Drawing: The overlooked universal HCI primitive

Fifty years of computing progress, and we still interact with desktop computers the same way. We gave them more raw power, GPUs, and new form factors; but by not rethinking how we interface with our primary workhorse devices, we’ve hamstrung devices far more capable.

*Computers 1974 vs. 2024 (Credit: CHM and Apple)*

The keyboard, mouse, and GUI gave us point-and-click, type-and-view apps and workflows. This paradigm, a logical extension of the typewriter, worked well even for the early web era. However, the digital world has gotten richer and more visual over time, while our interface hasn’t kept up.

It's bizarre that we can type or speak in any app instantly, but we can't scribble natively. Freehand drawing is intuitive, universal, high fidelity, and closer to how we think. Yet on computers, it is treated either as a niche feature buried inside a few apps or relegated to tools for creative design.

Freehand isn’t a creative luxury. It’s a fundamental form of expression and communication. What we need today is drawing as a first-class, system-wide1 input mechanism i.e., as fundamental as typing or clicking.

Say you are a developer and spot an error in Terminal or a marketer reviewing an artifact, you can’t just highlight something directly on a window and say ‘fix this’. Instead you’re forced to perform a tedious ritual: screenshot, save, crop, open in another tool to annotate, copy/paste, explain with a wall of text, and send it off.

Want to write out a math equation for homework? Mark up UI for a bug report? Or just select something on screen to identify who or what it is? You get the drift.

These aren’t edge cases, but neglected universal friction points. They’re everyday moments where current tools fall short; and where drawing in-context, without breaking the flow, feels like the obvious solution. The technology exists but the cohesive experience doesn’t.

“…but we are doing just fine with current tools”

May be, but not for long because we are at a major inflection point.

Remember how Dropbox made file backups seamless, even though ‘alternatives’ existed before? We’re at a similar inflection point, except it’s AI this time instead of cloud.

Today’s AI can unlock new experiences across text, voice, vision, and beyond. Yet our interfaces and interactions remain the bottleneck. AI products lean heavily on text prompts and chat windows. Sure, voice mode is a nice upgrade, but words alone flatten our thoughts and feel inadequate for natural expression.

When people keep repeating things like ‘natural language is the interface’ or ‘voice is the next frontier,’ it just feels lazy and reductive. Yes, they are impressive and useful, but not enough.

Natural expression > Natural language

“…but multimodal AI models with language and image as input are enough”

Try this: open maps (or any app) on your computer and screen-share with an AI and start asking questions that require spatial awareness. Even the most advanced multimodal models out there struggle with questions that require precision in visual understanding. Not to mention the amount of back-and-forth and detailed instruction it requires for users to direct model’s attention.

Now imagine if you could just highlight something and say ‘tell me about this’, or just use visual cues to convey intent.

*Example: Highlighting POI on a map for interactive day-trip planning*

People dream of Jarvis-inspired experiences where one can simply talk to screen, and AI just understands everything and gets things done. But that future won’t arrive without solving for visual input and spatial grounding.

Some assume bigger/better AI models alone will solve this problem by just ‘understanding the screen’ better, so screenshots and screen sharing get passed off as the solution. But that’s like building self-driving cars with paper maps instead of GPS.

While the industry swings between static screenshots on one end and AR/VR (or even brain-computer interfaces) on the other, what’s missing is the in-between. Something basic. Something as old as humanity itself i.e., freehand drawing.

With AI, we’ve given computers eyes, but we still have no easy way to show it where to look at. That’s the bridge I’m building with Flik.

If system-wide drawing is so obvious, why are we still stuck with a PrtScr era paradigm i.e., screenshots? Two theories:

Theory 1: An illusion reinforced by device makers

Software constraints. Today’s operating systems are often presented as rigid, untouchable, and insufficient.

Hardware fallacy. ‘Drawing with a mouse is clumsy’ is true, but to me that argument is a failure of imagination and not a technical wall. Pencils and styluses exist, but incentives for desktop integration don’t. Why bring that experience to computers when you can sell tablets?

This has also created an artificial divide that ‘serious’ work happens on Macs/PCs, but fluid, visual expression happens on separate devices.

Perfectionism. Some believe that a system-wide drawing tool would be inferior to a professional design tool experience. But that’s missing the point.

We don’t need a system-wide Figma or Procreate. We’re talking about a basic, universal act of communication. Even a simple, software and ML augmented freehand input is enough to be transformative2.

Theory 2: We simply stopped looking deeper

As software development moved to higher levels of abstraction, we mostly defaulted to building siloed ‘apps’. We stopped tinkering with the foundational, OS-level interactions, and thus missed opportunities for holistic experiences.

I didn’t set out to build this. Years working3 (and often fumbling) on various AI and chat products led to a simple realization: to unlock the full potential of computers beyond language-based interfaces, end users must be able to easily manipulate pixels, not just text.

Something that had been a long-standing annoyance with markup tools suddenly became a priority when I hit the limits of multimodal AI applications I’d built with screenshots or screen sharing.

‘Prompting’ felt like an inefficient crutch.

I dissected aspects of OS, browsers, various input devices, and AI models to understand what was even possible; and after multiple iterations, I built the software layer that let me draw on any window on my Mac.

I began with drawing, but using it in my real life and tinkering further led to an ‘aha’ moment: vision + voice is the most powerful form of input parallelism we can achieve with our existing devices. You can draw with your hands while speaking out loud, giving the computer the richest, highest-signal multimodal input possible. No other input pair comes close.

This is where Flik comes in. Using Flik’s universal drawing feature you can highlight a bug on a live app and use AI to draft annotated reports, markup windows, redact text in emails, etc. All without breaking your flow. You can also switch apps, draw on multiple windows, and pick up where you left off, all seamlessly.

Because it's tightly integrated with the OS, it feels less like an ‘app’ and more like a natural extension of your workflows i.e., the missing input primitive your computer always needed.

Recording: Drawing on multiple windows and using visual prompts to draft a Linear report

The goal here is NOT to build a ‘drawing’ app, but to make visual and spatial expression a first-class input mode on desktop systems. What you see here is just a start and part of something (hopefully) bigger4, but in a state to get it out there, gather feedback, and refine. Achieving the fluidity of a pencil on paper, and making drawing a truly foundational input mode, will require further work across software, OS, ML/AI, I/O, and hardware.

If you’re ready to move past screenshots and clunky markup, get in touch to get onboarded.

Also if you're working in this space and wish to chat, or are just curious about what I’m working on, you can find me on X/Twitter @notnotrishi.

Drawing: The overlooked universal HCI primitive

Discussion about this post

Ready for more?