Skip to main content

Building a Real-Time AI Screen Reader with Azure and Go

How I built Bazure, a system that captures your screen, analyzes it with Azure Computer Vision, and talks back to you about what it sees.

  • Go
  • Azure
  • Computer Vision
  • AI

By Jenn Barosa

Building a Real-Time AI Screen Reader with Azure and Go

Bazure captures your screen, sends it to Azure's Computer Vision API, and speaks the analysis back to you. It's a loop: screenshot, analyze, talk. I wanted to see what it would be like to have an AI that passively watches what you're doing and can answer questions about it.

Three Modules

The system runs three things concurrently:

barosa-screen-capture takes screenshots of the active window using maim and xdotool on Linux. It captures at a configurable interval (default is every 2 seconds) and writes frames to a shared buffer.

barosa-azure reads from that buffer and sends images to Azure's Computer Vision API. The API returns detected objects, OCR text, scene classification, and a natural language description. I merge all of that into a single observation object.

barosa-microphone handles audio input and output through Azure Speech Services. It transcribes your speech so you can ask questions about what's on screen, and it generates spoken responses using Azure's neural TTS voices.

Go Backend, Node Frontend

Go handles the backend because its concurrency model maps well to this workload. The capture loop, API calls, and audio processing all run as goroutines. The Azure SDK for Go works well enough.

The frontend is an Express.js app that serves a config dashboard and status display. Node was the fastest path to a working UI. For a prototype this split worked fine.

Dealing with Latency

The pipeline has a lot of steps: capture, upload, API processing, response parsing, speech synthesis, audio playback. Each one adds time. If the total round-trip is too slow, the AI ends up describing what was on your screen five seconds ago, which feels broken.

Two things helped. First, captures are asynchronous. The system doesn't wait for one analysis to finish before taking the next screenshot. Second, it deduplicates: if the screen hasn't changed much (measured by a pixel diff threshold), it skips the API call entirely. No point in analyzing the same image twice.

With these optimizations, round-trip is around 1.5-2 seconds. Azure's Computer Vision API responds in 500-800ms for a 1080p image, and speech synthesis adds another 300-500ms.

What It's Actually Good At

Azure's vision API handles UI content better than I expected. It can tell the difference between a code editor with Python open and a browser showing a social media feed. OCR on standard fonts is basically perfect.

Some things I found it useful for: describing screen content out loud (accessibility use case), narrating what's happening while streaming or recording, and monitoring dashboards. You could point it at a Grafana instance and have it tell you when a graph looks off.

Voice Quality

Azure's neural TTS voices sound good. For short phrases they're hard to distinguish from a real person. Longer passages still have a slight flatness to the intonation, but it's getting there. For this kind of utility output where you're hearing short descriptions and answers, the quality is more than sufficient.