Skip to main content

Bazure

Azure Live Computer Vision with AI voice feedback

  • Go
  • Go
  • Node.js
  • Express
  • Azure Cognitive Services
  • Azure Speech
Screenshot of Bazure

Bazure captures your screen in a loop, sends the image to Azure's Computer Vision API, and speaks the analysis back to you through Azure's neural TTS voices. You can also talk to it and ask questions about what's on screen using speech transcription.

It runs three modules concurrently: barosa-screen-capture handles window screenshots using maim and xdotool on Linux, barosa-azure sends captures to Azure's vision API and parses the results (detected objects, OCR text, scene descriptions), and barosa-microphone manages speech input/output through Azure Speech Services.

The backend is Go (good fit for the concurrent capture/API/audio loops running as goroutines) and the frontend is an Express.js dashboard for configuration and status.

Round-trip latency is about 1.5-2 seconds after optimizations. The system skips API calls when the screen hasn't changed much (pixel diff threshold) and captures asynchronously so it doesn't block on analysis.