Cap2

← All work

Cap2

Live captions for Hard of Hearing users, on-device.

Smart glasses + phoneon-device MLaccessibility

The problem

Existing live-caption tools depend on the cloud and don’t cover the places Hard of Hearing users actually are: phone media in third-party apps, in-person speech across noisy rooms, headphones-on contexts. The audio path is slow, and routing media through the phone speaker broadcasts what the user is listening to into the room.

My role

Solo product and interaction lead. Architecture, UX, hardware integration spec, on-device ML pipeline.

Approach

Three modes (Live / Media / Both) over one on-device Vosk recognizer. Accessibility scraper reads verbatim CC and lyrics out of YouTube, YT Music and Spotify when they’re already on screen. Audio routes through the glasses’ speaker-and-mic for privacy, so the phone speaker stays out of the path.

Cap2

Live captions for Hard of Hearing users, on the device. Audio I/O runs privately through the glasses’ speaker-and-mic so the wearer hears and speaks without broadcasting to the room. Recognition runs on-device with Vosk. Captions show inside the AR display, as a floating pill on the phone screen, or both. For media playing in other apps the system reads captions and lyrics directly out of those apps via Accessibility, falling back to the glasses speaker+mic when an app refuses to expose them. Nothing goes to the cloud, and the phone’s speaker stays out of the path so it doesn’t disturb anyone nearby.

Smart glasses + paired phone On-device ML

Problem

Existing live-caption tools depend on the cloud, lag on phone media playback, and don’t cover the places Hard of Hearing users actually are: in-person conversation, ambient rooms, phone media.

User

Anyone who needs captions for speech in their environment. Hard-of-hearing users primarily, but also people in noisy rooms, language learners, or anyone trying to follow a quiet podcast in a loud cafe.

Approach

One on-device recognizer with three modes. LIVE: conversation around you. MEDIA: audio playing on the phone, with verbatim CC and lyrics scraped from the foreground app via Accessibility. BOTH: split surface. Audio I/O runs privately through the glasses; the phone speaker stays quiet.

What the user sees inside the AR glasses

The Rokid panel is 480 × 640 pixels per eye (portrait), monochrome green Micro-LED on near-transparent waveguides with a 30° diagonal field of view. Captions live in the lower portion. Status text, when present, sits at top center. No source pill anywhere; the phone picks the source.

Cap2 glasses — Idle, waiting for phone Empty panel with only a dim italic status banner at the top center. Waiting for phone…
cap2 · glasses 01

Idle, waiting for phone

Before captioning starts (or while the Bluetooth link is still negotiating), the panel shows nothing but a small italic status line at the top in dim green. The rest of the field of view is clear so the wearer's view of the world is undisturbed.

Cap2 glasses — Partial — the speaker is still talking Partial caption forms at the lower portion of the panel in italic at lower opacity. so what I think we should do tomorrow is…
cap2 · glasses 02

Partial: the speaker is still talking

The recognizer streams a best-guess transcript while the speaker is mid-sentence. Partials render italic and at lower brightness so the user reads them as provisional. The exact words may shift before the sentence locks in.

Cap2 glasses — Final — the sentence is committed Two-line caption locked at full brightness; previous caption shown above at lower opacity. It leaves at 6:42, gets in by ten. So what I think we should do tomorrow is take the early train.
cap2 · glasses 03

Final: the sentence is committed

When the speaker pauses, the caption commits at full brightness and medium weight. The previous caption stays visible above at lower opacity so the user can keep one line of context while reading the current one. No chrome around the captions; they sit cleanly in the lower portion of the display.

Cap2 glasses — Words the recognizer is not sure about Final caption with low-confidence words italicised. He landed in Tucson around midnight last night.
cap2 · glasses 04

Words the recognizer is not sure about

Confidence is tracked per word. Words below the threshold render italic and at lower brightness: proper nouns, words spoken quickly, words half-drowned by background noise. Severity is conveyed by weight and brightness, never by color.

Cap2 glasses — Media mode — captions pulled from the media app Caption pushed to the glasses when the phone is in MEDIA mode and the Accessibility scraper is pulling verbatim lyrics or CC from the foreground app. media · lyric scraped from YouTube Music And the shame, was on the other side
cap2 · glasses 05

Media mode: captions pulled from the media app

When the phone is in MEDIA mode and the user is watching or listening in another app, the Accessibility scraper reads the captions or lyrics that app already shows on screen and pushes them to the glasses verbatim. The audio path is not used; the user sees the exact text the media itself is displaying.

Cap2 glasses — BOTH mode — media captions on top, live conversation below BOTH mode HUD on the glasses. Media captions render larger at the top because the user is paying primary attention to the media. Live conversation captions render smaller in a single line below as the awareness layer for speech happening in the room. Both streams are visible on the lens at the same time. A small BOTH status banner sits at the very top. BOTH MEDIA and the third climber reaches the summit just before sunset. LIVE "Hey, are you still watching the climbing documentary?"
cap2 · glasses 06

BOTH mode — media captions on top, live conversation below

BOTH mode runs LIVE and MEDIA pipelines simultaneously, and both surface on the glasses HUD. The media stream renders larger at the top because the wearer is paying primary attention to it. The live conversation stream renders smaller in a single line below as the awareness layer for speech happening in the room. A small BOTH status banner at the top of the panel signals the mode. Source decisions happen on the phone side, so the HUD still has no source pill.