Cap2

Live captions for Hard of Hearing users, on-device.

Smart glasses + phoneon-device MLaccessibility

The problem

Existing live-caption tools depend on the cloud and don’t cover the places Hard of Hearing users actually are: phone media in third-party apps, in-person speech across noisy rooms, headphones-on contexts. The audio path is slow, and routing media through the phone speaker broadcasts what the user is listening to into the room.

My role

Solo product and interaction lead. Architecture, UX, hardware integration spec, on-device ML pipeline.

Approach

Three modes (Live / Media / Both) over one on-device Vosk recognizer. Accessibility scraper reads verbatim CC and lyrics out of YouTube, YT Music and Spotify when they’re already on screen. Audio routes through the glasses’ speaker-and-mic for privacy, so the phone speaker stays out of the path.

Cap2

Live captions for Hard of Hearing users, on the device. Audio I/O runs privately through the glasses’ speaker-and-mic so the wearer hears and speaks without broadcasting to the room. Recognition runs on-device with Vosk. Captions show inside the AR display, as a floating pill on the phone screen, or both. For media playing in other apps the system reads captions and lyrics directly out of those apps via Accessibility, falling back to the glasses speaker+mic when an app refuses to expose them. Nothing goes to the cloud, and the phone’s speaker stays out of the path so it doesn’t disturb anyone nearby.

Smart glasses + paired phone On-device ML

Problem

Existing live-caption tools depend on the cloud, lag on phone media playback, and don’t cover the places Hard of Hearing users actually are: in-person conversation, ambient rooms, phone media.

User

Anyone who needs captions for speech in their environment. Hard-of-hearing users primarily, but also people in noisy rooms, language learners, or anyone trying to follow a quiet podcast in a loud cafe.

Approach

One on-device recognizer with three modes. LIVE: conversation around you. MEDIA: audio playing on the phone, with verbatim CC and lyrics scraped from the foreground app via Accessibility. BOTH: split surface. Audio I/O runs privately through the glasses; the phone speaker stays quiet.

What the user sees inside the AR glasses

The Rokid panel is 480 × 640 pixels per eye (portrait), monochrome green Micro-LED on near-transparent waveguides with a 30° diagonal field of view. Captions live in the lower portion. Status text, when present, sits at top center. No source pill anywhere; the phone picks the source.

cap2 · glasses 01

Idle, waiting for phone

Before captioning starts (or while the Bluetooth link is still negotiating), the panel shows nothing but a small italic status line at the top in dim green. The rest of the field of view is clear so the wearer's view of the world is undisturbed.

cap2 · glasses 02

Partial: the speaker is still talking

The recognizer streams a best-guess transcript while the speaker is mid-sentence. Partials render italic and at lower brightness so the user reads them as provisional. The exact words may shift before the sentence locks in.

cap2 · glasses 03

Final: the sentence is committed

When the speaker pauses, the caption commits at full brightness and medium weight. The previous caption stays visible above at lower opacity so the user can keep one line of context while reading the current one. No chrome around the captions; they sit cleanly in the lower portion of the display.

cap2 · glasses 04

Words the recognizer is not sure about

Confidence is tracked per word. Words below the threshold render italic and at lower brightness: proper nouns, words spoken quickly, words half-drowned by background noise. Severity is conveyed by weight and brightness, never by color.

cap2 · glasses 05

Media mode: captions pulled from the media app

When the phone is in MEDIA mode and the user is watching or listening in another app, the Accessibility scraper reads the captions or lyrics that app already shows on screen and pushes them to the glasses verbatim. The audio path is not used; the user sees the exact text the media itself is displaying.

cap2 · glasses 06

BOTH mode — media captions on top, live conversation below

BOTH mode runs LIVE and MEDIA pipelines simultaneously, and both surface on the glasses HUD. The media stream renders larger at the top because the wearer is paying primary attention to it. The live conversation stream renders smaller in a single line below as the awareness layer for speech happening in the room. A small BOTH status banner at the top of the panel signals the mode. Source decisions happen on the phone side, so the HUD still has no source pill.

Phone app: mode picker, capturing surface, settings

All controls live on one scrollable screen. The user picks a mode and presses Start; the app handles the rest, including picking up audio through the glasses’ speaker-and-mic instead of the phone’s broadcasting speaker.

cap2 · phone 01

Main screen: pick a mode, hit start

Three modes: LIVE (speech around you), MEDIA (audio playing on the phone), BOTH (split surface). Audio I/O runs through the glasses’ speaker and mic whenever they’re paired so the wearer hears and speaks privately. No one nearby has to overhear. The phone’s own speaker and mic are a last resort because they’d broadcast to the room. The toggles below let the user pick where captions show up. No source selection: the phone makes that choice.

cap2 · phone 02

Capturing LIVE: captions scroll on phone and glasses together

When the session is running, the status pill turns green and names the active mic source. The transport-details row shows the Wi-Fi link to the glasses streaming frames. Captions land here on the phone in real time and on the glasses HUD at the same time. Partials show dimmer and italic, finals show solid, low-confidence words show italic. Same conventions the glasses use.

cap2 · phone 03

Captions overlaid on a social-media video

A TikTok-style portrait video plays in another app. The creator is talking but the video has no native captions, which is most social video. Cap2’s floating overlay window sits over the lower-third of the screen as a translucent pill with white captions, captured live as the creator speaks. The pill is draggable so the user can move it off the creator’s face or off the engagement-icons column on the right. No other app needs to be modified; the overlay is a system-level window the user opted in to.

cap2 · phone 04

BOTH mode: live conversation and media at the same time

If the user is watching media but still wants to hear the person sitting next to them, BOTH mode runs both pipelines together. Live captions take the top of the surface; media captions fill the bottom. Useful in homes where someone may walk in mid-show, or on transit where a stop announcement matters.

cap2 · phone 05

Appearance and behavior settings

Caption appearance (text size, color preset, glasses HUD text size) sits with the two behavior toggles that matter most. First: prefer the glasses’ speaker and mic when paired (on by default; keeps audio private to the wearer because the phone’s speaker would broadcast media into the room; this is also the fallback when an on-app caption scraper can’t fire). Second: the network-lookups toggle for lyric fetch in MEDIA mode (off by default; when on, sends only artist and track title to lrclib.net, never audio). HUD brightness picker for the glasses display below.