we’re finally moving from speech-to-text to environment-to-context!! standard voice assistants use an ASR (speech recognition) pipeline that strips away 90% of acoustic context. what OpenHome is showing likely uses native audio transformers or CLAP (Contrastive Language-Audio Pretraining) embeddings to process raw audio spectrograms continuously. it detects Acoustic Events (AED) and paralinguistic cues (sighs, tone) instead of just words. now incorporate an always-on camera feed with visual transformers, and you just gave your agent eyes to match its spatial hearing. true multimodal sensor fusion may make manual prompting obsolete just something to think about