Google researchers have developed a deep-learning audio-visual model that can isolate one speaker's voice in a cacophony of noise.
The 'cocktail party effect' -- the ability to mute all voices in a crowd and focus on a single person's voice -- comes easily to humans but not machines.
It's an obstacle to an application of the Google Glass smart glasses that I personally would like to see developed one day. That is, as a real-time speech-recognition and live-transcription system to support hearing-aid wearers.
Apparently voice separation is a hard nut to crack, but Google's AI researchers may have a part of the answer to my Glass dream in the form of a deep-learning audio-visual model that can isolate speech from a mixture of sounds.
The scenario they present are two speakers standing side-by-side jabbering simultaneously. The technique hasn't been proven in a real-world crowd but it does work on a video with two speakers on a single audio track.