Study: Using AI to Turn Audio Recordings Into Accurate Street Images | Bolt Picks
The Possibilities AI Brings to Smart City Development, Accessibility Technology, and Beyond

By UT NEWS
Researchers at the University of Texas at Austin have used generative AI to successfully convert street recordings into street-view images. The visual accuracy of these generated images suggests that machines can replicate the connection between human auditory and visual perception of environments.
Published in the journal Computers, Environment and Urban Systems, the study describes how the research team trained a Soundscape-to-Image Diffusion Model using audio and visual data collected from various urban and rural street scenes, then used this model to generate images from audio recordings.
"Our research found that acoustic environments contain sufficient visual cues to generate highly recognizable street-view images that accurately depict different places," said Yuhao Kang, assistant professor in UT Austin's Department of Geography and the Environment and co-author of the study. "This means we can transform acoustic environments into vivid visual representations, effectively turning sounds into sights."

Image: AI-generated images vs. actual scenes
Research Methods
The research team created 10-second audio clip and still image pairs from various locations using YouTube videos and audio from cities across North America, Asia, and Europe. These were used to train an AI model capable of generating high-resolution images from audio input. They then compared 100 AI sound-to-image works against their corresponding real photographs using both human and computer evaluation. The computer assessment compared correlation ratios between source and generated images across three dimensions — greenery, buildings, and sky — while human judges were asked to match one of three generated images to an audio sample.
Results
Key findings:
- AI-generated images showed strong correlation with real scenes in sky and greenery proportions; building proportion correlation was slightly lower.
- Human participants matched source audio samples to corresponding generated images with an average accuracy of 80%.
"In the conventional sense, imagining scenes from sound is a uniquely human ability, reflecting our deep sensory connection with the environment. Using advanced AI technology supported by large language models, we've demonstrated AI's potential to approximate this human perceptual experience," said Professor Yuhao Kang. "This suggests AI's capabilities extend beyond merely recognizing physical environments — it also holds promise for enriching people's subjective experiences of different places."

Image: Correlation comparison between AI-generated images and actual scenes
Additional Findings
Beyond matching real images in sky, greenery, and building proportions, the generated images generally maintained corresponding architectural styles and object distances from the real images, and accurately reflected whether soundscapes were recorded under sunny, cloudy, or nighttime lighting conditions. The authors note that lighting information likely comes from changes in soundscape activity. For example, traffic sounds or nocturnal insect chirping can reveal time of day. These observations advance our understanding of the relationship between multisensory factors and human feelings and experiences of a place.
"When you close your eyes and listen, the sounds around you paint pictures in your mind," said Professor Yuhao Kang. "The distant hum of traffic becomes a bustling cityscape, while the gentle rustle of leaves transports you to a serene forest. Each sound weaves vivid scenes in your imagination's theater, like magic."
Future Research Directions
This study demonstrates the innovative potential of AI technology, opening new possibilities for smart city development, accessibility technology, and other fields. Professor Yuhao Kang's team is further exploring AI applications in urban feature recognition, with related research published in the international journal Nature.
🔗 Paper link:
https://www.sciencedirect.com/science/article/abs/pii/S0198971524000516
📮 More to Read



Linear Bolt Bolt is an investment initiative by Linear Capital specifically designed for early-stage, global-market-facing AI applications. It upholds Linear's investment philosophy, focusing on projects where technology drives transformative change, and aims to help founders find the shortest path to their goals. Whether in speed of action or investment approach, Bolt's commitment is to be lighter, faster, and more flexible. In the first half of 2024, Bolt invested in seven AI application projects including Final Round, Xinguang, Cathoven, Xbuddy, and Midreal.