multicloud365
  • Home
  • Cloud Architecture
    • OCI
    • GCP
    • Azure
    • AWS
    • IAC
    • Cloud Networking
    • Cloud Trends and Innovations
    • Cloud Security
    • Cloud Platforms
  • Data Management
  • DevOps and Automation
    • Tutorials and How-Tos
  • Case Studies and Industry Insights
    • AI and Machine Learning in the Cloud
No Result
View All Result
  • Home
  • Cloud Architecture
    • OCI
    • GCP
    • Azure
    • AWS
    • IAC
    • Cloud Networking
    • Cloud Trends and Innovations
    • Cloud Security
    • Cloud Platforms
  • Data Management
  • DevOps and Automation
    • Tutorials and How-Tos
  • Case Studies and Industry Insights
    • AI and Machine Learning in the Cloud
No Result
View All Result
multicloud365
No Result
View All Result

Pushing the frontiers of audio technology

admin by admin
May 1, 2025
in AI and Machine Learning in the Cloud
0
Pushing the frontiers of audio technology
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


Applied sciences

Printed
30 October 2024
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi

An illustration depicting speech patterns, iterative progress on dialogue generation,  and a relaxed conversation between two voices.

Our pioneering speech technology applied sciences are serving to individuals all over the world work together with extra pure, conversational and intuitive digital assistants and AI instruments.

Speech is central to human connection. It helps individuals all over the world change info and concepts, specific feelings and create mutual understanding. As our expertise constructed for producing pure, dynamic voices continues to enhance, we’re unlocking richer, extra partaking digital experiences.

Over the previous few years, we’ve been pushing the frontiers of audio technology, growing fashions that may create top quality, pure speech from a spread of inputs, like textual content, tempo controls and specific voices. This expertise powers single-speaker audio in lots of Google merchandise and experiments — together with Gemini Stay, Venture Astra, Journey Voices and YouTube’s auto dubbing — and helps individuals all over the world work together with extra pure, conversational and intuitive digital assistants and AI instruments.

Working along with companions throughout Google, we not too long ago helped develop two new options that may generate long-form, multi-speaker dialogue for making complicated content material extra accessible:

  • NotebookLM Audio Overviews turns uploaded paperwork into partaking and vigorous dialogue. With one click on, two AI hosts summarize consumer materials, make connections between matters and banter backwards and forwards.
  • Illuminate creates formal AI-generated discussions about analysis papers to assist make data extra accessible and digestible.

Right here, we offer an summary of our newest speech technology analysis underpinning all of those merchandise and experimental instruments.

Pioneering methods for audio technology

For years, we have been investing in audio technology analysis and exploring new methods for producing extra pure dialogue in our merchandise and experimental instruments. In our earlier analysis on SoundStorm, we first demonstrated the power to generate 30-second segments of pure dialogue between a number of audio system.

This prolonged our earlier work, SoundStream and AudioLM, which allowed us to use many text-based language modeling methods to the issue of audio technology.

SoundStream is a neural audio codec that effectively compresses and decompresses an audio enter, with out compromising its high quality. As a part of the coaching course of, SoundStream learns find out how to map audio to a spread of acoustic tokens. These tokens seize all the info wanted to reconstruct the audio with excessive constancy, together with properties resembling prosody and timbre.

AudioLM treats audio technology as a language modeling activity to provide the acoustic tokens of codecs like SoundStream. In consequence, the AudioLM framework makes no assumptions concerning the sort or make-up of the audio being generated, and might flexibly deal with a wide range of sounds while not having architectural changes — making it an excellent candidate for modeling multi-speaker dialogues.

Instance of a multi-speaker dialogue generated by NotebookLM Audio Overview, based mostly on a number of potato-related paperwork.

Constructing upon this analysis, our newest speech technology expertise can produce 2 minutes of dialogue, with improved naturalness, speaker consistency and acoustic high quality, when given a script of dialogue and speaker flip markers. The mannequin additionally performs this activity in below 3 seconds on a single Tensor Processing Unit (TPU) v5e chip, in a single inference cross. This implies it generates audio over 40-times quicker than actual time.

Scaling our audio technology fashions

Scaling our single-speaker technology fashions to multi-speaker fashions then grew to become a matter of knowledge and mannequin capability. To assist our newest speech technology mannequin produce longer speech segments, we created an much more environment friendly speech codec for compressing audio right into a sequence of tokens, in as little as 600 bits per second, with out compromising the standard of its output.

The tokens produced by our codec have a hierarchical construction and are grouped by time frames. The primary tokens inside a gaggle seize phonetic and prosodic info, whereas the final tokens encode positive acoustic particulars.

Even with our new speech codec, producing a 2-minute dialogue requires producing over 5000 tokens. To mannequin these lengthy sequences, we developed a specialised Transformer structure that may effectively deal with hierarchies of data, matching the construction of our acoustic tokens.

With this method, we will effectively generate acoustic tokens that correspond to the dialogue, inside a single autoregressive inference cross. As soon as generated, these tokens could be decoded again into an audio waveform utilizing our speech codec.

Animation exhibiting how our speech technology mannequin produces a stream of audio tokens autoregressively, that are decoded again to a waveform consisting of a two-speaker dialogue.

To show our mannequin find out how to generate practical exchanges between a number of audio system, we pretrained it on lots of of 1000’s of hours of speech information. Then we finetuned it on a a lot smaller dataset of dialogue with excessive acoustic high quality and exact speaker annotations, consisting of unscripted conversations from numerous voice actors and practical disfluencies — the “umm”s and “aah”s of actual dialog. This step taught the mannequin find out how to reliably swap between audio system throughout a generated dialogue and to output solely studio high quality audio with practical pauses, tone and timing.

According to our AI Ideas and our dedication to growing and deploying AI applied sciences responsibly, we’re incorporating our SynthID expertise to watermark non-transient AI-generated audio content material from these fashions, to assist safeguard towards the potential misuse of this expertise.

New speech experiences forward

We’re now centered on bettering our mannequin’s fluency, acoustic high quality and including extra fine-grained controls for options, like prosody, whereas exploring how finest to mix these advances with different modalities, resembling video.

The potential purposes for superior speech technology are huge, particularly when mixed with our Gemini household of fashions. From enhancing studying experiences to creating content material extra universally accessible, we’re excited to proceed pushing the boundaries of what’s attainable with voice-based applied sciences.

Acknowledgements

Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Karolis Misiunas, Sertan Girgin, Jonas Rothfuss, Jake Walker and Marco Tagliasacchi.

We thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong and RJ Skerry-Ryan for his or her vital efforts on dialogue information.

We’re very grateful to our collaborators throughout Labs, Illuminate, Cloud, Speech and YouTube for his or her excellent work bringing these fashions into merchandise.

We additionally thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Tokumine, James Zhao for his or her steerage on the undertaking.

Tags: audiofrontiersGenerationPushing
Previous Post

AI And Predictive Analytics For Healthcare Danger Administration

Next Post

PowerShell – Report on Azure Disks throughout Group

Next Post
PowerShell – Report on Azure Disks throughout Group

PowerShell – Report on Azure Disks throughout Group

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Trending

Episode 25: Safeguarding Hybrid IT

Episode 25: Safeguarding Hybrid IT

January 23, 2025
The Potential of Protein Sequencing Know-how

The Potential of Protein Sequencing Know-how

April 14, 2025
Google unveils Cloud WAN and Gemini Instruments to simplify app improvement at Google Cloud Subsequent 25

Google unveils Cloud WAN and Gemini Instruments to simplify app improvement at Google Cloud Subsequent 25

April 10, 2025
Migrating Mainframes To The Cloud: Advantages And Greatest Practices

Migrating Mainframes To The Cloud: Advantages And Greatest Practices

April 29, 2025
What’s SIEM? Safety Info and Occasion Administration Defined

GDPR, HIPAA & SOC 2 Finest Practices

May 4, 2025
Complete Information for New Relic Licensed Reliability Engineer – Skilled (REP)

Complete Information for New Relic Licensed Reliability Engineer – Skilled (REP)

March 30, 2025

MultiCloud365

Welcome to MultiCloud365 — your go-to resource for all things cloud! Our mission is to empower IT professionals, developers, and businesses with the knowledge and tools to navigate the ever-evolving landscape of cloud technology.

Category

  • AI and Machine Learning in the Cloud
  • AWS
  • Azure
  • Case Studies and Industry Insights
  • Cloud Architecture
  • Cloud Networking
  • Cloud Platforms
  • Cloud Security
  • Cloud Trends and Innovations
  • Data Management
  • DevOps and Automation
  • GCP
  • IAC
  • OCI

Recent News

Safe & Environment friendly File Dealing with in Spring Boot: Learn, Write, Compress, and Defend | by Rishi | Mar, 2025

Safe & Environment friendly File Dealing with in Spring Boot: Learn, Write, Compress, and Defend | by Rishi | Mar, 2025

May 15, 2025
Bitwarden vs Dashlane: Evaluating Password Managers

Bitwarden vs Dashlane: Evaluating Password Managers

May 15, 2025
  • About Us
  • Privacy Policy
  • Disclaimer
  • Contact

© 2025- https://multicloud365.com/ - All Rights Reserved

No Result
View All Result
  • Home
  • Cloud Architecture
    • OCI
    • GCP
    • Azure
    • AWS
    • IAC
    • Cloud Networking
    • Cloud Trends and Innovations
    • Cloud Security
    • Cloud Platforms
  • Data Management
  • DevOps and Automation
    • Tutorials and How-Tos
  • Case Studies and Industry Insights
    • AI and Machine Learning in the Cloud

© 2025- https://multicloud365.com/ - All Rights Reserved