Join us on November 9 to learn how to successfully innovate and gain efficiencies by improving and scaling citizen developers at the Low-Code/No-Code Summit. Register here.
At Nvidia’s Speech AI Summit today, the company announced its new speech artificial intelligence (AI) ecosystem, which it developed through a partnership with Mozilla Common Voice. The ecosystem focuses on the development of outsourced multilingual speech corpora and open source pre-trained models. Nvidia and Mozilla Common Voice aim to accelerate the growth of automatic speech recognition models that work universally for every language speaker in the world.
Nvidia found that standard voice assistants, such as Amazon Alexa and Google Home, support less than 1% of languages spoken globally. To address this issue, the company aims to improve language inclusion in voice AI and expand the availability of voice data for global and low-resource languages.
Nvidia joins a race that Meta and Google are already leading: Recently, the two companies released voice AI models to facilitate communication between people who speak different languages. Google’s AI speech-to-speech translation model, Translation Hub, can translate a large volume of documents into many different languages. Google also just announced it’s building a universal voice translator, trained in over 400 languages, with the claim that it’s the “widest language model coverage seen in a voice model today.” .
At the same time, Meta AI’s Universal Speech Translator (UST) project helps build AI systems that enable real-time speech-to-speech translation in all languages, even those that are spoken but not commonly written.
Event
Low-Code/No-Code Summit
Learn to bbuild, scale, and govern low-code programs in a simple way that creates success for everyone this November 9th. Rregister for your free pass today.
register here
An ecosystem for language users around the world
According to Nvidia, language inclusion for voice AI has comprehensive benefits for data health, such as helping AI models understand speaker diversity and a range of noise profiles. The new Speech AI ecosystem helps developers build, maintain, and improve Speech AI models and datasets for language inclusion, usability, and experience. Users can train their models on Mozilla Common Voice datasets and then offer these pretrained models as high-quality automatic speech recognition architectures. Then other organizations and individuals around the world can adapt and use these architectures to build their voice AI applications.
“Demographic diversity is key to capturing linguistic diversity,” said Caroline de Brito Gottlieb, product manager at Nvidia. “There are several vital factors impacting speech variation, such as underserved dialects, sociolects, pidgins, and accents. Through this partnership, we aim to create an ecosystem of datasets that helps communities to create datasets and speech models for any language or context.
The Mozilla Common Voice platform currently supports 100 languages, with 24,000 hours of voice data available from 500,000 contributors worldwide. The latest version of the Common Voice dataset also includes six new languages - Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona and Cantonese, as well as more voice data from female speakers.
Through the Mozilla Common Voice platform, users can donate their audio datasets by recording phrases as short voice clips, which Mozilla validates to ensure the quality of the dataset upon submission.
“The Voice AI ecosystem is largely focused not only on language diversity, but also on the accents and noise profiles of different language speakers across the world,” said Siddharth Sharma, Head of Product Marketing , AI and deep learning at Nvidia, to VentureBeat. “This has been our sole focus at Nvidia and we’ve created a solution that can be customized for every aspect of the voice AI model pipeline.”
Current implementations of Nvidia’s voice AI
The company develops speech AI for several use cases, such as automatic speech recognition (ASR), artificial speech translation (AST), and speech synthesis. Nvidia Riva, part of the Nvidia AI Platform, provides state-of-the-art GPU-optimized workflows to build and deploy fully customizable real-time AI pipelines for applications such as wizards contact center agent templates, virtual assistants, digital avatars, branded voices, and video conferencing transcription. Applications developed through Riva can be deployed on all types of clouds and data centers, at the edge or on embedded devices.
NCS, a multinational corporation and transportation technology partner of the Singapore government, customized Nvidia’s Riva FastPitch model and built its own text-to-speech engine for English in Singapore using voice data from local speakers. NCS recently built Breeze, a local driver app that translates languages such as Mandarin, Hokkien, Malay and Tamil into Singaporean English with the same clarity and expressiveness that a native Singaporean would speak them.
Mobile communications conglomerate T-Mobile has also partnered with Nvidia to develop AI-powered software for its customer experience centers that transcribes customer conversations in real time and recommends solutions to thousands of people working. in the first line. To create the software, T-Mobile used Nvidia NeMo, an open-source framework for cutting-edge conversational AI models, alongside Riva. These Nvidia tools allowed T-Mobile engineers to fine-tune ASR models on T-Mobile’s custom datasets and accurately interpret customer jargon in noisy environments.
Nvidia’s Future Focuses on Voice AI
Sharma says Nvidia aims to inculcate current developments in AST and next-generation voice AI into real-time metaverse use cases.
“Today we are limited to only offering slow translations from one language to another, and those translations have to go through the text,” he said. “But the future is where you can have people in the metaverse in so many different languages, all able to have instant translation with each other,” he said.
“The next step,” he added, “is to develop systems that will enable seamless interactions with people around the world through voice recognition for all languages and real-time text-to-speech.”
VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Discover our Briefings.