NVIDIA Riva: ASR and TTS
NVIDIA Riva
NVIDIA Riva is a GPU-accelerated multilingual speech and translation AI software development kit for building fully customizable, real-time conversational AI pipelines—including automatic speech recognition (ASR), text-to-speech (TTS), and neural machine translation (NMT) applications—that can be deployed in clouds, in data centers, at the edge, or on embedded devices.
The Riva Speech API server exposes a simple API for performing speech recognition, speech synthesis, and a variety of natural language processing inferences and is integrated into LangChain for ASR and TTS. See instructions on how to setup a Riva Speech API server below.
Integrating NVIDIA Riva to LangChain Chains
The NVIDIARivaASR
, NVIDIARivaTTS
utility runnables are LangChain runnables that integrate NVIDIA Riva into LCEL chains for Automatic Speech Recognition (ASR) and Text To Speech (TTS).
This example goes over how to use these LangChain runnables to:
- Accept streamed audio,
- convert the audio to text,
- send the text to an LLM,
- stream a textual LLM response, and
- convert the response to streamed human-sounding audio.
1. NVIDIA Riva Runnables
There are 2 Riva Runnables:
a. RivaASR: Converts audio bytes into text for an LLM using NVIDIA Riva.
b. RivaTTS: Converts text into audio bytes using NVIDIA Riva.
a. RivaASR
The RivaASR runnable converts audio bytes into a string for an LLM using NVIDIA Riva.
It's useful for sending an audio stream (a message containing streaming audio) into a chain and preprocessing that audio by converting it to a string to create an LLM prompt.
ASRInputType = AudioStream # the AudioStream type is a custom type for a message queue containing streaming audio
ASROutputType = str
class RivaASR(
RivaAuthMixin,
RivaCommonConfigMixin,
RunnableSerializable[ASRInputType, ASROutputType],
):
"""A runnable that performs Automatic Speech Recognition (ASR) using NVIDIA Riva."""
name: str = "nvidia_riva_asr"
description: str = (
"A Runnable for converting audio bytes to a string."
"This is useful for feeding an audio stream into a chain and"
"preprocessing that audio to create an LLM prompt."
)
# riva options
audio_channel_count: int = Field(
1, description="The number of audio channels in the input audio stream."
)
profanity_filter: bool = Field(
True,
description=(
"Controls whether or not Riva should attempt to filter "
"profanity out of the transcribed text."
),
)
enable_automatic_punctuation: bool = Field(
True,
description=(
"Controls whether Riva should attempt to correct "
"senetence puncuation in the transcribed text."
),
)
When this runnable is called on an input, it takes an input audio stream that acts as a queue and concatenates transcription as chunks are returned.After a response is fully generated, a string is returned.
- Note that since the LLM requires a full query the ASR is concatenated and not streamed in token-by-token.
b. RivaTTS
The RivaTTS runnable converts text output to audio bytes.
It's useful for processing the streamed textual response from an LLM by converting the text to audio bytes. These audio bytes sound like a natural human voice to be played back to the user.
TTSInputType = Union[str, AnyMessage, PromptValue]
TTSOutputType = byte
class RivaTTS(
RivaAuthMixin,
RivaCommonConfigMixin,
RunnableSerializable[TTSInputType, TTSOutputType],
):
"""A runnable that performs Text-to-Speech (TTS) with NVIDIA Riva."""
name: str = "nvidia_riva_tts"
description: str = (
"A tool for converting text to speech."
"This is useful for converting LLM output into audio bytes."
)
# riva options
voice_name: str = Field(
"English-US.Female-1",
description=(
"The voice model in Riva to use for speech. "
"Pre-trained models are documented in "
"[the Riva documentation]"
"(https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html)."
),
)
output_directory: Optional[str] = Field(
None,
description=(
"The directory where all audio files should be saved. "
"A null value indicates that wave files should not be saved. "
"This is useful for debugging purposes."
),
When this runnable is called on an input, it takes iterable text chunks and streams them into output audio bytes that are either written to a .wav
file or played out loud.
2. Installation
The NVIDIA Riva client library must be installed.
%pip install --upgrade --quiet nvidia-riva-client
Note: you may need to restart the kernel to use updated packages.
3. Setup
To get started with NVIDIA Riva:
- Follow the Riva Quick Start setup instructions for Local Deployment Using Quick Start Scripts.
4. Import and Inspect Runnables
Import the RivaASR and RivaTTS runnables and inspect their schemas to understand their fields.
import json
from langchain_community.utilities.nvidia_riva import (
RivaASR,
RivaTTS,
)
Let's view the schemas.
print(json.dumps(RivaASR.schema(), indent=2))
print(json.dumps(RivaTTS.schema(), indent=2))
{
"title": "RivaASR",
"description": "A runnable that performs Automatic Speech Recognition (ASR) using NVIDIA Riva.",
"type": "object",
"properties": {
"name": {
"title": "Name",
"default": "nvidia_riva_asr",
"type": "string"
},
"encoding": {
"description": "The encoding on the audio stream.",
"default": "LINEAR_PCM",
"allOf": [
{
"$ref": "#/definitions/RivaAudioEncoding"
}
]
},
"sample_rate_hertz": {
"title": "Sample Rate Hertz",
"description": "The sample rate frequency of audio stream.",
"default": 8000,
"type": "integer"
},
"language_code": {
"title": "Language Code",
"description": "The [BCP-47 language code](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) for the target language.",
"default": "en-US",
"type": "string"
},
"url": {
"title": "Url",
"description": "The full URL where the Riva service can be found.",
"default": "http://localhost:50051",
"examples": [
"http://localhost:50051",
"https://user@pass:riva.example.com"
],
"anyOf": [
{
"type": "string",
"minLength": 1,
"maxLength": 65536,
"format": "uri"
},
{
"type": "string"
}
]
},
"ssl_cert": {
"title": "Ssl Cert",
"description": "A full path to the file where Riva's public ssl key can be read.",
"type": "string"
},
"description": {
"title": "Description",
"default": "A Runnable for converting audio bytes to a string.This is useful for feeding an audio stream into a chain andpreprocessing that audio to create an LLM prompt.",
"type": "string"
},
"audio_channel_count": {
"title": "Audio Channel Count",
"description": "The number of audio channels in the input audio stream.",
"default": 1,
"type": "integer"
},
"profanity_filter": {
"title": "Profanity Filter",
"description": "Controls whether or not Riva should attempt to filter profanity out of the transcribed text.",
"default": true,
"type": "boolean"
},
"enable_automatic_punctuation": {
"title": "Enable Automatic Punctuation",
"description": "Controls whether Riva should attempt to correct senetence puncuation in the transcribed text.",
"default": true,
"type": "boolean"
}
},
"definitions": {
"RivaAudioEncoding": {
"title": "RivaAudioEncoding",
"description": "An enum of the possible choices for Riva audio encoding.\n\nThe list of types exposed by the Riva GRPC Protobuf files can be found\nwith the following commands:\n\`\`\`python\nimport riva.client\nprint(riva.client.AudioEncoding.keys()) # noqa: T201\n\`\`\`",
"enum": [
"ALAW",
"ENCODING_UNSPECIFIED",
"FLAC",
"LINEAR_PCM",
"MULAW",
"OGGOPUS"
],
"type": "string"
}
}
}
{
"title": "RivaTTS",
"description": "A runnable that performs Text-to-Speech (TTS) with NVIDIA Riva.",
"type": "object",
"properties": {
"name": {
"title": "Name",
"default": "nvidia_riva_tts",
"type": "string"
},
"encoding": {
"description": "The encoding on the audio stream.",
"default": "LINEAR_PCM",
"allOf": [
{
"$ref": "#/definitions/RivaAudioEncoding"
}
]
},
"sample_rate_hertz": {
"title": "Sample Rate Hertz",
"description": "The sample rate frequency of audio stream.",
"default": 8000,
"type": "integer"
},
"language_code": {
"title": "Language Code",
"description": "The [BCP-47 language code](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) for the target language.",
"default": "en-US",
"type": "string"
},
"url": {
"title": "Url",
"description": "The full URL where the Riva service can be found.",
"default": "http://localhost:50051",
"examples": [
"http://localhost:50051",
"https://user@pass:riva.example.com"
],
"anyOf": [
{
"type": "string",
"minLength": 1,
"maxLength": 65536,
"format": "uri"
},
{
"type": "string"
}
]
},
"ssl_cert": {
"title": "Ssl Cert",
"description": "A full path to the file where Riva's public ssl key can be read.",
"type": "string"
},
"description": {
"title": "Description",
"default": "A tool for converting text to speech.This is useful for converting LLM output into audio bytes.",
"type": "string"
},
"voice_name": {
"title": "Voice Name",
"description": "The voice model in Riva to use for speech. Pre-trained models are documented in [the Riva documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html).",
"default": "English-US.Female-1",
"type": "string"
},
"output_directory": {
"title": "Output Directory",
"description": "The directory where all audio files should be saved. A null value indicates that wave files should not be saved. This is useful for debugging purposes.",
"type": "string"
}
},
"definitions": {
"RivaAudioEncoding": {
"title": "RivaAudioEncoding",
"description": "An enum of the possible choices for Riva audio encoding.\n\nThe list of types exposed by the Riva GRPC Protobuf files can be found\nwith the following commands:\n\`\`\`python\nimport riva.client\nprint(riva.client.AudioEncoding.keys()) # noqa: T201\n\`\`\`",
"enum": [
"ALAW",
"ENCODING_UNSPECIFIED",
"FLAC",
"LINEAR_PCM",
"MULAW",
"OGGOPUS"
],
"type": "string"
}
}
}