Microsoft Azure Voices

The default TTS provider

✴️ Microsoft TTS

Microsoft Azure Text to Speech Provider support more than 100 languages, with a large variety of authentic human voices, including different intonations and voice styles. This provider is integrated inside D-ID's API. Simply choose your desired voice and use it in your API request.

✴️ Example Usage

D-ID provides Microsoft Azure Cognitive Services integration to generate text to speech

"provider": {
	"type": "microsoft",
	"voice_id": "en-GB-AbbiNeural"
"provider": {
  "type": "microsoft",
  "voice_id": "en-US-JennyNeural",
  "voice_config": {
  	"style": "Cheerful"

✴️ Available Voices

Go to Microsoft Voice Gallery above, and:

  1. Select any voice from the voice gallery
  2. Click on the "Sample code" tab on the right
  3. Copy the voice name: config.SpeechSynthesisVoiceName ="en-GB-AbbiNeural"
  4. Use this string en-GB-AbbiNeural in the voice_id field


Get all Text-to-Speech supported voices

See /voices endpoint to get all the supported voices from all integrated TTS providers


Using other text to speech providers

You can also use any other external provider you like, and pass it as an audio URL instead, or upload it as an audio file.

✴️ Adding Pauses

Adding pauses to the audio generated by Microsoft TTS is possible using the SSML (Speech Synthesis Markup Language) with the syntax and the examples below. Adding <break time=\"5000ms\"/> will create an exact and natural pause in the speech for 5 seconds. It is not just added silence between words, but the AI has an actual understanding of this syntax and will add a natural pause.

      "ssml": true,
      "input": "Enjoy the 5000 milliseconds break <break time=\"5000ms\"/> generated by Microsoft",

✴️ Fully Silent (Idle) Video Example

A fully silent (idle) video is a video generated using /talks endpoint, without any words. The avatar is silent and doesn't pronounce anything. See the silent video example below. Head movements, gentle smiles, and eye-blinking still persist, which makes the silent video a great asset to use in your projects.

Silent video can be used in various chatbot applications, while the avatar is awaiting user's input. Once the bot's answer is ready to be shown (stream data arrived), the silent video could be replaced by the talking video, using a CSS transition: fade-in and fade-out of the opacity.
In this example - Chat D-ID the idle video is cached by the browser on the first download, and looped in the background. Once the answer is ready, it is replaced as described above.

To create a silent video of 15 seconds (recommended), simply create a /talk with Microsoft TTS provider and 3 consequence silent breaks (5 seconds each). In the request example below, "ssml": **true** indicates that the special markup language will be used, and <break time=\"5000ms\"/> is a 5-second pause of silence. Three pauses of 5 seconds each have been added to generate a 15-second video.

Full JSON example of generating a Silent (Idle) video:

    "source_url": "",
    "driver_url": "bank://lively/driver-06",
    "script": {
        "type": "text",
        "ssml": true,
        "input": "<break time=\"5000ms\"/><break time=\"5000ms\"/><break time=\"5000ms\"/>",
        "provider": {
            "type": "microsoft",
            "voice_id": "en-US-JennyNeural"
    "config": {
        "fluent": true

Best Practices to follow when generating a silent video:

"fluent": trueThe "fluent": true parameter in the config object is responsible for making the video start and end with an equal frame. In such a case, the video can be looped with no noticeable transition or jump-cuts, as in the example above.
"driver_url": "bank://lively/driver-06"The driver_url parameter of the talks request should be specifically set to β€œbank://lively/driver-06”. This driver provides the best results with a silent video generation.



  1. Fully silent video generation is available only using Microsoft TTS provider.
  2. If Silent (Idle) video is used with Streams API, fluent and driver_url parameters should be set for both the idle video (pre-generated) and talking (Streamed) videos, to make sure that no jump-cuts and different drivers being used.

✴️ Support

Have any questions? We are here to help! Please leave your question in the Discussions section and we will be happy to answer shortly.

Ask a question