Creating Interactive Digital Avatars: Overcoming Playback Delays and Video Conflicts

I'm developing an AI-based project that creates digital avatars for users. These avatars can be accessed by other people who can have conversations with them. The workflow of the project is as follows:

Users input text (or voice which is then converted to text).
OpenAI fine-tunes the model and returns the text results.
The text is segmented into multiple audio clips.
The audio clips are converted into a video using the d-id Streams API.
I'm currently facing a couple of issues:

I'm using the live-streaming demo found at https://github.com/de-id/live-streaming-demo. However, I'm experiencing significant delays in video playback. Each individual audio clip is only around 200KB in size and lasts about 3 seconds.
When I provide multiple audio clips to the d-id API simultaneously, I encounter conflicts in the returned video playback.
I'd be happy to help you with these issues! Is there anything specific you would like assistance with?

I saw the demo on chat.d-id.com and it was very smooth.