Guidance on Streaming Full-Body

Hi D-ID Support team,

I’m building a service where all conversational logic runs on my own backend; I only need D-ID for the visual layer:

Avatar type: full-body Clips Premium presenter

Mode: real-time streaming (the backend pushes each user message as it arrives)

Idle state: when the avatar is not speaking, I overlay a silent “idle” loop so the scene never freezes

The missing piece is a reliable signal for when the streaming avatar actually starts and finishes speaking, so I can fade the idle clip out/in at the right time.

I’ve seen three possible approaches:

WebSocket events (speech_start, speech_end) – some docs/forums reference them, but I can’t establish a wss://…/clips/streams/{id}/events connection for Clips Premium.

WebRTC track events (track.onunmute, track.onended) – they don’t fire because the track arrives un-muted and stays open.

Polling RTCPeerConnection.getStats() – feasible, but it’s heuristics-based (audioLevel / framesDecoded) and less precise.

Could you confirm:

Which method is officially supported for Clips Premium streams today?

If WebSocket events are unavailable for this tier, is polling WebRTC stats the recommended fallback?

Any best-practice tips for generating a seamless idle clip (e.g., specific fluent/pad_audio settings) so transitions look smooth?

Thanks a lot for your guidance.
Best regards,

David.