VAD (Voice Activity Detection)

Introduction

VAD is the module to detect the presence of human speech in audio.

In AIVoice, a neural network based VAD is provided and can be used in speech enhancement, ASR system etc.

Configurations

VAD configurable parameters

sensitivity:: Three levels of sensitivity are provided with predefined thresholds. The higher, easier to detect speech but also more false alarm.
left_margin:: Time margin added to the start of speech segment, which makes the start offset earlier than raw prediction. Only affects offset_ms of VAD output, it won’t affect the event trigger time of status 1.
right_margin:: Time margin added to the end of speech segment, which makes the end offset later than raw prediction. Affects both offset_ms of VAD output and event time of status 0.

../../../rst_ai/aivoice/aivoice_vad/figures/vad_lm_rm_introduction.svg

Refer to ${aivoice_lib_dir}/include/aivoice_vad_config.h for details.

Note

left_margin only affects offset_ms returned by VAD, it won’t affect the VAD event trigger time. If you need get the audio during left_margin, please implement a buffer to keep audio.

Suggestions for adjusting parameters

Suggestion for adjusting left_margin

The larger the left_margin is, the more the vad segment expands to the left, and the richer the information near the starting point of the speech is contained, which can reduce the situation where the speech is incompletely segmented at the starting point. However, a large left_margin setting is also prone to introducing noise (including background noise or irrelevant speech), and a larger cache space needs to be reserved.

Case 1: Properly increase left_margin to reduce the clipping of the front part of the speech

../../../rst_ai/aivoice/aivoice_vad/figures/vad_lm_cond_1.svg

Case 2: Excessive increase in left_margin may introduce irrelevant speech

../../../rst_ai/aivoice/aivoice_vad/figures/vad_lm_cond_2.svg

Suggestion for adjusting right_margin

The larger the right_margin is, the more the vad segment expands to the right, and the more information near the end of the speech is included, which can reduce the situation where the speech is incompletely segmented at the ending point . However, too large a right_margin setting can easily introduce noise (including background noise or irrelevant speech) and increase latency.

Case 1: Properly increase right_margin to reduce the clipping of the tail speech

../../../rst_ai/aivoice/aivoice_vad/figures/vad_rm_cond_1.svg

Case 2: Excessive increase in right_margin may introduce irrelevant noise

../../../rst_ai/aivoice/aivoice_vad/figures/vad_rm_cond_2.svg

Case 3: Long sentence scenario, increasing right_margin can reduce the situation where long sentences are cut apart due to pauses

../../../rst_ai/aivoice/aivoice_vad/figures/vad_rm_cond_3.svg

In general, left_margin and right_margin should not be too large, and can be adjusted to cover most of the speech segments. For long-sentence dialogue scenarios, right_margin should be increased to prevent the algorithm from prematurely ending the segment capture when the user pauses in the middle of speaking. However, increasing right_margin will also increase latency, so it is necessary to make reasonable adjustments based on actual conditions.

All SoCs

Select SoC via Features

HiFi DSP Series ›

HiFi DSP Series

Cortex-A Linux Series ›

Cortex-A Linux Series

Display Series ›

Audio Series ›

Wi-Fi 6 + BLE Series ›

Wi-Fi 2.4G/5G + BLE Seriess ›

Wi-Fi + Classic BT Series ›

Wi-Fi R-MESH Series ›

Select SoC via Applications

IoT Control ›

Application Note

SDK

Advanced Features

Wi-Fi Advanced Features ›

AI Voice ›

Tools