VAD (Voice Activity Detection)
Introduction
VAD is the module to detect the presence of human speech in audio.
In AIVoice, a neural network based VAD is provided and can be used in speech enhancement, ASR system etc.
Configurations
VAD configurable parameters
- sensitivity:
Three levels of sensitivity are provided with predefined thresholds. The higher, easier to detect speech but also more false alarm.
- left_margin:
Time margin added to the start of speech segment, which makes the start offset earlier than raw prediction. Only affects offset_ms of VAD output, it won’t affect the event trigger time of status 1.
- right_margin:
Time margin added to the end of speech segment, which makes the end offset later than raw prediction. Affects both offset_ms of VAD output and event time of status 0.
Refer to ${aivoice_lib_dir}/include/aivoice_vad_config.h
for details.
Note
left_margin only affects offset_ms returned by VAD, it won’t affect the VAD event trigger time. If you need get the audio during left_margin, please implement a buffer to keep audio.
Suggestions for adjusting parameters
Suggestion for adjusting left_margin
The larger the left_margin is, the more the vad segment expands to the left, and the richer the information near the starting point of the speech is contained, which can reduce the situation where the speech is incompletely segmented at the starting point. However, a large left_margin setting is also prone to introducing noise (including background noise or irrelevant speech), and a larger cache space needs to be reserved.
Case 1: Properly increase left_margin to reduce the clipping of the front part of the speech
Case 2: Excessive increase in left_margin may introduce irrelevant speech
Suggestion for adjusting right_margin
The larger the right_margin is, the more the vad segment expands to the right, and the more information near the end of the speech is included, which can reduce the situation where the speech is incompletely segmented at the ending point . However, too large a right_margin setting can easily introduce noise (including background noise or irrelevant speech) and increase latency.
Case 1: Properly increase right_margin to reduce the clipping of the tail speech
Case 2: Excessive increase in right_margin may introduce irrelevant noise
Case 3: Long sentence scenario, increasing right_margin can reduce the situation where long sentences are cut apart due to pauses
In general, left_margin and right_margin should not be too large, and can be adjusted to cover most of the speech segments. For long-sentence dialogue scenarios, right_margin should be increased to prevent the algorithm from prematurely ending the segment capture when the user pauses in the middle of speaking. However, increasing right_margin will also increase latency, so it is necessary to make reasonable adjustments based on actual conditions.