KWS (Keyword Spotting)

Introduction

KWS is the module to detect specific wakeup words from audio. It is usually the first step in a voice interaction system. The device will enter the state of waiting voice commands after detecting the keyword.

AIVoice provides two KWS solutions: a fixed keyword solution and a user-defined keyword solution. The former can achieve optimal performance on low-resource devices, while the latter allows flexible customization of keywords.

Solution

Training data

Available keywords

Feature

Fixed keyword

Specific keywords

Keywords same as training data

better performance, smaller model

User-defined keyword

Common data

Flexible keyword of the same language as training data

More flexible

Currently SDK provides a fixed keyword model library and a user-defined model.

Fixed Keyword Model

  • Support Chinese keyword xiao-qiang-xiao-qiang or ni-hao-xiao-qiang.

  • Other keywords or performance optimizations can be provided through customized services.

User-defined Keyword Model

  • Language Support: Chinese only

  • Number of Keyword: Supports up to 5 keywords simultaneously.

  • Word Length: Each keyword must contain 3 to 6 Chinese characters; words outside this range are invalid.

  • Keyword Selection Guidelines

    • Avoid characters with zero initials(e.g., yīn, ).

    • Avoid common daily phrases (e.g., put on clothes, eat breakfast).

    • Ensure high phonetic distinction between adjacent syllables.

KWS Mode

Two KWS modes are provided for different use cases. Single-channel mode processes single-channel audio as input, while Multi-channel mode processes multi-channel as input. Multi-channel mode improves accuracy for KWS and ASR compared to single-channel mode. However, it also increases computational resource consumption and memory usage.

KWS mode

Function

Description

Single-channel mode

void rtk_aivoice_set_single_kws_mode(void)

Less computation resource consumption and less memory usage

Multi-channel mode

void rtk_aivoice_set_multi_kws_mode(void)

Better kws and asr accuracy

Attention

KWS mode MUST set before create instance in these flows:

  • aivoice_iface_full_flow_v1

  • aivoice_iface_afe_kws_v1

  • aivoice_iface_afe_kws_vad_v1

Algorithm Flow

  • Single-channel Mode

../../../rst_ai/aivoice/aivoice_kws/figures/kws_flow_single_channel.svg
  • Multi-channel Mode

../../../rst_ai/aivoice/aivoice_kws/figures/kws_flow_multi_channel.svg

Configurations

KWS configurable parameters:

keywords:

Keywords for wake up, and available keywords depend on KWS model. If the KWS model is a fixed keyword solution, keywords can only be chosen from the trained words. For customized solution, keywords can be customized with any combinations of same language unit(such as pinyin for Chinese). Example: xiao-qiang-xiao-qiang.

thresholds:

Threshold for wake up, range [0, 1]. The higher, less false alarm, but harder to wake up. Set to 0 to use sensitivity with predefined thresholds.

sensitivity:

Three levels of sensitivity are provided with predefined thresholds. The higher, easier to wake up but also more false alarm. ONLY works when thresholds set to 0.

Refer to ${aivoice_lib_dir}/include/aivoice_kws_config.h for details.

Threshold Adjustment Suggestions

  • As the threshold increases from low to high, the wakeup rate gradually decreases, and false wakeup reduce (i.e., sensitivity shifts from high to low). Users should select an appropriate threshold based on actual needs.

  • For fixed keyword model, three sensitivity levels are provided: High, Medium, and Low, corresponding to ~1 false trigger per 12h, 24h, and 48h, respectively. For finer adjustments, users can configure the thresholds parameter to adapt to their usage scenario, with a step size of 0.02.

  • For user-defined keyword model, the thresholds are typically lower than fixed keyowrd model, with a suggested adjustment step size of 0.005.

../../../rst_ai/aivoice/aivoice_kws/figures/kws_roc.svg