DSP Applications

DSP Memory

DSP Memory Architecture

The AmebaLite HiFi5 DSP memory is divided into on-chip and off-chip regions.

DSP On-Chip Memory (DSP exclusive)
- ICache
- DCache
- DTCM
DSP Off-Chip Memory (Shared with AmebaLite KM4, KR4)
- SRAM
- PSRAM

These memories exhibit significant differences in access speed and capacity.

In terms of access performance:

DTCM and DCache offer the best real-time characteristics, running at the same frequency as the DSP and achieving single-cycle data access.

SRAM provides secondary performance at 240MHz frequency and 64-bit bus width.

PSRAM has the lowest actual bandwidth despite its nominal 250MHz frequency, due to its 16-bit physical width (8-bit DDR).

Capacity configuration shows inverse characteristics:

PSRAM provides the largest expandable space of up to 16MB (specific capacity depends on chip model);

The total SRAM capacity is 512KB, of which 447KB is actually available to the DSP after deducting the 65KB used by KM4 and KR4 processors;

DTCM and DCache serve as dedicated high-speed storage with the smallest capacity but lowest latency.

DSP Memory Access Speed

DSP Memory Transfer Performance Comparison
Source	Target	Memory Access Speed (MB/s)	Transfer Method
SRAM	DTCM	1899	iDMA
SRAM	DCache	1791	memcpy
PSRAM	DTCM	430	iDMA
PSRAM	DCache	425	memcpy

Note

Test conditions: DSP 500MHz, SRAM 240MHz, PSRAM 250MHz.

Memory access speed may vary under different test conditions. Data in the table represents measured peak values.

DSP Memory Access Methods

Since memory access speed affects DSP computational performance and can even become a bottleneck, algorithms running on the DSP should prioritize using DTCM and SRAM over PSRAM whenever possible.

In practical applications, different data storage locations can be selected based on algorithm model size. For example:

If the model is smaller than 256KB, data can be preloaded entirely into DTCM and resident there after program startup.

For large models, data can be transferred between PSRAM and DTCM.

Two methods can actively transfer data from PSRAM to DTCM:

memcpy

iDMA

Compared to memcpy, the advantage of iDMA is freeing up CPU computing power, as the DSP can execute other tasks during iDMA transfers. However, iDMA does not show significant advantages in PSRAM access speed.

iDMA is slightly faster when transferring large data blocks (64KB/128KB).

memcpy is faster when transferring small data blocks (8KB/16KB/32KB).

iDMA Double-Buffering Data Transfer Example

For detailed iDMA usage, refer to Chapter 8 “The Integrated DMA Library API” in the official xtensa document sys_sw_rm.pdf.

This example demonstrates using double-buffering to transfer data from PSRAM to DTCM while performing computations for acceleration.

Pseudo Code

#define ALIGN(x) __attribute__((aligned(x)))
#define DRAM0 __attribute__((section(".dram0.data")))
#define DRAM1 __attribute__((section(".dram1.data")))

int8_t ALIGN(16) DRAM0 dst_ping[USER_BUFFER_SIZE];
int8_t ALIGN(16) DRAM1 dst_pong[USER_BUFFER_SIZE];

#define NUM_DESCRIPTORS 2
IDMA_BUFFER_DEFINE(dmaBuffer, NUM_DESCRIPTORS, IDMA_1D_DESC);

void idma_pingpong_buffers_example(void) {
    idma_init(0, MAX_BLOCK_16, 16, TICK_CYCLES_1, 0, NULL);
    idma_init_loop(dmaBuffer, IDMA_1D_DESC, NUM_DESCRIPTORS, NULL, NULL);

    // prepare the first data
    idma_copy_desc(dst_ping, ...);

    // wait for the first idma finish
    while (idma_buffer_status() > 0) {}

                                            // prepare the second data
                                            idma_copy_desc(dst_pong, src, size, 0);

    // do the first process
    user_process_1(dst_ping,....)

                                            // wait for the second idma finish
                                            while (idma_buffer_status() > 0) {}

    // prepare the third data
    idma_copy_desc(dst_ping, ...);

                                            // do the second process
                                            user_process_2(dst_pong,....)

    // wait for the third idma finish
    while (idma_buffer_status() > 0) {}
                                            // prepare the fourth data
                                            idma_copy_desc(dst_pong, src, size, 0);
    // do the third process
    user_process_3(dst_ping,....)
                                            // wait for the fourth idma finish
                                            while (idma_buffer_status() > 0) {}
    // prepare the fifth data
    idma_copy_desc(dst_ping, ...);
                                            // do the fourth process
                                            user_process_4(dst_pong,....)
    ......
}

The code consists of the following parts:

Define iDMA buffers

#define NUM_DESCRIPTORS 2
IDMA_BUFFER_DEFINE(dmaBuffer, NUM_DESCRIPTORS, IDMA_1D_DESC);

Initialize iDMA

idma_init(0, MAX_BLOCK_16, 16, TICK_CYCLES_1, 0, NULL);
idma_init_loop(dmaBuffer, IDMA_1D_DESC, NUM_DESCRIPTORS, NULL, NULL);

Define two data buffers located in DTCM

#define ALIGN(x) __attribute__((aligned(x)))
#define DRAM0 __attribute__((section(".dram0.data")))
#define DRAM1 __attribute__((section(".dram1.data")))

int8_t ALIGN(16) DRAM0 dst_ping[USER_BUFFER_SIZE];
int8_t ALIGN(16) DRAM1 dst_pong[USER_BUFFER_SIZE];

For the Nth transfer, update descriptor and schedule

idma_copy_desc(dst_X, src, size, 0);
while (idma_buffer_status() > 0) {}
user_process_N(dst_X,....)

In the pseudo code, for clarity, sequentially executed code is split into two columns:

Left column: 1st, 3rd, 5th,… odd-numbered transfers and computations, using the ping data buffer.

Right column: 2nd, 4th, 6th,… even-numbered transfers and computations, using the pong data buffer.

The Nth transfer/computation overlaps with the (N+1)th operation. While transferring the Nth data block, the (N-1)th data block is processed, achieving simultaneous data transfer and computation.

Note

When using iDMA, a small amount of iDMA descriptor data (approximately a few hundred bytes) must reside in DTCM. Therefore, the actual usable DTCM space for users is slightly less than 256KB.

All SoCs

Select SoC via Features

HiFi DSP Series

HiFi DSP Series

Cortex-A Linux Series

Cortex-A Linux Series

Display Series

Display Series

Audio Series

Audio Series

Wi-Fi 6 + BLE Series

Wi-Fi 6 + BLE Series

Wi-Fi 2.4G/5G + BLE Seriess

Wi-Fi 2.4G/5G + BLE Series

Wi-Fi + Classic BT Series

Wi-Fi + Classic BT Series

Wi-Fi R-MESH Series

Wi-Fi R-MESH Series

Select SoC via Applications

IoT Control

IoT Control

Application Note

Wi-Fi Guide

Wi-Fi Guide

Security Guide

Security Guide

Multimedia Guide

Multimedia Guide

AI Voice