DSP Applications

DSP Memory

DSP Memory Architecture

The AmebaLite HiFi5 DSP memory is divided into on-chip and off-chip regions.

  • DSP On-Chip Memory (DSP exclusive)

    • ICache

    • DCache

    • DTCM

  • DSP Off-Chip Memory (Shared with AmebaLite KM4, KR4)

    • SRAM

    • PSRAM

These memories exhibit significant differences in access speed and capacity.

In terms of access performance:

  • DTCM and DCache offer the best real-time characteristics, running at the same frequency as the DSP and achieving single-cycle data access.

  • SRAM provides secondary performance at 240MHz frequency and 64-bit bus width.

  • PSRAM has the lowest actual bandwidth despite its nominal 250MHz frequency, due to its 16-bit physical width (8-bit DDR).

Capacity configuration shows inverse characteristics:

  • PSRAM provides the largest expandable space of up to 16MB (specific capacity depends on chip model);

  • The total SRAM capacity is 512KB, of which 447KB is actually available to the DSP after deducting the 65KB used by KM4 and KR4 processors;

  • DTCM and DCache serve as dedicated high-speed storage with the smallest capacity but lowest latency.

DSP Memory Access Speed

DSP Memory Transfer Performance Comparison

Source

Target

Memory Access Speed (MB/s)

Transfer Method

SRAM

DTCM

1899

iDMA

SRAM

DCache

1791

memcpy

PSRAM

DTCM

430

iDMA

PSRAM

DCache

425

memcpy

Note

Test conditions: DSP 500MHz, SRAM 240MHz, PSRAM 250MHz.

Memory access speed may vary under different test conditions. Data in the table represents measured peak values.

DSP Memory Access Methods

Since memory access speed affects DSP computational performance and can even become a bottleneck, algorithms running on the DSP should prioritize using DTCM and SRAM over PSRAM whenever possible.

In practical applications, different data storage locations can be selected based on algorithm model size. For example:

  • If the model is smaller than 256KB, data can be preloaded entirely into DTCM and resident there after program startup.

  • For large models, data can be transferred between PSRAM and DTCM.

Two methods can actively transfer data from PSRAM to DTCM:

  • memcpy

  • iDMA

Compared to memcpy, the advantage of iDMA is freeing up CPU computing power, as the DSP can execute other tasks during iDMA transfers. However, iDMA does not show significant advantages in PSRAM access speed.

  • iDMA is slightly faster when transferring large data blocks (64KB/128KB).

  • memcpy is faster when transferring small data blocks (8KB/16KB/32KB).

iDMA Double-Buffering Data Transfer Example

For detailed iDMA usage, refer to Chapter 8 “The Integrated DMA Library API” in the official xtensa document sys_sw_rm.pdf.

This example demonstrates using double-buffering to transfer data from PSRAM to DTCM while performing computations for acceleration.

Pseudo Code

 1#define ALIGN(x) __attribute__((aligned(x)))
 2#define DRAM0 __attribute__((section(".dram0.data")))
 3#define DRAM1 __attribute__((section(".dram1.data")))
 4
 5int8_t ALIGN(16) DRAM0 dst_ping[USER_BUFFER_SIZE];
 6int8_t ALIGN(16) DRAM1 dst_pong[USER_BUFFER_SIZE];
 7
 8#define NUM_DESCRIPTORS 2
 9IDMA_BUFFER_DEFINE(dmaBuffer, NUM_DESCRIPTORS, IDMA_1D_DESC);
10
11void idma_pingpong_buffers_example(void) {
12    idma_init(0, MAX_BLOCK_16, 16, TICK_CYCLES_1, 0, NULL);
13    idma_init_loop(dmaBuffer, IDMA_1D_DESC, NUM_DESCRIPTORS, NULL, NULL);
14
15    // prepare the first data
16    idma_copy_desc(dst_ping, ...);
17
18    // wait for the first idma finish
19    while (idma_buffer_status() > 0) {}
20
21                                            // prepare the second data
22                                            idma_copy_desc(dst_pong, src, size, 0);
23
24    // do the first process
25    user_process_1(dst_ping,....)
26
27                                            // wait for the second idma finish
28                                            while (idma_buffer_status() > 0) {}
29
30    // prepare the third data
31    idma_copy_desc(dst_ping, ...);
32
33                                            // do the second process
34                                            user_process_2(dst_pong,....)
35
36    // wait for the third idma finish
37    while (idma_buffer_status() > 0) {}
38                                            // prepare the fourth data
39                                            idma_copy_desc(dst_pong, src, size, 0);
40    // do the third process
41    user_process_3(dst_ping,....)
42                                            // wait for the fourth idma finish
43                                            while (idma_buffer_status() > 0) {}
44    // prepare the fifth data
45    idma_copy_desc(dst_ping, ...);
46                                            // do the fourth process
47                                            user_process_4(dst_pong,....)
48    ......
49}

The code consists of the following parts:

  • Define iDMA buffers

#define NUM_DESCRIPTORS 2
IDMA_BUFFER_DEFINE(dmaBuffer, NUM_DESCRIPTORS, IDMA_1D_DESC);
  • Initialize iDMA

idma_init(0, MAX_BLOCK_16, 16, TICK_CYCLES_1, 0, NULL);
idma_init_loop(dmaBuffer, IDMA_1D_DESC, NUM_DESCRIPTORS, NULL, NULL);
  • Define two data buffers located in DTCM

#define ALIGN(x) __attribute__((aligned(x)))
#define DRAM0 __attribute__((section(".dram0.data")))
#define DRAM1 __attribute__((section(".dram1.data")))

int8_t ALIGN(16) DRAM0 dst_ping[USER_BUFFER_SIZE];
int8_t ALIGN(16) DRAM1 dst_pong[USER_BUFFER_SIZE];
  • For the Nth transfer, update descriptor and schedule

idma_copy_desc(dst_X, src, size, 0);
while (idma_buffer_status() > 0) {}
user_process_N(dst_X,....)

In the pseudo code, for clarity, sequentially executed code is split into two columns:

  • Left column: 1st, 3rd, 5th,… odd-numbered transfers and computations, using the ping data buffer.

  • Right column: 2nd, 4th, 6th,… even-numbered transfers and computations, using the pong data buffer.

The Nth transfer/computation overlaps with the (N+1)th operation. While transferring the Nth data block, the (N-1)th data block is processed, achieving simultaneous data transfer and computation.

Note

When using iDMA, a small amount of iDMA descriptor data (approximately a few hundred bytes) must reside in DTCM. Therefore, the actual usable DTCM space for users is slightly less than 256KB.