DSP Applications
DSP Memory
DSP Memory Architecture
The AmebaLite HiFi5 DSP memory is divided into on-chip and off-chip regions.
DSP On-Chip Memory (DSP exclusive)
ICache
DCache
DTCM
DSP Off-Chip Memory (Shared with AmebaLite KM4, KR4)
SRAM
PSRAM
These memories exhibit significant differences in access speed and capacity.
In terms of access performance:
DTCM and DCache offer the best real-time characteristics, running at the same frequency as the DSP and achieving single-cycle data access.
SRAM provides secondary performance at 240MHz frequency and 64-bit bus width.
PSRAM has the lowest actual bandwidth despite its nominal 250MHz frequency, due to its 16-bit physical width (8-bit DDR).
Capacity configuration shows inverse characteristics:
PSRAM provides the largest expandable space of up to 16MB (specific capacity depends on chip model);
The total SRAM capacity is 512KB, of which 447KB is actually available to the DSP after deducting the 65KB used by KM4 and KR4 processors;
DTCM and DCache serve as dedicated high-speed storage with the smallest capacity but lowest latency.
DSP Memory Access Speed
Source |
Target |
Memory Access Speed (MB/s) |
Transfer Method |
---|---|---|---|
SRAM |
DTCM |
1899 |
iDMA |
SRAM |
DCache |
1791 |
memcpy |
PSRAM |
DTCM |
430 |
iDMA |
PSRAM |
DCache |
425 |
memcpy |
Note
Test conditions: DSP 500MHz, SRAM 240MHz, PSRAM 250MHz.
Memory access speed may vary under different test conditions. Data in the table represents measured peak values.
DSP Memory Access Methods
Since memory access speed affects DSP computational performance and can even become a bottleneck, algorithms running on the DSP should prioritize using DTCM and SRAM over PSRAM whenever possible.
In practical applications, different data storage locations can be selected based on algorithm model size. For example:
If the model is smaller than 256KB, data can be preloaded entirely into DTCM and resident there after program startup.
For large models, data can be transferred between PSRAM and DTCM.
Two methods can actively transfer data from PSRAM to DTCM:
memcpy
iDMA
Compared to memcpy, the advantage of iDMA is freeing up CPU computing power, as the DSP can execute other tasks during iDMA transfers. However, iDMA does not show significant advantages in PSRAM access speed.
iDMA is slightly faster when transferring large data blocks (64KB/128KB).
memcpy is faster when transferring small data blocks (8KB/16KB/32KB).
iDMA Double-Buffering Data Transfer Example
For detailed iDMA usage, refer to Chapter 8 “The Integrated DMA Library API” in the official xtensa document sys_sw_rm.pdf.
This example demonstrates using double-buffering to transfer data from PSRAM to DTCM while performing computations for acceleration.
Pseudo Code
1#define ALIGN(x) __attribute__((aligned(x)))
2#define DRAM0 __attribute__((section(".dram0.data")))
3#define DRAM1 __attribute__((section(".dram1.data")))
4
5int8_t ALIGN(16) DRAM0 dst_ping[USER_BUFFER_SIZE];
6int8_t ALIGN(16) DRAM1 dst_pong[USER_BUFFER_SIZE];
7
8#define NUM_DESCRIPTORS 2
9IDMA_BUFFER_DEFINE(dmaBuffer, NUM_DESCRIPTORS, IDMA_1D_DESC);
10
11void idma_pingpong_buffers_example(void) {
12 idma_init(0, MAX_BLOCK_16, 16, TICK_CYCLES_1, 0, NULL);
13 idma_init_loop(dmaBuffer, IDMA_1D_DESC, NUM_DESCRIPTORS, NULL, NULL);
14
15 // prepare the first data
16 idma_copy_desc(dst_ping, ...);
17
18 // wait for the first idma finish
19 while (idma_buffer_status() > 0) {}
20
21 // prepare the second data
22 idma_copy_desc(dst_pong, src, size, 0);
23
24 // do the first process
25 user_process_1(dst_ping,....)
26
27 // wait for the second idma finish
28 while (idma_buffer_status() > 0) {}
29
30 // prepare the third data
31 idma_copy_desc(dst_ping, ...);
32
33 // do the second process
34 user_process_2(dst_pong,....)
35
36 // wait for the third idma finish
37 while (idma_buffer_status() > 0) {}
38 // prepare the fourth data
39 idma_copy_desc(dst_pong, src, size, 0);
40 // do the third process
41 user_process_3(dst_ping,....)
42 // wait for the fourth idma finish
43 while (idma_buffer_status() > 0) {}
44 // prepare the fifth data
45 idma_copy_desc(dst_ping, ...);
46 // do the fourth process
47 user_process_4(dst_pong,....)
48 ......
49}
The code consists of the following parts:
Define iDMA buffers
#define NUM_DESCRIPTORS 2
IDMA_BUFFER_DEFINE(dmaBuffer, NUM_DESCRIPTORS, IDMA_1D_DESC);
Initialize iDMA
idma_init(0, MAX_BLOCK_16, 16, TICK_CYCLES_1, 0, NULL);
idma_init_loop(dmaBuffer, IDMA_1D_DESC, NUM_DESCRIPTORS, NULL, NULL);
Define two data buffers located in DTCM
#define ALIGN(x) __attribute__((aligned(x)))
#define DRAM0 __attribute__((section(".dram0.data")))
#define DRAM1 __attribute__((section(".dram1.data")))
int8_t ALIGN(16) DRAM0 dst_ping[USER_BUFFER_SIZE];
int8_t ALIGN(16) DRAM1 dst_pong[USER_BUFFER_SIZE];
For the Nth transfer, update descriptor and schedule
idma_copy_desc(dst_X, src, size, 0);
while (idma_buffer_status() > 0) {}
user_process_N(dst_X,....)
In the pseudo code, for clarity, sequentially executed code is split into two columns:
Left column: 1st, 3rd, 5th,… odd-numbered transfers and computations, using the ping data buffer.
Right column: 2nd, 4th, 6th,… even-numbered transfers and computations, using the pong data buffer.
The Nth transfer/computation overlaps with the (N+1)th operation. While transferring the Nth data block, the (N-1)th data block is processed, achieving simultaneous data transfer and computation.
Note
When using iDMA, a small amount of iDMA descriptor data (approximately a few hundred bytes) must reside in DTCM. Therefore, the actual usable DTCM space for users is slightly less than 256KB.