Skip to content

[Feature] Implement DMA support#293

Open
BenkangPeng wants to merge 24 commits into
tancheng:masterfrom
BenkangPeng:dma-cgra
Open

[Feature] Implement DMA support#293
BenkangPeng wants to merge 24 commits into
tancheng:masterfrom
BenkangPeng:dma-cgra

Conversation

@BenkangPeng

Copy link
Copy Markdown
Collaborator

Related issue: coredac/CGRA-SoC#2

This PR introduces CgraDmaRTL which integrates the CGRA with a DMA engine, enabling direct memory transfers between external DRAM(don't implement now) and the CGRA's dataSPM.

@BenkangPeng BenkangPeng requested review from HobbitQia and tancheng June 2, 2026 13:55
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/dma/DmaEngineRTL.py
Comment thread mem/dma/DmaEngineRTL.py
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread cgra/CgraDmaRTL.py Outdated
Comment on lines +116 to +118
s.mem_rd_req_val = OutPort() # dma_read_request_valid
s.mem_rd_req_rdy = InPort() # dma_read_request_ready
s.mem_rd_req_addr = OutPort(DmaDramAddrType)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use the RecvIfcRTL and SendIfcRTL interfaces to connect the DmaRTL?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread mem/data/DataMemControllerRTL.py Outdated
Comment thread cgra/CgraTemplateRTL.py Outdated
Comment on lines +225 to +230
s.data_mem.spm_dma_rval //= s.spm_dma_rval
s.data_mem.spm_dma_rrdy //= s.spm_dma_rrdy
s.data_mem.spm_dma_raddr //= s.spm_dma_raddr
s.data_mem.spm_dma_rresp_val //= s.spm_dma_rresp_val
s.data_mem.spm_dma_rresp_rdy //= s.spm_dma_rresp_rdy
s.data_mem.spm_dma_rresp_data //= s.spm_dma_rresp_data

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed before, should we connect the dma to the controller (as intermediate interface/transition)? instead of directly connecting to data spm?

So then we can leverage

s.recv_from_cpu_pkt //= s.recv_from_cpu_pkt_queue.recv
s.send_to_cpu_pkt //= s.send_to_cpu_pkt_queue.send
to decide the next location (e.g., local spam bank, remote spm)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@HobbitQia

Copy link
Copy Markdown
Collaborator

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

  • Rely on data controller

    • DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.

      To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

    • Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

    • Cons: Additional logic is required to feed DMA results into the control memory.

    image
  • All in controller

    • All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.

      The logic of packeting should also be implemented in the controller module.

    • Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

    • Cons: Introduces complex control logic in the controller; results in a slower path.

    image

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

@tancheng

tancheng commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

Comment thread cgra/CgraDmaRTL.py Outdated
@HobbitQia

Copy link
Copy Markdown
Collaborator

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

@tancheng

tancheng commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

@HobbitQia

Copy link
Copy Markdown
Collaborator

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

If the DMA data should go through the controller packet path, there may be extra latency of packeting, and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

@tancheng

tancheng commented Jun 8, 2026

Copy link
Copy Markdown
Owner

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

@HobbitQia

Copy link
Copy Markdown
Collaborator

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

@tancheng

tancheng commented Jun 8, 2026

Copy link
Copy Markdown
Owner

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

Oh, we don't need to distinguish FROM_CPU and FROM_NOC, we can decompose the requests from CPU like:

s.recv_from_cpu_pkt_queue.send.rdy @= s.crossbar.recv[kFromCpuCtrlAndDataIdx].rdy

@HobbitQia

HobbitQia commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

Oh, we don't need to distinguish FROM_CPU and FROM_NOC, we can decompose the requests from CPU like:

s.recv_from_cpu_pkt_queue.send.rdy @= s.crossbar.recv[kFromCpuCtrlAndDataIdx].rdy

But if we add DMA, there should be another path FROM_DMA, which is different from FROM_CPU/FROM_NOC?

@tancheng

tancheng commented Jun 8, 2026

Copy link
Copy Markdown
Owner

But if we add DMA, there should be another path FROM_DMA, which is different from FROM_CPU/FROM_NOC?

Ah, yes, makes sense. That FROM_DMA would be similar to FROM_CPU. The DMA_Controller then can assemble the different ports into our struct and send/recv interface.

@HobbitQia

Copy link
Copy Markdown
Collaborator

DMA_Controller

So this DMA_Controller refers to DataSPMController or our DMA engine?

@tancheng

tancheng commented Jun 9, 2026

Copy link
Copy Markdown
Owner

So this DMA_Controller refers to DataSPMController or our DMA engine?

DMA engine in your figure, or @BenkangPeng's DmaEngineRTL.

@tancheng

Copy link
Copy Markdown
Owner

@BenkangPeng will you update this PR accordingly?

@BenkangPeng

Copy link
Copy Markdown
Collaborator Author

@BenkangPeng will you update this PR accordingly?

Yes, sorry for the delay. I will update this PR as soon as possible.

…RTL. The DMA is connected to the data memory controller indirectly via the controller, with the decoding logic integrated into the controller.
Comment thread cgra/CgraTemplateRTL.py Outdated
Comment thread cgra/CgraTemplateRTL.py Outdated
Comment thread controller/ControllerRTL.py Outdated
Comment thread controller/ControllerRTL.py Outdated
…interface for enhanced data transfer capabilities.
…te requests and adjust related signal handling for clarity and consistency.
…TL by passing DmaDataType and DmaCmdType as parameters, and updating related type definitions for improved clarity and consistency.
…rite requests, enhancing type definitions for DmaCmdType and DmaDataType
Comment thread controller/ControllerRTL.py Outdated
YType = mk_bits(max(clog2(multi_cgra_rows), 1))
TileIdType = mk_bits(clog2(num_tiles + 1))
ControllerXbarPktType = mk_controller_noc_xbar_pkt(InterCgraPktType)
DmaOpcodeType = DmaCmdType.get_field_type('opcode')

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above attr, we should use kAttrOpcode instead of opcode here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread controller/ControllerRTL.py Outdated
TileIdType = mk_bits(clog2(num_tiles + 1))
ControllerXbarPktType = mk_controller_noc_xbar_pkt(InterCgraPktType)
DmaOpcodeType = DmaCmdType.get_field_type('opcode')
DmaDramAddrType = DmaCmdType.get_field_type('dram_addr')

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for dram_addr?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread cgra/CgraTemplateRTL.py
s.dma_done //= s.controller.dma_done

# DMA engine <-> controller side of the SPM path.
s.dma_spm //= s.controller.dma_spm_from_dma

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this? dma_spm and dma_spm_from_dma sound confusing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about s.dma_spm_from_dma //= s.controller.dma_spm_from_dma? My original idea was that dma_spm indicates the signal is used for communication between DMA and SPM, with suffixes like from_dma and to_mem indicating the direction of the signal.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have two interfaces, one send and one recv, WDYT? and the "send"/"recv" should be in the name.

Comment thread cgra/CgraDmaRTL.py

# Abstract external dram memory interfaces for the internal DMA engine.

s.dram_rd_req = SendIfcRTL(DmaDramAddrType)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name this as s.send_dram_rd_req, to indicate its direction.

Comment thread cgra/CgraDmaRTL.py
# Abstract external dram memory interfaces for the internal DMA engine.

s.dram_rd_req = SendIfcRTL(DmaDramAddrType)
s.dram_rd_resp = RecvIfcRTL(DmaMemDataType)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name this as s.recv_dram_rd_resp.

Comment thread cgra/CgraDmaRTL.py

# Components.

s.cgra = CgraTemplateRTL(CgraPayloadType,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually wondering why we need this CgraDmaRTL.py. Can we just expand the CgraTemplateRTL to include the DMA-related ports/interfaces?

Comment thread lib/basic/val_rdy/ifcs.py
s.msg = OutPort( Type )
s.val = OutPort()
s.rdy = OutPort()
s.trace_len = len(str(Type()))

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just reuse the ValRdySendIfcRTL, but embed the trace_len into msg? The msg could have both msg.data and msg.trace_len. WDYT?

Comment thread lib/basic/val_rdy/ifcs.py
Comment on lines +112 to +114
- write : Output (Send). DMA sends write requests to SPM.
- read : Output (Send). DMA sends read requests to SPM.
- read_resp: Input (Recv). DMA receives read data from SPM.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse our send/recv interfaces, instead of creating new one?

Comment thread lib/basic/val_rdy/ifcs.py
Comment on lines +136 to +138
- write : Input (Recv). SPM receives write requests from DMA.
- read : Input (Recv). SPM receives read requests from DMA.
- read_resp: Output (Send). SPM sends read data back to DMA.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread lib/util/common.py
Comment on lines +84 to +92
STATE_IDLE = StateType( 0 ) # Waiting for a new DMA command
STATE_MVIN_REQ = StateType( 1 ) # MVIN: Issuing DRAM read request
STATE_MVIN_RESP = StateType( 2 ) # MVIN: Waiting for DRAM read response
STATE_MVIN_WRITE = StateType( 3 ) # MVIN: Writing unpacked words to SPM
STATE_MVOUT_READ = StateType( 4 ) # MVOUT: Issuing SPM read request
STATE_MVOUT_RESP = StateType( 5 ) # MVOUT: Receiving SPM read response and packing
STATE_MVOUT_WRITE = StateType( 6 ) # MVOUT: Issuing DRAM write request
STATE_MVOUT_WAIT = StateType( 7 ) # MVOUT: Waiting for DRAM write response
STATE_DONE = StateType( 8 ) # Signaling command completion

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change from STATE_xxx to STATE_DMA_xxx.

Comment thread lib/messages.py

return mk_bitstruct(new_name, {
'dram_data': DramDataType,
'dram_mask': DramMaskType,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what is dram_mask with comment?

Comment thread lib/messages.py
'dram_data': DramDataType,
'dram_mask': DramMaskType,
'spm_data': SpmDataType,
'spm_mask': SpmMaskType,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what is spm_mask with comment?

@tancheng

Copy link
Copy Markdown
Owner

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

  • Rely on data controller

    • DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
      To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

    • Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

    • Cons: Additional logic is required to feed DMA results into the control memory.

      image
  • All in controller

    • All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
      The logic of packeting should also be implemented in the controller module.

    • Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

    • Cons: Introduces complex control logic in the controller; results in a slower path.

      image

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

We should also include the figure 2 into our

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants