CN108234147B

CN108234147B - DMA broadcast data transmission method based on host counting in GPDSP

Info

Publication number: CN108234147B
Application number: CN201711480231.7A
Authority: CN
Inventors: 马胜; 雷元武; 张美迪; 万江华; 陈胜刚; 李勇; 彭元喜; 孙书为
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2021-06-18
Anticipated expiration: 2037-12-29
Also published as: CN108234147A

Abstract

The invention discloses a DMA broadcast data transmission method based on host counting in GPDSP, which comprises the following steps: starting DMA broadcast data transmission by a host DMA, generating a broadcast read request and then sending the broadcast read request to the outside of a core through an on-chip network; the host DMA receives the read return data of each slave DMA, counts to confirm whether the data transmission is finished, when the data transmission is confirmed to be finished, the host DMA sends out a buffer emptying command to all the slave DMAs, and each slave DMA receives the buffer emptying command and executes the buffer emptying operation to finish the broadcast transmission. The invention can start one DMA transmission transaction to realize DMA broadcast data transmission, and has the advantages of simple realization principle, low cost, low DMA transmission power consumption and starting overhead, high data transmission efficiency and DDR reading efficiency, large transmission bandwidth and the like.

Description

DMA broadcast data transmission method based on host counting in GPDSP

Technical Field

The invention relates to the technical field of General Purpose Digital Signal Processors (GPDSPs), in particular to a DMA (direct Memory Access) broadcast data transmission method based on host counting in the GPDSP.

Background

The GPDSP is a novel architecture which not only keeps the advantages of basic characteristics and high performance and low power consumption of an embedded DSP, but also can efficiently support general scientific computation, can overcome the problems of the general DSP used for scientific computation, and can simultaneously provide efficient support for 64-bit high-performance computers and embedded high-precision signal processing. The structure has the following characteristics: the method has direct representation of double-precision floating point and 64-bit vertex data, more than 64 bits of general registers, data buses and instruction bit width and more than 40 bits of address buses; closely coupling the CPU and the DSP heterogeneous multi-core, wherein a CPU core supports a complete operating system, and a scalar unit of the DSP core supports an operating system micro-core; considering the unified programming mode of the CPU core, the DSP core and the DSP core inner vector array structure; fourthly, the cross simulation debugging of other machines is kept, and a local CPU host debugging mode is provided; the basic characteristics of common DSP except digits are reserved.

The GPDSP generally comprises a plurality of isomorphic 64-bit processing units to form a processing array to obtain high floating-point arithmetic capability, however, a large amount of data needs to be exchanged between the GPDSP core internal storage unit and the GPDSP core external storage unit due to the huge amount of data to be processed by the GPDSP. Data stored in the out-of-core storage space needs to be moved to the in-core storage space for the convenience of the core to calculate, a result obtained by the calculation of the core needs to be moved to the out-of-core storage space for storage, and at the moment, the data transmission rate between the in-core storage component and the out-of-core storage component becomes a key factor limiting the processing speed of the GPDSP.

The DMA can perform data transfer at a high speed in the background while performing data calculation on the processing core, the participation of the processing core is not needed in the transfer process, and the DMA can better relieve the problem of a storage wall. Because the DMA technology carries out the calculation operation of the kernel and the data moving operation of the storage component in an overlapping way, the influence of the data transmission speed between the storage component in the kernel and the storage component outside the kernel on the processing performance of the GPDSP is reduced to a certain extent. However, with the increasing number of processing cores integrated in the GPDSP, the existing DMA data transmission mode cannot meet the requirement of multi-core parallel processing on data volume, and efficient multi-core DMA involves the access and storage requirement of an application program and the hardware structure characteristics of the multi-core GPDSP which must be considered.

When common algorithms and application programs such as matrix multiplication, fast fourier transform, hpl (high Performance linear) and the like are implemented in parallel on a multi-core GPDSP, all cores may access the same block of memory space within a period of time, for example, GEMM matrix multiplication (C + ═ AB) is performed, an a matrix is a shared matrix, and all DSP cores need a matrix a; if a traditional DMA transmission mode is used, each DSP core initiates point-to-point transmission to read data blocks on the same position of DDR, at the moment, due to different distances from each core to the DDR, different core-read data are possibly positioned on different DDR pages, the DDR page hit loss and the DDR page change times are increased, meanwhile, the access delay is increased, and the DDR read efficiency is greatly reduced; if a plurality of or all cores start the DMA transfer transaction, not only a large amount of power consumption is caused, but also the pressure of the network is caused, and the conditions of competition or hit loss and the like occur when the DDR is accessed to the extra-core storage space.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the host counting-based DMA broadcast data transmission method in the GPDSP, which has the advantages of simple realization principle, low cost, low DMA transmission power consumption and starting overhead, high data transmission efficiency and DDR (double data rate) reading efficiency and large transmission bandwidth.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a DMA broadcast data transmission method based on host counting in GPDSP comprises the following steps: starting DMA broadcast data transmission by a host DMA, generating a broadcast read request and then sending the broadcast read request to an off-core storage space through an on-chip network; and the out-core storage space sends the read return data to the on-chip network according to the broadcast read request, each core in the GPDSP receives the read return data from the on-chip network and writes the read return data into the in-core storage space, and the host DMA receives the read return data and counts to confirm whether the data transmission is finished.

As a further improvement of the present invention, the confirming whether the transmission is completed specifically includes: the method comprises the steps of presetting broadcast transmission parameters including a source frame number SrcArrCnt, a source frame residual unit number SrcEleCnt, a destination frame number DstArrCnt and a destination frame residual unit number DstEleCnt, wherein the source frame number SrcArrCnt is used for configuring the frame number of data to be moved outside a core, the source frame residual unit number SrcEleCnt is used for counting the number of data units which are not read in a current source frame, the destination frame number DstArrCnt is used for configuring the number of data frames written into a storage space in the core, the destination frame residual unit number DstEleCnt is used for counting the number of data units which are not written in the current destination frame, and whether data transmission is finished or not is confirmed according to the values of the broadcast transmission parameters.

As a further improvement of the present invention, the source frame number SrcArrCnt, the source frame residual unit number srclecnt, the target frame number DstArrCnt and the target frame residual unit number DstEleCnt satisfy the following formula:

(SrcArrCnt+1)*SrcEleCnt＝＝(DstArrCnt+1)*DstEleCnt；

wherein SrcArrCnt +1 is the frame number of data to be moved outside the core, SrcEleCnt is the number of data units which are not read yet in the current source frame, DstArrCnt +1 is the number of data frames which are required to be written into the storage space in the core, and DstEleCnt is the number of data units which are not written yet in the current destination frame.

As a further development of the invention, the method further comprises a transfer mode parameter TMODE for configuring the DMA transfer mode, which transfer mode parameter TMODE, when active, initiates the execution of a DMA broadcast data transfer.

As a further improvement of the invention, the broadcast read request comprises a read return selection vector RetVec used for identifying the data return core information, and the target core required to be returned by the read return data is determined according to the read return selection vector RetVec.

As a further improvement of the present invention, the read return selection vector RetVec specifically has multiple bits, and each bit corresponds to a state that identifies whether a participating core participating in transmission needs to return read return data.

As a further improvement of the invention: the broadcast read request further comprises one or more of a read address, a read mask, and a read return address.

As a further improvement of the present invention, when the data transmission is confirmed to be completed, the present invention further includes a buffer clearing step, and the specific steps are as follows: and the master DMA sends a buffer clearing command to all the slave DMAs, and each slave DMA receives the buffer clearing command and executes buffer clearing operation to finish broadcast transmission.

As a further improvement of the invention, the method comprises the following specific steps:

s1, setting broadcast transmission parameters including a source frame number SrcArrCnt, a source frame residual unit number SrcEleCnt, a destination frame number DstArrCnt and a destination frame residual unit number DstEleCnt in advance respectively;

s2, after the broadcast transmission parameters are configured, DMA of the host starts DMA broadcast data transmission, and a broadcast read request is generated according to the broadcast transmission parameters and then is sent to an off-core storage space through an on-chip network;

s3, sending read return data to the on-chip network by the off-core storage space according to the broadcast read request, receiving the read return data from the on-chip network by each core in the GPDSP and writing the read return data into the on-core storage space, and receiving the read return data and updating the broadcast transmission parameters by the host DMA so as to count;

s4, when the host DMA receives the last block of data, counting is completed, the host DMA sends out a command of clearing buffer to all DSP cores, the slave DMA executes clearing operation after receiving the clearing command, and sends out an interrupt request according to the interrupt enable bit after clearing is completed; and after the DMA cache of the slave is emptied, setting the value of a broadcast ending register BOR preset in the slave, and ending the broadcast transmission transaction.

As a further improvement of the present invention, when the host DMA receives the read return data in step S3, the method further includes a data validity determination step, which specifically includes: and judging whether the data is valid, if so, forwarding the data to the internal storage space of the core, starting the host DMA to count, and if not, directly starting the host DMA to count.

Compared with the prior art, the invention has the advantages that:

1) the DMA broadcast data transmission method based on host counting in the GPDSP of the invention moves the same block data block of the extra-core storage space to the intra-core storage spaces of all DSP cores on the chip through one DMA transmission transaction, and the host DMA generates a read request and counts the transmission data block to confirm the completion of transmission, so that the DMA broadcast transmission transaction of only one DSP core needs to be started, the same block data block of the extra-core storage space of the GPDSP can be transmitted to all DSP cores on the chip in a broadcast mode, the transmission mode of the data checking requirements of all the DSPs is met, the DMA transmission is prevented from being started by all the cores at the same time, the DMA transmission power consumption and the starting expense can be effectively reduced, and the network congestion on the chip is reduced.

2) The DMA broadcast data transmission method based on host counting in the GPDSP can realize broadcast transmission similar to an (C + ═ AB) A matrix in GEMM matrix multiplication operation, and can meet the requirement of checking out-of-core data by all DSPs only by starting one DMA transmission transaction, thereby greatly reducing the page changing times of the DDR in the out-of-core storage space and reducing the access times of the DDR, greatly improving the read efficiency of the DDR and the row hit rate of the DDR, and effectively improving the transmission bandwidth.

3) The DMA broadcast data transmission method based on host counting in the GPDSP further sets broadcast transmission parameters of a source frame number SrcArrCnt, a source frame residual unit number SrcEleCnt, a target frame number DstArrCnt and a target frame residual unit number DstEleCnt, so that DMA broadcast data transmission control can be conveniently realized by configuring the broadcast transmission parameters, a DMA broadcast transmission transaction of one DSP core is simply and efficiently started, the same data block of an extra-core storage space of the GPDSP can be transmitted to all DSP cores on a chip in a broadcast mode, and flexible configuration can be realized based on the broadcast transmission parameters.

Drawings

Fig. 1 is a schematic diagram of the architecture principle of the GPDSP employed in the present embodiment.

Fig. 2 is a schematic diagram of the location and operation principle of the DMA in the GPDSP in this embodiment.

FIG. 3 is a schematic diagram of a DMA broadcast data transfer according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a DMA broadcast data transfer parameter word according to an embodiment of the present invention.

Fig. 5 is a schematic flow chart of an implementation of DMA broadcast data transmission according to the present embodiment.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

As shown in fig. 1 to 5, the host count-based DMA broadcast data transmission method in the GPDSP of the present embodiment includes: starting DMA broadcast data transmission by a host DMA, generating a broadcast read request and then sending the broadcast read request to an off-core storage space through an on-chip network; the method comprises the steps that a read return data is sent to a network-on-chip by an extra-core storage space according to a broadcast read request, each core in the GPDSP receives the read return data from the network-on-chip and writes the read return data into a storage space in the core, and the host DMA receives the read return data and counts to confirm whether data transmission is completed or not, namely, the DSP core initiating DMA transmission transaction is used as the host to be responsible for generating the read request in the broadcast transmission, and meanwhile, the DSP core counts transmission data of all other cores to confirm that the data transmission is completed.

In the method, the same block data block in the extra-core storage space is moved to the intra-core storage spaces of all DSP cores on the chip through one DMA transmission transaction, the host DMA generates a read request and counts the transmission data blocks to confirm the completion of transmission, so that only one DMA broadcast transmission transaction of one DSP core needs to be started, the same block data block in the extra-core storage space of the GPDSP can be transmitted to all DSP cores on the chip in a broadcast mode, the transmission mode of the requirement of all DSP cores for checking data is met, the DMA transmission is prevented from being started by all cores at the same time, the DMA transmission power consumption and the starting overhead can be effectively reduced, and the congestion of a network on chip is reduced.

The method can realize broadcast transmission similar to a (C + ═ AB) A matrix in GEMM matrix multiplication operation, and can meet the requirement of checking out-of-core data of all DSPs by only starting DMA transmission transaction once, thereby greatly reducing the page change times of the out-of-core storage space DDR, reducing the access times of the DDR, greatly improving the read efficiency of the DDR and the row hit rate of the DDR, and simultaneously effectively improving the transmission bandwidth.

The GPDSP architecture adopted in this embodiment is shown in fig. 1, and the multi-core GPDSP is composed of core nodes, IO nodes, a network on chip, a DDR controller, and a storage component DDR outside the core, where each core node includes two DSP cores, the DDR controller controls the migration of DDR data, and the network on chip implements data communication between the DSPs and the storage space outside the core.

As shown in fig. 2, in the present embodiment, the DMA is connected to the SPU through a configuration bus PBUS in the DSP core, connected to an in-core memory space (a vector memory unit AM and a scalar memory unit SM) through a data bus, and connected to an out-core memory space DDR through an out-core bus interface; the SPU scalar processing unit is responsible for producing the transmission parameter word for DMA, make DMA can move to the memory space outside the core from the memory space in the core voluntarily or move to the memory space in the core from the memory space outside the core, DMA can receive the read-write request from network on chip passively too.

In this embodiment, the determining whether the transmission is completed specifically includes: the method comprises the steps that broadcast transmission parameters including a source frame number SrcArrCnt, a source frame residual unit number SrcEleCnt, a destination frame number DstArrCnt and a destination frame residual unit number DstEleCnt are respectively preset, the source frame number SrcArrCnt is used for configuring the frame number of data to be moved outside a core, the frame number of the data to be moved from an outside storage space is represented as SrcArrCnt +1, the source frame residual unit number SrcEleCnt is used for counting the number of data units which are not read in a current source frame, and the data units are the minimum granularity of DMA transmission in a GPDSP; the target frame number DstArrCnt is used for configuring the number of data frames written into the storage space in the core, the number of the data frames written into the storage space in the core is DstArrCnt +1, the residual unit number DstEleCnt of the target frame is used for counting the number of data units which are not written in the current target frame, and whether data transmission is finished or not is confirmed according to the value of the broadcast transmission parameter.

In this embodiment, by setting broadcast transmission parameters of the source frame number SrcArrCnt, the source frame residual unit number srclecnt, the destination frame number DstArrCnt and the destination frame residual unit number DstEleCnt, DMA broadcast data transmission control can be conveniently realized by configuring the broadcast transmission parameters, so that a DMA broadcast transmission transaction of one DSP core is simply and efficiently started, and the same block data block of the extra-core storage space of the GPDSP can be transmitted to all the DSP cores on the chip in a broadcast manner.

In this embodiment, the source frame number SrcArrCnt, the source frame residual unit number srclecnt, the destination frame number DstArrCnt and the destination frame residual unit number DstEleCnt satisfy the following equation:

(SrcArrCnt+1)*SrcEleCnt＝＝(DstArrCnt+1)*DstEleCnt；

When the DMA performs broadcast data transmission, the specific host DMA initiates a broadcast read request to a data block in the DDR, and the returned data is sent to all DSP cores; the source frame number SrcArrCnt represents the frame number of the data to be moved outside the core, the value is SrcArrCnt +1, SrcEleCnt represents the number of the remaining units of the current source frame, the data unit is the minimum unit of DMA transmission, the size of broadcast transmission data is (SrcArrCnt +1) SrcEleCnt, when SrcEleCnt is 0, the current frame reading request is calculated, and the value of SrcArrCnt is reduced by 1; when SrcArrCnt is 0 and SrcEleCnt is also 0, the read request is calculated; DstArrCnt indicates the number of frames of the current destination, and DstEleCnt indicates the number of remaining units of the current destination, where (SrcArrCnt +1) × srclecnt ═ is (DstArrCnt +1) × DstEleCnt.

In this embodiment, a transfer mode parameter TMODE for configuring the DMA transfer mode is further included, and when the transfer mode parameter TMODE is valid, the DMA broadcast data transfer is started to be executed, specifically, when the TMODE is "2 'b 11", the transfer mode is configured to be the broadcast data transfer, that is, when the TMODE is "2' b 11", the host DMA starts the broadcast data transfer.

In this embodiment, the broadcast read request includes a read return selection vector RetVec for identifying data return core information, the destination core to which the read return data needs to be returned is determined according to the read return selection vector RetVec, that is, the sent broadcast read request carries flag data return information, and each DSP core determines whether the read return data needs to be returned according to the value of the read return selection vector RetVec. The read return selection vector retvec has n bits, each bit correspondingly identifies the state of whether a DSP core needs to return read return data, that is, each bit corresponds to a DSP core and indicates whether data needs to be returned to the corresponding core. The broadcast read request further includes a read address, a read mask, a read return address, and the like, that is, the read request carries information of the read address, the read mask, the read return address, the read return selection vector RetVec, and the like.

In a specific embodiment, when TMODE is "2' b 11" (that is, DMA performs broadcast data transmission), and after a broadcast transmission parameter is configured, DMA initiates broadcast data transmission, DMA generates a broadcast read request according to broadcast parameters SrcArrCnt, srclecnt, DstArrCnt, and DstEleCnt to transmit to the network on chip, where the read request includes a read address, a read mask, a read return address, and a parameter read return selection vector RetVec, the read return selection vector RetVec has n bits in common, and in broadcast transmission, n bits of a signal RetVec are all 1, which indicates that read return data is returned to all DSP cores; and the out-core storage space returns data to the on-chip network according to the read request, and all slave DMA passively receives the data through the on-chip network and writes the data into the core.

On the basis of the point-to-point DMA transfer mode, the present embodiment configures 5 parameters: the transmission process is controlled by a transmission mode TMODE, a source frame number SrcArrCnt, a source frame residual unit number SrcEleCnt, a target frame number DstArrCnt and a target frame residual unit number DstEleCnt, a broadcast data transmission request is generated by a host DMA according to configured transmission parameters, meanwhile, counting statistics of moving data is carried out until the broadcast transmission is finished, a DMA broadcast transmission transaction of a DSP core can be started, and the same data block of an extra-core storage space of a GPDSP can be transmitted to all DSP cores on a chip in a broadcast mode.

In this embodiment, when it is determined that data transmission is completed, the method further includes a buffer clearing step, and the specific steps are as follows: and after the counting of the master DMA is finished, the master DMA sends out a buffer emptying command to all slave DMAs, each slave DMA receives the buffer emptying command and executes buffer emptying operation, and the transmission transaction is finished after the buffer emptying is finished.

In the embodiment of the present invention, data transmission by DMA broadcast data is implemented as shown in fig. 3, where a chip has 12 DSP cores, each core has independent DMA and LM, and LM is an intra-core storage space (including a vector storage unit AM and a scalar storage unit SM); array indicates the moving frame, and C0-C11 indicate the size of 8words, which are 512bits, per line of data block. The size of the data block transmitted by the broadcast is 4x96 words; the DMA initiates broadcast data transmission, 4 frames of data are transmitted in total, and the size of each frame of data is 96 words. DDR moves data according to the direction indicated by an arrow in the figure, and DMA moves one page of data of the DDR at first and then carries out page turning to move the next page of data; the DDR sends data to the network on chip according to the read request, and the slave DSP passively receives the data through the network. Therefore, by adopting the broadcast data transmission mode of the embodiment, the page change times of the DDR can be greatly reduced, the transmission delay is reduced, the access times of the DDR are effectively reduced, and the transmission bandwidth and the read efficiency of the DDR are improved.

In the specific embodiment of the present invention, the DMA broadcast data transmission parameter word is shown in fig. 4, and specifically includes a transmission mode TMODE, a source frame number SrcArrCnt, a source frame residual unit number srclecnt, a destination frame number DstArrCnt, and a destination frame residual unit number DstEleCnt, where a bit width of the TMODE is 2, and when the TMODE value is 2' b11, the DMA starts broadcast data transmission to move the same data block from the extra-core storage space DDR to 12 DSP cores; SrcArrCnt is the number of source frame units, the bit width is 32, and the maximum frame number is 32 powers of 2; SrcEleCnt is the number of the residual units of the current source frame, the bit width is 32, and the maximum value is 2 minus 1 to the power of 32; DstArrCnt is the number of target frame units, the bit width is 32, and the maximum value is 32 powers of 2; DstEleCnt is the number of the remaining units of the current destination frame, and the maximum value is 2 minus 1 to the power of 32.

As shown in fig. 5, the specific steps for implementing DMA broadcast data transmission in the GPDSP in this embodiment are as follows:

s2, after broadcast transmission parameters are configured, a host DMA starts DMA broadcast data transmission, and generates a broadcast reading request according to the broadcast transmission parameters SrcArrCnt, SrcEleCnt, DstArrCnt and DstEleCnt, wherein the reading request comprises a reading address, a reading mask, a reading return address and a parameter reading return selection vector RetVec, and the generated broadcast reading request is sent to an off-core storage space through an on-chip network;

s3, the out-core storage space sends read return data to the on-chip network according to the broadcast read request, each core in the GPDSP receives the read return data from the on-chip network and writes the read return data into the in-core storage space, and the host DMA receives the read return data and updates broadcast transmission parameters to count the transmission data of the core;

In this embodiment, when the host DMA receives the read return data in step S3, the method further includes a data validity determination step, which includes the specific steps of: and judging whether the data is valid, if so, forwarding the data to the internal storage space of the core, starting the host DMA to count, and if not, directly starting the host DMA to count.

As shown in fig. 5, after configuring the broadcast data transmission parameter word, the host DMA initiates broadcast data transmission, and the host DMA sends a broadcast read request to the extra-core storage space DDR; the DDR returns read return data to the network on chip, the host DMA receives the read return data of each core from the network on chip, and all other cores passively receive the read return data from the network on chip; if the host DMA detects that the data is valid, writing the data into a storage space in the core, and if the data is invalid, counting the data; when the counting of the host is finished, a command of clearing the buffer is sent to other cores, the slave sets the value of the broadcast ending identification register to be 1, and after the clearing operation is finished, if the interrupt enable bit is 1, the slave sends an interrupt.

The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A DMA broadcast data transmission method based on host counting in GPDSP is characterized in that the method comprises the following steps: starting DMA broadcast data transmission by a host DMA, generating a broadcast read request and then sending the broadcast read request to an off-core storage space through an on-chip network; the out-core storage space sends the read return data to the on-chip network according to the broadcast read request, each core in the GPDSP receives the read return data from the on-chip network and writes the read return data into the in-core storage space, and the host DMA receives the read return data and counts to confirm whether the data transmission is finished;

the confirming whether the transmission is completed specifically includes: the method comprises the steps of presetting broadcast transmission parameters including a source frame number SrcArrCnt, a source frame residual unit number SrcEleCnt, a destination frame number DstArrCnt and a destination frame residual unit number DstEleCnt, wherein the source frame number SrcArrCnt is used for configuring the frame number of data to be moved outside a core, the source frame residual unit number SrcEleCnt is used for counting the number of data units which are not read in a current source frame, the destination frame number DstArrCnt is used for configuring the number of data frames written into a storage space in the core, the destination frame residual unit number DstEleCnt is used for counting the number of data units which are not written in the current destination frame, generating a broadcast read request according to the broadcast transmission parameters, and confirming whether data transmission is finished or not according to the values of the broadcast transmission parameters.

2. The method for host count-based DMA broadcast data transmission in the GPDSP of claim 1, wherein the source frame number SrcArrCnt, the source frame residual unit number SrcEleCnt, the destination frame number DstArrCnt and the destination frame residual unit number DstEleCnt satisfy the following formula:

(SrcArrCnt+1)*SrcEleCnt＝＝(DstArrCnt+1)*DstEleCnt；

3. The method of claim 1 or 2, further comprising configuring a transfer mode parameter TMODE for DMA transfer mode, and when the transfer mode parameter TMODE is valid, starting to perform DMA broadcast data transfer.

4. The method of claim 1 or 2, wherein the broadcast read request includes a read return selection vector RetVec for identifying data return core information, and the destination core required for returning the read return data is determined according to the read return selection vector RetVec.

5. The method of claim 4, wherein the read return selection vector RetVec has a plurality of bits, and each bit corresponds to a state identifying whether a participating core participating in the transfer needs to return read return data.

6. The method of claim 5, wherein the broadcast read request further comprises one or more of a read address, a read mask, and a read return address.

7. The method for host count-based DMA broadcast data transfer in a GPDSP according to claim 1 or 2, characterized in that when the completion of data transfer is confirmed, it further comprises a buffer flushing step, specifically comprising the steps of: and the master DMA sends a buffer clearing command to all the slave DMAs, and each slave DMA receives the buffer clearing command and executes buffer clearing operation to finish broadcast transmission.

8. The method for host count-based DMA broadcast data transmission in a GPDSP according to claim 1 or 2, characterized in that the method comprises the following specific steps:

9. The method for host count-based DMA broadcast data transmission in a GPDSP according to claim 8, wherein when the host DMA receives the read return data in step S3, the method further comprises a data validity determination step, specifically comprising: and judging whether the data is valid, if so, forwarding the data to the internal storage space of the core, starting the host DMA to count, and if not, directly starting the host DMA to count.