CN103049421A

CN103049421A - Method and device for data transmission between central processing unit (CPU) and co-processors

Info

Publication number: CN103049421A
Application number: CN2012105322924A
Authority: CN
Inventors: 欧阳剑; 王勇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-12-11
Filing date: 2012-12-11
Publication date: 2013-04-17
Anticipated expiration: 2032-12-11
Also published as: CN103049421B

Abstract

The invention provides a method and a device for data transmission between a central processing unit (CPU) and co-processors. The method comprises steps of controlling data transmission of N co-processors in accordance with N thread parallelisms generated by the CPU, and the N is an integer of no less than 2. The control comprises that co-processors receive data which are transmitted by the CPU in a slice mode. Or, the co-processors receive and store data slice of the current moment, which is transmitted by the CPU or the last co-processor and transmit the stored data slice of the last moment to the next co-processor. By the aid of the method and the device, buses among co-processors and the CPU and among co-processors can be utilized, and the transmission efficiency of data which are transmitted to co-processors by the CPU and transmitted to other co-processors by co-processors is improved.

Description

Data transmission method between a kind of CPU and coprocessor and device

[technical field]

The present invention relates to the processor data transmission technology, relate in particular to data transmission method and device between a kind of CPU and coprocessor.

[background technology]

Nowadays, take the GPU(graphic process unit) had from strength to strength computing power as the coprocessor of representative, in numerous fields that need high-performance calculation, the mode that the capital adopts many coprocessors to cooperate with CPU is carried out calculation task, in this process, often need the data transmission between CPU and coprocessor and a plurality of coprocessor, the efficient of data transmission directly affects the execution efficient of calculation task.

In the existing data transmission method, data transfer to a plurality of coprocessors from CPU, and perhaps during to a plurality of coprocessor, data transfer efficient is all very low, is mainly reflected in data broadcast for coprocessor:

When one piece of data is transferred to a plurality of coprocessor from CPU, existing method normally CPU is carried out data transmission with these a plurality of coprocessors successively, be after each CPU and a coprocessor transfer this piece of data, carry out data transmission with another coprocessor again, like this so that CPU and a coprocessor when the transmission of data, the bus of all the other coprocessors all is in idle condition, and total line use ratio is very low.

Coprocessor is with a data transmission during to all the other a plurality of coprocessors, existing method normally coprocessor is sent to the CPU internal memory to data first, internal memory from CPU transfers to another coprocessor successively again, perhaps, the transition function that directly provides by the coprocessor manufacturer, successively with data transmission to all the other a plurality of coprocessors, with above-mentioned CPU that a data transmission is similar to a plurality of coprocessors, these two kinds of methods can cause not being in idle condition in the bus of the coprocessor of the transmission of data equally, and total line use ratio is very low.

Aforesaid problem is so that the data transmission efficiency between CPU and coprocessor and a plurality of coprocessor is very low, and can directly reduce the arithmetic capability of whole system, for example in the training process of speech recognition, need to cooperate CPU to calculate by many GPU, each GPU will have with a training data, yet because data transfer overhead is larger, can causes the training speed of many GPU even not have single GPU fast.

[summary of the invention]

In view of this, the invention provides data transmission method and device between a kind of CPU and coprocessor, in the time of can improving CPU data be sent to a plurality of coprocessor and coprocessor with data transmission the data transmission efficiency during to all the other a plurality of coprocessors.

Concrete technical scheme is as follows:

Data transmission method between a kind of CPU and coprocessor, the method comprises:

According to N the thread parallel that CPU generates the data transmission of N coprocessor is controlled, described N is the integer more than or equal to 2;

Described control comprises: coprocessor receives the data that CPU sends with the data slicer form; Perhaps, in the time of the data slicer of the current time that coprocessor reception and storage CPU or a upper coprocessor send, sent the data slicer in a upper moment of having stored to next coprocessor.

According to one preferred embodiment of the present invention, when described method is used for transferring data to N target coprocessor by CPU, described CPU is sent to one of them target coprocessor with data with the data slicer form, and this target coprocessor of the Thread control by correspondence receives and when the data slicer of the current time that storage CPU sends, sends upper one constantly the data slicer stored to next target coprocessor.

According to one preferred embodiment of the present invention, when described method is used for transferring data to N target coprocessor by CPU, described CPU is sent to described N target coprocessor with data with the data slicer form, and receives simultaneously and store the data slicer that CPU sends by the described N of a corresponding Thread control target coprocessor.

According to one preferred embodiment of the present invention, when described method is used for transferring data to other N-1 target coprocessors by a source coprocessor, described CPU is sent to CPU with data with the form of data slicer by corresponding Thread control source coprocessor, described CPU receives and when the data slicer of the current time that storage source coprocessor sends, sent the data slicer in a upper moment of having stored to one of them target coprocessor, and receive and when the data slicer of the current time that storage CPU sends by this target coprocessor of corresponding Thread control, send upper one constantly the data slicer stored to next target coprocessor.

According to one preferred embodiment of the present invention, if described next target coprocessor is last target coprocessor, the data fragmentation that then arrives by last the target coprocessor reception of corresponding Thread control and storing received, otherwise, in the data slicer by described next the target coprocessor current time that a target coprocessor sends on receiving of corresponding Thread control, sent the data slicer in a upper moment of having stored to next target coprocessor, until last target coprocessor.

According to one preferred embodiment of the present invention, when described method is used for transferring data to other N-1 target coprocessors by a source coprocessor, described CPU is sent to CPU with the form of data slicer from the source coprocessor with data by corresponding thread notification source coprocessor, described CPU receives and when the data slicer of the current time that storage source coprocessor sends, send the data slicer in a upper moment of having stored to described N-1 target coprocessor, and receive simultaneously and store the data slicer that CPU sends by the described N-1 of a corresponding Thread control coprocessor.

Data transmission device between a kind of CPU and coprocessor, this device is arranged at CPU, it is characterized in that, and this device comprises:

The Thread control unit is used for generating N thread;

Transmission control unit is used for according to a described N thread parallel data transmission of N coprocessor being controlled, and described N is the integer more than or equal to 2;

According to one preferred embodiment of the present invention, when described device is used for transferring data to N target coprocessor by CPU, described CPU is sent to one of them target coprocessor with data with the data slicer form, described transmission control unit receives according to this coprocessor of the Thread control of correspondence and when the data slicer of the current time that storage CPU sends, sends upper one constantly the data slicer stored to next target coprocessor.

According to one preferred embodiment of the present invention, when described device is used for transferring data to N target coprocessor by CPU, described CPU is sent to this N target coprocessor with data with the data slicer form, and described transmission control unit receives and store the data slicer that CPU sends simultaneously according to the described N of a corresponding Thread control target coprocessor.

According to one preferred embodiment of the present invention, when described device is used for transferring data to other N-1 target coprocessors by a source coprocessor, described transmission control unit is sent to CPU with data with the data slicer form according to corresponding Thread control source coprocessor, described CPU receives and when the data slicer of the current time that storage source coprocessor sends, sent the data slicer in a upper moment of having stored to one of them target coprocessor, described transmission control unit receives according to this target coprocessor of corresponding Thread control and when the data slicer of the current time that storage CPU sends, sends upper one constantly the data slicer stored to next target coprocessor.

According to one preferred embodiment of the present invention, if described next target coprocessor is last target coprocessor, then described transmission control unit is by last target coprocessor of corresponding Thread control receives and storing received arrives data fragmentation, otherwise, in the data slicer of described transmission control unit by described next the target coprocessor current time that a target coprocessor sends on receiving of corresponding Thread control, sent the data slicer in a upper moment of having stored to next target coprocessor, until last target coprocessor.

According to one preferred embodiment of the present invention, when described device is used for transferring data to other N-1 target coprocessors by a source coprocessor, described transmission control unit is sent to CPU with data with the data slicer form according to corresponding Thread control source coprocessor, described CPU receives and when the data slicer of the current time that storage source coprocessor sends, sent the data slicer in a upper moment of having stored to described N-1 target coprocessor, described transmission control unit receives and stores the data slicer that CPU sends simultaneously according to the described N-1 of a corresponding Thread control target coprocessor.

As can be seen from the above technical solutions, the present invention controls each coprocessor with the form transmission of data with section by generating multithreading, and each thread can carry out reception or the transmission operation of corresponding data slicer by its corresponding coprocessor of parallel control.The present invention can take full advantage of the bus between each coprocessor and the CPU, and the bus between each coprocessor, significantly improved when CPU is sent to a plurality of coprocessor with data and coprocessor with data transmission the data transmission efficiency during to all the other a plurality of coprocessors.

[description of drawings]

Method A exemplary plot when the data that Fig. 1 provides for the embodiment of the invention one transfer to a plurality of coprocessor by CPU;

Method B exemplary plot when the data that Fig. 2 provides for the embodiment of the invention one transfer to a plurality of coprocessor by CPU;

Method C exemplary plot when the data that Fig. 3 provides for the embodiment of the invention one transfer to a plurality of coprocessor by a coprocessor;

Method D exemplary plot when the data that Fig. 4 provides for the embodiment of the invention one transfer to a plurality of coprocessor by a coprocessor;

The CPU that Fig. 5 provides for the embodiment of the invention two and the data transmission device schematic diagram between coprocessor.

[embodiment]

In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.

In the existing method, data transfer to a plurality of coprocessors by CPU, when perhaps transferring to all the other a plurality of coprocessors by coprocessor, at every turn can only be by carrying out data transmission between CPU and a coprocessor or two coprocessors, and the bus of all the other coprocessors all is in idle condition.If can provide a kind of method so that transmission can be carried out simultaneously between CPU and a plurality of coprocessor or a plurality of coprocessor, the efficient of transmission will have remarkable lifting so.The present invention controls a plurality of coprocessors by CPU generation multithreading just data is transmitted with the form of section, takes full advantage of the bus bandwidth of each processor, thereby the raising data transmission efficiency.

Embodiment one

The embodiment of the invention one provides the data transmission method between a kind of CPU and coprocessor, and the method comprises: data are transferred to the transmission method of a plurality of coprocessors by CPU; Data are transferred to the transmission method of a plurality of coprocessors by a coprocessor.

When method provided by the present invention can improve CPU data are sent to a plurality of coprocessor and coprocessor with data transmission the data transmission efficiency during to all the other a plurality of coprocessors, the below is described both of these case respectively.

1, data are transferred to the transmission method of a plurality of coprocessors by CPU, and the method by the following method A and method B dual mode realizes:

Method A: generate N thread by CPU and control respectively N coprocessor, successively data slicer is sent to coprocessor 1 from CPU, the data slicer that thread 1 control coprocessor 1 received and preserved the CPU transmission is sent to coprocessor 2 successively with the data slicer of preserving simultaneously, the data slicer that thread 2 control coprocessors 2 received and preserved coprocessor 1 transmission is sent to coprocessor 3 successively with the data slicer of preserving simultaneously, send data to by that analogy all coprocessors, wherein, N is the quantity of coprocessor.

The data that provide for understanding method A are better transferred to the transmission mode of a plurality of coprocessors by CPU, below in conjunction with example shown in Figure 1 method A is described.As shown in Figure 1, CPU need to be with a data transmission to 4 coprocessor.Generating respectively 4 thread cause CPU controls these 4 coprocessors and carries out data transmission, for ease of describing, these 4 threads are numbered respectively thread 1, thread 2, thread 3, thread 4, control respectively coprocessor 1, coprocessor 2, coprocessor 3, coprocessor 4.During the transmission beginning, CPU is sent to coprocessor 1 with the every a section of these data successively from the CPU internal memory, in the data slicer that thread 1 control coprocessor 1 reception CPU sends and the internal memory that is kept at coprocessor 1, simultaneously, the data slicer that thread 1 control coprocessor 1 will have been preserved is sent to coprocessor 2 successively, thread 2 control coprocessors 2 receive the data slicer of coprocessor 1 transmission and are kept in the internal memory of coprocessor 2, simultaneously, the data slicer that thread 2 control coprocessors 2 will have been preserved is sent to coprocessor 3 successively, by that analogy, thread 3 control coprocessors 3 receive the data slicer of coprocessor 2 transmissions and are kept in the internal memory of coprocessor 3, simultaneously, the data slicer that thread 3 control coprocessors 3 will have been preserved is sent to coprocessor 4 successively, and thread 4 control coprocessors 4 receive the data slicer of coprocessor 3 transmissions and are kept in the internal memory of coprocessor 4.

The method can take full advantage of the bus between each coprocessor, in each transmission, when CPU is sent to a coprocessor with data slicer, also can carry out the transmission of data slicer between all the other coprocessors.For example, as shown in Figure 1, in the transmission course at a time, the transmission of carrying out simultaneously has: CPU is sent to coprocessor 1 with the section of Slice_x piece of data, the Slice_x-1 piece of data section that coprocessor 1 will have been preserved is sent to coprocessor 2, the Slice_x-2 piece of data section that the Slice_x-1 piece of data section that coprocessor 2 receives and preservation coprocessor 1 sends will have been preserved simultaneously is sent to coprocessor 3, the Slice_x-3 piece of data section that the Slice_x-2 piece of data section that coprocessor 3 receives and preservation coprocessor 2 sends will have been preserved simultaneously is sent to coprocessor 4, coprocessor 4 receives and preserves the Slice_x-3 piece of data section that coprocessor 3 sends, wherein, thread 1-thread 4 is controlled respectively corresponding coprocessor and is carried out reception or the transmission work of corresponding data slicer, in transmission course, walk abreast between these 4 threads, can control simultaneously its corresponding coprocessor and carry out data transmission, compared with prior art, the method takes full advantage of the bus between each coprocessor in transmission course, higher data transmission efficiency is arranged.

The above-mentioned data that provide for the method A that describes in conjunction with Fig. 1 are transferred to the transmission method of a plurality of coprocessors by CPU.

Method B: generate N thread by CPU and control respectively N coprocessor, successively data slicer is sent to this N coprocessor from CPU, each thread is controlled respectively its corresponding coprocessor and is received successively and preserve the data slicer that CPU sends, and wherein, N is the quantity of coprocessor.

The data that provide for understanding method B better transfer to a plurality of coprocessor transmission modes by CPU, below in conjunction with example shown in Figure 2 method B are described.As shown in Figure 2, CPU need to be with a data transmission to 4 coprocessor.Generating respectively 4 thread cause CPU controls these 4 coprocessors and carries out data transmission, for ease of describing, these 4 threads are numbered respectively thread 1, thread 2, thread 3, thread 4, control respectively coprocessor 1, coprocessor 2, coprocessor 3, coprocessor 4.During the transmission beginning, CPU is sent to 4 coprocessors with the every a section of these data successively simultaneously from the CPU internal memory, and thread 1-thread 4 is controlled respectively its corresponding coprocessor and received the data slicer that CPU sends.

The method can take full advantage of the bus between each coprocessor and the CPU, and in each transmission, CPU can be sent to all coprocessors with a data slicer simultaneously.For example, as shown in Figure 2, in the transmission course at a time, CPU is sent to 4 coprocessors with the section of Slice_x piece of data from the CPU internal memory simultaneously, thread 1-thread 4 is controlled simultaneously its corresponding coprocessor and is received this Slice_x piece of data section, walk abreast between these 4 threads, can control simultaneously its corresponding coprocessor and receive the data slicer that CPU sends, compared with prior art, the method takes full advantage of the bus between each coprocessor and the CPU in transmission course, higher data transmission efficiency is arranged.

The above-mentioned data that provide for the method A that describes in conjunction with Fig. 2 are transferred to the transmission method of a plurality of coprocessors by CPU.

Said method A and method B are data are transferred to a plurality of coprocessors by CPU transmission method, two kinds of methods are all transmitted data in the mode of section, data slicer in two kinds of methods is the data block that presets size, be in the transmission course, by the data volume transmitted between the each CPU of Thread control and coprocessor or the coprocessor data block for a certain size, the size of section can be set according to the actual requirements, if but it is excessive to cut into slices, then transmission delay is excessive, too small if cut into slices, then efficient is lower, the invention provides a kind of preferred implementation data are cut into slices: the data slicer size is set as page, i.e. a 4KB.

In said method A and method B, CPU can be known by thread the state of each coprocessor, and finishes corresponding operation by the Thread control coprocessor.Can in the CPU internal memory, record the state of each coprocessor by the data structure of making by oneself (such as the free time, or receive data, or transmission data), and which data slicer is the transmission situation of data slicer (sent, which data slicer to be sent), so that CPU controls and dispatches each thread, for example when two coprocessors all were in idle condition, CPU can carry out the transmission of data slicer by these two coprocessors of Thread control.This part is prior art, exceeds at this and gives unnecessary details.

For said method A and method B, because the bus bandwidth of coprocessor and CPU is usually above the bus bandwidth between the coprocessor, therefore, the transfer efficiency of method B will be higher than method A in actual applications, but method B is subject to the bandwidth of CPU internal memory, is applicable to the higher situation of CPU memory bandwidth, as can be used in the multi-CPU system, if the situation that the bandwidth of CPU internal memory is not high enough then is fit to selecting method A.

2, data are transferred to the transmission method of a plurality of coprocessors by a coprocessor, and the method by the following method C and method D dual mode realizes:

Method C: generate N thread by CPU and control respectively N coprocessor, thread 1 control coprocessor 1 is sent to CPU with data slicer successively, the data slicer that CPU is received is sent to coprocessor 2 successively, the data slicer that thread 2 control coprocessors 2 received and preserved the CPU transmission is sent to coprocessor 3 successively with the data slicer of preserving simultaneously, the data slicer that thread 3 control coprocessors 3 received and preserved coprocessor 2 transmissions is sent to coprocessor 4 successively with the data slicer of preserving simultaneously, send data to by that analogy all target coprocessors, wherein, N is the quantity of coprocessor.

The data that provide for understanding method C are better transferred to the transmission mode of a plurality of coprocessors by a coprocessor, below in conjunction with example shown in Figure 3 method C is described.As shown in Figure 3, coprocessor 1 need to be with a data transmission to all the other 3 coprocessors.Generating respectively 4 thread cause CPU controls these 4 coprocessors and carries out data transmission, for ease of describing, these 4 threads are numbered respectively thread 1, thread 2, thread 3, thread 4, control respectively coprocessor 1, coprocessor 2, coprocessor 3, coprocessor 4, during the transmission beginning, thread 1 control coprocessor 1 is sent to CPU with the form of section successively with this piece of data, CPU receives data slicer and the preservation that coprocessor 1 sends, simultaneously, the data slicer that CPU will preserve is sent to coprocessor 2 successively, thread 2 control coprocessors 2 receive the data slicer of CPU transmission and are kept in the internal memory of coprocessor 2, simultaneously, the data slicer that thread 2 control coprocessors 2 will have been preserved is sent to coprocessor 3 successively, thread 3 control coprocessors 3 receive the data slicer of coprocessor 2 transmissions and are kept in the internal memory of coprocessor 3, simultaneously, the data slicer that thread 3 control coprocessors 3 will have been preserved is sent to coprocessor 4 successively, in the data slicer that thread 4 control coprocessors 4 reception coprocessors 3 send and the internal memory that is kept at coprocessor 4.

The method can take full advantage of the bus between each coprocessor, in each transmission, a coprocessor is sent to CPU with data slicer, when the another one coprocessor is cut into slices from the CPU receive data, also can carry out the transmission of data slicer between all the other coprocessors.For example, as shown in Figure 3, in the transmission course at a time, the transmission of carrying out simultaneously has: coprocessor 1 is sent to CPU with the section of Slice_x piece of data, CPU receives and preserves the Slice_x piece of data section that coprocessor 1 sends, the Slice_x-1 piece of data section that while CPU will preserve is sent to coprocessor 2, the Slice_x-2 piece of data section that the Slice_x-1 piece of data section that coprocessor 2 receives and preservation CPU sends will have been preserved simultaneously is sent to coprocessor 3, the Slice_x-3 piece of data section that the Slice_x-2 piece of data section that coprocessor 3 receives and preservation coprocessor 2 sends will have been preserved simultaneously is sent to coprocessor 4, coprocessor 4 receives and preserves the Slice_x-3 piece of data section that coprocessor 3 sends, wherein, thread 1-thread 4 is controlled respectively the operation that each self-corresponding coprocessor received and sent data slicer.In transmission course, walk abreast between the thread 1-thread 4, can control simultaneously its corresponding coprocessor and carry out data transmission, compared with prior art, the method takes full advantage of the bus between each coprocessor in transmission course, higher data transmission efficiency is arranged.

The above-mentioned data that provide for the method C that describes in conjunction with Fig. 3 are transferred to the transmission method of a plurality of coprocessors by a coprocessor.

Method D: generate N thread by CPU and control respectively N coprocessor, thread 1 control coprocessor 1 is sent to CPU with data slicer successively, the data slicer that CPU is received is sent to all the other coprocessors successively, thread 2-thread N controls respectively its corresponding coprocessor and receives the data slicer that CPU sends, wherein, N is the quantity of coprocessor.

The data that provide for understanding method D are better transferred to the transmission mode of a plurality of coprocessors by a coprocessor, below in conjunction with example shown in Figure 4 method D is described.As shown in Figure 4, coprocessor 1 need to be with a data transmission to all the other 3 coprocessors.Generating respectively 4 thread cause CPU controls these 4 coprocessors and carries out data transmission, for ease of describing, these 4 threads are numbered respectively thread 1, thread 2, thread 3, thread 4, control respectively coprocessor 1, coprocessor 2, coprocessor 3, coprocessor 4, during the transmission beginning, thread 1 control coprocessor 1 is sent to CPU with data with the form of cutting into slices successively, CPU receives data slicer and the preservation that coprocessor 1 sends, simultaneously, the data slicer that CPU will preserve is sent to all the other 3 coprocessors successively, and thread 2-thread 4 is controlled respectively its corresponding coprocessor and received and preserve the data slicer that CPU sends.

The method can take full advantage of the bus between each coprocessor and the CPU, and in each transmission, a coprocessor is sent to CPU with a data slicer, and simultaneously, the data slicer that CPU can receive portion is sent to a plurality of all the other coprocessors.For example, in the transmission course at a time, coprocessor 1 is sent to CPU with the section of Slice_x piece of data, simultaneously, the Slice_x-1 piece of data section that CPU will receive before is sent to all the other coprocessors, and thread 2-thread 4 is controlled respectively its corresponding coprocessor and received the Slice_x-1 piece of data section that CPU sends.In transmission course, walk abreast between the thread 1-thread 4, can control simultaneously its corresponding coprocessor and carry out data transmission, compared with prior art, the method takes full advantage of the bus between each coprocessor and the CPU in transmission course, higher data transmission efficiency is arranged.

The above-mentioned data that provide for the method D that describes in conjunction with Fig. 4 are transferred to the transmission method of a plurality of coprocessors by a coprocessor.

Said method C and method D are data are transferred to a plurality of coprocessors by a coprocessor transmission method, two kinds of methods are all transmitted data in the mode of section, data slicer in two kinds of methods is the data block that presets size, be in the transmission course, the data volume of transmitting between the each coprocessor of Thread control and CPU or the coprocessor is a certain size data block, the size of section can be set according to the actual requirements, if but it is excessive to cut into slices, then transmission delay is excessive, too small if cut into slices, then efficient is lower, the invention provides a kind of preferred implementation data are cut into slices: the data slicer size is set as page, i.e. a 4KB.

In said method C and method D, CPU can be known by thread the state of each coprocessor, and finishes corresponding operation by the Thread control coprocessor.Can in the CPU internal memory, record the state of each coprocessor by the data structure of making by oneself (such as the free time, or receive data, or transmission data), and which data slicer is the transmission situation of data slicer (sent, which data slicer to be sent), so that CPU controls and dispatches each thread, for example when two coprocessors all were in idle condition, CPU can carry out the transmission of data slicer by these two coprocessors of Thread control.This part is prior art, exceeds at this and gives unnecessary details.

For said method C and method D, because the bus bandwidth of coprocessor and CPU is usually above the bus bandwidth between the coprocessor, therefore, the transfer efficiency of method D will be higher than method C in actual applications, but method D is subject to the bandwidth of CPU internal memory, is applicable to the higher situation of CPU memory bandwidth, as can be used in the multi-CPU system, if the situation that the bandwidth of CPU internal memory is not high enough then is fit to selecting method C.

The above-mentioned description of carrying out for CPU that the embodiment of the invention one is provided and the data transmission method between coprocessor.Can find out, the present invention controls each coprocessor with the form transmission of data with section by generating multithreading, and each thread can carry out reception or the transmission operation of corresponding data slicer by its corresponding coprocessor of parallel control, take full advantage of the bus between a coprocessor and the CPU, and the bus between each coprocessor, in the time of can significantly improving CPU data be sent to a plurality of coprocessor and coprocessor with data transmission the data transmission efficiency during to all the other a plurality of coprocessors.The present invention can be used for GPU, and the multiple coprocessor such as the FPGA that are similar to GPU, the MIC(many-core processor of ARM and Intel) etc. the data transmission between coprocessor and the CPU.

Embodiment two

The CPU that Fig. 5 provides for the embodiment of the invention two and the data transmission device schematic diagram between coprocessor, as shown in Figure 5, this device comprises: Thread control unit 10, transmission control unit 20.

Data transmission device between CPU provided by the present invention and coprocessor is arranged at CPU, specifically comprises: Thread control unit 10 and transmission control unit 20.

Wherein, Thread control unit 10 is used for generating N thread;

Transmission control unit 20 is used for according to a described N thread parallel data transmission of N coprocessor being controlled, and described N is the integer more than or equal to 2.

Transmission control unit 20 specifically can be used for the control coprocessor and receive the data that CPU sends with sliced form; Perhaps, be used for that the control coprocessor receives and when the data slicer of the current time that storage CPU or a upper coprocessor send, send upper one constantly the data slicer stored to next coprocessor.

Transmission control unit 20 can also be used for the control coprocessor data are sent to CPU with the data slicer form.

When device provided by the present invention can improve CPU data are sent to a plurality of coprocessor and coprocessor with data transmission the data transmission efficiency during to all the other a plurality of coprocessors, the below is described both of these case respectively.

When 1, data transferred to a plurality of coprocessor by CPU, transmission unit 10 specifically can carry out following operation A or operation B transfers to a plurality of coprocessors with data by CPU:

Operation A: control respectively N coprocessor according to N the thread that Thread control unit 10 generates, successively data slicer can be finished by existing data transmission unit the CPU from the data transmission that CPU is sent to coprocessor 1(CPU self, also be like this in the subsequent descriptions, this data transmission unit does not illustrate in the drawings), transmission control unit 20 is sent to coprocessor 2 with the data slicer of preserving simultaneously successively according to the data slicer that thread 1 control coprocessor 1 received and preserved the CPU transmission, transmission control unit 20 is sent to coprocessor 3 with the data slicer of preserving simultaneously successively according to the data slicer that thread 2 control coprocessors 2 received and preserved coprocessor 1 transmission, send data to by that analogy all coprocessors, wherein, N is the quantity of coprocessor.

For example, CPU need to be with a data transmission to 4 coprocessor.Thread control unit 10 generates respectively 4 thread cause CPU and controls these 4 coprocessors and carry out data transmission, for ease of describing, these 4 threads are numbered respectively thread 1, thread 2, thread 3, thread 4, control respectively coprocessor 1, coprocessor 2, coprocessor 3, coprocessor 4.During the transmission beginning, CPU is sent to coprocessor 1 with the every a section of these data successively from the CPU internal memory, transmission control unit 20 receives in the data slicer that CPU sends and the internal memory that is kept at coprocessor 1 according to thread 1 control coprocessor 1, simultaneously, the data slicer that coprocessor 1 has been preserved is sent to coprocessor 2 successively, transmission control unit 20 receives the data slicer of coprocessor 1 transmission according to thread 2 control coprocessors 2 and is kept in the internal memory of coprocessor 2, simultaneously, the data slicer that coprocessor 2 has been preserved is sent to coprocessor 3 successively, by that analogy, transmission control unit 20 receives the data slicer of coprocessor 2 transmissions according to thread 3 control coprocessors 3 and is kept in the internal memory of coprocessor 3, simultaneously, the data slicer that coprocessor 3 has been preserved is sent to coprocessor 4 successively, and transmission control unit 20 receives the data slicer of coprocessor 3 transmissions according to thread 4 control coprocessors 4 and is kept in the internal memory of coprocessor 4.

This operation can take full advantage of the bus between each coprocessor, in each transmission, when CPU is sent to a coprocessor with data slicer, also can carry out the transmission of data slicer between all the other coprocessors.

Operation B: control respectively N coprocessor according to N the thread that Thread control unit 10 generates, successively data slicer is sent to this N coprocessor from CPU, transmission control unit 20 is controlled respectively this N coprocessor according to corresponding thread and is received successively and preserve the data slicer that CPU sends, wherein, N is the quantity of coprocessor.

For example, CPU need to be with a data transmission to 4 coprocessor.Thread control unit 10 generates respectively 4 thread cause CPU and controls these 4 coprocessors and carry out data transmission, for ease of describing, these 4 threads are numbered respectively thread 1, thread 2, thread 3, thread 4, control respectively coprocessor 1, coprocessor 2, coprocessor 3, coprocessor 4.During the transmission beginning, CPU is sent to 4 coprocessors with the every a section of these data successively simultaneously from the CPU internal memory, and transmission control unit 20 is controlled respectively its corresponding coprocessor according to thread 1-thread 4 and received the data slicer that CPU sends.

This operation can take full advantage of the bus between each coprocessor and the CPU, and in each transmission, CPU can be sent to all coprocessors with a data slicer simultaneously.

Data slicer among aforesaid operations A and the operation B is the data block that presets size, be in the transmission course, by the data volume transmitted between the each CPU of Thread control and coprocessor or the coprocessor data block for a certain size, the size of section can be set according to the actual requirements, if but it is excessive to cut into slices, then transmission delay is excessive, too small if cut into slices, then efficient is lower, the invention provides a kind of preferred implementation cuts into slices to data: the data slicer size is set as a page, i.e. 4KB.

In aforesaid operations A and operation B, CPU can be known by thread the state of each coprocessor, and finishes corresponding operation by the Thread control coprocessor.Can in the CPU internal memory, record the state of each coprocessor by the data structure of making by oneself (such as the free time, or receive data, or transmission data), and which data slicer is the transmission situation of data slicer (sent, which data slicer to be sent), so that CPU controls and dispatches each thread, for example when two coprocessors all were in idle condition, CPU can carry out the transmission of data slicer by these two coprocessors of Thread control.This part is prior art, exceeds at this and gives unnecessary details.

Because the bus bandwidth of coprocessor and CPU is usually above the bus bandwidth between the coprocessor, therefore, the transfer efficiency that operates in actual applications B will be higher than operation A, but operation B is subject to the bandwidth of CPU internal memory, be applicable to the higher situation of CPU memory bandwidth, as can be used in the multi-CPU system, if the not high enough situation of the bandwidth of CPU internal memory then is fit to select operation A.

When 2, data transferred to a plurality of coprocessor by a coprocessor, transmission unit 10 specifically can carry out following operation C or operation D transfers to a plurality of coprocessors with data by a coprocessor:

Operation C: control respectively N coprocessor according to N the thread that Thread control unit 10 generates, transmission control unit 20 is sent to CPU with data slicer successively according to thread 1 control coprocessor 1, the data slicer that CPU is received is sent to coprocessor 2 successively, transmission control unit 20 is sent to coprocessor 3 with the data slicer of preserving simultaneously successively according to the data slicer that thread 2 control coprocessors 2 received and preserved the CPU transmission, transmission control unit 20 is sent to coprocessor 4 with the data slicer of preserving simultaneously successively according to the data slicer that thread 3 control coprocessors 3 received and preserved coprocessor 2 transmissions, send data to by that analogy all target coprocessors, wherein, N is the quantity of coprocessor.

For example, coprocessor 1 need to be with a data transmission to all the other 3 coprocessors.Thread control unit 10 generates respectively 4 thread cause CPU and controls these 4 coprocessors and carry out data transmission, for ease of describing, these 4 threads are numbered respectively thread 1, thread 2, thread 3, thread 4, control respectively coprocessor 1, coprocessor 2, coprocessor 3, coprocessor 4, during the transmission beginning, transmission control unit 20 is sent to CPU with the form of section according to thread 1 control coprocessor 1 successively with this piece of data, CPU receives data slicer and the preservation that coprocessor 1 sends, simultaneously, the data slicer that CPU will preserve is sent to coprocessor 2 successively, transmission control unit 20 receives the data slicer of CPU transmission according to thread 2 control coprocessors 2 and is kept in the internal memory of coprocessor 2, simultaneously, the data slicer that coprocessor 2 has been preserved is sent to coprocessor 3 successively, transmission control unit 20 receives the data slicer of coprocessor 2 transmissions according to thread 3 control coprocessors 3 and is kept in the internal memory of coprocessor 3, simultaneously, the data slicer that coprocessor 3 has been preserved is sent to coprocessor 4 successively, and transmission control unit 20 receives in the data slicer that coprocessors 3 send and the internal memory that is kept at coprocessor 4 according to thread 4 control coprocessors 4.

This operation can take full advantage of the bus between each coprocessor, in each transmission, a coprocessor is sent to CPU with data slicer, when the another one coprocessor is cut into slices from the CPU receive data, also can carry out the transmission of data slicer between all the other coprocessors.

Operation D: control respectively N coprocessor according to N the thread that Thread control unit 10 generates, transmission control unit 20 is sent to CPU with data slicer successively according to thread 1 control coprocessor 1, the data slicer that CPU is received is sent to all the other coprocessors successively, transmission control unit 20 is controlled respectively its corresponding coprocessor according to thread 2-thread N and is received the data slicer that CPU sends, wherein, N is the quantity of coprocessor.

For example, coprocessor 1 need to be with a data transmission to all the other 3 coprocessors.Thread control unit 10 generates respectively 4 thread cause CPU and controls these 4 coprocessors and carry out data transmission, for ease of describing, these 4 threads are numbered respectively thread 1, thread 2, thread 3, thread 4, control respectively coprocessor 1, coprocessor 2, coprocessor 3, coprocessor 4, during the transmission beginning, transmission control unit 20 is sent to CPU with data with the form of cutting into slices successively according to thread 1 control coprocessor 1, CPU receives data slicer and the preservation that coprocessor 1 sends, simultaneously, the data slicer that CPU will preserve is sent to all the other 3 coprocessors successively, and transmission control unit 20 is controlled respectively its corresponding coprocessor according to thread 2-thread 4 and received and preserve the data slicer that CPU sends.

This operation can take full advantage of the bus between each coprocessor and the CPU, and in each transmission, a coprocessor is sent to CPU with a data slicer, and simultaneously, the data slicer that CPU can receive portion is sent to all the other a plurality of coprocessors.

Data slicer among aforesaid operations C and the operation D is the data block that presets size, be in the transmission course, the data volume of transmitting between the each coprocessor of Thread control and CPU or the coprocessor is a certain size data block, the size of section can be set according to the actual requirements, if but cut into slices excessively, then transmission delay is excessive, too small if cut into slices, then efficient is lower, the invention provides a kind of preferred implementation data are cut into slices: the data slicer size is set as page, i.e. a 4KB.

In aforesaid operations C and operation D, CPU can be known by thread the state of each coprocessor, and finishes corresponding operation by the Thread control coprocessor.Can in the CPU internal memory, record the state of each coprocessor by the data structure of making by oneself (such as the free time, or receive data, or transmission data), and which data slicer is the transmission situation of data slicer (sent, which data slicer to be sent), so that CPU controls and dispatches each thread, for example when two coprocessors all were in idle condition, CPU can carry out the transmission of data slicer by these two coprocessors of Thread control.This part is prior art, exceeds at this and gives unnecessary details.

The transfer efficiency that operates in actual applications D will be higher than operation C, but operation D is subject to the bandwidth of CPU internal memory, is applicable to the higher situation of CPU memory bandwidth, as can be used in the multi-CPU system, if the not high enough situation of the bandwidth of CPU internal memory then is fit to select operation C.

The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1. the data transmission method between a CPU and coprocessor is characterized in that, the method comprises:

2. method according to claim 1, it is characterized in that, when described method is used for transferring data to N target coprocessor by CPU, described CPU is sent to one of them target coprocessor with data with the data slicer form, and this target coprocessor of the Thread control by correspondence receives and when the data slicer of the current time that storage CPU sends, sends upper one constantly the data slicer stored to next target coprocessor.

3. method according to claim 1, it is characterized in that, when described method is used for transferring data to N target coprocessor by CPU, described CPU is sent to described N target coprocessor with data with the data slicer form, and receives simultaneously and store the data slicer that CPU sends by the described N of a corresponding Thread control target coprocessor.

4. method according to claim 1, it is characterized in that, when described method is used for transferring data to other N-1 target coprocessors by a source coprocessor, described CPU is sent to CPU with data with the form of data slicer by corresponding Thread control source coprocessor, described CPU receives and when the data slicer of the current time that storage source coprocessor sends, sent the data slicer in a upper moment of having stored to one of them target coprocessor, and receive and when the data slicer of the current time that storage CPU sends by this target coprocessor of corresponding Thread control, send upper one constantly the data slicer stored to next target coprocessor.

5. according to claim 2 or 4 described methods, it is characterized in that, if described next target coprocessor is last target coprocessor, the data fragmentation that then arrives by last the target coprocessor reception of corresponding Thread control and storing received, otherwise, in the data slicer by described next the target coprocessor current time that a target coprocessor sends on receiving of corresponding Thread control, sent the data slicer in a upper moment of having stored to next target coprocessor, until last target coprocessor.

6. method according to claim 1, it is characterized in that, when described method is used for transferring data to other N-1 target coprocessors by a source coprocessor, described CPU is sent to CPU with the form of data slicer from the source coprocessor with data by corresponding thread notification source coprocessor, described CPU receives and when the data slicer of the current time that storage source coprocessor sends, send the data slicer in a upper moment of having stored to described N-1 target coprocessor, and receive simultaneously and store the data slicer that CPU sends by the described N-1 of a corresponding Thread control coprocessor.

7. the data transmission device between a CPU and coprocessor, this device is arranged at CPU, it is characterized in that, and this device comprises:

The Thread control unit is used for generating N thread;

8. device according to claim 7, it is characterized in that, when described device is used for transferring data to N target coprocessor by CPU, described CPU is sent to one of them target coprocessor with data with the data slicer form, described transmission control unit receives according to this coprocessor of the Thread control of correspondence and when the data slicer of the current time that storage CPU sends, sends upper one constantly the data slicer stored to next target coprocessor.

9. device according to claim 7, it is characterized in that, when described device is used for transferring data to N target coprocessor by CPU, described CPU is sent to this N target coprocessor with data with the data slicer form, and described transmission control unit receives and store the data slicer that CPU sends simultaneously according to the described N of a corresponding Thread control target coprocessor.

10. device according to claim 7, it is characterized in that, when described device is used for transferring data to other N-1 target coprocessors by a source coprocessor, described transmission control unit is sent to CPU with data with the data slicer form according to corresponding Thread control source coprocessor, described CPU receives and when the data slicer of the current time that storage source coprocessor sends, sent the data slicer in a upper moment of having stored to one of them target coprocessor, described transmission control unit receives according to this target coprocessor of corresponding Thread control and when the data slicer of the current time that storage CPU sends, sends upper one constantly the data slicer stored to next target coprocessor.

11. according to claim 8 or 10 described devices, it is characterized in that, if described next target coprocessor is last target coprocessor, then described transmission control unit is by last target coprocessor of corresponding Thread control receives and storing received arrives data fragmentation, otherwise, in the data slicer of described transmission control unit by described next the target coprocessor current time that a target coprocessor sends on receiving of corresponding Thread control, sent the data slicer in a upper moment of having stored to next target coprocessor, until last target coprocessor.

12. device according to claim 7, it is characterized in that, when described device is used for transferring data to other N-1 target coprocessors by a source coprocessor, described transmission control unit is sent to CPU with data with the data slicer form according to corresponding Thread control source coprocessor, described CPU receives and when the data slicer of the current time that storage source coprocessor sends, sent the data slicer in a upper moment of having stored to described N-1 target coprocessor, described transmission control unit receives and stores the data slicer that CPU sends simultaneously according to the described N-1 of a corresponding Thread control target coprocessor.