CN103377032A - Fine granularity scientific computation parallel processing device on basis of heterogenous multi-core chip - Google Patents

Fine granularity scientific computation parallel processing device on basis of heterogenous multi-core chip Download PDF

Info

Publication number
CN103377032A
CN103377032A CN2012101057224A CN201210105722A CN103377032A CN 103377032 A CN103377032 A CN 103377032A CN 2012101057224 A CN2012101057224 A CN 2012101057224A CN 201210105722 A CN201210105722 A CN 201210105722A CN 103377032 A CN103377032 A CN 103377032A
Authority
CN
China
Prior art keywords
nuclear
task
parallel processing
flag
fine granularity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012101057224A
Other languages
Chinese (zh)
Inventor
刘鹏
杨劼
顾雄礼
史册
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN2012101057224A priority Critical patent/CN103377032A/en
Publication of CN103377032A publication Critical patent/CN103377032A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

An embodiment of the invention discloses a fine granularity scientific computation parallel processing device on the basis of a heterogenous multi-core chip. The fine granularity scientific computation parallel processing device is characterized in that an interface module runs on a main core, and task type identifiers FLAG are generated according to data dependence relationships of objects and are transmitted into a recording module; the recording module runs on the main core and records the task type identifiers FLAG and target processor numbers TaskDest of follow-up objects, and the task type identifiers FLAG are determined according to a data flow model; an object distributing module runs on the main core and is used for distributing tasks to corresponding slave cores according to FLAG values and the TaskDest and updating FLAG and TaskDest in object tables of agent managers on the corresponding slave cores; agent manager modules which are used as agents of parallel processing devices are arranged on the main core and the various slave cores, are used for managing runtime systems and comprise the object tables, actuators and type selectors. The fine granularity scientific computation parallel processing device has the advantage that the fine granularity scientific computation parallel processing device is used for realizing parallelization and performance optimization for fine granularity scientific computation on a heterogenous multi-core system on a chip.

Description

A kind of fine granularity science based on the heterogeneous polynuclear chip is calculated parallel processing apparatus
Technical field
The invention belongs to the Computer Applied Technology field, relate to especially a kind of fine granularity science based on the heterogeneous polynuclear chip and calculate parallel processing apparatus.
Background technology
Develop from the polycaryon processor chip, at first be used for being calculated as main supercomputer with science.Because supercomputer needs large computing power, originally consisted of by multiprocessor exactly, the software of using is concurrent software, when the multi-processor core chip is developed, can accomplish large computing power, the software of using need not to do large change and just can easily transplant, and therefore can be used for very smoothly supercomputer.
When engaging in the applied research of embedded chip multi-core system, also need to use for reference the experience of research supercomputer multiple programming.From processor architecture, according to the difference of nuclear structure, on-chip multi-processor can be divided into two types of isomorphism and isomeries.The isomorphism polycaryon processor refers to that the structure of all processor cores of chip internal is identical, and function, the status of each nuclear are in full accord, can execute the task individually, and is close with general purpose single core processor function, structure.The heterogeneous multi-nucleus processor chip internal comprises the different processor core of a plurality of functions, and different processor cores is responsible for processing different tasks.Heterogeneous multi-nucleus processor is mainly used in the dedicated computing field at present, processor such as multimedia processor, flush bonding processor and super machine, usually comprise general processor and be specifically designed to and calculate the processor that accelerates, such as digital signal processor, network processing unit, Streaming Media processors etc., wherein general processor is responsible for the management of multiple nucleus system and OverDrive Processor ODP is mainly finished specific calculation task usually.Because the isomery on-chip multi-processor can adopt different processor cores to make up polycaryon processor according to application demand, heterogeneous structure can reach best in performance and power dissipation ratio.Because the ratio of task computation amount and the traffic is less in the fine granularity scientific program, generally need to increase as far as possible processor and be used for the ratio of calculating section to reach greater efficiency, and heterogeneous structure utilizes common treatment management multiple nucleus system and will speed up processor and free and be specifically designed to calculating, therefore adopts heterogeneous structure can guarantee that the fine granularity scientific program is in the efficient operation of chip multi-core system.
Parallel computation problem on the multi-core processor oriented is a focus of parallel software development, and its focus is mainly how parallel computation carries out distribution and the scheduling of process/thread.Allocation strategy be with course allocation to rational processor core owing to adopt heterogeneous structure, different processor core different in kinds, running status is different constantly in difference, the rationality of therefore distributing can affect system performance.Most important and typical several parallel computational models comprise random access parallel machine (Parallel Random Access Machine, PRAM) model, Integral synchronous parallel computational model (Bulk Synchronous Parallel Computing Model, BSP) multiprocessor model model and distributed store, point to point link (Latency overhead gap Processor, LogP) model.Under different hypothesis, each model has a lot of expansions.The PRAM model is towards the single instruction stream multiple data stream parallel machine, and it needs parallel machine to have the storage of sharing, and requires at any time processor can access shared memory cell, is not suitable for the heterogeneous polynuclear platform of distributed storage architecture.The BSP model does not also require the storer of parallel machine, can be to share or distribution mechanism, but not have corresponding expansion for the heterogeneous polynuclear platform.LogP model Based on Distributed storer, but require message passing mechanism between processor can only be single-point to single-point, and to obey permanent order, be not suitable for having the science that one-to-many message transmits and calculate.
Therefore, at present on sheet in the heterogeneous multi-core system, realize not pointed solution for the parallelization of fine granularity science computing application, thereby affected fine granularity science computing application getable performance in the heterogeneous multi-core system on sheet.So, for the defects that exists in the present prior art, in fact be necessary to study, so that a kind of scheme to be provided, solve the defective that exists in the prior art, avoid causing fine granularity science computing application getable poor-performing in the heterogeneous multi-core system on sheet.
Summary of the invention
For addressing the above problem, the object of the present invention is to provide a kind of fine granularity science based on the heterogeneous polynuclear chip to calculate parallel processing apparatus, be used for heterogeneous multi-core system on the sheet, calculate for the fine granularity science.By definition science compute type, determine computing cost, call overhead and communication overhead when agreement between application program and the operating system comes the minimizing science to calculate on sheet heterogeneous multi-core system operation, finish the fine granularity science is calculated parallelization and Performance tuning on heterogeneous multi-core system on the sheet.
For achieving the above object, technical scheme of the present invention is:
A kind of fine granularity science based on the heterogeneous polynuclear chip is calculated parallel processing apparatus, is applied to comprise that a main nuclear and at least one from the heterogeneous polynuclear chip of nuclear, comprise interface module, logging modle, and object distribution module and proxy manager module,
Described interface module operates on the main nuclear, is used for the agreement between realization application program and the operating system, according to the data dependence relation generation task type sign FLAG of object, and imports described logging modle into;
Described logging modle operates on the main nuclear, be used for the information about object that agreement defines between records application program and the operating system, record comprises task type sign FLAG and the follow-up object purpose processor code T askDest that determines according to data flow model;
Described object distribution module operates on the main nuclear, the distribution of object when being used for initialization, described object distribution module is assigned to task according to FLAG value and TaskDest corresponding from nuclear, and upgrades FLAG and TaskDest the Object table of corresponding proxy manager from nuclear;
Described proxy manager module is used for the management of runtime system, is present in main nuclear and respectively from nuclear, comprises Object table, actuator and type selecting device as the agency of parallel processing apparatus.
Preferably, described task type sign FLAG is defined as four mark value, described four mark value are the corresponding internuclear producer, internuclear consumer, the interior producer of nuclear and the interior consumer of nuclear successively, mark value is that 1 expression exists corresponding dependence, and mark value is that 0 expression does not exist corresponding dependence.
Compared with prior art, beneficial effect of the present invention is as follows:
(1) because the approach that task data is transmitted in the internuclear nuclear is different, also needs task is distinguished to some extent.According to above analysis, can know mainly need to judge following four problems for the type identification of task: 1) whether this task has the producer; 2) whether this task has the consumer; 3) whether the data transmission of this task occurs in the nuclear; 4) whether the data transmission of this task occurs in internuclear.According to above analysis, the present invention proposes a kind of strategy of the scheduling of classifying, be 16 types with task division at first, thereby cover 16 kinds of situations that above four problems combines in twos fully.In order to distinguish these 16 types, defined task type sign FLAG, it is defined as four mark value ABCD, four consumers in the producer and the nuclear in the corresponding internuclear producer, internuclear consumer, the nuclear successively, value is that 1 expression exists corresponding dependence, and value is that 0 expression does not exist corresponding dependence.Carry out targetedly for every type, reduced call overhead unnecessary in the process, be conducive to the fine granularity science and calculate parallelization and Performance tuning on heterogeneous multi-core system on the sheet;
(2) by the agreement between definition application and the operating system, the multiple programming personnel do not need to be concerned about the synchronous and data communication between object again, thereby have effectively reduced the difficulty of multiple programming.
Description of drawings
Fig. 1 be the embodiment of the invention calculate the frame construction drawing of parallel processing apparatus based on the fine granularity science of heterogeneous polynuclear chip;
Fig. 2 is that the fine granularity science based on the heterogeneous polynuclear chip of the embodiment of the invention is calculated proxy manager function flowchart in the parallel processing apparatus;
Fig. 3 be the embodiment of the invention calculate the application RED platform structure synoptic diagram of the embodiment of parallel processing apparatus based on the fine granularity science of heterogeneous polynuclear chip;
Fig. 4 a be the embodiment of the invention calculate the data dependence graph of application example one matrix multiplication of the embodiment of parallel processing apparatus based on the fine granularity science of heterogeneous polynuclear chip;
Fig. 4 b be the embodiment of the invention calculate the execution graph of application example one matrix multiplication of the embodiment of parallel processing apparatus based on the fine granularity science of heterogeneous polynuclear chip;
Fig. 5 a be the embodiment of the invention calculate the data dependence graph of application example two FFT of the embodiment of parallel processing apparatus based on the fine granularity science of heterogeneous polynuclear chip;
Fig. 5 b be the embodiment of the invention calculate the execution graph of application example two FFT of the embodiment of parallel processing apparatus based on the fine granularity science of heterogeneous polynuclear chip.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.
On the contrary, the present invention contain any by claim definition in substituting of making of marrow of the present invention and scope, modification, equivalent method and scheme.Further, in order to make the public the present invention is had a better understanding, in hereinafter details of the present invention being described, detailed some specific detail sections of having described.There is not for a person skilled in the art the description of these detail sections can understand the present invention fully yet.
Referring to Fig. 1, a kind of fine granularity science based on the heterogeneous polynuclear chip is calculated parallel processing apparatus, heterogeneous chip comprises that 1 master examines and i is individual from nuclear, i is not less than 1 integer, calculate parallel processing apparatus based on the fine granularity science of heterogeneous polynuclear chip and comprise interface module, logging modle, object distribution module and proxy manager module, wherein:
Interface module moves 101 on main nuclear, is used for the agreement between realization application program and the operating system, according to the data dependence relation generation task type sign FLAG of object, and imports logging modle into,
In the concrete application example, FLAG is defined as four mark value ABCD, these four consumers in the producer and the nuclear in the corresponding internuclear producer, internuclear consumer, the nuclear successively, mark value is that 1 expression exists corresponding dependence, mark value is that 0 expression does not exist corresponding dependence.
Logging modle 102 operates on the main nuclear, is used for the information about object that agreement defines between records application program and the operating system, task type sign FLAG, follow-up object purpose processor code T askDest that record is determined according to data flow model;
Object distribution module 103 operates on the main nuclear, and the distribution of object when being used for initialization is assigned to task according to FLAG value and TaskDest correspondingly from nuclear, and upgrades FLAG and TaskDest the Object table of the proxy manager of correspondence on examining;
The proxy manager module is used for the management of runtime system, it is present in main examining with each from nuclear as the agency of dispatching system, illustrated in the diagram on the main nuclear proxy manager module 104 and from examining the proxy manager module 105 on 1, the proxy manager module further comprises Object table, actuator and type selecting device, wherein:
Object table has comprised the full detail of the object of the upper mapping of this nuclear;
With reference to figure 2, be depicted as the task execution step process flow diagram of actuator, the process that actuator need to be finished comprises the activation stage of task, the execute phase of task and the synchronous phase of task, specifically may further comprise the steps:
The activation stage of task comprises,
S201 checks whether the input data buffering is ready,
S202 if the input data buffering is ready, provides feedback to the data of the internuclear preorder object production that receives, enters tasks execution phases;
The execute phase of task comprises,
S203, tasks carrying;
Judge whether task is finished,
If be not finished, continue to carry out,
If be finished, enter the tasks synchronization stage,
The tasks synchronization stage may further comprise the steps,
S204 determines synchronously with follow-up object whether data output buffer is effective;
S205, if effectively, the data transmission of internuclear object;
S206 judges whether the data transmission of internuclear object is finished,
S207 if the data transmission of internuclear object is finished, is set to the input data buffering of follow-up internuclear object effectively, and the data output buffer of internuclear preorder object is set to effectively;
S208 judges whether the feedback of follow-up object obtains,
S209 if the feedback of follow-up object obtains, is set to the input data buffering of object in the follow-up nuclear effectively, and the data output buffer of the interior preorder object of nuclear is set to effectively.
The type selecting device is selected corresponding step in the actuator according to Object table FLAG value, and is as shown in table 1:
Table 1 type selecting device is selected corresponding step in the actuator
The FLAG value Implementation
0000 S203
0001 S203、S204、S207
0010 S201、S203、S209
0011 S201、S203、S204、S205、S206、S207
0100 S203、S204、S205、S206
0101 S203、S204、S205、S206、S207
0110 S201、S203、S204、S205、S206、S209
0111 S201、S202、S203、S204、S205、S206、S208、S209
1000 S201、S202、S203、S208
1001 S201、S202、S203、S204、S207、S208
1010 S201、S202、S203、S208、S209
1011 S201、S202、S203、S207、S208、S209
1100 S201、S202、S203、S204、S205、S206、S208
1101 S201、S202、S203、S204、S205、S206、S207、S208
1110 S201、S202、S203、S204、S205、S206、S208、S209
1111 S201、S202、S203、S204、S205、S206、S207、S208、S209
In this system the fine granularity science being calculated parallelization may further comprise the steps:
(1) interface module receives application information, according to the data dependence relation generation task type sign FLAG of object, and imports logging modle into;
(2) logging modle logger task type identification FLAG, follow-up object purpose processor code T askDest;
(3) the object distribution module is assigned to task according to FLAG value and TaskDest correspondingly from nuclear, and upgrades FLAG and TaskDest the Object table of corresponding proxy manager from nuclear;
(4) the type selecting device in the proxy manager module is selected corresponding function in the actuator according to the FLAG in the Object table, until all are complete in the Object table.
The below will describe as application example with matrix multiplication and Fast Fourier Transform (FFT) (Fast Fourier Transformation, FFT).Matrix multiplication is a typical mathematical problem, because its calculated amount is large, is commonly used to the floating-point operation performance of test computer.For parallel machine, the height of its parallel efficiency also can be tested by matrix multiplication.Need to use a large amount of matrix multiplications in applications such as process control, image processing and science calculating, realize that its parallel computation can improve operational efficiency.FFT is one of basic theories in contemporary signal analysis and processing, communication engineering, power engineering, control field and the information engineering and method, and obtains extensive and general application in relevant mathematics, physics and the field of engineering technology such as mechanics, optics, quantum physics and various Linear System Analysis.
Application example one
The Cannon algorithm realization of matrix multiplication, as shown in Figure 4.Matrix multiplication is very common application during the fine granularity science is calculated, if according to the Cannon algorithm, its data dependence graph is shown in Fig. 4 (a), to being described as follows of Task Dependent figure:
1) object objects ti0 (i=0...3) is the preorder object of internuclear follow-up object tij (i=1...4, j=1...8), finishes the transmission of matrix primary data.
2) object tij (i=1,2,3, the j=1...8) intermediate result of calculated sub-matrix, they be nuclear in follow-up object ti+1, j (i=1,2,3, preorder object j=1...8).
3) end product of object t4j (j=1...8) calculated sub-matrix, they are preorder objects of internuclear follow-up object t50.
4) object t40 forms the end product of matrix computations.
How the below will carry out to use to parallel processing system (PPS) describes:
(1) interface module operates on the main nuclear, is used for the agreement between realization application program and the operating system, according to the data dependence relation generation task type sign FLAG of object, and imports logging modle into.
1) the seedless interior producer/consumer of object objects ti0 (i=0...3), seedless the producer has internuclear consumer tij (i=1...4, j=1...8), and interface module is set to 0100 according to protocol description with its FLAG value.
2) object tij (i=1, j=1...8) has consumer in the nuclear, and the seedless interior producer has the internuclear producer, seedless consumer, and interface module is set to 1001 according to protocol description with its FLAG value.
3) object tij (i=2...4, j=1...8) has consumer in the nuclear, and the producer in the nuclear is arranged, and the internuclear producer is arranged, seedless consumer, and interface module is set to 1011 according to protocol description with its FLAG value.
4) object t40 has the internuclear producer, seedless consumer, and the seedless interior producer, seedless the producer, interface module is set to 1000 according to protocol description with its FLAG value.
(2) logging modle operates on the main nuclear, be used for the information about object that agreement defines between records application program and the operating system, task type sign FLAG, object purpose processor code T askDest that record is determined according to data flow model, main list item is as shown in table 2:
Task type sign and the object purpose processor number table of table 2 matrix multiplication
(3) the object distribution module operates on the main nuclear, the distribution of object when being used for initialization, it is assigned to task according to FLAG value and TaskDest corresponding from nuclear, and upgrades FLAG and follow-up TaskDest the Object table of corresponding proxy manager from nuclear;
For using for example, the object distribution module reads the task in the logging modle at first according to the order of sequence, is 0 such as the TaskDest of task t00, and namely task will be assigned on the main nuclear; TaskDest such as task t11 is 1, and namely task will be assigned to from nuclear DSP1, upgrade simultaneously FLAG and follow-up TaskDest in the Object table of the proxy manager on the DSP1; Final execution graph is shown in Fig. 4 b.
(4) the proxy manager module is used for the management of runtime system, and it is present in main examining with each from nuclear as the agency of dispatching system, comprises Object table, actuator and type selecting device, wherein:
Object table has comprised the full detail of the object of the upper mapping of this nuclear;
Actuator comprises 9 functions, and step order is with reference to figure 2.The type selecting device is selected corresponding function in the actuator according to Object table FLAG value.For using for example, the type selecting device in the proxy manager module is according to Object table FLAG value, selects in the actuator accordingly function, is 0100 such as the FLAG of task t00, and then actuator will only be carried out function 3,4,5,6, and skip functions 1,2,7,8,9; The FLAG of task t11 is 1001, and then actuator will only be carried out function 1,2,3,4,7,8, and skip functions 5,6,9.
Application example two
Fft algorithm is the fast algorithm of discrete Fourier transformation, wherein most popular is the Cooley-Turkey algorithm, computing formula is as follows: 1) object t00 is the preorder object of internuclear follow-up object t1j (j=1...8), be responsible for that original FFT data are carried out permutatation and send to later on corresponding accelerator module DSP, it is mapped on the control RISC.
n = N 2 n 1 + n 2 0 ≤ n 1 ≤ N 1 - 1 0 ≤ n 2 ≤ N 2 - 1 , k = k 1 + N 1 k 2 0 ≤ k 1 ≤ N 1 - 1 0 ≤ k 2 ≤ N 2 - 1 , N=N 1N 2.
Adopt 64 FFT as the test example in the application example, N1=N2=8 is set, here shown in Fig. 5 a.
1) object t00 is the preorder object of internuclear follow-up object t1j (j=1...8), is responsible for that original FFT data are carried out permutatation and sends to later on corresponding accelerator module DSP, and it is mapped on the control RISC.
2) object t1j (j=1...8) calculates for the first time inner FFT
Figure BDA0000152347340000104
Then the result is given internuclear follow-up object t20, these object map are on DSP.
3) object t20 calculates
Figure BDA0000152347340000105
Afterwards with step 2) in the data of send multiply each other so that the arrangement of data is by x[k1, n2] transfer x[n2, k2 to], the result that will obtain afterwards passes to follow-up internuclear object t3j (j=1...8), t20 is mapped on the RISC.
4) object t3j (j=1...8) calculates for the second time inner FFT
Figure BDA0000152347340000111
Afterwards the result is passed to internuclear follow-up object t40.Object t3j (j=1...8) is mapped on the DSP.
5) object t40 will export data and be transformed into frequency domain, obtain last transformation results, and it is mapped on the RISC.
How the below will carry out to use to parallel processing system (PPS) describes:
(1) interface module operates on the main nuclear, is used for the agreement between realization application program and the operating system, according to the data dependence relation generation task type sign FLAG of object, and imports logging modle into.
1) the seedless interior producer/consumer of object t00, seedless the producer has internuclear consumer tij (i=1, j=1...8), and interface module is set to 0100 according to protocol description with its FLAG value.
2) the seedless interior producer/consumer of object tij (i=1, j=1...8) has internuclear producer t00, and internuclear consumer t20 is arranged, and interface module is set to 1100 according to protocol description with its FLAG value.
3) the seedless interior producer/consumer of object t20 has internuclear producer tij (i=1, j=1...8), and internuclear consumer tij (i=3, j=1...8) is arranged, and interface module is set to 1100 according to protocol description with its FLAG value.
4) the seedless interior producer/consumer of object tij (i=3, j=1...8) has internuclear producer t20, and internuclear consumer t40 is arranged, and interface module is set to 1100 according to protocol description with its FLAG value.
5) object t40 has the internuclear producer, seedless consumer, and the seedless interior producer, seedless the producer, interface module is set to 1000 according to protocol description with its FLAG value.
(2) logging modle operates on the main nuclear, be used for the information about object that agreement defines between records application program and the operating system, task type sign FLAG, object purpose processor code T askDest that record is determined according to data flow model, main list item is as shown in table 3:
Task type sign and the object purpose processor number table of table 3FFT algorithm
Figure BDA0000152347340000112
(3) the object distribution module operates on the main nuclear, the distribution of object when being used for initialization, it is assigned to task according to FLAG value and TaskDest corresponding from nuclear, and upgrades FLAG and follow-up TaskDest the Object table of corresponding proxy manager from nuclear;
For using for example, the object distribution module reads the task in the logging modle at first according to the order of sequence, is 0 such as the TaskDest of task t00, and namely task will be assigned on the main nuclear; TaskDest such as task t11 is 1, and namely task will be assigned to from nuclear DSP1, upgrade simultaneously FLAG and follow-up TaskDest in the Object table of the proxy manager on the DSP1; Final execution graph is shown in Fig. 5 b.
(4) the proxy manager module is used for the management of runtime system, and it is present in main examining with each from nuclear as the agency of dispatching system, comprises Object table, actuator and type selecting device, wherein:
Object table has comprised the full detail of the object of the upper mapping of this nuclear;
Actuator comprises 9 steps, and with explanation in the application example one, execution sequence as shown in Figure 2.
For using for example, the type selecting device in the proxy manager module is according to Object table FLAG value, selects in the actuator accordingly function, is 0100 such as the FLAG of task t00, and then actuator will only be carried out function 3,4,5,6, and skip functions 1,2,7,8,9; The FLAG of task t11 is 1100, and then actuator will only be carried out function 1,2,3,4,5,6,8, and skip functions 7,9.
The present invention utilizes the heterogeneous polynuclear SOC (system on a chip), comprise 1 Reduced Instruction Set Computer (Reduced Instruction Set Computer, RISC) processor and 8 digital signal processor (Digita1 Signal Processors, DSPs) multinuclear RED (1*RISC+8*DSP) platform that forms is tested, the RED platform structure as shown in Figure 3, experimental result is as shown in table 4.
Table 4 adopts parallel processing apparatus of the present invention and is the call overhead comparison sheet that adopts parallel processing apparatus of the present invention
Experimental result shows, by reducing call overhead, in the application example one the entire system improved efficiency 36.35%, in the application example two the entire system improved efficiency 29.28%.
The above only is preferred embodiment of the present invention, not in order to limiting the present invention, all any modifications of doing within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims (3)

1. the fine granularity science based on the heterogeneous polynuclear chip is calculated parallel processing apparatus, is applied to comprise that a main nuclear and at least one from the heterogeneous polynuclear chip of nuclear, is characterized in that, comprise interface module, logging modle, object distribution module and proxy manager module
Described interface module operates on the main nuclear, is used for the agreement between realization application program and the operating system, according to the data dependence relation generation task type sign FLAG of object, and imports described logging modle into;
Described logging modle operates on the main nuclear, be used for the information about object that agreement defines between records application program and the operating system, record comprises task type sign FLAG and the follow-up object purpose processor code T askDest that determines according to data flow model;
Described object distribution module operates on the main nuclear, the distribution of object when being used for initialization, described object distribution module is assigned to task according to FLAG value and TaskDest corresponding from nuclear, and upgrades FLAG and TaskDest the Object table of corresponding proxy manager from nuclear;
Described proxy manager module is present in main nuclear and respectively from examining, is used for the management of runtime system as the agency of parallel processing apparatus, comprises Object table, actuator and type selecting device.
2. the fine granularity science based on the heterogeneous polynuclear chip according to claim 1 is calculated parallel processing apparatus, it is characterized in that, described task type sign FLAG is defined as four mark value, described four mark value are the corresponding internuclear producer, internuclear consumer, the interior producer of nuclear and the interior consumer of nuclear successively, mark value is that 1 expression exists corresponding dependence, and mark value is that 0 expression does not exist corresponding dependence.
3. the fine granularity science based on the heterogeneous polynuclear chip according to claim 1 is calculated parallel processing apparatus, it is characterized in that, the tasks carrying of described actuator comprises the activation stage of task, the execute phase of task and the synchronous phase of task, specifically may further comprise the steps:
The activation stage of task comprises, checks whether the input data buffering is ready,
Data to the internuclear preorder object production that receives provide feedback;
The execute phase of task comprises, tasks carrying,
Judge whether task is finished,
If be not finished, continue to carry out,
If be finished, enter the tasks synchronization stage,
The described tasks synchronization stage may further comprise the steps, and determines synchronously with follow-up object whether data output buffer is effective;
If effectively, the data transmission of internuclear object;
Judge whether the data transmission of internuclear object is finished, if finish, the input data buffering of follow-up internuclear object is set to effectively, the data output buffer of internuclear preorder object is set to effectively;
Whether the feedback of judging follow-up object obtains, if obtain, the input data buffering of object in the follow-up nuclear is set to effectively, and the data output buffer of the interior preorder object of nuclear is set to effectively.
CN2012101057224A 2012-04-11 2012-04-11 Fine granularity scientific computation parallel processing device on basis of heterogenous multi-core chip Pending CN103377032A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101057224A CN103377032A (en) 2012-04-11 2012-04-11 Fine granularity scientific computation parallel processing device on basis of heterogenous multi-core chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101057224A CN103377032A (en) 2012-04-11 2012-04-11 Fine granularity scientific computation parallel processing device on basis of heterogenous multi-core chip

Publications (1)

Publication Number Publication Date
CN103377032A true CN103377032A (en) 2013-10-30

Family

ID=49462201

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101057224A Pending CN103377032A (en) 2012-04-11 2012-04-11 Fine granularity scientific computation parallel processing device on basis of heterogenous multi-core chip

Country Status (1)

Country Link
CN (1) CN103377032A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699464A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Dependency mesh based instruction-level parallel scheduling method
CN107341053A (en) * 2017-06-01 2017-11-10 深圳大学 The programmed method of heterogeneous polynuclear programmable system and its memory configurations and computing unit
CN107608784A (en) * 2017-06-28 2018-01-19 西安微电子技术研究所 A kind of multi-modal dispatching method of mass data flow under multi-core DSP
CN108021431A (en) * 2016-11-04 2018-05-11 广东亿迅科技有限公司 Method and its system based on web data interactive maintenance Hive
CN109074701A (en) * 2016-03-18 2018-12-21 捷德货币技术有限责任公司 Device and method for assessing the sensing data of valuable document
CN110908797A (en) * 2019-11-07 2020-03-24 浪潮电子信息产业股份有限公司 Call request data processing method, device, equipment, storage medium and system
CN112035578A (en) * 2020-11-06 2020-12-04 北京谷数科技股份有限公司 Data parallel processing method and device based on many-core processor
CN112416053A (en) * 2019-08-23 2021-02-26 北京希姆计算科技有限公司 Synchronizing signal generating circuit and chip of multi-core architecture and synchronizing method and device
CN113419119A (en) * 2021-06-02 2021-09-21 中电科思仪科技股份有限公司 Parallel phase noise measurement method based on multi-core DSP
WO2021218492A1 (en) * 2020-04-29 2021-11-04 北京希姆计算科技有限公司 Task allocation method and apparatus, electronic device, and computer readable storage medium
CN114443139A (en) * 2022-01-27 2022-05-06 上海壁仞智能科技有限公司 Method, system, apparatus and medium for converting sequential code into parallel code
CN116257222A (en) * 2023-02-28 2023-06-13 中国人民解放军战略支援部队信息工程大学 Classical-quantum collaborative computing programming method and model based on task flow
CN117825934A (en) * 2024-03-05 2024-04-05 上海励驰半导体有限公司 Test method, test system, electronic device and program product
CN116257222B (en) * 2023-02-28 2024-05-28 中国人民解放军战略支援部队信息工程大学 Classical-quantum collaborative computing programming method and model based on task flow

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110058092A (en) * 2009-11-25 2011-06-01 한양대학교 산학협력단 Pipeline multi-core system and method for efficient task allocation in the system
US20110231856A1 (en) * 2010-03-16 2011-09-22 Samsung Electronics Co., Ltd System and method for dynamically managing tasks for data parallel processing on multi-core system
CN102207883A (en) * 2011-06-01 2011-10-05 华中科技大学 Transaction scheduling method of heterogeneous distributed real-time system
CN102360309A (en) * 2011-09-29 2012-02-22 中国科学技术大学苏州研究院 Scheduling system and scheduling execution method of multi-core heterogeneous system on chip

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110058092A (en) * 2009-11-25 2011-06-01 한양대학교 산학협력단 Pipeline multi-core system and method for efficient task allocation in the system
US20110231856A1 (en) * 2010-03-16 2011-09-22 Samsung Electronics Co., Ltd System and method for dynamically managing tasks for data parallel processing on multi-core system
CN102207883A (en) * 2011-06-01 2011-10-05 华中科技大学 Transaction scheduling method of heterogeneous distributed real-time system
CN102360309A (en) * 2011-09-29 2012-02-22 中国科学技术大学苏州研究院 Scheduling system and scheduling execution method of multi-core heterogeneous system on chip

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIONGLI GU等: "An efficient scheduler of RTOS for multi/many-core system", 《COMPUTERS AND ELECTRICAL ENGINEERING》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699464B (en) * 2015-03-26 2017-12-26 中国人民解放军国防科学技术大学 A kind of instruction level parallelism dispatching method based on dependence grid
CN104699464A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Dependency mesh based instruction-level parallel scheduling method
CN109074701A (en) * 2016-03-18 2018-12-21 捷德货币技术有限责任公司 Device and method for assessing the sensing data of valuable document
CN108021431A (en) * 2016-11-04 2018-05-11 广东亿迅科技有限公司 Method and its system based on web data interactive maintenance Hive
CN107341053B (en) * 2017-06-01 2020-12-15 深圳大学 Heterogeneous multi-core programmable system and memory configuration and programming method of computing unit thereof
CN107341053A (en) * 2017-06-01 2017-11-10 深圳大学 The programmed method of heterogeneous polynuclear programmable system and its memory configurations and computing unit
CN107608784A (en) * 2017-06-28 2018-01-19 西安微电子技术研究所 A kind of multi-modal dispatching method of mass data flow under multi-core DSP
CN107608784B (en) * 2017-06-28 2020-06-09 西安微电子技术研究所 Multi-mode scheduling method for mass data stream under multi-core DSP
CN112416053A (en) * 2019-08-23 2021-02-26 北京希姆计算科技有限公司 Synchronizing signal generating circuit and chip of multi-core architecture and synchronizing method and device
WO2021036421A1 (en) * 2019-08-23 2021-03-04 北京希姆计算科技有限公司 Multi-core synchronization signal generation circuit, chip, and synchronization method and device
CN112416053B (en) * 2019-08-23 2023-11-17 北京希姆计算科技有限公司 Synchronous signal generating circuit, chip and synchronous method and device of multi-core architecture
CN110908797A (en) * 2019-11-07 2020-03-24 浪潮电子信息产业股份有限公司 Call request data processing method, device, equipment, storage medium and system
CN110908797B (en) * 2019-11-07 2023-09-15 浪潮电子信息产业股份有限公司 Call request data processing method, device, equipment, storage medium and system
WO2021218492A1 (en) * 2020-04-29 2021-11-04 北京希姆计算科技有限公司 Task allocation method and apparatus, electronic device, and computer readable storage medium
CN112035578A (en) * 2020-11-06 2020-12-04 北京谷数科技股份有限公司 Data parallel processing method and device based on many-core processor
CN113419119A (en) * 2021-06-02 2021-09-21 中电科思仪科技股份有限公司 Parallel phase noise measurement method based on multi-core DSP
CN114443139A (en) * 2022-01-27 2022-05-06 上海壁仞智能科技有限公司 Method, system, apparatus and medium for converting sequential code into parallel code
CN114443139B (en) * 2022-01-27 2023-06-30 上海壁仞智能科技有限公司 Method, system, device and medium for converting sequential code into parallel code
CN116257222A (en) * 2023-02-28 2023-06-13 中国人民解放军战略支援部队信息工程大学 Classical-quantum collaborative computing programming method and model based on task flow
CN116257222B (en) * 2023-02-28 2024-05-28 中国人民解放军战略支援部队信息工程大学 Classical-quantum collaborative computing programming method and model based on task flow
CN117825934A (en) * 2024-03-05 2024-04-05 上海励驰半导体有限公司 Test method, test system, electronic device and program product

Similar Documents

Publication Publication Date Title
CN103377032A (en) Fine granularity scientific computation parallel processing device on basis of heterogenous multi-core chip
Jia et al. The analysis of a plane wave pseudopotential density functional theory code on a GPU machine
Zheng et al. Real-time big data processing framework: challenges and solutions
Le-Phuoc et al. Elastic and scalable processing of linked stream data in the cloud
US9053067B2 (en) Distributed data scalable adaptive map-reduce framework
Jiang et al. Optimized co-scheduling of mixed-precision neural network accelerator for real-time multitasking applications
CN103607466B (en) A kind of wide-area multi-stage distributed parallel grid analysis method based on cloud computing
CN103473115B (en) virtual machine placement method and device
Zhou et al. A novel GPU-accelerated strategy for contingency screening of static security analysis
Huang et al. Predictive dynamic simulation for large-scale power systems through high-performance computing
Li et al. Performance model for parallel matrix multiplication with dryad: Dataflow graph runtime
Szustak Strategy for data-flow synchronizations in stencil parallel computations on multi-/manycore systems
Kang et al. Hi-fft: Heterogeneous parallel in-place algorithm for large-scale 2D-fft
Liu et al. Funcpipe: A pipelined serverless framework for fast and cost-efficient training of deep learning models
Kono et al. Evaluations of OpenCL-written tsunami simulation on FPGA and comparison with GPU implementation
Khaitan et al. Proactive task scheduling and stealing in master-slave based load balancing for parallel contingency analysis
CN103902374B (en) Cellular automation and empowerment directed hypergraph based cloud-computing task scheduling method
Khan Hadoop performance modeling and job optimization for big data analytics
Huang et al. Power grid modeling and topology analysis based on graph database conforming with CIM/E
CN104166593A (en) Method for computing asynchronous and concurrent scheduling of multiple application functions
CN110120959A (en) Big data method for pushing, device, system, equipment and readable storage medium storing program for executing
Wang et al. A global data model for electric power data centers
He et al. An improved approach for marking optimization of timed weighted marked graphs
Cui et al. A bidirectional DNN partition mechanism for efficient pipeline parallel training in cloud
Zeng et al. Parallel multi-GPU implementation of fast decoupled power flow solver with hybrid architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20131030