CN108920413B - Convolutional neural network multi-core parallel computing method facing GPDSP - Google Patents

Convolutional neural network multi-core parallel computing method facing GPDSP Download PDF

Info

Publication number
CN108920413B
CN108920413B CN201810689646.3A CN201810689646A CN108920413B CN 108920413 B CN108920413 B CN 108920413B CN 201810689646 A CN201810689646 A CN 201810689646A CN 108920413 B CN108920413 B CN 108920413B
Authority
CN
China
Prior art keywords
data
core
data buffer
dsp core
gpdsp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810689646.3A
Other languages
Chinese (zh)
Other versions
CN108920413A (en
Inventor
刘仲
郭阳
扈啸
田希
陈海燕
陈跃跃
孙永节
王丽萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201810689646.3A priority Critical patent/CN108920413B/en
Publication of CN108920413A publication Critical patent/CN108920413A/en
Application granted granted Critical
Publication of CN108920413B publication Critical patent/CN108920413B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • G06F15/8061Details on data memory access
    • G06F15/8069Details on data memory access using a cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The invention discloses a GPDSP-oriented convolutional neural network multi-core parallel computing method, which comprises the following steps: s1, a CPU core constructs two data buffer areas and a weight data buffer area in an off-chip memory; s2, the CPU core combines the convolution core data with the designated number and stores the combined data in a weight data buffer area; s3, the CPU core accesses the image data to be calculated of the designated frame to carry out merging processing and transmits the image data to an idle data cache region; s4, if the DSP core is idle and data of the data cache region is ready, transmitting the address to the DSP core; s5, carrying out convolution neural network calculation in parallel by each DSP core; s6, outputting a current calculation result; s7, circulating the steps S3-S6 until all the calculations are completed. The invention can fully exert the performance and the multilevel parallelism of the CPU core and the DSP core in the GPDSP and realize the high-efficiency convolution neural network calculation.

Description

Convolutional neural networks core parallel calculation method towards GPDSP
Technical field
The present invention relates to depth learning technology fields, more particularly to one kind is towards GPDSP (General-Purpose Digital Signal Processor, general-purpose computations digital signal processor) convolutional neural networks multi-core parallel concurrent calculating side Method.
Background technique
Currently based on the deep learning model of convolutional neural networks (Convolutional Neural Networks, CNN) It is equal in various aspects such as image recognition and calssification, machine translation, Text Automatic Processing, speech recognition, automatic Pilot, video analysis The achievement to attract people's attention is achieved, the research hotspot in each field is become.Convolutional neural networks are a kind of depth feedforward neural networks, It is usually alternately made of several convolutional layers, active coating and pond layer, wherein convolutional layer is rolled up by convolution kernel and input feature vector Product operation carries out feature extraction, so that the feature of each classification is arrived in study.Convolutional layer calculating occupies in convolutional neural networks calculating The calculation amount of whole network structure 90%, thus optimize and accelerate convolutional layer to be calculated as promoting convolutional neural networks calculated performance Key.
In order to improve the performance of convolutional neural networks, network structure deeper and deeper and complicated, allusion quotation are currently constantly proposed Type such as LeNet, AlexNet, VGGNet, GoogleNet etc., but with the continuous expansion of network size, network parameter Scale is also increasing, and corresponding large-scale convolutional neural networks calculate process performance and data memory bandwidth to processor Also higher and higher.Industry is generally to use high-performance GPU to meet convolutional neural networks and calculate requirement, or even pass through design at present Dedicated convolutional neural networks processor accelerates convolutional neural networks to calculate, but the calculated performance of high-performance GPU is limited, real Existing convolutional neural networks computational efficiency is still to be improved, and the calculated performance for being especially unable to satisfy extensive convolutional neural networks is wanted It asks, and the convolutional neural networks processor of design specialized is at high cost, realizes complicated.
GPDSP is a kind of system with powerful calculating ability, it includes CPU core unit and DSP core unit, wherein CPU Nuclear unit is mainly used for being responsible for the generic transaction pipe including storage management, document control, process scheduling, interrupt management task The complete support of reason and offer to the general-purpose operating system, DSP core unit include at 64 bit vectors of several powerful calculating abilities Array is managed, for supporting the resolving of highly dense processor active task.GPDSP powerful computing capability, so that it is likely to become acceleration volume One extraordinary platform of product neural computing, however GPDSP is the heterogeneous multi-nucleus processor comprising CPU core and DSP core, Including sharing storage array in vector array memory in register file, scalar memory, piece, piece, DDR memory etc. outside piece Multi-level storage architecture, existing convolutional neural networks calculation method, which can not directly apply in GPDSP, to be realized.By GPDSP Realize that convolutional neural networks calculate, there is also how to calculate convolutional neural networks to be mapped to the CPU core of GPDSP and multiple contain 64 bit vectors handle the DSP core of array, and the problems such as how to play the multi-level parallelisms of GPDSP, there has been no being capable of base at present The effective solution that convolutional neural networks calculate is realized in GPDSP, and it is urgent to provide a kind of convolutional neural networks towards GPDSP Core parallel calculation method, to improve the calculating of convolutional neural networks using the structural system feature of GPDSP, multi-level parallelisms Efficiency.
Summary of the invention
The technical problem to be solved in the present invention is that, for technical problem of the existing technology, the present invention provides one Performance and multi-level parallelisms that realization principle is simple and convenient to operate, can give full play to CPU core and DSP core in GPDSP are planted, meter Calculate the good convolutional neural networks core parallel calculation method towards GPDSP of high-efficient and performance.
In order to solve the above technical problems, technical solution proposed by the present invention are as follows:
A kind of convolutional neural networks core parallel calculation method towards GPDSP, step include:
CPU core in S1.GPDSP constructs two for storing the data of input image data outside piece in DDR memory Buffer area and one are for storing the weighted data buffer area of convolution Nuclear Data;
The convolution Nuclear Data of specified number is merged place according to the picture number that SIMD is capable of parallel processing by S2.CPU core Reason generates convolution Nuclear Data required for meeting calculating, and is stored in the weighted data buffer area;
The idle state of two data buffer areas of S3.CPU Nuclear monitoring starts CPU if available free data buffer area Core accesses a specified image data to be calculated and merges processing, generates image data required for meeting calculating, and be transferred to Idle data buffer area;
S4.CPU core judges the data mode of the idle state of each DSP core and two data buffer areas in GPDSP, if It is idle and have the data ready of target data buffer area to determine each DSP core, then by the address of target data buffer area and described The address of weighted data buffer area is transferred to each DSP core, is calculated with starting DSP core;
S5. each DSP core carries out convolution according to each auxiliary image data of the address received to target data buffer area parallel Neural computing;
The calculating state of two data buffer areas of S6.CPU Nuclear monitoring and DSP core, and work as and monitor two data buffer storages In area data processing finish and DSP core calculate at the end of, output work as previous calculated result;
S7. circulation step S3~S6, the calculating until completing all image datas.
As a further improvement of the present invention: the size of data buffer area described in the step S1 is with specific reference to GPDSP The picture number d that middle DSP core quantity p, input picture port number c and SIMD are capable of parallel processing is configured.
As a further improvement of the present invention: the size concrete configuration of the data buffer area is n auxiliary input image data Memory capacity size, and n=p*c*d, wherein d=64/w, w are the data bits of pictorial element to be calculated.
As a further improvement of the present invention: further include in the step S1 setting two data buffer state marks with Respectively corresponding the data of two data buffer areas of mark, whether ready state and DSP core calculate Status Flag to indicate DSP core whether idle state;
The CPU core of GPDSP calculates Status Flag judges whether DSP core is idle according to the DSP core in the step S3, with And judge whether the data of two data buffer areas are ready according to the data buffer area Status Flag;
When monitoring in the data buffer area after data processing in the step S6, the corresponding data are set Buffer state mark is arranged the corresponding DSP core and calculates Status Flag after CPU core monitors that DSP core calculates.
As a further improvement of the present invention: processing is merged in the step S2, specially by d different convolution Nuclear Data merges, and generation, which obtains meeting, calculates required convolution Nuclear Data, and wherein d is that SIMD being capable of parallel processing in GPDSP Picture number.
As a further improvement of the present invention: processing is merged in the step S3, specially by d width input picture number According to merging, generation, which obtains meeting, calculates required image data, and wherein d is the image that SIMD is capable of parallel processing in GPDSP Number.
As a further improvement of the present invention: the specific steps of idle data buffer area are transferred in the step S3 Are as follows: the data buffer area is divided into p and the one-to-one memory block of each DSP core in advance, wherein p is DSP core in GPDSP Number;When generating image data required for meeting calculating, the image data of generation is successively transmitted to each storage in sequence Area.
As a further improvement of the present invention, the step of parallel-convolution neural computing is carried out in the step S5 is wrapped It includes:
S51. the n auxiliary image data in each DSP core parallel processing target data buffer area, wherein each DSP core handles c*d Width image, the picture number of processing needed for this core is calculated according to the first address of target data buffer area and this core ID in each DSP core According to first address;
S52. weighted data buffer area in piece is arranged in each DSP core in respective vector storage array, by DSP core 0 outside piece Convolution Nuclear Data is read in the weighted data buffer area of DDR memory, and is broadcast to the vector storage array of each DSP core Piece in weighted data buffer area;Each DSP core is corresponding according to the parallel reading of the first address that the step S51 is calculated Input image data into respective scalar memory buffer;
S53. each DSP core is to the input image data and respective vector storage array piece in respective scalar memory buffer Convolution kernel data parallel in interior weighted data buffer area carries out convolutional neural networks calculating;
S54. in the piece in the vector storage array of each DSP core the convolution Nuclear Data of weighted data buffer area have been calculated finish after Synchronize waiting;Judge whether that all DSP cores reach by the DSP core 0 of GPDSP, if then going to step S52, continues subsequent portion The convolutional neural networks divided calculate;
S55. circulation step S52~S54, the convolutional neural networks until completing this calculate.
As a further improvement of the present invention, the specific step of convolutional neural networks calculating is carried out in the step S53 parallel Suddenly are as follows:
S531. the obtained position the W convolution Nuclear Data comprising d convolution Nuclear Data will be generated after merging treatment to be successively extended to D W convolution Nuclear Datas, wherein W is the digit of Vector Processing array in GPDSP;
S532. the d W convolution Nuclear Datas step S531 being extended, successively with generated after merging treatment W bit image data comprising d auxiliary image data carry out the multiply-add calculating of SIMD, complete current convolutional neural networks and calculate.
Compared with the prior art, the advantages of the present invention are as follows:
1) the convolutional neural networks core parallel calculation method of the invention towards GPDSP, by the system knot for combining GPDSP Structure feature, convolutional neural networks calculating task is divided, and runs operating system by CPU core, is responsible for outside input image data Connect, the format and merging treatment of format and merging treatment and weighted data and the scheduling of calculating task, status data it is same Step etc., DSP core is responsible for parallel convolutional neural networks and calculates kernel program, continual that new calculating task is obtained from CPU core And report operation result to CPU core, the general-purpose computations of CPU core and the powerful vectorization computing capability of DSP core can be given full play to The advantages of, the close coordinated between CPU core and DSP core is realized, to efficiently realize convolutional neural networks multi-core parallel concurrent meter It calculates.
2) the convolutional neural networks core parallel calculation method of the invention towards GPDSP, the architecture based on GPDSP are special Sign, using efficient CPU core and DSP core cooperated computing, by the CPU core of convolutional neural networks calculating efficient mapping to GPDSP And in multiple DSP cores, it can make full use of the CPU core general-purpose computations and the powerful parallel meter of DSP core Vector Processing array of GPDSP It calculates, high bandwidth vector data load capability, gives full play to the multi-level parallelisms of GPDSP, can be adapted for extensive convolutional Neural Efficient parallel computation is realized in network.
3) the convolutional neural networks core parallel calculation method of the invention towards GPDSP further passes through two numbers of setting According to buffer state mark with respectively correspond mark two data buffer areas data whether ready state and a DSP core Calculate Status Flag with indicate DSP core whether idle state, CPU core controls holding for calculating task by monitoring each Status Flag Row can be further improved the efficient coordinated between CPU core and DSP core, improve what convolutional neural networks multi-core parallel concurrent calculated Efficiency.
Detailed description of the invention
Fig. 1 is the simplification memory access structural model schematic diagram for the GPDSP that the present embodiment uses.
Fig. 2 is the implementation process schematic diagram of convolutional neural networks core parallel calculation method of the present embodiment towards GPDSP.
Fig. 3 is the specific implementation flow schematic diagram that the present embodiment carries out convolutional neural networks calculating parallel.
Fig. 4 is the realization principle schematic diagram that SIMD parallel-convolution neural computing is carried out in the specific embodiment of the invention.
Fig. 5 is the implementation process signal that convolutional neural networks core parallel calculation method is realized in the specific embodiment of the invention Figure.
Specific embodiment
Below in conjunction with Figure of description and specific preferred embodiment, the invention will be further described, but not therefore and It limits the scope of the invention.
The simplification memory access structural model for the GPDSP that the present embodiment specifically uses is as shown in Figure 1, system includes CPU core list Member and DSP core unit, wherein DSP core unit includes that several 64 bit vectors handle array computation unit, dedicated interior scalar is deposited Storage is shared in the shared piece of reservoir and vector array memory, CPU core unit and DSP core unit, the outer DDR of piece of large capacity is deposited Reservoir, i.e. GPDSP include the DSP core of multiple 64 bit vectors processing array, can carry out simultaneously parallel data processing by SIMD.
As shown in Fig. 2, convolutional neural networks core parallel calculation method of the present embodiment towards GPDSP, step include:
CPU core in S1.GPDSP constructs two for storing the data of input image data outside piece in DDR memory Buffer area (input1 and input2) and one are for storing the weighted data buffer area of convolution Nuclear Data.
The size of above-mentioned data buffer area with specific reference in GPDSP include 64 bit vectors processing array DSP core quantity p, The picture number d that input picture port number c and SIMD are capable of parallel processing is configured.
In concrete application embodiment, the size concrete configuration of above-mentioned data buffer area is depositing for n auxiliary input image data Amount of capacity, and n=p*c*d are stored up, wherein d=64/w, data bits of the w for pictorial element to be calculated, w=64,32,16,8, 4,2 etc., the data bits for respectively indicating pictorial element is 64,32,16,8,4,2.According to the data bit of pictorial element to be calculated Number w, can determine that SIMD is capable of the picture number d of parallel processing, i.e. d=64/w, further can determine above-mentioned n, i.e. n=p*c* d。
The convolution Nuclear Data of specified number is merged place according to the picture number d that SIMD is capable of parallel processing by S2.CPU core Reason generates convolution Nuclear Data required for meeting calculating, and is stored in weighted data buffer area.
It is above-mentioned to merge processing, specially d different convolution Nuclear Datas are merged, generation obtains needed for meeting calculating The convolution Nuclear Data wanted, wherein d is the picture number that SIMD is capable of parallel processing in GPDSP.I.e. according to image data to be calculated Digit merges multiple and different convolution Nuclear Datas, if d=2, image data digit to be calculated is 32 digits According to then by two different convolution Nuclear Datas merging, wherein high 32 data and low 32 data save one 32 volumes respectively Product nuclear element data;If d=4, image data digit to be calculated is 16 data, then 64 data save 4 16 volumes Product nuclear element data, and so on.
The idle state of two data buffer areas of S3.CPU Nuclear monitoring, if available free data buffer area, starting CPU core is connect Enter to specify an image data to be calculated to merge processing, generates image data required for meeting calculating, and be transferred to the free time Data buffer area, wherein image data can be externally to connect the input image data of camera or from other data sources Image data.
Since GPDSP includes the DSP core of multiple 64 bit vectors processing array, parallel data can be carried out simultaneously by SIMD Processing, it is above-mentioned to merge processing, specially d width input image data is merged, generation obtains figure required for meeting calculating Picture data merge more auxiliary image datas that is, according to image data digit to be calculated, for example, if figure to be calculated As data bits is 32 data, then two images data are merged, wherein the data of a sub-picture are stored in high 32 data, Another width image data is stored in low 32 data;If image data digit to be calculated is 16 data, by 4 width picture numbers According to merging, and so on.
The above-mentioned specific steps that image data is transferred to idle data buffer area are as follows: in advance according to DSP core number and Sequentially, data buffer area is divided into p and the one-to-one adjacent storage zones of each DSP core, wherein p is DSP core in GPDSP Number;When generating image data required for meeting calculating, the image data of generation is successively transmitted to each memory block in sequence, I.e. in above-mentioned p memory block, according to channel sequence, successively the data in one channel of transmission pass every time to each memory block every time Defeated is the d auxiliary image data after merging treatment.
S4.CPU core judges the data mode of the idle state of each DSP core and two data buffer areas in GPDSP, if It determines each DSP core free time and has the data ready of target data buffer area, then by the address of target data buffer area and weight The address of data buffer zone is transferred to each DSP core, carries out convolutional Neural net to start DSP core to the data buffer area image data Network calculates.
S5. each DSP core carries out convolution according to each auxiliary image data of the address received to target data buffer area parallel Neural computing.
As shown in figure 3, the step of the present embodiment above-mentioned parallel carry out convolutional neural networks calculating, includes:
N auxiliary image data in S51.DSP core parallel processing target data buffer area, wherein each DSP core handles c*d width Image, the image data of processing needed for this core is calculated according to the first address of target data buffer area and this core ID in each DSP core First address;
Weighted data buffer area in piece is arranged in S52.DSP core in respective vector storage array, by DSP core 0 outside piece DDR Convolution Nuclear Data is read in the weighted data buffer area of memory, and is broadcast in the piece of vector storage array of each DSP core In weighted data buffer area;Each DSP core is according to the corresponding input picture of the parallel reading of the first address that step S51 is calculated Data are into respective scalar memory buffer;
S53. each DSP core is to the input image data and respective vector storage array piece in respective scalar memory buffer Convolution kernel data parallel in interior weighted data buffer area carries out convolutional neural networks calculating;
Bi Houjin has been calculated in the convolution Nuclear Data of weighted data buffer area in piece in the vector storage array of S54.DSP core Row is synchronous to be waited;Judge whether that all DSP cores reach by the DSP core 0 of GPDSP, if then going to step S52, continues further part Convolutional neural networks calculate;
S55. circulation step S52~S54, the convolutional neural networks until completing this calculate.
The specific steps of convolutional neural networks calculating are carried out in above-mentioned steps S53 parallel are as follows:
S531. obtained 64 convolution Nuclear Datas comprising d convolution Nuclear Data will be generated after merging treatment successively to extend At d 64 convolution Nuclear Datas.When such as d=2, then low 32 data are extended to 64 convolution Nuclear Data A, it will be 32 high Data expansion is at another 64 convolution Nuclear Data B, wherein high 32 data of convolution Nuclear Data A and low 32 before extension Data are identical, and low 32 data of convolution Nuclear Data B are identical as high 32 data before extension, other d value situations and above-mentioned original Reason is consistent.
S532. the d that step S531 is extended 64 convolution Nuclear Datas successively include with what is generated after merging treatment 64 bit image data of d auxiliary image data carry out the multiply-add calculating of SIMD, complete current convolutional neural networks and calculate.
As shown in figure 4, the present invention realizes that 2 convolution Nuclear Datas and 2 auxiliary image datas carry out in concrete application embodiment SIMD parallel-convolution neural computing step are as follows:
Step 1: 64 convolution Nuclear Data R comprising 2 convolution Nuclear Datas are successively extended to 2 64 convolution Nuclear Datas A and B;
Step 2: by 2 64 convolution Nuclear Datas A and B of above-mentioned generation successively with 64 bitmaps comprising 2 auxiliary image datas As data D carries out the multiply-add calculating of SIMD, i.e. A, D, E progress multiply-add calculating of SIMD, B, D, F progress multiply-add calculating of SIMD.
The calculating state of two data buffer areas of S6.CPU Nuclear monitoring and DSP core, when monitoring two data buffer areas At the end of middle data processing finishes and DSP core calculates, previous calculated result is worked as in output.
S7. circulation step S3~S6, the calculating until completing all image datas.
The present embodiment above method is drawn convolutional neural networks calculating task in conjunction with the architectural feature of GPDSP Point, operating system is run by CPU core, is responsible for that input image data is external, format of format and merging treatment and weighted data Scheduling, status data with merging treatment and calculating task synchronize, and DSP core is responsible for parallel convolutional neural networks and is calculated Kernel program, it is continual to obtain new calculating task from CPU core and report operation result to CPU core, CPU can be given full play to The advantages of general-purpose computations of core and the powerful vectorization computing capability of DSP core, realizes that CPU core is matched with closely cooperateing between DSP core It closes, to efficiently realize that convolutional neural networks multi-core parallel concurrent calculates.
The present invention specifically should in embodiment step S1 further include two data buffer state marks of setting (flag1 and Flag2) with respectively correspond two data buffer areas (input1 and input2) of mark data whether ready state and one DSP core calculate Status Flag (flag3) with indicate DSP core whether idle state;The CPU core of GPDSP is according to DSP core in step S3 It calculates Status Flag judges whether DSP core is idle, and two data buffer areas is judged according to data buffer area Status Flag Whether data are ready;When monitoring in data buffer area after data processing in step S6, corresponding data buffer area is set Status Flag is arranged corresponding DSP core and calculates Status Flag, bonding state mark after CPU core monitors that DSP core calculates The configuration of will further increases the efficiency of calculating with acquisition.As shown in figure 5, convolutional Neural net is realized in the configuration of bonding state mark The detailed step that network multi-core parallel concurrent calculates is as follows, and each step principle is consistent with the above:
The CPU core of step 1:GPDSP outside piece DDR memory construct two data buffer area input1 and input2 and One weighted data buffer area, the size of data buffer area are the memory capacity of n auxiliary input image, while configuring two state marks Will flag1 and flag2 indicate respectively two data buffer areas input1 and input2 whether data ready and a DSP Whether idle assess calculation Status Flag flag3 mark DSP core.
The CPU core of step 2:GPDSP is capable of the picture number d of parallel processing to d different convolution Nuclear Datas according to SIMD Format analysis processing is merged, generation meets convolution Nuclear Data required for this is calculated, is stored in weighted data buffer area.
The CPU core monitoring data of step 3:GPDSP cache distinctive emblem (flag1 and flag2), if available free data buffer storage Area, then CPU core externally enters d width image data and merges format analysis processing, and generation meets image data required for this is calculated, It is transferred to idle data buffer area in order, by respective flag position 1 when data buffer area is full of.
The CPU core of step 4:GPDSP calculates Status Flag (flag3) judges whether DSP core is idle according to DSP core, and Data buffer area Status Flag (flag1 and flag2) judges whether data buffer area data are ready;If DSP core is idle, and has number According to buffer area data ready, then data buffer storage regional address and weighted data buffer zone address are transferred to DSP core, start DSP core Convolutional neural networks calculating is carried out to the data buffer area image data.
Multiple DSP cores of step 5:GPDSP carry out parallel-convolution neural network meter to the n auxiliary image data of data buffer area It calculates, after DSP core completes the calculating of this convolutional neural networks to the calculating of input picture layer, setting data buffer area data processing is complete Finish mark;DSP core continues subsequent convolutional neural networks and calculates, and after completing the whole calculating of this convolutional neural networks, sets DSP core Calculating terminates label.
After the CPU core of step 6:GPDSP monitors that data buffer area data processing finishes mark, by corresponding Status Flag Set 0;After CPU core monitors that DSP core calculates end label, corresponding DSP core calculating Status Flag is set 0;CPU core will be calculated and be tied Fruit outflow.
Step 7: circulation step 3 to 6, until completing all to calculate.
Above-mentioned only presently preferred embodiments of the present invention, is not intended to limit the present invention in any form.Although of the invention It has been disclosed in a preferred embodiment above, however, it is not intended to limit the invention.Therefore, all without departing from technical solution of the present invention Content, technical spirit any simple modifications, equivalents, and modifications made to the above embodiment, should all fall according to the present invention In the range of technical solution of the present invention protection.

Claims (9)

1. a kind of convolutional neural networks core parallel calculation method towards GPDSP, which is characterized in that step includes:
CPU core in S1.GPDSP constructs two for storing the data buffer storage of input image data outside piece in DDR memory Area and one are for storing the weighted data buffer area of convolution Nuclear Data;
The convolution Nuclear Data of specified number is merged processing according to the picture number that SIMD is capable of parallel processing by S2.CPU core, raw Required convolution Nuclear Data is calculated at meeting, and is stored in the weighted data buffer area;
The idle state of two data buffer areas of S3.CPU Nuclear monitoring, if available free data buffer area, starting CPU core is connect Enter to specify an image data to be calculated to merge processing, generates image data required for meeting calculating, and be transferred to the free time Data buffer area;
S4.CPU core judges the data mode of the idle state of each DSP core and two data buffer areas in GPDSP, if it is determined that It is idle to each DSP core and have the data ready of target data buffer area, then by the address of target data buffer area and the weight The address of data buffer zone is transferred to each DSP core, is calculated with starting DSP core;
S5. each DSP core carries out convolutional Neural according to each width image data of the address received to target data buffer area parallel Network query function;
The calculating state of two data buffer areas of S6.CPU Nuclear monitoring and DSP core, and work as and monitor in two data buffer areas At the end of data processing finishes and DSP core calculates, previous calculated result is worked as in output;
S7. circulation step S3~S6, the calculating until completing all image datas.
2. the convolutional neural networks core parallel calculation method according to claim 1 towards GPDSP, which is characterized in that The size of data buffer area described in the step S1 with specific reference to DSP core quantity p, input picture port number c in GPDSP and The picture number d that SIMD is capable of parallel processing is configured.
3. the convolutional neural networks core parallel calculation method according to claim 2 towards GPDSP, which is characterized in that The size concrete configuration of the data buffer area is the memory capacity size of n width input image data, and n=p*c*d, wherein d =64/w, w are the data bits of pictorial element to be calculated.
4. the convolutional neural networks core parallel calculation method according to claim 1 or 2 or 3 towards GPDSP, feature It is: further includes two data buffer state marks of setting in the step S1 to respectively correspond two data buffer areas of mark Data whether be that ready state and DSP core calculate Status Flag to indicate whether DSP core is idle state;
The CPU core of GPDSP calculates Status Flag judges whether DSP core is idle according to the DSP core in the step S4, Yi Jigen Judge whether the data of two data buffer areas are ready according to the data buffer area Status Flag;
When monitoring in the data buffer area after data processing in the step S6, the corresponding data buffer storage is set Zone state mark is arranged the corresponding DSP core and calculates Status Flag after CPU core monitors that DSP core calculates.
5. the convolutional neural networks core parallel calculation method according to claim 1 or 2 or 3 towards GPDSP, feature It is, processing is merged in the step S2, specially merge d different convolution Nuclear Datas, generation obtains meeting meter Convolution Nuclear Data required for calculating, wherein d is the picture number that SIMD is capable of parallel processing in GPDSP.
6. the convolutional neural networks core parallel calculation method according to claim 1 or 2 or 3 towards GPDSP, feature It is, merges processing in the step S3, specially merge d width input image data, generation, which obtains meeting, calculates institute The image data needed, wherein d is the picture number that SIMD is capable of parallel processing in GPDSP.
7. the convolutional neural networks core parallel calculation method according to claim 6 towards GPDSP, which is characterized in that The specific steps of idle data buffer area are transferred in the step S3 are as follows: the data buffer area is divided into p in advance With the one-to-one memory block of each DSP core, wherein p is DSP core number in GPDSP;It generates to meet and calculates required picture number According to when, the image data of generation is successively transmitted to each memory block in sequence.
8. the convolutional neural networks core parallel calculation method according to claim 2 or 3 towards GPDSP, feature exist In, in the step S5 carry out parallel-convolution neural computing the step of include:
S51. the n width image data in each DSP core parallel processing target data buffer area, wherein each DSP core handles c*d width figure Picture, the picture number of processing needed for corresponding core is calculated according to the first address of target data buffer area and corresponding core ID in each DSP core According to first address;
S52. weighted data buffer area in piece is arranged in each DSP core in respective vector storage array, by 0 core of DSP core DDR outside piece Convolution Nuclear Data is read in the weighted data buffer area of memory, and be broadcast to the vector storage array of each DSP core In piece in weighted data buffer area;Each DSP core is corresponding according to the parallel reading of the first address that the step S51 is calculated Input image data is into respective scalar memory buffer;
S53. each DSP core in the input image data and respective vector storage array piece in respective scalar memory buffer to weighing Convolution kernel data parallel in weight data buffer zone carries out convolutional neural networks calculating;
S54. in the piece in the vector storage array of each DSP core the convolution Nuclear Data of weighted data buffer area have been calculated finish after carry out It is synchronous to wait;Judge whether that the data of all DSP cores reach by 0 core of DSP core of GPDSP, if then going to step S52, after continuation The convolutional neural networks of continuous part calculate;
S55. circulation step S52~S54, the convolutional neural networks until completing this calculate.
9. the convolutional neural networks core parallel calculation method according to claim 8 towards GPDSP, which is characterized in that The specific steps of convolutional neural networks calculating are carried out in the step S53 parallel are as follows:
S531. the obtained position the W convolution Nuclear Data comprising d convolution Nuclear Data will be generated after merging treatment is successively extended to d W Position convolution Nuclear Data, wherein W is the digit of Vector Processing array in GPDSP;
S532. the d W convolution Nuclear Datas step S531 extended include successively d with what is generated after merging treatment The W bit image data of width image data carry out the multiply-add calculating of SIMD, complete current convolutional neural networks and calculate.
CN201810689646.3A 2018-06-28 2018-06-28 Convolutional neural network multi-core parallel computing method facing GPDSP Active CN108920413B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810689646.3A CN108920413B (en) 2018-06-28 2018-06-28 Convolutional neural network multi-core parallel computing method facing GPDSP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810689646.3A CN108920413B (en) 2018-06-28 2018-06-28 Convolutional neural network multi-core parallel computing method facing GPDSP

Publications (2)

Publication Number Publication Date
CN108920413A CN108920413A (en) 2018-11-30
CN108920413B true CN108920413B (en) 2019-08-09

Family

ID=64421783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810689646.3A Active CN108920413B (en) 2018-06-28 2018-06-28 Convolutional neural network multi-core parallel computing method facing GPDSP

Country Status (1)

Country Link
CN (1) CN108920413B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858622B (en) * 2019-01-31 2021-03-02 瑞芯微电子股份有限公司 Data handling circuit and method for deep learning neural network
CN109886395B (en) * 2019-03-06 2020-11-24 上海熠知电子科技有限公司 Data reading method for multi-core image processing convolutional neural network
CN109976893A (en) * 2019-03-29 2019-07-05 北京润科通用技术有限公司 The sequential control method and device of real time operating system
CN109858472B (en) * 2019-04-09 2023-08-04 武汉领普科技有限公司 Embedded real-time humanoid detection method and device
CN110489356B (en) * 2019-08-06 2022-02-22 上海商汤智能科技有限公司 Information processing method, information processing device, electronic equipment and storage medium
CN113095503B (en) * 2020-01-09 2024-05-03 北京君正集成电路股份有限公司 System for realizing high efficiency of detection model
CN113095471B (en) * 2020-01-09 2024-05-07 北京君正集成电路股份有限公司 Method for improving efficiency of detection model
CN113111995A (en) * 2020-01-09 2021-07-13 北京君正集成电路股份有限公司 Method for shortening model reasoning and model post-processing operation time
CN111897579B (en) * 2020-08-18 2024-01-30 腾讯科技(深圳)有限公司 Image data processing method, device, computer equipment and storage medium
CN112068955B (en) * 2020-08-21 2023-10-27 北京科技大学 Communication optimization method in heterogeneous multi-core platform processor and electronic equipment
CN112101284A (en) * 2020-09-25 2020-12-18 北京百度网讯科技有限公司 Image recognition method, training method, device and system of image recognition model
CN113469350B (en) * 2021-07-07 2023-03-24 武汉魅瞳科技有限公司 Deep convolutional neural network acceleration method and system suitable for NPU

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709462A (en) * 2016-12-29 2017-05-24 天津中科智能识别产业技术研究院有限公司 Indoor positioning method and device
CN107657581A (en) * 2017-09-28 2018-02-02 中国人民解放军国防科技大学 Convolutional neural network CNN hardware accelerator and acceleration method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6925641B1 (en) * 2000-02-04 2005-08-02 Xronix Communications, Inc. Real time DSP load management system
CN102591657B (en) * 2011-12-29 2014-06-25 东南大学 Graphical user interface (GUI) system achieving method based on collaboration mechanism of central processing unit (CPU) and digital signal processor (DSP)
KR101834195B1 (en) * 2012-03-15 2018-04-13 삼성전자주식회사 System and Method for Balancing Load on Multi-core Architecture
CN104615584B (en) * 2015-02-06 2017-12-22 中国人民解放军国防科学技术大学 The method for solving vectorization calculating towards GPDSP extensive triangular linear equation group
CN106228238B (en) * 2016-07-27 2019-03-22 中国科学技术大学苏州研究院 Accelerate the method and system of deep learning algorithm on field programmable gate array platform
CN106959937B (en) * 2017-03-30 2019-03-29 中国人民解放军国防科学技术大学 A kind of vectorization implementation method of the warp product matrix towards GPDSP
CN107301456B (en) * 2017-05-26 2020-05-12 中国人民解放军国防科学技术大学 Deep neural network multi-core acceleration implementation method based on vector processor
CN107862378B (en) * 2017-12-06 2020-04-24 芯原微电子(上海)股份有限公司 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal
CN108205702B (en) * 2017-12-29 2020-12-01 中国人民解放军国防科技大学 Parallel processing method for multi-input multi-output matrix convolution
CN107885700B (en) * 2017-12-29 2021-05-14 中国人民解放军国防科技大学 Multi-core implementation method for large-scale matrix convolution

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709462A (en) * 2016-12-29 2017-05-24 天津中科智能识别产业技术研究院有限公司 Indoor positioning method and device
CN107657581A (en) * 2017-09-28 2018-02-02 中国人民解放军国防科技大学 Convolutional neural network CNN hardware accelerator and acceleration method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种高效的面向基2 FFT算法的SIMD并行存储结构;陈海燕 等;《电子学报》;20160229(第2期);241-246 *

Also Published As

Publication number Publication date
CN108920413A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
CN108920413B (en) Convolutional neural network multi-core parallel computing method facing GPDSP
JP7166389B2 (en) Systems and integrated circuits for bit-serial computation in neural networks
CN111897579B (en) Image data processing method, device, computer equipment and storage medium
TWI749249B (en) Chip device, chip, intelligent device and operation method of the neural network
CN107657581B (en) Convolutional neural network CNN hardware accelerator and acceleration method
CN109543832B (en) Computing device and board card
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN107679620A (en) Artificial neural network processing unit
CN107704922A (en) Artificial neural network processing unit
CN108805797A (en) Optimized computing hardware for machine learning operation
CN109558937A (en) The operating method of nerve network system and nerve network system
CN101398753A (en) System, method and computer program product for performing a scan operation
CN108388537A (en) A kind of convolutional neural networks accelerator and method
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN101717817A (en) Method for accelerating RNA secondary structure prediction based on stochastic context-free grammar
US11972348B2 (en) Texture unit circuit in neural network processor
WO2022226721A1 (en) Matrix multiplier and method for controlling matrix multiplier
CN110362780A (en) A kind of big data tensor canonical decomposition calculation method based on Shen prestige many-core processor
CN115880132A (en) Graphics processor, matrix multiplication task processing method, device and storage medium
WO2020253383A1 (en) Streaming data processing method based on many-core processor, and computing device
CN109711540B (en) Computing device and board card
CN115221102B (en) Method for optimizing convolution operation of system-on-chip and related product
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
CN103871032A (en) Image enhancement method for Wallis filter based on GPU (Graphics Processing Unit)
CN108090865A (en) The in-orbit real-time streaming processing method of optical satellite remote sensing image and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant