CN109753319A - A kind of device and Related product of release dynamics chained library - Google Patents
A kind of device and Related product of release dynamics chained library Download PDFInfo
- Publication number
- CN109753319A CN109753319A CN201811629632.9A CN201811629632A CN109753319A CN 109753319 A CN109753319 A CN 109753319A CN 201811629632 A CN201811629632 A CN 201811629632A CN 109753319 A CN109753319 A CN 109753319A
- Authority
- CN
- China
- Prior art keywords
- circuit
- processing circuit
- thread
- dynamic link
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Stored Programmes (AREA)
- Advance Control (AREA)
Abstract
This application provides the device and Related product of a kind of release dynamics chained library, which is applied to processor unit;For processor unit for configuring the first process, the first process includes first thread and the second thread;By calling first thread to execute load Caffe dynamic link library, using Caffe dynamic link library and to the destructed operation of the first object progress;By calling the second thread after first thread has executed destructed operation to the first object, the Caffe dynamic link library is discharged from memory, the load and release of Caffe dynamic link library may be implemented, and then can be occupied to avoid memory, to improve arithmetic speed.
Description
Technical field
This application involves technical field of information processing, and in particular to a kind of device of release dynamics chained library and related produces
Product.
Background technique
With the fast development of software technology and hardware technology, various dynamic link library file (Dynamic Link
Library, DLL) it is widely used.With convolutional neural networks frame (Convolutional Architecture
For Fast Feature Embedding, Caffe) for dynamic link library, Caffe dynamic link library is dependent on third party
Dynamic link library, be mainly used in terms of video, image procossing using upper.It needs to occupy end during the loading process
A large amount of memories on end.
In the prior art, application program directly links Caffe dynamic link library, then, in application program launching, processing
Caffe dynamic link library is loaded into memory by device.Only when application program exits, processor could unload Caffe dynamic
Chained library.It also means that, when application program is constantly in operating status, Caffe dynamic link library occupied terminal always
Memory, and the memory in terminal is limited, this virtually reduces the processing speed of terminal.Therefore, how to realize that Caffe is dynamic
The load and release of state chained library, to avoid the memory when program is run in terminal always due to Caffe dynamic link library by
Occupancy is the research hotspot problem of those skilled in the art.
Summary of the invention
The embodiment of the present application provides the device and Related product of a kind of release dynamics chained library, is in fortune in application program
When under row state, the load and release of Caffe dynamic link library may be implemented, can to avoid when program is run in terminal in
Deposit it is occupied because of Caffe dynamic link library always, to improve arithmetic speed.
In a first aspect, the embodiment of the present application provides a kind of device of release dynamics chained library, described device is applied to place
Manage device unit;Wherein, the processor unit, for receiving the first load request of the first dynamic link library file;Wherein, institute
The first dynamic link library file is stated for realizing the first function of the first application program;The processor unit is for configuring first
Process;First process includes first thread and the second thread;
The processor unit is also used to call the first thread by Caffe dynamic according to first load request
Chained library is loaded onto memory, and creates the first object;
The processor unit is also used to call the first thread to execute first dynamic link library file, and
After the first function of having executed first application program, first object is executed destructed;
The processor unit, be also used to call the first thread first object has been executed it is destructed after,
Second thread is called to discharge the Caffe dynamic link library from the memory.
Second aspect, the embodiment of the present application provide a kind of machine learning arithmetic unit, and the machine learning device includes
The device of release dynamics chained library described in first aspect, wherein in the device of the release dynamics chained library include one or
Multiple MLU computing units.The machine learning arithmetic unit is used to obtain from other processing units to operation input data and control
Information processed, and specified machine learning operation is executed, implementing result is passed into other processing units by I/O interface;
When the machine learning arithmetic unit includes multiple MLU computing units, the multiple MLU calculates single
It can be attached by specific structure between member and transmit data;
Wherein, multiple MLU computing units are interconnected and are passed by quick external equipment interconnection Bus PC IE bus
Transmission of data, to support the operation of more massive machine learning;Multiple MLU computing units are shared same control system or are gathered around
There is respective control system;Multiple MLU computing unit shared drives possess respective memory;Multiple MLU meters
The mutual contact mode for calculating unit is any interconnection topology.
The third aspect, the embodiment of the present application provide a kind of combined treatment device, which includes such as third
Machine learning processing unit, general interconnecting interface described in aspect and other processing units.The machine learning arithmetic unit with it is upper
It states other processing units to interact, the common operation completing user and specifying.The combined treatment device can also include storage dress
It sets, which connect with the machine learning arithmetic unit and other described processing units respectively, for saving the machine
The data of device study arithmetic unit and other processing units.
Fourth aspect, the embodiment of the present application provide a kind of neural network chip, which includes above-mentioned first aspect institute
Machine learning arithmetic unit described in the device for the release dynamics chained library stated, above-mentioned second aspect or above-mentioned third aspect institute
The combined treatment device stated.
5th aspect, the embodiment of the present application provide a kind of neural network chip encapsulating structure, neural network chip envelope
Assembling structure includes neural network chip described in above-mentioned fourth aspect;
6th aspect, the embodiment of the present application provide a kind of board, which includes nerve described in above-mentioned 5th aspect
Network chip encapsulating structure.
7th aspect, the embodiment of the present application provide a kind of electronic device, which includes above-mentioned 6th aspect institute
Board described in the neural network chip stated or above-mentioned 6th aspect.
Eighth aspect, the embodiment of the present application also provide a kind of method of release dynamics chained library, and the method is applied to release
The device of dynamic link library is put, described device includes processor unit;The described method includes:
The processor unit receives the first load request of the first dynamic link library file;Wherein, first dynamic
Library file is linked for realizing the first function of the first application program;The processor unit is for configuring the first process;It is described
First process includes first thread and the second thread;
The processor unit calls the first thread to add Caffe dynamic link library according to first load request
It is loaded onto memory, and creates the first object;
The processor unit calls the first thread to execute first dynamic link library file, and having executed
After the first function of stating the first application program, first object is executed destructed;
The processor unit call the first thread first object has been executed it is destructed after, described in calling
Second thread discharges the Caffe dynamic link library from the memory.
In some embodiments, the electronic equipment includes data processing equipment, robot, computer, printer, scanning
Instrument, tablet computer, intelligent terminal, mobile phone, automobile data recorder, navigator, sensor, camera, server, cloud server,
Camera, video camera, projector, wrist-watch, earphone, mobile storage, wearable device, the vehicles, household electrical appliance, and/or medical treatment
Equipment.
In some embodiments, the vehicles include aircraft, steamer and/or vehicle;The household electrical appliance include electricity
Depending on, air-conditioning, micro-wave oven, refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator;The Medical Devices include
Nuclear Magnetic Resonance, B ultrasound instrument and/or electrocardiograph.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a kind of structural schematic diagram of the device of release dynamics chained library provided by the embodiments of the present application;
Fig. 2 is the structure chart for the MLU computing unit that the application one embodiment provides;
Fig. 3 is the structure chart of main process task circuit provided by the embodiments of the present application;
Fig. 4 is the structure chart of another kind MLU computing unit provided by the embodiments of the present application;
Fig. 5 is the structure chart of another kind MLU computing unit provided by the embodiments of the present application;
Fig. 6 is the structural schematic diagram of tree-shaped module provided by the embodiments of the present application;
Fig. 7 is the structure chart of another MLU computing unit provided by the embodiments of the present application;
Fig. 8 is also a kind of structure chart of MLU computing unit provided by the embodiments of the present application;
Fig. 9 is the structure chart for the MLU computing unit that another embodiment of the application provides;
Figure 10 is a kind of structure chart of combined treatment device provided by the embodiments of the present application;
Figure 11 is the structure chart of another combined treatment device provided by the embodiments of the present application;
Figure 12 is a kind of structural schematic diagram of board provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
The description and claims of this application and term " first ", " second ", " third " and " in the attached drawing
Four " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and it
Any deformation, it is intended that cover and non-exclusive include.Such as it contains the process, method of a series of steps or units, be
System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or list
Member, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.
Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments
It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical
Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and
Implicitly understand, embodiment described herein can be combined with other embodiments.
Firstly, introducing the device of release dynamics chained library used in this application.Referring to Fig. 1, a kind of release dynamics are provided
The device of chained library, the device include: processor unit 13;
Wherein, processor unit 13, for receiving the first load request of the first dynamic link library file;Wherein, described
First dynamic link library file for realizing the first application program the first function;The processor unit for configure first into
Journey;First process includes first thread and the second thread;
In embodiments of the present invention, dynamic link library file DLL itself can not independent operating, dynamic link library file DLL
It is mainly responsible for and provides certain services to application program.That is, application program needs under the load of dynamic link library file
Complete certain specific functions.For example, the first dynamic link library file involved in embodiments of the present invention can be used to implement
The face identification functions of Alipay application program.
In embodiments of the present invention, the first Dynamic link library file referred to herein is stored in Caffe dynamic link library.
In practical applications, it may include the function under Caffe frame in dynamic link library file.
In the specific implementation, the function under the Caffe frame may include: Caffe Blob function, Caffe Layer letter
Several and Caffe Net function.Wherein, Blob is for storing, exchanging and handling the data of forward and reverse iteration in network and lead
Number information;Layer may include convolution (convolve), pond (pool), inner product (inner for executing calculating
Product), the nonlinear operations such as rectified-linear and sigmoid can also include the data transformation of Element-Level, return
One changes (normalize), data load (load data), the classification costing bio disturbances such as (softmax) and hinge (losses).Tool
During body is realized, each Layer both defines 3 kinds of important operations, this 3 kinds of operations are Initialize installation (setup), propagated forward
(forward), backpropagation (backward).Wherein, setup is used for when model initialization resetting layers and from each other
Connection;Forward is for receiving input data from bottom (bottom) layer, and output is sent to top (top) layer after calculating;
Backward is used to give top layers of output gradient, calculates the gradient of its input, and is transmitted to bottom layers.For example, Layer
It may include Date Layer, Convolution Layers, Pooling Layer, InnerProduct Layer, ReLU
Layer、Sigmoid Layer、LRN Layer、Dropout Layer、SoftmaxWithLoss Layer、Softmax
Layer, Accuracy Layers etc..One Net starts from data layer, namely data are loaded from disk, terminates at
Loss layer, namely calculate the objective function for such as classifying and reconstructing these tasks.Specifically, Net is by series of layers group
At directed acyclic calculate figure, Caffe remains median all in calculating figure to ensure the accurate of forward and backward iteration
Property.
Processor unit 13 is also used to call the first thread by Caffe dynamic chain according to first load request
It connects library to be loaded onto memory, and creates the first object;
As previously mentioned, the first process includes first thread and the second thread.Specifically, the first process is that processor unit is
The process of first application program configuration.
In the specific implementation, Caffe dynamic link library is loaded into memory by first thread by executing dlopen function, and
For one object of the first process creation namely the first object.Here, dlopen function opens specified dynamic chain with designated mode
Library is connect, and its graftabl.After the dynamic link library is loaded into memory, a handle is returned to calling process.
Processor unit 13 is also used to that the first thread is called to execute first dynamic link library file, and is holding
It has gone after the first function of first application program, first object has been executed destructed;
As previously mentioned, first dynamic link library file itself can not independent operating, pass through call first thread execute first
Dynamic link library, that is, realizing that the first dynamic link library file is the function that application program provides, when the function is finished it
Afterwards, destructed operation is executed to the object (namely first object) created by the first process.For example, the first dynamic link library file
For providing face identification functions for wechat application program, when calling first thread the first dynamic library file of execution, and in wechat
After application program successfully recognizes the identity of current object to be identified, the first object is executed destructed.
In embodiments of the present invention, destructed operation is executed to the first object by executing destructor function.Specifically, destructed
Function and constructed fuction are on the contrary, when object terminates its life cycle (for example, function where object modulated with finish), system
It is automatic to execute destructor function.Destructor function is often used to do the work of " cleaning is dealt with problems arising from an accident ", for example, using " new " when establishing object
Function opens a piece of memory headroom, should be discharged in destructor function with " delete " function before exit.
In embodiments of the present invention, the program code of destructor function is stored in Caffe dynamic link library.It is understood that
It is that after destructed to the progress of the first object, the memory space of the first object committed memory can be discharged.
Processor unit 13, be also used to call the first thread first object has been executed it is destructed after, tune
The Caffe dynamic link library is discharged from the memory with second thread.
Specifically, after first thread has executed destructed operation to the first object, by calling the second thread to execute
Dlclose function discharges Caffe dynamic link library from memory.Here, dlclose function is used to unload the dynamic chain opened
Connect library.
In normal conditions, a thread is only used only, current only thread executes dlopen function, passes through dlopen
Function opens specified dynamic link library with designated mode, and its graftabl.It is dynamic that current only thread will use Caffe
State chained library provides corresponding function for application program.After the function is finished, current only thread is executed
Dlclose function, thread exits at this time.But destructor function is performed not yet.Because Caffe dynamic link library is released, when
Corresponding code is unloaded from memory when destructor function executes, and be will lead to program and is run quickly and bursts.Also, in practice, destructor function quilt
It executes and occurs after all personal code works, when just never occurring to only have a thread after destructor function is performed again
The case where calling diclose function.It therefore, can when under application program is in operating status by implementing the embodiment of the present invention
It, can be to avoid the memory when program is run in terminal always because of Caffe to realize the load and release of Caffe dynamic link library
Dynamic link library and it is occupied, to improve arithmetic speed.
In a wherein embodiment, Caffe dynamic link library involved in the application can be simultaneously by same application
Multiple processes of program use, this multiple process calls different dynamic link library files respectively, in this case, the place
Reason device unit is also used to configure the second process, and second process includes third thread and the 4th thread;
The processor unit is also used to receiving the same of the first load request of first dynamic link library file
When, receive the second load request of the second dynamic link library file;Wherein, second dynamic link library file is for realizing institute
State the first function of the first application program;
The processor unit is also used to call the third thread by Caffe dynamic according to second load request
Chained library is loaded onto memory, and creates the second object;
The processor unit is also used to call the third thread to execute second dynamic link library file, and
After the first function of having executed first application program, second object is executed destructed;
The processor unit, be also used to call the third thread second object has been executed it is destructed after,
The 4th thread is called to discharge the Caffe dynamic link library from the memory.
Under current application scenarios, it is what the first application program configured that the first process, which is processor unit with the second process,
Two processes, the first process and the second process are mutually independent two processes, when under the first process is in operating status, no
The operation of second process can be had an impact, and the two processes, simultaneously when loading Caffe dynamic link library, process respectively corresponds to
Memory between will not generate intersection.
In embodiments of the present invention, foregoing description is please referred to about the specific implementation of the first process, does not add to repeat herein.It is right
For the second process, the second process includes third thread and the 4th thread, wherein third thread executes " load Caffe dynamic
Chained library-executes the second object using Caffe dynamic link library-destructed ", the 4th thread holds the second object in third thread
Gone it is destructed after, Caffe dynamic link library is discharged from memory.So, in this case, when the first dynamic base connects
File and the second dynamic link library file for realizing the first application program same function when, for example, the people of Alipay application
Face identification function, it is believed that be the verifying again to current object to be identified, so as to further increase safety.
In a wherein embodiment, Caffe dynamic link library involved in the application can be answered by different simultaneously
With routine call, and the dynamic link library file that different application call is different, in this case, the processor list
Member is also used to configure third process, and the third process includes the 5th thread and the 6th thread;
The processor unit, is also used to: receiving the same of the first load request of first dynamic link library file
When, receive the third load request of third dynamic link library file;Wherein, the third dynamic link library file is for realizing
Second function of two application programs;
The processor unit is also used to call the 5th thread by Caffe dynamic according to the third load request
Chained library is loaded onto memory, and creates third object;
The processor unit is also used to call the 5th thread to execute the third dynamic link library file, and
After the second function of having executed second application program, the third object is executed destructed;
The processor unit, be also used to call the 5th thread third object has been executed it is destructed after,
The 6th thread is called to discharge the Caffe dynamic link library from the memory.
Under current application scenarios, the first process is the process that processor unit is the configuration of the first application program, third
Process is the process that processor unit is the configuration of the second application program, the first process and third process be mutually independent two into
Journey when under the first process is in operating status, will not have an impact the operation of third process, and the two processes add simultaneously
When carrying Caffe dynamic link library, intersection will not be generated between the corresponding memory of process.
In embodiments of the present invention, foregoing description is please referred to about the specific implementation of the first process, does not add to repeat herein.It is right
For third process, third process includes the 5th thread and the 6th thread, wherein the 5th thread executes " load Caffe dynamic
Chained library-executes third object using Caffe dynamic link library-destructed ", the 6th thread holds third object in the 5th thread
Gone it is destructed after, Caffe dynamic link library is discharged from memory.So, in such a case, it is possible to realize different answer
With program pin to the Real-Time Sharing of Caffe dynamic link library.Further, it is also possible to realize the different function of different application programs.
In a wherein embodiment, as shown in Figure 1, the device of the release dynamics chained library further includes MLU
(Machine Learning Processing Unit, MLU) computing unit;Wherein, processor unit 13 and MLU computing unit
Connection.
As previously mentioned, including the letter under Caffe frame in each dynamic link library file in Caffe dynamic link library
Number;
The processor unit is also used to during loading the Caffe dynamic link library, by the Caffe frame
Function under frame inputs the MLU computing unit;Wherein, the MLU computing unit is used for according to the letter under the Caffe frame
Several and operational order is calculated, and obtains calculated result, and the calculated result is sent to processor unit;
The processor unit is also used to receive the calculated result.
By implementing the embodiment of the present invention, loading velocity when load Caffe dynamic link library can be improved.
In the specific implementation, the MLU computing unit includes controller unit 11 and arithmetic element 12;Wherein, controller list
Member 11 is connect with arithmetic element 12, which includes: a main process task circuit and multiple from processing circuit;
Controller unit 11, for obtaining input data and computations;Wherein, the input data includes described
Function data under Caffe frame;In a kind of optinal plan, specifically, acquisition input data and computations mode can
To be obtained by data input-output unit, which is specifically as follows one or more data I/O interfaces
Or I/O pin.
Above-mentioned computations include but is not limited to: forward operation instruction or reverse train instruction or other neural networks fortune
Instruction etc. is calculated, such as convolution algorithm instruction, the application specific embodiment are not intended to limit the specific manifestation of above-mentioned computations
Form.
Controller unit 11 is also used to parse the computations and obtains multiple operational orders, by multiple operational order with
And the input data is sent to the main process task circuit;
Main process task circuit 101, for executing preamble processing and with the multiple from processing circuit to the input data
Between transmit data and operational order;
It is multiple from processing circuit 102, for parallel according to the data and operational order from the main process task circuit transmission
It executes intermediate operations and obtains multiple intermediate results, and multiple intermediate results are transferred to the main process task circuit;
Main process task circuit 101 obtains based on the computations by executing subsequent processing to the multiple intermediate result
Calculate result.
Arithmetic element is arranged to one master and multiple slaves structure by technical solution provided by the present application, and the calculating of forward operation is referred to
Enable, can will split data according to the computations of forward operation, in this way by it is multiple can from processing circuit
Concurrent operation is carried out to the biggish part of calculation amount, to improve arithmetic speed, saves operation time, and then reduce power consumption.
In a wherein embodiment, when including that processor unit and MLU are calculated in the device of release dynamics chained library
When unit, which can execute machine learning calculating.In a kind of optional embodiment, machine learning calculating be can wrap
Convolutional neural networks calculating is included, above-mentioned input data may include function, input neuron number evidence and weight under Caffe frame
Data, wherein the function under Caffe frame may include Caffe Blob, Caffe Layer and Caffe Net function, above-mentioned
Calculated result is specifically as follows: the result of convolutional neural networks operation i.e. output nerve metadata.
It can be one layer of operation in neural network for the operation in neural network, for multilayer neural network,
Realization process is, in forward operation, after upper one layer of artificial neural network, which executes, to be completed, next layer of operational order can be incited somebody to action
Calculated output neuron carries out operation (or to the output nerve as next layer of input neuron in arithmetic element
Member carries out the input neuron that certain operations are re-used as next layer), meanwhile, weight is also replaced with to next layer of weight;Anti-
Into operation, after the completion of the reversed operation of upper one layer of artificial neural network executes, next layer of operational order can be by arithmetic element
In it is calculated input neuron gradient as next layer output neuron gradient carry out operation (or to the input nerve
First gradient carries out certain operations and is re-used as next layer of output neuron gradient), while weight being replaced with to next layer of weight.
In a wherein embodiment, above-mentioned MLU computing unit can also include: the storage unit 10 and direct memory
Access unit 50, storage unit 10 may include: register, one or any combination in caching, specifically, the caching,
For storing the computations;The register, for storing the input data and scalar;The caching is scratchpad
Caching.Direct memory access unit 50 is used to read from storage unit 10 or storing data.
Optionally, which includes: the location of instruction 110, instruction process unit 111 and storage queue unit
113;
The location of instruction 110, for storing the associated computations of artificial neural network operation;
Described instruction processing unit 111 obtains multiple operational orders for parsing to the computations;
Storage queue unit 113, for storing instruction queue, the instruction queue include: to wait for by the tandem of the queue
The multiple operational orders or computations executed.
For example, main arithmetic processing circuit also may include a controller list in an optional technical solution
Member, the controller unit may include master instruction processing unit, be specifically used for Instruction decoding into microcommand.Certainly in another kind
Also may include another controller unit from arithmetic processing circuit in optinal plan, another controller unit include from
Instruction process unit, specifically for receiving and processing microcommand.Above-mentioned microcommand can be the next stage instruction of instruction, micro- finger
Order can further can be decoded as each component, each unit or each processing circuit by obtaining after the fractionation or decoding to instruction
Control signal.
In a kind of optinal plan, the structure of the computations can be as shown in table 1 below.
Table 1
Operation code | Register or immediate | Register/immediate | ... |
Ellipsis expression in upper table may include multiple registers or immediate.
In alternative dispensing means, which may include: one or more operation domains and an operation code.
The computations may include neural network computing instruction.By taking neural network computing instructs as an example, as shown in table 1, wherein deposit
Device number 0, register number 1, register number 2, register number 3, register number 4 can be operation domain.Wherein, each register number 0,
Register number 1, register number 2, register number 3, register number 4 can be the number of one or more register.Specifically,
Refer to table 2:
Table 2
Above-mentioned register can be chip external memory, certainly in practical applications, or on-chip memory, for depositing
Store up data, which is specifically as follows n dimension data, and n is the integer more than or equal to 1, for example, be 1 dimension data when n=1, i.e., to
Amount is 2 dimension datas, i.e. matrix when such as n=2, is multidimensional tensor when such as n=3 or 3 or more.
In another alternative embodiment, arithmetic element 12 is as shown in Fig. 2, may include 101 He of main process task circuit
It is multiple from processing circuit 102.In one embodiment, as shown in Fig. 2, it is multiple from processing circuit be in array distribution;Each from
Reason circuit is connect with other adjacent from processing circuit, and the multiple k from processing circuit of main process task circuit connection are from
Circuit is managed, the k is a from processing circuit are as follows: the n of n of the 1st row from processing circuit, m row is a to be arranged from processing circuit and the 1st
M from processing circuit, it should be noted that as shown in Figure 2 K only include n of the 1st row from processing circuit from processing electricity
Road, the n m arranged from processing circuit and the 1st of m row are a from processing circuit, i.e. the k are multiple from processing from processing circuit
In circuit directly with the slave processing circuit of main process task circuit connection.
K is from processing circuit, in the main process task circuit and multiple data between processing circuit and referring to
The forwarding of order.
Optionally, as shown in figure 3, the main process task circuit can also include: conversion processing circuit 110, activation processing circuit
111, one of addition process circuit 112 or any combination;
Conversion processing circuit 110, for the received data block of main process task circuit or intermediate result to be executed the first data knot
Exchange (such as conversion of continuous data and discrete data) between structure and the second data structure;Or it is main process task circuit is received
Data block or intermediate result execute exchange (such as fixed point type and floating-point class between the first data type and the second data type
The conversion of type);
Processing circuit 111 is activated, for executing the activation operation of data in main process task circuit;
Addition process circuit 112, for executing add operation or accumulating operation.
The main process task circuit, for that will determine that the input neuron is broadcast data, weight is distribution data, will be divided
Hair data are distributed into multiple data blocks, will be at least one data block and multiple operational orders in the multiple data block
At least one operational order is sent to described from processing circuit;
It is the multiple from processing circuit, obtain centre for executing operation to the data block received according to the operational order
As a result, and operation result is transferred to the main process task circuit;
The main process task circuit refers to for being handled to obtain the calculating by multiple intermediate results sent from processing circuit
Enable as a result, the result of the computations is sent to the controller unit.
It is described from processing circuit include: multiplication process circuit;
The multiplication process circuit obtains result of product for executing product calculation to the data block received;
Forward process circuit (optional), for forwarding the data block received or result of product.
Accumulation process circuit, the accumulation process circuit obtain among this for executing accumulating operation to the result of product
As a result.
In another embodiment, which is Matrix Multiplication in terms of the instruction of matrix, accumulated instruction, activation instruction etc.
Calculate instruction.
Illustrate the circular of MLU computing unit as shown in Figure 1 below by neural network computing instruction.It is right
For neural network computing instruction, the formula that actually needs to be implemented can be with are as follows: s=s (∑ wxi+ b), wherein i.e. by weight
W is multiplied by input data xi, sum, then plus activation operation s (h) is done after biasing b, obtain final output result s.
In a kind of optional embodiment, as shown in figure 4, the arithmetic element includes: tree-shaped module 40, the tree-shaped
Module includes: a root port 401 and multiple ports 404, and the root port of the tree-shaped module connects the main process task circuit,
Multiple ports of the tree-shaped module are separately connected multiple one from processing circuit from processing circuit;
Above-mentioned tree-shaped module has transmission-receiving function, and for example, as shown in figure 4, which is sending function, such as Fig. 5 institute
Show, which is receive capabilities.
The tree-shaped module, for forward the main process task circuit and the multiple data block between processing circuit,
Weight and operational order.
Optionally, which is the optional as a result, it may include at least 1 node layer, the section of MLU computing unit
Point is the cable architecture with forwarding capability, and the node itself can not have computing function.If tree-shaped module has zero layer node,
It is not necessarily to the tree-shaped module.
Optionally, which can pitch tree construction for n, for example, binary tree structure as shown in FIG. 6, certainly may be used
Think trident tree construction, which can be the integer more than or equal to 2.The application specific embodiment is not intended to limit the specific of above-mentioned n
Value, the above-mentioned number of plies may be 2, can connect the node of other layers in addition to node layer second from the bottom from processing circuit,
Such as it can connect the node of layer last as shown in FIG. 6.
Optionally, above-mentioned arithmetic element can carry individual caching, as shown in fig. 7, may include: that neuron caching is single
Member, the neuron cache unit 63 cache the input neuron vector data and output neuron Value Data from processing circuit.
As shown in figure 8, the arithmetic element can also include: weight cache unit 64, exist for caching this from processing circuit
The weight data needed in calculating process.
In an alternative embodiment, arithmetic element 12 is as shown in figure 9, may include branch process circuit 103;It is specific
Connection structure it is as shown in Figure 9, wherein
Main process task circuit 101 is connect with branch process circuit 103 (one or more), branch process circuit 103 and one
Or it is multiple from the connection of processing circuit 102;
Branch process circuit 103, for execute forwarding main process task circuit 101 and between processing circuit 102 data or
Instruction.
In an alternative embodiment, by taking the full connection operation in neural network computing as an example, process can be with are as follows: y=f
(wx+b), wherein x is to input neural variable matrix, and w is weight matrix, and b is biasing scalar, and f is activation primitive, is specifically as follows:
Sigmoid function, any one in tanh, relu, softmax function.It is assumed that be binary tree structure, have 8 from
Processing circuit, the method realized can be with are as follows:
Controller unit obtains input nerve variable matrix x, weight matrix w out of storage unit and full connection operation refers to
It enables, input nerve variable matrix x, weight matrix w and full connection operational order is transferred to main process task circuit;
Main process task circuit determines that input nerve variable matrix x is broadcast data, determines that weight matrix w, will for distribution data
Weight matrix w splits into 8 submatrixs, and 8 submatrixs are then distributed to 8 from processing circuit by tree-shaped module, will be defeated
Enter neural variable matrix x and be broadcast to 8 from processing circuit,
The multiplying and accumulating operation for executing 8 submatrixs and the neural variable matrix x of input parallel from processing circuit obtain 8
8 intermediate results are sent to main process task circuit by a intermediate result;
The operation result is executed biasing for sorting to obtain the operation result of wx by 8 intermediate results by main process task circuit
Activation operation is executed after the operation of b and obtains final result y, final result y is sent to controller unit, controller unit should
Final result y is exported or is stored to storage unit.
The method that MLU computing unit as shown in Figure 1 executes the instruction of neural network forward operation is specifically as follows:
Controller unit extracts the instruction of neural network forward operation, neural network computing instruction pair out of the location of instruction
The operation domain is transmitted to data access unit by the operation domain answered and at least one operation code, controller unit, at least by this
One operation code is sent to arithmetic element.
Controller unit extracts the corresponding weight w of the operation domain out of storage unit and biasing b (when b is 0, is not needed
It extracts biasing b), weight w and biasing b is transmitted to the main process task circuit of arithmetic element, controller unit is mentioned out of storage unit
Input data Xi is taken, input data Xi is sent to main process task circuit.
Main process task circuit is determined as multiplying according at least one operation code, determines input data Xi for broadcast number
According to, determine weight data for distribution data, weight w is split into n data block;
The instruction process unit of controller unit determines multiplying order, offset instructions according at least one operation code and tires out
Add instruction, multiplying order, offset instructions and accumulated instruction be sent to main process task circuit, main process task circuit by the multiplying order,
Input data Xi is sent to multiple from processing circuit in a broadcast manner, which is distributed to multiple from processing electricity
Road (such as with n from processing circuit, then each sending a data block from processing circuit);It is multiple from processing circuit, use
Intermediate result is obtained in input data Xi is executed multiplying with the data block received according to the multiplying order, it will be in this
Between result be sent to main process task circuit, the main process task circuit according to the accumulated instruction by it is multiple sent from processing circuit intermediate tie
Fruit executes accumulating operation and obtains accumulation result, and accumulation result execution biasing is set b according to the offset instructions and obtains final result,
The final result is sent to the controller unit.
In addition, the sequence of add operation and multiplying can exchange.
Technical solution provided by the present application is that neural network computing instruction realizes neural network by an instruction
Multiplying and biasing operation are not necessarily to store or extract, reduce intermediate data in the intermediate result of neural computing
Storage and extraction operation, so it, which has, reduces corresponding operating procedure, the advantages of improving the calculating effect of neural network.
The application is also disclosed that a machine learning arithmetic unit comprising the device of release dynamics chained library, wherein institute
State in the device of release dynamics chained library include one or more MLU computing unit, the machine learning arithmetic unit for from
It is obtained in other processing units to operational data and control information, executes specified machine learning operation, implementing result passes through I/O
Interface passes to peripheral equipment.Peripheral equipment for example camera, display, mouse, keyboard, network interface card, wifi interface, server.
When comprising more than one MLU computing unit, it can be linked and be passed by specific structure between this multiple MLU computing unit
Transmission of data is for example interconnected by PCIE bus and is transmitted data, to support the operation of more massive machine learning.This
When, same control system can be shared, there can also be control system independent;Can also can each it be added with shared drive
Fast device has respective memory.In addition, its mutual contact mode can be any interconnection topology.
The machine learning arithmetic unit compatibility with higher can pass through PCIE interface and various types of server phases
Connection.
The application is also disclosed that a combined treatment device comprising above-mentioned machine learning arithmetic unit, general interconnection
Interface and other processing units.Machine learning arithmetic unit is interacted with other processing units, common to complete what user specified
Operation.Figure 10 is the schematic diagram of combined treatment device.
Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/special
With one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.Its
His interface of the processing unit as machine learning arithmetic unit and external data and control, including data are carried, and are completed to the machine
Device learns the basic control such as unlatching, stopping of arithmetic unit;Other processing units can also cooperate with machine learning arithmetic unit
It is common to complete processor active task.
General interconnecting interface, for transmitting data and control between the machine learning arithmetic unit and other processing units
Instruction.The machine learning arithmetic unit obtains required input data, write-in machine learning operation dress from other processing units
Set the storage device of on piece;Control instruction can be obtained from other processing units, write-in machine learning arithmetic unit on piece
Control caching;It can also learn the data in the memory module of arithmetic unit with read machine and be transferred to other processing units.
Optionally, the structure is as shown in figure 11, can also include storage device, storage device respectively with the machine learning
Arithmetic unit is connected with other described processing units.Storage device for be stored in the machine learning arithmetic unit and it is described its
The data of the data of his processing unit, operation required for being particularly suitable for learn arithmetic unit or other processing units in machine
Storage inside in the data that can not all save.
The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipment
The die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, the combined treatment
The general interconnecting interface of device is connected with certain components of equipment.Certain components for example camera, display, mouse, keyboard,
Network interface card, wifi interface.
In some embodiments, a kind of chip has also been applied for comprising at above-mentioned machine learning arithmetic unit or combination
Manage device.
In some embodiments, a kind of chip-packaging structure has been applied for comprising said chip.
In some embodiments, a kind of board has been applied for comprising said chip encapsulating structure.Refering to fig. 12, Figure 12
A kind of board is provided, above-mentioned board can also include other matching components, this is matched other than including said chip 389
Set component includes but is not limited to: memory device 390, interface arrangement 391 and control device 392;
The memory device 390 is connect with the chip in the chip-packaging structure by bus, for storing data.Institute
Stating memory device may include multiple groups storage unit 393.Storage unit described in each group is connect with the chip by bus.It can
To understand, storage unit described in each group can be DDR SDRAM (English: Double Data Rate SDRAM, Double Data Rate
Synchronous DRAM).
DDR, which does not need raising clock frequency, can double to improve the speed of SDRAM.DDR allows the rising in clock pulses
Edge and failing edge read data.The speed of DDR is twice of standard SDRAM.In one embodiment, the storage device can be with
Including storage unit described in 4 groups.Storage unit described in each group may include multiple DDR4 particles (chip).In one embodiment
In, the chip interior may include 4 72 DDR4 controllers, and 64bit is used for transmission number in above-mentioned 72 DDR4 controllers
According to 8bit is used for ECC check.It is appreciated that data pass when using DDR4-3200 particle in the storage unit described in each group
Defeated theoretical bandwidth can reach 25600MB/s.
In one embodiment, storage unit described in each group include multiple Double Data Rate synchronous dynamics being arranged in parallel with
Machine memory.DDR can transmit data twice within a clock cycle.The controller of setting control DDR in the chips,
Control for data transmission and data storage to each storage unit.
The interface arrangement is electrically connected with the chip in the chip-packaging structure.The interface arrangement is for realizing described
Data transmission between chip and external equipment (such as server or computer).Such as in one embodiment, the interface
Device can be standard PCIE interface.For example, data to be processed are transferred to the core by standard PCIE interface by server
Piece realizes data transfer.Preferably, when using the transmission of 16 interface of PCIE 3.0X, theoretical bandwidth can reach 16000MB/s.
In another embodiment, the interface arrangement can also be other interfaces, and the application is not intended to limit above-mentioned other interfaces
Specific manifestation form, the interface unit can be realized signaling transfer point.In addition, the calculated result of the chip is still by institute
It states interface arrangement and sends back external equipment (such as server).
The control device is electrically connected with the chip.The control device is for supervising the state of the chip
Control.Specifically, the chip can be electrically connected with the control device by SPI interface.The control device may include list
Piece machine (Micro Controller Unit, MCU).If the chip may include multiple processing chips, multiple processing cores or more
A processing circuit can drive multiple loads.Therefore, the chip may be at the different work shape such as multi-load and light load
State.It may be implemented by the control device to processing chips multiple in the chip, multiple processing and/or multiple processing circuits
Working condition regulation.
In some embodiments, a kind of electronic equipment has been applied for comprising above-mentioned board.
Electronic equipment include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal,
Mobile phone, automobile data recorder, navigator, sensor, camera, server, cloud server, camera, video camera, projector, hand
Table, earphone, mobile storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.
The vehicles include aircraft, steamer and/or vehicle;The household electrical appliance include TV, air-conditioning, micro-wave oven,
Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator;The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument
And/or electrocardiograph.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of
Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because
According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know
It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily the application
It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment
Point, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed device, it can be by another way
It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of
Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can
To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit,
It can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also be realized in the form of software program module.
If the integrated unit is realized in the form of software program module and sells or use as independent product
When, it can store in a computer-readable access to memory.Based on this understanding, the technical solution of the application substantially or
Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products
Reveal and, which is stored in a memory, including some instructions are used so that a computer equipment
(can be personal computer, server or network equipment etc.) executes all or part of each embodiment the method for the application
Step.And memory above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory
The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can
It is completed with instructing relevant hardware by program, which can store in a computer-readable memory, memory
May include: flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English:
Random Access Memory, referred to as: RAM), disk or CD etc..
The embodiment of the present application is described in detail above, specific case used herein to the principle of the application and
Embodiment is expounded, the description of the example is only used to help understand the method for the present application and its core ideas;
At the same time, for those skilled in the art can in specific embodiments and applications according to the thought of the application
There is change place, in conclusion the contents of this specification should not be construed as limiting the present application.
Claims (22)
1. a kind of device of release dynamics chained library, which is characterized in that described device is applied to processor unit;
The processor unit, for receiving the first load request of the first dynamic link library file;Wherein, first dynamic
Library file is linked for realizing the first function of the first application program;The processor unit is for configuring the first process;It is described
First process includes first thread and the second thread;
The processor unit is also used to call the first thread by Caffe dynamic link according to first load request
Library is loaded onto memory, and creates the first object;
The processor unit is also used to that the first thread is called to execute first dynamic link library file, and is executing
After first function of complete first application program, first object is executed destructed;
The processor unit, be also used to call the first thread first object has been executed it is destructed after, calling
Second thread discharges the Caffe dynamic link library from the memory.
2. the apparatus according to claim 1, which is characterized in that the processor unit is also used to configure the second process, institute
Stating the second process includes third thread and the 4th thread;
The processor unit is also used to connect while receiving the first load request of first dynamic link library file
Receive the second load request of the second dynamic link library file;Wherein, second dynamic link library file is for realizing described
First function of one application program;
The processor unit is also used to call the third thread by Caffe dynamic link according to second load request
Library is loaded onto memory, and creates the second object;
The processor unit is also used to that the third thread is called to execute second dynamic link library file, and is executing
After first function of complete first application program, second object is executed destructed;
The processor unit, be also used to call the third thread second object has been executed it is destructed after, calling
4th thread discharges the Caffe dynamic link library from the memory.
3. the apparatus according to claim 1, which is characterized in that the processor unit is also used to configure third process, institute
Stating third process includes the 5th thread and the 6th thread;
The processor unit, is also used to: while receiving the first load request of first dynamic link library file, connecing
Receive the third load request of third dynamic link library file;Wherein, the third dynamic link library file is answered for realizing second
With the second function of program;
The processor unit is also used to call the 5th thread by Caffe dynamic link according to the third load request
Library is loaded onto memory, and creates third object;
The processor unit is also used to that the 5th thread is called to execute the third dynamic link library file, and is executing
After second function of complete second application program, the third object is executed destructed;
The processor unit, be also used to call the 5th thread third object has been executed it is destructed after, calling
6th thread discharges the Caffe dynamic link library from the memory.
4. the apparatus according to claim 1, which is characterized in that described device further include: MLU computing unit;The Caffe
Include the function under Caffe frame in each dynamic link library file in dynamic link library;
The processor unit is also used to during loading the Caffe dynamic link library, will be under the Caffe frame
Function input the MLU computing unit;Wherein, the MLU computing unit be used for according to the function under the Caffe frame with
And operational order is calculated, and obtains calculated result, and the calculated result is sent to processor unit;
The processor unit is also used to receive the calculated result.
5. device according to claim 4, which is characterized in that the MLU computing unit includes controller unit and operation
Unit;The arithmetic element includes: a main process task circuit and multiple from processing circuit;
The controller unit, for obtaining input data and computations;Wherein, the input data includes described
Function data under Caffe frame;
The controller unit is also used to parse the computations and obtains multiple operational orders, by multiple operational order and
The input data is sent to the main process task circuit;
The main process task circuit, for executing preamble processing and with the multiple between processing circuit to the input data
Transmit data and operational order;
It is the multiple from processing circuit, for according to being executed parallel from the data and operational order of the main process task circuit transmission
Intermediate operations obtain multiple intermediate results, and multiple intermediate results are transferred to the main process task circuit;
The main process task circuit obtains the calculating knot of the computations for executing subsequent processing to the multiple intermediate result
Fruit.
6. device according to claim 5, which is characterized in that the arithmetic element includes: tree-shaped module, the tree-shaped mould
Block includes: a root port and multiple ports, and the root port of the tree-shaped module connects the main process task circuit, the tree-shaped
Multiple ports of module are separately connected multiple one from processing circuit from processing circuit;
The tree-shaped module, for forwarding the main process task circuit and the multiple data block between processing circuit, weight
And operational order.
7. device according to claim 5, which is characterized in that the arithmetic element further includes one or more branch process
Circuit, each branch process circuit connection at least one from processing circuit,
The main process task circuit is specifically used for determining that the input neuron is broadcast data, and weight is distribution data block, by one
A distribution data are distributed into multiple data blocks, by least one data block in the multiple data block, broadcast data and more
At least one operational order in a operational order is sent to the branch process circuit;
The branch process circuit, for forward the main process task circuit and the multiple data block between processing circuit,
Broadcast data and operational order;
It is the multiple from processing circuit, for executing operation to the data block and broadcast data received according to the operational order
Intermediate result is obtained, and intermediate result is transferred to the branch process circuit;
The main process task circuit, the intermediate result for sending branch process circuit carry out subsequent processing and obtain the computations
As a result, the result of the computations is sent to the controller unit.
8. device according to claim 5, which is characterized in that it is the multiple from processing circuit be in array distribution;Each from
Processing circuit is connect with other adjacent from processing circuit, the multiple k from processing circuit of main process task circuit connection
It is a from processing circuit, the k tandem circuit are as follows: the n of the 1st row from processing circuit, n of m row from processing circuit and
The m of 1st column is a from processing circuit;
The K is from processing circuit, in the main process task circuit and multiple data between processing circuit and referring to
The forwarding of order;
The main process task circuit, for determining that the input neuron is broadcast data, weight is distribution data, and one is distributed
Data are distributed into multiple data blocks, by least one data block and multiple operational orders in the multiple data block extremely
A few operational order is sent to the K from processing circuit;
The K is a from processing circuit, for converting the main process task circuit and the multiple data between processing circuit;
It is the multiple from processing circuit, obtain intermediate knot for executing operation to the data block received according to the operational order
Fruit, and operation result is transferred to the K from processing circuit;
The main process task circuit obtains based on this by the intermediate result that the K send from processing circuit to be carried out subsequent processing
Calculate instruction as a result, the result of the computations is sent to the controller unit.
9. a kind of machine learning arithmetic unit, which is characterized in that the machine learning arithmetic unit includes that claim 1-8 such as appoints
The device of release dynamics chained library described in one, wherein include one or more in the device of the release dynamics chained library
MLU computing unit, the machine learning arithmetic unit are used to obtain from other processing units to operation input data and control
Information, and specified machine learning operation is executed, implementing result is passed into other processing units by I/O interface;
When the machine learning arithmetic unit includes multiple MLU computing units, between the multiple MLU computing unit
It can be attached by specific structure and transmit data;
Wherein, multiple MLU computing units are interconnected by quick external equipment interconnection Bus PC IE bus and transmit number
According to support the operation of more massive machine learning;Multiple MLU computing units are shared same control system or are possessed each
From control system;Multiple MLU computing unit shared drives possess respective memory;Multiple MLU calculate single
The mutual contact mode of member is any interconnection topology.
10. a kind of combined treatment device, which is characterized in that the combined treatment device includes machine as claimed in claim 9
Learn arithmetic unit, general interconnecting interface and other processing units;
The machine learning arithmetic unit is interacted with other described processing units, the common calculating behaviour for completing user and specifying
Make.
11. combined treatment device according to claim 10, which is characterized in that further include: storage device, the storage device
It is connect respectively with the machine learning arithmetic unit and other described processing units, for saving the machine learning arithmetic unit
With the data of other processing units.
12. a kind of neural network chip, which is characterized in that the machine learning chip includes machine as claimed in claim 9
Learn arithmetic unit or combined treatment device as claimed in claim 10 or combined treatment device as claimed in claim 10.
13. a kind of electronic equipment, which is characterized in that the electronic equipment includes the chip as described in the claim 12.
14. a kind of board, which is characterized in that the board includes: memory device, interface arrangement and control device and such as right
It is required that neural network chip described in 12;
Wherein, the neural network chip is separately connected with the memory device, the control device and the interface arrangement;
The memory device, for storing data;
The interface arrangement, for realizing the data transmission between the chip and external equipment;
The control device is monitored for the state to the chip.
15. a kind of method of release dynamics chained library, which is characterized in that the method is applied to the dress of release dynamics chained library
It sets, described device includes processor unit;The described method includes:
The processor unit receives the first load request of the first dynamic link library file;Wherein, first dynamic link
Library file for realizing the first application program the first function;The processor unit is for configuring the first process;Described first
Process includes first thread and the second thread;
The processor unit calls the first thread to be loaded onto Caffe dynamic link library according to first load request
In memory, and create the first object;
The processor unit calls the first thread to execute first dynamic link library file, and is executing described the
After first function of one application program, first object is executed destructed;
The processor unit call the first thread first object has been executed it is destructed after, calling described second
Thread discharges the Caffe dynamic link library from the memory.
16. according to the method for claim 15, which is characterized in that the method also includes: the processor unit configuration
Second process, second process include third thread and the 4th thread;
It is dynamic to receive second while receiving the first load request of first dynamic link library file for the processor unit
Second load request of state link library file;Wherein, second dynamic link library file applies journey for realizing described first
First function of sequence;
The processor unit calls the third thread to be loaded onto Caffe dynamic link library according to second load request
In memory, and create the second object;
The processor unit calls the third thread to execute second dynamic link library file, and is executing described the
After first function of one application program, second object is executed destructed;
The processor unit call the third thread second object has been executed it is destructed after, calling the described 4th
Thread discharges the Caffe dynamic link library from the memory.
17. according to the method for claim 15, which is characterized in that the method also includes: the processor unit configuration
Third process, the third process include the 5th thread and the 6th thread;
It is dynamic to receive third while receiving the first load request of first dynamic link library file for the processor unit
The third load request of state link library file;Wherein, the third dynamic link library file is for realizing the second application program
Second function;
The processor unit calls the 5th thread to be loaded onto Caffe dynamic link library according to the third load request
In memory, and create third object;
The processor unit calls the 5th thread to execute the third dynamic link library file, and is executing described the
After second function of two application programs, the third object is executed destructed;
The processor unit call the 5th thread third object has been executed it is destructed after, calling the described 6th
Thread discharges the Caffe dynamic link library from the memory.
18. according to the method for claim 15, which is characterized in that described device further include: MLU computing unit;It is described
Include the function under Caffe frame in each dynamic link library file in Caffe dynamic link library;
The processor unit, during loading the Caffe dynamic link library, by the function under the Caffe frame
Input the MLU computing unit;Wherein, the MLU computing unit be used for according under the Caffe frame function and operation
Instruction is calculated, and obtains calculated result, and the calculated result is sent to processor unit;
The processor unit receives the calculated result.
19. according to the method for claim 18, which is characterized in that the MLU computing unit includes controller unit and fortune
Calculate unit;The arithmetic element includes: a main process task circuit and multiple from processing circuit;
The controller unit, for obtaining input data and computations;Wherein, the input data includes described
Function data under Caffe frame;
The controller unit is also used to parse the computations and obtains multiple operational orders, by multiple operational order and
The input data is sent to the main process task circuit;
The main process task circuit, for executing preamble processing and with the multiple between processing circuit to the input data
Transmit data and operational order;
It is the multiple from processing circuit, for according to being executed parallel from the data and operational order of the main process task circuit transmission
Intermediate operations obtain multiple intermediate results, and multiple intermediate results are transferred to the main process task circuit;
The main process task circuit obtains the calculating knot of the computations for executing subsequent processing to the multiple intermediate result
Fruit.
20. according to the method for claim 19, which is characterized in that the arithmetic element includes: tree-shaped module, the tree-shaped
Module includes: a root port and multiple ports, and the root port of the tree-shaped module connects the main process task circuit, the tree
Multiple ports of pattern block are separately connected multiple one from processing circuit from processing circuit;
The tree-shaped module, for forwarding the main process task circuit and the multiple data block between processing circuit, weight
And operational order.
21. according to the method for claim 19, which is characterized in that the arithmetic element further includes one or more bifurcations
Manage circuit, each branch process circuit connection at least one from processing circuit,
The main process task circuit is specifically used for determining that the input neuron is broadcast data, and weight is distribution data block, by one
A distribution data are distributed into multiple data blocks, by least one data block in the multiple data block, broadcast data and more
At least one operational order in a operational order is sent to the branch process circuit;
The branch process circuit, for forward the main process task circuit and the multiple data block between processing circuit,
Broadcast data and operational order;
It is the multiple from processing circuit, for executing operation to the data block and broadcast data received according to the operational order
Intermediate result is obtained, and intermediate result is transferred to the branch process circuit;
The main process task circuit, the intermediate result for sending branch process circuit carry out subsequent processing and obtain the computations
As a result, the result of the computations is sent to the controller unit.
22. according to the method for claim 19, which is characterized in that it is the multiple from processing circuit be in array distribution;Each
It is connect from processing circuit with other adjacent from processing circuit, the main process task circuit connection is the multiple from processing circuit
K is from processing circuit, the k tandem circuit are as follows: the n of n of the 1st row from processing circuit, m row it is a from processing circuit and
The m of 1st column is a from processing circuit;
The K is from processing circuit, in the main process task circuit and multiple data between processing circuit and referring to
The forwarding of order;
The main process task circuit, for determining that the input neuron is broadcast data, weight is distribution data, and one is distributed
Data are distributed into multiple data blocks, by least one data block and multiple operational orders in the multiple data block extremely
A few operational order is sent to the K from processing circuit;
The K is a from processing circuit, for converting the main process task circuit and the multiple data between processing circuit;
It is the multiple from processing circuit, obtain intermediate knot for executing operation to the data block received according to the operational order
Fruit, and operation result is transferred to the K from processing circuit;
The main process task circuit obtains based on this by the intermediate result that the K send from processing circuit to be carried out subsequent processing
Calculate instruction as a result, the result of the computations is sent to the controller unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811629632.9A CN109753319B (en) | 2018-12-28 | 2018-12-28 | Device for releasing dynamic link library and related product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811629632.9A CN109753319B (en) | 2018-12-28 | 2018-12-28 | Device for releasing dynamic link library and related product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109753319A true CN109753319A (en) | 2019-05-14 |
CN109753319B CN109753319B (en) | 2020-01-17 |
Family
ID=66403216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811629632.9A Active CN109753319B (en) | 2018-12-28 | 2018-12-28 | Device for releasing dynamic link library and related product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109753319B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110187959A (en) * | 2019-06-04 | 2019-08-30 | 北京慧眼智行科技有限公司 | A kind of dynamic link library multithreading call method and system |
CN111796941A (en) * | 2020-07-06 | 2020-10-20 | 北京字节跳动网络技术有限公司 | Memory management method and device, computer equipment and storage medium |
WO2022006728A1 (en) * | 2020-07-07 | 2022-01-13 | 深圳元戎启行科技有限公司 | Method for managing vehicle-mounted hardware device, and information processing system, vehicle-mounted terminal and storage medium |
CN115227255A (en) * | 2022-07-29 | 2022-10-25 | 四川大学华西医院 | Remote electrocardiogram display method and system based on canvas technology |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561763A (en) * | 2009-04-30 | 2009-10-21 | 腾讯科技(北京)有限公司 | Method and device for realizing dynamic-link library |
US20120304162A1 (en) * | 2010-02-23 | 2012-11-29 | Fujitsu Limited | Update method, update apparatus, and computer product |
CN104572275B (en) * | 2013-10-23 | 2017-12-29 | 华为技术有限公司 | A kind of process loading method, apparatus and system |
-
2018
- 2018-12-28 CN CN201811629632.9A patent/CN109753319B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561763A (en) * | 2009-04-30 | 2009-10-21 | 腾讯科技(北京)有限公司 | Method and device for realizing dynamic-link library |
US20120304162A1 (en) * | 2010-02-23 | 2012-11-29 | Fujitsu Limited | Update method, update apparatus, and computer product |
CN104572275B (en) * | 2013-10-23 | 2017-12-29 | 华为技术有限公司 | A kind of process loading method, apparatus and system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110187959A (en) * | 2019-06-04 | 2019-08-30 | 北京慧眼智行科技有限公司 | A kind of dynamic link library multithreading call method and system |
CN111796941A (en) * | 2020-07-06 | 2020-10-20 | 北京字节跳动网络技术有限公司 | Memory management method and device, computer equipment and storage medium |
WO2022006728A1 (en) * | 2020-07-07 | 2022-01-13 | 深圳元戎启行科技有限公司 | Method for managing vehicle-mounted hardware device, and information processing system, vehicle-mounted terminal and storage medium |
CN115227255A (en) * | 2022-07-29 | 2022-10-25 | 四川大学华西医院 | Remote electrocardiogram display method and system based on canvas technology |
Also Published As
Publication number | Publication date |
---|---|
CN109753319B (en) | 2020-01-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543832A (en) | A kind of computing device and board | |
CN109522052A (en) | A kind of computing device and board | |
CN109753319A (en) | A kind of device and Related product of release dynamics chained library | |
CN109685201A (en) | Operation method, device and Related product | |
CN109657782A (en) | Operation method, device and Related product | |
US11080593B2 (en) | Electronic circuit, in particular capable of implementing a neural network, and neural system | |
CN109740739A (en) | Neural computing device, neural computing method and Related product | |
CN110163362A (en) | A kind of computing device and method | |
CN109375951A (en) | A kind of device and method for executing full articulamentum neural network forward operation | |
CN109740754A (en) | Neural computing device, neural computing method and Related product | |
CN110096310A (en) | Operation method, device, computer equipment and storage medium | |
CN110059797A (en) | A kind of computing device and Related product | |
CN109670581A (en) | A kind of computing device and board | |
CN110119807A (en) | Operation method, device, computer equipment and storage medium | |
CN109739703A (en) | Adjust wrong method and Related product | |
CN111353591A (en) | Computing device and related product | |
CN110059809A (en) | A kind of computing device and Related product | |
CN111368981B (en) | Method, apparatus, device and storage medium for reducing storage area of synaptic connections | |
CN109670578A (en) | Neural network first floor convolution layer data processing method, device and computer equipment | |
CN109711540A (en) | A kind of computing device and board | |
CN109726800A (en) | Operation method, device and Related product | |
CN111381882B (en) | Data processing device and related product | |
CN109740729A (en) | Operation method, device and Related product | |
CN111368967A (en) | Neural network computing device and method | |
CN110472734A (en) | A kind of computing device and Related product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences Applicant after: Zhongke Cambrian Technology Co., Ltd Address before: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |