CN109753319A

CN109753319A - A kind of device and Related product of release dynamics chained library

Info

Publication number: CN109753319A
Application number: CN201811629632.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2019-05-14
Anticipated expiration: 2038-12-28
Also published as: CN109753319B

Abstract

This application provides the device and Related product of a kind of release dynamics chained library, which is applied to processor unit；For processor unit for configuring the first process, the first process includes first thread and the second thread；By calling first thread to execute load Caffe dynamic link library, using Caffe dynamic link library and to the destructed operation of the first object progress；By calling the second thread after first thread has executed destructed operation to the first object, the Caffe dynamic link library is discharged from memory, the load and release of Caffe dynamic link library may be implemented, and then can be occupied to avoid memory, to improve arithmetic speed.

Description

A kind of device and Related product of release dynamics chained library

Technical field

This application involves technical field of information processing, and in particular to a kind of device of release dynamics chained library and related produces Product.

Background technique

With the fast development of software technology and hardware technology, various dynamic link library file (Dynamic Link Library, DLL) it is widely used.With convolutional neural networks frame (Convolutional Architecture For Fast Feature Embedding, Caffe) for dynamic link library, Caffe dynamic link library is dependent on third party Dynamic link library, be mainly used in terms of video, image procossing using upper.It needs to occupy end during the loading process A large amount of memories on end.

In the prior art, application program directly links Caffe dynamic link library, then, in application program launching, processing Caffe dynamic link library is loaded into memory by device.Only when application program exits, processor could unload Caffe dynamic Chained library.It also means that, when application program is constantly in operating status, Caffe dynamic link library occupied terminal always Memory, and the memory in terminal is limited, this virtually reduces the processing speed of terminal.Therefore, how to realize that Caffe is dynamic The load and release of state chained library, to avoid the memory when program is run in terminal always due to Caffe dynamic link library by Occupancy is the research hotspot problem of those skilled in the art.

Summary of the invention

The embodiment of the present application provides the device and Related product of a kind of release dynamics chained library, is in fortune in application program When under row state, the load and release of Caffe dynamic link library may be implemented, can to avoid when program is run in terminal in Deposit it is occupied because of Caffe dynamic link library always, to improve arithmetic speed.

In a first aspect, the embodiment of the present application provides a kind of device of release dynamics chained library, described device is applied to place Manage device unit；Wherein, the processor unit, for receiving the first load request of the first dynamic link library file；Wherein, institute The first dynamic link library file is stated for realizing the first function of the first application program；The processor unit is for configuring first Process；First process includes first thread and the second thread；

The processor unit is also used to call the first thread by Caffe dynamic according to first load request Chained library is loaded onto memory, and creates the first object；

The processor unit is also used to call the first thread to execute first dynamic link library file, and After the first function of having executed first application program, first object is executed destructed；

The processor unit, be also used to call the first thread first object has been executed it is destructed after, Second thread is called to discharge the Caffe dynamic link library from the memory.

Second aspect, the embodiment of the present application provide a kind of machine learning arithmetic unit, and the machine learning device includes The device of release dynamics chained library described in first aspect, wherein in the device of the release dynamics chained library include one or Multiple MLU computing units.The machine learning arithmetic unit is used to obtain from other processing units to operation input data and control Information processed, and specified machine learning operation is executed, implementing result is passed into other processing units by I/O interface；

When the machine learning arithmetic unit includes multiple MLU computing units, the multiple MLU calculates single It can be attached by specific structure between member and transmit data；

Wherein, multiple MLU computing units are interconnected and are passed by quick external equipment interconnection Bus PC IE bus Transmission of data, to support the operation of more massive machine learning；Multiple MLU computing units are shared same control system or are gathered around There is respective control system；Multiple MLU computing unit shared drives possess respective memory；Multiple MLU meters The mutual contact mode for calculating unit is any interconnection topology.

The third aspect, the embodiment of the present application provide a kind of combined treatment device, which includes such as third Machine learning processing unit, general interconnecting interface described in aspect and other processing units.The machine learning arithmetic unit with it is upper It states other processing units to interact, the common operation completing user and specifying.The combined treatment device can also include storage dress It sets, which connect with the machine learning arithmetic unit and other described processing units respectively, for saving the machine The data of device study arithmetic unit and other processing units.

Fourth aspect, the embodiment of the present application provide a kind of neural network chip, which includes above-mentioned first aspect institute Machine learning arithmetic unit described in the device for the release dynamics chained library stated, above-mentioned second aspect or above-mentioned third aspect institute The combined treatment device stated.

5th aspect, the embodiment of the present application provide a kind of neural network chip encapsulating structure, neural network chip envelope Assembling structure includes neural network chip described in above-mentioned fourth aspect；

6th aspect, the embodiment of the present application provide a kind of board, which includes nerve described in above-mentioned 5th aspect Network chip encapsulating structure.

7th aspect, the embodiment of the present application provide a kind of electronic device, which includes above-mentioned 6th aspect institute Board described in the neural network chip stated or above-mentioned 6th aspect.

Eighth aspect, the embodiment of the present application also provide a kind of method of release dynamics chained library, and the method is applied to release The device of dynamic link library is put, described device includes processor unit；The described method includes:

The processor unit receives the first load request of the first dynamic link library file；Wherein, first dynamic Library file is linked for realizing the first function of the first application program；The processor unit is for configuring the first process；It is described First process includes first thread and the second thread；

The processor unit calls the first thread to add Caffe dynamic link library according to first load request It is loaded onto memory, and creates the first object；

The processor unit calls the first thread to execute first dynamic link library file, and having executed After the first function of stating the first application program, first object is executed destructed；

The processor unit call the first thread first object has been executed it is destructed after, described in calling Second thread discharges the Caffe dynamic link library from the memory.

In some embodiments, the electronic equipment includes data processing equipment, robot, computer, printer, scanning Instrument, tablet computer, intelligent terminal, mobile phone, automobile data recorder, navigator, sensor, camera, server, cloud server, Camera, video camera, projector, wrist-watch, earphone, mobile storage, wearable device, the vehicles, household electrical appliance, and/or medical treatment Equipment.

In some embodiments, the vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include electricity Depending on, air-conditioning, micro-wave oven, refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument and/or electrocardiograph.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of structural schematic diagram of the device of release dynamics chained library provided by the embodiments of the present application；

Fig. 2 is the structure chart for the MLU computing unit that the application one embodiment provides；

Fig. 3 is the structure chart of main process task circuit provided by the embodiments of the present application；

Fig. 4 is the structure chart of another kind MLU computing unit provided by the embodiments of the present application；

Fig. 5 is the structure chart of another kind MLU computing unit provided by the embodiments of the present application；

Fig. 6 is the structural schematic diagram of tree-shaped module provided by the embodiments of the present application；

Fig. 7 is the structure chart of another MLU computing unit provided by the embodiments of the present application；

Fig. 8 is also a kind of structure chart of MLU computing unit provided by the embodiments of the present application；

Fig. 9 is the structure chart for the MLU computing unit that another embodiment of the application provides；

Figure 10 is a kind of structure chart of combined treatment device provided by the embodiments of the present application；

Figure 11 is the structure chart of another combined treatment device provided by the embodiments of the present application；

Figure 12 is a kind of structural schematic diagram of board provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

The description and claims of this application and term " first ", " second ", " third " and " in the attached drawing Four " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and it Any deformation, it is intended that cover and non-exclusive include.Such as it contains the process, method of a series of steps or units, be System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or list Member, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.

Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.

Firstly, introducing the device of release dynamics chained library used in this application.Referring to Fig. 1, a kind of release dynamics are provided The device of chained library, the device include: processor unit 13；

Wherein, processor unit 13, for receiving the first load request of the first dynamic link library file；Wherein, described First dynamic link library file for realizing the first application program the first function；The processor unit for configure first into Journey；First process includes first thread and the second thread；

In embodiments of the present invention, dynamic link library file DLL itself can not independent operating, dynamic link library file DLL It is mainly responsible for and provides certain services to application program.That is, application program needs under the load of dynamic link library file Complete certain specific functions.For example, the first dynamic link library file involved in embodiments of the present invention can be used to implement The face identification functions of Alipay application program.

In embodiments of the present invention, the first Dynamic link library file referred to herein is stored in Caffe dynamic link library.

In practical applications, it may include the function under Caffe frame in dynamic link library file.

In the specific implementation, the function under the Caffe frame may include: Caffe Blob function, Caffe Layer letter Several and Caffe Net function.Wherein, Blob is for storing, exchanging and handling the data of forward and reverse iteration in network and lead Number information；Layer may include convolution (convolve), pond (pool), inner product (inner for executing calculating Product), the nonlinear operations such as rectified-linear and sigmoid can also include the data transformation of Element-Level, return One changes (normalize), data load (load data), the classification costing bio disturbances such as (softmax) and hinge (losses).Tool During body is realized, each Layer both defines 3 kinds of important operations, this 3 kinds of operations are Initialize installation (setup), propagated forward (forward), backpropagation (backward).Wherein, setup is used for when model initialization resetting layers and from each other Connection；Forward is for receiving input data from bottom (bottom) layer, and output is sent to top (top) layer after calculating； Backward is used to give top layers of output gradient, calculates the gradient of its input, and is transmitted to bottom layers.For example, Layer It may include Date Layer, Convolution Layers, Pooling Layer, InnerProduct Layer, ReLU Layer、Sigmoid Layer、LRN Layer、Dropout Layer、SoftmaxWithLoss Layer、Softmax Layer, Accuracy Layers etc..One Net starts from data layer, namely data are loaded from disk, terminates at Loss layer, namely calculate the objective function for such as classifying and reconstructing these tasks.Specifically, Net is by series of layers group At directed acyclic calculate figure, Caffe remains median all in calculating figure to ensure the accurate of forward and backward iteration Property.

Processor unit 13 is also used to call the first thread by Caffe dynamic chain according to first load request It connects library to be loaded onto memory, and creates the first object；

As previously mentioned, the first process includes first thread and the second thread.Specifically, the first process is that processor unit is The process of first application program configuration.

In the specific implementation, Caffe dynamic link library is loaded into memory by first thread by executing dlopen function, and For one object of the first process creation namely the first object.Here, dlopen function opens specified dynamic chain with designated mode Library is connect, and its graftabl.After the dynamic link library is loaded into memory, a handle is returned to calling process.

Processor unit 13 is also used to that the first thread is called to execute first dynamic link library file, and is holding It has gone after the first function of first application program, first object has been executed destructed；

As previously mentioned, first dynamic link library file itself can not independent operating, pass through call first thread execute first Dynamic link library, that is, realizing that the first dynamic link library file is the function that application program provides, when the function is finished it Afterwards, destructed operation is executed to the object (namely first object) created by the first process.For example, the first dynamic link library file For providing face identification functions for wechat application program, when calling first thread the first dynamic library file of execution, and in wechat After application program successfully recognizes the identity of current object to be identified, the first object is executed destructed.

In embodiments of the present invention, destructed operation is executed to the first object by executing destructor function.Specifically, destructed Function and constructed fuction are on the contrary, when object terminates its life cycle (for example, function where object modulated with finish), system It is automatic to execute destructor function.Destructor function is often used to do the work of " cleaning is dealt with problems arising from an accident ", for example, using " new " when establishing object Function opens a piece of memory headroom, should be discharged in destructor function with " delete " function before exit.

In embodiments of the present invention, the program code of destructor function is stored in Caffe dynamic link library.It is understood that It is that after destructed to the progress of the first object, the memory space of the first object committed memory can be discharged.

Processor unit 13, be also used to call the first thread first object has been executed it is destructed after, tune The Caffe dynamic link library is discharged from the memory with second thread.

Specifically, after first thread has executed destructed operation to the first object, by calling the second thread to execute Dlclose function discharges Caffe dynamic link library from memory.Here, dlclose function is used to unload the dynamic chain opened Connect library.

In normal conditions, a thread is only used only, current only thread executes dlopen function, passes through dlopen Function opens specified dynamic link library with designated mode, and its graftabl.It is dynamic that current only thread will use Caffe State chained library provides corresponding function for application program.After the function is finished, current only thread is executed Dlclose function, thread exits at this time.But destructor function is performed not yet.Because Caffe dynamic link library is released, when Corresponding code is unloaded from memory when destructor function executes, and be will lead to program and is run quickly and bursts.Also, in practice, destructor function quilt It executes and occurs after all personal code works, when just never occurring to only have a thread after destructor function is performed again The case where calling diclose function.It therefore, can when under application program is in operating status by implementing the embodiment of the present invention It, can be to avoid the memory when program is run in terminal always because of Caffe to realize the load and release of Caffe dynamic link library Dynamic link library and it is occupied, to improve arithmetic speed.

In a wherein embodiment, Caffe dynamic link library involved in the application can be simultaneously by same application Multiple processes of program use, this multiple process calls different dynamic link library files respectively, in this case, the place Reason device unit is also used to configure the second process, and second process includes third thread and the 4th thread；

The processor unit is also used to receiving the same of the first load request of first dynamic link library file When, receive the second load request of the second dynamic link library file；Wherein, second dynamic link library file is for realizing institute State the first function of the first application program；

The processor unit is also used to call the third thread by Caffe dynamic according to second load request Chained library is loaded onto memory, and creates the second object；

The processor unit is also used to call the third thread to execute second dynamic link library file, and After the first function of having executed first application program, second object is executed destructed；

The processor unit, be also used to call the third thread second object has been executed it is destructed after, The 4th thread is called to discharge the Caffe dynamic link library from the memory.

Under current application scenarios, it is what the first application program configured that the first process, which is processor unit with the second process, Two processes, the first process and the second process are mutually independent two processes, when under the first process is in operating status, no The operation of second process can be had an impact, and the two processes, simultaneously when loading Caffe dynamic link library, process respectively corresponds to Memory between will not generate intersection.

In embodiments of the present invention, foregoing description is please referred to about the specific implementation of the first process, does not add to repeat herein.It is right For the second process, the second process includes third thread and the 4th thread, wherein third thread executes " load Caffe dynamic Chained library-executes the second object using Caffe dynamic link library-destructed ", the 4th thread holds the second object in third thread Gone it is destructed after, Caffe dynamic link library is discharged from memory.So, in this case, when the first dynamic base connects File and the second dynamic link library file for realizing the first application program same function when, for example, the people of Alipay application Face identification function, it is believed that be the verifying again to current object to be identified, so as to further increase safety.

In a wherein embodiment, Caffe dynamic link library involved in the application can be answered by different simultaneously With routine call, and the dynamic link library file that different application call is different, in this case, the processor list Member is also used to configure third process, and the third process includes the 5th thread and the 6th thread；

The processor unit, is also used to: receiving the same of the first load request of first dynamic link library file When, receive the third load request of third dynamic link library file；Wherein, the third dynamic link library file is for realizing Second function of two application programs；

The processor unit is also used to call the 5th thread by Caffe dynamic according to the third load request Chained library is loaded onto memory, and creates third object；

The processor unit is also used to call the 5th thread to execute the third dynamic link library file, and After the second function of having executed second application program, the third object is executed destructed；

The processor unit, be also used to call the 5th thread third object has been executed it is destructed after, The 6th thread is called to discharge the Caffe dynamic link library from the memory.

Under current application scenarios, the first process is the process that processor unit is the configuration of the first application program, third Process is the process that processor unit is the configuration of the second application program, the first process and third process be mutually independent two into Journey when under the first process is in operating status, will not have an impact the operation of third process, and the two processes add simultaneously When carrying Caffe dynamic link library, intersection will not be generated between the corresponding memory of process.

In embodiments of the present invention, foregoing description is please referred to about the specific implementation of the first process, does not add to repeat herein.It is right For third process, third process includes the 5th thread and the 6th thread, wherein the 5th thread executes " load Caffe dynamic Chained library-executes third object using Caffe dynamic link library-destructed ", the 6th thread holds third object in the 5th thread Gone it is destructed after, Caffe dynamic link library is discharged from memory.So, in such a case, it is possible to realize different answer With program pin to the Real-Time Sharing of Caffe dynamic link library.Further, it is also possible to realize the different function of different application programs.

In a wherein embodiment, as shown in Figure 1, the device of the release dynamics chained library further includes MLU (Machine Learning Processing Unit, MLU) computing unit；Wherein, processor unit 13 and MLU computing unit Connection.

As previously mentioned, including the letter under Caffe frame in each dynamic link library file in Caffe dynamic link library Number；

The processor unit is also used to during loading the Caffe dynamic link library, by the Caffe frame Function under frame inputs the MLU computing unit；Wherein, the MLU computing unit is used for according to the letter under the Caffe frame Several and operational order is calculated, and obtains calculated result, and the calculated result is sent to processor unit；

The processor unit is also used to receive the calculated result.

By implementing the embodiment of the present invention, loading velocity when load Caffe dynamic link library can be improved.

In the specific implementation, the MLU computing unit includes controller unit 11 and arithmetic element 12；Wherein, controller list Member 11 is connect with arithmetic element 12, which includes: a main process task circuit and multiple from processing circuit；

Controller unit 11, for obtaining input data and computations；Wherein, the input data includes described Function data under Caffe frame；In a kind of optinal plan, specifically, acquisition input data and computations mode can To be obtained by data input-output unit, which is specifically as follows one or more data I/O interfaces Or I/O pin.

Above-mentioned computations include but is not limited to: forward operation instruction or reverse train instruction or other neural networks fortune Instruction etc. is calculated, such as convolution algorithm instruction, the application specific embodiment are not intended to limit the specific manifestation of above-mentioned computations Form.

Controller unit 11 is also used to parse the computations and obtains multiple operational orders, by multiple operational order with And the input data is sent to the main process task circuit；

Main process task circuit 101, for executing preamble processing and with the multiple from processing circuit to the input data Between transmit data and operational order；

It is multiple from processing circuit 102, for parallel according to the data and operational order from the main process task circuit transmission It executes intermediate operations and obtains multiple intermediate results, and multiple intermediate results are transferred to the main process task circuit；

Main process task circuit 101 obtains based on the computations by executing subsequent processing to the multiple intermediate result Calculate result.

Arithmetic element is arranged to one master and multiple slaves structure by technical solution provided by the present application, and the calculating of forward operation is referred to Enable, can will split data according to the computations of forward operation, in this way by it is multiple can from processing circuit Concurrent operation is carried out to the biggish part of calculation amount, to improve arithmetic speed, saves operation time, and then reduce power consumption.

In a wherein embodiment, when including that processor unit and MLU are calculated in the device of release dynamics chained library When unit, which can execute machine learning calculating.In a kind of optional embodiment, machine learning calculating be can wrap Convolutional neural networks calculating is included, above-mentioned input data may include function, input neuron number evidence and weight under Caffe frame Data, wherein the function under Caffe frame may include Caffe Blob, Caffe Layer and Caffe Net function, above-mentioned Calculated result is specifically as follows: the result of convolutional neural networks operation i.e. output nerve metadata.

It can be one layer of operation in neural network for the operation in neural network, for multilayer neural network, Realization process is, in forward operation, after upper one layer of artificial neural network, which executes, to be completed, next layer of operational order can be incited somebody to action Calculated output neuron carries out operation (or to the output nerve as next layer of input neuron in arithmetic element Member carries out the input neuron that certain operations are re-used as next layer), meanwhile, weight is also replaced with to next layer of weight；Anti- Into operation, after the completion of the reversed operation of upper one layer of artificial neural network executes, next layer of operational order can be by arithmetic element In it is calculated input neuron gradient as next layer output neuron gradient carry out operation (or to the input nerve First gradient carries out certain operations and is re-used as next layer of output neuron gradient), while weight being replaced with to next layer of weight.

In a wherein embodiment, above-mentioned MLU computing unit can also include: the storage unit 10 and direct memory Access unit 50, storage unit 10 may include: register, one or any combination in caching, specifically, the caching, For storing the computations；The register, for storing the input data and scalar；The caching is scratchpad Caching.Direct memory access unit 50 is used to read from storage unit 10 or storing data.

Optionally, which includes: the location of instruction 110, instruction process unit 111 and storage queue unit 113；

The location of instruction 110, for storing the associated computations of artificial neural network operation；

Described instruction processing unit 111 obtains multiple operational orders for parsing to the computations；

Storage queue unit 113, for storing instruction queue, the instruction queue include: to wait for by the tandem of the queue The multiple operational orders or computations executed.

For example, main arithmetic processing circuit also may include a controller list in an optional technical solution Member, the controller unit may include master instruction processing unit, be specifically used for Instruction decoding into microcommand.Certainly in another kind Also may include another controller unit from arithmetic processing circuit in optinal plan, another controller unit include from Instruction process unit, specifically for receiving and processing microcommand.Above-mentioned microcommand can be the next stage instruction of instruction, micro- finger Order can further can be decoded as each component, each unit or each processing circuit by obtaining after the fractionation or decoding to instruction Control signal.

In a kind of optinal plan, the structure of the computations can be as shown in table 1 below.

Table 1

Operation code

Register or immediate

Register/immediate

...

Ellipsis expression in upper table may include multiple registers or immediate.

In alternative dispensing means, which may include: one or more operation domains and an operation code. The computations may include neural network computing instruction.By taking neural network computing instructs as an example, as shown in table 1, wherein deposit Device number 0, register number 1, register number 2, register number 3, register number 4 can be operation domain.Wherein, each register number 0, Register number 1, register number 2, register number 3, register number 4 can be the number of one or more register.Specifically, Refer to table 2:

Table 2

Above-mentioned register can be chip external memory, certainly in practical applications, or on-chip memory, for depositing Store up data, which is specifically as follows n dimension data, and n is the integer more than or equal to 1, for example, be 1 dimension data when n=1, i.e., to Amount is 2 dimension datas, i.e. matrix when such as n=2, is multidimensional tensor when such as n=3 or 3 or more.

In another alternative embodiment, arithmetic element 12 is as shown in Fig. 2, may include 101 He of main process task circuit It is multiple from processing circuit 102.In one embodiment, as shown in Fig. 2, it is multiple from processing circuit be in array distribution；Each from Reason circuit is connect with other adjacent from processing circuit, and the multiple k from processing circuit of main process task circuit connection are from Circuit is managed, the k is a from processing circuit are as follows: the n of n of the 1st row from processing circuit, m row is a to be arranged from processing circuit and the 1st M from processing circuit, it should be noted that as shown in Figure 2 K only include n of the 1st row from processing circuit from processing electricity Road, the n m arranged from processing circuit and the 1st of m row are a from processing circuit, i.e. the k are multiple from processing from processing circuit In circuit directly with the slave processing circuit of main process task circuit connection.

K is from processing circuit, in the main process task circuit and multiple data between processing circuit and referring to The forwarding of order.

Optionally, as shown in figure 3, the main process task circuit can also include: conversion processing circuit 110, activation processing circuit 111, one of addition process circuit 112 or any combination；

Conversion processing circuit 110, for the received data block of main process task circuit or intermediate result to be executed the first data knot Exchange (such as conversion of continuous data and discrete data) between structure and the second data structure；Or it is main process task circuit is received Data block or intermediate result execute exchange (such as fixed point type and floating-point class between the first data type and the second data type The conversion of type)；

Processing circuit 111 is activated, for executing the activation operation of data in main process task circuit；

Addition process circuit 112, for executing add operation or accumulating operation.

The main process task circuit, for that will determine that the input neuron is broadcast data, weight is distribution data, will be divided Hair data are distributed into multiple data blocks, will be at least one data block and multiple operational orders in the multiple data block At least one operational order is sent to described from processing circuit；

It is the multiple from processing circuit, obtain centre for executing operation to the data block received according to the operational order As a result, and operation result is transferred to the main process task circuit；

The main process task circuit refers to for being handled to obtain the calculating by multiple intermediate results sent from processing circuit Enable as a result, the result of the computations is sent to the controller unit.

It is described from processing circuit include: multiplication process circuit；

The multiplication process circuit obtains result of product for executing product calculation to the data block received；

Forward process circuit (optional), for forwarding the data block received or result of product.

Accumulation process circuit, the accumulation process circuit obtain among this for executing accumulating operation to the result of product As a result.

In another embodiment, which is Matrix Multiplication in terms of the instruction of matrix, accumulated instruction, activation instruction etc. Calculate instruction.

Illustrate the circular of MLU computing unit as shown in Figure 1 below by neural network computing instruction.It is right For neural network computing instruction, the formula that actually needs to be implemented can be with are as follows: s=s (∑ wx_i+ b), wherein i.e. by weight W is multiplied by input data x_i, sum, then plus activation operation s (h) is done after biasing b, obtain final output result s.

In a kind of optional embodiment, as shown in figure 4, the arithmetic element includes: tree-shaped module 40, the tree-shaped Module includes: a root port 401 and multiple ports 404, and the root port of the tree-shaped module connects the main process task circuit, Multiple ports of the tree-shaped module are separately connected multiple one from processing circuit from processing circuit；

Above-mentioned tree-shaped module has transmission-receiving function, and for example, as shown in figure 4, which is sending function, such as Fig. 5 institute Show, which is receive capabilities.

The tree-shaped module, for forward the main process task circuit and the multiple data block between processing circuit, Weight and operational order.

Optionally, which is the optional as a result, it may include at least 1 node layer, the section of MLU computing unit Point is the cable architecture with forwarding capability, and the node itself can not have computing function.If tree-shaped module has zero layer node, It is not necessarily to the tree-shaped module.

Optionally, which can pitch tree construction for n, for example, binary tree structure as shown in FIG. 6, certainly may be used Think trident tree construction, which can be the integer more than or equal to 2.The application specific embodiment is not intended to limit the specific of above-mentioned n Value, the above-mentioned number of plies may be 2, can connect the node of other layers in addition to node layer second from the bottom from processing circuit, Such as it can connect the node of layer last as shown in FIG. 6.

Optionally, above-mentioned arithmetic element can carry individual caching, as shown in fig. 7, may include: that neuron caching is single Member, the neuron cache unit 63 cache the input neuron vector data and output neuron Value Data from processing circuit.

As shown in figure 8, the arithmetic element can also include: weight cache unit 64, exist for caching this from processing circuit The weight data needed in calculating process.

In an alternative embodiment, arithmetic element 12 is as shown in figure 9, may include branch process circuit 103；It is specific Connection structure it is as shown in Figure 9, wherein

Main process task circuit 101 is connect with branch process circuit 103 (one or more), branch process circuit 103 and one Or it is multiple from the connection of processing circuit 102；

Branch process circuit 103, for execute forwarding main process task circuit 101 and between processing circuit 102 data or Instruction.

In an alternative embodiment, by taking the full connection operation in neural network computing as an example, process can be with are as follows: y=f (wx+b), wherein x is to input neural variable matrix, and w is weight matrix, and b is biasing scalar, and f is activation primitive, is specifically as follows: Sigmoid function, any one in tanh, relu, softmax function.It is assumed that be binary tree structure, have 8 from Processing circuit, the method realized can be with are as follows:

Controller unit obtains input nerve variable matrix x, weight matrix w out of storage unit and full connection operation refers to It enables, input nerve variable matrix x, weight matrix w and full connection operational order is transferred to main process task circuit；

Main process task circuit determines that input nerve variable matrix x is broadcast data, determines that weight matrix w, will for distribution data Weight matrix w splits into 8 submatrixs, and 8 submatrixs are then distributed to 8 from processing circuit by tree-shaped module, will be defeated Enter neural variable matrix x and be broadcast to 8 from processing circuit,

The multiplying and accumulating operation for executing 8 submatrixs and the neural variable matrix x of input parallel from processing circuit obtain 8 8 intermediate results are sent to main process task circuit by a intermediate result；

The operation result is executed biasing for sorting to obtain the operation result of wx by 8 intermediate results by main process task circuit Activation operation is executed after the operation of b and obtains final result y, final result y is sent to controller unit, controller unit should Final result y is exported or is stored to storage unit.

The method that MLU computing unit as shown in Figure 1 executes the instruction of neural network forward operation is specifically as follows:

Controller unit extracts the instruction of neural network forward operation, neural network computing instruction pair out of the location of instruction The operation domain is transmitted to data access unit by the operation domain answered and at least one operation code, controller unit, at least by this One operation code is sent to arithmetic element.

Controller unit extracts the corresponding weight w of the operation domain out of storage unit and biasing b (when b is 0, is not needed It extracts biasing b), weight w and biasing b is transmitted to the main process task circuit of arithmetic element, controller unit is mentioned out of storage unit Input data Xi is taken, input data Xi is sent to main process task circuit.

Main process task circuit is determined as multiplying according at least one operation code, determines input data Xi for broadcast number According to, determine weight data for distribution data, weight w is split into n data block；

The instruction process unit of controller unit determines multiplying order, offset instructions according at least one operation code and tires out Add instruction, multiplying order, offset instructions and accumulated instruction be sent to main process task circuit, main process task circuit by the multiplying order, Input data Xi is sent to multiple from processing circuit in a broadcast manner, which is distributed to multiple from processing electricity Road (such as with n from processing circuit, then each sending a data block from processing circuit)；It is multiple from processing circuit, use Intermediate result is obtained in input data Xi is executed multiplying with the data block received according to the multiplying order, it will be in this Between result be sent to main process task circuit, the main process task circuit according to the accumulated instruction by it is multiple sent from processing circuit intermediate tie Fruit executes accumulating operation and obtains accumulation result, and accumulation result execution biasing is set b according to the offset instructions and obtains final result, The final result is sent to the controller unit.

In addition, the sequence of add operation and multiplying can exchange.

Technical solution provided by the present application is that neural network computing instruction realizes neural network by an instruction Multiplying and biasing operation are not necessarily to store or extract, reduce intermediate data in the intermediate result of neural computing Storage and extraction operation, so it, which has, reduces corresponding operating procedure, the advantages of improving the calculating effect of neural network.

The application is also disclosed that a machine learning arithmetic unit comprising the device of release dynamics chained library, wherein institute State in the device of release dynamics chained library include one or more MLU computing unit, the machine learning arithmetic unit for from It is obtained in other processing units to operational data and control information, executes specified machine learning operation, implementing result passes through I/O Interface passes to peripheral equipment.Peripheral equipment for example camera, display, mouse, keyboard, network interface card, wifi interface, server. When comprising more than one MLU computing unit, it can be linked and be passed by specific structure between this multiple MLU computing unit Transmission of data is for example interconnected by PCIE bus and is transmitted data, to support the operation of more massive machine learning.This When, same control system can be shared, there can also be control system independent；Can also can each it be added with shared drive Fast device has respective memory.In addition, its mutual contact mode can be any interconnection topology.

The machine learning arithmetic unit compatibility with higher can pass through PCIE interface and various types of server phases Connection.

The application is also disclosed that a combined treatment device comprising above-mentioned machine learning arithmetic unit, general interconnection Interface and other processing units.Machine learning arithmetic unit is interacted with other processing units, common to complete what user specified Operation.Figure 10 is the schematic diagram of combined treatment device.

Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/special With one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.Its His interface of the processing unit as machine learning arithmetic unit and external data and control, including data are carried, and are completed to the machine Device learns the basic control such as unlatching, stopping of arithmetic unit；Other processing units can also cooperate with machine learning arithmetic unit It is common to complete processor active task.

General interconnecting interface, for transmitting data and control between the machine learning arithmetic unit and other processing units Instruction.The machine learning arithmetic unit obtains required input data, write-in machine learning operation dress from other processing units Set the storage device of on piece；Control instruction can be obtained from other processing units, write-in machine learning arithmetic unit on piece Control caching；It can also learn the data in the memory module of arithmetic unit with read machine and be transferred to other processing units.

Optionally, the structure is as shown in figure 11, can also include storage device, storage device respectively with the machine learning Arithmetic unit is connected with other described processing units.Storage device for be stored in the machine learning arithmetic unit and it is described its The data of the data of his processing unit, operation required for being particularly suitable for learn arithmetic unit or other processing units in machine Storage inside in the data that can not all save.

The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipment The die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, the combined treatment The general interconnecting interface of device is connected with certain components of equipment.Certain components for example camera, display, mouse, keyboard, Network interface card, wifi interface.

In some embodiments, a kind of chip has also been applied for comprising at above-mentioned machine learning arithmetic unit or combination Manage device.

In some embodiments, a kind of chip-packaging structure has been applied for comprising said chip.

In some embodiments, a kind of board has been applied for comprising said chip encapsulating structure.Refering to fig. 12, Figure 12 A kind of board is provided, above-mentioned board can also include other matching components, this is matched other than including said chip 389 Set component includes but is not limited to: memory device 390, interface arrangement 391 and control device 392；

The memory device 390 is connect with the chip in the chip-packaging structure by bus, for storing data.Institute Stating memory device may include multiple groups storage unit 393.Storage unit described in each group is connect with the chip by bus.It can To understand, storage unit described in each group can be DDR SDRAM (English: Double Data Rate SDRAM, Double Data Rate Synchronous DRAM).

DDR, which does not need raising clock frequency, can double to improve the speed of SDRAM.DDR allows the rising in clock pulses Edge and failing edge read data.The speed of DDR is twice of standard SDRAM.In one embodiment, the storage device can be with Including storage unit described in 4 groups.Storage unit described in each group may include multiple DDR4 particles (chip).In one embodiment In, the chip interior may include 4 72 DDR4 controllers, and 64bit is used for transmission number in above-mentioned 72 DDR4 controllers According to 8bit is used for ECC check.It is appreciated that data pass when using DDR4-3200 particle in the storage unit described in each group Defeated theoretical bandwidth can reach 25600MB/s.

In one embodiment, storage unit described in each group include multiple Double Data Rate synchronous dynamics being arranged in parallel with Machine memory.DDR can transmit data twice within a clock cycle.The controller of setting control DDR in the chips, Control for data transmission and data storage to each storage unit.

The interface arrangement is electrically connected with the chip in the chip-packaging structure.The interface arrangement is for realizing described Data transmission between chip and external equipment (such as server or computer).Such as in one embodiment, the interface Device can be standard PCIE interface.For example, data to be processed are transferred to the core by standard PCIE interface by server Piece realizes data transfer.Preferably, when using the transmission of 16 interface of PCIE 3.0X, theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface arrangement can also be other interfaces, and the application is not intended to limit above-mentioned other interfaces Specific manifestation form, the interface unit can be realized signaling transfer point.In addition, the calculated result of the chip is still by institute It states interface arrangement and sends back external equipment (such as server).

The control device is electrically connected with the chip.The control device is for supervising the state of the chip Control.Specifically, the chip can be electrically connected with the control device by SPI interface.The control device may include list Piece machine (Micro Controller Unit, MCU).If the chip may include multiple processing chips, multiple processing cores or more A processing circuit can drive multiple loads.Therefore, the chip may be at the different work shape such as multi-load and light load State.It may be implemented by the control device to processing chips multiple in the chip, multiple processing and/or multiple processing circuits Working condition regulation.

In some embodiments, a kind of electronic equipment has been applied for comprising above-mentioned board.

Electronic equipment include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, server, cloud server, camera, video camera, projector, hand Table, earphone, mobile storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.

The vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include TV, air-conditioning, micro-wave oven, Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument And/or electrocardiograph.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily the application It is necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed device, it can be by another way It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit, It can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also be realized in the form of software program module.

If the integrated unit is realized in the form of software program module and sells or use as independent product When, it can store in a computer-readable access to memory.Based on this understanding, the technical solution of the application substantially or Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products Reveal and, which is stored in a memory, including some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes all or part of each embodiment the method for the application Step.And memory above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can store in a computer-readable memory, memory May include: flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), disk or CD etc..

The embodiment of the present application is described in detail above, specific case used herein to the principle of the application and Embodiment is expounded, the description of the example is only used to help understand the method for the present application and its core ideas； At the same time, for those skilled in the art can in specific embodiments and applications according to the thought of the application There is change place, in conclusion the contents of this specification should not be construed as limiting the present application.

Claims

1. a kind of device of release dynamics chained library, which is characterized in that described device is applied to processor unit；

The processor unit, for receiving the first load request of the first dynamic link library file；Wherein, first dynamic Library file is linked for realizing the first function of the first application program；The processor unit is for configuring the first process；It is described First process includes first thread and the second thread；

The processor unit is also used to call the first thread by Caffe dynamic link according to first load request Library is loaded onto memory, and creates the first object；

The processor unit is also used to that the first thread is called to execute first dynamic link library file, and is executing After first function of complete first application program, first object is executed destructed；

The processor unit, be also used to call the first thread first object has been executed it is destructed after, calling Second thread discharges the Caffe dynamic link library from the memory.

2. the apparatus according to claim 1, which is characterized in that the processor unit is also used to configure the second process, institute Stating the second process includes third thread and the 4th thread；

The processor unit is also used to connect while receiving the first load request of first dynamic link library file Receive the second load request of the second dynamic link library file；Wherein, second dynamic link library file is for realizing described First function of one application program；

The processor unit is also used to call the third thread by Caffe dynamic link according to second load request Library is loaded onto memory, and creates the second object；

The processor unit is also used to that the third thread is called to execute second dynamic link library file, and is executing After first function of complete first application program, second object is executed destructed；

The processor unit, be also used to call the third thread second object has been executed it is destructed after, calling 4th thread discharges the Caffe dynamic link library from the memory.

3. the apparatus according to claim 1, which is characterized in that the processor unit is also used to configure third process, institute Stating third process includes the 5th thread and the 6th thread；

The processor unit, is also used to: while receiving the first load request of first dynamic link library file, connecing Receive the third load request of third dynamic link library file；Wherein, the third dynamic link library file is answered for realizing second With the second function of program；

The processor unit is also used to call the 5th thread by Caffe dynamic link according to the third load request Library is loaded onto memory, and creates third object；

The processor unit is also used to that the 5th thread is called to execute the third dynamic link library file, and is executing After second function of complete second application program, the third object is executed destructed；

The processor unit, be also used to call the 5th thread third object has been executed it is destructed after, calling 6th thread discharges the Caffe dynamic link library from the memory.

4. the apparatus according to claim 1, which is characterized in that described device further include: MLU computing unit；The Caffe Include the function under Caffe frame in each dynamic link library file in dynamic link library；

The processor unit is also used to during loading the Caffe dynamic link library, will be under the Caffe frame Function input the MLU computing unit；Wherein, the MLU computing unit be used for according to the function under the Caffe frame with And operational order is calculated, and obtains calculated result, and the calculated result is sent to processor unit；

The processor unit is also used to receive the calculated result.

5. device according to claim 4, which is characterized in that the MLU computing unit includes controller unit and operation Unit；The arithmetic element includes: a main process task circuit and multiple from processing circuit；

The controller unit, for obtaining input data and computations；Wherein, the input data includes described Function data under Caffe frame；

The controller unit is also used to parse the computations and obtains multiple operational orders, by multiple operational order and The input data is sent to the main process task circuit；

The main process task circuit, for executing preamble processing and with the multiple between processing circuit to the input data Transmit data and operational order；

It is the multiple from processing circuit, for according to being executed parallel from the data and operational order of the main process task circuit transmission Intermediate operations obtain multiple intermediate results, and multiple intermediate results are transferred to the main process task circuit；

The main process task circuit obtains the calculating knot of the computations for executing subsequent processing to the multiple intermediate result Fruit.

6. device according to claim 5, which is characterized in that the arithmetic element includes: tree-shaped module, the tree-shaped mould Block includes: a root port and multiple ports, and the root port of the tree-shaped module connects the main process task circuit, the tree-shaped Multiple ports of module are separately connected multiple one from processing circuit from processing circuit；

The tree-shaped module, for forwarding the main process task circuit and the multiple data block between processing circuit, weight And operational order.

7. device according to claim 5, which is characterized in that the arithmetic element further includes one or more branch process Circuit, each branch process circuit connection at least one from processing circuit,

The main process task circuit is specifically used for determining that the input neuron is broadcast data, and weight is distribution data block, by one A distribution data are distributed into multiple data blocks, by least one data block in the multiple data block, broadcast data and more At least one operational order in a operational order is sent to the branch process circuit；

The branch process circuit, for forward the main process task circuit and the multiple data block between processing circuit, Broadcast data and operational order；

It is the multiple from processing circuit, for executing operation to the data block and broadcast data received according to the operational order Intermediate result is obtained, and intermediate result is transferred to the branch process circuit；

The main process task circuit, the intermediate result for sending branch process circuit carry out subsequent processing and obtain the computations As a result, the result of the computations is sent to the controller unit.

8. device according to claim 5, which is characterized in that it is the multiple from processing circuit be in array distribution；Each from Processing circuit is connect with other adjacent from processing circuit, the multiple k from processing circuit of main process task circuit connection It is a from processing circuit, the k tandem circuit are as follows: the n of the 1st row from processing circuit, n of m row from processing circuit and The m of 1st column is a from processing circuit；

The K is from processing circuit, in the main process task circuit and multiple data between processing circuit and referring to The forwarding of order；

The main process task circuit, for determining that the input neuron is broadcast data, weight is distribution data, and one is distributed Data are distributed into multiple data blocks, by least one data block and multiple operational orders in the multiple data block extremely A few operational order is sent to the K from processing circuit；

The K is a from processing circuit, for converting the main process task circuit and the multiple data between processing circuit；

It is the multiple from processing circuit, obtain intermediate knot for executing operation to the data block received according to the operational order Fruit, and operation result is transferred to the K from processing circuit；

The main process task circuit obtains based on this by the intermediate result that the K send from processing circuit to be carried out subsequent processing Calculate instruction as a result, the result of the computations is sent to the controller unit.

9. a kind of machine learning arithmetic unit, which is characterized in that the machine learning arithmetic unit includes that claim 1-8 such as appoints The device of release dynamics chained library described in one, wherein include one or more in the device of the release dynamics chained library MLU computing unit, the machine learning arithmetic unit are used to obtain from other processing units to operation input data and control Information, and specified machine learning operation is executed, implementing result is passed into other processing units by I/O interface；

When the machine learning arithmetic unit includes multiple MLU computing units, between the multiple MLU computing unit It can be attached by specific structure and transmit data；

Wherein, multiple MLU computing units are interconnected by quick external equipment interconnection Bus PC IE bus and transmit number According to support the operation of more massive machine learning；Multiple MLU computing units are shared same control system or are possessed each From control system；Multiple MLU computing unit shared drives possess respective memory；Multiple MLU calculate single The mutual contact mode of member is any interconnection topology.

10. a kind of combined treatment device, which is characterized in that the combined treatment device includes machine as claimed in claim 9 Learn arithmetic unit, general interconnecting interface and other processing units；

The machine learning arithmetic unit is interacted with other described processing units, the common calculating behaviour for completing user and specifying Make.

11. combined treatment device according to claim 10, which is characterized in that further include: storage device, the storage device It is connect respectively with the machine learning arithmetic unit and other described processing units, for saving the machine learning arithmetic unit With the data of other processing units.

12. a kind of neural network chip, which is characterized in that the machine learning chip includes machine as claimed in claim 9 Learn arithmetic unit or combined treatment device as claimed in claim 10 or combined treatment device as claimed in claim 10.

13. a kind of electronic equipment, which is characterized in that the electronic equipment includes the chip as described in the claim 12.

14. a kind of board, which is characterized in that the board includes: memory device, interface arrangement and control device and such as right It is required that neural network chip described in 12；

Wherein, the neural network chip is separately connected with the memory device, the control device and the interface arrangement；

The memory device, for storing data；

The interface arrangement, for realizing the data transmission between the chip and external equipment；

The control device is monitored for the state to the chip.

15. a kind of method of release dynamics chained library, which is characterized in that the method is applied to the dress of release dynamics chained library It sets, described device includes processor unit；The described method includes:

The processor unit receives the first load request of the first dynamic link library file；Wherein, first dynamic link Library file for realizing the first application program the first function；The processor unit is for configuring the first process；Described first Process includes first thread and the second thread；

The processor unit calls the first thread to be loaded onto Caffe dynamic link library according to first load request In memory, and create the first object；

The processor unit calls the first thread to execute first dynamic link library file, and is executing described the After first function of one application program, first object is executed destructed；

The processor unit call the first thread first object has been executed it is destructed after, calling described second Thread discharges the Caffe dynamic link library from the memory.

16. according to the method for claim 15, which is characterized in that the method also includes: the processor unit configuration Second process, second process include third thread and the 4th thread；

It is dynamic to receive second while receiving the first load request of first dynamic link library file for the processor unit Second load request of state link library file；Wherein, second dynamic link library file applies journey for realizing described first First function of sequence；

The processor unit calls the third thread to be loaded onto Caffe dynamic link library according to second load request In memory, and create the second object；

The processor unit calls the third thread to execute second dynamic link library file, and is executing described the After first function of one application program, second object is executed destructed；

The processor unit call the third thread second object has been executed it is destructed after, calling the described 4th Thread discharges the Caffe dynamic link library from the memory.

17. according to the method for claim 15, which is characterized in that the method also includes: the processor unit configuration Third process, the third process include the 5th thread and the 6th thread；

It is dynamic to receive third while receiving the first load request of first dynamic link library file for the processor unit The third load request of state link library file；Wherein, the third dynamic link library file is for realizing the second application program Second function；

The processor unit calls the 5th thread to be loaded onto Caffe dynamic link library according to the third load request In memory, and create third object；

The processor unit calls the 5th thread to execute the third dynamic link library file, and is executing described the After second function of two application programs, the third object is executed destructed；

The processor unit call the 5th thread third object has been executed it is destructed after, calling the described 6th Thread discharges the Caffe dynamic link library from the memory.

18. according to the method for claim 15, which is characterized in that described device further include: MLU computing unit；It is described Include the function under Caffe frame in each dynamic link library file in Caffe dynamic link library；

The processor unit, during loading the Caffe dynamic link library, by the function under the Caffe frame Input the MLU computing unit；Wherein, the MLU computing unit be used for according under the Caffe frame function and operation Instruction is calculated, and obtains calculated result, and the calculated result is sent to processor unit；

The processor unit receives the calculated result.

19. according to the method for claim 18, which is characterized in that the MLU computing unit includes controller unit and fortune Calculate unit；The arithmetic element includes: a main process task circuit and multiple from processing circuit；

20. according to the method for claim 19, which is characterized in that the arithmetic element includes: tree-shaped module, the tree-shaped Module includes: a root port and multiple ports, and the root port of the tree-shaped module connects the main process task circuit, the tree Multiple ports of pattern block are separately connected multiple one from processing circuit from processing circuit；

21. according to the method for claim 19, which is characterized in that the arithmetic element further includes one or more bifurcations Manage circuit, each branch process circuit connection at least one from processing circuit,

22. according to the method for claim 19, which is characterized in that it is the multiple from processing circuit be in array distribution；Each It is connect from processing circuit with other adjacent from processing circuit, the main process task circuit connection is the multiple from processing circuit K is from processing circuit, the k tandem circuit are as follows: the n of n of the 1st row from processing circuit, m row it is a from processing circuit and The m of 1st column is a from processing circuit；