WO2023124361A1 - 芯片、加速卡、电子设备和数据处理方法 - Google Patents

芯片、加速卡、电子设备和数据处理方法 Download PDF

Info

Publication number
WO2023124361A1
WO2023124361A1 PCT/CN2022/124257 CN2022124257W WO2023124361A1 WO 2023124361 A1 WO2023124361 A1 WO 2023124361A1 CN 2022124257 W CN2022124257 W CN 2022124257W WO 2023124361 A1 WO2023124361 A1 WO 2023124361A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
sub
processing unit
networks
network parameters
Prior art date
Application number
PCT/CN2022/124257
Other languages
English (en)
French (fr)
Inventor
冷祥纶
张国栋
李冰
赵月新
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023124361A1 publication Critical patent/WO2023124361A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Definitions

  • the present disclosure relates to the field of chip technology, and in particular to a chip, an accelerator card, electronic equipment and a data processing method.
  • neural networks have been widely used in various fields such as image processing, fault diagnosis, and video security. Among them, most application scenarios may require the cooperation of multiple neural networks to solve a complex problem.
  • the terminal usually needs to serially dispatch each neural network to the GPU or accelerator card through the Host CPU during implementation. It is difficult to realize the pipeline processing of each neural network, and the processing efficiency is low.
  • an embodiment of the present disclosure provides a chip, the chip includes: a control unit, configured to schedule network parameters of a group of sub-networks in multiple groups of sub-networks included in the neural network to the first processing unit; and A processing unit, configured to perform first processing on video frames to be processed in the video based on network parameters dispatched to the sub-network of the processing unit, to obtain a first video frame; When the processing is completed, the network parameters of the next group of sub-networks in the multiple groups of sub-networks are dispatched to the first processing unit.
  • the first processing unit is configured to: when the first processing of the video frame to be processed based on the network parameters currently scheduled to the sub-network of the processing unit is completed, report to the control The unit sends an interrupt signal; wherein the control unit is configured to schedule the network parameters of the next group of sub-networks to the first processing unit when the interrupt signal is received.
  • the number of the first processing units is greater than 1, and each first processing unit is configured to perform the first processing on the video frames to be processed based on the network parameters scheduled to the processing unit;
  • the control unit is used for dispatching the network parameters of two adjacent groups of sub-networks to different first processing units respectively.
  • each of the plurality of first processing units is configured to: send an interrupt signal to the control unit when the first processing of the processing unit is completed, so that all The control unit dispatches the network parameters of the next group of sub-networks to other first processing units.
  • control unit is further configured to: send an enable signal to the first processing unit to enable the first processing unit to perform the first processing.
  • the chip further includes: a second processing unit, configured to perform second processing on the first video frame to obtain and output the second video frame.
  • the chip further includes: a third processing unit, configured to perform a third process on the video frame to be processed and output it to the first processing unit, so that the first processing unit The video frames to be processed after the third processing are subjected to the first processing.
  • a third processing unit configured to perform a third process on the video frame to be processed and output it to the first processing unit, so that the first processing unit The video frames to be processed after the third processing are subjected to the first processing.
  • an embodiment of the present disclosure provides an accelerator card, which includes: a memory unit for storing the network parameters of each group of sub-networks in the multiple groups of sub-networks included in the neural network; and any embodiment of the present disclosure the chip.
  • an embodiment of the present disclosure provides an electronic device, the electronic device comprising: the accelerator card described in the second aspect; and an external processing unit, configured to use each of the multiple groups of sub-networks included in the neural network The network parameters of the sub-network are output to the memory unit.
  • an embodiment of the present disclosure provides a data processing method, the method is applied to the control unit in the chip described in any embodiment of the present disclosure; the method includes: including the neural network in multiple groups of sub-networks The network parameters of a group of sub-networks are dispatched to the first processing unit; when the first processing is completed, the network parameters of the next group of sub-networks in the multiple groups of sub-networks are dispatched to the first processing unit a processing unit.
  • dispatching the network parameters of the next group of sub-networks in the multiple groups of sub-networks to the first processing unit includes: receiving the The interrupt signal sent by the first processing unit, wherein the interrupt signal is completed by the first processing unit in the first processing of the video frame to be processed based on the network parameters of the sub-network currently scheduled to the processing unit In case of sending; in response to the interrupt signal, dispatching the network parameters of the next group of sub-networks to the first processing unit.
  • control unit schedules the network parameters of two adjacent sub-networks to different first processing units, so that the Each of the multiple first processing units performs the first processing on the video frame to be processed based on the network parameters scheduled to the processing unit.
  • dispatching the network parameters of the next group of sub-networks in the multiple groups of sub-networks to the first processing unit includes: receiving the The interrupt signal sent by the first processing unit, wherein the interrupt signal is completed by the first processing unit in the first processing of the video frame to be processed based on the network parameters of the sub-network currently scheduled to the processing unit In case of sending; in response to the interrupt signal, dispatching the network parameters of the next group of sub-networks to other first processing units in the plurality of first processing units except the first processing unit that sent the interrupt signal first processing unit.
  • the method further includes: sending an enable signal to the first processing unit to enable the first processing unit to perform the first processing.
  • an embodiment of the present disclosure provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps in the method described in any embodiment of the present disclosure are implemented.
  • the control unit in the chip controls and dispatches each sub-network of the neural network to the first processing unit for AI calculation (including training and/or Inference), when the first processing of a group of sub-networks is completed, the control unit dispatches the network parameters of the next group of sub-networks to the first processing unit for the first processing of the next stage, so that the neural network includes Multiple groups of sub-networks realize pipelined processing, thereby improving processing efficiency.
  • AI calculation including training and/or Inference
  • FIG. 1 is a schematic diagram of a chip of an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a data processing pipeline according to an embodiment of the present disclosure
  • Fig. 3 is a schematic diagram of a data processing pipeline according to another embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of data flow between multiple processing units according to an embodiment of the present disclosure.
  • Fig. 5 is a schematic diagram of an accelerator card shown according to an exemplary embodiment of the present disclosure.
  • Fig. 6 is a schematic diagram of an electronic device according to an exemplary embodiment of the present disclosure.
  • Fig. 7 is a flow chart showing a data processing method according to an exemplary embodiment of the present disclosure.
  • Fig. 8 is a block diagram of a data processing device according to an exemplary embodiment of the present disclosure.
  • first, second, third, etc. may be used in the present disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present disclosure, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word “if” as used herein may be interpreted as “at” or “when” or “in response to a determination.”
  • neural networks have been widely used in various fields such as image processing, fault diagnosis, and video security.
  • most application scenarios may require the cooperation of multiple neural networks to solve a complex problem.
  • it may be necessary to combine several neural networks that can be used to extract features of images in different quality dimensions and neural networks that can be used to score features of images in different quality dimensions. In order to get the score value that characterizes the image quality.
  • the terminal usually needs to serially dispatch each neural network to the GPU or accelerator card through the Host CPU during implementation, and it is difficult to realize the pipeline processing of each neural network.
  • the terminal generally uses a GPU or accelerator card to implement AI training (that is, the process of training the neural network through sample data) or AI reasoning (the so-called reasoning is to draw conclusions from existing facts and knowledge according to a certain strategy. process, while AI reasoning is the process of using artificial intelligence algorithms to process input data).
  • AI training that is, the process of training the neural network through sample data
  • AI reasoning the so-called reasoning is to draw conclusions from existing facts and knowledge according to a certain strategy.
  • AI reasoning is the process of using artificial intelligence algorithms to process input data).
  • the data of each neural network is generally stored in the host main memory, and the host CPU serially dispatches each neural network to the GPU or accelerator card when in use.
  • the host CPU first processes the video frames to be processed and used
  • the network parameters of the neural network that detects the face area in the video frame are dispatched to the GPU or accelerator card.
  • the Host CPU then dispatches the network parameters of the neural network used for facial local feature analysis to the GPU. or accelerator cards, and so on. In this way, it is difficult to realize the pipeline processing among the various neural networks, which easily affects the processing efficiency.
  • FIG. 1 it is a schematic diagram of a chip according to an exemplary embodiment of the present disclosure.
  • the chips include:
  • the control unit 102 is configured to dispatch the network parameters of a group of sub-networks in the multiple groups of sub-networks included in the neural network to the first processing unit 101;
  • the first processing unit 101 is configured to perform first processing on video frames to be processed in the video based on network parameters of the sub-network dispatched to the processing unit, to obtain a first video frame.
  • the control unit 102 is further configured to dispatch network parameters of a next group of sub-networks in the multiple groups of sub-networks to the first processing unit when the first processing is completed.
  • the aforementioned chip may be, for example, a data processing chip such as an AI chip or a graphics processor chip.
  • the above-mentioned chip can be applied in an accelerator card.
  • the following uses the AI accelerator card as an example for illustration.
  • An AI accelerator card is a processor product specially designed to accelerate the execution of AI algorithms.
  • the AI accelerator card can be a circuit board including hardware modules such as chips for AI operations and communication interfaces.
  • the chip used for AI computing can be any of GPU (Graphics Processing Unit, graphics processing unit), FPGA (Field Programmable Gate Array, field programmable logic gate array), ASIC (Application Specific Integrated Circuit, application specific integrated circuit) One type, or another type of chip, which is not limited in the present disclosure.
  • the AI accelerator card can be inserted in the slot and communicate with the Host CPU through a communication interface such as PCIe (Peripheral Component Interconnect Express, high-speed serial computer expansion bus standard).
  • PCIe Peripheral Component Interconnect Express, high-speed serial computer expansion bus standard
  • the AI accelerator card It is also possible to use other types of interfaces, such as QPI (Quick Path Interconnect) interfaces, as the interaction channel with the Host CPU.
  • the AI accelerator card may include a memory unit for storing network parameters of each group of sub-networks in the multiple groups of sub-networks included in the neural network.
  • the above-mentioned memory unit is a memory integrated on the AI accelerator card, which is an integral part of the memory of the AI accelerator card.
  • the memory unit can be DRAM (Dynamic Random Access Memory, dynamic random access memory) or SDRAM (Synchronous Dynamic Random Access Memory, synchronous dynamic random access memory), or other types of memory, which is not limited in the present disclosure.
  • Each group of sub-networks can be an independent neural network, or include one or more network layers of the neural network.
  • the network parameters of each sub-network may include operators (for example, addition operators, multiplication operators, etc.) and parameters required by the operators to perform operations.
  • a set of sub-networks can map inputs to outputs through their corresponding network parameters.
  • a group of sub-networks can realize, for example, a function, wherein at least one sub-network is included in a group of sub-networks.
  • the number of sub-networks included in each set of sub-networks may be the same or different. Different sub-networks can achieve different functions.
  • a set of neural networks for performing face recognition tasks in videos can include sub-networks for face detection and sub-networks for face local feature analysis, etc. .
  • the type of a sub-network can be CNN (Convolutional Neural Network, convolutional neural network), RNN (Recurrent Neural Network, cyclic neural network), LSTM (Long Short-Term Memory, long-term short-term memory network), etc.
  • CNN Convolutional Neural Network, convolutional neural network
  • RNN Recurrent Neural Network, cyclic neural network
  • LSTM Long Short-Term Memory, long-term short-term memory network
  • the above-mentioned memory unit may be provided with multiple partitions, and network parameters of different sub-networks are stored in different partitions. That is to say, the storage space of the memory unit can adopt a hierarchical partition structure, so as to realize the partition storage management of the network parameters of each sub-network.
  • a partition table can be created, which can record information such as the starting address, size, and identification of the corresponding subnetwork of each partition, so that other units can pass Look up the partition table to determine the storage address of the specified subnetwork, and then read the network parameters of the specified subnetwork according to the storage address; or, different partition identifiers can be set for different subnetworks, and the memory unit Determine the partition path, and store the network parameters of the sub-network in the partition corresponding to the partition path.
  • other units can also find the network parameters of the specified sub-network from the memory unit according to the partition identifier of the specified sub-network .
  • other storage management methods may also be set according to specific scenarios.
  • the above-mentioned first processing unit 101 may be a functional module in an AI accelerator card for performing AI computing tasks, and the first processing unit 101 may be called a computing unit (Computing Unit, CU).
  • the first processing unit 101 may include subunits performing addition operations, subunits performing multiplication operations, and subunits performing convolution multiplication operations. These subunits may be called reconfigurable clusters, and these reconfigurable Clusters can be combined to form different computing pathways to perform different tasks through reconstruction.
  • the first processing performed by the first processing unit on the video frame to be processed based on the network parameters may refer to that the first processing unit uses a corresponding sub-network to perform corresponding processing on the video frame to be processed.
  • the specific content of the first processing may be determined according to the functions implemented by the corresponding subnetwork. For example, one of the sub-networks can be used for feature extraction of the video frame to be processed to obtain the feature map corresponding to the video frame to be processed.
  • the first processing can refer to feature extraction of the video frame to be processed to obtain the corresponding feature map of the video frame to be processed
  • the feature map another set of sub-networks can be used to perform target detection on the target object in the video frame to be processed based on the feature map, and obtain the target detection result corresponding to the video frame to be processed.
  • the first processing can refer to the target object in the video frame to be processed
  • the target object of the target object is detected to obtain the target detection result; another group of subnetworks can be used to identify the target object in the video frame to be processed based on the target detection result corresponding to the video frame to be processed, and obtain the target object in the video frame to be processed
  • the first processing may refer to identifying the target object in the video frame to be processed based on the target detection result to obtain the category of the target object.
  • the AI computing task performed by the first processing unit may be AI training or AI reasoning.
  • AI reasoning the subsequent content will be described as AI reasoning.
  • the operation of the first processing unit to perform the first processing on the video frame to be processed based on the network parameters of the sub-network is executed under the control and scheduling of the control unit.
  • the first processing unit can sequentially use a suitable sub-network to process the video frames.
  • the control unit may be an MCU (Microcontroller Unit, micro control unit). That is to say, different from the AI accelerator card that is purely used as a slave device in the related art, the embodiment of the present disclosure adds an MCU to the AI accelerator card to replace the Host CPU to control and schedule each sub-network.
  • the control unit may also be a processor chip with architectures such as ARM, RISC-V, and PowerPC.
  • control unit may control and schedule the subnets downloaded to the first processing unit by sending an enabling signal to the first processing unit.
  • the enable signal is similar to a trigger signal, through which the first processing unit can be triggered to read network parameters, and then perform first processing on the video frame to be processed based on the read network parameters.
  • the control unit configures the storage address of the network parameters to be read for the first processing unit, so that the first processing unit reads the corresponding The data.
  • the first processing unit may be used to perform the first processing of the video frame to be processed based on the network parameters of the sub-networks currently scheduled to the processing unit.
  • an interrupt signal is sent to the control unit; the control unit is configured to dispatch the network parameters of the next group of sub-networks to the first processing unit when the interrupt signal is received. That is to say, when the processing of the video frame to be processed based on a group of sub-networks is completed, the first processing unit can notify the control unit by sending an interrupt signal to the control unit, so that the control unit schedules the next group of sub-networks.
  • the AI accelerator card can immediately enable the AI reasoning of the second group of sub-networks after completing the AI reasoning of the first group of sub-networks, and so on, until the AI reasoning of the last group of sub-networks is completed, so as to realize the AI reasoning of each group of sub-networks.
  • Streamlined processing improves processing efficiency.
  • the number of first processing units included in the chip may be N, where N may be 1, that is, one CU may complete the AI reasoning of all sub-networks included in the neural network; or, N may be greater than or equal to 2 , that is to say, AI reasoning of multiple groups of sub-networks can be completed by multiple CUs.
  • each first processing unit can be used to perform the first processing on the video frame to be processed based on the network parameters dispatched to the processing unit, and the control unit can be used to convert two adjacent The network parameters of the group sub-networks are dispatched to different first processing units respectively.
  • each first processing unit can send an interrupt signal to the control unit when the first processing of the processing unit is completed, so that the control unit can schedule the network parameters of the next group of sub-networks to other sub-networks. first processing unit.
  • other first processing units may continue to process the video frames processed by this processing unit based on the network parameters of the next group of sub-networks. In this way, the orderly scheduling of each sub-network can be realized, and the dynamic selection of the next group of sub-network models can be realized.
  • next group of sub-network models when the next group of sub-network models is determined, that is, in the case of a fixed pipeline, network parameters of each group of sub-networks may also be preloaded to corresponding first processing units. In this way, when the first processing of the processing unit is completed, other first processing units may continue to process the video frame based on the network parameters of the next group of sub-networks. In this way, orderly scheduling of each sub-network is realized.
  • the value of N may correspond to the number of sub-networks included in the neural network.
  • the AI accelerator card may also include two first processing units, and each first processing unit processes the video frame to be processed based on the network parameters of a set of sub-networks.
  • the processing units correspond to different sub-networks.
  • this group of video frames can be transferred to the second first processing unit, and the second first processing unit The unit processes this group of video frames based on the second group of sub-networks, and at this time, the first first processing unit starts to process the next group of video frames based on the first group of sub-networks, thereby realizing pipeline processing of multiple groups of video frames.
  • the host CPU can transmit the video code stream to be processed to the memory unit of the AI accelerator card through the PCIe interface, and the video code stream
  • the video frame to be processed is obtained after being decoded by a video decoding unit (video codec).
  • the MCU can schedule the video frames to be processed to the first first processing unit (CU1), and at the same time, the MCU can also schedule the network parameters of the first group of sub-networks (model1) to CU1, so that CU1 is to be processed based on model1 Video frames are processed.
  • the processing results obtained by CU1 can be output to the second first processing unit (CU2), and at the same time, the MCU can also schedule the network parameters of the second group of sub-networks (model2) to CU2, so that CU2 can output CU1 based on model2.
  • the processing result is processed to obtain the final output result.
  • the pipeline processing of model1 and model2 is realized through CU1 and CU2.
  • the network parameters of some of the sub-networks can be sequentially dispatched to the first processing units, and one or some of the first processing units complete the In the case of the first processing, the network parameters of the remaining sub-networks are dispatched to the first processing unit that completes the first processing.
  • the number n of video frames to be processed at one time may be set, and when a first processing unit finishes processing n video frames, it is determined that the first processing unit has completed the first processing.
  • network parameters of another group of sub-networks may be dispatched to the first processing unit.
  • FIG. 3 it is a schematic diagram of a process of dispatching subnetworks to CUs to process video frames according to an embodiment of the present disclosure, wherein, there are three CUs in the AI accelerator card, namely CU1, CU2 and CU3, and
  • the neural network includes 6 sub-networks, namely sub-network 1, sub-network 2, sub-network 3, sub-network 4, sub-network 5 and sub-network 6 (marked as M1, M2, M3, M4, M5 and M6 in sequence in the figure) , assuming that each subnetwork constitutes a set of subnetworks.
  • the steps of each CU processing a group of video frames to be processed in the video can be expressed as:
  • Subnetwork 1 is dispatched to CU1, and CU1 processes video frame 1 based on subnetwork 1.
  • CU2 and CU3 are in an idle state (in Figure 3, "M1:V1" indicates that video frame 1 is processed based on subnetwork 1, "Empty” means idle, i.e. not doing any processing);
  • CU1 passes video frame 1 to CU2 after processing video frame 1, and dispatches subnetwork 2 to CU2. At this time, CU1 processes video frame 2 based on subnetwork 1, and CU2 processes video frame 1 based on subnetwork 2.
  • CU3 is in idle state;
  • the third stage CU2 passes video frame 1 to CU3 after processing video frame 1, and dispatches subnetwork 3 to CU3.
  • CU1 passes video frame 2 to CU2.
  • CU1 Network 1 processes video frame 3
  • CU2 processes video frame 2 based on subnetwork 2
  • CU3 processes video frame 1 based on subnetwork 3;
  • the fourth stage CU3 passes video frame 1 to CU1 after processing video frame 1, dispatches subnetwork 4 to CU1, CU2 passes video frame 2 to CU3 after processing video frame 2, and CU1 processes video frame 3 After completion, video frame 3 is passed to CU2.
  • CU1 processes video frame 1 based on subnetwork 4
  • CU2 processes video frame 3 based on subnetwork 2
  • CU3 processes video frame 2 based on subnetwork 3;
  • the fifth stage CU3 passes video frame 2 to CU1 after processing video frame 2, CU2 passes video frame 3 to CU3 after processing video frame 3, and passes video frame 1 to CU1 after processing video frame 1.
  • CU2 dispatches subnetwork 5 to CU2.
  • CU1 processes video frame 2 based on subnetwork 4
  • CU2 processes video frame 1 based on subnetwork 5
  • CU3 processes video frame 3 based on subnetwork 3;
  • Stage 6 CU3 passes video frame 3 to CU1 after processing video frame 3, CU2 passes video frame 1 to CU3 after processing video frame 1, dispatches subnetwork 6 to CU3, and CU1 processes video frame 2 After completion, pass video frame 2 to CU2.
  • CU1 processes video frame 3 based on subnetwork 4
  • CU2 processes video frame 2 based on subnetwork 5
  • CU3 processes video frame 1 based on subnetwork 6;
  • the seventh stage After CU3 completes the processing of video frame 1, it completes the model reasoning for all sub-networks of video frame 1. At this time, video frame 1 is stored in memory, and CU2 passes video frame 2 after processing video frame 2. For CU3, CU1 passes video frame 3 to CU2 after processing video frame 3, and reschedules subnetwork 1 to CU1. At this time, CU1 processes video frame 4 based on subnetwork 1, and CU2 processes video frame 3 based on subnetwork 5. , CU3 processes video frame 2 based on subnetwork 6;
  • the eighth stage and so on, until the model inference of all sub-networks is completed for all video frames to be processed.
  • the above video frame 1, video frame 2 and video frame 3 belong to the same group of video frames, and the number of video frames in each group of video frames is the set number of video frames to be processed at one time.
  • Each CU starts to process the set number of video frames based on the next subnetwork scheduled to this CU when it finishes processing the set number of video frames based on the subnetwork currently scheduled to the CU; and, In the case of a video frame that completes the model reasoning of all sub-networks among the multiple video frames that are set to be processed once, if there is an idle CU, the idle CU will start processing the multiple set for the next process. video frames. It can be seen from this that a pipeline (pipeline) is formed between the sub-networks, which realizes the parallel processing of multiple groups of video frames and improves the processing efficiency.
  • post-processing the video frames processed by a group of sub-networks.
  • this processing is generally implemented by the Host CPU. This causes the Host CPU to read a set of video frames processed by the sub-network from the memory of the GPU/AI accelerator card, and store them in the memory of the GPU/AI accelerator card after processing, so that the GPU/AI accelerator card
  • the processed video frames are processed based on the next group of sub-networks. This will affect the pipelining between the sub-networks.
  • the chip of the present disclosure may further include a second processing unit, which is configured to perform second processing on the first video frame to obtain a second video frame, and convert the second video frame to Output to memory unit for storage.
  • the second processing that is, the above-mentioned post-processing, may include at least one of cropping processing, sharpening processing, rotation processing, scaling processing, transparency processing, and other processing processing. That is to say, in this disclosure, a second processing unit is added to the AI accelerator card to replace the Host CPU to perform model post-processing tasks, which can achieve better pipelining.
  • the above-mentioned second processing unit may be configured to perform image segmentation on the first video frame to obtain an image including the target object in the first video frame, and output the image to the memory unit.
  • the target object can be a complete object, such as a person, a basketball, a tree, etc., or a partial part of an object, such as a human face, human eyes, tree branches, etc.
  • the target object may be determined according to the processing objects of the next group of sub-networks.
  • the target object can be a human face
  • the second processing unit can perform image segmentation on the first video frame, and crop it to include the first video frame.
  • the image of the face area in the video frame is output to the memory unit for the processing and use of the next group of sub-networks.
  • the chip of the present disclosure may include a third processing unit, and the third processing unit is used to perform the third processing on the video frame to be processed and output it to the first A processing unit, so that the first processing unit performs the first processing on the video frame to be processed after the third processing.
  • the third processing may also be referred to as pre-processing or pre-processing, and may include at least one of image data pre-processing such as adjusting brightness, adjusting contrast, adjusting size, image segmentation, and normalization.
  • this disclosure adds a third processing unit to the AI accelerator card to replace the Host CPU to perform model pre-processing tasks, so that "model pre-processing-model reasoning-model post-processing" can be executed inside the AI accelerator card, Better pipelining is achieved, and at the same time, it can reduce the use of Host CPU resources and reduce the frequency of Host CPU access to the memory of the AI accelerator card.
  • the chip of the embodiment of the present disclosure can implement a pipelined process of pre-processing, model processing, and post-processing.
  • the video frame pic1 obtained after video decoding can be output to CU1, which can include a pre-processing unit (which can be considered as a third processing unit), a model processing unit (which can be considered as a first processing unit) and a post-processing unit.
  • the unit (which can be considered as the second processing unit) performs pre-processing, model processing based on the first group of sub-networks, and post-processing in sequence.
  • the video frame pic2 output by the post-processing unit in CU1 can be further output to CU2 (similar to CU1 , may also include a pre-processing unit, a model processing unit, and a post-processing unit). Similarly, CU2 may sequentially perform pre-processing, model processing based on the second group of sub-networks, and post-processing, and output processing results. And so on until the final output result is obtained. It should be noted that the video frame output by the post-processing unit in one CU can be used as the video frame to be processed by the pre-processing unit in the next CU.
  • the disclosed chip can also reduce the occupation of Host CPU resources in terms of video decoding.
  • the chip of the present disclosure may further include a video decoding unit, which may be used to perform video decoding on the video under the control of the control unit, obtain a video frame to be processed, and output the video frame to be processed to the memory unit.
  • the video decoding unit may include a video codec, that is, a program or device capable of compressing or decompressing video.
  • the Host CPU only needs to transfer the video to be processed to the memory of the AI accelerator card, and the control unit in the AI accelerator card can control the video decoding unit to decode the video and convert the decoded video to be processed.
  • the video frames are output to the memory unit for subsequent processing by the first processing unit, the second processing unit, and the third processing unit. In this way, the video decoding business is downgraded to the AI accelerator card side, reducing the access frequency between the Host CPU and the memory of the AI accelerator card.
  • the video decoding unit may send a first interrupt signal to the control unit when the video decoding is completed, so that the control unit responds to the first interrupt signal with the network parameter Dispatch to eg the first processing unit. That is to say, the video decoding unit can notify the control unit that the video decoding is completed in the form of an interrupt signal, thus improving the internal processing efficiency of the AI accelerator card.
  • control unit may immediately control the video decoding unit to start video decoding after the AI accelerator card is started. In some other examples, the control unit may also control the video decoding unit to start video decoding in response to receiving a control instruction sent by the external processing unit.
  • the external processing unit here can be Host CPU. In other embodiments, the external processing unit may also be other Device (device) terminals other than the AI accelerator card, such as an external device connected to the terminal where the AI accelerator card is located.
  • the external processing unit sends a control command to the control unit to trigger the control unit to control the video decoding unit to start video decoding. In this way, the video decoding function of the AI accelerator card can be enabled when needed, and then the AI reasoning function of the subsequent AI accelerator card can be triggered.
  • Host CPU is used to refer to the host processor at the Host side.
  • the host processor can also be other types of processors except CPU, such as MPU (Microprocessor Unit, microprocessor device), etc., the present disclosure does not limit this.
  • MPU Microprocessor Unit, microprocessor device
  • the chip provided by the embodiment of the present disclosure is applied to an accelerator card.
  • the memory unit of the accelerator card stores the network parameters of each group of sub-networks in the neural network.
  • the chip controls and schedules the neural network through the control unit in the chip.
  • Each sub-network performs AI reasoning in the first processing unit. In this way, the task of controlling and scheduling each sub-network is lowered to the accelerator card, and the data transmission between each sub-network no longer needs to rely on the control and scheduling of the Host CPU, realizing the pipeline between multiple groups of sub-networks included in the neural network treatment, thereby improving the processing efficiency.
  • the neural network in the embodiment of the present disclosure may be a neural network for face detection, and the neural network may include a sub-network for target tracking and a sub-network for attribute analysis of face images. subnet.
  • the solutions of the embodiments of the present disclosure may also be used in other application scenarios, which will not be listed here.
  • the chip of this embodiment is applied to an AI accelerator card, and the AI accelerator card is inserted into a slot of a motherboard of a terminal.
  • the scenario of this embodiment is a scenario where the terminal performs face recognition on the received video by using neural network technology.
  • multiple steps such as video decoding, face area recognition, and face local feature analysis are required to extract features to achieve the final face recognition, which involves N (N is a positive integer greater than 1) sub-networks .
  • the N groups of sub-networks include sub-network 1 and sub-network 2. Each sub-network constitutes a set of sub-networks as an example.
  • Sub-network 1 is a neural network for face area recognition
  • sub-network 2 is for face recognition. Neural Networks for Local Feature Analysis.
  • the AI accelerator card is generally used as a pure slave device to assist the work of the Host CPU.
  • the process of the terminal executing AI reasoning includes:
  • Host CPU decodes the video to get frame data
  • Host CPU downloads the frame data and subnetwork 1 to the memory of the AI accelerator card;
  • the AI accelerator card performs model inference on the frame data through the subnetwork 1, and stores the result in the memory;
  • Host CPU reads the output result of subnetwork 1 from the memory of the AI accelerator card;
  • Host CPU processes the results output by subnetwork 1;
  • Host CPU downloads the processed data and subnetwork 2 to the memory of the AI accelerator card;
  • the AI accelerator card performs model reasoning on the processed data through the sub-network 2, and stores the results in the memory;
  • the Host CPU reads the inference results stored in the memory of the AI accelerator card.
  • the improved AI accelerator card includes MCU (ie, the above-mentioned control unit), DRAM (ie, the above-mentioned memory unit), Video Codec (ie, the above-mentioned video decoding unit), PCIe and Several CUs (may include the above-mentioned first processing unit).
  • DRAM is the memory of the AI accelerator card, which is used to store the network parameters of the neural network and the video frames before and after processing
  • the MCU is used to schedule the network parameters of the model to the CU, and control and schedule the Video Codec and JPEG Codec ;Each CU is used to perform model reasoning on video frames based on the neural network.
  • the CU also includes a pre-processing unit and a post-processing unit.
  • the pre-processing unit is responsible for the pre-processing tasks of the model, and the post-processing unit is used for the post-processing of the model.
  • Video Codec is a video codec used to decode video
  • PCIe is an interactive path between the AI accelerator card and the Host CPU, used to implement command delivery and data transmission between the AI accelerator card and the Host CPU , the video code stream can also be transmitted to the memory of the AI accelerator card through the PCIe.
  • the above-mentioned modules are interconnected through NoC (Network On Chip, network on chip) and other interconnection buses and memory.
  • the process for the terminal to perform AI reasoning includes:
  • Host CPU transmits the video to be processed, the network parameters of subnetwork 1 and the parameters of subnetwork 2 to the memory of the AI accelerator card;
  • the MCU in the AI accelerator card controls the Video Codec to decode the video in the memory, and controls the pre-processing unit to perform data preprocessing (for example, size adjustment, image segmentation and data normalization) on the decoded video frame, and then The obtained large frame image is stored in the memory;
  • data preprocessing for example, size adjustment, image segmentation and data normalization
  • the MCU in the AI accelerator card controls the CU to read the decoded large frame image and transmit it to sub-network 1, and then enable sub-network 1 to start the first stage of model reasoning, and the obtained result data is transmitted to the post-processing unit;
  • the MCU in the AI accelerator card controls the post-processing unit to segment the output data of sub-network 1 to obtain a small frame image that meets the requirements (for example, remove background features and retain local features of the face, etc.), and store the small frame image into memory;
  • the MCU in the AI accelerator card controls the CU to read the small frame image and transmit it to the subnetwork 2, and then enables the subnetwork 2 to start the second stage of model reasoning, and the obtained result data is transmitted to the post-processing unit;
  • the post-processing unit performs post-processing and obtains the final processing result.
  • sub-network 1 and sub-network 2 are taken as examples, but in practical applications, the number of groups of sub-networks can be greater than or equal to 3, that is, there can be multiple groups of neural networks, and for sub-networks after sub-network 2
  • the model reasoning process of the sub-network is similar to the model reasoning process of the sub-network 2, which will not be repeated in this embodiment.
  • different CUs are responsible for different sub-networks, which constitute a hardware pipeline operation, and the MCU is responsible for the control and scheduling of the pipeline work.
  • model pre-processing-model reasoning-model post-processing in The AI accelerator card is executed internally to achieve better pipeline processing; moreover, different video frames can be streamlined among multiple CUs;
  • the present disclosure also provides embodiments of the AI accelerator card and its corresponding devices.
  • FIG. 5 it is a schematic diagram of an accelerator card according to an exemplary embodiment of the present disclosure, and the accelerator card includes:
  • the memory unit 501 is used to store the network parameters of the neural network.
  • the aforementioned chip 502 may be the chip described in any of the preceding embodiments.
  • FIG. 6 it is a schematic diagram of an electronic device according to an exemplary embodiment of the present disclosure.
  • the electronic device includes an accelerator card 601 and an external processing unit 602, wherein the accelerator card 601 can use any of the aforementioned
  • the accelerator card in the embodiment; the external processing unit 602 may be the Host CPU in the foregoing embodiment, and is used to output the network parameters of the neural network to the memory unit.
  • an embodiment of the present disclosure also provides a data processing method, the method is applied to the control unit in the chip described in any embodiment of the present disclosure; the method includes:
  • Step 701 Scheduling the network parameters of a group of sub-networks in the multiple groups of sub-networks included in the neural network to the first processing unit;
  • Step 702 When the first processing is completed, dispatch network parameters of a next group of sub-networks in the multiple groups of sub-networks to the first processing unit.
  • dispatching the network parameters of the next group of sub-networks in the multiple groups of sub-networks to the first processing unit includes: receiving the The interrupt signal sent by the first processing unit, wherein the interrupt signal is completed by the first processing unit in the first processing of the video frame to be processed based on the network parameters of the sub-network currently scheduled to the processing unit In case of sending; in response to the interrupt signal, dispatching the network parameters of the next group of sub-networks to the first processing unit.
  • control unit schedules the network parameters of two adjacent sub-networks to different first processing units, so that the Each of the multiple first processing units performs the first processing on the video frame to be processed based on the network parameters scheduled to the processing unit.
  • dispatching the network parameters of the next group of sub-networks in the multiple groups of sub-networks to the first processing unit includes: receiving the The interrupt signal sent by the first processing unit, wherein the interrupt signal is completed by the first processing unit in the first processing of the video frame to be processed based on the network parameters of the sub-network currently scheduled to the processing unit In case of sending; in response to the interrupt signal, dispatching the network parameters of the next group of sub-networks to other first processing units in the plurality of first processing units except the first processing unit that sent the interrupt signal first processing unit.
  • the method further includes: sending an enable signal to the first processing unit for enabling the first processing unit to perform the first processing.
  • an embodiment of the present disclosure also provides a data processing device, the device is applied to the control unit in the chip described in any embodiment of the present disclosure; the device includes:
  • the first scheduling module 801 is configured to schedule the network parameters of a group of sub-networks in the multiple groups of sub-networks included in the neural network to the first processing unit;
  • the second scheduling module 802 is configured to schedule the network parameters of the next group of sub-networks in the multiple groups of sub-networks to the first processing unit when the first processing is completed.
  • the second scheduling module is configured to: receive an interrupt signal sent by the first processing unit, wherein the interrupt signal is sent by the first processing unit based on the sub- The network parameters of the network are sent when the first processing of the video frame to be processed is completed; in response to the interrupt signal, the network parameters of the next group of sub-networks are dispatched to the first processing unit.
  • each first processing unit in the plurality of first processing units is configured to perform the processing based on the network parameters scheduled to the processing unit.
  • the first processing is performed on the video frames to be processed; the network parameters of two adjacent sub-networks are dispatched to different first processing units.
  • the second scheduling module is configured to: receive an interrupt signal sent by the first processing unit, wherein the interrupt signal is determined by the The first processing unit sends when the first processing of the video frame to be processed based on the network parameters currently scheduled to the subnetwork of the processing unit is completed; in response to the interrupt signal, the next group The network parameters of the sub-network are dispatched to other first processing units among the plurality of first processing units except the first processing unit that sends the interrupt signal.
  • the apparatus further includes: a sending module, configured to send an enable signal to the first processing unit to enable the first processing unit to perform the first processing.
  • the device embodiment since it basically corresponds to the method embodiment, for related parts, please refer to the part description of the method embodiment.
  • the device embodiments described above are only illustrative, and the modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical modules, that is, they may be located in One place, or it can be distributed to multiple network modules. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution in this specification. It can be understood and implemented by those skilled in the art without creative effort.
  • the embodiment of the present specification also provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps in the method described in any embodiment of the present disclosure are implemented.
  • Computer-readable media including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information.
  • Information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.
  • a typical implementing device is a computer, which may take the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, e-mail device, game control device, etc. desktops, tablets, wearables, or any combination of these.
  • each module may be integrated into the same or multiple software and/or hardware implementations. Part or all of the modules can also be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without creative effort.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Credit Cards Or The Like (AREA)

Abstract

本公开实施例提供一种芯片、加速卡、电子设备、数据处理方法和装置、以及非暂时性计算机可读存储介质,通过芯片内的控制单元控制调度神经网络的各子网络到第一处理单元中进行AI运算,在一组子网络的第一处理完成的情况下,控制单元即调度下一组子网络的网络参数到第一处理单元进行下一阶段的第一处理,从而使神经网络所包括的多组子网络实现流水化处理,进而提升了处理效率。

Description

芯片、加速卡、电子设备和数据处理方法
交叉引用声明
本申请要求于2021年12月30日提交的申请号为202111658144.2的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及芯片技术领域,尤其涉及一种芯片、加速卡、电子设备和数据处理方法。
背景技术
近年来,神经网络广泛应用于图像处理、故障诊断、视频安防等各种领域中,其中,大部分应用场景下可能需要多个神经网络配合起来才能解决一个复杂问题。然而,在实际应用中,终端在实现时通常需要通过Host CPU将各个神经网络串行调度到GPU或加速卡中,难以实现各个神经网络的流水化处理,处理效率较低。
发明内容
第一方面,本公开实施例提供一种芯片,所述芯片包括:控制单元,用于将神经网络包括的多组子网络中的一组子网络的网络参数调度到第一处理单元;以及第一处理单元,用于基于调度到本处理单元的子网络的网络参数对视频中的待处理视频帧进行第一处理,得到第一视频帧;其中,所述控制单元还用于在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元。
在一些实施例中,所述第一处理单元用于:在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一处理完成的情况下,向所述控制单元发送中断信号;其中,所述控制单元用于在接收到所述中断信号的情况下,将所述下一组子网络的网络参数调度到所述第一处理单元。
在一些实施例中,所述第一处理单元的数量大于1,每个第一处理单元用于基于调度到本处理单元的网络参数对所述待处理视频帧进行所述第一处理;所述控制单元用于将相邻的两组子网络的网络参数分别调度到不同的第一处理单元。
在一些实施例中,所述多个第一处理单元中的每个第一处理单元用于:在本处理单元的第一处理完成的情况下,向所述控制单元发送中断信号,以使所述控制单元将所述下一组子网络的网络参数调度到其他第一处理单元。
在一些实施例中,所述控制单元还用于:向所述第一处理单元发送用于使所述第一处理单元进行所述第一处理的使能信号。
在一些实施例中,所述芯片还包括:第二处理单元,用于对所述第一视频帧进行第二处理,得到第二视频帧并输出。
在一些实施例中,所述芯片还包括:第三处理单元,用于对所述待处理视频帧进行第三处理后输出至所述第一处理单元,以使所述第一处理单元对经第三处理后的所述待处理视频帧进行第一处理。
第二方面,本公开实施例提供一种加速卡,所述加速卡包括:内存单元,用于存储神经网络包括的多组子网络中每组子网络的网络参数;以及本公开任一实施例所述的芯片。
第三方面,本公开实施例提供一种电子设备,所述电子设备包括:第二方面所述的加速卡;以及外部处理单元,用于将所述神经网络包括的多组子网络中每组子网络的网络参数输出至所述内存单元。
第四方面,本公开实施例提供一种数据处理方法,所述方法应用于本公开任一实施例所述的芯片中的控制单元;所述方法包括:将神经网络包括的多组子网络中的一组子网络的网络参数调度到所述第一处理单元;在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元。
在一些实施例中,所述在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元,包括:接收所述第一处理单元发送的中断信号,其中,所述中断信号由所述第一处理单元在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一处理完成的情况下发送;响应于所述中断信号,将所述下一组子网络的网络参数调度到所述第一处理单元。
在一些实施例中,在所述第一处理单元的数量大于1的情况下,所述控制单元将相邻的两组子网络的网络参数分别调度到不同的第一处理单元,以使所述多个第一处理单元中的每个第一处理单元基于调度到本处理单元的网络参数对所述待处理视频帧进行所述第一处理。
在一些实施例中,所述在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元,包括:接收所述第一处理单元发送的中断信号,其中,所述中断信号由所述第一处理单元在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一处理完成的情况下发送;响应于所述中断信号,将所述下一组子网络的网络参数调度到所述多个第一处理单元中除发送所述中断信号的所述第一处理单元之外的其他第一处理单元。
在一些实施例中,所述方法还包括:向所述第一处理单元发送使所述第一处理单元进行所述第一处理的使能信号。
第五方面,本公开实施例提供一种非暂时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本公开任一实施例所述方法中的步骤。
根据本公开实施例提供的芯片、加速卡、电子设备和数据处理方法和装置,通过芯片内的控制单元控制调度神经网络的各子网络到第一处理单元中进行AI运算(包括训练和/或推理),在一组子网络的第一处理完成的情况下,控制单元即调度下一组子网络的网络参数到第一处理单元进行下一阶段的第一处理,从而使神经网络所包括的多组子网络实现流水化处理,进而提升了处理效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。
附图说明
此处的附图被并入说明书中并构成本公开的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1是本公开实施例的芯片的示意图;
图2是本公开实施例的数据处理流水线的示意图;
图3是本公开另一实施例的数据处理流水线的示意图;
图4是本公开实施例的多个处理单元间的数据流的示意图;
图5是根据本公开一示例性实施例示出的一种加速卡的示意图;
图6是根据本公开一示例性实施例示出的一种电子设备的示意图;
图7是根据本公开一示例性实施例示出的一种数据处理方法的流程图;
图8是根据本公开一示例性实施例示出的一种数据处理装置的框图。
具体实施方式
这里将详细地对本公开实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合。
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
为了使本技术领域的人员更好的理解本公开实施例中的技术方案,并使本公开实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本公开实施例中的技术方案作进一步详细的说明。
近年来,神经网络广泛应用于图像处理、故障诊断、视频安防等各种领域中,其中,大部分应用场景下可能需要多个神经网络配合起来才能解决一个复杂问题。例如,在执行图像质量评估的任务时,可能需要能够用于提取出图像在不同质量维度的特征的若干个神经网络以及能够用于对图像在不同质量维度的特征进行评分的神经网络组合起来,才能得到表征图像质量的评分值。
在实际应用中,终端在实现时通常需要通过Host CPU将各个神经网络串行调度到GPU或加速卡中,难以实现各个神经网络的流水化处理。
以利用多个神经网络对待处理视频进行人脸识别为例,其往往需要基于视频解码、视频帧分析、人脸局部特征分析等多个步骤提取出视频中的人脸特征,才能实现最终的人脸识别,其中视频帧分析、人脸局部特征分析等步骤可以基于多个神经网络来实现。在这种场景中,终端一般会借助GPU或加速卡来实现AI训练(即通过样本数据对神经网络进行训练的过程)或AI推理(所谓推理就是按照某种策略从已有事实和知识推出结论的过程,而AI推理就是利用人工智能算法对输入数据进行处理的过程)。相关技术中,各个神经网络的数据一般存于Host主存中,Host CPU在使用时将各个神经网络串行调度到GPU或加速卡中,如,Host CPU先将待处理的视频帧以及用于检测视频帧中人脸区域的神经网络的网络参数调度给GPU或加速卡,在GPU或加速卡完成模型推理后,Host CPU再将用于人脸局部特征分析的神经网络的网络参数调度给GPU或加速卡,以此类推。这样,难以实现各个神经网络之间的流水化处理,容易影响处理效率。
基于此,本公开实施例提供一种芯片,用以解决上述问题。如图1所示,是根据本 公开一示例性实施例示出的芯片的示意图。所述芯片包括:
控制单元102,用于将神经网络包括的多组子网络中的一组子网络的网络参数调度到第一处理单元101;以及
第一处理单元101,用于基于调度到本处理单元的子网络的网络参数对视频中的待处理视频帧进行第一处理,得到第一视频帧。
控制单元102还用于在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元。
上述芯片例如可以是AI芯片或者图形处理器芯片等数据处理芯片。上述芯片可应用于加速卡中。下面均以AI加速卡为例进行说明。
AI加速卡是一种专门设计的用来加速AI算法执行的处理器产品,该AI加速卡可以是包括用于进行AI运算的芯片、通信接口等硬件模块的电路板。该用于进行AI运算的芯片可以是GPU(Graphics Processing Unit,图形处理器)、FPGA(Field Programmable Gate Array,现场可编程逻辑门阵列)、ASIC(Application Specific Integrated Circuit,专用集成电路)中的任意一种,也可以是其他类型的芯片,本公开对此不作限制。AI加速卡可以插设在插槽中并通过通信接口,例如PCIe(Peripheral Component Interconnect Express,高速串行计算机扩展总线标准)接口,与Host CPU进行通信,当然,在其他实施例中,AI加速卡也可以将其他类型的接口,如QPI(Quick Path Interconnect,快速通道互联)接口,作为与Host CPU之间的交互通道。
AI加速卡中可以包括内存单元,用于存储神经网络包括的多组子网络中每组子网络的网络参数。上述内存单元是集成在AI加速卡上的存储器,是AI加速卡的内存的组成部分,该内存单元可以是DRAM(Dynamic Random Access Memory,动态随机存取存储器)或SDRAM(Synchronous Dynamic Random Access Memory,同步动态随机存取存储器),也可以是其他类型的存储器,本公开对此不作限制。
每组子网络可以是一个独立的神经网络,也可以包括神经网络的一个或多个网络层。各子网络的网络参数可以包括算子(例如,加法算子、乘法算子等)和算子进行运算所需的参数。一组子网络可以通过其对应的网络参数,将输入映射成输出。一组子网络可以实现例如一种功能,其中,一组子网络中至少包括一个子网络。在神经网络包括多组子网络的情况下,各组子网络中包括的子网络的数量可以相同,也可以不同。不同的子网络可以实现不同的功能,例如,一组用于执行视频中人脸识别任务的神经网络可以包括用于进行人脸检测的子网络以及用于进行人脸局部特征分析的子网络等。
需要说明的是,一个子网络的类型可以是CNN(Convolutional Neural Network,卷积神经网络)、RNN(Recurrent Neural Network,循环神经网络)、LSTM(Long Short-Term Memory,长短期记忆网络)等的任意一种,不同子网络的类型可以相同,也可以不同,本公开对此不作限制。
在一些例子中,上述内存单元可以设置有多个分区,不同子网络的网络参数存储在不同分区中。也就是说,该内存单元的存储空间可以采用层次化的分区结构,以实现对各子网络的网络参数的分区存储管理。
进一步地,为了方便对指定子网络的网络参数的读取,可以创建分区表,该分区表可以记录每个分区的起始地址、大小、对应的子网络的标识等信息,这样其他单元可以通过查找该分区表来确定指定子网络的存储地址,进而根据该存储地址读取到该指定子网络的网络参数;或者,可以为不同子网络设置不同的分区标识,内存单元根据子网络的分区标识确定分区路径,并将该子网络的网络参数存储在该分区路径对应的分区中, 相对应地,其他单元也可以根据指定子网络的分区标识从内存单元中查找到该指定子网络的网络参数。当然,在其他实施例中,也可以根据具体场景的需要设置其他的存储管理方式。
上述第一处理单元101可以是AI加速卡中执行AI运算任务的功能模块,可以将第一处理单元101称为运算单元(Computing Unit,CU)。第一处理单元101中可以包括执行加法运算的子单元、执行乘法运算的子单元、执行卷积乘运算的子单元等多个子单元,这些子单元可以称为可重构簇,这些可重构簇可以组合起来,通过重构的方式形成不同的运算通路去执行不同的任务。
该第一处理单元基于网络参数对待处理视频帧进行第一处理可以是指该第一处理单元利用对应的子网络对待处理视频帧进行相应的处理。该第一处理的具体内容可以根据对应的子网络所实现的功能来确定。例如,其中一组子网络可以用于对待处理视频帧进行特征提取,以得到待处理视频帧对应的特征图,此时第一处理可以指对待处理视频帧进行特征提取以得到待处理视频帧对应的特征图;另一组子网络可以用于基于特征图对待处理视频帧中的目标对象进行目标检测,得到待处理视频帧对应的目标检测结果,此时第一处理可以指对待处理视频帧中的目标对象进行目标检测以得到目标检测结果;再一组子网络可以用于基于待处理视频帧对应的目标检测结果对待处理视频帧中的目标对象进行识别,得到待处理视频帧中的目标对象的类别,此时第一处理可以指基于目标检测结果对待处理视频帧中的目标对象进行识别以得到目标对象的类别。
需要说明的是,本实施例中,第一处理单元所执行的AI运算任务可以是AI训练,也可以是AI推理,为了简洁,后续内容以AI推理进行描述。
在本实施例中,第一处理单元基于子网络的网络参数对待处理视频帧进行第一处理的操作是在控制单元的控制调度下执行的。通过控制单元的控制调度,第一处理单元可以有序地采用合适的子网络对视频帧进行处理。可选地,该控制单元可以是MCU(Microcontroller Unit,微控制单元)。也就是说,不同于相关技术中单纯作为slave设备的AI加速卡,本公开实施例通过在AI加速卡中增设MCU,用以替代Host CPU来对各子网络进行控制调度。当然,在其他实施例中,该控制单元也可以是ARM、RISC-V、PowerPC等架构的处理器芯片。
在一些例子中,该控制单元可以是通过向第一处理单元发送使能信号的方式,对下载到第一处理单元中的子网络进行控制调度。该使能信号类似一个触发信号,通过这一信号可以触发第一处理单元执行读取网络参数,进而基于读取的网络参数对待处理视频帧进行第一处理的操作。可选地,该控制单元可以是在使能第一处理单元后,为该第一处理单元配置需要读取的网络参数的存储地址,以使该第一处理单元到该存储地址下读取相应的数据。
为了保证控制单元对多组子网络的合理调度,在一些例子中,该第一处理单元可以用于在基于当前调度到本处理单元的子网络的网络参数对该待处理视频帧进行的第一处理完成的情况下,向控制单元发送中断信号;该控制单元用于在接收到该中断信号的情况下,将下一组子网络的网络参数调度到该第一处理单元。也就是说,第一处理单元在基于一组子网络对待处理视频帧的处理完成的情况下,可以通过向控制单元发送中断信号的方式通知控制单元,以使控制单元调度下一组子网络的网络参数,这样,第一处理单元可以基于下一组子网络的网络参数对待处理视频帧进行处理。如此,AI加速卡可以在完成第一组子网络的AI推理后,立即启用第二组子网络的AI推理,依次类推,一直到完成最后一组子网络的AI推理,从而实现各组子网络的流水化处理,提升处理效率。
芯片中所包括的第一处理单元的数量可以为N,其中,N可以为1,也就是说,可 以由一个CU完成神经网络所包括的所有子网络的AI推理;或者,N可以大于等于2,也就是说,可以由多个CU完成多组子网络的AI推理。
在芯片包括多个第一处理单元的情况下,每个第一处理单元可以用于基于调度到本处理单元的网络参数对待处理视频帧进行第一处理,控制单元可以用于将相邻的两组子网络的网络参数分别调度到不同的第一处理单元。
在一个实施方式中,每个第一处理单元可以在本处理单元的第一处理完成的情况下,向控制单元发送中断信号,以使该控制单元将下一组子网络的网络参数调度到其他第一处理单元。这样,其他第一处理单元可以基于下一组子网络的网络参数来对本处理单元处理完成的视频帧继续处理。如此,可以实现各子网络的有序调度,并且实现了下一组子网络模型的动态选择。
在另一个实施方式中,在下一组子网络模型确定的情况下,即,在固定pipeline的情况下,也可以将各组子网络的网络参数预先加载到对应的各个第一处理单元。这样,在本处理单元的第一处理完成的情况下,其他第一处理单元可以基于下一组子网络的网络参数来对视频帧继续处理。如此,实现了各子网络的有序调度。
在一些例子中,N的数值可以与神经网络所包括的子网络的数量相对应。例如,假若神经网络包括2组子网络,AI加速卡也可以包括2个第一处理单元,每个第一处理单元基于一组子网络的网络参数来对待处理视频帧进行处理,不同的第一处理单元对应不同子网络。这样,在第一个第一处理单元基于第一组子网络处理完一组视频帧的情况下,可以将这一组视频帧传递给第二个第一处理单元,由第二个第一处理单元基于第二组子网络处理这一组视频帧,此时第一个第一处理单元开始基于第一组子网络处理下一组视频帧,从而实现多组视频帧的流水化处理。
参见图2,在第一处理单元的数量以及子网络的组数均为2的情况下,可以由Host CPU通过PCIe接口将待处理的视频码流传输至AI加速卡的内存单元,视频码流通过视频解码单元(video codec)进行解码后得到待处理视频帧。MCU可以将待处理视频帧调度到第一个第一处理单元(CU1)上,同时,MCU还可以将第一组子网络(model1)的网络参数调度到CU1上,以使CU1基于model1对待处理视频帧进行处理。CU1得到的处理结果可以输出给第二个第一处理单元(CU2),同时,MCU还可以将第二组子网络(model2)的网络参数调度到CU2上,以使CU2基于model2对CU1输出的处理结果进行处理,得到最终的输出结果。在上述实施例中,通过CU1和CU2实现了model1和model2的流水化处理。
另外,在子网络的组数大于第一处理单元的数量的情况下,可以先依次调度其中一部分子网络的网络参数到第一处理单元中,在其中的某个或某些第一处理单元完成第一处理的情况下,将剩余的子网络的网络参数调度到完成第一处理的第一处理单元。在一些实施例中,可以设定一次处理的视频帧数量n,在一个第一处理单元对n个视频帧处理完成的情况下,确定该第一处理单元完成所述第一处理。在一个第一处理单元基于一组子网络处理完设定的视频帧数量的情况下,可以调度另一组子网络的网络参数到该第一处理单元中。
如图3所示,是根据本公开实施例示出的一种调度子网络到CU以处理视频帧的流程的示意图,其中,AI加速卡中有3个CU,分别为CU1、CU2和CU3,而神经网络包括了6个子网络,分别为子网络1、子网络2、子网络3、子网络4、子网络5和子网络6(图中依次记为M1、M2、M3、M4、M5和M6),假设每个子网络构成一组子网络。由图3可知,各CU处理视频中的一组待处理视频帧的步骤可以表示为:
第一阶段:将子网络1调度到CU1,CU1基于子网络1处理视频帧1,此时CU2 和CU3处于空闲状态(图3中,“M1:V1”表示基于子网络1处理视频帧1,“空”表示处于空闲状态,即未进行任何处理);
第二阶段:CU1对视频帧1处理完成后将视频帧1传递给CU2,将子网络2调度到CU2,此时CU1基于子网络1处理视频帧2,CU2基于子网络2处理视频帧1,CU3处于空闲状态;
第三阶段:CU2对视频帧1处理完成后将视频帧1传递给CU3,将子网络3调度到CU3,CU1对视频帧2处理完成后将视频帧2传递给CU2,此时,CU1基于子网络1处理视频帧3,CU2基于子网络2处理视频帧2,CU3基于子网络3处理视频帧1;
第四阶段:CU3对视频帧1处理完成后将视频帧1传递给CU1,将子网络4调度到CU1,CU2对视频帧2处理完成后将视频帧2传递给CU3,CU1对视频帧3处理完成后将视频帧3传递给CU2,此时,CU1基于子网络4处理视频帧1,CU2基于子网络2处理视频帧3,CU3基于子网络3处理视频帧2;
第五阶段:CU3对视频帧2处理完成后将视频帧2传递给CU1,CU2对视频帧3处理完成后将视频帧3传递给CU3,CU1对视频帧1处理完成后将视频帧1传递给CU2,将子网络5调度到CU2,此时,CU1基于子网络4处理视频帧2,CU2基于子网络5处理视频帧1,CU3基于子网络3处理视频帧3;
第六阶段:CU3对视频帧3处理完成后将视频帧3传递给CU1,CU2对视频帧1处理完成后将视频帧1传递给CU3,将子网络6调度到CU3,CU1对视频帧2处理完成后将视频帧2传递给CU2,此时,CU1基于子网络4处理视频帧3,CU2基于子网络5处理视频帧2,CU3基于子网络6处理视频帧1;
第七阶段:CU3对视频帧1处理完成后,完成针对视频帧1的所有子网络的模型推理,此时将视频帧1存入内存中,CU2对视频帧2处理完成后将视频帧2传递给CU3,CU1对视频帧3处理完成后将视频帧3传递给CU2,重新将子网络1调度到CU1,此时,CU1基于子网络1处理视频帧4,CU2基于子网络5处理视频帧3,CU3基于子网络6处理视频帧2;
第八阶段:依次类推,直至针对所有待处理视频帧完成所有子网络的模型推理。
上述视频帧1、视频帧2和视频帧3属于同一组视频帧,每组视频帧中视频帧的数量均为所述设定的一次处理的视频帧的数量。各CU在基于当前调度到本CU的子网络完成对设定数量的视频帧的处理时,开始基于下一个调度到本CU的子网络来对所述设定数量的视频帧进行处理;而且,在设定的一次处理的多个视频帧中有视频帧完成所有子网络的模型推理的情况下,若有空闲状态的CU,则由该空闲状态的CU开始处理设定的下一次处理的多个视频帧。由此可以看出,各子网络之间形成pipeline(流水线),实现了对多组视频帧的并行处理,提高了处理效率。
在实际应用中,经常需要对一组子网络处理后的视频帧进行加工处理(称为后处理)。相关技术中,由于GPU或AI加速卡受限于硬件条件不支持通用性的处理操作,因此,这一加工处理一般是由Host CPU来实现的。这就导致Host CPU需要从GPU/AI加速卡的内存中读取出一组子网络处理后的视频帧,经过加工处理再存放至GPU/AI加速卡的内存中,以使GPU/AI加速卡基于下一组子网络对加工处理后的视频帧进行处理。这样会影响各子网络之间的流水化处理。
基于此,在一些例子中,本公开的芯片还可以包括第二处理单元,该第二处理单元用于对第一视频帧进行第二处理,得到第二视频帧,并将该第二视频帧输出至内存单元进行存储。所述的第二处理即上述后处理,可以包括裁剪处理、锐化处理、旋转处理、 缩放处理、透明处理等加工处理中的至少一种。也就是说,本公开通过在AI加速卡中增设第二处理单元,替代Host CPU来执行模型后处理任务,这样可以实现更好的流水化。
在一个可选的实施例中,上述第二处理单元可以用于对第一视频帧进行图像分割,得到包括该第一视频帧中的目标对象的图像,并将该图像输出至内存单元。所述的目标对象可以是一个完整的物体,如人、篮球、树等,也可以是物体的一个局部部位,如人脸、人眼、树枝等。可选地,该目标对象可以根据下一组子网络的处理对象来进行确定。例如,在下一组子网络是用于人脸局部特征分析的神经网络的情况下,目标对象可以是人脸,则第二处理单元可以对第一视频帧进行图像分割,裁剪得到包括该第一视频帧中的人脸区域的图像,并将该图像输出至内存单元,以便于下一组子网络的处理使用。
相对应地,在实际应用中,经常需要对解码得到的视频帧进行数据预处理后,再输入神经网络进行AI推理。而同样地,相关技术中,这一数据预处理一般是由Host CPU来实现的,其会影响各子网络之间的流水化处理。
为了更好地实现各子网络之间的流水化作业,在一些例子中,本公开的芯片可以包括第三处理单元,该第三处理单元用于对待处理视频帧进行第三处理后输出至第一处理单元,以使该第一处理单元对经第三处理后的待处理视频帧进行第一处理。所述第三处理也可以称为前处理或预处理,可以包括调整亮度、调整对比度、调整尺寸大小、图像分割、归一化等图像数据预处理中的至少一种。也就是说,本公开通过在AI加速卡中增设第三处理单元,替代Host CPU来执行模型前处理任务,这样可以使得“模型前处理-模型推理-模型后处理”在AI加速卡内部执行,实现了更好的流水化,同时,可以减少占用Host CPU资源,减少Host CPU对AI加速卡的内存的访问频率。
参见图4,在包括前处理和后处理的实施例中,本公开实施例的芯片可以实现前处理、模型处理以及后处理的流水化过程。在一些实施例中,视频解码后得到的视频帧pic1可以输出到CU1,其可以包括前处理单元(可以认为是第三处理单元)、模型处理单元(可以认为是第一处理单元)和后处理单元(可以认为是第二处理单元),依次进行前处理、基于第一组子网络进行的模型处理以及后处理,CU1中的后处理单元输出的视频帧pic2可以进一步输出至CU2(与CU1类似,也可以包括前处理单元、模型处理单元和后处理单元)的前处理单元。同样地,CU2可以依次进行前处理、基于第二组子网络进行的模型处理以及后处理,并输出处理结果。以此类推,直到得到最终的输出结果。应当说明的是,一个CU中的后处理单元输出的视频帧可以作为下一个CU中的前处理单元处理的待处理视频帧。
除了模型前处理、模型后处理任务以外,本公开的芯片还可以在视频解码方面减少对Host CPU资源的占用。在一些例子中,本公开的芯片还可以包括视频解码单元,该视频解码单元可以用于在控制单元的控制下,对视频进行视频解码,得到待处理视频帧,并将该待处理视频帧输出至内存单元。该视频解码单元可以包括视频编解码器,即能够对视频进行压缩或者解压缩的程序或者设备。
通过这一设置,Host CPU只需要将待处理的视频传输到AI加速卡的内存中,AI加速卡内的控制单元就可以控制视频解码单元对该视频进行解码处理,并将解码得到的待处理视频帧输出至内存单元中,以便后续第一处理单元、第二处理单元以及第三处理单元等的处理。如此,实现将视频解码的业务下沉到AI加速卡端,减少了Host CPU和AI加速卡的内存之间的访问频率。
需要说明的是,在一个可选的实施例中,该视频解码单元可以在视频解码完成的情况下,向控制单元发送第一中断信号,以使控制单元响应于该第一中断信号将网络参数调度到例如第一处理单元。也就是说,该视频解码单元可以以中断信号的形式,将视频 解码完成的消息通知给控制单元,如此,提升了AI加速卡内部的处理效率。
在一些例子中,控制单元可以是在AI加速卡启动后,立即控制视频解码单元开始进行视频解码。在另一些例子中,控制单元也可以是在响应于接收到外部处理单元发送的控制指令的情况下,控制视频解码单元开始进行视频解码。这里的外部处理单元可以是Host CPU。在其他实施例中,该外部处理单元也可以是除AI加速卡以外的其他Device(设备)端,如AI加速卡所在的终端所连接的外部设备。外部处理单元向控制单元发送控制指令,以触发控制单元控制视频解码单元开始进行视频解码,这样,可以在需要的时候启用AI加速卡的视频解码功能,进而触发后续AI加速卡的AI推理功能。此处需要说明的是,本公开中用Host CPU指代Host端的主处理器,在实际应用中,该主处理器也可以是除CPU以外其他类型的处理器,如MPU(Microprocessor Unit,微处理器)等,本公开对此不作限制。
本公开实施例所提供的芯片应用于加速卡,该加速卡的内存单元存储神经网络包括的多组子网络中每组子网络的网络参数,该芯片通过芯片内的控制单元控制调度神经网络的各子网络到第一处理单元中进行AI推理。这样,实现将控制调度各子网络的任务下沉到加速卡端,各子网络间的数据传递不再需要依赖Host CPU的控制调度,实现了神经网络所包括的多组子网络之间的流水化处理,进而提升了处理效率。
在一些实际应用场景中,本公开实施例的神经网络可以是用于进行人脸检测的神经网络,该神经网络可以包括用于进行目标跟踪的子网络和用于对人脸图像进行属性分析的子网络。当然,除了上述应用场景之外,本公开实施例的方案还可以用于其他应用场景,此处不再一一列举。
为了对本公开实施例的芯片做更为详细的说明,接下来介绍一具体实施例。
本实施例的芯片应用于AI加速卡,该AI加速卡插设在终端的主板插槽中。本实施例的场景是该终端利用神经网络技术对接收到的视频进行人脸识别的场景。该场景中,需要经视频解码、人脸区域识别、人脸局部特征分析等多个步骤以提取特征,才能实现最终的人脸识别,其中涉及N(N为大于1的正整数)组子网络。下面以这N组子网络包括子网络1和子网络2,每个子网络构成一组子网络为例进行说明,子网络1是用于人脸区域识别的神经网络,子网络2是用于人脸局部特征分析的神经网络。
相关技术中,AI加速卡一般是作为纯slave设备,用于辅助Host CPU的工作,终端执行AI推理的过程包括:
Host CPU对视频进行解码,得到帧数据;
Host CPU将帧数据以及子网络1下载到AI加速卡的内存里;
AI加速卡通过子网络1对帧数据进行模型推理,并将结果存放在内存中;
Host CPU从AI加速卡的内存中读取子网络1输出的结果;
Host CPU对子网络1输出的结果进行加工处理;
Host CPU将加工处理后的数据和子网络2下载到AI加速卡的内存里;
AI加速卡通过子网络2对处理后的数据进行模型推理,并将结果存放在内存中;
Host CPU读取AI加速卡的内存中存放的推理结果。
上述流程存在大量的Host CPU与AI加速卡间的内存访问动作,需要占用较多的Host CPU资源,无法实现不同子网络之间的流水化处理,AI推理的效率较低。
而本实施例中,在AI加速卡内部做了改进,改进后的AI加速卡包括MCU(即上 述控制单元)、DRAM(即上述内存单元)、Video Codec(即上述视频解码单元)、PCIe以及若干个CU(可以包括上述第一处理单元)。其中,DRAM是AI加速卡的内存,用于存储神经网络的网络参数以及处理前、处理后的视频帧;MCU用于调度模型的网络参数到CU中,以及对Video Codec和JPEG Codec进行控制调度;每个CU用于基于神经网络对视频帧进行模型推理,CU中还包括前处理单元和后处理单元,前处理单元用于负责模型的前处理任务,后处理单元用于负责模型的后处理任务;Video Codec是视频编解码器,用于对视频进行解码;PCIe是AI加速卡与Host CPU之间的交互通路,用于实现AI加速卡与Host CPU之间的命令下发、数据传输等,视频码流也可通过该PCIe传输到AI加速卡的内存中。另外,在AI加速卡内,上述的各模块通过NoC(Network On Chip,片上网络)等互联总线和内存实现互联。
本实施例中,终端执行AI推理的过程包括:
Host CPU将待处理视频、子网络1的网络参数和子网络2的参数传输至AI加速卡的内存中;
AI加速卡内的MCU控制Video Codec对内存中的视频进行解码,并控制前处理单元对解码后的视频帧进行数据预处理(例如,调整尺寸大小、图像分割和数据归一化)后,将得到的帧大图存放到内存中;
AI加速卡内的MCU控制CU读取解码得到的帧大图并传输到子网络1,再使能子网络1开始进行第一阶段的模型推理,得到的结果数据传输到后处理单元;
AI加速卡内的MCU控制后处理单元将子网络1的输出数据进行图像分割,得到符合要求的帧小图(如,去掉背景特征,并保留人脸局部特征等),并将帧小图存放到内存中;
AI加速卡内的MCU控制CU读取帧小图并传输到子网络2,再使能子网络2开始进行第二阶段的模型推理,得到的结果数据传输到后处理单元;
后处理单元进行后处理,并得到最终的处理结果。
需要说明的是,本实施例中以子网络1和子网络2为例,但在实际应用中,子网络的组数可以大于等于3,即可以有多组神经网络,而针对子网络2之后的子网络的模型推理过程与子网络2的模型推理过程类似,本实施例在此不再赘述。而且,本实施例中,不同的CU负责不同的子网络,其构成一条硬件流水线作业,由MCU负责流水线工作的控制调度。
本实施例的方案至少具有以下优点:
(1)将视频解码、图像编码、模型前处理、模型后处理等业务下沉到AI加速卡端,Host CPU只需将相关模型和视频码流传输给AI加速卡即可,极大地减轻了Host CPU的工作负担;
(2)将所有的模型一次性下载到AI加速卡端,通过AI加速卡内的MCU控制调度,实现了多模型间的pipeline处理,而且,“模型前处理-模型推理-模型后处理”在AI加速卡内部执行,实现了更好的pipeline处理;并且,不同的视频帧可以在多个CU之间实现流水化处理;
(3)模型间的数据传递不再需要依赖Host CPU控制调度,减少了Host CPU对AI加速卡的内存访问频率,因而进一步提升了处理效率。
经试验发现,经本实施例的方案得到的终端显示结果,相较于相关技术方案得到的终端显示结果而言,终端显示的画面更加流畅,有效避免了掉帧的情况。
与前述的芯片的实施例相对应,本公开还提供了AI加速卡及其对应的设备的实施例。如图5所示,是根据本公开一示例性实施例示出的一种加速卡的示意图,所述加速卡包括:
内存单元501,用于存储神经网络的网络参数;以及
芯片502。
上述芯片502可以采用前述任一实施例所述的芯片。
如图6所示,是本公开根据一示例性实施例示出的一种电子设备的示意图,所述电子设备包括加速卡601以及外部处理单元602,其中,所述加速卡601可以采用前述任一实施例中的加速卡;所述外部处理单元602可以是前述实施例中的Host CPU,用于将所述神经网络的网络参数输出至所述内存单元。
上述加速卡以及电子设备中各个组件的功能和作用的实现过程具体详见上述芯片中对应组件的实现过程,在此不再赘述。需要说明的是,上述芯片所对应的其他改进内容同样适用于加速卡以及电子设备。
以上通过不同的示例对本申请的技术方案进行了描述,然而,由这些示例的部分或全部的组合所形成的新的实施例也被包括在本说明书中被视为是本说明书的一部分。
参见图7,本公开实施例还提供一种数据处理方法,所述方法应用于本公开任一实施例所述的芯片中的控制单元;所述方法包括:
步骤701:将神经网络包括的多组子网络中的一组子网络的网络参数调度到所述第一处理单元;
步骤702:在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元。
在一些实施例中,所述在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元,包括:接收所述第一处理单元发送的中断信号,其中,所述中断信号由所述第一处理单元在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一处理完成的情况下发送;响应于所述中断信号,将所述下一组子网络的网络参数调度到所述第一处理单元。
在一些实施例中,在所述第一处理单元的数量大于1的情况下,所述控制单元将相邻的两组子网络的网络参数分别调度到不同的第一处理单元,以使所述多个第一处理单元中的每个第一处理单元基于调度到本处理单元的网络参数对所述待处理视频帧进行所述第一处理。
在一些实施例中,所述在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元,包括:接收所述第一处理单元发送的中断信号,其中,所述中断信号由所述第一处理单元在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一处理完成的情况下发送;响应于所述中断信号,将所述下一组子网络的网络参数调度到所述多个第一处理单元中除发送所述中断信号的所述第一处理单元之外的其他第一处理单元。
在一些实施例中,所述方法还包括:向所述第一处理单元发送用于使所述第一处理单元进行所述第一处理的使能信号。
上述方法实施例中控制单元执行的步骤详见前述芯片实施例中控制单元所执行的功能,此处不再赘述。
参见图8,本公开实施例还提供一种数据处理装置,所述装置应用于本公开任一实 施例所述的芯片中的控制单元;所述装置包括:
第一调度模块801,用于将神经网络包括的多组子网络中的一组子网络的网络参数调度到第一处理单元;
第二调度模块802,用于在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元。
在一些实施例中,所述第二调度模块用于:接收所述第一处理单元发送的中断信号,其中,所述中断信号由所述第一处理单元在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一处理完成的情况下发送;响应于所述中断信号,将所述下一组子网络的网络参数调度到所述第一处理单元。
在一些实施例中,在所述第一处理单元的数量大于1的情况下,所述多个第一处理单元中的每个第一处理单元用于基于调度到本处理单元的网络参数对所述待处理视频帧进行所述第一处理;相邻的两组子网络的网络参数被调度到不同的第一处理单元。
在一些实施例中,在所述第一处理单元的数量大于1的情况下,所述第二调度模块用于:接收所述第一处理单元发送的中断信号,其中,所述中断信号由所述第一处理单元在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一处理完成的情况下发送;响应于所述中断信号,将所述下一组子网络的网络参数调度到所述多个第一处理单元中除发送所述中断信号的所述第一处理单元之外的其他第一处理单元。
在一些实施例中,所述装置还包括:发送模块,用于向所述第一处理单元发送用于使所述第一处理单元进行所述第一处理的使能信号。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本说明书方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
本说明书实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本公开任一实施例所述方法中的步骤。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本公开实施例可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例或者实施例的某些部分所述的方法。
上述实施例阐明的模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。
以上所描述的各实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,在实施本公开实施例方案时可以把各模块的功能在同一个或多个软件和/或硬件中实现。也可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述仅是本公开实施例的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本公开实施例原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本公开实施例的保护范围。

Claims (15)

  1. 一种芯片,其特征在于,所述芯片包括:
    控制单元,用于将神经网络包括的多组子网络中的一组子网络的网络参数调度到第一处理单元;以及
    第一处理单元,用于基于调度到本处理单元的子网络的网络参数对视频中的待处理视频帧进行第一处理,得到第一视频帧,
    其中,所述控制单元还用于在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元。
  2. 根据权利要求1所述的芯片,其特征在于,所述第一处理单元用于:在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一处理完成的情况下,向所述控制单元发送中断信号;
    其中,所述控制单元用于在接收到所述中断信号的情况下,将所述下一组子网络的网络参数调度到所述第一处理单元。
  3. 根据权利要求1或2所述的芯片,其特征在于,所述第一处理单元的数量大于1,每个第一处理单元用于基于调度到本处理单元的网络参数对所述待处理视频帧进行所述第一处理;
    所述控制单元用于将相邻的两组子网络的网络参数分别调度到不同的第一处理单元。
  4. 根据权利要求3所述的芯片,其特征在于,每个第一处理单元用于:
    在本处理单元的第一处理完成的情况下,向所述控制单元发送中断信号,以使所述控制单元将所述下一组子网络的网络参数调度到其他第一处理单元。
  5. 根据权利要求1至4中任一项所述的芯片,其特征在于,所述控制单元还用于:
    向所述第一处理单元发送用于使所述第一处理单元进行所述第一处理的使能信号。
  6. 根据权利要求1至5中任一项所述的芯片,其特征在于,所述芯片还包括:
    第二处理单元,用于对所述第一视频帧进行第二处理,得到第二视频帧并输出。
  7. 根据权利要求1至6中任一项所述的芯片,其特征在于,所述芯片还包括:
    第三处理单元,用于对所述待处理视频帧进行第三处理后输出至所述第一处理单元,以使所述第一处理单元对经第三处理后的所述待处理视频帧进行第一处理。
  8. 一种加速卡,其特征在于,所述加速卡包括:
    内存单元,用于存储神经网络包括的多组子网络中每组子网络的网络参数;以及
    权利要求1至7中任一项所述的芯片。
  9. 一种电子设备,其特征在于,所述电子设备包括:
    权利要求8所述的加速卡;以及
    外部处理单元,用于将所述神经网络包括的多组子网络中每组子网络的网络参数输出至所述内存单元。
  10. 一种数据处理方法,其特征在于,所述数据处理方法应用于权利要求1至7中任一项所述的芯片中的控制单元;所述方法包括:
    将神经网络包括的多组子网络中的一组子网络的网络参数调度到所述第一处理单元;
    在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元。
  11. 根据权利要求10所述的数据处理方法,其特征在于,所述在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元,包括:
    接收所述第一处理单元发送的中断信号,其中,所述中断信号由所述第一处理单元在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一 处理完成的情况下发送;
    响应于所述中断信号,将所述下一组子网络的网络参数调度到所述第一处理单元。
  12. 根据权利要求10或11所述的数据处理方法,其特征在于,在所述第一处理单元的数量大于1的情况下,所述控制单元将相邻的两组子网络的网络参数分别调度到不同的第一处理单元,以使所述多个第一处理单元中的每个第一处理单元基于调度到本处理单元的网络参数对所述待处理视频帧进行所述第一处理。
  13. 根据权利要求12所述的数据处理方法,所述在所述第一处理完成的情况下,将所述多组子网络中的下一组子网络的网络参数调度到所述第一处理单元,包括:
    接收所述第一处理单元发送的中断信号,其中,所述中断信号由所述第一处理单元在基于当前调度到本处理单元的子网络的网络参数对所述待处理视频帧进行的第一处理完成的情况下发送;
    响应于所述中断信号,将所述下一组子网络的网络参数调度到所述多个第一处理单元中除发送所述中断信号的所述第一处理单元之外的其他第一处理单元。
  14. 根据权利要求10至13中任一项所述的数据处理方法,还包括:向所述第一处理单元发送用于使所述第一处理单元进行所述第一处理的使能信号。
  15. 一种非暂时性计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求10至14中任一项所述的数据处理方法中的步骤。
PCT/CN2022/124257 2021-12-30 2022-10-10 芯片、加速卡、电子设备和数据处理方法 WO2023124361A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111658144.2 2021-12-30
CN202111658144.2A CN114330675A (zh) 2021-12-30 2021-12-30 一种芯片、加速卡、电子设备和数据处理方法

Publications (1)

Publication Number Publication Date
WO2023124361A1 true WO2023124361A1 (zh) 2023-07-06

Family

ID=81018318

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124257 WO2023124361A1 (zh) 2021-12-30 2022-10-10 芯片、加速卡、电子设备和数据处理方法

Country Status (2)

Country Link
CN (1) CN114330675A (zh)
WO (1) WO2023124361A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330675A (zh) * 2021-12-30 2022-04-12 上海阵量智能科技有限公司 一种芯片、加速卡、电子设备和数据处理方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985451A (zh) * 2018-06-29 2018-12-11 百度在线网络技术(北京)有限公司 基于ai芯片的数据处理方法及设备
CN111339027A (zh) * 2020-02-25 2020-06-26 中国科学院苏州纳米技术与纳米仿生研究所 可重构的人工智能核心与异构多核芯片的自动设计方法
CN111626414A (zh) * 2020-07-30 2020-09-04 电子科技大学 一种动态多精度神经网络加速单元
US20200364544A1 (en) * 2019-05-17 2020-11-19 Aspiring Sky Co. Limited Multiple accelerators for neural network
CN113033785A (zh) * 2021-02-26 2021-06-25 上海阵量智能科技有限公司 芯片、神经网络训练系统、内存管理方法及装置、设备
CN114330675A (zh) * 2021-12-30 2022-04-12 上海阵量智能科技有限公司 一种芯片、加速卡、电子设备和数据处理方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985451A (zh) * 2018-06-29 2018-12-11 百度在线网络技术(北京)有限公司 基于ai芯片的数据处理方法及设备
US20200364544A1 (en) * 2019-05-17 2020-11-19 Aspiring Sky Co. Limited Multiple accelerators for neural network
CN111339027A (zh) * 2020-02-25 2020-06-26 中国科学院苏州纳米技术与纳米仿生研究所 可重构的人工智能核心与异构多核芯片的自动设计方法
CN111626414A (zh) * 2020-07-30 2020-09-04 电子科技大学 一种动态多精度神经网络加速单元
CN113033785A (zh) * 2021-02-26 2021-06-25 上海阵量智能科技有限公司 芯片、神经网络训练系统、内存管理方法及装置、设备
CN114330675A (zh) * 2021-12-30 2022-04-12 上海阵量智能科技有限公司 一种芯片、加速卡、电子设备和数据处理方法

Also Published As

Publication number Publication date
CN114330675A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2019242222A1 (zh) 用于生成信息的方法和装置
US20210295168A1 (en) Gradient compression for distributed training
JP2019036298A (ja) 知能型高帯域幅メモリシステム及びそのための論理ダイ
WO2023124428A1 (zh) 芯片、加速卡以及电子设备、数据处理方法
US11494614B2 (en) Subsampling training data during artificial neural network training
US20200257902A1 (en) Extraction of spatial-temporal feature representation
US11676021B1 (en) Multi-model training pipeline in distributed systems
US20210158131A1 (en) Hierarchical partitioning of operators
US11468296B2 (en) Relative position encoding based networks for action recognition
WO2023124361A1 (zh) 芯片、加速卡、电子设备和数据处理方法
US11175919B1 (en) Synchronization of concurrent computation engines
US11741568B2 (en) Systems and methods for low-power, real-time object detection
US20230186625A1 (en) Parallel video processing systems
US11276249B2 (en) Method and system for video action classification by mixing 2D and 3D features
JP2023526899A (ja) 画像修復モデルを生成するための方法、デバイス、媒体及びプログラム製品
CN110008922B (zh) 用于终端设备的图像处理方法、设备、装置、介质
US10922146B1 (en) Synchronization of concurrent computation engines
CN115700845B (zh) 人脸识别模型训练方法、人脸识别方法、装置及相关设备
CN110795993A (zh) 一种构建模型的方法、装置、终端设备及介质
CN113992493B (zh) 视频处理方法、系统、设备及存储介质
CN113705666B (zh) 分割网络训练方法、使用方法、装置、设备及存储介质
CN114119374A (zh) 图像处理方法、装置、设备以及存储介质
Hasanaj et al. Cooperative edge deepfake detection
US11966789B2 (en) System and method for queuing node load for malware analysis
CN117501278A (zh) 运算加速方法及运算加速器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913625

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE