CN114611681A

CN114611681A - Heterogeneous system and method for neural network reasoning

Info

Publication number: CN114611681A
Application number: CN202011449706.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2022-06-10

Abstract

The present disclosure provides a heterogeneous system and method for neural network inference that may be implemented in a computing device, where the computing device may be included in a combined processing device that may also include a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for data of the computing device and the other processing device.

Description

Heterogeneous system and method for neural network reasoning

Technical Field

The present disclosure relates to computer systems, and more particularly, to heterogeneous neural network systems.

Background

In the neural network accelerator card heterogeneous platform, the decoded data output by the decoder may exist at the GPU end or the CPU end, and the neural network generally needs to be operated at the GPU end. For image data, the input size and image format of the neural network are usually different from the size of the image directly output by the decoder, so that the input image needs to be modified or processed in a first stage, and the output data of the neural network needs to be processed in a second stage. In the prior art, decoding data of a heterogeneous platform is generally stored on a GPU, preprocessing is completed by a CPU, and after the GPU performs inference, post-processing analysis is performed on an inference result of a model by the CPU. Or the preprocessing is used as one layer of the neural network, and the preprocessing is carried out by the neural network.

The defects of the prior art are as follows: 1) only single CPU or GPU front-back processing is provided, the advantages of a heterogeneous platform are not fully utilized, and the application is limited; 2) the multiple copying operations among heterogeneous platforms occupy extra data bandwidth, so that a data transmission bottleneck is caused, and the use efficiency of a neural network accelerator card is reduced; 3) the above defects are more obvious in more-level detection scenes, and can greatly influence the overall performance of the program.

Disclosure of Invention

At least one purpose of the present disclosure is to overcome the defect that the heterogeneous platform in the prior art does not fully utilize the advantages of the platform and has limitations in application.

According to a first aspect of the present disclosure, there is provided a heterogeneous system for neural network inference, comprising: a first scheduler, a first processor, a first heterogeneous processor, and an inference apparatus, wherein the first processor is configured to: receiving a signal to be processed, and performing first preprocessing on the signal to be processed to obtain a first preprocessing result; the first heterogeneous processor is configured to: receiving a signal to be processed, and performing second preprocessing on the signal to be processed to obtain a second preprocessing result; and/or receiving the first pre-processing result from the first processor to obtain a second pre-processing result; the first scheduler is configured to schedule the first processor and/or the first heterogeneous processor to process the signal to be processed; the first reasoning device is configured to receive the second preprocessing result to carry out primary reasoning so as to obtain a primary reasoning result.

According to a second aspect of the present disclosure, there is provided a method for neural network inference in a heterogeneous system, wherein the heterogeneous system includes a first scheduler, a first processor, a first heterogeneous processor and an inference apparatus, the method comprising: scheduling the first processor and/or a first heterogeneous processor to process the signal to be processed by a first scheduler; receiving the signal to be processed through a first processor, and performing first preprocessing on the signal to be processed to obtain a first preprocessing result; by the first heterogeneous processor: receiving the signal to be processed, and performing second preprocessing on the signal to be processed to obtain a second preprocessing result; and/or receiving the first pre-processing result from the first processor to obtain a second pre-processing result; and receiving the second preprocessing result through the first reasoning device to carry out primary reasoning so as to obtain a primary reasoning result.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.

The technical effects of the disclosure include at least one of the following: the technical scheme disclosed by the invention can automatically select the pre-and-post processing platform according to the operation load, and can flexibly switch the pre-and-post processing operation platform according to the requirement, so that the method and the device have wide application value. In addition, the technical scheme of the invention reduces the CPU resource consumption, reduces the copying times of the data between the CPU and the memory of the heterogeneous processor, reduces the data bandwidth pressure and improves the overall performance.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 illustrates a heterogeneous system for neural network inference, according to one embodiment of the present disclosure;

FIG. 2a shows a schematic diagram of the internal structure of a heterogeneous processor to which the method of the present disclosure may be applied; FIG. 2b shows a schematic of the structure of a heterogeneous processor to which the method of the present disclosure may be applied;

FIG. 3 illustrates another embodiment of a heterogeneous system according to an embodiment of the present disclosure;

FIG. 4 illustrates a method of neural network inference in a heterogeneous system, according to one embodiment of the present disclosure;

FIG. 5 shows a combined treatment apparatus; and

fig. 6 illustrates an exemplary board card.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Meanwhile, a person skilled in the art should, according to the idea of the present disclosure, change or modify the embodiments and applications of the present disclosure. In view of the above, the description is not intended to limit the present disclosure.

Fig. 1 illustrates a heterogeneous system for neural network inference, according to one embodiment of the present disclosure, including: a first scheduler 110, a first processor 130, a first heterogeneous processor 150, and an inference device 170.

It is noted that in the above terminology, the first processor 130 and the first heterogeneous processor 150 are heterogeneous, and both employ different processor architectures and have different performance behaviors when processing different data. It is further understood that although the first processor 130 and the first heterogeneous processor 150 are described above, the roles of the two may be interchanged, for example, the first processor 130 may be one or more general purpose processors, such as a CPU, and the first heterogeneous processor 150 may be one or more special purpose processors, such as a graphics processing unit GPU or a machine learning processing unit MLU; alternatively, the first heterogeneous processor 150 can be one or more general purpose processors, such as a CPU, and the first processor 130 can be one or more special purpose processors, such as a graphics processing unit GPU or a machine learning processing unit MLU, among others. The above-mentioned names do not constitute any limitation to the protection of the present application.

As shown in fig. 1, the first processor 130 may be configured to: receiving a signal to be processed, and performing first preprocessing on the signal to be processed to obtain a first preprocessing result; the first heterogeneous processor 150 may be configured to: receiving a signal to be processed, and performing second preprocessing on the signal to be processed to obtain a second preprocessing result; and/or receive the first pre-processing result from the first processor 130 to obtain a second pre-processing result; the first scheduler 110 may be configured to schedule the first processor 130 and/or the first processor 150 to process the signal to be processed; the first inference device 170 is configured to receive the second pre-processing result to perform a first-level inference, so as to obtain a first-level inference result.

The "signal to be processed" described above may refer to decoded video, audio, image, and other data, and the technical content of the present disclosure does not impose any limitation on the type of signal to be processed.

The above "to-be-processed signal" may be from various sources, for example, the first processor 130 may receive the to-be-processed signal from other external processors, encoders, etc., and may also receive the signal from the first heterogeneous processor 150; the first heterogeneous processor 150 can receive the signal to be processed from other external processors, encoders, etc., and also can receive the signal to be processed from the first processor 130. The technical content of the present disclosure does not impose any limitation on the source of the signal to be processed.

In fig. 1, although it is shown that the first processor 130 and the first heterogeneous processor 150 both receive the to-be-processed signals, the two to-be-processed signals may be the same signal or may be from different external components.

Generally, the first heterogeneous processor 150 (e.g., MLU) receives the signals to be processed, and then the first processor 130 or the first heterogeneous processor 150 processes the signals to be processed according to actual requirements.

It should be understood that there may be a plurality of situations where the first heterogeneous processor 150 "receives" the signal to be processed, one situation is that the "signal to be processed" is encoded by a processor other than the first heterogeneous processor 150, and another situation is that the "signal to be processed" is generated by the first heterogeneous processor 150 itself. The term "receive" is used herein only to indicate that the signal to be processed enters the next processing stage and does not indicate that the signal is processed by a different physical entity.

Thus, in the present disclosure, the scheduler 110 may be adopted to schedule the first processor 130 or the first heterogeneous processor 150 to process the to-be-processed signal according to actual requirements, or may schedule the first processor 130 and the first processor 150 to process the to-be-processed signal together.

In fig. 1, the first heterogeneous processor 150 represented by a solid-line box and the first heterogeneous processor 150 represented by a dashed-line box may be the same heterogeneous processor, or may be heterogeneous processors having the same function but different entities. Specifically, in the present disclosure, the first heterogeneous processor 150 may receive a signal to be processed, obtain a second preprocessing result after the signal is processed, and send the second preprocessing result to the first inference device 170 for inference processing; on the other hand, the first processor 130 may receive the signal to be processed, obtain a first preprocessing result after the signal to be processed is processed, transfer or copy the first preprocessing result to the first heterogeneous processor 150, and then send a second preprocessing result of the first heterogeneous processor 150 to the first inference device 170 for inference processing.

In fig. 1, the first preprocessing result and the second preprocessing result are indicated by different names, but they are substantially equivalent, equivalent or the same, that is, the first preprocessing result obtained after the processing by the first processor 130 is equivalent, equivalent or the same as the second preprocessing result obtained by the processing by the first heterogeneous processor 150. The output of the first processor 130 and the output of the first heterogeneous processor 150 are described herein by different names only for the purpose of distinguishing the difference of their processing entities, and do not constitute any limitation on the preprocessing result.

According to an embodiment of the present disclosure, the first scheduler 110 in the present disclosure may be configured to preferentially schedule the first heterogeneous processor 150 to process the signal to be processed.

In this embodiment, the first heterogeneous processor 150 can directly process the signal to be processed without being sent to the first processor 130. This has the advantage that the first processor 130, e.g. a general purpose processor such as a CPU, is not typically specifically designed to handle specific task types, and its processing effect on these task types is not sufficiently high; in addition, when the first heterogeneous processor 150 receives the signals to be processed, the signals need to be transferred or copied to the first processor 130 for processing, and the first pre-processing result obtained after the processing needs to be transferred or copied to the first heterogeneous processor 150, which reduces the speed of signal processing on one hand, and occupies an extra bandwidth due to the transfer and copy of the signals on the other hand, thereby easily causing a bottleneck of data communication.

Although it may be preferable to employ a dedicated processor for preprocessing the signal to be processed, in some particular cases, the general-purpose processor, i.e., the first processor 130, may be scheduled first for processing the signal to be processed.

According to an embodiment of the present disclosure, the first scheduler 110 may be configured to: detecting a load of the first processor 150; and, dynamically adjusting the proportion of the signal to be processed by the first processor 150 according to the load of the first processor 150.

In this embodiment, the first scheduler 110 may first check the load of the first heterogeneous processor 150, and if the load of the first heterogeneous processor 150 is too high or the processing capability is insufficient, a part or all of the tasks may be scheduled to the first processor 130 for processing; if the processing power of the first heterogeneous processor 150 increases or the amount of data that needs to be processed decreases, all or a portion of the tasks may be scheduled to be processed at the first heterogeneous processor 150. In other words, the first scheduler 110 may dynamically adjust the ratio of the amount of tasks processed by the first processor 130 to the amount of tasks processed by the first heterogeneous processor 150 according to the load of the first heterogeneous processor 150, the stronger the processing power of the first heterogeneous processor 150, the lower the ratio; otherwise, the higher.

Fig. 2a shows a schematic diagram of the internal structure of a heterogeneous processor to which the method of the present disclosure can be applied. The heterogeneous processor may be, for example, an Artificial Intelligence (AI) chip or AI processor.

An Artificial Intelligence (AI) chip accelerates the data computing capacity and reduces the memory access delay. The AI chip adopts a multi-core processor architecture, supports up to 16-core parallel computation, and adds a storage unit core (also called an on-chip or on-chip storage unit) to accelerate data reading, thereby solving the memory access bottleneck problem of a processor core of the AI chip and a DDR (also called an off-chip storage unit). And stronger computing capability is provided for a user in scenes of processing deep learning, network computing and the like.

The AI chip has 16 processor cores in total for executing the computing task. Every 4 processor cores constitute one processor group, i.e. 4 processor groups in total. There is a memory unit core within each processor group. The memory unit core is mainly used for data exchange between the shared memory unit inside the processor group and the processor core and between the processor groups. When the memory core and the processor core simultaneously access the DDR, only one group of buses is guaranteed to access the DDR after the arbitration of the multiplexer.

Fig. 2b shows a schematic structural diagram of a heterogeneous processor to which the method of the present disclosure may be applied. The heterogeneous processor may be, for example, an AI chip.

The DDR of the AI chip adopts a Non-Uniform Memory Access (NUMA) architecture, and each processor group can Access different DDR channels through the NOC0, but has different delays for accessing different DDR channels. Each processor group corresponds to a DDR channel with the lowest access delay, and the access delay of other channels is relatively long. As shown in the structure diagram of processor group and DDR in fig. 2b, the latency of processor group 0, processor group 1, processor group 2 and processor group 3 accessing the corresponding DDR0, DDR1, DDR2 and DDR3 is the lowest. That is, each processor core accesses the DDR channel with the lowest access delay of the respective processor group.

Because the access bandwidth inside the processor group is higher than the access bandwidth between the processor core and the DDR, the AI chip can internally access the shared memory unit by adopting the processor group to reduce the direct access of the processor core to the DDR, thereby improving the data throughput.

When 4-core parallel computing is required, the memory unit core may broadcast data from the shared memory unit to 4 processor cores within the processor complex simultaneously for data computation by way of data broadcast (via NOC 1). Compared with a mode that all processor cores read data through DDR, the memory access delay can be reduced under the condition, and the computing performance is optimized.

As computing demands increase, 16 processor cores may need to process multiple computing tasks simultaneously. The direct access of the processor core to the DDR inevitably causes data access delay, and the problems of low computing speed and the like are caused. The AI chip avoids direct communication between the 16 processor cores and the DDR through mutual data exchange of the processor groups, thereby reducing the delay of data access.

As shown in fig. 2a and 2b, for a heterogeneous processor as a multi-core processor, each part of a processor group, a processor core, a memory cell core, etc. may be occupied, which results in a reduction in the overall processing capability of the heterogeneous processor itself; when these computing resources are released, the overall processing capability of the heterogeneous processor is increased.

Thus, the processing capability of the first heterogeneous processor 150 can be determined according to a plurality of parameters or criteria, and according to an embodiment of the present disclosure, the processing capability of the first heterogeneous processor 150 can be determined according to parameters such as the core occupancy rate, the memory occupancy rate, and the board temperature of the first heterogeneous processor 150. If at least one of the core occupancy, the memory occupancy, and the board temperature of the first heterogeneous processor 150 exceeds the corresponding threshold, the ratio of the first heterogeneous processor 150 to process the to-be-processed signal may be decreased, and the ratio of the first processor 130 to process the to-be-processed signal may be increased.

The above-mentioned core may refer to the processor core and the processor group shown in fig. 2a and fig. 2b, and the storage occupancy may refer to the storage unit core shown in fig. 2 b. In addition, since the increase of the processing task in the first heterogeneous processor 150 also causes the increase of the physical parameters such as the board temperature, it is also possible to determine whether the processing capability of the first heterogeneous processor 150 is sufficient by monitoring the board temperature.

Corresponding threshold values can be set for the parameters respectively, and once the parameters exceed the set corresponding threshold values, the tasks in the first heterogeneous processor 150 can be partially or completely transferred to the first processor 130, such as a CPU, for processing, so as to reduce the burden of the first heterogeneous processor 150; and these parameters are below or fall below these respective thresholds, more tasks may be processed by the first heterogeneous processor 150 to reduce, for example, CPU burden and overhead. The transfer and conversion of such tasks is bi-directional and adjustable.

In the above technical solution, the proportion of the first heterogeneous processor 150 to process the signal to be processed may be adjusted according to whether one or more of the above parameters exceed the corresponding threshold, and according to an embodiment of the present disclosure, the burden of the first processor 150 may also be adjusted according to whether a weighted average of the core occupancy, the memory occupancy, and the board temperature of the first heterogeneous processor 150 exceeds a predetermined threshold. Specifically, in response to the weighted average of the core occupancy, the memory occupancy, and the board temperature of the first heterogeneous processor 150 exceeding a predetermined threshold, the ratio of the first heterogeneous processor 150 to process the signal to be processed may be decreased, and the ratio of the first processor 130 to process the signal to be processed may be increased.

A respective weight may be set for each parameter to adjust the burden on the heterogeneous processor based on a comprehensive evaluation of those parameters. The setting of such weights can be set by those skilled in the art according to the actual application scenario.

Having introduced the above situation where the burden of the first processor 130 and the first heterogeneous processor 150 is adjusted according to some parameters of the first heterogeneous processor 150, according to an embodiment of the present disclosure, the first processor 130 may also be scheduled to process the signal to be processed in response to a mismatch between the type of the signal to be processed and the first heterogeneous processor 150.

The mismatch mentioned above means that the signal or task to be processed does not match to the first heterogeneous processor 150 or the first heterogeneous processor 150 is not suitable for processing the signal or task to be processed. For example, for some complex graphical calculations, the first heterogeneous processor 150 may not be suitable for processing, in which case such a signal to be processed may be scheduled to the first processor 130 for processing. It is to be understood that only complex graphical calculations are illustrated here, on the other hand, if certain tasks are more suitable for processing by the first heterogeneous processor 150, the first heterogeneous processor 150 may be more or more preferably scheduled to process these tasks or signals to be processed.

Further, the first inference device 170 may be a heterogeneous processor having the same function as the first heterogeneous processor 150 but different physical entities, or may be the same physical processor as the first heterogeneous processor 150. In this document, different names are used only to indicate their function at different stages. According to the preferred embodiment of the present disclosure, the first inference device 170 and the first heterogeneous processor 150 are the same physical processor, and in this preferred embodiment, the first inference device 170 itself can complete the pre-processing without adding a heterogeneous processor, thereby reducing the number of data copies and reducing the data bandwidth pressure.

Fig. 3 illustrates another embodiment of a heterogeneous system according to an embodiment of the present disclosure.

As shown in fig. 3, the heterogeneous system of the present disclosure may further include a second scheduler 210, a second processor 230, a second heterogeneous processor 250, and a second inference apparatus 270.

It is noted that in the above terminology, the second processor 230 and the second heterogeneous processor 250 are heterogeneous, that is, they use different processor architectures and have different performance behaviors when processing different data. It should also be understood that although the expressions second processor 230 and second heterogeneous processor 250 are used above, the roles of the two may be interchanged, for example, second processor 230 may be one or more general purpose processors, such as a CPU, and second heterogeneous processor 250 may be one or more special purpose processors, such as a graphics processing unit GPU or a machine learning processing unit MLU; alternatively, the second heterogeneous processor 250 may be one or more general purpose processors, such as a CPU, and the second processor 230 may be one or more special purpose processors, such as a graphics processing unit GPU or a machine learning processing unit MLU, or the like. The above-mentioned names do not constitute any limitation to the protection of the present application.

As shown in fig. 3, the second processor 230 may be configured to receive the primary inference result, and perform a first post-processing on the primary inference result to obtain a first post-processing result; the second heterogeneous processor 250 is configured to receive the primary inference result, and perform a second post-processing on the primary inference result to obtain a second post-processing result; and/or receive the first post-processing result from the second processor 230 to obtain a second post-processing result; the second scheduler 210 is configured to schedule the second processor 230 and/or the second heterogeneous processor 250 to process the first-level inference result; the second inference means 270 is configured to receive the second post-processed result for performing a second-level inference, thereby obtaining a second-level inference result.

The "primary inference result" above may be an output from the first inference means 170 as shown in fig. 1, which may be further processed and inferred.

In general, the primary inference results may be received by the second heterogeneous processor 250 (e.g., MLU) and then processed by the second processor 230 or by the second heterogeneous processor 250 based on actual needs.

It is to be understood that the second heterogeneous processor 250 and the first heterogeneous processor 150 described herein can be different or the same heterogeneous processor. There may be multiple instances of "receiving" the primary inference result as described herein, one instance being where the "primary inference result" is generated by a first heterogeneous processor 150 that is different from a second heterogeneous processor 250, and another instance being where the "primary inference result" is generated by the second heterogeneous processor 250 itself. The term "receive" is used herein only to indicate that the primary inference result has entered the next processing stage, and does not indicate that processing is being performed by a different physical entity.

Thus, in the present disclosure, the second scheduler 210 may be adopted to schedule the second processor 230 or the second heterogeneous processor 250 to process the primary inference result according to actual requirements, or may schedule the second processor 230 and the second heterogeneous processor 250 to process the primary inference result together.

In fig. 3, the second heterogeneous processor 250 represented by a solid-line box and the second heterogeneous processor 250 represented by a dashed-line box may be the same heterogeneous processor, or may be heterogeneous processors having the same function but different entities. Specifically, in the present disclosure, the second heterogeneous processor 250 may receive the first-stage inference result, obtain a second post-processing result after the processing, and send the second post-processing result to the second inference device 270 for inference processing; on the other hand, the second processor 230 may receive the first-stage inference result, obtain a first post-processing result after the processing, transfer or copy the first post-processing result to the second heterogeneous processor 250, and then send the second post-processing result of the second heterogeneous processor 250 to the second inference device 270 for inference processing.

In fig. 3, although the first post-processing result and the second post-processing result are denoted by different names, they are substantially equivalent, equivalent or the same, that is, the first post-processing result obtained after the processing by the second processor 230 is equivalent, equivalent or the same as the second post-processing result obtained after the processing by the second heterogeneous processor 250. The output of the second processor 230 and the output of the second heterogeneous processor 250 are described herein by different names only for the purpose of distinguishing the difference of their processing entities, and do not constitute any limitation on the post-processing result.

According to an embodiment of the present disclosure, the second scheduler 210 in the present disclosure may be configured to preferentially schedule the second heterogeneous processor 250 to process the first-level inference result.

In this embodiment, the first inference result may be directly processed by the second heterogeneous processor 250 without being sent to the second processor 230. This has the advantage that the second processor 230, e.g. a general purpose processor such as a CPU, is not typically specifically designed to handle specific task types, and its processing effect on these task types is not sufficiently high; in addition, when the second heterogeneous processor 250 receives the first inference result, the signals need to be transferred or copied to the second processor 230 for processing, and the processed first post-processing result needs to be transferred or copied to the second heterogeneous processor 250, which reduces the speed of signal processing on one hand, and occupies an extra bandwidth due to the transfer and copy of the signals on the other hand, which is likely to cause a bottleneck in data communication.

Although a dedicated processor may be preferentially employed for post-processing the first inference result, in certain specific cases a general purpose processor, i.e. the second processor 230, may be scheduled first for processing the first inference result.

According to an embodiment of the present disclosure, the second scheduler 210 may be configured to: detecting a load of the second heterogeneous processor 250; and dynamically adjusting the proportion of the second heterogeneous processor 250 processing the first inference result according to the load of the second heterogeneous processor 250.

In this embodiment, the second scheduler 210 may first check the load of the second heterogeneous processor 250, and if the load of the second heterogeneous processor 250 is too high or the processing capability is insufficient, a part or all of the tasks may be scheduled to the second processor 230 for processing; if the processing power of the second heterogeneous processor 250 increases or the amount of data that needs to be processed decreases, all or a portion of the tasks may be scheduled to be processed at the second heterogeneous processor 250. In other words, the second scheduler 210 may dynamically adjust the ratio of the amount of tasks processed by the second processor 230 to the amount of tasks processed by the second heterogeneous processor 250 according to the load of the second heterogeneous processor 250, the stronger the processing capability of the second scheduler 250, the lower the ratio; otherwise, the higher.

The second heterogeneous processor 250 may be as shown in fig. 2a and 2 b. For a heterogeneous processor serving as a multi-core processor, parts such as a processor group, a processor core, a memory cell core and the like of the heterogeneous processor are likely to be occupied, so that the overall processing capacity of the heterogeneous processor is reduced; when these computing resources are released, the overall processing capability of the heterogeneous processor is increased.

Thus, the processing capability of the second heterogeneous processor 250 can be judged according to a plurality of parameters or criteria, and according to one embodiment of the present disclosure, the processing capability of the second heterogeneous processor 250 can be judged according to parameters such as the core occupancy rate, the memory occupancy rate, and the board temperature of the second heterogeneous processor 250. If at least one of the core occupancy rate, the memory occupancy rate, and the board temperature of the second heterogeneous processor 250 exceeds the corresponding threshold, the ratio of the second heterogeneous processor 250 to process the first inference result may be decreased, and the ratio of the second processor 230 to process the first inference result may be increased.

In addition, since the increase of the processing task in the second heterogeneous processor 250 also causes the increase of the physical parameters such as the board temperature, it is also possible to determine whether the processing capability of the second heterogeneous processor 250 is sufficient by monitoring the board temperature.

Corresponding threshold values can be set for the parameters respectively, and once the parameters exceed the set corresponding threshold values, the tasks in the second heterogeneous processor 250 can be partially or completely transferred to the second processor 230, such as a CPU, for processing, so as to reduce the burden of the second heterogeneous processor 250; and these parameters are below or fall back below these respective thresholds, more tasks may be processed by the second heterogeneous processor 250 to reduce, for example, CPU burden and overhead. The transfer and conversion of such tasks is bi-directional and adjustable.

In the above technical solution, the proportion of the second heterogeneous processor 250 to process the first inference result may be adjusted according to whether one or more of the above parameters exceed the corresponding threshold, and according to an embodiment of the present disclosure, the burden of the second heterogeneous processor 250 may also be adjusted according to whether a weighted average of the core occupancy rate, the memory occupancy rate, and the board temperature of the second heterogeneous processor 250 exceeds a predetermined threshold. Specifically, in response to the weighted average of the core occupancy, the memory occupancy, and the board temperature of the second heterogeneous processor 250 exceeding a predetermined threshold, the ratio of the second heterogeneous processor 250 to process the first inference result may be decreased, and the ratio of the second processor 230 to process the first inference result may be increased.

Having introduced the above situation where the burden of the second processor 230 and the second heterogeneous processor 250 is adjusted according to some parameters of the second heterogeneous processor 250, according to an embodiment of the present disclosure, the second processor 250 may also be scheduled to process the signal to be processed in response to the type of the first inference result being mismatched with the second heterogeneous processor 250.

The mismatch mentioned above means that the first inference result or task does not match to the second heterogeneous processor 250 or the second heterogeneous processor 250 is not suitable for processing the first inference result or task. For example, for some complex graphical calculations, the second heterogeneous processor 250 may not be suitable for processing, in which case such first inference results may be dispatched to the second processor 230 for processing. It is to be understood that only complex graphical calculations are illustrated here, on the other hand, if certain tasks are better suited to be processed by the second heterogeneous processor 250, the second heterogeneous processor 250 may be more or more preferably scheduled to process these tasks or signals to be processed.

Further, the second inference means 270 may be a heterogeneous processor having the same function as the second heterogeneous processor 250 but different physical entity, or may be the same physical processor as the second heterogeneous processor 250. In this document, different names are used only to indicate their function at different stages.

It should be understood that, in the above description, it is assumed that the first processor 130 and the second processor 230 may be the same or physically the same processor (e.g., CPU), the first heterogeneous processor 150, the second heterogeneous processor 250, the first inference device 170, and the second inference device 270 may be physically the same processor (e.g., GPU), and they are referred to by different names only for indicating their different functions. When different functions are on the same physical entity, the overhead of extra data copy and the like can be reduced, and the overall operation efficiency of the system is improved.

Fig. 4 illustrates a method of neural network inference in a heterogeneous system, wherein the heterogeneous system includes a first scheduler, a first processor, a first heterogeneous processor, and an inference apparatus, the method comprising: scheduling, by a first scheduler, the first processor and/or a first heterogeneous processor to process the signal to be processed in operation S410; in operation S420, receiving the signal to be processed by a first processor, and performing a first preprocessing on the signal to be processed to obtain a first preprocessing result; in operation S430, receiving, by the first heterogeneous processor, the signal to be processed, and performing a second pre-processing on the signal to be processed to obtain a second pre-processing result; and/or, in operation S440, receiving, by the first heterogeneous processor, the first pre-processing result from the first processor to obtain a second pre-processing result; in operation S450, the second pre-processing result is received through the first inference apparatus to perform a primary inference, so as to obtain a primary inference result.

It should be understood that the operations shown in fig. 4 are only operations at different components, and the operations do not necessarily have a sequential relationship, nor are all necessarily necessary operations, but corresponding operations may occur according to the traveling direction of the signal.

The present disclosure also provides an electronic device, comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.

The heterogeneous system of the present disclosure supports a plurality of image processing algorithms, such as a color space conversion algorithm, an image size scaling algorithm, an image horizontal and vertical flipping algorithm, an image affine transformation algorithm, and the like, when performing pre-and post-processing.

The technical scheme disclosed by the invention can automatically select the pre-and-post processing platform according to the operation load, and can flexibly switch the pre-and-post processing operation platform according to the requirement, so that the method and the device have wide application value. In addition, the technical scheme of the disclosure reduces the CPU resource consumption, reduces the copy times of data between the CPU and the memory of the heterogeneous processor, reduces the data bandwidth pressure, and improves the overall performance.

The first scheduler 110 and the second scheduler 210 in the present disclosure may be implemented by means of hardware and software. The technical scheme disclosed by the invention can be applied to the field of artificial intelligence and is realized or realized in an artificial intelligence chip. The chip may exist alone or may be included in a computing device.

Fig. 5 shows a combined processing device 500 that includes the computing device 502 described above, a universal interconnect interface 504, and other processing devices 506. The computing device according to the present disclosure interacts with other processing devices to collectively perform operations specified by a user. Fig. 5 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

A universal interconnect interface for transferring data and control instructions between a computing device (including, for example, a machine learning computing device) and other processing devices. The computing device acquires required input data from other processing devices and writes the input data into a storage device on the computing device chip; control instructions can be obtained from other processing devices and written into a control cache on a computing device slice; the data in the memory module of the computing device can also be read and transmitted to other processing devices.

Optionally, the structure may further comprise a storage device 508, which is connected to the computing device and the other processing device, respectively. The storage device is used for storing data in the computing device and the other processing devices, and is particularly suitable for storing all data which cannot be stored in the internal storage of the computing device or the other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip.

In some embodiments, the disclosure also discloses a board card comprising the chip packaging structure. Referring to fig. 6, an exemplary board card is provided that may include other kits in addition to the chip 602, including but not limited to: a memory device 604, an interface device 606, and a control device 608.

The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include a plurality of groups of memory cells 610. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double up the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64bit of the 72-bit DDR4 controller is used for data transmission, and 8bit is used for ECC check. In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface means is used to enable data transfer between the chip and an external device 612, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing and/or a plurality of processing circuits in the chip.

In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card.

Electronic devices or apparatuses include data processing apparatuses, robots, computers, printers, scanners, tablets, smart terminals, cell phones, automobile data recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. With this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims

1. A heterogeneous system for neural network inference, comprising: a first scheduler, a first processor, a first heterogeneous processor, and an inference apparatus,

the first processor is configured to: receiving a signal to be processed, and performing first preprocessing on the signal to be processed to obtain a first preprocessing result;

the first heterogeneous processor is configured to:

receiving a signal to be processed, and performing second preprocessing on the signal to be processed to obtain a second preprocessing result; and/or

Receiving the first pre-processing result from the first processor to obtain a second pre-processing result;

the first scheduler is configured to schedule the first processor and/or the first heterogeneous processor to process the signal to be processed;

the first inference device is configured to receive the second pre-processing result to perform a first-level inference, so as to obtain a first-level inference result.

2. The heterogeneous system of claim 1, wherein the first scheduler is configured to preferentially schedule the first heterogeneous processor to process the signal to be processed.

3. The heterogeneous system of claim 1 or 2, wherein the first scheduler is configured to:

detecting a load of the first heterogeneous processor;

and dynamically adjusting the proportion of the first heterogeneous processor to process the signal to be processed according to the load of the first heterogeneous processor.

4. The heterogeneous system of claim 3, wherein dynamically adjusting the proportion of the first heterogeneous processor that processes the signal to be processed according to the load of the first heterogeneous processor comprises:

in response to at least one of the core occupancy rate, the memory occupancy rate and the board temperature of the first heterogeneous processor exceeding a corresponding threshold, reducing the proportion of the first heterogeneous processor to process the signals to be processed, and increasing the proportion of the first processor to process the signals to be processed.

5. The heterogeneous system of claim 3, wherein dynamically adjusting the proportion of the first heterogeneous processor that processes the signal to be processed according to the load of the first heterogeneous processor comprises:

and in response to the fact that the weighted average value of the core occupancy rate, the memory occupancy rate and the board temperature of the first heterogeneous processor exceeds a preset threshold value, reducing the proportion of the first heterogeneous processor to process the signals to be processed, and increasing the proportion of the first processor to process the signals to be processed.

6. The heterogeneous system of claim 3, wherein dynamically adjusting the proportion of the first heterogeneous processor that processes the signal to be processed according to the load of the first heterogeneous processor comprises:

scheduling the first processor to process the signal to be processed in response to the type of the signal to be processed mismatching the first heterogeneous processor.

7. The heterogeneous system of any of claims 1 to 6, wherein the first processor is one or more general purpose processors and the first heterogeneous processor is one or more special purpose processors, preferably a Graphics Processing Unit (GPU) or a machine learning processing unit.

8. The heterogeneous system of any of claims 1-7, further comprising: a second scheduler, a second processor, a second heterogeneous processor, and a second inference apparatus, wherein,

the second processor is configured to receive the primary inference result and perform first post-processing on the primary inference result to obtain a first post-processing result;

the second heterogeneous processor is configured to:

receiving the primary inference result, and performing second post-processing on the primary inference result to obtain a second post-processing result; and/or

Receiving the first post-processing result from the second processor to obtain a second post-processing result;

the second scheduler is configured to schedule the second processor and/or a second heterogeneous processor to process the primary inference result;

the second reasoning device is configured to receive the second post-processing result to perform a second-level reasoning, so as to obtain a second-level reasoning result.

9. A method of neural network inference in a heterogeneous system, wherein the heterogeneous system includes a first scheduler, a first processor, a first heterogeneous processor, and an inference apparatus, the method comprising:

scheduling the first processor and/or a first heterogeneous processor to process the signal to be processed by a first scheduler;

receiving the signal to be processed through a first processor, and performing first preprocessing on the signal to be processed to obtain a first preprocessing result;

by the first heterogeneous processor:

receiving the signal to be processed, and performing second preprocessing on the signal to be processed to obtain a second preprocessing result; and/or

Receiving the first pre-processing result from the first processor to obtain a second pre-processing result; and

and receiving the second preprocessing result through the first reasoning device to carry out primary reasoning so as to obtain a primary reasoning result.

10. An electronic device, comprising:

one or more processors; and

memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of claim 9.

11. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of claim 9.