CN111176731B

CN111176731B - Method, system, equipment and medium for improving chip computing performance

Info

Publication number: CN111176731B
Application number: CN201911385640.8A
Authority: CN
Inventors: 李拓
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2019-12-29
Filing date: 2019-12-29
Publication date: 2023-01-06
Anticipated expiration: 2039-12-29
Also published as: CN111176731A

Abstract

The invention discloses a method, a system, equipment and a storage medium for improving the computing performance of a chip, wherein the method comprises the following steps: the general processor core decomposes the calculation task into a plurality of parallel subtasks and distributes the plurality of subtasks to the parallel control array; the parallel control array preprocesses a plurality of subtasks and distributes the preprocessed subtasks to the calculation acceleration unit array corresponding to the parallel control array; judging whether each computing unit in the computing acceleration unit array can process the distributed subtasks or not; and in response to a compute unit in the compute acceleration array being unable to process the assigned subtask, assigning the subtask to another compute unit. The method, the system, the equipment and the medium for improving the computing performance of the chip carry out data preprocessing by adding the parallel control array, and the computing acceleration unit array only carries out special computation, thereby improving the computing performance.

Description

Method, system, equipment and medium for improving chip computing performance

Technical Field

The present invention relates to the field of computing, and more particularly, to a method, a system, a computer device, and a readable medium for improving computing performance of a chip.

Background

As a coprocessor for accelerating AI (Artificial Intelligence) computation, the most popular is a GPU (Graphics Processing Unit), and by integrating thousands of parallel computing cores, the GPU can provide the strongest AI computation performance. But the GPU has problems in price and power consumption. Therefore, some fully-customized AI chips are applied to the field of AI sensitive to cost and power consumption by performing customized optimization design on the chip architecture aiming at specific scenes. While semi-customized (Field Programmable Gate Array (FPGA)) AI chips are mainly applied in some experimental fields and fields with less stringent performance requirements.

In various AI processes, massively parallel computations of various deep learning algorithms require accelerators, which are characterized by not very complex single computations but high parallelism, and the data bit widths of the computations and intermediate result storage are often customized and small (e.g., 8 bits, 16 bits) for efficiency. Because of these characteristics, a general-purpose processor, such as a CPU, which is relatively complex in a single core is inefficient in performing AI calculations. The solution of integrating a single or a few processor cores in an AI acceleration chip often needs to consider the high IP cost and the occupation of chip resources and area by the processor cores themselves. However, the scheme has the problem that the universality is difficult to realize, namely, the support of algorithms with different scales in different application scenes is difficult to realize high efficiency.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method, a system, a computer device, and a computer readable storage medium for improving the computing performance of a chip, in which a parallel control array is added to perform data preprocessing, and a computing acceleration unit array only performs special computation, so as to improve the computing performance.

Based on the above object, an aspect of the embodiments of the present invention provides a method for improving computing performance of a chip, including the following steps: the general processor core decomposes a computing task into a plurality of parallel subtasks and distributes the subtasks to a parallel control array; the parallel control array preprocesses a plurality of subtasks and distributes the preprocessed subtasks to a calculation acceleration unit array corresponding to the parallel control array; determining whether each compute unit in the compute acceleration unit array is capable of processing the assigned subtask; and in response to a compute unit in the compute acceleration unit array being unable to process the assigned subtask, assign the subtask to another compute unit.

In some embodiments, further comprising: sending a processing result to a memory or the general purpose processor core in response to a compute unit in the compute acceleration unit array being able to process the assigned subtask.

In some embodiments, sending the processing result to the memory or the general purpose processor core in response to the compute unit in the compute acceleration unit array being able to process the assigned subtask comprises: judging whether the amount of intermediate data generated in the subtask processing process exceeds a threshold value; and in response to the amount of intermediate data generated during the subtask processing not exceeding a threshold, writing the intermediate data to a cache of the parallel control array.

In some embodiments, further comprising: in response to the quantity of the intermediate data generated in the subtask processing process exceeding a threshold value, writing the intermediate data exceeding the threshold value into an additional cache mounted on the computing acceleration unit array.

In some embodiments, sending the processing result to the memory or the general purpose processor core in response to the compute unit in the compute acceleration unit array being able to process the assigned subtask comprises: judging whether the data in the calculation process can be transmitted through the network of the calculation acceleration unit array; and responding to the situation that data in the computing process cannot be transmitted through the network of the computing acceleration unit array, and transmitting the data through the processor cores in the parallel control array.

In another aspect of the embodiments of the present invention, a system for accelerating reading of information of a field replaceable unit is further provided, including: the general processor core module is configured for decomposing a computing task into a plurality of parallel subtasks and distributing the plurality of subtasks to the parallel control array module; the parallel control array module is configured to preprocess the subtasks and distribute the preprocessed subtasks to the calculation acceleration unit array module corresponding to the parallel control array module; a judging module configured to judge whether each computing unit in the compute acceleration unit array module is capable of processing the assigned subtask; a compute acceleration array module configured to assign the subtasks to other compute units in response to compute units in the compute acceleration array module being unable to process the assigned subtasks.

In some embodiments, the compute accelerator array module is further configured to send a processing result to a memory or the general purpose processor core in response to a compute unit in the compute accelerator array being able to process the assigned subtask.

In some embodiments, the compute acceleration unit array module is further configured to determine whether an amount of intermediate data generated during the sub-task processing exceeds a threshold and, in response to the amount of intermediate data generated during the sub-task processing not exceeding the threshold, write the intermediate data to the cache of the parallel control array.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method as above.

In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.

The invention has the following beneficial technical effects: the parallel control array is added for data preprocessing, the calculation accelerating unit array only carries out special calculation, the calculation performance is improved, in addition, the chip architecture fully considers the universality of the chip, and the architecture with flexible expansion of multiple cores can support the design and expansion of the chip under various different scene requirements.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a schematic diagram of an embodiment of a method for improving computing performance of a chip according to the present invention;

FIG. 2 is a diagram of a chip architecture according to an embodiment of the method for improving the computing performance of a chip provided by the present invention;

fig. 3 is a schematic diagram of a hardware structure of an embodiment of the method for improving the computing performance of the chip according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the above, a first aspect of the embodiments of the present invention provides an embodiment of a method for improving computing performance of a chip. Fig. 1 is a schematic diagram illustrating an embodiment of a method for improving computing performance of a chip according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s1, decomposing a computing task into a plurality of parallel subtasks by a general processor core, and distributing the plurality of subtasks to a parallel control array;

s2, preprocessing a plurality of subtasks by the parallel control array, and distributing the preprocessed subtasks to a calculation acceleration unit array corresponding to the parallel control array;

s3, judging whether each computing unit in the computing acceleration unit array can process the distributed subtasks or not; and

and S4, responding to the fact that the computing units in the computing acceleration unit array cannot process the distributed subtasks, and distributing the subtasks to other computing units.

RISC-V is a processor Instruction Set Architecture (ISA) with full open-source and architecturally simple features. On the basis, processor control cores of various sizes and complexity can be customized without paying extra IP cost, and the cores using the architecture do not need to consider the instruction requirements under other compatible applications, so that the chip area and the power consumption can be maximally reduced.

Fig. 2 is a diagram illustrating a chip architecture of an embodiment of the method for improving the computing performance of a chip according to the present invention. As shown in fig. 2, the chip architecture mainly includes a general processor core module, a parallel control array module, and a compute acceleration unit array module. Wherein the general purpose processor core module comprises a full function RISC-V processor core capable of running general purpose software, including an operating system, a network stack and non-critical control/configuration software. The parallel control array module uses hundreds or even thousands of lightweight RISC-V processor cores to build a 2D Mesh network, so that fine-grained data and thread-level parallelism are fully utilized, and balance between flexibility and efficiency is realized. The computing acceleration unit array module comprises customized computing units for efficient special computing, but has poor flexibility and needs to work in combination with a general processor core module or a parallel control array module.

The whole chip architecture comprises three interconnection buses. The AXI bus is used for mounting a general processor core module and an interface for data transmission of an external Host (Host) end; the RISC-V processor bus is used to interconnect the general processor core module and the parallel processor array layer and mount the memory interfaces that can be used by both layers, and it should be noted that not all the processor cores of both layers need to be mounted on the bus, which is easy to affect the efficiency due to data congestion. The processor cores which are not mounted on the bus can indirectly access through the connection in each layer; the parallel processor array module corresponds to the arrays in the calculation acceleration unit array module one by one, and the corresponding processor cores are directly connected with the calculation acceleration units; the computing acceleration unit bus is used for connecting the general processor core module and the computing acceleration unit array module and can mount DMA (direct memory storage) and additional cache resources.

The general processor core module adopts a core of a RISC-V instruction set architecture, and can adopt a free multi-core scheme due to the characteristics of small area and power consumption, source opening and the like of the RISC-V. In the chip design stage, the required number of cores can be determined according to the requirements of an application scene, and the frames of the chip and the software do not need to be modified. In contrast, if other non-open-source processor cores are used, if a mature solution is not available, the number of different cores will often cause architectural changes, and even if a processor core with a mature multi-core solution (such as ARM) is available, the cost and area will limit the choice of the number of cores.

The parallel processor array module is essentially a 2D Mesh many-core network composed of RISC-V cores and used for processing tasks which are divided into fine granularity by a compiler at a general processor layer. In order to realize the functions and the efficiency, the device has the following characteristics: firstly, the function of a single processor is as simple as possible, the area power consumption is reduced, the most basic functions of data transmission, cache and the like are reserved, some data preprocessing functions can be added on the basis according to the requirements of actual design, and special calculation is carried out in a calculation unit; secondly, the cache in the processor core is opened for the computing unit to use, and meanwhile, the cache in the whole parallel array is shared, so that the cache resource of the whole chip is utilized to the maximum extent; and thirdly, the array is interconnected by adopting a 2D Mesh network irrelevant to the functions of the processor cores, so that the processor cores in the array can be mixed and used by selecting various customized processor cores according to requirements, and the flexibility of the functions is ensured to the maximum extent.

The calculation acceleration unit array module comprises a customized calculation module, each unit can only carry out calculation in a fixed mode determined when the unit is designed, and the storage of results and most of data transmission are controlled by a general processor core or a parallel processing array, so that the minimization of array units can be ensured. Like the parallel processor array module, the compute acceleration unit array is also a 2D Mesh network to transmit cooperative intermediate data when needed. The calculation acceleration unit array and the parallel processor array are in a complete one-to-one correspondence and respectively connected relationship, so in an ideal situation, data transmission can be carried out through the processor without passing through a network of the calculation acceleration unit, but in AI calculation, particularly in a deep learning network, a large amount of iterative calculations are carried out, the output of the previous calculation unit is directly the output of the next calculation unit, and the data transmission is higher in efficiency without the participation of the processor. In most application scenarios, the computation acceleration unit does not need to specially place cache resources, and a small amount of intermediate data can be cached in an array through corresponding parallel processing. In some special situations, for example, the amount of the required cache data is too large, separate cache resources and DMA for access can be mounted on the bus of the computing unit.

The AI computing task is decomposed into fine-grained parallel tasks by a general processor core layer, the fine-grained parallel tasks are distributed to a parallel processor array through a RISC-V processor bus, and the processor cores in the array perform necessary preprocessing on the tasks and data and then send the tasks and the data to a corresponding computing acceleration unit. And data transmission required in the calculation process is realized, a relatively fixed part in the algorithm is directly transmitted through a network of the calculation acceleration unit, and the others are transmitted through the corresponding processor cores in the parallel processor array. And finally, the result is sent to a memory through a RISC-V processor bus or directly sent to a general processor core.

The general purpose processor core breaks up the computational task into multiple parallel sub-tasks and distributes the multiple sub-tasks to the parallel control arrays. The general processor core divides the calculation task into fine-grained tasks, for convenience of description, the divided tasks are called subtasks, and each subtask is distributed to a corresponding parallel control array.

The parallel control array preprocesses a plurality of subtasks, and distributes the preprocessed subtasks to the calculation accelerating unit array corresponding to the parallel control array.

Judging whether each computing unit in the computing acceleration unit array can process the distributed subtasks or not; in response to a compute unit in the compute acceleration array being unable to process the assigned subtask, the subtask is assigned to another compute unit. I.e. to determine whether the calculation mode preset by the calculation unit is able to process the assigned subtasks. In some embodiments, further comprising: sending a processing result to a memory or the general purpose processor core in response to a compute unit in the compute acceleration unit array being able to process the assigned subtask.

In some embodiments, sending the processing result to the memory or the general purpose processor core in response to the compute unit in the compute acceleration unit array being able to process the assigned subtask comprises: judging whether the amount of intermediate data generated in the subtask processing process exceeds a threshold value; and in response to the quantity of intermediate data generated in the sub-task processing process not exceeding a threshold value, writing the intermediate data into a cache of the parallel control array.

In some embodiments, sending the processing result to the memory or the general purpose processor core in response to the compute unit in the compute acceleration unit array being able to process the assigned subtask comprises: judging whether the data in the calculation process can be transmitted through the network of the calculation accelerating unit array; and responding to the fact that data in the computing process cannot be transmitted through the network of the computing acceleration unit array, and transmitting the data through the processor cores in the parallel control array.

It should be particularly noted that, the steps in the embodiments of the method for improving the computing performance of the chip described above may be mutually intersected, replaced, added, or deleted, and therefore, these methods for improving the computing performance of the chip by reasonable permutation and combination conversion also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiments.

In view of the above objects, a second aspect of an embodiment of the present invention provides a system for expediting reading of information from a field replaceable unit, comprising: the general processor core module is configured for decomposing a computing task into a plurality of parallel subtasks and distributing the plurality of subtasks to the parallel control array module; the parallel control array module is configured to preprocess the subtasks and distribute the preprocessed subtasks to the calculation acceleration unit array module corresponding to the parallel control array module; a judging module configured to judge whether each computing unit in the compute acceleration unit array module is capable of processing the assigned subtask; a compute acceleration array module configured to assign the subtasks to other compute units in response to compute units in the compute acceleration array module being unable to process the assigned subtasks.

In some embodiments, the compute accelerator array module is further configured to determine whether data in a compute process can be transmitted over a network of the compute accelerator array; and responding to the situation that data in the computing process cannot be transmitted through the network of the computing acceleration unit array, and transmitting the data through the processor cores in the parallel control array.

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, decomposing a computing task into a plurality of parallel subtasks by a general processor core, and distributing the plurality of subtasks to a parallel control array; s2, preprocessing a plurality of subtasks by the parallel control array, and distributing the preprocessed subtasks to a calculation acceleration unit array corresponding to the parallel control array; s3, judging whether each computing unit in the computing acceleration unit array can process the distributed subtasks or not; and S4, responding to the computing unit in the computing acceleration unit array being incapable of processing the distributed subtasks, distributing the subtasks to other computing units.

In some embodiments, sending the processing result to the memory or the general purpose processor core in response to the compute unit in the compute acceleration unit array being able to process the assigned subtask comprises: judging whether the quantity of intermediate data generated in the subtask processing process exceeds a threshold value; and in response to the amount of intermediate data generated during the subtask processing not exceeding a threshold, writing the intermediate data to a cache of the parallel control array.

In some embodiments, sending the processing result to the memory or the general purpose processor core in response to the compute unit in the compute acceleration unit array being able to process the assigned subtask comprises: judging whether the data in the calculation process can be transmitted through the network of the calculation acceleration unit array; and responding to the fact that data in the computing process cannot be transmitted through the network of the computing acceleration unit array, and transmitting the data through the processor cores in the parallel control array.

Fig. 3 is a schematic diagram of a hardware structure of an embodiment of the method for improving computing performance of a chip according to the present invention.

Taking the apparatus shown in fig. 3 as an example, the apparatus includes a processor 301 and a memory 302, and may further include: an input device 303 and an output device 304.

The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 3 illustrates the connection by a bus as an example.

The memory 302 is a non-volatile computer-readable storage medium, and can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for improving the computing performance of the chip in the embodiment of the present application. The processor 301 executes various functional applications of the server and data processing by running the nonvolatile software programs, instructions and modules stored in the memory 302, that is, implements the method for improving chip computing performance of the above method embodiment.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of a method of improving the computational performance of the chip, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 optionally includes memory located remotely from processor 301, which may be connected to a local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 303 may receive information such as a user name and a password that are input. The output means 304 may comprise a display device such as a display screen.

Program instructions/modules corresponding to one or more methods for improving the computing performance of the chip are stored in the memory 302, and when being executed by the processor 301, the method for improving the computing performance of the chip in any of the above-mentioned method embodiments is executed.

Any embodiment of the computer device executing the method for improving the chip computing performance can achieve the same or similar effects as any corresponding method embodiment.

The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the method as above.

Finally, it should be noted that, as those skilled in the art can understand that all or part of the processes in the methods according to the embodiments described above can be implemented by a computer program to instruct related hardware to complete the processes, and the program of the method for improving the computing performance of a chip can be stored in a computer-readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

Furthermore, the methods disclosed according to embodiments of the invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for improving the computing performance of a chip is characterized by comprising the following steps:

the general processor core decomposes a computing task into a plurality of parallel subtasks and distributes the subtasks to a parallel control array;

the parallel control array preprocesses a plurality of subtasks and distributes the preprocessed subtasks to a calculation acceleration unit array corresponding to the parallel control array;

determining whether each compute unit in the compute acceleration unit array is capable of processing the assigned subtask; and

in response to a compute unit in the compute acceleration unit array being unable to process the assigned subtask, assign the subtask to another compute unit;

the method further comprises the following steps:

in response to a compute unit in the compute acceleration unit array being able to process the assigned subtask, sending a processing result to a memory or the general purpose processor core;

sending a processing result to a memory or the general purpose processor core in response to a compute unit in the compute acceleration unit array being capable of processing the assigned subtask includes:

judging whether the quantity of intermediate data generated in the subtask processing process exceeds a threshold value; and

writing the intermediate data into a cache of the parallel control array in response to an amount of intermediate data generated during the sub-task processing not exceeding a threshold.

2. The method of claim 1, further comprising:

in response to the quantity of the intermediate data generated in the subtask processing process exceeding a threshold value, writing the intermediate data exceeding the threshold value into an additional cache mounted on the computing acceleration unit array.

3. The method of claim 1, wherein sending processing results to memory or the general purpose processor core in response to a compute unit in the compute acceleration unit array being able to process the assigned subtasks comprises:

judging whether the data in the calculation process can be transmitted through the network of the calculation acceleration unit array; and

and responding to the situation that data in the computing process cannot be transmitted through the network of the computing acceleration unit array, and transmitting the data through the processor cores in the parallel control array.

4. A system for improving computational performance of a chip, comprising:

the general processor core module is configured for decomposing a computing task into a plurality of parallel subtasks and distributing the plurality of subtasks to the parallel control array module;

the parallel control array module is configured to preprocess the subtasks and distribute the preprocessed subtasks to the calculation acceleration unit array module corresponding to the parallel control array module; and

a judging module configured to judge whether each computing unit in the compute acceleration unit array module is capable of processing the assigned subtask;

a compute accelerator array module configured to assign the subtasks to other compute units in response to compute units in the compute accelerator array module being unable to process the assigned subtasks;

the compute acceleration unit array module is further configured to send a processing result to a memory or the general purpose processor core in response to compute units in the compute acceleration unit array being able to process the assigned subtasks;

the compute acceleration unit array module is further configured to determine whether an amount of intermediate data generated during the sub-task processing exceeds a threshold, and write the intermediate data into the cache of the parallel control array in response to the amount of intermediate data generated during the sub-task processing not exceeding the threshold.

5. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 3.

6. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.