CN111639045B

CN111639045B - Data processing method, device, medium and equipment

Info

Publication number: CN111639045B
Application number: CN202010495340.1A
Authority: CN
Inventors: 曹文慧
Original assignee: Horizon Shanghai Artificial Intelligence Technology Co Ltd
Current assignee: Horizon Shanghai Artificial Intelligence Technology Co Ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2023-10-13
Anticipated expiration: 2040-06-03
Also published as: CN111639045A

Abstract

Disclosed are a data processing method, apparatus, medium and device, wherein the method comprises: processing the data transmitted in the current bus by a calculation processing unit on the bus to obtain a data processing result; determining a destination address corresponding to the data processing result; and transmitting the data processing result to the destination address through the bus. The technical scheme provided by the disclosure is beneficial to improving the data processing efficiency and improving the robustness of the equipment.

Description

Data processing method, device, medium and equipment

Technical Field

The present disclosure relates to data processing technology, and in particular, to a data processing method, a data processing apparatus, a storage medium, and an electronic device.

Background

Currently, existing data processing methods are generally: data is read from the storage device, the read data is subjected to corresponding processing by a data processing unit (e.g., CPU or image processor, etc.), and the processing result is written back into the storage device. Although some data processing is simpler, it is also desirable to implement the above-described steps of reading, processing, and writing back.

For some simple data processing, how to simplify the data processing process and improve the data processing efficiency is a technical problem which is worth focusing on.

Disclosure of Invention

The present disclosure has been made in order to solve the above technical problems. The embodiment of the disclosure provides a data processing method, a data processing device, a storage medium and electronic equipment.

According to an aspect of the embodiments of the present disclosure, there is provided a data processing method, including: processing the data transmitted in the current bus by a calculation processing unit on the bus to obtain a data processing result; determining a destination address corresponding to the data processing result; and transmitting the data processing result to the destination address through the bus.

According to still another aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including: the computing processing unit is positioned on the bus and is used for processing the data transmitted in the current bus to obtain a data processing result; the target address determining module is used for determining a target address corresponding to the data processing result obtained by the computing processing unit; and the transmission module is used for transmitting the data processing result obtained by the calculation processing unit to the destination address determined by the destination address determining module through the bus.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for implementing the above method.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method described above.

According to the data processing method and device provided by the embodiment of the disclosure, the computing processing unit is arranged on the bus, and the vector processing unit is enabled to process the data transmitted in the current bus, so that some simple computing processing processes can be completed in the data transmission process, and therefore some processes of reading data from the storage device, performing corresponding computing processing on the read data by the data processing unit, and writing the computing result back into the storage device can be avoided. Furthermore, for each data processing unit (such as an image processor or a neural network accelerator) connected to the bus, some simple and common calculation can be extracted from each data processing unit and completed by the calculation processing units arranged in the bus, so that the data processing pressure of each data processing unit is reduced, and the utilization rate of the calculation processing unit is improved. Therefore, the technical scheme provided by the disclosure is beneficial to improving the data processing efficiency and improving the robustness of the device.

The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof in more detail with reference to the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, not to limit the disclosure. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a schematic illustration of a scenario in which the present disclosure is applicable;

FIG. 2 is a flow chart of one embodiment of a data processing method of the present disclosure;

FIG. 3 is a schematic diagram of one embodiment of a first execution unit of the present disclosure;

FIG. 4 is a schematic diagram of one embodiment of a third execution unit of the present disclosure;

FIG. 5 is a schematic diagram of a further embodiment of a third execution unit of the present disclosure;

FIG. 6 is a schematic diagram of one embodiment of a second execution unit of the present disclosure;

FIG. 7 is a schematic diagram of one embodiment of a fourth execution unit of the present disclosure;

FIG. 8 is a schematic diagram of a further embodiment of a fourth execution unit of the present disclosure;

FIG. 9 is a schematic diagram of one embodiment of a fourth execution unit and a fifth execution unit of the present disclosure;

FIG. 10 is a schematic diagram of one embodiment of a fifth execution unit of the present disclosure;

FIG. 11 is a schematic diagram of one embodiment of a downsampling process of the present disclosure;

FIG. 12 is a schematic diagram illustrating the structure of one embodiment of a data processing apparatus of the present disclosure;

fig. 13 is a block diagram of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in this disclosure is merely an association relationship describing an association object, and indicates that three relationships may exist, such as a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Embodiments of the present disclosure are applicable to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

Summary of the disclosure

In the process of implementing the disclosure, the inventor finds that in the image processing process, there are often more pixel-by-pixel computing operations, and these computing operations generally have the characteristics of simple computation, and the computing process is only related to a small amount of data (such as data in a weight matrix, etc.), and has extremely strong regularity.

For image processing, if the device still adopts the existing data processing mode to perform the calculation processing operation, the process of reading data from the storage device of the device, sending the data to a calculation unit such as a central processing unit or a coprocessor, and writing the calculation result back to the storage device is needed, which is not beneficial to improving the data processing efficiency.

Buses in devices typically transmit data in parallel. If simple computing processing operations such as addition, multiplication, comparison and the like are simultaneously performed on a plurality of data transmitted in parallel in the bus during parallel transmission of the data by the bus, the data processing can be completed on the basis of substantially no time overhead, memory overhead and bandwidth overhead of the bus. That is, the present disclosure can realize data processing without performing operations such as reading data from a storage device, performing a calculation processing operation on the read data by a data processing unit in an apparatus, and writing back a calculation result of the data processing unit to a corresponding storage device by the data processing unit, thereby contributing to an improvement in data processing efficiency.

Exemplary overview

The data processing technology disclosed by the disclosure can be applied to devices such as computers, intelligent mobile phones, tablet computers and vehicle-mounted systems. An example is shown in fig. 1.

In fig. 1, an apparatus 100 that can be used for image processing includes: bus 101, N co-processors (coprocessors), etc. N co-processors are respectively connected with the bus 101, wherein N is a positive integer.

The bus 101 is provided with a calculation processing unit 1011, which calculation processing unit 1011 may be referred to as a vector processing unit. Since the bit width of the bus 101 is generally larger than the bit width of the data transferred in the bus 101, a plurality of data are generally transferred in parallel in the bus 101.

The N coprocessors in FIG. 1 are coprocessors 102-1, 102-2, … …, and 102-N, respectively. At least one of the N co-processors may be referred to as a neural network accelerator (Nerual Network Accelerator) and at least one of the N co-processors may be referred to as an Image Processor (Image Processor). For example, co-processor 102-1 in FIG. 1 is referred to as a neural network accelerator, while co-processor 102-2 is referred to as an image processor.

The image processor in the device 100 may cooperate with at least one neural network accelerator and a computing processing unit 1011 on the bus 101 to perform a series of computing operations to efficiently implement neural network-based image processing. For example, in the image processing, after the data output by the co-processor 102-2 is uploaded to the bus 101, the calculation processing unit 1011 on the bus 101 may directly perform corresponding processing, such as nonlinear activation processing, tone mapping processing, image brightness adjustment processing, normalization processing, image downsampling processing, or pooling processing, for the data transmitted in the bus 101; the data processed by the computing processing unit 1011 is continuously transmitted to a corresponding storage area through the bus 101 and stored; when other coprocessors (e.g., coprocessors 102-1, etc.) need to perform corresponding processing on the data stored in the memory area, the corresponding data may be read from the corresponding memory area via bus 101.

Exemplary method

FIG. 2 is a flow chart of one embodiment of a data processing method of the present disclosure. The method as shown in fig. 2 may include: s200, S201, and S202. The steps are described below with reference to fig. 1.

S200, processing the data transmitted in the current bus through a computing processing unit on the bus to obtain a data processing result.

Bus 101 in this disclosure may be referred to as a parallel data bus. The data bit width of bus 101 (i.e., bus bandwidth) may be Mbit, where M is a positive integer. Assuming that the bit width of the individual data transmitted by the bus 101 is nbit, nbit is typically smaller than the data bit width Mbit of the bus. M may be an integer multiple of n, i.e., the number of data transferred by bus 101 in a single pass may be P, and p=m/n.

The computing unit 1011 in the present disclosure may be configured to recognize a data processing instruction received by the computing unit and perform corresponding computing processing on data to be processed transmitted in the current bus 101 according to the data processing instruction. In one example, the computing processing unit 1011 in the present disclosure may receive data processing instructions via the bus 101. In another example, the computing processing unit 1011 of the present disclosure may also receive data processing instructions via a newly added control line.

The computing processing unit 1011 of the present disclosure may use P data transmitted in parallel in the bus 101 as a set of vectors, and the P data may be considered as P elements in the set of vectors, and the computing processing unit 1011 may perform corresponding processing on the P data at the same time, so that the computing processing unit 1011 of the present disclosure may also be referred to as a vector processing unit.

S201, determining a destination address corresponding to the data processing result.

The destination address in the present disclosure may be a storage address for data processing results. The present disclosure may determine, according to a first destination address transmitted in the bus 101, a destination address (i.e., a second destination address) corresponding to a data processing result. The first destination address may be the same as the second destination address or may be different from the second destination address.

S202, transmitting the data processing result to the destination address through a bus.

The computing unit 1011 in the present disclosure may output a destination address (i.e., the second destination address) and a data processing result, where the destination address and the data processing result output by the computing unit 1011 are both transmitted in the bus 101, and each element connected to the bus 101 may determine, according to the destination address transmitted in the current bus 101, whether to perform storage processing on the data processing result transmitted in the current bus 101, so that the data processing result may be finally stored in the corresponding destination address.

By providing the computing unit 1011 on the bus 101 and causing the vector processing unit to process the data currently transmitted on the bus 101, some simple computing processes can be completed in the data transmission process, so that some processes of reading data from the storage device, performing corresponding computing processes on the read data by the data processing unit, and writing the computing results back into the storage device can be avoided. Furthermore, for each data processing unit (for example, an image processor or a neural network accelerator) connected to the bus 101, some simple and common calculation can be extracted from each data processing unit, and the calculation is completed by the calculation processing unit 1011 provided in the bus 101, so that it is beneficial to reduce the data processing pressure of each data processing unit and to increase the utilization rate of the calculation processing unit 1011. Therefore, the technical scheme provided by the disclosure is beneficial to improving the data processing efficiency and improving the robustness of the device.

In an alternative example, the computing processing unit 1011 in this disclosure includes an instruction module and a plurality of execution units. Instruction modules are typically coupled to each execution unit. The instruction module may receive the data processing instruction through the bus 101, or may receive the data processing instruction through a newly added control line. The instruction module may be configured to recognize the received data processing instruction, and control each execution unit according to the data processing instruction, so as to perform corresponding calculation processing on the data transmitted in the current bus 101. The execution unit is generally used for executing corresponding operations according to the control of the instruction module. In the case of transmitting data processing instructions by using the bus 101, the content transmitted in the bus 101 of the present disclosure is more data processing instructions than the content transmitted in the existing bus, that is, the instruction module in the computing processing unit 1011 may control at least one execution unit in the computing processing unit 1011 to perform computing processing on the data transmitted in parallel in the current bus 101 according to the data processing instructions transmitted in the bus 101.

Alternatively, the type of execution unit included in the computing unit 1011 in the present disclosure may be set according to actual requirements, for example, the type of execution unit is generally related to the type of data processing that the computing unit 1011 actually needs to implement.

By arranging the instruction module and the plurality of execution units in the calculation processing unit 1011, the structure of the calculation processing unit 1011 can be flexibly arranged based on the actual requirements of calculation processing and the specific situation of the bus 101, thereby being beneficial to improving the practicability and flexibility of the calculation processing unit 1011.

In one alternative example, the computing processing unit 1011 of the present disclosure may include a variety of different types of execution units, for example, the computing processing unit 1011 may include: one or more of the first execution unit, the second execution unit, the third execution unit, the fourth execution unit, and the fifth execution unit. The first execution unit is an execution unit for executing mapping processing, and may be referred to as a data mapping processing unit. The second execution unit is an execution unit for performing a multiplication operation, and may be referred to as a vector multiplication processing unit. The third execution unit is an execution unit for performing addition operation, and may be referred to as a vector addition processing unit. The fourth execution unit is an execution unit for executing the maximum value taking operation, and the fourth execution unit may be called a vector maximum value processing unit. The fifth execution unit is an execution unit for executing strobe operation, and may be referred to as a data strobe processing unit.

Optionally, the instruction module in the computing unit 1011 recognizes the data processing instruction after receiving the data processing instruction transmitted in the bus 101 or the control line, and controls at least one of the first execution unit, the second execution unit, the third execution unit, the fourth execution unit, and the fifth execution unit in the computing unit 1011 to process the data transmitted in the current bus 101 according to the recognition result. The different types of execution units in the disclosure can independently realize data processing and can also cooperate with each other to realize corresponding data processing.

Alternatively, the number of each type of execution unit in the present disclosure may be multiple, and the number of each type of execution unit is generally related to the data bit width of the bus 101 and the bit width of a single data. For example, in the case where the data bit width of the bus 101 is 32 bits and the bit width of single data is 8 bits, the calculation processing unit 1011 may include 4 first execution units, 4 second execution units, at least two third execution units, at least three fourth execution units, and at least two fifth execution units based on four-way gating, and the like.

The present disclosure can realize various types of data processing by providing various types of execution units in the computation processing unit 1011 and controlling different execution units based on data processing instructions using the instruction module, thereby contributing to an improvement in the data processing capability of the computation processing unit 1011.

In one alternative example, the data processing instructions of the present disclosure may include: write mapping table instructions and data mapping instructions. Wherein a write mapping table instruction may refer to an instruction for causing mapping table information to be written into a corresponding register. The data mapping instruction may refer to an instruction that causes data to be mapped based on mapping table information. The present disclosure may utilize write mapping table instructions and data mapping instructions to implement a data mapping process.

Optionally, the first execution unit in the present disclosure includes a plurality of Memory units (such as Static Random-Access Memory (SRAM)), and each of the Memory units is configured to store the mapping table information. All the storage units in the first execution unit may store the same mapping table information at the same time, or may store different mapping table information at the same time, and each storage unit typically stores a complete piece of mapping table information. The present disclosure may control, by an instruction module, the first execution unit to write mapping table information transmitted in the current bus 101 into each storage unit of the first execution unit based on a write mapping table instruction in the data processing instruction, and control, by the instruction module, the first execution unit to perform mapping processing on data to be mapped transmitted in the current bus 101 according to the mapping table information stored in each storage unit based on the data mapping instruction in the data processing instruction.

In a specific example, the instruction module in the computing processing unit 1011 recognizes the data processing instruction after receiving the data processing instruction transmitted in the bus 101 or the control line. If the identification result is that the data processing instruction is a write mapping table instruction, the instruction module controls a first execution unit in the computing processing unit 1011 to execute a storage unit write operation, and the first execution unit writes mapping table information transmitted in the current bus 101 into each storage unit of the first execution unit according to the write control of the instruction module. If the identification result is that the data processing instruction is a data mapping instruction, the instruction module controls a first execution unit in the computing processing unit 1011 to execute a table lookup operation, and the first execution unit determines mapping data corresponding to each of the data to be mapped transmitted in the current bus 101 according to the table lookup control of the instruction module and mapping table information stored in each storage unit, and outputs the mapping data.

Alternatively, the present disclosure may implement a mapping process-related function such as a nonlinear activation process or tone mapping in a neural network using the first execution unit. In one example, the write mapping table instruction of the present disclosure may be embodied as a write nonlinear activation mapping table instruction, and correspondingly, the data mapping instruction may be embodied as a nonlinear activation instruction, and the present disclosure may control, by the instruction module, the first execution unit to write the nonlinear activation mapping table information transmitted in parallel in the current bus 101 into the storage unit thereof based on the write nonlinear activation mapping table instruction in the data processing instruction, and may control, by the instruction module, the first execution unit to perform nonlinear activation mapping processing on the data to be mapped, which is transmitted in parallel in the current bus 101, according to the nonlinear activation mapping table information stored in the storage unit thereof based on the nonlinear activation instruction in the data processing instruction. As another example, the write mapping table instruction of the present disclosure may be embodied as a write tone mapping table instruction, and correspondingly, the data mapping instruction may be embodied as a tone mapping instruction, where the present disclosure may control, by using the instruction module, the first execution unit to write tone mapping table information transmitted in the current bus 101 into the storage unit thereof based on the write tone mapping table instruction in the data processing instruction, and control, by using the instruction module, the first execution unit to perform tone mapping processing on the data to be mapped, which is transmitted in parallel in the current bus 101, according to the tone mapping table information stored in the storage unit thereof based on the tone mapping instruction in the data processing instruction.

It should be noted that, the present disclosure may not distinguish between the write nonlinear activation mapping table instruction and the write tone mapping table instruction, that is, the write nonlinear activation mapping table instruction and the write tone mapping table instruction are the same instruction, that is, the write mapping table instruction. Accordingly, the present disclosure may not distinguish between the nonlinear-active instruction and the tone-mapping instruction, i.e., the nonlinear-active instruction and the tone-mapping instruction are the same instruction, i.e., the data-mapping instruction. That is, the present disclosure may utilize write mapping table instructions and data mapping instructions to accomplish a nonlinear activation process or tone mapping process.

In one example, the first execution unit may implement y whether it is a nonlinear activation process or a tone mapping process _i ＝f(x _i ) Wherein x is _i May be the ith data input in parallel in bus 101, where y _i May be directed to x for the first execution unit _i Mapping data obtained after performing a nonlinear activation process or tone mapping process. In theory, the first execution unit in this disclosure may simulate any mapping processing operation using the mapping table information. The mapping table information may be mapping table information based on nonlinear activation or mapping table information based on tone mapping, etc.

Assuming that the data bit width of the bus 101 is 32 bits and the bit width of a single data is 8 bits, one example of the first execution unit is shown in fig. 3 under the above assumption.

In fig. 3, the first execution unit includes: 4 sets of cells, each set of cells comprising: one subunit (i.e., ctrl0, … … or ctrl3 in fig. 3) and a memory unit. Since LUT (Look Up Table) is stored in each memory cell, LUT0, … … and LUT3 are used in fig. 3 to represent 4 memory cells. Ctrl0 and LUT0 in fig. 3 are the first set of cells, ctrl1 and LUT1 are the second set of cells, ctrl2 and LUT2 are the third set of cells, ctrl3 and LUT3 are the fourth set of cells.

Ctrl0, … … and ctrl3 in fig. 3 provide write enable signals to LUT0, … … and LUT3, respectively, upon receiving a command transmitted by the instruction module based on a write mapping table instruction; ctrl0 provides mapping table information and addresses corresponding to the mapping table information, which are transmitted in parallel in the current bus 101, to LUT0, so that a complete mapping table information is written into LUT 0; ctrl1 provides mapping table information and addresses corresponding to the mapping table information transmitted in parallel in the current bus 101 to LUT1, so that a complete mapping table information is written into LUT 1; ctrl2 provides mapping table information and addresses corresponding to the mapping table information transmitted in parallel in the current bus 101 to LUT2, so that a complete mapping table information is written into LUT 2; ctrl3 provides mapping table information and addresses corresponding to the mapping table information, which are transmitted in parallel in the current bus 101, to LUT3, thereby writing a complete mapping table information into LUT 3. For some applications, four identical pieces of mapping table information are stored in LUT0, LUT1, LUT2, and LUT 3. For other applications, the four mapping table information stored in LUT0, LUT1, LUT2, and LUT3 are not exactly the same. For example, if the mapping table information corresponding to each of the different channels of the neural network is different, four different pieces of mapping table information may be stored in LUT0, LUT1, LUT2, and LUT 3.

When receiving a command transmitted by the instruction module based on a data mapping instruction, ctrl0, … … and ctrl3 in fig. 3, ctrl0 uses first data in data to be mapped, which is transmitted in parallel in the current bus 101, to look up a table in LUT0, obtains mapping data corresponding to the first data, and outputs the mapping data; ctrl1 uses the second data in the data to be mapped, which is transmitted in parallel in the current bus 101, to look up a table in LUT1, obtains the mapping data corresponding to the second data, and outputs the mapping data; ctrl2 uses the third data in the data to be mapped, which is transmitted in parallel in the current bus 101, to look up a table in LUT2 to obtain the mapping data corresponding to the third data, and outputs the mapping data; ctrl3 uses fourth data in the data to be mapped, which is transmitted in parallel in the current bus 101, to look up a table in LUT3 to obtain mapping data corresponding to the fourth data, and outputs the mapping data. The four mapping data outputted by the first execution unit continue to be transmitted in parallel in the bus 101, so that the data completes nonlinear activation processing or tone mapping processing and the like in the transmission process of the bus 101.

The mapping table information is stored by the plurality of storage units in the first execution unit respectively, and the mapping processing can be completed simultaneously on the plurality of data transmitted in parallel in the current bus 101 through the table lookup operation, and the processing related to mapping such as nonlinear activation, tone mapping and the like in the neural network can be extracted from the computing units such as the image processor, the neural network accelerator and the like, so that the efficiency of the mapping processing is improved, and the computing load of the computing units is reduced.

In one alternative example, the data processing instructions of the present disclosure may include: and adding an instruction. The add instruction may refer to an instruction for performing an add operation. The present disclosure may utilize an add instruction to implement an operation of at least two data additions. Specifically, the present disclosure may control, by the instruction module, a communication state of the multi-stage adding unit in the third execution unit based on an adding instruction in the data processing instruction, and control each addend to be provided to the primary adding unit, respectively, so that adding processing is performed by each adding unit. The addend in the present disclosure may be the data to be processed currently transmitted in the bus 101, or may be an output of at least one execution unit in the computing unit 1011. The primary summing unit in the present disclosure typically comprises a plurality of parallel summing units. The present disclosure may control the communication state of the multistage addition unit through the bypass signal.

The present disclosure may implement an addition operation of three or more data among the plurality of data by a multi-stage addition unit in the third execution unit. However, the present disclosure also does not exclude the case that the third execution unit comprises only a one-stage addition unit. In the case where the third execution unit includes only the one-stage addition unit, the third execution unit of the present disclosure may implement the data-two-by-two addition operation among the plurality of data.

Assuming that the data bit width of the bus 101 is 32 bits and the bit width of a single data is 8 bits, an example of the third execution unit is shown in fig. 4 under the above assumption.

In fig. 4, the third execution unit includes: 3 add cells, add cell 400, add cell 401, and add cell 402, each of which may be in a bypass state or in a non-bypass state by its bypass signal. In the case where the present disclosure includes only one third execution unit, the adding unit 400 and the adding unit 401 are primary adding units, and the adding unit 402 is a secondary adding unit. b0, b1, b2 and b3 are four data to be processed currently transmitted in parallel in the bus 101, and the present disclosure can complete various addition operations using 3 addition units as shown in fig. 4.

For example, the instruction module in the present disclosure controls the adding unit 402 to be in a bypass state, and controls the adding unit 400 and the adding unit 401 to be in a non-bypass state, so that the third execution unit outputs the sum of b0 and b1 and the sum of b2 and b 3.

For another example, the instruction module in the present disclosure causes the third execution unit to output the sum of b0, b1, b2, and b3 by controlling the adding unit 400, the adding unit 401, and the adding unit 402 to be in a non-bypass state.

For another example, the instruction module in the present disclosure controls the adding unit 400 to be in a bypass state, and controls the adding unit 401 and the adding unit 402 to be in a non-bypass state, so that the third execution unit outputs the sum of b0, b2, and b3, and b1.

For another example, the instruction module in the present disclosure controls the adding unit 401 to be in a bypass state, and controls the adding unit 400 and the adding unit 402 to be in a non-bypass state, so that the third execution unit outputs the sum of b0, b1, and b2, and b3.

Assuming that the data bit width of the bus 101 is 128 bits and the bit width of a single data is 8 bits, under the above assumption, if three add units shown in fig. 4 are taken as one add core module, one example of the third execution unit is shown in fig. 5.

In fig. 5, the third execution unit includes: four add core modules and three add units. Four cored modules are cored module 0, cored module 1, cored module 2 and cored module 3. Three adding units, namely adding unit 500, adding unit 501 and adding unit 502. The structures of the cored module 0, the cored module 1, the cored module 2 and the cored module 3 may be as shown in fig. 4.

The adding units 400 and 401 in all the adding core modules in fig. 5 are primary adding units, the adding units 402 in all the adding core modules are primary adding units, the adding units 500 and 501 are secondary adding units, and the adding units 502 are tertiary adding units.

The instruction module in the present disclosure may enable the adding unit 500 to complete the sum of the output of the adding core module 0 (e.g., the sum of the 1 st to 4 th to be processed data among the 16 to be processed data transmitted in parallel in the current bus 101) and the output of the adding core module 1 (e.g., the sum of the 5 th to 8 th to be processed data among the 16 to be processed data transmitted in parallel in the current bus 101) by controlling the adding unit 500 in the non-bypass state.

The instruction module in the present disclosure may enable the adding unit 501 to complete the sum of the output of the adding core module 2 (e.g., the sum of the 9 th to 12 th to be processed data in the 16 to be processed data transmitted in parallel in the current bus 101) and the output of the adding core module 3 (e.g., the sum of the 13 th to 16 th to be processed data in the 16 to be processed data transmitted in parallel in the current bus 101) by controlling the adding unit 501 in the non-bypass state.

The instruction module in the present disclosure may enable the adding unit 502 to complete the sum of the output of the adding unit 500 (e.g., the sum of the 1 st to 8 th to be processed data among the 16 to be processed data transmitted in parallel in the current bus 101) and the output of the adding unit 501 (e.g., the sum of the 9 th to 16 th to be processed data among the 16 to be processed data transmitted in parallel in the current bus 101) by controlling the adding unit 502 in the non-bypass state.

By utilizing the multi-stage adding unit in the third executing unit, the method and the device can not only simultaneously complete adding operations in various forms on a plurality of data transmitted in parallel in the current bus 101, but also extract partial adding operations in the neural network from the computing units such as the image processor, the neural network accelerator and the like, thereby being beneficial to improving the adding operation efficiency and reducing the computing load of the computing units.

In one alternative example, the data processing instructions of the present disclosure may include: a write coefficient instruction and a data multiplication instruction. The instruction for writing coefficients may refer to an instruction for writing coefficients (i.e., a multiplier) of a multiplication operation into a corresponding memory location. The data multiplication instruction may refer to an instruction that multiplies another multiplier by a coefficient in a storage unit. The present disclosure may utilize a write coefficient instruction and a data multiplication instruction to implement a data multiplication operation.

Optionally, the second execution unit in the present disclosure includes a plurality of registers (e.g., reg0, reg1, reg2, reg3, etc. in fig. 6 described below), and each of the registers in the second execution unit is used to store a coefficient. All registers in the second execution unit may store the same coefficients at the same time, or may store different coefficients at the same time. The present disclosure may control, by an instruction module, the second execution unit to write each coefficient transmitted in parallel in the current bus 101 into each register of the second execution unit, based on a write coefficient instruction in a data processing instruction, and control, by the instruction module, the second execution unit to perform multiplication operation on to-be-multiplied data transmitted in the current bus 101 or to-be-multiplied data output by other execution units, according to coefficients stored in each register, based on a data multiplication instruction in the data processing instruction.

In a specific example, the instruction module in the computing processing unit 1011 recognizes the data processing instruction after receiving the data processing instruction transmitted in the bus 101 or the control line. If the identification result is that the data processing instruction is a write coefficient instruction, the instruction module controls a second execution unit in the computing processing unit 1011 to execute a register write operation, and the second execution unit writes each coefficient transmitted in the current bus 101 into each register of the second execution unit according to the write control of the instruction module. If the identification result is that the data processing instruction is a data multiplication instruction, the instruction module controls a second execution unit in the calculation processing unit 1011 to execute multiplication operation, and the second execution unit performs multiplication operation on the coefficients stored in each register and the to-be-multiplied data transmitted in the current bus 101/the to-be-multiplied data output by other execution units according to the multiplication operation control of the instruction module, and outputs a multiplication operation result.

Alternatively, the present disclosure may implement functions related to multiplication processing, such as normalization processing or image brightness adjustment processing in a neural network, with the second execution unit.

In one example, the write coefficient instruction of the present disclosure may be embodied as a write normalization processing coefficient instruction, and the corresponding data multiplication instruction may be embodied as a normalization processing instruction, where the present disclosure may control, by using the instruction module, the second execution unit to write the coefficients for normalization processing transmitted in parallel in the current bus 101 into the registers thereof, respectively, based on the normalization processing instruction in the data processing instruction, and control, by using the instruction module, the second execution unit to multiply each coefficient stored in the registers thereof with each multiplier transmitted in parallel in the current bus 101, respectively, based on the normalization processing instruction in the data processing instruction, and the second execution unit outputs the result of the multiplication processing, thereby implementing the normalization processing.

As another example, the write coefficient instruction of the present disclosure may be embodied as a write brightness adjustment coefficient instruction, and the corresponding data multiplication instruction may be embodied as a brightness adjustment processing instruction, and the present disclosure may control the second execution unit to write the coefficient for brightness adjustment, which is transmitted in parallel in the current bus 101, into its register based on the write brightness adjustment coefficient instruction in the data processing instruction, and control the second execution unit to multiply each coefficient stored in its register with each multiplier, which is transmitted in parallel in the current bus 101, respectively, based on the brightness adjustment processing instruction in the data processing instruction, through the instruction module, and the second execution unit outputs the result of the multiplication processing, thereby implementing the image brightness adjustment processing.

It should be noted that, the disclosure may not distinguish between the write normalization processing coefficient instruction and the write luminance adjustment coefficient instruction, that is, the write normalization processing coefficient instruction and the write luminance adjustment coefficient instruction are the same instruction, that is, the write coefficient instruction. Accordingly, the present disclosure may not distinguish between the normalization processing instruction and the brightness adjustment processing instruction, that is, the normalization processing instruction and the brightness adjustment processing instruction are the same instruction, that is, the data multiplication instruction. That is, the present disclosure can complete the normalization process or the image brightness adjustment process using the write coefficient instruction and the data multiplication instruction.

In one example, the normalization process may be denoted as c _i ＝a _i ×b _i +β, the present disclosure may be used toWherein the addition operation is stripped from the normalization process, i.e., the normalization process in this disclosure can be denoted as c _i ＝a _i ×b _i Thereby conforming to the expression of the image brightness adjustment processing. Thus, the second execution unit can realize c regardless of the normalization processing or the image brightness adjustment processing of the present disclosure _i ＝a _i ×b _i Wherein a is _i May be a coefficient stored in an ith register, b of which _i The ith multiplier, c, which may be transmitted in parallel for the current bus 101 _i May be directed to a for the second execution unit _i And b _i The product obtained after the normalization processing or the luminance adjustment processing is performed.

Assuming that the data bit width of the bus 101 is 32 bits and the bit width of the single data is 8 bits, the second execution unit may include: 4 registers and 4 multiplier units. Assuming that the data bit width of the bus 101 is 128 bits and the bit width of the single data is 8 bits, the second execution unit may include, under the above assumption: 16 registers and 16 multiplier units.

Assuming that the data bit width of the bus 101 is 32 bits and the bit width of a single data is 8 bits, an example of the second execution unit is shown in fig. 6 under the above assumption.

In fig. 6, the second execution unit includes: 4 sets of cells, each set of cells comprising: a multiplier unit (i.e., multiplier unit 600, multiplier unit 601, multiplier unit 602, or multiplier unit 603 in fig. 6) and a register (i.e., reg0, reg1, reg2, or Reg3 in fig. 6). Each register stores a coefficient, and the coefficients stored in different registers may be the same or different. The multiplier units 600 and Reg0 in fig. 6 are the first set of units, the multiplier units 601 and Reg1 are the second set of units, the multiplier units 602 and Reg2 are the third set of units, and the multiplier units 603 and Reg3 are the fourth set of units.

The instruction module can clear the coefficients stored in each register by controlling Clr and En of Reg0, reg1, reg2 and Reg3 to be active simultaneously. The instruction module may cause the coefficients to be written into the registers by controlling Set and En of Reg0, reg1, reg2, and Reg3 to be active simultaneously, such as writing four coefficients transmitted in parallel in the current bus 101 into the registers. When Clr, en and Set of Reg0, reg1, reg2 and Reg3 are controlled to be invalid at the same time, the instruction module can enable 4 multipliers transmitted in parallel in the current bus 101 to be respectively provided for the multiplying unit 600, the multiplying unit 601, the multiplying unit 602 and the multiplying unit 603, so that the 4 multiplying units can simultaneously complete product operation of coefficients and multipliers and output 4 products, and normalization processing or brightness adjustment processing is completed. Note that, in fig. 6, reg0, reg1, reg2, and Reg3 may be controlled by using Clr and Set alone without the En control signal.

The present disclosure, by using the plurality of registers in the second execution unit to store coefficients respectively, and simultaneously performing multiplication operations by the plurality of multiplication units, not only can the multiplication operations be simultaneously completed on the plurality of multipliers transmitted in parallel in the current bus 101, but also processes related to the multiplication operations, such as normalization processing and image brightness adjustment processing, in the neural network can be extracted from the calculation units, such as the image processor and the neural network accelerator, thereby being beneficial to improving the efficiency of the multiplication operations and reducing the calculation load of the calculation units.

In one alternative example, the data processing instructions of the present disclosure may include: a maximum value instruction is determined. The maximum value determining instruction may be an instruction for comparing sizes of a plurality of data to determine data having a maximum value among the plurality of data. The present disclosure may utilize a determine maximum instruction to implement a value magnitude comparison operation for at least two data.

Optionally, the present disclosure may control a connection state of the multi-stage comparator in the fourth execution unit through the instruction module based on a determined maximum value instruction in the data processing instruction, and control each data to be compared to be provided to the primary comparator in a group by group, and perform comparison processing through the comparator to obtain a maximum value in 2N data. Wherein N is an integer greater than or equal to 1, and 2N is less than or equal to the number of data to be compared. The data to be compared in the present disclosure may be the data to be compared transmitted in the current bus 101, or may be the data output by at least one execution unit in the computing processing unit 1011. The primary comparator in the present disclosure typically includes a plurality of parallel comparators. The present disclosure may control the communication state of the multi-stage comparator through the bypass signal.

The present disclosure may implement a numerical comparison operation of three or more data of the plurality of data by a multi-stage comparator in the fourth execution unit. However, the present disclosure does not exclude the case that the fourth execution unit comprises only one stage of comparators. In the case that the fourth execution unit includes only one stage of comparator, the fourth execution unit of the present disclosure may implement a numerical comparison operation of two-by-two data among the plurality of data.

Assuming that the data bit width of the bus 101 is 32 bits and the bit width of a single data is 8 bits, an example of the fourth execution unit is shown in fig. 7 under the above assumption.

In fig. 7, the fourth execution unit includes: 3 comparators, namely comparator 700, comparator 701 and comparator 702, and each of the compare plus units may be in a bypass state or in a non-bypass state by its bypass signal. In the case where the present disclosure includes only one fourth execution unit, the comparators 700 and 701 are primary comparators, and the comparator 702 is a secondary comparator. b0, b1, b2 and b3 are four data to be compared currently transmitted in parallel in the bus 101, and the present disclosure can complete various comparison operations using 3 add units as shown in fig. 7.

For example, the instruction module in the present disclosure controls the comparator 702 to be in the bypass state and controls the comparator 700 and the comparator 701 to be in the non-bypass state, so that the fourth execution unit determines the comparison result of b0 and b1 (i.e., the maximum value and the minimum value of b0 and b 1) and the comparison result of b2 and b3 (i.e., the maximum value and the minimum value of b2 and b 3).

For another example, the instruction module in the present disclosure controls the comparator 700, the comparator 701, and the comparator 702 to be in a non-bypass state, so that the fourth execution unit determines the comparison result of b0, b1, b2, and b3, that is, the maximum value of b0, b1, b2, and b 3.

For another example, the instruction module in the present disclosure controls the comparator 700 to be in a bypass state and controls the comparators 701 and 702 to be in a non-bypass state, so that the fourth execution unit determines the comparison result of b0, b2, and b3, that is, the maximum value of b0, b2, and b 3.

For another example, the instruction module in the present disclosure controls the comparator 701 to be in a bypass state and controls the comparators 700 and 702 to be in a non-bypass state, so that the fourth execution unit determines the comparison result of the outputs b0, b1, and b2, that is, the maximum value of the outputs b0, b1, and b 2.

It should be specifically noted that, the present disclosure may further use the fourth execution unit to sort the plurality of data to be compared, so as to obtain a sorting result of the plurality of data to be compared. For example, the fourth execution unit in fig. 7 may include: and 5 comparators, wherein two comparators are primary comparators, the other two comparators are primary comparators, the remaining one comparator is a secondary comparator, the maximum value determined by each of the two primary comparators can be provided for one of the primary comparators, the minimum value determined by each of the two primary comparators can be provided for the other one of the primary comparators, and the minimum value determined by each of the two primary comparators can be provided for the secondary comparator, so that the sorting result of four data to be compared can be obtained according to the comparison results of the two primary comparators and the secondary comparator. Furthermore, the disclosure may further obtain a maximum value and a minimum value of the plurality of data to be compared using the fourth execution unit. For example, the fourth execution unit in fig. 7 may include: and 4 comparators, wherein two comparators are primary comparators, the other two comparators are primary comparators, the maximum value determined by each of the two primary comparators can be provided for one of the primary comparators, and the maximum value in four data to be compared can be obtained according to the output of the one-stage comparator. The minimum value determined by each of the two primary comparators may be provided to the other one of the primary comparators, and the present disclosure may obtain the minimum value of the four data to be compared based on the output of the one of the primary comparators.

Assuming that the data bit width of the bus 101 is 128 bits and the bit width of a single data is 8 bits, under the above assumption, if three comparators shown in fig. 7 are taken as one comparison core module, one example of the fourth execution unit is shown in fig. 8.

In fig. 8, the fourth execution unit includes: four comparison core modules and three comparators. Four comparison core modules are comparison core module 0, comparison core module 1, comparison core module 2 and comparison core module 3. Three comparators, namely comparator 800, comparator 801 and comparator 802. The structures of the comparison core module 0, the comparison core module 1, the comparison core module 2, and the comparison core module 3 may all be as shown in fig. 8.

The comparators 700 and 701 in all the comparison core blocks in fig. 8 are primary comparators, the comparators 702 in all the comparison core blocks are primary comparators, the comparators 800 and 801 are secondary comparison units, and the comparators 802 are tertiary comparators.

The instruction module in the present disclosure may enable the comparator 800 to screen out the maximum value among the maximum values determined by the comparison core module 0 (e.g., the maximum value from the 1 st to 4 th data to be compared in the 16 data to be compared transmitted in parallel in the current bus 101) and the maximum value determined by the comparison core module 1 (e.g., the maximum value from the 5 th to 8 th data to be compared in the 16 data to be compared transmitted in parallel in the current bus 101) by controlling the comparator 800 in a non-bypass state.

The instruction module in the present disclosure may enable the comparator 801 to screen out the maximum value among the maximum values determined by the comparison core module 2 (e.g., the maximum value from the 9 th to 12 th data to be compared in the 16 data to be compared transmitted in parallel in the current bus 101) and the maximum value determined by the comparison core module 3 (e.g., the maximum value from the 13 th to 16 th data to be compared in the 16 data to be compared in parallel in the current bus 101) by controlling the comparator 801 in a non-bypass state.

The instruction module in the present disclosure controls the comparator 800, the comparator 801 and the comparator 802 to be in a non-bypass state, so that the comparator 802 can screen out the maximum value (such as the maximum value from the 1 st to 8 th data to be compared in the 16 data to be compared transmitted in parallel in the current bus 101) determined by the comparator 800 and the maximum value (such as the maximum value from the 9 th to 16 th data to be compared in the 16 data to be compared in parallel in the current bus 101) determined by the comparator 801, that is, the maximum value in the 16 data to be compared.

The present disclosure, by using the multi-stage comparator in the fourth execution unit, not only can complete multiple types of comparison operations on multiple data transmitted in parallel in the current bus 101, but also can extract part of the comparison operations in the neural network from the computing units such as the image processor and the neural network accelerator, thereby being beneficial to improving the efficiency of the comparison operations and reducing the computing load of the computing units.

In one alternative example, the data processing instructions in the present disclosure may include: a maximum position command is determined. The determine maximum value instruction may refer to an instruction for comparing sizes of a plurality of data to determine coordinates of data having the largest numerical value among the plurality of data. The present disclosure may utilize a determine maximum position instruction to effect a positioning operation of a maximum value in a numerical magnitude comparison of at least two data.

Optionally, the present disclosure may control, by the instruction module, a connection state of the multi-stage comparator in the fourth execution unit based on a determined maximum value position instruction in the data processing instruction, and control each data to be compared to be provided to the primary comparator in a group by group, and perform a comparison process by the comparator, to obtain a maximum value in 2N data. The instruction module in the present disclosure may then control the gating state of at least one multiplexer in the fifth execution unit based on the determined maximum value position instruction, and execute the output control of the coordinates of the maximum value through the multiplexer. At least a portion of the data paths in the computation processing unit 1011 of the present disclosure are typically provided with initial coordinates, which are typically transmitted during the data comparison, so that the coordinates output by the multiplexers may be the initial coordinates of the corresponding data paths. N in the present disclosure is an integer greater than or equal to 1, and 2N is less than or equal to the number of data to be compared. The data to be compared in the present disclosure may be the data to be compared transmitted in the current bus 101, or may be the data output by at least one execution unit in the computing processing unit 1011. The primary comparator in the present disclosure typically includes a plurality of parallel comparators. The present disclosure may control the communication state of the multi-stage comparator through the bypass signal. In addition, a multiplexer typically includes multiple inputs, and the strobe state of the multiplexer may refer to one of the multiplexers being in the strobe state.

Assuming that the data bit width of the bus 101 is 32 bits and the bit width of the single data is 8 bits, an example of maximum value position location (Argmax) is realized by combining a multiplexer in the fourth execution unit and the fifth execution unit under the above assumption, as shown in fig. 9.

In fig. 9, the output of the fourth execution unit is taken as input to a Multiplexer (MUX) in the fifth execution unit, which can output the coordinates of the maximum value among b0, b1, b2, and b 3.

The present disclosure, by using the multiplexers in the fourth execution unit and the fifth execution unit, not only can perform positioning processing on the maximum value of the plurality of data transmitted in parallel in the current bus 101, but also can enable comparison and positioning processing in the neural network to be extracted from the computing units such as the image processor and the neural network accelerator, thereby being beneficial to improving the comparison positioning operation efficiency and reducing the computing load of the computing units.

In one alternative example, the data processing instructions in the present disclosure may include: a data strobe instruction. The data strobe instruction may refer to an instruction for determining data that can be output by the fifth execution unit among the plurality of input data of the fifth execution unit. That is, the present disclosure may determine the output data of the fifth execution unit using the data strobe instruction. The fifth execution unit in the present disclosure may include one or more multiplexers, for example, the fifth execution unit includes: four multiplexers. The instruction module may control the strobe state of each of the multiplexers. Specifically, the disclosure may control the gating state of at least one multiplexer in the fifth execution unit based on the data gating instruction in the data processing instruction through the instruction module, and control the data to be processed to be provided to the multiplexer, and perform the data output control through the multiplexer. The data to be processed may include: the data to be processed currently transmitted on the bus 101 and/or the output of at least one execution unit of the computation processing units 1011.

Optionally, for any multiplexer in the fifth execution unit, the instruction module in the disclosure may control whether any one of the input data in the multiplexer is output through a data strobe instruction. Assuming that the data bit width of the bus 101 is 32 bits and the bit width of the single data is 8 bits, the fifth execution unit may include: 4 multiplexers. Assuming that the data bit width of the bus 101 is 128 bits and the bit width of the single data is 8 bits, the fifth execution unit may include: 16 multiplexers. An example of a fifth execution unit comprising 4 multiplexers is shown in fig. 10.

In fig. 10, the fifth execution unit includes: MUX0 (multiplexer), MUX1, MUX2, and MUX3. The instruction module controls whether each multiplexer is in a gated state or in a non-gated state by sending a Sel signal (i.e., a data strobe) to each multiplexer, respectively. The input of a multiplexer is output when the multiplexer is in the gating state, and the input of the multiplexer is not output when the multiplexer is in the non-gating state.

The present disclosure can flexibly output in cooperation with other execution units by controlling the gating state of the multiplexer in the fifth execution unit, so that the calculation processing unit 1011 accurately outputs the calculation result thereof, for example, by cooperation with the third execution unit, the calculation processing unit 1011 can output the addition operation result; for another example, the calculation processing unit 1011 may be caused to output the maximum value of the plurality of data by cooperating with the fourth execution unit.

In an alternative example, a portion of execution units in the computing processing unit 1011 of the present disclosure may cause the computing processing unit 1011 to have a phenomenon of reduced data dimension after performing the corresponding operation. Wherein the data dimension is reduced, i.e. the output data is less than the input data. For example, the addition performed by the third execution unit causes the data dimension reduction phenomenon to occur in the computing unit 1011. For another example, the maximum value processing performed by the fourth execution unit causes the data dimension reduction phenomenon to occur in the computing processing unit 1011. The instruction module in the present disclosure may ensure that the output of the computing processing unit 1011 can be correctly written into the memory device by gating the address and data. In addition, the phenomenon of data dimension reduction is beneficial to reducing the storage space occupied by data, saving the data writing time and reducing the bandwidth consumed by data transmission.

Optionally, the instruction module in the present disclosure may include an address and byte control module (CTRL), the instruction module may control the address and byte control module to output a DSEL (data select) instruction, and the DSEL instruction may include: byte enable output signals and address signals. The byte enable output signal can be used to control the gating state of at least one multiplexer in the fifth execution unit, so that a certain way output of the third execution unit can be output by the multiplexer. The address signal can be used for controlling the storage address corresponding to the data output by the multiplexer, so that the correct storage address can be set for the addition processing result of the third execution unit, and the correct storage address can be set for the maximum processing result of the fourth execution unit.

Alternatively, the present disclosure determines the second destination address of each output data of the calculation processing unit 1011 based on the first destination address of each input data of the calculation processing unit 1011. If the number of input data of the calculation processing unit 1011 is exactly the same as the number of output data thereof, the first destination address is generally the same as the second destination address, and if the number of input data of the calculation processing unit 1011 is less than the number of output data thereof, the first destination address is generally different from the second destination address, for example, the second destination address is a part of the first destination address.

It should be noted that, the computing processing unit 1011 in the present disclosure may not include an address and byte control module. In the case where the calculation processing unit 1011 does not include an address and byte control module, the data output by the calculation processing unit 1011 may include invalid data. The present disclosure may filter invalid data therein by means of software or the like after the corresponding data is output by the computing processing unit 1011, thereby ensuring the accuracy of the data written into the storage device.

Alternatively, the present disclosure may cause the computing processing unit 1011 to realize a plurality of functions by combining the processes performed by the respective execution units in the computing processing unit 1011. For example, the present disclosure may implement the averaging pooling process or the downsampling process using the third and fifth execution units in the calculation processing unit 1011. For another example, the present disclosure may implement the maximum value pooling process with the fourth execution unit and the fifth execution unit in the calculation processing unit 1011. The following describes a specific procedure of the downsampling process as an example, with reference to fig. 11, a function that can be realized by the calculation processing unit 1011 of the present disclosure.

In fig. 11, it is assumed that the data transmitted in parallel in the current bus 101 includes values of 16 pixels, which are respectively the values of 8 pixels in the 0 th row and the values of 8 pixels in the 1 st row in the image. The values of 8 pixels in the 0 th row are P00, P01, P02, P03, P04, P05, P06 and P07, respectively. The values of 8 pixels in the 1 st row are P10, P11, P12, P13, P14, P15, P16 and P17 respectively. It is assumed that downsampling is performed by a factor of 2 in both the row and column directions of the image.

The present disclosure may provide the values of 16 pixels to 8 primary adding units in the third executing unit in the order of P00, P01, P10, P11, P02, P03, P12, P13, P04, P05, P14, P15, P06, P07, P16, and P17, and the results output by the 8 primary adding units are s01, s02, s03, s04, s05, s06, s07, and s08, respectively. Wherein s01 and s02 are supplied to the first two-stage adding unit, s03 and s04 are supplied to the second two-stage adding unit, s05 and s06 are supplied to the third two-stage adding unit, and s07 and s08 are supplied to the fourth two-stage adding unit. The outputs of the four two-stage adder units are s1, s2, s3 and s4, respectively. After the addition processing by the two-stage addition unit, the data required to be stored in the storage device is changed from 16 to 4, and therefore, the transmission positions and the storage addresses of these 4 data need to be adjusted. The address and byte control module in this disclosure can adjust the transmission locations and storage addresses of s1, s2, s3, and s4 by DSEL instructions so that s1, s2, s3, and s4 are stored in the correct storage locations of the storage device.

In particular, the present disclosure generally uses the storage address corresponding to each input data of the computing unit 1011 to determine the storage address corresponding to each output data of the computing unit 1011. The memory address is a memory address of a memory device. For example, the present disclosure sets the storage addresses of s1, s2, s3, and s4 on the basis of the storage addresses respectively corresponding to the values of the 16 pixel points, and for example, the present disclosure may use the storage addresses of the first 4 pixel points in the values of the 16 pixel points as the storage addresses of s1, s2, s3, and s4. If the calculation processing unit 1011 does not have the data dimension reduction phenomenon, the storage address of each data input by the calculation processing unit 1011 is the storage address of each data output by the calculation processing unit 1011.

Exemplary apparatus

FIG. 12 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure. The apparatus of this embodiment may be used to implement the corresponding method embodiments of the present disclosure. The apparatus as shown in fig. 12 may include: a calculation processing unit 1200, a determination destination address module 1201 and a transmission module 1202.

The computing processing unit 1200 is located on the bus 101. The computing unit 1200 is configured to process the data transmitted in the current bus 101 to obtain a data processing result.

The destination address determining module 1201 is configured to determine a destination address corresponding to a data processing result obtained by the computing processing unit 1200. The destination address determining module 1201 may be located inside the computing processing unit 1200 or may be provided independently of the computing processing unit 1200. In one example, the determine destination address module 1201 may be an address and byte control module in the method embodiments described above.

The transmission module 1202 is configured to transmit the data processing result obtained by the calculation processing unit 1200 to the destination address determined by the destination address determining module 1201 through the bus 101.

Optionally, the computing processing unit 1200 in the present disclosure may include an instruction module and at least one execution unit. The instruction module can receive the data processing instruction, identify the received data processing instruction, and control at least one execution unit to process the data transmitted in the current bus 101 according to the identification result. For example, the instruction module, upon receiving the data processing instruction, controls at least one of a first execution unit for implementing mapping, a second execution unit for implementing multiplication, a third execution unit for implementing addition, a fourth execution unit for implementing maximum value, and a fifth execution unit for implementing gating in the calculation processing unit 1200 according to the data processing instruction, and processes the data transferred in the current bus 101.

In one example, the instruction module controls the first execution unit to write the mapping table information transmitted in the current bus 101 into a register of the first execution unit according to the write mapping table instruction in the data processing instruction, and controls the first execution unit to perform mapping processing on the data to be mapped transmitted in the current bus 101 according to the mapping table information stored in the current register according to the data mapping instruction in the data processing instruction. For example, the instruction module may control the first execution unit to perform nonlinear activation mapping processing on the data to be mapped transmitted in the current bus 101 according to the nonlinear activation mapping table information stored in the current register according to the nonlinear activation instruction in the data processing instruction. For another example, the instruction module may control the first execution unit to perform tone mapping processing on the data to be mapped transmitted in the current bus 101 according to tone mapping table information stored in the current register according to a tone mapping instruction in the data processing instruction.

As another example, the instruction module may control the connected state of the multi-stage adding unit in the third execution unit according to an adding instruction in the data processing instruction, and control the respective addends to be supplied to the primary adding unit, respectively, and perform the adding process by the respective adding units. The addend may include: the data to be processed and/or the output of at least one execution unit of the computation processing units currently being transferred in the bus 101, in addition to which the primary addition unit usually comprises a plurality of parallel addition units.

In yet another example, the instruction module may control the second execution unit to write the coefficient transferred in the current bus 101 into a register of the second execution unit according to a write coefficient instruction in the data processing instruction, and the instruction module may further control the second execution unit to multiply the multiplier according to a coefficient stored in the register according to a data multiplication instruction in the data processing instruction. The multipliers may include: the data to be multiplied and/or the output of at least one execution unit of the calculation processing units currently transmitted in the bus 101. For example, the instruction module may control the second execution unit to write the coefficient for normalization processing transmitted in the current bus 101 into the register of the second execution unit according to the write normalization processing coefficient instruction in the data processing instruction, and may control each multiplier transmitted in parallel in the current bus 101 to be supplied to each multiplication unit in the second execution unit according to the normalization processing instruction in the data processing instruction, and execute the multiplication processing of each multiplier and the coefficient for normalization processing stored in the register by each multiplication unit. For another example, the instruction module may control the second execution unit to write the coefficient for brightness adjustment transmitted in the current bus 101 into the register of the second execution unit according to the write brightness adjustment coefficient instruction in the data processing instruction, and may control each multiplier transmitted in parallel in the current bus 101 to be provided to each multiplication unit in the second execution unit according to the brightness adjustment processing instruction in the data processing instruction, and perform the multiplication processing of each multiplier and the coefficient for brightness adjustment by each multiplication unit.

In another example, the instruction module may control the connection state of the multi-stage comparators in the fourth execution unit according to the determined maximum value instruction in the data processing instructions, and control each data to be compared to be provided to the primary comparator in a group by group, and perform the comparison processing by the comparator, so that the disclosure may obtain the maximum value of 2N data. Wherein N is an integer greater than or equal to 1, and 2N is less than or equal to the number of data to be compared, wherein the data to be compared may include: the data to be compared currently transmitted in the bus 101 and/or the output of at least one execution unit of the calculation processing units.

In yet another example, the instruction module may control the connection state of the multi-level comparator in the fourth execution unit according to the determined maximum value position instruction in the data processing instruction, and control each data to be compared transmitted in the current bus 101 to be provided to the primary comparator in a group by group, and perform the comparison processing by the comparator, so that the present disclosure may obtain the maximum value of 2N data. The instruction module may further control a gating state of at least one multiplexer in the fifth execution unit according to the determined maximum value position instruction in the data processing instruction, and execute output control of coordinates of the maximum value through the multiplexer. Wherein N is an integer greater than or equal to 1, and 2N is less than or equal to the number of data to be compared.

In yet another example, the instruction module may control a gating state of at least one of the multiplexers in the fifth execution unit according to a data gating instruction in the data processing instruction, and control data to be processed to be supplied to the multiplexer, and the data output control is performed by the multiplexer. The data to be processed may include: the data to be processed currently transmitted in the bus 101 and/or the output of at least one execution unit of the computation processing units.

Alternatively, the determining destination address module 1201 in the present disclosure may determine, according to the first destination addresses corresponding to the addends provided to the third execution unit, the second destination addresses corresponding to the addend processing results output by the third execution unit. The determining destination address module 1201 may determine, according to the first destination address corresponding to each data to be compared provided to the fourth execution unit, the second destination address corresponding to the maximum value output by the fourth execution unit.

Exemplary electronic device

An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 13. Fig. 13 shows a block diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 13, electronic device 131 includes one or more processors 1311 and memory 1312.

Processor 1311 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities and may control other components in electronic device 131 to perform desired functions.

Memory 1312 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example: random Access Memory (RAM) and/or cache, etc. The nonvolatile memory may include, for example: read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 1311 to implement the data processing methods and/or other desired functions of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 131 may further include: input device 1313, and output device 1314, etc., interconnected by a bus system and/or other form of connecting mechanism (not shown). In addition, the input device 1313 may also include, for example, a keyboard, mouse, and the like. The output unit 1314 can output various information to the outside. The output device 1314 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device 131 relevant to the present disclosure are shown in fig. 13 for simplicity, components such as buses, input/output interfaces, and the like being omitted. In addition, the electronic device 131 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in a data processing method according to various embodiments of the present disclosure described in the "exemplary methods" section of the present description.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a data processing method according to various embodiments of the present disclosure described in the above "exemplary method" section of the present disclosure.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatus, devices, and systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, changes, additions, and sub-combinations thereof.

Claims

1. A data processing method, comprising:

processing the data transmitted in the current bus by a calculation processing unit on the bus to obtain a data processing result;

determining a destination address corresponding to the data processing result;

transmitting the data processing result to the destination address through the bus;

the processing, by the computing processing unit on the bus, the data transmitted in the current bus includes:

receiving a data processing instruction based on an instruction module in the computing processing unit;

controlling a first execution unit for executing mapping processing to write mapping table information transmitted in a current bus into a storage unit of the first execution unit based on a write mapping table instruction in the data processing instruction through the instruction module;

and controlling the first execution unit to perform mapping processing on the data to be mapped transmitted in the current bus according to the mapping table information stored in the storage unit based on the data mapping instruction in the data processing instruction through the instruction module.

2. The method of claim 1, wherein the processing, by at least one execution unit of the computing processing unit, of data transmitted in a current bus based on the data processing instruction, further comprises:

and controlling at least one of a second execution unit for realizing multiplication, a third execution unit for realizing addition, a fourth execution unit for realizing maximum value and a fifth execution unit for realizing gating in the calculation processing unit based on the data processing instruction through the instruction module, and processing the data transmitted in the current bus.

3. The method according to claim 1, wherein the controlling, by the instruction module, the first execution unit to perform mapping processing on the data to be mapped transmitted in the current bus according to the mapping table information stored in the storage unit based on the data mapping instruction in the data processing instruction includes:

the first execution unit is controlled to perform nonlinear activation mapping processing on data to be mapped, which is transmitted in a current bus, according to nonlinear activation mapping table information stored in the storage unit based on a nonlinear activation instruction in the data processing instruction through the instruction module; or alternatively

And controlling the first execution unit to perform tone mapping processing on the data to be mapped, which is transmitted in the current bus, according to the tone mapping table information stored in the storage unit based on the tone mapping instruction in the data processing instruction through the instruction module.

4. The method of claim 1, wherein the processing the data transmitted in the current bus further comprises:

the instruction module controls a second execution unit to write the coefficients transmitted in the current bus into a register of the second execution unit based on the write coefficient instruction in the data processing instruction;

the instruction module is used for controlling the second execution unit to multiply a multiplier according to the coefficient stored in the register based on a data multiplication instruction in the data processing instruction;

wherein the multiplier comprises: the data to be multiplied and/or the output of at least one execution unit of the calculation processing units transmitted in the current bus.

5. The method of claim 1, wherein the processing the data transmitted in the current bus further comprises:

controlling the communication state of the multi-stage adding units in the third execution unit based on the adding instruction in the data processing instruction through the instruction module, controlling each adding number to be provided for the primary adding unit respectively, and executing adding processing through each adding unit;

Wherein the addend comprises: the primary adding unit comprises a plurality of parallel adding units.

6. A data processing apparatus comprising:

the computing processing unit is positioned on the bus and is used for processing the data transmitted in the current bus to obtain a data processing result;

the target address determining module is used for determining a target address corresponding to the data processing result obtained by the computing processing unit;

the transmission module is used for transmitting the data processing result obtained by the calculation processing unit to the destination address determined by the destination address determining module through the bus;

wherein the computing processing unit is used for:

7. A computer readable storage medium storing a computer program for performing the method of any one of the preceding claims 1-5.

8. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor being configured to read the executable instructions from the memory and execute the instructions to implement the method of any of the preceding claims 1-5.