CN114596184A

CN114596184A - Method, device and storage medium for accumulating image data

Info

Publication number: CN114596184A
Application number: CN202011411000.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2022-06-07

Abstract

The invention relates to a method, a device and a storage medium for accumulating image data, wherein the processing device of the invention is included in an integrated circuit device which comprises a general interconnection interface and a computing device. The computing device interacts with the processing device to jointly complete computing operations specified by the user. The integrated circuit device may further comprise a storage device, which is connected to the computing device and the processing device, respectively, for data storage of the computing device and the processing device.

Description

Method, device and storage medium for accumulating image data

Technical Field

The present invention relates generally to the field of computers. More particularly, the present invention relates to a method, apparatus, and storage medium for floating-point number accumulation operations.

Background

With the development of computer technology, neural networks are increasingly applied to processing of audio, video and other data, especially image data. In the related art, the amount of data calculation is also increasing, and the accumulation calculation is widely performed among various neural network operators. However, the biggest problem in the accumulation operation is: in the accumulation process of floating point type data, precision is lost due to the fact that large numbers eat small numbers.

Therefore, how to achieve an accuracy compensation method with similar effect while ensuring performance is urgently needed.

Disclosure of Invention

To at least partially solve the technical problems mentioned in the background, aspects of the present invention provide a method, an apparatus, and a readable storage medium for accumulating image data.

In one aspect, the present disclosure discloses a method of accumulating image data, the method comprising: acquiring a plurality of image data to form a data group; sorting the data in the data group according to the magnitude of the numerical value; adding the data with the largest numerical value and the data with the smallest numerical value in the data group, and adding the data in pairs according to the rule to obtain a plurality of intermediate data; updating the data group according to the plurality of intermediate data; and repeatedly executing the sorting, adding and updating steps by using the updated data group until only one intermediate data is generated in the adding step, wherein the intermediate data is an accumulation result of the data group.

In another aspect, the present invention discloses an apparatus for computing accumulated image data, the apparatus comprising: the acquisition module is used for acquiring a plurality of image data to form a data group; the sorting module is used for sorting the data in the data group according to the numerical value; the calculation module is used for adding the data with the largest numerical value and the data with the smallest numerical value in the data group, and adding the data in pairs according to the rule to obtain a plurality of intermediate data; the updating module is used for updating the data group according to the plurality of intermediate data; wherein the sorting, adding and updating steps are repeatedly executed with the updated data group until only one intermediate data is generated in the adding step, and the intermediate data is the accumulation result of the data group

In another aspect, the present invention discloses a computer readable storage medium having stored thereon computer program instructions, which, when executed by a processing device, perform the aforementioned method.

The image data are divided into a plurality of data groups, extra memory consumption is not needed by adjusting the accumulation sequence of the data in the data groups, and the whole calculation operation is still only one-step addition operation, so that the increase of the calculation complexity is not caused, and the performance of the whole algorithm is not reduced. Most importantly, the invention ensures that data with the same order of magnitude are added when accumulating each time by adjusting the accumulation sequence, so that the situation that a large number eats a decimal can not occur, and the problem of accuracy overflow in an accumulation scene is effectively solved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:

fig. 1 is a schematic structural diagram showing a board card according to an embodiment of the present invention;

FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 3 illustrates a flow diagram of a method of data graph accumulation according to an embodiment of the present invention; and

fig. 4 is a schematic diagram of an apparatus illustrating an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

The following detailed description of embodiments of the invention refers to the accompanying drawings.

Fig. 1 shows an application scenario of the present invention, which is a schematic structural diagram of a board card 10. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, such as image segmentation algorithm, video coding and decoding algorithm, and the like, and meeting the intelligent processing requirements in complex scenes in the fields of computer vision, speech, natural language processing, data mining, and the like. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the image data volume is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, such as video sequences, image data, etc., including one or more memory units 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input video image data from the processing device 203 via the interface device 202 and write the video image data to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present invention may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The DRAM 204 is used for storing data to be processed, and is a DDR memory, which is typically 16G or larger in size and is used for storing data of the computing device 201 and/or the processing device 203.

The accumulation operation widely exists in various neural network operators, such as softmax, batchnorm, l1loss, 2loss, reduce sum operators and the like, and the basic process of the accumulation operation is as follows: sum is a1+ a2+ A3+ … … + An, where a1, a2, and A3 … … An are data of floating point operation, and sum represents An accumulated result of floating point data.

The scenario of accumulation operation is quite common in neural network applications, such as convolutional neural network (CNN network) which is widely used in image classification task. Generally, when network training is performed, the neural network adjusts parameters of each layer by inputting training samples, so that a result calculated by the neural network is as close as possible to a real result. The neural network training comprises forward propagation and backward propagation, wherein the forward propagation is based on the existing model, an input training sample is calculated through each layer of the neural network, an input feature graph is gradually extracted into abstract features, the backward propagation is a loss function calculated according to a forward propagation result and a real value, and the parameters are updated by adopting a gradient descent method and calculating the partial derivatives of the loss function on each parameter through a chain rule. And training by using the updated parameters, and repeating the training for multiple times to finally enable the calculation result of the forward propagation to be in line with the expectation.

In this embodiment, an epoch training (epoch) refers to a training process using all training samples, the set of training samples is a training set, and each training batch (batch size) of training samples is an iteration (iteration). For example, the training set has 1000 training samples, and the number of the batches is set to 10, so that 10 training samples are required to participate in training for each iteration, and 100 iterations are total in a generation of training. In practice, the training of the neural network model may go through the last layer of the training of one generation for many times to calculate the Loss function, taking the commonly used Loss function as L2Loss as an example, the core operation is summation, and the accumulated data size is large, and is generally accumulated in the data amount of billions.

For such large-scale data accumulation, if the operation of normal accumulation is performed, a billion of values plus a decimal fraction will be generated at last, and the precision loss phenomenon of large numbers and decimal fractions will occur. The large decimal fraction means that when two numbers with large difference are added, mantissas behind the decimal point of the decimal fraction are easy to be cut off, and precision errors are caused. For example, two floating point numbers are added, which is 32 bits of data, with a precision of 6-7 bits after the decimal point. Specifically, the floating-point number is represented by float: 2^23 ^ 8388608, total seven bits, this means that 7-bit significands can be had at most, but the numerical precision that can guarantee is 6 bits, and the precision that is floating point number is 6 ~ 7-bit significands. Assume two floating point numbers: a is 123456, b is 2.189; a + b should be 123458.189, but since the accuracy is only 6-7 decimal places, 123458.189 results are not necessarily obtained, the latter 89 may be truncated, in more detail, 8 percentile is not necessarily truncated, but 9 thousandths are necessarily truncated. It can be seen that one addition has produced an error of at least 0.009, and that for 1000 such additions the error is 9, which is clearly unacceptable. Therefore, precision compensation is required in the floating-point accumulation operation.

If a common kahan accumulation algorithm is adopted, firstly kahan needs to occupy 4 blocks of internal memory space, wherein 3 blocks of space are used for intermediate temporary variable storage, and only one block of space is used for storing valid data, thereby wasting the memory space. Secondly, from the calculation formula of kahan, to complete the whole kahan algorithm, 3-step calculation operation and 1-step copy operation are involved, compared with the original one-step addition calculation operation, the kahan algorithm increases the complexity of calculation, and in a hardware scene of a calculation bottleneck, the kahan algorithm means that the performance of the whole algorithm is reduced by one fourth.

In particular, the kahansum algorithm suffers from the drawback that 3 intermediate variables c, y, t need to be used to hold the intermediate results. This drawback is not highlighted in scalar calculations, but three on-chip spaces equal to the size of the input data are required to hold the intermediate variables during vector calculations. Since the kahan algorithm requires three extra spaces for storing intermediate variables, for input data of one input data size, the kahan algorithm has corresponding 3 intermediate variables for one input data size, so the space needs to be divided into 4 parts, but only one of them is really used for storing the calculated data, which means that the on-chip space available for storing the input data is reduced by three quarters. For parallel computing, on-chip space means computation performance to some extent, which directly results in a quarter reduction in computation performance.

The invention does not need extra memory consumption by adjusting the accumulation sequence, and the whole calculation operation is still only one-step addition operation, thereby not increasing the calculation complexity and not reducing the performance of the whole algorithm. Most importantly, the invention ensures that data with the same order of magnitude are added when accumulating each time by adjusting the accumulation sequence, so that the situation that a large number eats a decimal can not occur, and the problem of accuracy overflow in an accumulation scene is effectively solved.

This embodiment provides a scheme for accumulating image data based on the aforementioned hardware environment.

FIG. 3 is a flow chart of a method for data graph accumulation according to an embodiment of the invention. The method is applied to a single or multiple processor cores in the board 10 or the combination device 20.

Step 301, acquiring a plurality of image data to form a data group. When processing data using a neural network, data preprocessing is first performed on a large amount of acquired data to form a data group. Wherein step 301 comprises:

step 302, classifying the plurality of image data into at least one level group by order of magnitude.

Where an order of magnitude refers to a number of scales or levels of size, with a fixed ratio maintained between each level. The ratios generally used are 10, 2, 1000, 1024. If scientific counting is generally used without particular reference, the order of magnitude is based on 10 (to which powers of 10). For example: 1.0e9 represents a power of 9 of 10, of the order of 9; 1.0e-6 represents a negative 6 th power of 10, of the order of-6; 3.2e8 represents the power of 8 of 3.2 times 10. Classifying image data by order of magnitude means that image data of the same order of magnitude are classified into the same level group, and image data of different order of magnitude are classified into different level groups. For example, if a set of image data is all 1.ne9, and n represents a positive integer, the order of the set of image data is 9, which are all the same order of magnitude, and the set of image data is divided into a class group. For example, another group of image data includes 1.ne3, 1.me9, and 1.pe6, where n, m, and p are positive integers, and this group of data includes data of order 3, 9, and 6, and the group of data can be divided into three level groups of order 3, 9, and 6 according to the order of magnitude.

Step 303, determine if all the clusters have been assigned. The designation refers to designating the divided level groups as data groups, and performing subsequent calculation on each data group. The stage groups that have been designated are not repeatedly designated.

If there are not designated level groups in all level groups, step 304 is executed to designate one of the at least one level group as the data group. The manner of designating the level group is not limited at all, and the level group may be designated randomly, or may be designated in a certain order according to the actual situation, such as from large to small according to the order of magnitude, or according to the parity of the order of magnitude. For example, a group of image data is divided into six level groups, level group one, level group two, level group three, level group four, level group five, and level group six, assuming that all level groups have not been designated. In designating the level group, any one of the level group one, the level group two, the level group three, the level group four, the level group five, and the level group six may be designated as the data group first. It may also be specified in order of the level groups, from small to large, or in other ways.

Further, a plurality of stage groups may be simultaneously designated as corresponding data groups, and after the stage groups are designated as data groups, subsequent processing may be performed in parallel by the plurality of stage groups (steps after step 304 are executed, specific steps are described later). For example, the six level groups of the image data are simultaneously designated as corresponding six data groups, which are respectively a data group one, a data group two, a data group three, a data group four, a data group five, and a data group six, the combination device 20 includes a plurality of processor cores, each processor core respectively processes one data group, the plurality of processor cores simultaneously and parallelly process a plurality of data groups (steps after step 304 are executed, specific steps are described later), and each processor core executes the corresponding data group to obtain a corresponding final intermediate result.

Then, in step 305, the data in the data group is sorted according to the value size. And sorting the data in the one or more specified data groups according to the data size. The data in the data group is sorted in increasing or decreasing order. The data in the different data groups may be sorted in different ways, for example, the first data group sorts the data in the first data group in an increasing order, and the second data group sorts the data in the second data group in a decreasing order. The present invention does not limit the ordering method.

Step 306, adding the data with the largest numerical value and the data with the smallest numerical value in the data group, and adding the data in pairs according to the rule to obtain a plurality of intermediate data.

In order to avoid the precision loss caused by too large difference between data during accumulation, the invention adds the data with the largest numerical value and the data with the smallest numerical value in the data group, adds the data with the largest numerical value and the data with the smallest numerical value, and so on, so that the obtained intermediate data have the smallest difference as possible, and the precision loss can be reduced in the subsequent accumulation process.

For example, after step 305, a group of data a1, a2, A3, … …, An is obtained, wherein the group of data is arranged from small to large, and in step 303, a1 is a1+ An; a2 ═ a2+ a (n-1); a3 ═ A3+ a (n-2); … …, respectively; a (n/2) ═ a (n/2) + a (n/2+1), yielding n/2 or (n/2) +1 intermediate data.

And 307, judging whether the number of the intermediate data is one. If not, that is, if the intermediate data is more than one data, it means that all operations have not been completed on the data in the data group, step 308 is executed to update the data group according to the plurality of intermediate data, that is, all data in the data group is replaced with the intermediate data obtained in step 306.

Then, returning to step 305, based on the updated data group, step 305, step 306 and step 308 are repeatedly performed on the data in the data group.

The set of data from step 305 above is still used as an example. The first accumulation results in A1, A2 and A3 … …; a (n/2) sets of intermediate data, which means that all operations have not been completed on the data in the data group, the data group is updated according to the obtained sets of intermediate data, and the data in the updated data group is the sets of intermediate data, and then the updated data group is added for multiple times until in step 306, the last accumulation process: and obtaining an intermediate data sum, wherein the intermediate data sum is the accumulated sum of all data in the data group, and only one intermediate data is generated in the adding step and is the accumulated result of the data group. Since it is determined in step 307 that only 1 intermediate result indicates that all the data have been accumulated in the last year and the data of the next stage group needs to be processed, the process returns to step 303 to determine whether or not any stage group is not specified. If all the clusters have been assigned, step 309 is executed to sum the accumulated results of all the clusters to obtain the accumulated result of the image data.

In more detail, when all the clusters are designated, it means that the data of all the clusters are calculated in the steps 305-307, and finally each cluster obtains a corresponding intermediate result. When the level groups are divided, the data are divided according to the order of magnitude, so that the order of magnitude of the data in each level group is different, and the order of magnitude of the last intermediate data obtained by the data groups corresponding to different level groups is also different. Such as: the results calculated by 1.0e9+1.0e-6+1.0e-6 are 1000000000.0000019 and true 1000000000.000002 are erroneous, so that all the accumulated results are summed up based on the kahansum algorithm for intermediate data of different orders of magnitude obtained finally to obtain the accumulated result of the plurality of image data.

The kahan sum algorithm is another method which can avoid the situation that a large number eats a small number in addition operation and reduce precision loss. It has a number to remember which fraction was truncated, and the same calculations are done, assuming a 123456 and b 2.189, to calculate a + b. The kahan summation algorithm does this: sum ═ a + b (inaccurate), and temp ═ a + b) -a-b, temp may yield results other than 0, but equal to-0.009, the fraction truncated. This error is saved by a temporary variable, which can be complemented when calculating the next addition and updated to sum.

It can also be understood that since the sum value is too large, it occupies the precision of the decimal, and this decimal seems to be negligible at present, but it is possible that after many iterations, the decimal will accumulate into a large error, so that the error decimal is specially stored by temp to avoid the problem that the decimal is eaten by the large number.

In an alternative embodiment, the method of accumulating image data described above may be applied in neural network computations.

Fig. 4 shows another embodiment of the present invention, which is an apparatus 400 for data graph accumulation. The apparatus 400 comprises: an obtaining module 401, a sorting module 402, a calculating module 403 and an updating module 404. The acquiring module 401 is configured to acquire a plurality of image data to form a data group; the sorting module 402 is configured to sort the data in the data group according to the magnitude of the numerical value; the calculating module 403 is configured to add the data with the largest numerical value and the data with the smallest numerical value in the data group, and add the data two by two according to the rule to obtain a plurality of intermediate data; the updating module 404 is configured to update the data group according to the plurality of intermediate data.

The sorting, calculating and updating

modules

402, 403, 404 repeat the sorting, adding and updating operations with the updated data group until the calculating module 403 generates only one intermediate data, which is the accumulated result of the data group.

In one possible embodiment, the obtaining module 401 includes a classifying module 411 and a determining module 412. A classification module 411 for classifying the plurality of image data into at least one level group by order of magnitude; the determining module 412 is configured to determine whether all the level groups have been assigned, and if not, assign one of the at least one level group as the data group.

In an alternative embodiment, the apparatus 400 further comprises an accumulation module 405, if all the classes have been assigned, the accumulation module 405 accumulates all the accumulation results based on the kahan sum algorithm to obtain an accumulation result of the plurality of image data. The apparatus 400 is applied to neural network computations.

When large-scale data is accumulated, in order to avoid the problem that the large number has a decimal number, the image data is divided into a plurality of grades according to the order of magnitude, each grade contains data with the same order of magnitude, the data of each grade is respectively calculated and accumulated according to the modes of the maximum value and the minimum value in each cluster and the secondary value, so that one or more intermediate numbers after accumulation can be approximately equal, the precision loss in the accumulation process can be reduced, and the accuracy of calculation is improved.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein can be practiced in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of the connection relationships between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solution described in the embodiments of the present invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1, a method of accumulating image data, the method comprising: acquiring a plurality of image data to form a data group; sorting the data in the data group according to the magnitude of the numerical value; adding the data with the largest numerical value and the data with the smallest numerical value in the data group, and adding the data in pairs according to the rule to obtain a plurality of intermediate data; and

updating the data group according to the plurality of intermediate data; and repeatedly executing the sorting, adding and updating steps by using the updated data group until only one intermediate data is generated in the adding step, wherein the intermediate data is an accumulation result of the data group.

Clause a2, the method of clause a1, the obtaining step comprising: classifying the plurality of image data into at least one level group by order of magnitude; and judging whether all the level groups are designated, if not, designating one of the at least one level group as the data group.

Clause A3, the method of clause a2, the method further comprising: and if all the level groups are designated, accumulating and summing all the accumulation results based on a kahansum algorithm to obtain the accumulation results of the plurality of image data.

Clause a4, the method of clause A3, applied in neural network computing.

Clause a5, an apparatus for computing accumulated image data, wherein the apparatus comprises: the acquisition module is used for acquiring a plurality of image data to form a data group; the sorting module is used for sorting the data in the data group according to the numerical value; the calculation module is used for adding the data with the largest numerical value and the data with the smallest numerical value in the data group, and adding the data in pairs according to the rule to obtain a plurality of intermediate data; the updating module is used for updating the data group according to the plurality of intermediate data; the sorting, calculating and updating module repeatedly executes sorting, adding and updating operations by the updated data group until the calculating module only generates one intermediate data, wherein the intermediate data is an accumulation result of the data group.

Clause a6, the apparatus of clause a5, the obtaining module comprising: a classification module to classify the plurality of image data into at least one level group by order of magnitude; and the judging module is used for judging whether all the level groups are designated or not, and if not, designating one of the at least one level group as the data group.

Clause a7, the apparatus of clause a6, wherein the apparatus further comprises an accumulation module that accumulates all accumulation results based on a kahan sum algorithm to obtain an accumulation result for the plurality of image data if all level groups have been specified.

Clause A8, the apparatus of clause a7, wherein the apparatus is applied to neural network computing.

Clause a9, a computer-readable storage medium having computer program instructions stored thereon which, when executed by a processor, implement the method of any one of clauses a 1-a 4.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of accumulating image data, the method comprising:

acquiring a plurality of image data to form a data group;

sorting the data in the data group according to the numerical value;

adding the data with the largest numerical value and the data with the smallest numerical value in the data group, and adding the data in pairs according to the rule to obtain a plurality of intermediate data; and

updating the data group according to the plurality of intermediate data;

and repeatedly executing the sorting, adding and updating steps by using the updated data group until only one intermediate data is generated in the adding step, wherein the intermediate data is an accumulation result of the data group.

2. The method of claim 1, wherein the obtaining step comprises:

classifying the plurality of image data into at least one level group by order of magnitude;

and judging whether all the level groups are designated, if not, designating one of the at least one level group as the data group.

3. The method of claim 2, further comprising:

and if all the level groups are designated, accumulating and summing all the accumulation results based on a kahansum algorithm to obtain the accumulation results of the plurality of image data.

4. The method according to any one of claims 1-3, wherein the method is applied in neural network computing.

5. An apparatus for computing accumulated image data, the apparatus comprising:

the acquisition module is used for acquiring a plurality of image data to form a data group;

the sorting module is used for sorting the data in the data group according to the numerical value;

the calculation module is used for adding the data with the largest numerical value and the data with the smallest numerical value in the data group, and adding the data in pairs according to the rule to obtain a plurality of intermediate data; and

the updating module is used for updating the data group according to the plurality of intermediate data;

the sorting, calculating and updating module repeatedly executes sorting, adding and updating operations by the updated data group until the calculating module only generates one intermediate data, wherein the intermediate data is an accumulation result of the data group.

6. The apparatus of claim 5, wherein the obtaining module comprises:

a classification module to classify the plurality of image data into at least one level group by order of magnitude;

and the judging module is used for judging whether all the level groups are designated or not, and if not, designating one of the at least one level group as the data group.

7. The apparatus of claim 6, further comprising an accumulation module that accumulates all accumulation results based on a kahan sum algorithm to obtain an accumulation result for the plurality of image data if all of the level groups have been assigned.

8. The apparatus of claim 7, wherein the apparatus is applied to neural network computations.

9. A computer readable storage medium having stored thereon computer program instructions for accumulating image data, wherein the computer program instructions, when executed by a server, implement the method of any of claims 1 to 4.