WO2022001438A1 - 一种计算装置、集成电路芯片、板卡、设备和计算方法 - Google Patents

一种计算装置、集成电路芯片、板卡、设备和计算方法 Download PDF

Info

Publication number
WO2022001438A1
WO2022001438A1 PCT/CN2021/094467 CN2021094467W WO2022001438A1 WO 2022001438 A1 WO2022001438 A1 WO 2022001438A1 CN 2021094467 W CN2021094467 W CN 2021094467W WO 2022001438 A1 WO2022001438 A1 WO 2022001438A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
result
circuit
component
operated
Prior art date
Application number
PCT/CN2021/094467
Other languages
English (en)
French (fr)
Inventor
陶劲桦
喻歆
刘少礼
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Priority to US18/003,687 priority Critical patent/US20230305840A1/en
Publication of WO2022001438A1 publication Critical patent/WO2022001438A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • G06F9/30014Arithmetic instructions with variable precision
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F2015/761Indexing scheme relating to architectures of general purpose stored programme computers
    • G06F2015/763ASIC

Definitions

  • This disclosure relates generally to the field of data processing. More specifically, the present disclosure relates to a computing device, an integrated circuit chip, a board, an apparatus, and a computing method.
  • processors may handle different bit widths of data.
  • the data bit width they can handle is often limited.
  • the data bit width that can usually be processed is not more than 16 bits, such as 16-bit integer data.
  • how to enable a processor with limited bit width to process data with more bit width becomes a technical problem that needs to be solved.
  • the present disclosure proposes in various aspects to utilize small bit width components (ie, data with fewer bits) of large bit width data (ie, data with more bits) data), instead of the large bit-width data to participate in the calculation scheme.
  • small bit width components ie, data with fewer bits
  • large bit width data ie, data with more bits
  • the present disclosure proposes in various aspects to utilize small bit width components (ie, data with fewer bits) of large bit width data (ie, data with more bits) data), instead of the large bit-width data to participate in the calculation scheme.
  • the present disclosure provides a computing device comprising: an arithmetic circuit configured to: receive a plurality of data to be operated on associated with an operation instruction, wherein at least one of the data to be operated on consists of two or more components to characterize that the at least one data to be operated on has a source data bit width, each component has a respective target data bit width, and the target data bit width is smaller than the source data bit width; and the two or more components are used in place of the characterized the data to be operated on to perform the operation specified by the operation instruction to obtain two or more intermediate results.
  • the computing device further includes a combining circuit configured to: combine the above-mentioned intermediate results to obtain a final result; and a storage circuit configured to store the above-mentioned intermediate results and/or final results.
  • the present disclosure provides an integrated circuit chip comprising the computing device of the aforementioned first aspect.
  • the present disclosure provides an integrated circuit board including the integrated circuit chip of the aforementioned second aspect.
  • the present disclosure provides a computing device including the board of the aforementioned third aspect.
  • the present disclosure provides a method performed by a computing device.
  • the method includes: receiving a plurality of data to be operated on associated with an operation instruction, wherein at least one data to be operated is characterized by two or more components, the at least one data to be operated has a source data bit width, each component has a respective The target data bit width of , and the target data bit width is smaller than the source data bit width; use the two or more components to replace the represented data to be operated on to perform the operation specified by the operation instruction to obtain two or more intermediate results; and combining the above intermediate results to obtain a final result.
  • the solution of the present disclosure utilizes the small-bit-width component of the large-bit-width data to participate in the calculation instead of the large-bit-width data, so as to participate in computations including, for example, neural
  • neural in artificial intelligence application scenarios such as network computing or other general scenarios, it is not limited by the processing bit width of the processor, and the computing power of the processor can be fully utilized.
  • the solution of the present disclosure can also perform computation by replacing large-bit-width data with at least two small-bit-width components, thereby simplifying the computing of the neural network and improving the computing efficiency.
  • FIG. 1 is a simplified block diagram illustrating a computing device according to an embodiment of the present disclosure
  • FIG. 2 is a detailed block diagram illustrating a computing device according to an embodiment of the present disclosure
  • FIG. 3 is a detailed block diagram illustrating a computing device according to an embodiment of the present disclosure.
  • FIG. 4 is a detailed block diagram illustrating a computing device according to an embodiment of the present disclosure.
  • FIG. 5 is a detailed block diagram illustrating a computing device according to an embodiment of the present disclosure.
  • FIG. 6 is a flowchart illustrating a computing method of a computing device according to an embodiment of the present disclosure
  • FIG. 7 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
  • the present disclosure proposes, in various aspects, a scheme of using the small bit width component of the large bit width data to replace the large bit width data to participate in the calculation . Since at least two small-bit-width data are used to represent the large-bit-width data, and operation processing is performed instead of the large-bit-width data, it is necessary to combine the operation results obtained by using the small-bit-width data to obtain the final result.
  • the disclosed scheme overcomes the limitation of processor bit width by having, for example, large bit width (eg, 24-bit) data be characterized by at least two small bit width (eg, 16-bit and 8-bit) data (or components) obstacle.
  • the solutions of the present disclosure are particularly suitable for operation processing involving multiplication operations, such as multiplication or multiply-add operations, which may include, for example, convolution operations. Therefore, the solutions of the present disclosure can be used to perform neural network operations, especially to process weight data and neuron data to obtain desired operation results.
  • the neural network is a convolutional neural network for an image
  • the weight data may be convolution kernel data
  • the neuron data may be, for example, pixel data of an image or output data after a previous layer operation.
  • FIG. 1 is a simplified block diagram illustrating a computing device 100 in accordance with an embodiment of the present disclosure.
  • the computing device 100 can be used for operation processing of large-bit-width data for various application scenarios, such as artificial intelligence applications including neural network operations or the need to decompose large-bit-width data into Small bit-width data for general-purpose use in computing.
  • the computing device 100 includes an arithmetic circuit 110 , a combinational circuit 120 and a storage circuit 130 .
  • the operational circuit 110 may be configured to: receive a plurality of data to be operated on associated with an operational instruction, wherein at least one of the data to be operated on is characterized by two or more components.
  • the at least one data to be operated has a source data bit width, and each component has a respective target data bit width, wherein the target data bit width is smaller than the source data bit width.
  • the data bit width of the data to be operated may exceed the processing bit width of the operation circuit. Based on this, the data to be operated with a large bit width (the source data bit width) can be decomposed into two or more small bit widths (target data bit width) components.
  • the decomposition of the data to be operated on can be accomplished based on various existing and/or future developed data decomposition techniques to decompose into two or more components.
  • the number of components used to characterize the data to be operated on may be determined based at least in part on the source data bit width of the data to be operated on and the data bit width supported by the operational circuit.
  • the target data bit width may be determined based at least in part on the data bit width supported by the arithmetic circuit. For example, when the data to be operated has a data width of 24 bits and the operation circuit supports a maximum data width of 16 bits, in an example, the data to be operated can be decomposed into two components with unequal target data widths, namely: 8-bit high-order component and 16-bit low-order component, or 16-bit high-order component and 8-bit low-order component.
  • the data to be operated can be decomposed into 3 components with the same bit width of the target data, namely: an 8-bit high digit component, an 8-bit middle digit component, and an 8-bit low digit component.
  • the present disclosure has no limitation in this respect, as long as the target data bit width of the components obtained by decomposing meets the processing bit width limitation of the operation circuit.
  • the data to be calculated can be decomposed into multiple components according to the required number of components and the target data bit width of each component, wherein each component has a corresponding component value and component scaling factor.
  • a possible data decomposition method is briefly described below by taking the decomposition of a large bit-width data into two small-bit-width components as an example, but those skilled in the art can understand that the present disclosure is not limited in this respect.
  • the large bit width data is decomposed into two components: a first component and a second component.
  • the first component may be a high digital component or a low digital component; correspondingly, the second component may be a low digital component or a high digital component.
  • the component scaling factor of each component can be determined based on the target data bit width of each component and/or the bit position of each component in the data before decomposition (large bit width data). For example, when the target data bit width of the first component (in this example, the high digit component) is n1, and the target data bit width of the second component (in this example, the low digit component) is n2, when disregarding the When the sign bit is included in n2, the component scaling factor of the first component may be 2n2-1. In contrast, when the inclusion of a sign bit in n2 is considered, then the component scaling factor of the first component may be 2n2. In general, the component scaling factor for low-digit components defaults to 1.
  • the large bit-width data to be decomposed may be calculated to obtain the component value of the first component.
  • the first component is characterized by the component value and the corresponding component scaling factor.
  • calculation may be performed according to the large bit-width data to be decomposed and the component values of the first component obtained previously, so as to obtain the component values of the second component.
  • the second component can be decomposed by subtracting the one component to get the value of the second component.
  • the first component is the product value of the component value of the first component and the corresponding component scaling factor.
  • the large bit-width data can be decomposed into two components, each of which has a corresponding component value and a component scaling factor.
  • the above method can be performed iteratively until the desired number of components are obtained. For example, for data with a 24-bit data bit width, when it is determined that it needs to be decomposed into 3 components, and the three components are all 8-bit data bit wide, the data with a 24-bit data bit width can be first decomposed into The first component of the 8-bit data bit width and the middle second component of the 16-bit data bit width. Next, the above steps are repeatedly performed for the middle second component of the 16-bit data bit width, so as to be further decomposed into the second component of the 8-bit data bit width and the third component of the 8-bit data bit width.
  • the arithmetic circuit 110 may be further configured to perform the operation specified by the operation instruction using the received two or more components in place of the represented data to be operated on to obtain two or more intermediate results.
  • the operation circuit 110 may be configured to perform a specified operation on two or more components of one data to be operated on and corresponding data of other data to be operated on, respectively, and output the corresponding operation result to the combination circuit 120 .
  • other data to be operated on may include one or more data to be operated on.
  • Each data to be operated may have different data bit widths.
  • the data bit width of the data to be operated satisfies the processing bit width limitation of the operation circuit, it may not need to be decomposed, and the original data may be used for operation.
  • the corresponding data of these other data to be operated may include any one of the following: original data of the data to be operated, or at least one component representing the data to be operated.
  • the arithmetic circuit 110 can use the received data to perform an operation specified by the arithmetic instruction, thereby obtaining two or more intermediate results, and output to the combination circuit 120 .
  • the operation circuit 110 can perform the specified operation in the order in which the two or more components are received, thereby obtaining intermediate results in sequence and outputting them to the combination circuit 120 .
  • the order of the components may include, for example, high-order to low-order, or low-order to high-order.
  • the combining circuit 120 may be configured to combine the intermediate results input from the arithmetic circuit 110 to obtain a final result. As mentioned above, since at least one data to be operated on uses its two or more components to replace the operation, the operation performed using each component obtains an intermediate result, which needs to be combined to obtain the final result.
  • the combining circuit may be further configured to perform a weighted combination of these operation results as intermediate results to obtain a final result. Since the components participating in the operation instead of the original data to be operated have corresponding component values and component scaling factors, and the operation circuit 110 can only use the component values to perform operations to obtain an intermediate result during operation, the combination circuit 120 may consider participating in the operation. Component scaling factors for the components of the operation to weight the intermediate results.
  • combinational circuits will be described in detail later based on several embodiments.
  • Computing device 100 may also include storage circuitry 130 configured to store the above-described intermediate and/or final results.
  • storage circuitry 130 configured to store the above-described intermediate and/or final results.
  • the result obtained by the operation circuit 110 using the components to perform the operation is an intermediate result
  • these intermediate results need to be combined.
  • cyclic combination can be performed according to the generation of intermediate results, such as weighted accumulation, so these intermediate results can be temporarily or long-term stored by using a storage circuit.
  • the intermediate result and the final result may share storage space in the storage circuit, thereby saving storage space.
  • the storage circuit 130 can also be used to store other data and information, for example, intermediate data generated during the operation of the operation circuit 110 that needs to be stored, and the present disclosure is not limited in this respect.
  • FIG. 2 is a detailed block diagram illustrating a computing device 200 according to an embodiment of the present disclosure.
  • the solution of the present disclosure is particularly suitable for arithmetic processing involving multiplication operations. Therefore, in this embodiment, the operation circuit 210 of the computing device 200 may specifically be implemented as a multiplying circuit 211 or a multiplying and adding circuit 212 .
  • the multiply-add circuit 212 can be used to implement a convolution operation, for example.
  • the component participating in the operation instead of the original data to be operated has a corresponding component value and a component scaling factor
  • the component scaling factor is associated with the digit position of the component in the represented data to be operated
  • the multiplication circuit 211 or the multiplication and addition circuit 212 may only use the component values to perform the operation to obtain the operation result as an intermediate result. The effects of the component scaling factors can then be processed by the combinational circuit 220 .
  • the combining circuit 220 may include a weighting circuit 221 and an adding circuit 222 .
  • the weighting circuit 221 may be configured to use the weighting factor to perform weighting processing on the current operation result of the operation circuit 210, such as the product result of the multiplication circuit 211 or the multiplication and addition result of the multiply-add circuit 212, or the previous combination result of the combination circuit 220. .
  • the weighting factor can also be different.
  • the weighting factors are determined based, at least in part, on component scaling factors of components that generate corresponding operation results.
  • the addition circuit 222 may be configured to accumulate the weighted result with other intermediate results to obtain a final result.
  • FIG. 3 is a detailed block diagram illustrating a computing device 300 according to an embodiment of the present disclosure.
  • an implementation of the weighting circuit 221 of FIG. 2 is further shown.
  • the weighted object is the current operation result of the operation circuit 210 .
  • the weighting circuit 321 may be configured to multiply the operation result of the operation circuit 310 by a first weighting factor to obtain a weighting result.
  • the first weighting factor may be the product of the component scaling factors of the components corresponding to the operation result.
  • the addition circuit 322 may be configured to accumulate the obtained weighted result with the previous addition result of the addition circuit 322 .
  • the operation instruction specifies to perform a multiplication operation on data A and data B of a large bit width.
  • Each of data A and data B has been decomposed into two components in advance.
  • data A and data B can be represented as:
  • a1 and a0 are the component values of the high-digital component and the low-digital component of the data A, respectively; scaleA1 and scaleA0 are the corresponding component scaling factors, respectively.
  • b1 and b0 are the component values of the high-digit component and the low-digit component of the data B, respectively; scaleB1 and scaleB0 are the corresponding component scaling factors, respectively.
  • 4 multiplications need to be performed. Regardless of the order in which these 4 multiplications are performed, the final result can be obtained only by adjusting the weighting factors accordingly.
  • the above multiplication operation can be expressed as:
  • A*B (a1*scaleA1+a0*scaleA0)*(b1*scaleB1+b0*scaleB0)
  • the arithmetic circuit 310 can perform multiplication operations between the values of the respective components, which are respectively a1*b1, a1*b0, a0*b1 and a0*b0 in this example.
  • the corresponding first weighting factors of the above four intermediate results are respectively: scaleA1*scaleB1, scaleA1*scaleB0, scaleA0*scaleB1, and scaleA0*scaleB0.
  • the weighting circuit 321 uses the corresponding first weighting factor to weight the above-mentioned intermediate results respectively.
  • the addition circuit 322 may add the weighted intermediate results to obtain the final result.
  • some component scaling factors may have a value of 1, eg, scaleA0 or scaleB0 may be 1.
  • the corresponding multiplication may be omitted, for example, the calculation of scaleA1*scaleB0, scaleA0*scaleB1 and scaleA0*scaleB0 may be omitted.
  • the operation instruction specifies to perform a convolution operation on data A and data B with a large bit width, where data A may be, for example, neurons in a neural network operation, and data B may be a weight in a neural network operation.
  • data A and data B has been decomposed into two components in advance.
  • data A and data B can be represented as:
  • a1 and a0 are the component values of the high-digital component and the low-digital component of the data A, respectively; scaleA1 and scaleA0 are the corresponding component scaling factors, respectively.
  • b1 and b0 are the component values of the high-digit component and the low-digit component of the data B, respectively; scaleB1 and scaleB0 are the corresponding component scaling factors, respectively.
  • 4 convolution operations need to be performed. Regardless of the order in which the four convolution operations are performed, the final operation result can be obtained only by adjusting the weighting factors accordingly.
  • the operation process is shown in the order of operation with the low-order bits first and then the high-order bits:
  • conv represents the convolution operation
  • tmp0, tmp1, tmp2 and tmp3 are the convolution results of the four convolution operations, respectively
  • W00, W10, W01 and W11 are the corresponding weighting factors
  • p0, p1, p2 and p3 are the weighting factors The combined result after combining. It can be understood that p0 is the first combined result, because there is no previous combined data, so p0 directly corresponds to the weighted result.
  • the operation process is shown in the order of operation with high digits first and then low digits:
  • the weighting factor may be the product of the component scaling factors of the components corresponding to the convolution result.
  • some component scaling factors have a value of 1, eg scaleA0 or scaleB0 may be 1.
  • the corresponding multiplication may be omitted, for example, the calculation of W00, W10 and W01 may be omitted, thereby improving the calculation efficiency.
  • Either or both of data A and data B in the above two examples may be scalars or vectors.
  • each element in the vector is decomposed into two or more components, and these elements participate in the operation instead. Since the elements of the vector do not affect each other, the operations involving the elements can be processed in parallel, thereby improving the operation efficiency.
  • the first weighting factor directly corresponds to the product of the component scaling factors of the components of each intermediate result/operation result, it is directly related to the The weighted results are accumulated to obtain the final result.
  • the embodiment of FIG. 3 is not limited by the operation order of the operation circuit 310 and/or the output order of the intermediate results.
  • FIG. 4 shows another implementation of the weighting circuit 221 of FIG. 2 .
  • the order of operations is optimized for high-order bits followed by low-order bits.
  • the weighting object of the weighting circuit is the previous operation result of the combinational circuit.
  • the weighting circuit 421 may be configured to multiply the previous addition result of the adding circuit 422 by a second weighting factor to obtain a weighting result.
  • the second weighting factor is the ratio of the scaling factor of the previous operation result of the operation circuit 410 to the scaling factor of the current operation result, wherein the scaling factor of the operation result is determined by the component scaling factor of the component corresponding to the operation result.
  • the addition circuit 422 may be configured to accumulate the weighting result of the weighting circuit 421 and the current operation result of the operation circuit 410 .
  • H00, H11, H22 and H33 are the corresponding weighting factors respectively.
  • the weighting factors can be determined as follows:
  • H33 (scaleA1*scaleB1)/(scaleA0*scaleB1);
  • H22 (scaleA0*scaleB1)/(scaleA1*scaleB0);
  • H11 (scaleA1*scaleB0)/(scaleA0*scaleB0);
  • H00 scaleA0*scaleB0.
  • the weighting factor H00 corresponds to the scaling factor of the last operation result tmp0.
  • the operation of the operation circuit 410 for the operation instruction has ended, and there is currently no operation result.
  • the scaling factor of the current operation result can be set to 1, so that the last weighted weighting factor still corresponds to the ratio of the scaling factor of the previous operation result to the scaling factor of the current operation result.
  • some component scaling factors may have a value of 1, eg, scaleA0 or scaleB0 may be 1.
  • the operation instruction when the operation instruction is a multiplication operation or a multiply-add operation instruction, if any of the data involved in the operation is zero, the result must be zero. In this case, such zero data may not be required to participate in the calculation, Correspondingly, the current operation circuit can be turned off to not perform operation, and the result can be output directly, thereby saving operation power consumption and also saving computing and/or storage resources.
  • FIG. 5 shows a detailed block diagram of a computing device 500 according to an embodiment of the present disclosure.
  • a first comparison circuit 513 is added to the operation circuit 510, and the comparison circuit 513 can be configured to determine whether any of the data on which the specified operation is to be performed is zero. It can be understood that this data may include any of the following: original data of the data to be operated, or components representing the data to be operated. If the data is zero, the specified operation for this data is omitted, and the operation of the next data can be skipped directly. Otherwise, continue to perform the specified operation using this data as previously described.
  • a second comparison circuit 523 may be provided in the combinational circuit 520 .
  • the second comparison circuit 523 may be configured to: determine whether the received intermediate result is zero; and if the intermediate result is zero, omit performing the combining process on the intermediate result; otherwise, continue to use the intermediate result as described above for Combined processing. Similar to the above, this processing method can save computing power consumption and also save computing and/or storage resources.
  • FIG. 6 shows a flowchart of a computing method 600 performed by a computing device according to an embodiment of the present disclosure.
  • the computing method 600 can be used for the operation processing of large-bit-width data for various application scenarios, such as artificial intelligence applications including neural network operations or large Bit-width data is decomposed into small bit-width data for general use in computing.
  • step S610 a plurality of data to be operated on associated with the operation instruction is received, wherein at least one data to be operated is represented by two or more components.
  • the at least one data to be operated has a source data bit width, each component has a respective target data bit width, and the target data bit width is smaller than the source data bit width.
  • the method 600 may further include step S615.
  • step S615 for example, the first comparison circuit 513 in FIG. 5 is used to determine whether any of the data on which the operation is to be performed is zero.
  • the data may include any of the following: raw data of the data to be operated on, or components representing the data to be operated on.
  • step S620 the operation specified by the operation instruction is performed using the received two or more components in place of the represented data to be operated on to obtain two or more components. Multiple intermediate results.
  • step S620 the operation specified by the operation instruction is not performed using the data, and the next operation is directly continued. Because when the operation is a multiplication operation or a multiply-add operation, either side of the operation is zero, and the result is zero, so the specified operation can be omitted for the zero data, thereby saving computing resources, and can Reduce power consumption.
  • performing the specified operation may include: performing the specified operation on two or more components of one data to be operated on and corresponding data of other data to be operated, respectively, to obtain corresponding operation results.
  • the other data to be operated on may include one or more data to be operated on.
  • the corresponding data of these other data to be calculated may include any one of the following: original data of the data to be calculated, or at least one component representing the data to be calculated.
  • the method 600 may further include step S625.
  • step S625 it is determined by the second comparison circuit 523 in FIG. 5 whether or not the intermediate result on which the combining process is to be performed is zero. If the intermediate result is zero, the method 600 can skip step S630, that is, the combination process is not performed using the intermediate result, and the combination of the next intermediate result is directly continued, thereby saving computing resources and reducing power consumption.
  • step S630 the intermediate results obtained in step S620 may be combined to obtain a final result.
  • combining the intermediate results may include: performing a weighted combination on the operation results output in step S620 to obtain a final result.
  • the calculation method 600 of the embodiment of the present disclosure is especially suitable for operation processing involving multiplication operations, such as multiplication or multiplication and addition operations, and the multiplication and addition operations may include, for example, convolution operations. Since the component participating in the operation instead of the original data to be operated has a corresponding component value and a component scaling factor, and the component scaling factor is associated with the digit position of the component in the represented data to be operated When performing operations, such as multiplication or multiply-add operations, you can use only the component values to perform the operation to obtain the result of the operation as an intermediate result. The effect of the component scaling factors can be addressed in the subsequent combination of results.
  • performing the specified operation may include: performing the operation using the component values to obtain an operation result.
  • performing the weighted combination may include: using a weighting factor to perform a weighted combination of the current operation result and the previous combination result, wherein the weighting factor is at least partially based on the component scaling factor of the component corresponding to the current operation result.
  • weighted combinations can be adopted based on the order of operations of the components, such as from low-order to high-order, or from high-order to low-order.
  • performing the weighted combination in step S630 may include: multiplying the operation result of step S620 by a first weighting factor to obtain a weighting result, where the first weighting factor is a component scaling of the component corresponding to the current operation result The product of factors; and the weighted result is accumulated with the previous combined result.
  • performing the weighted combination in step S630 may include: multiplying the previous combination result by a second weighting factor to obtain a weighted result, where the second weighting factor is the scaling factor of the previous operation result and the current The ratio of the scaling factors of the operation result, wherein the scaling factor of the operation result is determined by the component scaling factors of the components corresponding to the operation result; and the weighted result and the current operation result of step S620 are accumulated.
  • FIG. 7 is a structural diagram illustrating a combined processing apparatus 700 according to an embodiment of the present disclosure.
  • the combined processing device 700 includes a computing processing device 702 , an interface device 704 , other processing devices 706 and a storage device 708 .
  • one or more computing devices 710 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIGS. 1-6 .
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into the control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 802 shown in FIG. 8).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 7 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 806 shown in FIG. 8 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 8 .
  • FIG. 8 is a schematic structural diagram illustrating a board 800 according to an embodiment of the present disclosure.
  • the board includes a storage device 804 for storing data, which includes one or more storage units 810 .
  • the storage device can be connected and data transferred with the control device 808 and the chip 802 described above through, for example, a bus.
  • the board also includes an external interface device 806, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 812 (such as a server or a computer, etc.).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, terminal, etc.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a computing device comprising:
  • arithmetic circuit configured for:
  • At least one data to be operated on is characterized by two or more components, the at least one data to be operated on has a source data bit width, each of the components having a respective a target data bit width, and the target data bit width is smaller than the source data bit width;
  • a combinational circuit configured for:
  • storage circuitry configured to store the intermediate result and/or the final result.
  • the operation circuit is configured to perform the operation on the two or more components of one data to be operated on with corresponding data of other data to be operated on, respectively, and output the corresponding operation result to the combination circuit;
  • the combining circuit is configured to perform a weighted combination of the operation results to obtain a final result.
  • Clause 3 The computing device of Clause 2, wherein the other data to be calculated includes one or more data to be calculated, and its corresponding data includes any of the following: raw data of the data to be calculated, or data representing the data to be calculated at least one component.
  • Clause 4 The computing device of any of clauses 2-3, wherein the operation instructions comprise instructions involving a multiply operation or a multiply-add operation, and the operation circuit comprises a multiply operation circuit or a multiply-add operation circuit.
  • each of the components has a component value and a component scale factor, the component scale factor being associated with a bit position of the corresponding component in the characterized data to be operated on;
  • the operation circuit is configured to perform the operation using the component values to obtain an operation result
  • the combining circuit is configured to perform a weighted combination of a current operation result of the operation circuit with a previous combining result of the combining circuit using a weighting factor, wherein the weighting factor is based at least in part on a result corresponding to the operation result.
  • the component scaling factor of the component is determined.
  • the weighting circuit is configured to multiply the operation result of the operation circuit by a first weighting factor to obtain a weighting result, wherein the first weighting factor is a product of component scaling factors of components corresponding to the operation result;
  • the addition circuit is configured to accumulate the weighted result with a previous addition result of the addition circuit.
  • Clause 7 The computing device of clause 5, wherein the combining circuit comprises a weighting circuit and an adding circuit,
  • the weighting circuit is configured to multiply the previous addition result of the adding circuit by a second weighting factor to obtain a weighting result, wherein the second weighting factor is the scaling factor of the previous operation result of the operation circuit and the The ratio of the scaling factors of the current operation result, wherein the scaling factor of the operation result is determined by the component scaling factors corresponding to the components of the operation result;
  • the addition circuit is configured to accumulate the weighted result and the current operation result of the operation circuit.
  • Clause 8 The computing device of any of clauses 4-6, wherein the operational circuit further comprises a first comparison circuit configured to:
  • the number of components used to characterize the at least one data to be operated on is determined based at least in part on the source data bit width and the data bit width supported by the operational circuit;
  • the target data bit width is determined based at least in part on data bit widths supported by the arithmetic circuit.
  • the operation circuit is further configured to perform the operation specified by the operation instruction in the order in which the two or more components are received, wherein the order includes: from high-order to low-order, or from low-order to high digits.
  • Clause 12 The computing device according to any one of clauses 1-11, wherein the data to be operated on is a vector, and executing the operation specified by the operation instruction comprises:
  • the operations are performed in parallel among the elements in the vector.
  • Clause 15 A computing device comprising the board of clause 14.
  • a method performed by a computing device comprising:
  • At least one data to be operated on is characterized by two or more components, the at least one data to be operated on has a source data bit width, each of the components having a respective target data bit width, and the target data bit width is smaller than the source data bit width;
  • Executing the operation specified by the operation instruction includes:
  • Combining the intermediate results includes:
  • a weighted combination is performed on the operation results to obtain a final result.
  • Clause 18 The computing device of clause 17, wherein the operation instructions comprise instructions involving a multiply operation or a multiply-add operation.
  • each of the components has a component value and a component scale factor, the component scale factor being associated with a bit position of the corresponding component in the characterized data to be operated on;
  • the execution of the operation specified by the operation instruction includes:
  • the performing weighted combination includes:
  • the current operation result is weighted and combined with the previous combination result using a weighting factor, wherein the weighting factor is determined based at least in part on a component scaling factor corresponding to a component of the operation result.
  • Clause 20 The method of Clause 19, wherein said performing weighted combining comprises:
  • the first weighting factor is a product of component scaling factors of components corresponding to the operation result
  • the weighted result is accumulated with the previous combined result.
  • Clause 21 The method of Clause 19, wherein said performing weighted combining comprises:
  • the second weighting factor is the ratio of the scaling factor of the previous operation result to the scaling factor of the current operation result, wherein the scaling factor of the operation result is given by determined corresponding to the component scaling factors of the components of the operation result;
  • the operation specified by the operation instruction is not performed using the component.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Executing Machine-Instructions (AREA)
  • Image Processing (AREA)

Abstract

本披露公开了一种计算装置、集成电路芯片、板卡、设备和方法。该计算装置可以包括在组合处理装置中,该组合处理装置还可以包括接口装置和其他处理装置。该计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与计算装置和其他处理装置连接,用于存储该计算装置和其他处理装置的数据。本披露的方案可以利用表征大位宽数据的至少两个小位宽数据来执行运算处理,以使得处理器的处理能力不受位宽的影响。摘要附图:图7

Description

一种计算装置、集成电路芯片、板卡、设备和计算方法
相关申请的交叉引用
本申请要求于2020年6月29日申请的,申请号为2020106108072,名称为“一种计算装置、集成电路芯片、板卡、设备和计算方法”的中国专利申请的优先权,在此将其全文引入作为参考。
技术领域
本披露一般地涉及数据处理领域。更具体地,本披露涉及一种计算装置、集成电路芯片、板卡、设备和计算方法。
背景技术
当前,不同类型的处理器所处理的数据位宽可能有所不同。对于执行特定数据类型运算的处理器来说,其处理的数据位宽往往是有限的。例如,对于定点运算器,其通常能处理的数据位宽不超过16位,例如16位的整型数据。然而,为了节约计算成本和开销并提高计算效率,如何令位宽受限的处理器能够处理更多位宽的数据成为需要解决的一个技术问题。
发明内容
为了至少解决如上所提到的技术问题,本披露在多个方面中提出了利用大位宽数据(也即,位数较多的数据)的小位宽分量(也即,位数较少的数据),代替该大位宽数据来参与计算的方案。通过本披露的计算方案,可以用至少两个小位宽数据来表征大位宽数据,并代替该大位宽数据执行运算处理,从而在处理器的处理位宽受限的场景中,仍然可以使用该处理器完成针对大位宽数据的计算。
在第一方面中,本披露提供一种计算装置,包括:运算电路,其配置用于:接收与运算指令关联的多个待运算数据,其中至少一个待运算数据由两个或更多个分量来表征,该至少一个待运算数据具有源数据位宽,每个分量具有各自的目标数据位宽,并且目标数据位宽小于源数据位宽;以及使用这两个或更多个分量代替所表征的待运算数据来执行运算指令所指定的运算,以获得两个或更多个中间结果。该计算装置还包括:组合电路,其配置用于:将上述中间结果进行组合,以获得最终结果;以及存储电路,其配置用于存储上述中间结果和/或最终结果。
在第二方面中,本披露提供一种集成电路芯片,其包括前述第一方面的计算装置。
在第三方面中,本披露提供一种集成电路板卡,其包括前述第二方面的集成电路芯片。
在第四方面中,本披露提供一种计算设备,其包括前述第三方面的板卡。
在第五方面中,本披露提供一种由计算装置执行的方法。该方法包括:接收与运算指令关联的多个待运算数据,其中至少一个待运算数据由两个或更多个分量来表征,该至少一个待运算数据具有源数据位宽,每个分量具有各自的目标数据位宽,并且目标数据位宽小于源数据位宽;使用该两个或更多个分量代替所表征的待运算数据来执行运算指令所指定的运算,以获得两个或更多个中间结果;以及将上述中间结果进行 组合,以获得最终结果。
通过如上所提供的计算装置、集成电路芯片、板卡、计算设备和方法,本披露的方案利用大位宽数据的小位宽分量,代替该大位宽数据来参与计算,从而在包括例如神经网络运算等的人工智能应用场景或其他通用场景中,不受处理器处理位宽的限制,充分发挥处理器的计算能力。进一步,在例如神经网络运算场景中,本披露的方案还可以通过由至少两个小位宽分量来代替大位宽数据进行计算,从而简化神经网络的计算,提高计算的效率。
附图说明
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1是示出根据本披露实施例的计算装置的简化框图;
图2是示出根据本披露实施例的计算装置的详细框图;
图3是示出根据本披露实施例的计算装置的详细框图;
图4是示出根据本披露实施例的计算装置的详细框图;
图5是示出根据本披露实施例的计算装置的详细框图;
图6是示出根据本披露实施例的计算装置的计算方法的流程图;
图7是示出根据本披露实施例的一种组合处理装置的结构图;以及
图8是示出根据本披露实施例的一种板卡的结构示意图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”和“第二”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
如前文所提到的,针对处理器的处理位宽受限的问题,本披露在多个方面中提出了利用大位宽数据的小位宽分量,代替该大位宽数据来参与计算的方案。由于使用至少两个小位宽数据来表征大位宽数据,并代替该大位宽数据执行运算处理,因此,需要对使用小位宽数据得到的运算结果进行组合处理,从而得到最终结果。本披露的方案通过令例如大位宽(例如24位)数据由至少两个小位宽(例如16位和8位)数据(或称为分量)来表征,克服了处理器位宽受限的障碍。进一步地,通过用小位宽数据/分量来代替运算,简化了计算复杂度,从而提高了例如神经网络计算的计算效率。更进一步地,由于将大位宽的数据运算分解为多次小位宽的数据运算,因此,处理电路之间可以并行执行相应处理,进一步提高计算效率。本披露的方案尤其适合于涉及乘法运算的运算处理,例如乘法或乘加运算,乘加运算例如可以包括卷积运算。因此,本披露的方案可以用于执行神经网络运算,特别是对权值数据和神经元数据进行处理,以获得所期望的运算结果。例如,当神经网络是用于图像的卷积神经网络时,该权值数据可以是卷积核数据,而神经元数据可以是例如图像的像素数据或前层运算操作后的输出数据。
下面将结合附图来详细描述本披露的具体实施方式。
图1是示出根据本披露实施例的计算装置100的简化框图。在一个或多个实施例中,该计算装置100可以用于大位宽数据的运算处理以用于各类应用场景中,例如包括神经网络运算的人工智能应用或需要将大位宽数据分解为小位宽数据以用于计算的通用场景中。
如图1所示,计算装置100包括运算电路110、组合电路120和存储电路130。
在一些实施例中,运算电路110可以配置用于:接收与运算指令关联的多个待运算数据,其中至少一个待运算数据由两个或更多个分量来表征。该至少一个待运算数据具有源数据位宽,每个分量具有各自的目标数据位宽,其中目标数据位宽小于源数据位宽。
如前所述,待运算数据的数据位宽可能超出运算电路的处理位宽,基于此,大位宽(源数据位宽)的待运算数据可以分解成两个或更多个小位宽(目标数据位宽)的分量来表征。
待运算数据的分解可以基于各种现有的和/或未来开发的数据分解技术来实现,以分解成两个或更多个分量。
在一些实施例中,用于表征待运算数据的分量的数量可以至少部分基于待运算数据的源数据位宽和运算电路所支持的数据位宽来确定。在又一些实施例中,目标数据位宽可以至少部分基于运算电路所支持的数据位宽而确定的。例如,当待运算数据为24位数据位宽,而运算电路最多支持16位数据位宽时,在一个示例中,可以将待运算数据分解为目标数据位宽不相等的2个分量,即:8位的高数位分量和16位的低数位分量,或者16位的高数位分量和8位的低数位分量。在另一示例中,可以将待运算数据分解为目标数据位宽相等的3个分量,即:8位的高数位分量、8位的中数位分量、以及8位的低数位分量。本披露在此方面没有限制,只需分解得到的分量的目标数据位宽满足运算电路的处理位宽限制。
待运算数据可以根据所需要的分量数量和各分量的目标数据位宽,分解成多个分量,其中每个分量具有对应的分量数值和分量缩放因子。下面以一个大位宽数据分解 成两个小位宽分量为例,简要描述可能的数据分解方法,但是本领域技术人员可以理解,本披露在此方面没有限制。
在一个示例中,大位宽数据分解成两个分量:第一分量和第二分量。第一分量可以是高数位分量或低数位分量;相应地,第二分量可以是低数位分量或高数位分量。
首先,基于各分量的目标数据位宽和/或各分量在分解前的数据(大位宽数据)中的数位位置,可以确定各分量的分量缩放因子。例如,当第一分量(在此示例中为高数位分量)的目标数据位宽为n1,而第二分量(在此示例中为低数位分量)的目标数据位宽为n2时,当不考虑n2中包括符号位时,第一分量的分量缩放因子可以为2n2-1。相比而言,当考虑n2中包括符号位时,则第一分量的分量缩放因子可以为2n2。通常,低数位分量的分量缩放因子默认为1。
接着,利用第一分量的分量缩放因子,可以对待分解的大位宽数据进行计算,以获得第一分量的分量数值。第一分量通过该分量数值和对应的分量缩放因子来表征。
然后,可以根据待分解的大位宽数据和前面获得的第一分量的分量数值进行计算,以获得第二分量的分量数值。在一个示例中,当第二分量的数据位宽不包含符号位时,例如第二分量数据位宽的最高位不为符号位,则可以通过从待分解的大位宽数据减去所述第一分量,以获得第二分量的数值。这里,第一分量是第一分量的分量数值与对应的分量缩放因子的乘积值。
通过上述方式,可以将大位宽数据分解成两个分量,每个分量具有对应的分量数值和分量缩放因子。当需要将数据分解成多于两个分量时,可以迭代地执行上述方法,直到获得所需数量的分量。例如,对于24位数据位宽的数据,当确定需要将其分解成3个分量,并且三个分量都为8位数据位宽时,可以通过上述步骤将24位数据位宽的数据首先分解成8位数据位宽的第一分量和16位数据位宽的中间第二分量。接着,针对该16位数据位宽的中间第二分量重复执行上述步骤,以将其进一步分解成8位数据位宽的第二分量和8位数据位宽的第三分量。
本领域技术人员可以理解,还可以采取各种处理对数据分解方法进行优化,本披露在此方面没有限制,只需接收分解后的分量以用于指定运算即可。
继续图1,在一些实施例中,运算电路110可以进一步配置用于使用接收的这两个或更多个分量代替所表征的待运算数据来执行运算指令所指定的运算,以获得两个或更多个中间结果。
具体地,运算电路110可以配置用于将一个待运算数据的两个或更多个分量分别与其他待运算数据的对应数据执行所指定的运算,并将对应的运算结果输出到组合电路120。
取决于具体的运算指令,其他待运算数据可以包括一个或多个待运算数据。各个待运算数据可以具有不同的数据位宽。当待运算数据的数据位宽满足运算电路的处理位宽限制时,可以不需要分解,而是使用原始数据进行运算。另一方面,有些待运算数据虽然被分解为多个分量,但可能在一些情况下只需要某个或某些分量进行运算即可。因此,这种情况下,这些其他待运算数据的对应数据可以包括以下任一:待运算数据的原始数据、或表征待运算数据的至少一个分量。
运算电路110可以使用接收的数据来执行运算指令所指定的运算,从而获得两个或更多个中间结果,并输出到组合电路120。本领域技术人员可以理解,运算电路110 可以按照接收这两个或更多个分量的顺序来执行所指定的运算,从而依次得到各中间结果并输出给组合电路120。这些分量的顺序例如可以包括:从高数位到低数位,或从低数位到高数位。
在一些实施例中,组合电路120可以配置用于:将从运算电路110输入的中间结果进行组合,以获得最终结果。如前所述,由于至少一个待运算数据使用其两个或更多个分量来代替运算,因此,利用各分量执行运算得到的是中间结果,需要对其进行组合以获得最终结果。
在一些实施例中,组合电路可以进一步配置用于将这些作为中间结果的运算结果进行加权组合,以得到最终结果。由于代替原始待运算数据来参与运算的分量具有相应的分量数值和分量缩放因子,而运算电路110在运算时可以只使用分量数值进行运算以获得中间结果,因此在组合电路120中,可以考虑参与运算的分量的分量缩放因子,来对中间结果进行加权组合。随后将基于几个实施例详细描述组合电路的各种实现。
计算装置100还可以包括存储电路130,其配置用于存储上述中间结果和/或最终结果。如前所述,由于运算电路110利用分量执行运算得到的结果是中间结果,因此需要将这些中间结果进行组合。组合期间,可以根据中间结果的产生而进行循环组合,例如加权累加,因此可以利用存储电路对这些中间结果进行临时地或长期地存储。优选地,在一些实施例中,中间结果和最终结果可以共用存储电路中的存储空间,从而节省存储空间。本领域技术人员可以理解,存储电路130也可以用于存储其他数据和信息,例如,运算电路110的运算期间生成的需要存储的中间数据,本披露在此方面没有限制。
图2是示出根据本披露实施例的计算装置200的详细框图。如前所述,本披露的方案尤其适合于涉及乘法运算的运算处理。因此,在此实施例中,计算装置200的运算电路210具体可以实现为乘法电路211或乘加电路212。乘加电路212例如可以用来实现卷积运算。
由于代替原始待运算数据来参与运算的分量具有相应的分量数值和分量缩放因子,而分量缩放因子与该分量在所表征的待运算数据中的数位位置相关联,因此在涉及到乘法一类的运算时,例如乘法运算或乘加运算,乘法电路211或乘加电路212在运算时可以只使用分量数值进行运算以获得运算结果,作为中间结果。分量缩放因子的影响可以随后通过组合电路220来处理。
如图2所示,组合电路220可以包括加权电路221和加法电路222。加权电路221可以配置用于利用加权因子,对运算电路210的当前运算结果,例如乘法电路211的乘积结果或乘加电路212的乘加结果,或者组合电路220的前一次的组合结果进行加权处理。取决于加权对象的不同,加权因子也可以不同。在一些实施例中,加权因子至少部分基于生成对应的运算结果的分量的分量缩放因子而确定。加法电路222可以配置用于对加权后的结果与其他中间结果进行累加,以获得最终结果。
以下针对不同加权对象的情况,分别描述图2的组合电路220中加权电路221的可能实现方式。
图3是示出根据本披露实施例的计算装置300的详细框图。在此实施例中,进一步示出了图2的加权电路221的一种实现。在此实现中,加权的对象是运算电路210 的当前运算结果。
如图3所示,加权电路321可以配置用于将运算电路310的运算结果乘以第一加权因子,以得到加权结果。当运算电路310的运算为乘法运算或乘加运算时,第一加权因子可以是对应于该运算结果的分量的分量缩放因子之积。本领域技术人员可以理解,对于不同的运算结果,第一加权因子也可能不同。此时,加法电路322可以配置用于将得到的加权结果与加法电路322的前一次加法结果进行累加。
以下以两个数据的运算为例,来进一步描述图3所示实施例的具体实现。
在一个示例中,假设运算指令指定对大位宽的数据A和数据B执行乘法运算。数据A和数据B的每个已经预先被分解为两个分量。例如,数据A和数据B可以分别表示为:
A=a1*scaleA1+a0*scaleA0
B=b1*scaleB1+b0*scaleB0
其中,a1、a0分别是数据A的高数位分量和低数位分量的分量数值;scaleA1和scaleA0分别是对应的分量缩放因子。类似地,b1、b0分别是数据B的高数位分量和低数位分量的分量数值;scaleB1和scaleB0分别是对应的分量缩放因子。在此示例中,当使用分量来代替数据A和B进行乘法运算时,需要执行4次乘法运算。无论以何种顺序执行这4次乘法运算,只需要相应地调整加权因子,就可以获得最终运算结果。
例如,上述乘法运算可以表示为:
A*B=(a1*scaleA1+a0*scaleA0)*(b1*scaleB1+b0*scaleB0)
=(a1*b1)*(scaleA1*scaleB1)+(a1*b0)*(scaleA1*scaleB0)+
(a0*b1)*(scaleA0*scaleB1)+(a0*b0)*(scaleA0*scaleB0)
从上述表述可知,运算电路310可以进行各分量数值之间的乘法运算,在此示例中分别为a1*b1、a1*b0、a0*b1和a0*b0。另外,在此示例中,上述四个中间结果的对应第一加权因子分别是:scaleA1*scaleB1、scaleA1*scaleB0、scaleA0*scaleB1和scaleA0*scaleB0。加权电路321利用对应的第一加权因子,分别对上述中间结果进行加权。加法电路322可以对加权后的中间结果进行相加,以得到最终结果。
在一些实施例中,有些分量缩放因子的值为1,例如scaleA0或scaleB0可能为1。此时,在计算第一加权因子时,可以省略对应的乘法,例如可以省略对scaleA1*scaleB0、scaleA0*scaleB1和scaleA0*scaleB0的计算。
在另一示例中,假设运算指令指定对大位宽的数据A和数据B执行卷积运算,其中数据A例如可以是神经网络运算中的神经元,数据B可以是神经网络运算中的权值。数据A和数据B的每个已经预先被分解为两个分量。例如,数据A和数据B可以分别表示为:
A=a1*scaleA1+a0*scaleA0
B=b1*scaleB1+b0*scaleB0
其中,a1、a0分别是数据A的高数位分量和低数位分量的分量数值;scaleA1和scaleA0分别是对应的分量缩放因子。类似的,b1、b0分别是数据B的高数位分量和低数位分量的分量数值;scaleB1和scaleB0分别是对应的分量缩放因子。在此示例中,当使用分量来代替数据A和B进行卷积运算时,需要执行4次卷积运算。无论以何种顺序执行这4次卷积运算,只需要相应地调整加权因子,就可以获得最终运算结果。
在一个示例中,以先低数位后高数位的运算顺序,示出了运算过程:
a0(conv)b0->tmp0,tmp0*W00->p0
a1(conv)b0->tmp1,tmp1*W10+p0–>p1
a0(conv)b1->tmp2,tmp2*W01+p1–>p2
a1(conv)b1->tmp3,tmp3*W11+p2–>p3
其中,conv代表卷积运算,tmp0、tmp1、tmp2和tmp3分别为四次卷积运算的卷积结果,W00、W10、W01和W11分别是对应的加权因子,p0、p1、p2和p3是加权组合后的组合结果。可以理解,p0是首次组合结果,因为不存在上一组合数据,因此p0直接对应于加权后的结果。
在另一示例中,以先高数位后低数位的运算顺序,示出了运算过程:
a1(conv)b1->tmp3,tmp3*W11->p3
a0(conv)b1->tmp2,tmp2*W01+p3–>p2
a1(conv)b0->tmp1,tmp1*W10+p2–>p1
a0(conv)b0->tmp0,tmp0*W00+p1–>p0
其中,各符号的含义与前面相同。可以理解,p3是首次组合结果,因为不存在上一组合数据,因此p3直接对应于加权后的结果。
在上述两个示例中,加权因子都可以是对应卷积结果的分量的分量缩放因子之积。例如,
W00=scaleA0*scaleB0;
W10=scaleA1*scaleB0;
W01=scaleA0*scaleB1;
W11=scaleA1*scaleB1。
与前面类似地,在一些实施例中,有些分量缩放因子的值为1,例如scaleA0或scaleB0可能为1。此时,在计算第一加权因子时,可以省略对应的乘法,例如可以省略对W00、W10和W01的计算,由此提高计算效率。
上述两个示例中的数据A和数据B中任一或二者都可以是标量或向量。当数据是向量时,向量中各元素被分解为两个或多个分量,代替这些元素参与运算。由于向量的各元素之间不会相互影响,因此,涉及各元素的运算可以并行处理,由此提高运算效率。
此外,从上面的运算过程可以看出,无论各个分量之间的运算采用何种顺序,由于第一加权因子都直接对应于各中间结果/运算结果的分量的分量缩放因子之积,因此直接对加权后的结果进行累加就可以获得最终结果。图3的实施例方式不受限于运算电路310的运算顺序和/或中间结果的输出顺序。
图4示出了图2的加权电路221的另一种实现。在此实现中,针对先高数位后低数位的运算顺序进行了优化。这种情况下,加权电路的加权对象是组合电路的前一次运算结果。
如图4所示,加权电路421可以配置用于将加法电路422的前一次加法结果乘以第二加权因子,以得到加权结果。此时,第二加权因子为运算电路410的前一运算结果的缩放因子与当前运算结果的缩放因子之比,其中运算结果的缩放因子由对应于运算结果的分量的分量缩放因子来确定。本领域技术人员可以理解,对于每次组合,第二加权因子可能不同。此时,加法电路422可以配置用于将加权电路421的加权结果与运算电路410的当前运算结果进行累加。
同样以前面的数据A和数据B的卷积运算为例,来进一步描述图4所示实施例的具体实现。
按照先高数位后低数位的运算顺序,其运算过程如下:
a1(conv)b1->tmp3,tmp3->p3
a0(conv)b1->tmp2,tmp2+p3*H33–>p2
a1(conv)b0->tmp1,tmp1+p2*H22–>p1
a0(conv)b0->tmp0,tmp0+p1*H11–>p0
p0=p0*H00
其中,各符号含义与前面相同,H00、H11、H22和H33分别是对应的加权因子。在此示例中,加权因子可以按如下确定:
H33=(scaleA1*scaleB1)/(scaleA0*scaleB1);
H22=(scaleA0*scaleB1)/(scaleA1*scaleB0);
H11=(scaleA1*scaleB0)/(scaleA0*scaleB0);
H00=scaleA0*scaleB0。
从上面运算过程可以看出,最后还需要将组合结果进行再次加权,该加权因子H00对应于最后一次运算结果tmp0的缩放因子。此时运算电路410针对该运算指令的运算已经结束,当前没有运算结果。为了统一加权因子的计算,可以将当前运算结果的缩放因子设置为1,从而最后一个加权的加权因子仍然对应于前一运算结果的缩放因子与当前运算结果的缩放因子之比。
类似地,在一些实施例中,有些分量缩放因子的值为1,例如scaleA0或scaleB0可能为1。此时,在计算第二加权因子时,可以省略对应的乘法,例如可以省略对scaleA1*scaleB0、scaleA0*scaleB1和scaleA0*scaleB0的计算,同时还可以省略最后一步组合结果的加权,也即p0=p0*H00,由此提高计算效率。
从上述运算过程可以看出,高数位分量的分量数值的运算结果是通过多次加权逐步增大的,因此可以避免出现相加的两个数相差较大时,例如一个很大数与一个很小的数相加,可能由于对阶步骤而出现的精度丢失现象。
在一些实施例中,当运算指令为乘法运算或乘加运算指令时,若参与运算的数据中任一为零,则结果必然为零,此时,可以无需这种为零的数据参与计算,相应地可以关闭当前运算电路不进行运算,直接输出结果,从而节省运算功耗,也可以节省计算和/或存储资源。
图5示出了本披露实施例的计算装置500的详细框图。在此实施例中,运算电路510中增加了第一比较电路513,该比较电路513可以配置用于判断即将对其执行指定运算的数据中任一是否为零。可以理解,此数据可以包括以下任一:待运算数据的原始数据、或表征待运算数据的分量。如果该数据为零,则省略针对此数据执行所指定的运算,可以直接跳到下一数据的运算。否则,如前所描述地继续使用此数据执行所指定的运算。
备选地或附加地,可以在组合电路520中设置第二比较电路523。该第二比较电路523可以配置用于:判断接收的中间结果是否为零;以及若中间结果为零,则省略针对该中间结果执行组合处理;否则,如前所描述地继续使用该中间结果进行组合处理。与上面类似地,这种处理方式可以节省运算功耗,也可以节省计算和/或存储资源。
图6示出了根据本披露实施例的由计算装置执行的计算方法600的流程图。如 前所述,在一个或多个实施例中,该计算方法600可以用于大位宽数据的运算处理以用于各类应用场景中,例如包括神经网络运算的人工智能应用或需要将大位宽数据分解为小位宽数据以用于计算的通用场景中。
如图6所示,在步骤S610中,接收与运算指令关联的多个待运算数据,其中至少一个待运算数据由两个或更多个分量来表征。该至少一个待运算数据具有源数据位宽,每个分量具有各自的目标数据位宽,并且目标数据位宽小于源数据位宽。
可选地,在一些实施例中,当运算指令涉及乘法运算或乘加运算(例如,卷积运算)时,方法600可以进一步包括步骤S615。在步骤S615中,例如通过图5中的第一比较电路513来判断即将对其执行运算的数据中任一是否为零。该数据可以包括以下任一:待运算数据的原始数据、或表征待运算数据的分量。
若数据均不为零,则方法600前进到步骤S620,在此使用所接收的两个或更多个分量代替所表征的待运算数据来执行运算指令所指定的运算,以获得两个或更多个中间结果。
若该数据中任一为零,则方法600可以跳过步骤S620,也即不使用此数据来执行运算指令所指定的运算,直接继续下一运算。因为当运算为乘法运算或乘加运算时,运算中的任一方为零,都将导致结果为零,因此可以省略针对该为零的数据执行所指定的运算,由此节省计算资源,并且可以降低功耗。
继续步骤S620,执行所指定的运算可以包括:将一个待运算数据的两个或更多个分量分别与其他待运算数据的对应数据执行所指定运算,以获得对应的运算结果。如前面所提到的,其他待运算数据可以包括一个或多个待运算数据。并且,这些其他待运算数据的对应数据可以包括以下任一:待运算数据的原始数据、或表征待运算数据的至少一个分量。
可选地,在一些实施例中,方法600可以进一步包括步骤S625。在步骤S625中,例如通过图5中的第二比较电路523来判断即将对其执行组合处理的中间结果是否为零。若该中间结果为零,则方法600可以跳过步骤S630,也即不使用此中间结果来执行组合处理,直接继续下一中间结果的组合,由此可以节省计算资源,并且可以降低功耗。
最后,在步骤S630中,可以将步骤S620得到的中间结果进行组合,以获得最终结果。在一些实施例中,将中间结果进行组合可以包括:对步骤S620输出的运算结果执行加权组合,以得到最终结果。
本披露实施例的计算方法600尤其适用于涉及乘法运算的运算处理,例如乘法或乘加运算,乘加运算例如可以包括卷积运算。由于代替原始待运算数据来参与运算的分量具有相应的分量数值和分量缩放因子,而分量缩放因子与该分量在所表征的待运算数据中的数位位置相关联,因此在涉及到乘法一类的运算时,例如乘法运算或乘加运算,可以只使用分量数值进行运算以获得运算结果,作为中间结果。分量缩放因子的影响可以在随后的结果组合中处理。
例如,在一些实施例中,在步骤S620中,执行所指定的运算可以包括:利用分量数值来执行该运算以获取运算结果。进一步地,在步骤S630中,执行加权组合可以包括:利用加权因子,将当前运算结果与前一次的组合结果进行加权组合,其中加权因子至少部分基于对应于当前运算结果的分量的分量缩放因子而确定。
如前面所提到的,基于分量的运算顺序,例如从低数位到高数位,或者从高数位到低数位,可以采取不同的加权组合方式。
在一些实施例中,在步骤S630中执行加权组合可以包括:将步骤S620的运算结果乘以第一加权因子,以得到加权结果,其中第一加权因子为对应于当前运算结果的分量的分量缩放因子之积;以及将该加权结果与前一次的组合结果进行累加。
在另一些实施例中,在步骤S630中执行加权组合可以包括:将前一次的组合结果乘以第二加权因子,以得到加权结果,其中第二加权因子为前一运算结果的缩放因子与当前运算结果的缩放因子之比,其中运算结果的缩放因子由对应于运算结果的分量的分量缩放因子而确定;以及将加权结果与步骤S620的当前运算结果进行累加。
上面已经参考流程图描述本披露实施例的计算装置所执行的计算方法。本领域技术人员可以理解,由于将大位宽的数据运算分解为多次小位宽的数据运算,因此,上述方法的各步骤之间可以并行执行相应处理,进一步提高计算效率。
图7是示出根据本披露实施例的一种组合处理装置700的结构图。如图7中所示,该组合处理装置700包括计算处理装置702、接口装置704、其他处理装置706和存储装置708。根据不同的应用场景,计算处理装置中可以包括一个或多个计算装置710,该计算装置可以配置用于执行本文结合附图1-6所描述的操作。
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步, 该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如,该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。
在一些实施例里,本披露还公开了一种芯片(例如图8中示出的芯片802)。在一种实现中,该芯片是一种系统级芯片(System on Chip,SoC),并且集成有一个或多个如图7中所示的组合处理装置。该芯片可以通过对外接口装置(如图8中示出的对外接口装置806)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图8对该板卡进行详细地描述。
图8是示出根据本披露实施例的一种板卡800的结构示意图。如图8中所示,该板卡包括用于存储数据的存储器件804,其包括一个或多个存储单元810。该存储器件可以通过例如总线等方式与控制器件808和上文所述的芯片802进行连接和数据传输。进一步,该板卡还包括对外接口装置806,其配置用于芯片(或芯片封装结构中的芯片)与外部设备812(例如服务器或计算机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。
根据上述结合图7和图8的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子 设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴 于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。
依据以下条款可更好地理解前述内容:
条款1.一种计算装置,包括:
运算电路,其配置用于:
接收与运算指令关联的多个待运算数据,其中至少一个待运算数据由两个或更多个分量来表征,所述至少一个待运算数据具有源数据位宽,每个所述分量具有各自的目标数据位宽,并且所述目标数据位宽小于所述源数据位宽;以及
使用所述两个或更多个分量代替所表征的待运算数据来执行所述运算指令所指定的运算,以获得两个或更多个中间结果;
组合电路,其配置用于:
将所述中间结果进行组合,以获得最终结果;以及
存储电路,其配置用于存储所述中间结果和/或所述最终结果。
条款2.根据条款1所述的计算装置,其中,
所述运算电路配置用于将一个待运算数据的所述两个或更多个分量分别与其他待运算数据的对应数据执行所述运算,并将对应的运算结果输出到所述组合电路;并且
所述组合电路配置用于将所述运算结果进行加权组合,以得到最终结果。
条款3.根据条款2所述的计算装置,其中所述其他待运算数据包括一个或多个待运算数据,并且其对应数据包括以下任一:待运算数据的原始数据、或表征待运算数据的至少一个分量。
条款4.根据条款2-3任一所述的计算装置,其中,所述运算指令包括涉及乘法运算或者乘加运算的指令,并且所述运算电路包括乘法运算电路或者乘加运算电路。
条款5.根据条款4所述的计算装置,其中,每个所述分量具有分量数值和分量缩放因子,所述分量缩放因子与对应分量在所表征的待运算数据中的数位位置相关联;
其中所述运算电路配置用于利用所述分量数值执行所述运算以获取运算结果;并且
所述组合电路配置用于利用加权因子,将所述运算电路的当前运算结果与所述组合电路的前一次的组合结果进行加权组合,其中所述加权因子至少部分基于对应于所 述运算结果的分量的分量缩放因子而确定。
条款6.根据条款5所述的计算装置,其中所述组合电路包括加权电路和加法电路,
所述加权电路配置用于将所述运算电路的运算结果乘以第一加权因子,以得到加权结果,其中所述第一加权因子为对应于所述运算结果的分量的分量缩放因子之积;并且
所述加法电路配置用于将所述加权结果与所述加法电路的前一次加法结果进行累加。
条款7.根据条款5所述的计算装置,其中所述组合电路包括加权电路和加法电路,
所述加权电路配置用于将所述加法电路的前一次加法结果乘以第二加权因子,以得到加权结果,其中所述第二加权因子为所述运算电路的前一运算结果的缩放因子与当前运算结果的缩放因子之比,其中所述运算结果的缩放因子由对应于所述运算结果的分量的分量缩放因子而确定;并且
所述加法电路配置用于将所述加权结果与所述运算电路的当前运算结果进行累加。
条款8.根据条款4-6任一所述的计算装置,其中,所述运算电路还包括第一比较电路,所述第一比较电路配置用于:
判断即将对其执行所述运算的数据中任一是否为零,其中所述数据包括以下任一:待运算数据的原始数据、或表征待运算数据的分量;以及
如果所述数据为零,则省略针对所述数据执行所述运算指令所指定的运算;
否则,使用所述数据执行所述运算指令所指定的运算。
条款9.根据条款1-8任一所述的计算装置,其中,所述组合电路还包括第二比较电路,所述第二比较电路配置用于:
判断接收的所述中间结果是否为零;以及
若所述中间结果为零,则省略针对所述中间结果执行所述组合;
否则,使用所述中间结果进行所述组合。
条款10.根据条款1-9任一所述的计算装置,其中,
用于表征所述至少一个待运算数据的分量的数量是至少部分基于所述源数据位宽和所述运算电路所支持的数据位宽而确定的;和/或
所述目标数据位宽是至少部分基于所述运算电路所支持的数据位宽而确定的。
条款11.根据条款1-10任一所述的计算装置,其中,
所述运算电路进一步配置用于按接收所述两个分量或更多个分量的顺序来执行所述运算指令所指定的运算,其中所述顺序包括:从高数位到低数位,或从低数位到高数位。
条款12.根据条款1-11任一所述的计算装置,其中,所述待运算数据为向量,并且执行所述运算指令所指定的运算包括:
在所述向量中的元素之间,并行地执行所述运算。
条款13.一种集成电路芯片,包括根据条款1-12任一所述的计算装置。
条款14.一种集成电路板卡,包括根据条款13所述的集成电路芯片。
条款15.一种计算设备,包括根据条款14所述的板卡。
条款16.一种由计算装置执行的方法,所述方法包括:
接收与运算指令关联的多个待运算数据,其中至少一个待运算数据由两个或更多个分量来表征,所述至少一个待运算数据具有源数据位宽,每个所述分量具有各自的目标数据位宽,并且所述目标数据位宽小于所述源数据位宽;
使用所述两个或更多个分量代替所表征的待运算数据来执行所述运算指令所指定的运算,以获得两个或更多个中间结果;以及
将所述中间结果进行组合,以获得最终结果。
条款17.根据条款16所述的方法,其中,
执行所述运算指令所指定的运算包括:
将一个待运算数据的所述两个或更多个分量分别与其他待运算数据的对应数据执行所述运算,以获得对应的运算结果;并且
将所述中间结果进行组合包括:
对所述运算结果执行加权组合,以得到最终结果。
条款18.根据条款17所述的计算装置,其中,所述运算指令包括涉及乘法运算或者乘加运算的指令。
条款19.根据条款18所述的方法,其中,每个所述分量具有分量数值和分量缩放因子,所述分量缩放因子与对应分量在所表征的待运算数据中的数位位置相关联;
所述执行所述运算指令所指定的运算包括:
利用所述分量数值执行所述运算以获取运算结果;并且
所述执行加权组合包括:
利用加权因子,将当前运算结果与前一次的组合结果进行加权组合,其中所述加权因子至少部分基于对应于所述运算结果的分量的分量缩放因子而确定。
条款20.根据条款19所述的方法,其中所述执行加权组合包括:
将所述运算结果乘以第一加权因子,以得到加权结果,其中所述第一加权因子为对应于所述运算结果的分量的分量缩放因子之积;以及
将所述加权结果与前一次的组合结果进行累加。
条款21.根据条款19所述的方法,其中所述执行加权组合包括:
将前一次的组合结果乘以第二加权因子,以得到加权结果,其中所述第二加权因子为前一运算结果的缩放因子与当前运算结果的缩放因子之比,其中运算结果的缩放因子由对应于所述运算结果的分量的分量缩放因子而确定;以及
将所述加权结果与当前运算结果进行累加。
条款22.根据条款16-21任一所述的方法,还包括:
判断即将对其执行所述运算的数据中任一是否为零,其中所述数据包括以下任一:待运算数据的原始数据、或表征待运算数据的分量;以及
若所述分量为零,则不使用所述分量执行所述运算指令所指定的运算。
如果所述数据为零,则省略针对所述数据执行所述运算指令所指定的运算;
否则,使用所述数据执行所述运算指令所指定的运算。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心 思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。

Claims (22)

  1. 一种计算装置,包括:
    运算电路,其配置用于:
    接收与运算指令关联的多个待运算数据,其中至少一个待运算数据由两个或更多个分量来表征,所述至少一个待运算数据具有源数据位宽,每个所述分量具有各自的目标数据位宽,并且所述目标数据位宽小于所述源数据位宽;以及
    使用所述两个或更多个分量代替所表征的待运算数据来执行所述运算指令所指定的运算,以获得两个或更多个中间结果;
    组合电路,其配置用于:
    将所述中间结果进行组合,以获得最终结果;以及
    存储电路,其配置用于存储所述中间结果和/或所述最终结果。
  2. 根据权利要求1所述的计算装置,其中,
    所述运算电路配置用于将一个待运算数据的所述两个或更多个分量分别与其他待运算数据的对应数据执行所述运算,并将对应的运算结果输出到所述组合电路;并且
    所述组合电路配置用于将所述运算结果进行加权组合,以得到最终结果。
  3. 根据权利要求2所述的计算装置,其中所述其他待运算数据包括一个或多个待运算数据,并且其对应数据包括以下任一:待运算数据的原始数据、或表征待运算数据的至少一个分量。
  4. 根据权利要求2-3任一所述的计算装置,其中,所述运算指令包括涉及乘法运算或者乘加运算的指令,并且所述运算电路包括乘法运算电路或者乘加运算电路。
  5. 根据权利要求4所述的计算装置,其中,每个所述分量具有分量数值和分量缩放因子,所述分量缩放因子与对应分量在所表征的待运算数据中的数位位置相关联;
    其中所述运算电路配置用于利用所述分量数值执行所述运算以获取运算结果;并且
    所述组合电路配置用于利用加权因子,将所述运算电路的当前运算结果与所述组合电路的前一次的组合结果进行加权组合,其中所述加权因子至少部分基于对应于所述运算结果的分量的分量缩放因子而确定。
  6. 根据权利要求5所述的计算装置,其中所述组合电路包括加权电路和加法电路,
    所述加权电路配置用于将所述运算电路的运算结果乘以第一加权因子,以得到加权结果,其中所述第一加权因子为对应于所述运算结果的分量的分量缩放因子之积;并且
    所述加法电路配置用于将所述加权结果与所述加法电路的前一次加法结果进行累加。
  7. 根据权利要求5所述的计算装置,其中所述组合电路包括加权电路和加法电路,
    所述加权电路配置用于将所述加法电路的前一次加法结果乘以第二加权因子,以得到加权结果,其中所述第二加权因子为所述运算电路的前一运算结果的缩放因子与当前运算结果的缩放因子之比,其中所述运算结果的缩放因子由对应于所述运算结果 的分量的分量缩放因子而确定;并且
    所述加法电路配置用于将所述加权结果与所述运算电路的当前运算结果进行累加。
  8. 根据权利要求4-6任一所述的计算装置,其中,所述运算电路还包括第一比较电路,所述第一比较电路配置用于:
    判断即将对其执行所述运算的数据中任一是否为零,其中所述数据包括以下任一:待运算数据的原始数据、或表征待运算数据的分量;以及
    如果所述数据为零,则省略针对所述数据执行所述运算指令所指定的运算;
    否则,使用所述数据执行所述运算指令所指定的运算。
  9. 根据权利要求1-8任一所述的计算装置,其中,所述组合电路还包括第二比较电路,所述第二比较电路配置用于:
    判断接收的所述中间结果是否为零;以及
    若所述中间结果为零,则省略针对所述中间结果执行所述组合;
    否则,使用所述中间结果进行所述组合。
  10. 根据权利要求1-9任一所述的计算装置,其中,
    用于表征所述至少一个待运算数据的分量的数量是至少部分基于所述源数据位宽和所述运算电路所支持的数据位宽而确定的;和/或
    所述目标数据位宽是至少部分基于所述运算电路所支持的数据位宽而确定的。
  11. 根据权利要求1-10任一所述的计算装置,其中,
    所述运算电路进一步配置用于按接收所述两个分量或更多个分量的顺序来执行所述运算指令所指定的运算,其中所述顺序包括:从高数位到低数位,或从低数位到高数位。
  12. 根据权利要求1-11任一所述的计算装置,其中,所述待运算数据为向量,并且执行所述运算指令所指定的运算包括:
    在所述向量中的元素之间,并行地执行所述运算。
  13. 一种集成电路芯片,包括根据权利要求1-12任一所述的计算装置。
  14. 一种集成电路板卡,包括根据权利要求13所述的集成电路芯片。
  15. 一种计算设备,包括根据权利要求14所述的板卡。
  16. 一种由计算装置执行的方法,所述方法包括:
    接收与运算指令关联的多个待运算数据,其中至少一个待运算数据由两个或更多个分量来表征,所述至少一个待运算数据具有源数据位宽,每个所述分量具有各自的目标数据位宽,并且所述目标数据位宽小于所述源数据位宽;
    使用所述两个或更多个分量代替所表征的待运算数据来执行所述运算指令所指定的运算,以获得两个或更多个中间结果;以及
    将所述中间结果进行组合,以获得最终结果。
  17. 根据权利要求16所述的方法,其中,
    执行所述运算指令所指定的运算包括:
    将一个待运算数据的所述两个或更多个分量分别与其他待运算数据的对应数据执行所述运算,以获得对应的运算结果;并且
    将所述中间结果进行组合包括:
    对所述运算结果执行加权组合,以得到最终结果。
  18. 根据权利要求17所述的计算装置,其中,所述运算指令包括涉及乘法运算或者乘加运算的指令。
  19. 根据权利要求18所述的方法,其中,每个所述分量具有分量数值和分量缩放因子,所述分量缩放因子与对应分量在所表征的待运算数据中的数位位置相关联;
    所述执行所述运算指令所指定的运算包括:
    利用所述分量数值执行所述运算以获取运算结果;并且
    所述执行加权组合包括:
    利用加权因子,将当前运算结果与前一次的组合结果进行加权组合,其中所述加权因子至少部分基于对应于所述运算结果的分量的分量缩放因子而确定。
  20. 根据权利要求19所述的方法,其中所述执行加权组合包括:
    将所述运算结果乘以第一加权因子,以得到加权结果,其中所述第一加权因子为对应于所述运算结果的分量的分量缩放因子之积;以及
    将所述加权结果与前一次的组合结果进行累加。
  21. 根据权利要求19所述的方法,其中所述执行加权组合包括:
    将前一次的组合结果乘以第二加权因子,以得到加权结果,其中所述第二加权因子为前一运算结果的缩放因子与当前运算结果的缩放因子之比,其中运算结果的缩放因子由对应于所述运算结果的分量的分量缩放因子而确定;以及
    将所述加权结果与当前运算结果进行累加。
  22. 根据权利要求16-21任一所述的方法,还包括:
    判断即将对其执行所述运算的数据中任一是否为零,其中所述数据包括以下任一:待运算数据的原始数据、或表征待运算数据的分量;以及
    若所述分量为零,则不使用所述分量执行所述运算指令所指定的运算;
    如果所述数据为零,则省略针对所述数据执行所述运算指令所指定的运算;
    否则,使用所述数据执行所述运算指令所指定的运算。
PCT/CN2021/094467 2020-06-29 2021-05-18 一种计算装置、集成电路芯片、板卡、设备和计算方法 WO2022001438A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/003,687 US20230305840A1 (en) 2020-06-29 2021-05-18 Computing apparatus, integrated circuit chip, board card, device and computing method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010610807.2A CN113934678A (zh) 2020-06-29 2020-06-29 一种计算装置、集成电路芯片、板卡、设备和计算方法
CN202010610807.2 2020-06-29

Publications (1)

Publication Number Publication Date
WO2022001438A1 true WO2022001438A1 (zh) 2022-01-06

Family

ID=79273176

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/094467 WO2022001438A1 (zh) 2020-06-29 2021-05-18 一种计算装置、集成电路芯片、板卡、设备和计算方法

Country Status (3)

Country Link
US (1) US20230305840A1 (zh)
CN (1) CN113934678A (zh)
WO (1) WO2022001438A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160170466A1 (en) * 2014-12-15 2016-06-16 Jefferson H. HOPKINS Power saving multi-width processor core
CN105808206A (zh) * 2016-03-04 2016-07-27 广州海格通信集团股份有限公司 基于ram实现乘法运算的方法及其系统
CN110262773A (zh) * 2019-04-28 2019-09-20 阿里巴巴集团控股有限公司 一种计算机数据处理方法及装置
CN111107274A (zh) * 2018-10-26 2020-05-05 北京图森智途科技有限公司 一种图像亮度统计方法及成像设备
CN111258646A (zh) * 2018-11-30 2020-06-09 上海寒武纪信息科技有限公司 指令拆解方法、处理器、指令拆解装置及存储介质
CN112099759A (zh) * 2020-08-19 2020-12-18 武汉虚咖科技有限公司 数值处理方法、装置、处理设备及计算机可读存储介质
CN112417478A (zh) * 2020-11-24 2021-02-26 深圳前海微众银行股份有限公司 数据处理方法、装置、设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160170466A1 (en) * 2014-12-15 2016-06-16 Jefferson H. HOPKINS Power saving multi-width processor core
CN105808206A (zh) * 2016-03-04 2016-07-27 广州海格通信集团股份有限公司 基于ram实现乘法运算的方法及其系统
CN111107274A (zh) * 2018-10-26 2020-05-05 北京图森智途科技有限公司 一种图像亮度统计方法及成像设备
CN111258646A (zh) * 2018-11-30 2020-06-09 上海寒武纪信息科技有限公司 指令拆解方法、处理器、指令拆解装置及存储介质
CN110262773A (zh) * 2019-04-28 2019-09-20 阿里巴巴集团控股有限公司 一种计算机数据处理方法及装置
CN112099759A (zh) * 2020-08-19 2020-12-18 武汉虚咖科技有限公司 数值处理方法、装置、处理设备及计算机可读存储介质
CN112417478A (zh) * 2020-11-24 2021-02-26 深圳前海微众银行股份有限公司 数据处理方法、装置、设备及存储介质

Also Published As

Publication number Publication date
US20230305840A1 (en) 2023-09-28
CN113934678A (zh) 2022-01-14

Similar Documents

Publication Publication Date Title
CN109032669B (zh) 神经网络处理装置及其执行向量最小值指令的方法
CN112612521A (zh) 一种用于执行矩阵乘运算的装置和方法
WO2022161318A1 (zh) 数据处理装置、方法及相关产品
CN111488976B (zh) 神经网络计算装置、神经网络计算方法及相关产品
CN111488963B (zh) 神经网络计算装置和方法
WO2021185262A1 (zh) 计算装置、方法、板卡和计算机可读存储介质
CN112084023A (zh) 数据并行处理的方法、电子设备及计算机可读存储介质
WO2022001438A1 (zh) 一种计算装置、集成电路芯片、板卡、设备和计算方法
CN111353124A (zh) 运算方法、装置、计算机设备和存储介质
CN114003198B (zh) 内积处理部件、任意精度计算设备、方法及可读存储介质
WO2021082746A1 (zh) 运算装置及相关产品
WO2021082747A1 (zh) 运算装置及相关产品
CN112667227A (zh) 可视化设计流水线的方法及可读存储介质
CN115373646A (zh) 扩展信息方法、装置和相关产品
CN114692824A (zh) 一种神经网络模型的量化训练方法、装置和设备
CN113746471B (zh) 运算电路、芯片和板卡
CN112232498B (zh) 一种数据处理装置、集成电路芯片、电子设备、板卡和方法
WO2022253287A1 (zh) 用于生成随机数的方法及其相关产品
WO2022001496A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN113469333B (zh) 执行神经网络模型的人工智能处理器、方法及相关产品
CN111290788B (zh) 运算方法、装置、计算机设备和存储介质
WO2022134688A1 (zh) 数据处理电路、数据处理方法及相关产品
CN114429194A (zh) 处理神经网络计算的装置、板卡、方法及可读存储介质
CN117908959A (zh) 用于执行原子操作的方法及其相关产品
CN117093263A (zh) 处理器、芯片、板卡及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21832834

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21832834

Country of ref document: EP

Kind code of ref document: A1