CN114115995A - Artificial intelligence chip, operation board card, data processing method and electronic equipment - Google Patents

Artificial intelligence chip, operation board card, data processing method and electronic equipment Download PDF

Info

Publication number
CN114115995A
CN114115995A CN202010877800.7A CN202010877800A CN114115995A CN 114115995 A CN114115995 A CN 114115995A CN 202010877800 A CN202010877800 A CN 202010877800A CN 114115995 A CN114115995 A CN 114115995A
Authority
CN
China
Prior art keywords
data
format
continuous
buffer
conversion circuit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010877800.7A
Other languages
Chinese (zh)
Inventor
何轲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202010877800.7A priority Critical patent/CN114115995A/en
Publication of CN114115995A publication Critical patent/CN114115995A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30123Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Image Processing (AREA)

Abstract

An AI chip, an operation board card, a data processing method and an electronic device are used for improving the processing efficiency of the AI chip. The AI chip includes: the first assembly line module comprises a conversion circuit and a plurality of buffers connected with the conversion circuit; the conversion circuit is used for acquiring data of a first characteristic to be converted and converting the data of the first characteristic into data of a second characteristic, wherein the data of the second characteristic is suitable for the operation of the operation circuit; writing the second characteristic data into a plurality of buffers alternately; the operation circuit is used for reading the data of the second characteristic from the full buffer when one of the plurality of buffers is full, and operating the data of the second characteristic to obtain result data. In the embodiment of the application, a hardware special module is added to support the data conversion function, the operator calling at the software side is simplified, the overall performance is improved, and the calling cost and the calling flow are simplified.

Description

Artificial intelligence chip, operation board card, data processing method and electronic equipment
Technical Field
The application relates to the field of artificial intelligence chips, in particular to an artificial intelligence chip, an operation board card, a data processing method and electronic equipment.
Background
Currently, with the popularization of intelligent devices, an Artificial Intelligence (AI) technology is rapidly developing. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.
As an important research direction in the field of artificial intelligence, a high number of samples are continuously trained through deep learning (such as a convolutional neural network), and feature information of the samples is obtained. The characteristic information can be identified by comparing the characteristic information when a new test sample is contacted, and the model is applied to a machine, so that the machine obtains the human-like identification capability. In the stages of sample training, information extraction and the like, a very large amount of calculation is involved, and an AI chip is required to operate on a large amount of data in the operation process.
As shown in fig. 1, the AI chip includes a data preprocessing module (data preprocessing), a buffer (Cache), and an arithmetic circuit. The arithmetic circuit has a requirement on data to be processed, for example, a requirement on a format of the data (if the data is required to be in a 5D format). Therefore, when the data preprocessing module performs data conversion, software intervention is required to call a plurality of operators, the operators are mapped to the data preprocessing module, and the data processing module converts data. On the software side, the operators are regarded as extra operator calls when being called, and the number of times of calling the operators on the software side is increased, so that the execution overhead of the software side is increased, and the overall performance is influenced. In addition, in a mode of mapping the calling operator to the data processing module, after the data processing module needs to convert all data, the arithmetic circuit carries out calculation again, waiting time exists in the execution process of the arithmetic circuit, and the chip arithmetic efficiency is reduced.
Disclosure of Invention
The embodiment of the application provides an artificial intelligence chip, an operation board card, a data processing method and electronic equipment, which are used for improving the processing efficiency of the artificial intelligence chip.
In a first aspect, an embodiment of the present application provides an artificial intelligence chip, where the AI chip includes: the first assembly line module comprises a conversion circuit and a plurality of buffers connected with the conversion circuit; the conversion circuit is used for acquiring data of a first characteristic to be converted and converting the data of the first characteristic into data of a second characteristic, wherein the data of the second characteristic is suitable for the operation of the operation circuit; the conversion circuit alternately writes the second characteristic data into a plurality of buffers, wherein the alternately writing into the plurality of buffers refers to a cyclic operation of writing data into a first buffer, continuously writing the data into a second buffer after the buffer is full until the last buffer is full, then switching the state, writing the data into the first buffer, writing the data into the second buffer, and the like; the operation circuit is used for reading the data of the second characteristic from the full buffer when one of the plurality of buffers is full, operating the data of the second characteristic to obtain result data and outputting the result data.
In this example, a first pipeline module is added to the AI chip, and the first pipeline module includes a conversion circuit and a plurality of buffers, for example, the conversion circuit is used to support a continuous conversion and/or a format conversion function. In the application, a hardware special module is added to support the data conversion function, the conversion circuit alternately writes the second characteristic data into a plurality of buffers, and the plurality of buffers are used for realizing a ping-pong cache mechanism. When one of the plurality of buffers is full, the arithmetic circuit is activated to perform an operation on the read data. In the application, operators do not need to be called for many times to preprocess data, the requirement of an arithmetic circuit on the data is met, the operator calling on a software side is simplified, the overall performance is improved, and the calling cost and the flow are simplified. And the conversion circuit writes the converted data into a plurality of buffers in an alternating manner, when one buffer is full, the operation circuit directly reads the data from the full buffer, the fine pipeline operation is realized based on a ping-pong cache mechanism, the operation circuit and the conversion circuit process the data in parallel, the operation circuit does not need waiting time, and compared with the traditional method that all the data need to be preprocessed and then operated, the method greatly improves the processing efficiency of the chip in the application.
In one optional implementation, the conversion circuit is a first format conversion circuit, and the buffer is a conversion format buffer; the chip also comprises a second pipeline module connected with the arithmetic circuit, wherein the second pipeline module comprises a second format conversion circuit and a plurality of target format buffers connected with the second format conversion circuit; the first format conversion circuit is used for converting the data in the first format into the data in the second format and alternately writing the data in the second format into the plurality of conversion format buffers; the operation circuit is used for reading the second format data from the conversion format buffer when one of the conversion format buffers is full, and processing the second format data to obtain the result data of the second format; and the second format conversion circuit is used for converting the result data in the second format into the result data in the first format and alternately writing the result data in the first format into the plurality of target format buffers.
In this example, two pipeline modules are added, the first pipeline module includes a first format conversion circuit, the second pipeline module includes a second format conversion circuit, and a dedicated circuit supporting format conversion is added to implement the format conversion function. Format conversion is realized without calling a format conversion operator, so that the operator calling at the software side is simplified, the overall performance is improved, and the calling cost and the calling flow are simplified. The fine pipeline operation is realized based on a ping-pong cache mechanism, the first format conversion circuit, the operation circuit and the second format conversion circuit process data in parallel, the operation circuit does not need waiting time, and the processing efficiency of the chip is greatly improved.
In an optional implementation manner, the first pipeline module further includes a continuous conversion circuit and a plurality of continuous data buffers connected to the continuous conversion circuit; the continuous conversion circuit is used for converting discontinuous data into continuous data and alternately writing the continuous data into a plurality of continuous data buffers; the first format conversion circuit is further configured to read continuous data from the full continuous data buffer when one of the plurality of continuous data buffers is full, the continuous data being data in a first format, and convert the data in the first format into data in a second format.
In this example, the first pipeline module further includes a continuous conversion circuit, and when the data to be converted is discontinuous data and is data in the first format, the discontinuous data is converted into continuous data through the continuous conversion circuit, and a continuous conversion operator does not need to be called to realize data continuity conversion, so that operator calling on the software side is simplified, overall performance is improved, and calling overhead and flow are simplified. The ping-pong cache mechanism is used for realizing the operation of a thin pipeline, the continuous conversion circuit and the first format conversion circuit process data in parallel, the arithmetic circuit does not need waiting time, and the processing efficiency of the chip is greatly improved.
In an alternative implementation, the conversion circuit is a continuous conversion circuit, and the buffer is a continuous data buffer; the continuous conversion circuit is used for converting discontinuous data into continuous data and alternately writing the continuous data into a plurality of continuous data buffers; the operation circuit is used for reading continuous data from the full continuous data buffer when one continuous data in the plurality of continuous data buffers is full, performing operation on the continuous data to obtain result data, and outputting the result data.
In this example, when the data to be converted is discontinuous data, the discontinuous data is converted into continuous data through the continuous conversion circuit, and the data continuity conversion is realized without calling a continuous conversion operator, so that the operator calling at the software side is simplified, the overall performance is improved, and the calling overhead and the flow are simplified. The ping-pong cache mechanism is used for realizing the operation of a thin pipeline, the continuous conversion circuit and the operation circuit process data in parallel, the operation circuit does not need waiting time, and the processing efficiency of the chip is greatly improved.
In an alternative implementation, the arithmetic circuit includes a convolution calculation unit and/or a vector calculation unit.
In an optional implementation manner, the chip further includes a first buffer and a second buffer; the first buffer is used for outputting data to be converted to the first pipeline module; and the second buffer is used for buffering the result data. In the method, the first buffer and the second buffer are added, the first pipeline module reads data from the first buffer, and the second pipeline module outputs result data to the second buffer.
In a second aspect, the present application provides a data processing method, which is applied to an artificial intelligence chip, where the chip includes a first pipeline module and an arithmetic circuit, which are connected in sequence, the first pipeline module includes a conversion circuit and a plurality of buffers connected to the conversion circuit, and the method further includes: the conversion circuit acquires data of a first characteristic to be converted, and converts the data of the first characteristic into data of a second characteristic, wherein the data of the second characteristic is suitable for the operation of the operation circuit; writing the second characteristic data into a plurality of buffers alternately; when one of the plurality of buffers is full, the arithmetic circuit reads the data of the second characteristic from the full buffer and carries out operation on the data of the second characteristic to obtain result data; and outputs the resulting data.
In this example, a first pipeline module is added to the AI chip, and the first pipeline module includes a conversion circuit and a plurality of buffers, for example, the conversion circuit is used to support a continuous conversion and/or a format conversion function. In the application, a hardware special module is added to support the data conversion function, the conversion circuit alternately writes the second characteristic data into a plurality of buffers, and the plurality of buffers are used for realizing a ping-pong cache mechanism. When one of the plurality of buffers is full, the arithmetic circuit is activated to perform an operation on the read data. In the application, operators do not need to be called for many times to preprocess data to meet the requirement of an arithmetic circuit on the data, the operator calling on a software side is simplified, the overall performance is improved, and the calling cost and the flow are simplified. And the conversion circuit writes the converted data into a plurality of buffers in an alternating manner, when one buffer is full, the operation circuit directly reads the data from the full buffer, the fine pipeline operation is realized based on a ping-pong cache mechanism, the operation circuit and the conversion circuit process the data in parallel, the operation circuit does not need waiting time, and compared with the traditional method that all the data need to be preprocessed and then operated, the method greatly improves the processing efficiency of the chip in the application.
In an alternative implementation, the conversion circuit is a first format conversion circuit, and the buffer is a conversion format buffer; the chip also comprises a second pipeline module connected with the arithmetic circuit, wherein the second pipeline module comprises a second format conversion circuit and a plurality of target format buffers connected with the second format conversion circuit; the first format conversion circuit converts the data in the first format into the data in the second format; and the second format data is written into a plurality of conversion format buffers alternately; the arithmetic circuit reads the second format data from the conversion format buffer and processes the second format data to obtain result data of the second format; the second format conversion circuit converts the result data of the second format into the result data of the first format, and alternately writes the result data of the first format into the plurality of target format buffers.
In this example, two pipeline modules are added, the first pipeline module includes a first format conversion circuit, and the second pipeline module includes a second format conversion circuit, that is, a dedicated circuit supporting format conversion is added to implement the format conversion function. Format conversion is realized without calling a format conversion operator, so that the operator calling at the software side is simplified, the overall performance is improved, and the calling cost and the calling flow are simplified. The fine pipeline operation is realized based on a ping-pong cache mechanism, the first format conversion circuit, the operation circuit and the second format conversion circuit process data in parallel, the operation circuit does not need waiting time, and the processing efficiency of the chip is greatly improved.
In an optional implementation manner, the first pipeline module further includes a continuous conversion circuit and a plurality of continuous data buffers connected to the continuous conversion circuit; the method may further comprise: the continuous conversion circuit converts discontinuous data into continuous data and alternately writes the continuous data into a plurality of continuous data buffers; when one of the plurality of continuous data buffers is full, the first format conversion circuit reads continuous data from the full continuous data buffer, the continuous data being data in the first format.
In this example, the first pipeline module further includes a continuous conversion circuit, and when the data to be converted is discontinuous data and is data in the first format, the discontinuous data is converted into continuous data through the continuous conversion circuit, and a continuous conversion operator does not need to be called to realize data continuity conversion, so that operator calling on the software side is simplified, overall performance is improved, and calling overhead and flow are simplified. The ping-pong cache mechanism is used for realizing the operation of a thin pipeline, the continuous conversion circuit and the first format conversion circuit process data in parallel, the arithmetic circuit does not need waiting time, and the processing efficiency of the chip is greatly improved.
In an alternative implementation, the conversion circuit is a continuous conversion circuit, and the buffer is a continuous data buffer; the continuous conversion circuit converts discontinuous data into continuous data and alternately writes the continuous data into a plurality of continuous data buffers; when one of the plurality of continuous data buffers is full, the arithmetic circuit reads continuous data from the full continuous data buffer, performs arithmetic on the continuous data to obtain result data, and outputs the result data.
In this example, when the data to be converted is discontinuous data, the discontinuous data is converted into continuous data through the continuous conversion circuit, and the data continuity conversion is realized without calling a continuous conversion operator, so that the operator calling at the software side is simplified, the overall performance is improved, and the calling overhead and the flow are simplified. The ping-pong cache mechanism is used for realizing the operation of a thin pipeline, the continuous conversion circuit and the operation circuit process data in parallel, the operation circuit does not need waiting time, and the processing efficiency of the chip is greatly improved.
In a third aspect, an embodiment of the present application further provides an artificial intelligence operation board, which is characterized by including a communication interface and the artificial intelligence chip according to any one of the first aspect; wherein, the communication interface is used for connecting a host.
In a fourth aspect, an embodiment of the present application further provides an electronic device, including: the processor, the memory coupled with the processor, and the artificial intelligence operation board card of the second aspect, the processor and the memory perform data transmission with the artificial intelligence operation board card through a communication interface.
In a fifth aspect, the present application provides a chip system, where the chip system includes a processor and the artificial intelligence chip of the first aspect, and the processor and the artificial intelligence chip perform data transmission. In one possible design, the chip system further includes a memory for storing data to be converted by the artificial intelligence chip and result data after the operation of the artificial intelligence chip.
Drawings
FIG. 1 is a schematic diagram of an example of an artificial intelligence chip in a conventional method;
FIG. 2 is a schematic structural diagram of an example of an artificial intelligence chip in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of another example of an artificial intelligence chip in an embodiment of the present application;
FIG. 4 is a flowchart illustrating steps of an example of a data processing method according to an embodiment of the present application;
FIG. 5 is a schematic flow chart illustrating steps of another example of a data processing method in an embodiment of the present application;
FIG. 6 is a schematic flow chart illustrating steps of another example of a data processing method in an embodiment of the present application;
FIG. 7 is a schematic structural diagram of another example of an artificial intelligence chip in an embodiment of the present application;
FIG. 8 is a flowchart illustrating steps of another example of a data processing method in an embodiment of the present application;
FIG. 9 is a flowchart illustrating steps of another example of a data processing method in an embodiment of the present application;
FIG. 10 is a schematic structural diagram of another example of an artificial intelligence chip in an embodiment of the present application;
FIG. 11 is a flowchart illustrating steps of another example of a data processing method according to an embodiment of the present application;
FIG. 12 is a flowchart illustrating steps of another example of a data processing method according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of an example of an artificial intelligence operation board in the embodiment of the present application;
fig. 14 is a schematic structural diagram of an example of an electronic device in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. The term "and/or" appearing in the present application may be an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this application generally indicates that the former and latter related objects are in an "or" relationship.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are capable of operation in other sequences than described or illustrated herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.
For a better understanding of the present application, reference will first be made to the words referred to in this application.
Format of data (format): multidimensional data is stored by a multidimensional array. Generally, the data format in software is generally 4D format, and 4D format means that data is stored through a four-dimensional array. For example, the feature map of the convolutional neural network is stored in a four-dimensional array, and the four dimensions are the batch size (batch, N), the feature map height (height, H), the feature map width (width, W), and the feature map channels (channels, C). Since data can only be stored linearly, there is a corresponding order for these four dimensions. Different depth learning frameworks will store feature map data in different orders. For example, the order is [ batch, channels, height, width ], i.e., the NCHW format (a 4D format). Alternatively, the order is [ batch, height, width, channels ], i.e., NHWC format (a 4D format). Due to the requirement of the size of the matrix calculation unit in the chip design, for example, the matrix calculation of 16 × 16 can be supported. The arithmetic circuit can operate on data in a 5D format (such as 6D, or other formats, etc.), and it can be understood that the 5D format carries hardware information. Wherein, the 5D format refers to a five-dimensional data format employing NC1HWC 0. Among them, five-dimensional "C0" is strongly related to data alignment, and exemplarily, "C0, C1" is split from "C" dimension of "NCHW," C1 ═ C/C0 ", and if it cannot be divided completely, zero padding is finally needed to align. The rough disassembling process comprises the following steps: (1) the "NCHW" data was partitioned in the C dimension to give C1 NHWC 0. (2) C1 portions of "NHWC 0" were arranged in series in the memory, and became "NC 1HWC 0".
Continuity of data: when the arithmetic circuit performs data operation, certain requirements are required for data to be operated, and the data are required to be continuous. Continuous here means that the order of elements is continuous. It can be understood that whether the storage sequence of the one-dimensional array elements at the bottom layer of the tensor (tensor) is consistent with the element sequence of the tensor which is developed in line-first one-dimensional mode or not, if so, the data is continuous, and if not, the data is considered to be discontinuous. the underlying implementation of the Tensor multidimensional array is to use a 1-dimensional array of a continuous memory, the Tensor stores the shape of the multidimensional array in the meta-information, and when an element is accessed, the data corresponding to the offset (step size stride) of the 1-dimensional array relative to the initial position of the array can be found by converting the multidimensional index into the offset (step size stride) of the 1-dimensional array, and the offset is called the step size (stride). After some Tensor operations, the positions of the neighboring elements change, i.e., the data is not continuous. Taking an example: a two-dimensional array t is as follows:
Figure BDA0002653152420000061
and preferentially converting the two-dimensional array t into a one-dimensional array according to rows, wherein the form of the one-dimensional array is as follows:
[0,1,2,3,4,5,6,7,8], formula (2)
If the actual storage form of the two-dimensional array is consistent with equation (2), the data is continuous. Accessing the next element in the matrix is achieved by offsetting by 1 position (stride 1).
If the actual storage form of the two-dimensional array is as follows:
[0,3,6,1,4,7,2,5,8], formula (3)
If the above equation (3) does not match the equation (2), the data is discontinuous. Accessing the next element in the matrix is achieved by offsetting by 2 positions (stride 2).
Data of the first feature: "characteristics" include format and/or continuity. The data of the first characteristic may be data in a first format (e.g. 4D) format. Alternatively, the data of the first characteristic may be non-continuous data. Alternatively, the data of the first characteristic may be data of the first format and discontinuous data.
Data of the second feature: "characteristics" include format and/or continuity. The data of the first characteristic may be data in a second format (e.g. 5D) format. Alternatively, the data of the second characteristic may be continuous data. Alternatively, the data of the second characteristic may be data of the first format and continuous data.
The embodiment of the application provides an AI chip, and the AI chip includes first assembly line module and operational circuit that connects gradually, and first assembly line module includes converting circuit and a plurality of buffers (buffers) of being connected with converting circuit. When the arithmetic circuit performs data operation, certain requirements are made on data to be operated. Illustratively, there are requirements for the format of the data and/or for the continuity of the data. The conversion circuit alternately writes the second characteristic data into the plurality of buffers. The writing of the data into the plurality of buffers alternately means that the data is written into a first buffer, when the buffer is full, the data is continuously written into a second buffer until the last buffer is full, then the state is switched, the data is written into the first buffer, the data is written into the second buffer, and the like. For example, when the number of buffers is 2, data is written into the second buffer after the first buffer is full, and then the state is switched, and data is written into the first buffer when the second buffer is full. When one of the plurality of buffers is full, the conversion circuit reads the data of the second characteristic from the full buffer, performs an operation on the data of the second characteristic to obtain result data, and outputs the result data. For example, when the first buffer is full, the arithmetic circuit reads data from the first buffer, and when the second buffer is full, the arithmetic circuit reads data from the second buffer.
In the embodiment of the present application, a first pipeline module is added to the AI chip, and the first pipeline module includes a conversion circuit and a plurality of buffers, for example, the conversion circuit is used for supporting a function of continuous conversion and/or format conversion. In the application, a hardware special module and a data conversion function are added, the second characteristic data are written into a plurality of buffers alternately by a conversion circuit, and the plurality of buffers are used for realizing a ping-pong cache mechanism. When one of the plurality of buffers is full, the arithmetic circuit is activated to perform an operation on the read data. In the application, operators do not need to be called for many times to preprocess data to meet the requirement of an arithmetic circuit on the data, the operator calling on a software side is simplified, the overall performance is improved, and the calling cost and the flow are simplified. And the conversion circuit writes the converted data into a plurality of buffers in an alternating manner, when one buffer is full, the operation circuit directly reads the data from the full buffer, the fine pipeline operation is realized based on a ping-pong cache mechanism, the operation circuit and the conversion circuit process the data in parallel, the operation circuit does not need waiting time, and compared with the traditional method that all the data need to be preprocessed and then operated, the method greatly improves the processing efficiency of the chip in the application.
In the embodiment of the present application, different implementations may be included according to the specific function of the conversion circuit. Illustratively, 1, when the data is discontinuous data and the data format is a first format (e.g. 4D format), the converting circuit may be a data format converting circuit, and the first pipeline module further includes a continuous converting circuit. The data format conversion circuit is used for converting the format of the data, and the continuous conversion circuit is used for converting discontinuous data into continuous data. 2. When the data is continuous data, only the format of the data needs to be converted, and in this implementation, the conversion circuit is a format conversion circuit. 3. When the data is discontinuous data, only the data needs to be subjected to continuity conversion, and format conversion is not needed, and at this time, the conversion circuit is a continuous conversion circuit.
In an example, when the data is discontinuous data and the format of the data is data in the first format, the first pipeline module includes a format conversion circuit and a continuous conversion circuit. Referring to fig. 2, the AI chip includes a first pipeline module 201, an arithmetic circuit 202, and a second pipeline module 203 connected in sequence. Optionally, the AI chip may further include a first buffer 204 and a second buffer 205. One end of the first buffer 204 is connected to the memory 206, and the other end of the first buffer 204 is connected to the first pipeline module 201. One end of the second buffer 205 is connected to the second pipeline module 203, and the other end of the second buffer 205 is connected to the memory 206. The memory 206 may be an external memory or an AI chip internal memory. The memory may be a Dynamic Random Access Memory (DRAM). For example, the memory includes, but is not limited to, Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM (DDR SDRAM), and the like.
Referring to fig. 3, the first pipeline module 201 includes a first format conversion circuit 2012 and a plurality of format conversion buffers connected thereto, a serial conversion circuit 2011 and a plurality of serial data buffers connected thereto. The second pipeline module 203 includes a second format conversion circuit 2031 and a plurality of target format buffers connected to the second format conversion circuit 2031. The storage format of the first format conversion circuit 2012 needs to be determined according to the storage format of the data by the arithmetic circuit.
Illustratively, the plurality of consecutive data buffers includes at least a first consecutive data buffer (designated "1-buffer _ 0") 20111 and a second consecutive data buffer (designated "1-buffer _ 1") 20112.
The plurality of translation format buffers includes at least a first translation format buffer (designated "2-buffer _ 0") 20121 and a second translation format buffer (designated "2-buffer _ 1") 20122.
The plurality of target format buffers includes at least a first target format buffer (denoted as "3-buffer _ 0") 20311 and a first target format buffer (denoted as "3-buffer _ 1") 20312.
In this example, the number of consecutive data buffers, the number of transform format buffers, and the number of target format buffers are only 2, but in practical applications, the number and size of consecutive data buffers, the number and size of transform format buffers, and the number and size of transform format buffers are not limited. The size and number of each buffer may be determined according to the processing power of the arithmetic circuit. In this example, a plurality of consecutive data buffers constitute a ping-pong buffer. The plurality of conversion format buffers form a ping-pong buffer. A plurality of target format buffers also constitute a ping-pong buffer. Through a ping-pong cache mechanism, the data processing of the AI chip realizes fine-grained pipeline processing, i.e., a process of waiting for one circuit (such as a continuous conversion circuit) to finish processing all data and then continuing the processing of the next circuit (such as a format conversion circuit) is not needed. The first format conversion circuit, the operation circuit and the second format conversion circuit keep the parallel computation state in most of time, and the processing efficiency of the AI chip is greatly improved.
The continuous conversion circuit 2011 is configured to read data from the first buffer, convert discontinuous data into continuous data, and alternately write continuous data into the plurality of continuous data buffers.
The AI chip can receive the instruction sent by the upper computer and carry out parameter configuration according to the instruction. The configuration parameters are sent to the sequential conversion circuit 2011, the first format conversion circuit 2012, and the second format conversion circuit 2031. For example, a step parameter (e.g., stride 2) of the continuous conversion circuit 2011 is configured, and the continuous conversion circuit 2011 converts discontinuous data into continuous data according to the step. For example, please refer to equation (3) above, if the discontinuous data is: [0,3,6,1,4,7,2,5,8], the continuous converting circuit 2011 needs to convert the discontinuous data into continuous data, and the continuous converting circuit 2011 reads data according to the step parameter (for example, step size is 2), and the read data is [0,1,2,3,4,5,6,7,8 ].
The first format conversion circuit 2012 is used for reading the data of the first format from the continuous data buffer, converting the data of the first format into the data of the second format, and writing the data of the second format into the plurality of conversion format buffers alternately. For example, the first format conversion circuit 2012 converts the data in the first format (e.g., the data in the 4D format) into the data in the second format (e.g., the data in the 5D format) according to the format parameter (e.g., the 5D parameter), and the data in the arithmetic circuit is stored in the second format (e.g., the data in the 5D format). The purpose of converting data in the first format to data in the second format is to: the data in the second format is adapted to the data storage format of the arithmetic circuitry so that the arithmetic circuitry can operate on the data in the second format.
The operation circuit 202 reads the second format data from the tfc buffer and processes the second format data to obtain the result data of the second format when one tfc buffer of the plurality of tfc buffers is full.
The arithmetic circuit may include a plurality of processing units (PEs). Optionally, the processing unit may be a convolution calculation unit, and/or a vector calculation unit.
The second format conversion circuit 2031 is configured to receive the result data in the second format from the arithmetic circuit, convert the result data in the second format into the result data in the first format, and alternately write the result data in the first format into the plurality of target format buffers. The purpose of the second format conversion circuit 2031 to convert the result data in the second format into the result data in the first format is: the format of the output data may be adapted to the data format of the subsequent flow.
And the second buffer is used for reading the result data in the first format from the target format buffer and outputting the result data in the first format. Data processing is performed by taking tensor as a unit, and after conversion, the occupied space of the data is larger than the original data space, so that the capacity of a buffer is generally exceeded. In this application, the sequential conversion circuit is connected to a plurality of sequential data buffers, and when a portion of the data is processed (i.e., when the first sequential data buffer is full), the next circuit (the first format conversion circuit) can be started. Similarly, when one of the converted format buffers is full, the operation circuit, the sequential conversion circuit, the first format conversion circuit, the operation circuit, and the second format conversion circuit can process data in parallel, and thus the amount of data to be buffered in the buffer is not large. In the application, the first buffer and the second buffer are added, the first pipeline module reads data from the first buffer, and the second pipeline module outputs result data to the second buffer.
In order to distinguish the format conversion circuit in the first pipeline module (may also be referred to as a front pipeline module) from the format conversion circuit in the second pipeline module (may also be referred to as a rear pipeline module), the format conversion circuit in the first pipeline module is referred to as a "first format conversion circuit", and the format conversion circuit in the second pipeline module is referred to as a "second format conversion circuit". Wherein the first format conversion circuit is used for converting data in a first format (such as 4D) into data in a second format (such as 5D), and the second format conversion circuit is used for converting data in the second format (such as 5D) into data in the first format (such as 4D). In order to distinguish the buffer of the first format conversion circuit from the buffer of the first format conversion circuit, the buffer of the first format circuit is referred to as a "conversion format buffer", and the buffer of the second format circuit is referred to as a "target format buffer".
In this example, the AI chip includes a first pipeline module, an arithmetic circuit, and a second pipeline module. The first pipeline module comprises a continuous conversion circuit, a plurality of continuous data buffers connected with the continuous conversion circuit, a first format conversion circuit, a plurality of conversion format buffers connected with the first format conversion circuit and an arithmetic circuit. The second pipeline module comprises a second format conversion circuit and a plurality of target format buffers connected with the second format conversion circuit. First, compared with the conventional method, for example, in the AI framework (such as PyTorch) adaptation process of the dynamic graph architecture, a large number of format conversion operators need to be inserted, and the software side needs to consider the storage format implemented by the hardware, which increases the difficulty and complexity of software development. In the application, the data continuity conversion is realized through the continuous conversion circuit, the data format conversion is realized through the first format conversion circuit and the second format conversion circuit, the operator calling times are reduced on the software side, and the overall performance of the system is improved. Secondly, in the aspect of hardware implementation, a data processing mechanism of a fine-grained pipeline can be realized through a ping-pong cache mechanism, each circuit performs parallel processing without waiting time, the total execution time is basically consistent with the time of performing operation (such as convolution calculation), each circuit maintains the state of parallel calculation, and the data processing efficiency of the chip is improved. Finally, for the ping-pong buffer structure in the first pipeline module and the second pipeline module, only partial data is buffered, the area of each buffer is not large, and the implementation of the pipeline mechanism can be ensured by using less area overhead.
In this embodiment, referring to fig. 4, a data processing method is applied to the AI chip corresponding to fig. 2 and 3, and the flow of steps executed by each circuit may be as follows:
step 401, the continuous converting circuit converts discontinuous data into continuous data and alternately writes the continuous data into a plurality of continuous data buffers.
Specifically, please refer to fig. 5.
S10, the continuous conversion circuit reads data from the first buffer (cache _1) and writes continuous first format data into the first continuous data buffer (1-buffer _ 0). When the first continuous data buffer (1-buffer _0) is full, the first format conversion circuit is activated.
S11, the continuous converting circuit continuously writes the continuous first format data into the second continuous data buffer (1-buffer _ 1). Execution continues with S14.
Step 402, when one continuous data in the plurality of continuous data buffers is full, the first format conversion circuit reads continuous data from the full continuous data buffer, and the continuous data is data in the first format.
S12, the first format conversion circuit reads the first format data from the first continuous data buffer (1-buffer _0), converts the first format data into the second format data, and writes the second format data into the first conversion format buffer (2-buffer _ 0). Execution continues with S13.
It should be noted that, there is no time sequence context between the step S11 and the step S12, the step S11 and the step S12 are executed synchronously, that is, the continuous converting circuit and the first format converting circuit can perform data processing synchronously, the continuous converting circuit does not need to finish reading all data, and then perform format conversion, the continuous converting circuit reads a part of data, that is, when the first 1-buffer _0 is full, the arithmetic circuit can be started, the first format converting circuit can process the data in the "1-buffer _ 0", the continuous converting circuit continues to read the data, and the continuous converting circuit and the first format converting circuit can perform data processing synchronously, thereby improving the efficiency of data processing of the AI chip.
Step 403, the first format conversion circuit converts the data in the first format into the data in the second format; and the second format data is alternately written into the plurality of converted format buffers.
S13, when the second continuous data buffer (1-buffer _1) is full, the first format conversion circuit is started. The first format conversion circuit reads the first format data from the second continuous data buffer (1-buffer _1), converts the first format data into the second format data, and writes the second format data into the second conversion format buffer (2-buffer _ 1).
S14, the continuous converting circuit continuously writes the continuous first format data into the first continuous data buffer (1-buffer _ 0).
It should be noted that the above-mentioned steps S13 and S14 have no timing context, and the steps S13 and S14 are executed synchronously, that is, the sequential conversion circuit and the first format conversion circuit can perform data processing synchronously.
The above-described S11-S14 are repeatedly performed. And the continuous conversion circuit alternately writes the read continuous first format data into 1-buffer _0 and 1-buffer _1, and the first format conversion circuit alternately reads the continuous first format data from 1-buffer _0 and 1-buffer _1, so that a fine-grained pipeline mechanism is realized.
S15, when the first conversion format buffer (2-buffer _0) is full, the arithmetic circuit is started. The arithmetic circuit reads the data of the second format from the first conversion format buffer (2-buffer _0), and then, the arithmetic circuit performs an arithmetic operation (e.g., a convolution operation) on the data of the second format to obtain result data (the result data is the data of the second format). Execution continues with S17.
S16, the first format conversion circuit continues writing the data of the second format into the second conversion format buffer (2-buffer _ 1). Execution continues with S18.
It should be noted that, the above-mentioned step S15 and step S16 have no time sequence context, and the step S15 and step S16 are executed synchronously, that is, the first format conversion circuit and the arithmetic circuit can process data synchronously, and it is not necessary that the first format conversion circuit processes all data, and then the arithmetic operation is performed, the first format conversion circuit converts a part of data, that is, when the first conversion format buffer (2-buffer _0) is full, the arithmetic circuit can be started, the arithmetic circuit can process data in "2-buffer _ 0", and the first format conversion circuit continues to convert data, and the first format conversion circuit and the arithmetic circuit can process data synchronously, thereby improving the efficiency of data processing of the AI chip.
Step 404, the arithmetic circuit reads the second format data from the conversion format buffer, and processes the second format data to obtain result data of the second format;
s17, when the second format conversion buffer (2-buffer _1) is full, the arithmetic circuit is started, the arithmetic circuit reads the data in the second format from the second format conversion buffer (2-buffer _1), and then the arithmetic circuit performs an arithmetic operation (e.g., a convolution operation) on the data in the second format to obtain result data (the result data is the data in the second format). Execution continues with S20.
S18, the first format conversion circuit continues writing the data of the second format into the first conversion format buffer (2-buffer _ 0).
It should be noted that, the steps S17 and S18 are executed synchronously, and the order is not limited. In this example, the first pipeline module includes a plurality of format conversion buffers, the read-write process is performed alternately in the plurality of format conversion buffers, and it is not necessary that the first format conversion circuit outputs the data in the second format to the operation circuit after all the data are converted, and the operation circuit performs the convolution operation. When the first format conversion buffer is full, the arithmetic circuit can be started, the arithmetic circuit can read the data in the first format from the first format conversion buffer, and carry out convolution operation on the data in the first format, and the first format conversion circuit can synchronously continue to carry out conversion operation on the data in the first format, that is, the first format conversion circuit and the arithmetic circuit synchronously carry out data operation, so that fine-grained pipeline operation is realized, the data processing time of an AI chip is saved, and the data processing efficiency is improved.
The above steps S15-S18 are repeated until the data processing is completed.
Next, a description is given of a process of data processing by the second format conversion circuit in the second pipeline module:
in step 405, the second format conversion circuit converts the result data in the second format into the result data in the first format, and alternately writes the result data in the first format into the plurality of target format buffers.
Specifically, please refer to fig. 6.
S20, the arithmetic circuit outputs the result data of the second format to the second format conversion circuit, and activates the second format conversion circuit.
S21, the second format conversion circuit converts the result data of the second format to obtain the result data of the first format, and writes the result data of the first format into the first target format buffer (3-buffer _ 0).
S23, when the first target format buffer (3-buffer _0) is full, starting a second swap memory (cache _2), and the second swap memory reads the result data of the first format from the first target format buffer (3-buffer _ 0). Execution continues with S26.
S24, the second format conversion circuit writes the result data of the first format into the second target format buffer (3-buffer _ 1). Execution continues with S25.
Step S23 is executed in synchronization with step S24.
S25, when the second target format buffer (3-buffer _1) is full, the second format conversion circuit writes the result data of the first format into the first target format buffer (3-buffer _ 0).
S26, the second memory reads the result data in the first format from the second target format buffer (3-buffer _ 1).
Step S25 is executed in synchronization with step S26.
S21-S26 are repeatedly executed until the data processing is completed.
In this example, first, the data continuity conversion is implemented by the continuity conversion circuit, the format conversion of the data is implemented by the first format conversion circuit and the second format conversion circuit, the number of times of operator calls is reduced on the software side, and the overall performance of the system is improved. Secondly, in the aspect of hardware implementation, a data processing mechanism of a fine-grained pipeline can be realized through a ping-pong cache mechanism, each circuit does not need waiting time, the total execution time is basically consistent with the time of simply performing operation (such as convolution calculation), each circuit keeps a parallel calculation state, and the data processing efficiency of the chip is improved. Finally, for the ping-pong buffer structure in the first pipeline module and the second pipeline module, only partial data is buffered, the area of each buffer is not large, and the implementation of the pipeline mechanism can be ensured by using less area overhead.
In the second example, when the data acquired by the AI chip is continuous data, only the data format needs to be converted, and in this implementation, the conversion circuit is a format conversion circuit, and the format conversion circuit is used for converting the data format. The difference in this example from the first implementation is that: the first pipeline module does not comprise a continuous conversion circuit, and the conversion circuit is a first format conversion circuit. Referring to fig. 2 and 7, the first pipeline module 701 includes a first format conversion circuit 7012 and a format conversion buffer connected thereto. The number of the plurality of translation format buffers used to implement the ping-pong buffering mechanism includes at least a first translation format buffer (2-buffer _0)70121 and a second translation format buffer (2-buffer _1) 70122. The second pipeline module 703 includes a second format conversion circuit 7031 and a plurality of target format buffers. The plurality of target format buffers includes at least a first target format buffer (3-buffer _0)70311 and a second target format buffer ((3-buffer _1) 70312.
The first format conversion circuit 7012 is configured to read data in a first format from a first buffer (cache _1), convert the data in the first format into data in a second format, and write the data in the second format into a plurality of conversion format buffers alternately.
The operation circuit 702 is configured to, when one of the multiple tfc buffers is full, read second format data from the tfc buffer, and process the second format data to obtain result data in the second format.
The second format conversion circuit 7031 is configured to receive the result data in the second format from the arithmetic circuit 702, convert the result data in the second format into the result data in the first format, and alternately write the result data in the first format into the plurality of target format buffers. The purpose of the second format conversion circuit 7031 to convert the result data in the second format into the result data in the first format is: the format of the output data may be adapted to the data format of the subsequent flow.
And the second buffer is used for reading the result data in the first format from the fully written target format buffer and outputting the result data in the first format.
Referring to fig. 8, the data processing method in this example includes the following steps:
step 801, a first format conversion circuit converts data in a first format into data in a second format; and alternately writing the second format data into the plurality of converted format buffers.
Specifically, please refer to fig. 9.
S31, the first format conversion circuit reads the data of the first format from the first buffer (cache _1), converts the data of the first format into the data of the second format, and writes the data of the second format into the first conversion format buffer (2-buffer _ 0).
S32, when the first conversion format buffer (2-buffer _0) is full, the arithmetic circuit is started. The arithmetic circuit reads the data of the second format from the first conversion format buffer (2-buffer _0), and then, the arithmetic circuit performs an arithmetic operation (e.g., a convolution operation) on the data of the second format to obtain result data (the result data is the data of the second format). Execution continues with S34.
S33, the first format conversion circuit continues writing the data of the second format into the second conversion format buffer (2-buffer _ 1). Execution continues with S35.
S31 and S32 are executed synchronously, please refer to the related description of step S15 and step S16 in example 1, which is not described herein again.
Step 802, the arithmetic circuit reads the second format data from the transform format buffer, and processes the second format data to obtain the result data of the second format.
S34, when the second format conversion buffer (2-buffer _1) is full, the arithmetic circuit is started, the arithmetic circuit reads the data in the second format from the second format conversion buffer (2-buffer _1), and then the arithmetic circuit performs an arithmetic operation (e.g., a convolution operation) on the data in the second format to obtain result data (the result data is the data in the second format).
S35, the first format conversion circuit continues writing the data of the second format into the first conversion format buffer (2-buffer _ 0).
S34 and S35 are executed synchronously, please refer to the related description of step S17 and step S18 in example 1, which is not described herein again.
Step 803, the second format conversion circuit converts the result data of the second format into the result data of the first format, and alternately writes the result data of the first format into the plurality of target format buffers.
Please refer to the description related to steps S20-S26 in example 1, which is not repeated here.
In the third example, when the data acquired by the AI chip is discontinuous data, the format of the data meets the requirement of the operation circuit, and only the continuity of the data needs to be converted without converting the format of the data. The difference in this example from the first implementation is that: the first pipeline module does not comprise a first format conversion circuit, and the conversion circuit is a continuous conversion circuit. The AI chip does not include a second pipeline module.
Referring to fig. 10, the AI chip includes a first buffer 901, a first pipeline module 902, an arithmetic circuit 903 and a second buffer 904 connected in sequence. The first pipeline module 902 includes a sequential conversion circuit 9021 and a plurality of sequential data buffers, for example, the plurality of sequential data buffers includes at least a first sequential data buffer (1-buffer _0)90211 and a second sequential data buffer (1-buffer _1) 90212.
The continuous conversion circuit 9021 is configured to read data from the first buffer 901, convert discontinuous data into continuous data, and alternately write continuous data into a plurality of continuous data buffers.
An arithmetic circuit 903, configured to, when one of the multiple continuous data buffers is full, read continuous data from the full continuous data buffer, perform an arithmetic operation on the continuous data to obtain result data, and output the result data to the second buffer 904.
Referring to fig. 11, a description will be given of a data processing procedure in this example:
step 1101, the continuous converting circuit converts discontinuous data into continuous data and alternately writes the continuous data into a plurality of continuous data buffers.
Specifically, please refer to fig. 12.
S40, the continuous conversion circuit reads data from the first buffer (cache _1) and writes continuous data into the first continuous data buffer (1-buffer _ 0). When the first continuous data buffer (1-buffer _0) is full, the arithmetic circuit is started.
S41, the continuous conversion circuit continuously writes the continuous data into the second continuous data buffer (1-buffer _ 1). Execution continues with S42.
S42, when the second continuous data buffer (1-buffer _1) is full, the continuous conversion circuit continuously writes continuous data into the first continuous data buffer (1-buffer _ 0).
S41 and S42 are repeatedly performed.
Step 1102, when one of the plurality of continuous data buffers is full, the arithmetic circuit reads continuous data from the full continuous data buffer, performs arithmetic on the continuous data to obtain result data, and outputs the result data to the second buffer (cache _ 2).
S43, when the first continuous data buffer (1-buffer _0) is full, the operation circuit reads continuous data from the first continuous data buffer (1-buffer _0), performs operation (such as convolution operation) on the continuous data to obtain result data, and outputs the result data to the second buffer.
S41 and S43 are executed synchronously.
S44, when the second continuous data buffer (1-buffer _1) is full, the operation circuit reads continuous data from the second continuous data buffer (1-buffer _1), performs operation (such as convolution operation) on the continuous data to obtain result data, and outputs the result data to the second buffer.
S42 and S44 are executed synchronously. S43-S44 are repeatedly executed.
In this example, when the data acquired by the AI chip is discontinuous data, the format of the data meets the requirement of the arithmetic circuit, and only the continuity of the data needs to be converted, and the format of the data does not need to be converted, so that the data processing flow is saved. Moreover, the data continuity conversion is realized through a continuity conversion circuit, the data format conversion is realized through a first format conversion circuit and a second format conversion circuit, the operator calling times are reduced on the software side, and the overall performance of the system is improved. In the aspect of hardware implementation, a ping-pong cache mechanism can be used for realizing a data processing mechanism of a fine-grained pipeline, each circuit does not need to wait for time, the total execution time is basically consistent with the time of simply performing operation (such as convolution calculation), each circuit keeps a parallel calculation state, and the data processing efficiency of a chip is improved.
Referring to fig. 13, an artificial intelligence operation board 1300 includes a communication interface 1301 and an artificial intelligence chip 1302 in each of the above examples. The communication interface may be a Peripheral Component Interconnect Express (PCIE) interface, and the PCIE interface is used to connect to a host.
Referring to fig. 14, an electronic device 1400 is further provided in the embodiment of the present application, where the electronic device may be a server, or the electronic device may also be a terminal device.
For example, the electronic device is described by taking a server as an example. The server may vary significantly due to configuration or performance, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and memory 1432, one or more storage media 1430 (e.g., one or more mass storage devices) that store applications 1442 or data 1444, and an artificial intelligence operations board 1460. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, a central processor 1422 may be disposed in communication with the storage medium 1430, executing a sequence of instruction operations in the storage medium 1430 on the server.
In this application, the artificial intelligence operation board 1460 reads data to be converted from the memory, or the artificial intelligence operation board 1460 outputs result data after operation to the memory. The AI chip is mounted on a CPU (also referred to as a main CPU) as a coprocessor, the CPU may allocate a data processing task to the artificial intelligence chip, and the CPU may send configuration parameters for continuity conversion and/or configuration parameters for format conversion to the AI chip. The sequential conversion circuit and/or the format conversion circuit (e.g., the first format conversion circuit and the second format conversion circuit) in the AI chip may convert the data according to the configuration parameters.
The server can also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441 such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The processor mentioned in any of the above may be a general-purpose Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling execution of a program of the wireless communication method according to the first aspect.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (12)

1. An artificial intelligence chip, comprising: the device comprises a first pipeline module and an arithmetic circuit which are connected in sequence, wherein the first pipeline module comprises a conversion circuit and a plurality of buffers connected with the conversion circuit;
the conversion circuit is used for acquiring data of a first characteristic to be converted and converting the data of the first characteristic into data of a second characteristic, wherein the data of the second characteristic is suitable for the operation of the operation circuit; and writing the second characteristic data alternately into the plurality of buffers;
and the operation circuit is used for reading the data of the second characteristic from the fully written buffer when one of the plurality of buffers is fully written, performing operation on the data of the second characteristic to obtain result data, and outputting the result data.
2. The chip of claim 1, wherein the conversion circuit is a first format conversion circuit, and the buffer is a conversion format buffer; the chip also comprises a second pipeline module connected with the arithmetic circuit, wherein the second pipeline module comprises a second format conversion circuit and a plurality of target format buffers connected with the second format conversion circuit;
the first format conversion circuit is used for converting the data in the first format into the data in the second format and alternately writing the data in the second format into the plurality of conversion format buffers;
the arithmetic circuit is used for reading second format data from the conversion format buffer when one of the conversion format buffers is full, and processing the second format data to obtain result data of a second format;
the second format conversion circuit is used for converting the result data of the second format into the result data of the first format and alternately writing the result data of the first format into the target format buffers.
3. The chip of claim 2, wherein the first pipeline module further comprises a continuous conversion circuit and a plurality of continuous data buffers coupled to the continuous conversion circuit;
the continuous conversion circuit is used for converting discontinuous data into continuous data and alternately writing the continuous data into the plurality of continuous data buffers;
the first format conversion circuit is further configured to, when one of the plurality of continuous data buffers is full, read continuous data from the full continuous data buffer, where the continuous data is data in a first format, and convert the data in the first format into data in a second format.
4. The chip of claim 1, wherein the conversion circuit is a continuous conversion circuit and the buffer is a continuous data buffer;
the continuous conversion circuit is used for converting discontinuous data into continuous data and alternately writing the continuous data into the plurality of continuous data buffers;
the operation circuit is configured to, when one of the plurality of continuous data buffers is full, read continuous data from the full continuous data buffer, perform an operation on the continuous data to obtain result data, and output the result data.
5. The chip according to any one of claims 1 to 4, wherein the arithmetic circuitry comprises a convolution calculation unit and/or a vector calculation unit.
6. The chip of any of claims 1-4, wherein the chip further comprises a first buffer and a second buffer;
the first buffer is used for outputting data to be converted to the first pipeline module;
the second buffer is used for buffering the result data.
7. A data processing method is characterized in that an artificial intelligence chip is applied to, the chip comprises a first assembly line module and an arithmetic circuit which are connected in sequence, the first assembly line module comprises a conversion circuit and a plurality of buffers connected with the conversion circuit, and the method also comprises the following steps:
the conversion circuit acquires data of a first characteristic to be converted, and converts the data of the first characteristic into data of a second characteristic, wherein the data of the second characteristic is applicable to the operation of the operation circuit; and writing the second characteristic data alternately into the plurality of buffers;
when one of the plurality of buffers is full, the arithmetic circuit reads the data of the second characteristic from the full buffer and carries out arithmetic operation on the data of the second characteristic to obtain result data; and outputting the result data.
8. The method of claim 7, wherein the conversion circuit is a first format conversion circuit, and the buffer is a conversion format buffer; the chip also comprises a second pipeline module connected with the arithmetic circuit, wherein the second pipeline module comprises a second format conversion circuit and a plurality of target format buffers connected with the second format conversion circuit;
the conversion circuit acquires data of a first characteristic to be converted, converts the data of the first characteristic into data of a second characteristic, and alternately writes the data of the second characteristic into the plurality of buffers, and comprises:
the first format conversion circuit converts the data in the first format into the data in the second format; and the second format data is written into the plurality of conversion format buffers alternately;
when one of the plurality of buffers is full, the operation circuit reads the data of the second characteristic from the full buffer, and performs an operation on the data of the second characteristic to obtain result data, including:
when one of the plurality of conversion format buffers is full, the arithmetic circuit reads second format data from the full conversion format buffer, and processes the second format data to obtain result data of a second format;
the method further comprises the following steps:
the second format conversion circuit converts the result data of the second format into result data of a first format and alternately writes the result data of the first format into the plurality of target format buffers.
9. The method of claim 8, wherein the first pipeline module further comprises a continuous conversion circuit and a plurality of continuous data buffers coupled to the continuous conversion circuit;
the method further comprises the following steps:
the continuous conversion circuit converts discontinuous data into continuous data and alternately writes the continuous data into the plurality of continuous data buffers;
when one of the plurality of continuous data buffers is full, the first format conversion circuit reads continuous data from the full continuous data buffer, and the continuous data is data in the first format.
10. The method of claim 7, wherein the conversion circuit is a continuous conversion circuit, and the buffer is a continuous data buffer;
the conversion circuit acquires data of a first characteristic to be converted, converts the data of the first characteristic into data of a second characteristic, and alternately writes the data of the second characteristic into the plurality of buffers, and comprises:
the continuous conversion circuit converts discontinuous data into continuous data and alternately writes the continuous data into the plurality of continuous data buffers;
when one of the plurality of buffers is full, the operation circuit reads the data of the second characteristic from the full buffer, and performs an operation on the data of the second characteristic to obtain result data, including:
when one of the plurality of continuous data buffers is full, the arithmetic circuit reads continuous data from the full continuous data buffer, performs arithmetic on the continuous data to obtain result data, and outputs the result data.
11. An artificial intelligence operation board card, which is characterized by comprising a communication interface and an artificial intelligence chip as claimed in any one of claims 1 to 6; wherein, the communication interface is used for connecting a host.
12. An electronic device, comprising: a processor, a memory coupled to the processor, and the artificial intelligence operation board of claim 11, the processor and the memory communicating data with the artificial intelligence operation board through a communication interface.
CN202010877800.7A 2020-08-27 2020-08-27 Artificial intelligence chip, operation board card, data processing method and electronic equipment Pending CN114115995A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010877800.7A CN114115995A (en) 2020-08-27 2020-08-27 Artificial intelligence chip, operation board card, data processing method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010877800.7A CN114115995A (en) 2020-08-27 2020-08-27 Artificial intelligence chip, operation board card, data processing method and electronic equipment

Publications (1)

Publication Number Publication Date
CN114115995A true CN114115995A (en) 2022-03-01

Family

ID=80374424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010877800.7A Pending CN114115995A (en) 2020-08-27 2020-08-27 Artificial intelligence chip, operation board card, data processing method and electronic equipment

Country Status (1)

Country Link
CN (1) CN114115995A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5594916A (en) * 1990-01-24 1997-01-14 Hitachi, Ltd. Neural network processing system using semiconductor memories and processing paired data in parallel
US20190065208A1 (en) * 2017-08-31 2019-02-28 Cambricon Technologies Corporation Limited Processing device and related products
US20190073584A1 (en) * 2016-04-15 2019-03-07 Cambricon Technologies Corporation Limited Apparatus and methods for forward propagation in neural networks supporting discrete data
US20190087708A1 (en) * 2017-09-21 2019-03-21 Raytheon Company Neural network processor with direct memory access and hardware acceleration circuits
CN109858622A (en) * 2019-01-31 2019-06-07 福州瑞芯微电子股份有限公司 The data of deep learning neural network carry circuit and method
CN109960673A (en) * 2017-12-14 2019-07-02 北京中科寒武纪科技有限公司 Integrated circuit chip device and Related product
CN110968285A (en) * 2018-09-28 2020-04-07 上海寒武纪信息科技有限公司 Signal processing device and related product
CN116724316A (en) * 2020-12-31 2023-09-08 华为技术有限公司 Model processing method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5594916A (en) * 1990-01-24 1997-01-14 Hitachi, Ltd. Neural network processing system using semiconductor memories and processing paired data in parallel
US20190073584A1 (en) * 2016-04-15 2019-03-07 Cambricon Technologies Corporation Limited Apparatus and methods for forward propagation in neural networks supporting discrete data
US20190065208A1 (en) * 2017-08-31 2019-02-28 Cambricon Technologies Corporation Limited Processing device and related products
US20190087708A1 (en) * 2017-09-21 2019-03-21 Raytheon Company Neural network processor with direct memory access and hardware acceleration circuits
CN109960673A (en) * 2017-12-14 2019-07-02 北京中科寒武纪科技有限公司 Integrated circuit chip device and Related product
CN110968285A (en) * 2018-09-28 2020-04-07 上海寒武纪信息科技有限公司 Signal processing device and related product
CN109858622A (en) * 2019-01-31 2019-06-07 福州瑞芯微电子股份有限公司 The data of deep learning neural network carry circuit and method
CN116724316A (en) * 2020-12-31 2023-09-08 华为技术有限公司 Model processing method and device

Similar Documents

Publication Publication Date Title
US20210158484A1 (en) Information processing method and terminal device
CN107679621B (en) Artificial neural network processing device
CN107679620B (en) Artificial neural network processing device
CN112840356B (en) Operation accelerator, processing method and related equipment
US7454451B2 (en) Method for finding local extrema of a set of values for a parallel processing element
US10769749B2 (en) Processor, information processing apparatus, and operation method of processor
CN109388595A (en) High-bandwidth memory systems and logic dice
CN109685201B (en) Operation method, device and related product
CN112219210B (en) Signal processing device and signal processing method
US20230359876A1 (en) Efficient utilization of processing element array
CN109726822B (en) Operation method, device and related product
US11990137B2 (en) Image retouching method and terminal device
CN111324294B (en) Method and device for accessing tensor data
CN113312283B (en) Heterogeneous diagram learning system based on FPGA acceleration
US20210241070A1 (en) Hybrid convolution operation
WO2017003696A2 (en) Direct memory access with filtering
US20220113944A1 (en) Arithmetic processing device
US20240160689A1 (en) Method for optimizing convolution operation of system on chip and related product
CN106227506A (en) A kind of multi-channel parallel Compress softwares system and method in memory compression system
US20240004650A1 (en) Data processing method and apparatus, and related product
CN114115995A (en) Artificial intelligence chip, operation board card, data processing method and electronic equipment
CN109711538B (en) Operation method, device and related product
CN109740730B (en) Operation method, device and related product
CN112200310A (en) Intelligent processor, data processing method and storage medium
CN109740729B (en) Operation method, device and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination