WO2021027973A1 - 数据同步方法及装置以及相关产品 - Google Patents

数据同步方法及装置以及相关产品 Download PDF

Info

Publication number
WO2021027973A1
WO2021027973A1 PCT/CN2020/111291 CN2020111291W WO2021027973A1 WO 2021027973 A1 WO2021027973 A1 WO 2021027973A1 CN 2020111291 W CN2020111291 W CN 2020111291W WO 2021027973 A1 WO2021027973 A1 WO 2021027973A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
tensor
descriptor
synchronized
synchronization
Prior art date
Application number
PCT/CN2020/111291
Other languages
English (en)
French (fr)
Inventor
曾洪博
王秉睿
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2021027973A1 publication Critical patent/WO2021027973A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17325Synchronisation; Hardware support therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a data synchronization method and device and related products.
  • the present disclosure proposes a data synchronization technical solution.
  • a data synchronization method the method being applied to a first processor, including: determining the data characteristics of the tensor data to be synchronized according to the descriptor of the tensor data, the The descriptor is used to indicate the shape of the tensor data to be synchronized; according to the data characteristics of the tensor data, a state query instruction is generated and sent to the second processor.
  • the state query instruction is used to indicate the first
  • the second processor determines the amount of synchronizable data for the tensor data and generates a synchronization state instruction.
  • a data synchronization method which is applied to a second processor, and includes: upon receiving a status query instruction from a first processor, parsing the status query instruction to obtain The data characteristics of the synchronized tensor data; according to the data characteristics, the descriptor of the tensor data to be synchronized is determined, the descriptor is used to indicate the shape of the tensor data to be synchronized; according to the description of the tensor data Symbol to determine the amount of data that can be synchronized for the tensor data; according to the data characteristics of the tensor data and the amount of data that can be synchronized, generate a synchronization state instruction and send the synchronization state instruction to the first processor
  • the synchronization state instruction is used to instruct the first processor to determine the first sub-data of the tensor data, and the data amount of the first sub-data corresponds to the synchronizable data amount.
  • a data synchronization device which is applied to a first processor and includes: a feature determination module, configured to determine the tensor according to the descriptor of the tensor data to be synchronized The data feature of the data, the descriptor is used to indicate the shape of the tensor data to be synchronized; the query instruction generation and sending module is used to generate the state query instruction according to the data feature of the tensor data and send it to the second processor The state query instruction is sent, and the state query instruction is used to instruct the second processor to determine the synchronizable data amount for the tensor data and generate a synchronization state instruction.
  • a data synchronization device which is applied to a second processor, and includes: a query instruction parsing module, configured to parse the state query instruction from the first processor
  • the state query instruction obtains the data characteristics of the tensor data to be synchronized;
  • the second descriptor determining module is used to determine the descriptor of the tensor data to be synchronized according to the data characteristics, and the descriptor is used to indicate The shape of the tensor data to be synchronized;
  • the data amount determination module is used to determine the synchronizable data amount for the tensor data according to the descriptor of the tensor data;
  • the state command generation and transmission module is used to determine the The data characteristics of the tensor data and the amount of synchronizable data are generated, a synchronization state instruction is generated and the synchronization state instruction is sent to the first processor, and the synchronization state instruction is used to instruct the first processor to determine For the first sub-data of the
  • an artificial intelligence chip including the data synchronization device as described above.
  • an electronic device including the artificial intelligence chip as described above.
  • a board card comprising: a storage device, an interface device, a control device, and the artificial intelligence chip as described above; wherein the artificial intelligence chip and the storage device , The control device and the interface device are respectively connected; the storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and external equipment; the control device, Used to monitor the state of the artificial intelligence chip.
  • the sender can determine the data characteristics of the tensor data according to the descriptor, and generate and send a state query instruction according to the data characteristics to instruct the receiver to follow the state
  • the query instruction feedbacks the amount of data that can be synchronized, thereby realizing partial synchronization of tensor data, reducing synchronization overhead without changing the instruction structure, and improving the efficiency of data synchronization.
  • Fig. 1 shows a schematic diagram of a processing system of a data synchronization method according to an embodiment of the present disclosure.
  • Fig. 2 shows a flowchart of a data synchronization method according to an embodiment of the present disclosure.
  • Fig. 3 shows a flowchart of a data synchronization method according to an embodiment of the present disclosure.
  • Fig. 4 shows a schematic diagram of data storage space of a data synchronization method according to an embodiment of the present disclosure.
  • Fig. 5 shows a block diagram of a data synchronization device according to an embodiment of the present disclosure.
  • Fig. 6 shows a block diagram of a data synchronization device according to an embodiment of the present disclosure.
  • Fig. 7 shows a structural block diagram of a board according to an embodiment of the present disclosure.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • the data synchronization method according to the embodiment of the present disclosure can be applied to any processor of a processing system (for example, an artificial intelligence chip) including multiple processors (multi-core).
  • the processor may be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or an artificial intelligence processor (IPU) for performing artificial intelligence operations.
  • Artificial intelligence operations may include machine learning operations, brain-like operations, etc. Among them, machine learning operations include neural network operations, k-means operations, and support vector machine operations.
  • the artificial intelligence processor may, for example, include GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips.
  • GPU Graphics Processing Unit
  • NPU Neuro-Network Processing Unit
  • DSP Digital Signal Process, digital signal processing unit
  • field programmable gate array Field-Programmable Gate Array, FPGA
  • the processor mentioned in the present disclosure may include multiple processing units, and each processing unit can independently run various tasks assigned to it, such as: convolution operation tasks, pooling tasks Or fully connected tasks, etc.
  • the present disclosure does not limit the processing unit and the tasks executed by the processing unit.
  • Fig. 1 shows a schematic diagram of a processing system of a data synchronization method according to an embodiment of the present disclosure.
  • the processing system 100 includes multiple processors 101 and a memory 102.
  • the multiple processors 101 are used to execute instruction sequences.
  • the memory 102 is used to store data, and may include random access memory (RAM, Random Access Memory) and registers. heap.
  • RAM random access memory
  • the multiple processors 101 in the processing system 100 can not only share part of the storage space, for example, share part of the RAM storage space and the register file, but also have their own storage space at the same time.
  • Fig. 2 shows a flowchart of a data synchronization method according to an embodiment of the present disclosure. As shown in Figure 2, the method is applied to the first processor (any processor in the processing system), and the method includes:
  • step S11 determine the data characteristics of the tensor data according to the descriptor of the tensor data to be synchronized, and the descriptor is used to indicate the shape of the tensor data to be synchronized;
  • step S12 according to the data characteristics of the tensor data, generate a state query instruction and send the state query instruction to the second processor, and the state query instruction is used to instruct the second processor to determine that the tensor The amount of data that can be synchronized and the synchronization state command is generated.
  • Quantities can be of different dimensions.
  • a scalar can be regarded as a 0-dimensional tensor
  • a vector can be regarded as a 1-dimensional tensor
  • a matrix can be a tensor of 2 dimensions or more than 2 dimensions.
  • the shape of a tensor includes information such as the dimensions of the tensor and the size of each dimension of the tensor. For example, for tensors:
  • the shape of the tensor can be described by the descriptor as (2, 4), which means that the tensor is a two-dimensional tensor through two parameters, and the size of the first dimension (column) of the tensor is 2. The size of the second dimension (row) is 4. It should be noted that the present disclosure does not limit the manner in which the descriptor indicates the shape of the tensor. When storing tensor data in the memory, the shape of the tensor data cannot be determined according to its data address (or storage area), and the relationship between multiple tensor data cannot be determined. The access efficiency is low, and the complexity of data synchronization is also greater.
  • a descriptor (tensor descriptor) can be set to indicate the shape of tensor data (N-dimensional tensor data).
  • the value of N can be determined according to the dimensionality (order) of the tensor data, or can be set according to the needs of the tensor data.
  • the tensor data is three-dimensional tensor data
  • the descriptor can be used to indicate the shape of the three-dimensional tensor data in three dimensions (such as offset, size, etc.). It should be understood that those skilled in the art can set the value of N according to actual needs, which is not limited in the present disclosure.
  • the descriptor may include identification and content, etc.
  • the identifier of the descriptor may be used to distinguish the descriptor, for example, a number; the content of the descriptor may include at least one shape representing the shape of the tensor data
  • the parameters (for example, the size in each dimension of the tensor, etc.) may also include at least one address parameter representing the address of the tensor data (for example, the reference address of the data reference point).
  • the present disclosure does not limit the specific parameters included in the content of the descriptor.
  • the shape of tensor data can be expressed, and related information such as the relationship between multiple tensor data can be determined, which improves the efficiency of access to tensor data, thereby reducing The complexity of data synchronization.
  • data synchronization between multiple processors may be required, for example, the calculation result of processor A1 is synchronized to the processor A2 is used as input data for another operation.
  • a descriptor-based data synchronization mechanism can be used to achieve data synchronization.
  • the non-shared storage space of each processor may be allocated to the tensor data to be synchronized.
  • the space may be limited, and the overall synchronization of the tensor data cannot be achieved.
  • partial synchronization of tensor data can be performed, and the entire tensor data synchronization process can be realized through multiple partial synchronizations.
  • the sender of data synchronization has tensor data to be synchronized. For example, when an operation is completed to obtain the result of the operation (that is, tensor data), the sender can query the status of the receiver to determine The unshared storage space of the receiver of the data synchronization can be allocated to the amount of data that the space of the tensor data can hold, so as to perform partial synchronization of the tensor data.
  • the first processor of the multiple processors is the sender of data synchronization
  • the second processor is the receiver of data synchronization.
  • Both the first processor and the second processor are any of the multiple processors.
  • the second processor may be of the same or different type as the first processor. The type is not restricted.
  • the descriptor of the tensor data may be acquired.
  • the descriptor may be a registered (created) descriptor used to indicate the shape of the tensor data, or a new descriptor may be registered (created) according to the shape parameter of the tensor data, which is not limited in the present disclosure.
  • the first processor may determine the data characteristic of the tensor data.
  • the data feature may include at least one of the identification (for example, data number), shape, source, and storage address of the tensor data.
  • the data characteristics of the tensor data to be synchronized may include information such as the identification, shape, source, and address of the tensor data.
  • the data source of the tensor data is the Kth sender (the Kth processor)
  • the data source of the tensor data is the result of the convolution operation numbered 200
  • the address of the tensor data is specific
  • the address area for example, addresses ADDR0-ADDR127
  • the shape of the tensor data is a specified shape (for example, a two-dimensional tensor of 20*10), etc.
  • Those skilled in the art can set the data characteristics of the tensor data to be synchronized according to the actual situation, which is not limited in the present disclosure.
  • the first processor may generate a state query instruction and send the state query instruction to the second processor to be synchronized.
  • the state query instruction may only include part of the data characteristics, such as the identifier of the tensor data.
  • the synchronization instruction can include more data characteristics, For example, the identifier and storage address of the tensor data are used to instruct the second processor to determine the descriptor of the tensor data to be synchronized.
  • the present disclosure does not limit the specific content included in the status query instruction.
  • the second processor may determine the tensor data to be synchronized according to the identifier, and register or obtain the information indicating the tensor data to be synchronized Descriptor. If the state query instruction includes more data features (identification, storage address, etc.), the second processor can register a descriptor indicating the tensor data according to the data feature in the instruction.
  • the second processor may determine the space that can be allocated to the tensor data corresponding to the descriptor, and determine the synchronizable tensor data The amount of data. According to the amount of data that can be synchronized and the data characteristics, the second processor can generate and send a synchronization state instruction, so that the first processor can determine the tensor data to be synchronized and the amount of synchronized data that can be synchronized this time.
  • the sender can determine the data characteristics of the tensor data according to the descriptor, and generate and send a state query instruction according to the data characteristics to indicate the reception
  • the party feeds back its own state (that is, the amount of synchronized data) according to the state query instruction, thereby achieving partial synchronization of tensor data, reducing synchronization overhead without changing the instruction structure, and improving the efficiency of data synchronization.
  • the method further includes:
  • the first sub-data According to the first sub-data, generate a synchronization instruction and send the synchronization instruction to the second processor to instruct the second processor to acquire the first sub-data.
  • the first processor when it receives a synchronization state instruction from the second processor, it can parse the instruction to obtain the content of the instruction, that is, the data characteristics of the tensor data to be synchronized and the synchronizable data the amount.
  • the descriptor of the tensor data to be synchronized can be determined, so as to determine the tensor data to be synchronized; and the part of the data that can be synchronized this time is determined from the tensor data according to the amount of synchronizable data, that is The first sub-data.
  • the data amount of the first sub-data may correspond to the synchronizable data amount, for example, the data amount of the first sub-data is less than or equal to the synchronizable data amount.
  • data with a synchronizable amount of data can be selected from the tensor data as the first sub-data; if part of the data of the tensor data Unsynchronized, and the amount of unsynchronized partial data is greater than the amount of data that can be synchronized, you can select the amount of data that can be synchronized from the unsynchronized partial data (that is, the second sub-data of the tensor data) as the first Sub-data; if the amount of unsynchronized partial data is less than or equal to the amount of synchronizable data, the unsynchronized partial data can be directly used as the first sub-data. It should be understood that those skilled in the art can determine the first sub-data according to the actual situation. Data, this disclosure does not limit this.
  • the synchronization state instruction may also include a range of partial data of the tensor data to be synchronized, such as a storage address range of the partial sub-data, etc., so as to specify to obtain the partial data to be synchronized.
  • the first processor may directly determine the first sub-data to be synchronized according to the range of the partial data.
  • the first processor may generate a synchronization instruction according to the first sub-data and send the synchronization instruction to the second processor.
  • the instruction may include the data characteristics of the tensor data to be synchronized and the first sub-data.
  • the second processor can parse the instruction to determine the data characteristics of the tensor data to be synchronized and the first sub-data of the tensor data, determine the descriptor according to the data characteristics, and determine the descriptor to be synchronized according to the descriptor.
  • Tensor data and store the first sub-data of the tensor data in its own non-shared storage space.
  • the descriptor of the tensor data and the amount of synchronizable data can be determined according to the synchronization state command from the sender, the sub-data of this synchronization can be determined according to the amount of synchronizable data, and the synchronization command can be generated and sent based on the sub-data , So that the receiver can obtain the sub-data synchronized this time, thereby reducing synchronization overhead and improving the efficiency of data synchronization.
  • the step of determining the first sub-data of the tensor data according to the descriptor and the amount of synchronizable data may include:
  • the first processor receives the synchronization state instruction from the second processor, it can determine the second sub-data in the to-be-synchronized state according to the state of the data in the tensor data; according to the second sub-data And the amount of synchronizable data indicated by the synchronization status command can determine the first sub-data synchronized this time.
  • the first sub-data synchronized this time can be selected from the second sub-data; if the data amount of the second sub-data If it is less than or equal to the amount of synchronizable data, the second sub-data can be directly used as the first sub-data.
  • part of the data synchronized this time can be determined, so as to achieve partial synchronization of tensor data and improve the efficiency of data synchronization.
  • the method further includes: changing the state of the first sub-data of the tensor data from the pending state to the synchronized state.
  • the first processor can The state of the data in the data is changed, that is, the state of the first sub-data is changed from the pending state to the synchronized state.
  • the next synchronized data can be determined from the partial data in the to-be-synchronized state, thereby avoiding repeated synchronization of data and improving data The efficiency of synchronization.
  • Fig. 3 shows a flowchart of a data synchronization method according to an embodiment of the present disclosure. As shown in FIG. 3, the method is applied to the second processor, and the method includes:
  • step S31 upon receiving the state query instruction from the first processor, parse the state query instruction to obtain the data characteristics of the tensor data to be synchronized;
  • step S32 a descriptor of the tensor data to be synchronized is determined according to the data characteristics, and the descriptor is used to indicate the shape of the tensor data to be synchronized;
  • step S33 determine the amount of synchronizable data for the tensor data according to the descriptor of the tensor data
  • step S34 according to the data characteristics of the tensor data and the amount of synchronizable data, a synchronization state instruction is generated and the synchronization state instruction is sent to the first processor, and the synchronization state instruction is used to instruct all
  • the first processor determines the first sub-data of the tensor data, and the data amount of the first sub-data corresponds to the synchronizable data amount.
  • the sender of data synchronization when the sender of data synchronization has tensor data to be synchronized, the sender can query the status of the receiver.
  • the first processor (sender) can generate and send a state query instruction
  • the second processor receives the state query instruction in step S31, it can parse the instruction and determine the data characteristics of the tensor data to be synchronized.
  • the data feature may include at least one of the identification (for example, data number), shape, source, and storage address of the tensor data.
  • the second processor may determine the descriptor of the tensor data to be synchronized according to the data characteristics.
  • the descriptor may be a registered (created) descriptor used to indicate the shape of the tensor data, or a new descriptor may be registered (created) according to the shape parameter of the tensor data, which is not limited in the present disclosure.
  • the second processor may determine the tensor data to be synchronized according to the descriptor, and determine that its own non-shared storage space can be allocated to the space of the tensor data.
  • the amount of data that can be accommodated that is, the amount of data to be synchronized, for partial synchronization of tensor data.
  • the second processor may generate and send a synchronization state instruction to the first processor according to the determined amount of synchronizable data and the data characteristics of the tensor data to indicate The first processor determines the amount of data that can be synchronized this time. After determining part of the data (that is, the first sub-data) that can be synchronized this time, the first processor may generate a synchronization instruction and send the synchronization instruction to the second processor.
  • the instruction may include the data characteristics of the tensor data to be synchronized and the first sub-data.
  • the sender can query the status of the receiver, and the receiver can determine and reply its own status (that is, the amount of data to be synchronized) after receiving the status query instruction, and achieve partial synchronization of tensor data through interaction to improve data synchronization s efficiency.
  • the method further includes:
  • the first sub-data of the tensor data is stored.
  • the second processor when it receives a synchronization instruction, it can parse the instruction to determine the data characteristics of the tensor data to be synchronized and the first sub-data of the tensor data synchronized this time; search according to the data characteristics To the descriptor of the tensor data to be synchronized; and then determine the tensor data to be synchronized according to the descriptor, and store the first sub-data of the tensor data in its own unshared storage space.
  • the receiver can determine the descriptor according to the synchronization instruction and obtain the sub-data synchronized this time, thereby reducing synchronization overhead and improving the efficiency of data synchronization.
  • the receiver of data synchronization can also initiate a partial synchronization request for tensor data, that is, the receiver sends a descriptor synchronization request instruction, which can indicate the description of the tensor data to be synchronized Symbols and the amount of data that can be synchronized for the tensor data, that is, the amount of data that can be accommodated in the unshared storage space of the receiver that can be allocated to the space of the tensor data.
  • a data synchronization method is also provided, applied to the first processor, and the method includes:
  • the first sub-data According to the first sub-data, generate a synchronization instruction and send the synchronization instruction to the second processor to instruct the second processor to acquire the first sub-data.
  • a data synchronization receiver can initiate a partial synchronization request for tensor data, that is, the receiver sends a synchronization request instruction, which can indicate the data characteristics of the tensor data to be synchronized and the data characteristics for the tensor data.
  • the amount of data that can be synchronized that is, the amount of data that the receiver's unshared storage space can be allocated to the space of the tensor data.
  • the first processor of the multiple processors is the sender of data synchronization
  • the second processor is the receiver of data synchronization.
  • Both the first processor and the second processor are any of the multiple processors.
  • the second processor may be of the same or different type as the first processor. The type is not restricted.
  • the first processor when it receives a synchronization request instruction from the second processor, it can parse the instruction to obtain the content of the instruction, that is, the data of the tensor data to be synchronized Features and data volume that can be synchronized.
  • the data feature may include at least one of the identification (for example, data number), shape, source, and storage address of the tensor data.
  • the data characteristics of the tensor data to be synchronized may include information such as the identification, shape, source, and address of the tensor data.
  • the data source of the tensor data is the Kth sender (the Kth processor)
  • the data source of the tensor data is the result of the convolution operation numbered 200
  • the address of the tensor data is specific
  • the address area for example, addresses ADDR0-ADDR127
  • the shape of the tensor data is a specified shape (for example, a two-dimensional tensor of 20*10), etc.
  • Those skilled in the art can set the data characteristics of the tensor data to be synchronized according to the actual situation, which is not limited in the present disclosure.
  • the first processor may determine the descriptor of the tensor data to be synchronized according to the data characteristics; further determine the tensor data to be synchronized according to the descriptor, and obtain the data from the tensor according to the amount of synchronizable data.
  • the part of the data that can be synchronized this time is determined from the volume data, that is, the first sub-data.
  • the data amount of the first sub-data may correspond to the synchronizable data amount, for example, the data amount of the first sub-data is less than or equal to the synchronizable data amount.
  • data with a synchronizable amount of data can be selected from the tensor data as the first sub-data; if part of the data of the tensor data Unsynchronized, and the amount of unsynchronized partial data is greater than the amount of data that can be synchronized, you can select the amount of data that can be synchronized from the unsynchronized partial data (that is, the second sub-data of the tensor data) as the first Sub-data; if the amount of unsynchronized partial data is less than or equal to the amount of synchronizable data, the unsynchronized partial data can be directly used as the first sub-data. It should be understood that those skilled in the art can determine the first sub-data according to the actual situation. Data, this disclosure does not limit this.
  • the synchronization request instruction may also include a range of partial data of the tensor data to be synchronized, such as the storage address range of the partial subdata, etc., so as to specify to obtain the partial data to be synchronized.
  • the first processor may directly determine the first sub-data to be synchronized according to the range of the partial data.
  • the first processor may generate a synchronization instruction according to the first sub-data and send the synchronization instruction to the second processor.
  • the instruction may include the data characteristics of the tensor data to be synchronized and the first sub-data.
  • the second processor can parse the instruction to determine the data feature and the first sub-data, thereby determining the descriptor based on the data feature, determining the tensor data to be synchronized based on the descriptor, and combining the tensor
  • the first sub-data of the data is stored in its own non-shared storage space.
  • the receiver can send a synchronization request instruction to actively request synchronization of part of the data, and the sender can determine the sub-data to be synchronized this time according to the amount of data that can be synchronized by the receiver, and generate and send a synchronization instruction based on the sub-data to enable The receiver obtains the sub-data synchronized this time, thereby reducing synchronization overhead without changing the instruction structure and improving the efficiency of data synchronization.
  • determining the first sub-data of the tensor data according to the descriptor of the tensor data and the amount of synchronizable data may include:
  • the first processor receives the synchronization request instruction from the second processor, it can determine the second sub-data in the to-be-synchronized state according to the state of the data in the tensor data; according to the second sub-data And the amount of synchronizable data indicated by the synchronization request instruction can determine the first sub-data synchronized this time.
  • the first sub-data synchronized this time can be selected from the second sub-data; if the data amount of the second sub-data If it is less than or equal to the amount of synchronizable data, the second sub-data can be directly used as the first sub-data.
  • part of the data synchronized this time can be determined, so as to achieve partial synchronization of tensor data and improve the efficiency of data synchronization.
  • the method further includes: changing the state of the first sub-data of the tensor data from the to-be-synchronized state to the synchronized state.
  • the first processor can The state of the data in the data is changed, that is, the state of the first sub-data is changed from the pending state to the synchronized state.
  • the next synchronized data can be determined from the partial data in the to-be-synchronized state, thereby avoiding repeated synchronization of data and improving the efficiency of data synchronization.
  • a data synchronization method is also provided, applied to the second processor, and the method includes:
  • the synchronization request instruction is used to instruct the first processor to determine the synchronization request instruction according to the synchronization request instruction.
  • the synchronized tensor data and the first sub-data of the tensor data, and the data amount of the first sub-data corresponds to the synchronizable data amount.
  • the receiver of data synchronization may initiate a partial synchronization request for tensor data.
  • the descriptor of the tensor data can be determined.
  • the descriptor may be a registered (created) descriptor used to indicate the shape of the tensor data, or a new descriptor may be registered (created) according to the shape parameter of the tensor data, which is not limited in the present disclosure.
  • the second processor may determine the data characteristics of the tensor data according to the descriptor, such as at least one of the identification (for example, data number), shape, source, and storage address of the tensor data. kind.
  • the second processor can determine the amount of data that its non-shared storage space can be allocated to the space of the tensor data, which can also be synchronized.
  • the second processor may generate a synchronization request instruction and send the instruction.
  • the synchronization request instruction may be used to instruct the first processor to determine the tensor data to be synchronized as the first sub-data of the tensor data according to the instruction.
  • the receiver of the data synchronization when the receiver of the data synchronization (ie, the first processor) receives the synchronization request instruction, it can parse the instruction to determine the data characteristics of the tensor data to be synchronized and the synchronization Data volume; Determine the descriptor of the tensor data to be synchronized according to the data characteristics; Determine the tensor data to be synchronized according to the descriptor, and determine the part that can be synchronized this time from the tensor data according to the amount of data that can be synchronized Data, that is, the first sub-data.
  • the data amount of the first sub-data may correspond to the synchronizable data amount, for example, the data amount of the first sub-data is less than or equal to the synchronizable data amount.
  • data with a synchronizable amount of data can be selected from the tensor data as the first sub-data; if part of the data of the tensor data Unsynchronized, and the amount of unsynchronized partial data is greater than the amount of data that can be synchronized, you can select the amount of data that can be synchronized from the unsynchronized partial data (that is, the second sub-data of the tensor data) as the first Sub-data; if the amount of unsynchronized partial data is less than or equal to the amount of synchronizable data, the unsynchronized partial data can be directly used as the first sub-data. It should be understood that those skilled in the art can determine the first sub-data according to the actual situation. Data, this disclosure does not limit this.
  • the synchronization request instruction may also include a range of partial data of the tensor data to be synchronized, such as the descriptor content or storage address range of the partial sub-data, so as to specify the part to be synchronized. data.
  • the receiver can initiate a partial synchronization request of the tensor data, so that the sender can determine the sub-data to be synchronized this time, thereby improving the efficiency of data synchronization.
  • the method further includes:
  • the first sub-data of the tensor data is stored.
  • the first processor may generate and send a synchronization instruction according to the data characteristics of the tensor data and the first sub-data.
  • the second processor receives the synchronization instruction, it can parse the instruction to determine the data characteristics of the tensor data to be synchronized and the first sub-data of the tensor data synchronized this time; determine the descriptor according to the data characteristics, and then Determine the tensor data to be synchronized according to the descriptor, and store the first sub-data of the tensor data in its own non-shared storage space.
  • the receiver can determine the descriptor according to the synchronization instruction and obtain the sub-data of this synchronization, thereby reducing synchronization overhead, improving the efficiency of data synchronization, and achieving instruction compatibility during instruction transmission and processing.
  • the identifier and content of the descriptor can be stored in the descriptor storage space, which can be the internal memory of the processor (such as registers, on-chip SRAM or other media cache, etc.) Storage space.
  • the data storage space of the tensor data indicated by the descriptor may be a storage space in the internal memory of the processor (for example, on-chip cache) or an external memory (off-chip memory) connected to the processor.
  • the data address in the data storage space may be an actual physical address or a virtual address.
  • the present disclosure does not limit the location of the descriptor storage space and the data storage space, and the type of data address.
  • the identifier and content of the descriptor and the tensor data indicated by the descriptor can be located in the same area.
  • a continuous area of the on-chip cache can be used to store the relevant content of the descriptor
  • the address is ADDR0-ADDR1023, where the address ADDR0-ADDR31 can be used to store the identifier of the descriptor, the address ADDR32-ADDR63 can be used to store the content of the descriptor, and the address ADDR64-ADDR1023 can be used to store the tensor data indicated by the descriptor.
  • the address ADDR is not limited to one bit or one byte. It is used here to indicate an address and is an address unit. Those skilled in the art can determine the storage area and its address in actual conditions, and this disclosure does not limit this.
  • the identifier and content of the descriptor and the tensor data indicated by the descriptor can be stored separately in different areas of the internal memory.
  • a register can be used as a descriptor storage space, and the description can be stored in the register.
  • the identifier and content of the symbol use the on-chip cache as the data storage space to store the tensor data indicated by the descriptor.
  • a special register (SR) dedicated to the descriptor can also be set, and the data in the descriptor can be an immediate value or can be obtained from a special register.
  • the number of the register can be used to represent the identifier of the descriptor. For example, when the number of the register is 0, the identifier of the stored descriptor is 0.
  • an area can be allocated in the cache space according to the size of the tensor data indicated by the descriptor (for example, a tensor cache unit is created for each tensor data in the cache) for storing the Tensor data. It should be understood that a preset cache space may also be used to store the tensor data, which is not limited in the present disclosure.
  • the identifier and content of the descriptor can be stored in the internal memory, and the tensor data indicated by the descriptor can be stored in the external memory.
  • a method of storing the identifier and content of the descriptor on the chip, and storing the tensor data indicated by the descriptor off the chip may be adopted.
  • the data address of the data storage space corresponding to the descriptor may be a fixed address.
  • a separate data storage space can be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to the identifier of the descriptor in a one-to-one correspondence.
  • the processor can determine the data address of the tensor data based on the content of the descriptor.
  • the descriptor may also be used to indicate the address of N-dimensional tensor data, where the descriptor
  • the content of may also include at least one address parameter representing the address of the tensor data.
  • tensor data is three-dimensional data.
  • the content of the descriptor may include an address parameter indicating the address of the tensor data, such as the starting address of the tensor data, and It may include multiple address parameters of the address of the tensor data, such as the start address of the tensor data + address offset, or the address parameters of the tensor data based on each dimension.
  • address parameters such as the start address of the tensor data + address offset, or the address parameters of the tensor data based on each dimension.
  • the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data.
  • the reference address can be different according to the change of the data reference point.
  • the present disclosure does not limit the selection of data reference points.
  • the reference address may include the start address of the data storage space.
  • the reference address of the descriptor is the starting address of the data storage space.
  • the reference address of the descriptor is the physical address of the data block in the data storage space.
  • the shape parameter of the tensor data includes at least one of the following: the size of the data storage space of the tensor data in at least one of the N dimensional directions, and the storage area The size in at least one of the N dimensional directions, the offset of the storage area in at least one of the N dimensional directions, and at least two vertices at diagonal positions in the N dimensional directions relative to the data The position of the reference point, the data description position of the tensor data indicated by the descriptor and the mapping relationship between the data address. Among them, the data description position is the mapping position of the point or region in the tensor data indicated by the descriptor.
  • the descriptor can be represented by 3D space coordinates (x, y, z)
  • the shape of the tensor data and the data description position of the tensor data may be the position of a point or area in the three-dimensional space that the tensor data is mapped to, which is represented by three-dimensional space coordinates (x, y, z).
  • Fig. 4 shows a schematic diagram of data storage space of a data synchronization method according to an embodiment of the present disclosure.
  • the data storage space 21 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis goes horizontally to the right, and the Y axis goes vertically downwards).
  • the size (the size of each row) is ori_x (not shown in the figure)
  • the size in the Y-axis direction (the total number of rows)
  • ori_y not shown in the figure
  • the start address of the data storage space 21 is PA_start (reference Address) is the physical address of the first data block 22.
  • the data block 23 is part of the data in the data storage space 21, the offset 25 in the X axis direction is represented as offset_x, the offset 24 in the Y axis direction is represented as offset_y, and the size in the X axis direction is represented Is size_x, and the size in the Y-axis direction is expressed as size_y.
  • the data reference point of the descriptor can use the first data block of the data storage space 21, and the reference address of the descriptor is the start of the data storage space 21.
  • the start address PA_start can then be combined with the size ori_x of the data storage space 21 on the X axis, the size ori_y on the Y axis, and the offset of the data block 23 in the Y axis direction offset_y, the offset amount offset_x in the X axis direction,
  • the size size_x in the X-axis direction and the size size_y in the Y-axis direction determine the content of the descriptor of the data block 23.
  • the descriptor describes a two-dimensional space
  • those skilled in the art can set the dimension represented by the content of the descriptor according to the actual situation, which is not limited in the present disclosure.
  • At least two vertices at diagonal positions in N dimensions relative to the data reference may be based on the reference address of the data reference point of the descriptor in the data storage space. The position of the point determines the content of the descriptor of the tensor data.
  • the reference address PA_base of the data reference point of the descriptor in the data storage space and the position of the two diagonal vertices relative to the data reference point can be used to determine the descriptor value of the data block 23 in FIG. 2 content.
  • one data (for example, the data at position (2, 2)) can be selected as the data reference point in the data storage space 21 ,
  • the physical address of the data in the data storage space is used as the reference address PA_base; then, determine the position of at least two vertices of the diagonal position of the data block 23 relative to the data reference point, for example, using the upper left to lower right direction pair
  • PA_base the upper left corner vertex
  • the relative position (x_min, y_min) and the relative position (x_max, y_max) of the vertex at the lower right corner determine the content of the descriptor of the data block 23.
  • the data reference point of the descriptor may be based on the reference address in the data storage space, and between the data description position of the tensor data indicated by the descriptor and the data address To determine the content of the descriptor of the tensor data.
  • the mapping relationship between the data description location and the data address can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional spatial data, the function f(x, y, z) can be used to define The data describes the mapping relationship between the location and the data address.
  • mapping relationship between the data description location and the data address can be set according to the actual situation, which is not limited in the present disclosure.
  • PA2 (x,y) PA_start+(offset_y+y q -1)*ori_x+(offset_x+x q ) (4)
  • the processor can calculate the data address of the tensor data indicated by the descriptor in the data storage space according to the content of the descriptor, and then perform corresponding processing (such as data operation, data synchronization, etc.) according to the address, Therefore, the complexity of data access can be reduced, and the processing efficiency of the processor can be improved.
  • partial synchronization of tensor data can be achieved when the receiver space of the data synchronization is insufficient, and the synchronization of the entire tensor data is achieved through multiple partial synchronizations, thereby avoiding the lack of space
  • the overall synchronization failure or synchronization delay of tensor data improves the efficiency of data synchronization; and a descriptor indicating the shape of the tensor data is set, and the tensor data is determined according to the descriptor during the data synchronization process.
  • the synchronization overhead is reduced, the complexity of data access is reduced, and the instruction compatibility in the process of instruction transfer and processing is realized.
  • steps in the flowchart are displayed in sequence according to the directions of the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in the flowchart can include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • Fig. 5 shows a block diagram of a data synchronization device according to an embodiment of the present disclosure.
  • the data synchronization device is applied to the first processor.
  • the data synchronization device includes:
  • the feature determining module 51 is configured to determine the data feature of the tensor data according to the descriptor of the tensor data to be synchronized, and the descriptor is used to indicate the shape of the tensor data to be synchronized;
  • the query instruction generating and sending module 52 is configured to generate a state query instruction according to the data characteristics of the tensor data and send the state query instruction to the second processor, and the state query instruction is used to instruct the second processor to determine A synchronization state command is generated for the synchronizable data amount of the tensor data.
  • the device further includes:
  • the state instruction analysis module is used to analyze the synchronization state instruction when receiving the synchronization state instruction from the second processor, and determine the data characteristics of the tensor data to be synchronized and the amount of data that can be synchronized,
  • the first descriptor determining module is configured to determine the descriptor of the tensor data to be synchronized according to the data characteristics
  • a data determining module configured to determine the first sub-data of the tensor data according to the descriptor and the synchronizable data amount, and the data amount of the first sub-data corresponds to the synchronizable data amount;
  • the synchronization instruction generating and sending module is configured to generate a synchronization instruction according to the first sub-data and send the synchronization instruction to the second processor to instruct the second processor to acquire the first sub-data.
  • the data determining module includes:
  • the first determining sub-module is configured to determine the tensor data to be synchronized and the second sub-data in the to-be-synchronized state in the tensor data according to the descriptor;
  • the second determining sub-module is configured to determine the first sub-data according to the second sub-data and the amount of synchronizable data.
  • the device further includes:
  • the state change module is used to change the state of the first sub-data of the tensor data from the pending state to the synchronized state.
  • Fig. 6 shows a block diagram of a data synchronization device according to an embodiment of the present disclosure.
  • the data synchronization device is applied to the second processor.
  • the data synchronization device includes:
  • the query instruction parsing module 61 is configured to, when receiving the state query instruction from the first processor, analyze the state query instruction to obtain the data characteristics of the tensor data to be synchronized;
  • the second descriptor determining module 62 is configured to determine a descriptor of the tensor data to be synchronized according to the data characteristics, and the descriptor is used to indicate the shape of the tensor data to be synchronized;
  • the data amount determination module 63 is configured to determine the synchronizable data amount for the tensor data according to the descriptor of the tensor data;
  • the state instruction generating and sending module 64 is configured to generate a synchronization state instruction and send the synchronization state instruction to the first processor according to the data characteristics of the tensor data and the amount of synchronizable data, the synchronization state The instruction is used to instruct the first processor to determine the first sub-data of the tensor data, and the data amount of the first sub-data corresponds to the synchronizable data amount.
  • the device further includes:
  • a synchronization instruction parsing module configured to, when receiving a synchronization instruction from the first processor, analyze the synchronization instruction to obtain the data characteristics of the tensor data to be synchronized and the first sub-data of the tensor data;
  • the third descriptor determining module is configured to determine the descriptor of the tensor data according to the data characteristics
  • the data storage module is configured to store the first sub-data of the tensor data according to the descriptor of the tensor data.
  • the foregoing device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of the units/modules in the foregoing embodiment is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules, or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated units/modules can be implemented in the form of hardware or software program modules.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. Wait.
  • RRAM Resistive Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Static random access memory SRAM Static Random-Access Memory
  • enhanced dynamic random access memory EDRAM Enhanced Dynamic Random Access Memory
  • high-bandwidth memory HBM High-Bandwidth Memory
  • hybrid storage cube HMC Hybrid Memory Cube
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • an artificial intelligence chip is also disclosed, which includes the above-mentioned data synchronization device.
  • a board card which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein, the artificial intelligence chip and the storage device and the control device And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to implement data transmission between the artificial intelligence chip and external equipment; the control device is used to The state of the artificial intelligence chip is monitored.
  • Fig. 7 shows a structural block diagram of a board according to an embodiment of the present disclosure.
  • the board may include other supporting components in addition to the chip 389 described above.
  • the supporting components include, but are not limited to: a storage device 390, Interface device 391 and control device 392;
  • the storage device 390 is connected to the artificial intelligence chip through a bus for storing data.
  • the storage device may include multiple groups of storage units 393. Each group of the storage unit and the artificial intelligence chip are connected through a bus. It can be understood that each group of the storage unit may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips).
  • the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage unit, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transmit data twice in one clock cycle.
  • a controller for controlling the DDR is provided in the chip for controlling the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the artificial intelligence chip.
  • the interface device is used to implement data transmission between the artificial intelligence chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces. The present disclosure does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function.
  • the calculation result of the artificial intelligence chip is still transmitted by the interface device back to an external device (such as a server).
  • the control device is electrically connected with the artificial intelligence chip.
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the artificial intelligence chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multiple load and light load.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the artificial intelligence chip.
  • an electronic device which includes the aforementioned artificial intelligence chip.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships, and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
  • a data synchronization method which is applied to a first processor, includes:
  • the state query instruction is used to instruct the second processor to determine synchronizable data for the tensor data Measure and generate synchronization status instructions.
  • the first sub-data According to the first sub-data, generate a synchronization instruction and send the synchronization instruction to the second processor to instruct the second processor to acquire the first sub-data.
  • determining the first sub-data of the tensor data according to the descriptor and the amount of synchronizable data includes:
  • Clause A4 the method according to clause A2 or clause A3, the method further includes:
  • the state of the first sub-data of the tensor data is changed from the pending state to the synchronized state.
  • a data synchronization method applied to a second processor including:
  • the synchronization state instruction is used to instruct the first processor
  • the first sub-data of the tensor data is determined, and the data amount of the first sub-data corresponds to the synchronizable data amount.
  • the first sub-data of the tensor data is stored.
  • a data synchronization device applied to a first processor, and the device includes:
  • a feature determining module configured to determine the data feature of the tensor data according to the descriptor of the tensor data to be synchronized, and the descriptor is used to indicate the shape of the tensor data to be synchronized;
  • the query instruction generating and sending module is used to generate a state query instruction according to the data characteristics of the tensor data and send the state query instruction to the second processor, and the state query instruction is used to instruct the second processor to determine The synchronizable data amount of the tensor data and generates a synchronization state instruction.
  • the state instruction analysis module is used to analyze the synchronization state instruction when receiving the synchronization state instruction from the second processor, and determine the data characteristics of the tensor data to be synchronized and the amount of data that can be synchronized,
  • the first descriptor determining module is configured to determine the descriptor of the tensor data to be synchronized according to the data characteristics
  • a data determining module configured to determine the first sub-data of the tensor data according to the descriptor and the synchronizable data amount, and the data amount of the first sub-data corresponds to the synchronizable data amount;
  • the synchronization instruction generating and sending module is configured to generate a synchronization instruction according to the first sub-data and send the synchronization instruction to the second processor to instruct the second processor to acquire the first sub-data.
  • the first determining sub-module is configured to determine the tensor data to be synchronized and the second sub-data in the to-be-synchronized state in the tensor data according to the descriptor;
  • the second determining sub-module is configured to determine the first sub-data according to the second sub-data and the amount of synchronizable data.
  • the state change module is used to change the state of the first sub-data of the tensor data from the pending state to the synchronized state.
  • a data synchronization device applied to a second processor including:
  • the query instruction parsing module is configured to, when receiving the state query instruction from the first processor, analyze the state query instruction to obtain the data characteristics of the tensor data to be synchronized;
  • the second descriptor determining module is configured to determine the descriptor of the tensor data to be synchronized according to the data characteristics, and the descriptor is used to indicate the shape of the tensor data to be synchronized;
  • a data amount determination module configured to determine the synchronizable data amount for the tensor data according to the descriptor of the tensor data
  • the state command generation and sending module is configured to generate a synchronization state command and send the synchronization state command to the first processor according to the data characteristics of the tensor data and the amount of synchronizable data, the synchronization state command For instructing the first processor to determine the first sub-data of the tensor data, where the data amount of the first sub-data corresponds to the synchronizable data amount.
  • a synchronization instruction parsing module configured to, when receiving a synchronization instruction from the first processor, analyze the synchronization instruction to obtain the data characteristics of the tensor data to be synchronized and the first sub-data of the tensor data;
  • the third descriptor determining module is configured to determine the descriptor of the tensor data according to the data characteristics
  • the data storage module is configured to store the first sub-data of the tensor data according to the descriptor of the tensor data.
  • Clause A14 An electronic device that includes the artificial intelligence chip as described in Clause A13.
  • a board card, the board card includes: a storage device, an interface device, a control device, and the artificial intelligence chip as described in clause A13;
  • the artificial intelligence chip is connected to the storage device, the control device and the interface device respectively;
  • the storage device is used to store data
  • the interface device is used to implement data transmission between the artificial intelligence chip and external equipment
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the storage device includes: multiple groups of storage units, each group of the storage unit is connected to the artificial intelligence chip through a bus, and the storage unit is: DDR SDRAM;
  • the chip includes: a DDR controller for controlling data transmission and data storage of each storage unit;
  • the interface device is: a standard PCIE interface.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及一种数据同步方法及装置以及相关产品,该方法应用于第一处理器,包括:根据待同步的张量数据的描述符,确定该张量数据的数据特征,描述符用于指示待同步的张量数据的形状;根据该张量数据的数据特征,生成状态查询指令并向第二处理器发送状态查询指令,该状态查询指令用于指示第二处理器确定针对张量数据的可同步数据量并生成同步状态指令。通过以上方法,可以提高数据同步的效率。

Description

数据同步方法及装置以及相关产品 技术领域
本公开涉及计算机技术领域,尤其涉及一种数据同步方法及装置以及相关产品。
背景技术
随着人工智能技术的不断发展,其应用领域越来越广泛,在图像识别、语音识别、自然语言处理等领域中都得到了良好的应用。然而,随着人工智能算法的复杂度提高,需要处理的数据量和数据维度都在不断增大,通常需要多核和/或多芯片进行数据处理。在进行核间或芯片间的数据同步时,采用相关技术的同步方式的同步开销较大,处理效率较低。
发明内容
有鉴于此,本公开提出了一种数据同步技术方案。
根据本公开的一方面,提供了一种数据同步方法,所述方法应用于第一处理器,包括:根据待同步的张量数据的描述符,确定所述张量数据的数据特征,所述描述符用于指示待同步的张量数据的形状;根据所述张量数据的数据特征,生成状态查询指令并向第二处理器发送所述状态查询指令,所述状态查询指令用于指示第二处理器确定针对所述张量数据的可同步数据量并生成同步状态指令。
根据本公开的一方面,提供了一种数据同步方法,所述方法应用于第二处理器,包括:在接收到来自第一处理器的状态查询指令时,解析所述状态查询指令,得到待同步的张量数据的数据特征;根据所述数据特征,确定待同步的张量数据的描述符,所述描述符用于指示待同步的张量数据的形状;根据所述张量数据的描述符,确定针对所述张量数据的可同步数据量;根据所述张量数据的数据特征及所述可同步数据量,生成同步状态指令并向所述第一处理器发送所述同步状态指令,所述同步状态指令用于指示所述第一处理器确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应。
根据本公开的另一方面,提供了一种数据同步装置,所述装置应用于第一处理器,包括:特征确定模块,用于根据待同步的张量数据的描述符,确定所述张量数据的数据特征,所述描述符用于指示待同步的张量数据的形状;查询指令生成及发送模块,用于根据所述张量数据的数据特征,生成状态查询指令并向第二处理器发送所述状态查询指令,所述状态查询指令用于指示第二处理器确定针对所述张量数据的可同步数据量并生成同步状态指令。
根据本公开的另一方面,提供了一种数据同步装置,所述装置应用于第二处理器,包括:查询指令解析模块,用于在接收到来自第一处理器的状态查询指令时,解析所述状态查询指令,得到待同步的张量数据的数据特征;第二描述符确定模块,用于根据所述数据特征,确定待同步的张量数据的描述符,所述描述符用于指示待同步的张量数据的形状;数据量确定模块,用于根据所述张量数据的描述符,确定针对所述张量数据的可同步数据量;状态指令生成及发送模块,用于根据所述张量数据的数据特征及所述可同步数据量,生成同步状态指令并向所述第一处理器发送所述同步状态指令,所述同步状态指令用于指示所述第一处理器确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应。
根据本公开的另一方面,提供了一种人工智能芯片,所述芯片包括如上所述的数据同步装置。
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括如上所述的人工智能芯片。
根据本公开的另一方面,提供了一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如上所述的人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的状态进行监控。
根据本公开的实施例,通过设定指示张量数据的形状的描述符,发送方能够根据描述符确定张量数据的数据特征,根据数据特征生成并发送状态查询指令,以指示接收方根据状态查询指令反馈可同 步的数据量,从而实现张量数据的部分同步,在不改变指令结构的情况下减少同步开销,提高数据同步的效率。
通过权要中的技术特征进行推导,能够达到对应背景技术中的技术问题的有益效果。根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。
图1示出根据本公开实施例的数据同步方法的处理系统的示意图。
图2示出根据本公开实施例的数据同步方法的流程图。
图3示出根据本公开实施例的数据同步方法的流程图。
图4示出根据本公开实施例的数据同步方法的数据存储空间的示意图。
图5示出根据本公开实施例的数据同步装置的框图。
图6示出根据本公开实施例的数据同步装置的框图。
图7示出根据本公开实施例的板卡的结构框图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
根据本公开实施例的数据同步方法可应用于包括多个处理器(多核)的处理系统(例如人工智能芯片)的任意一个处理器中。该处理器可以是通用处理器,例如CPU(Central Processing Unit,中央处理器),也可以是用于执行人工智能运算的人工智能处理器(IPU)。人工智能运算可包括机器学习运算,类脑运算等。其中,机器学习运算包括神经网络运算、k-means运算、支持向量机运算等。该人工智能处理器可例如包括GPU(Graphics Processing Unit,图形处理单元)、NPU(Neural-Network Processing Unit,神经网络处理单元)、DSP(Digital Signal Process,数字信号处理单元)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)芯片中的一种或组合。本公开对处理器的具体类型不作限制。此外,处理系统中的多个处理器的类型可以相同或不同,本公开对此不作限制。
在一种可能的实现方式中,本公开中所提及的处理器可包括多个处理单元,每个处理单元可以独立运行所分配到的各种任务,如:卷积运算任务、池化任务或全连接任务等。本公开对处理单元及处理单元所运行的任务不作限制。
图1示出根据本公开实施例的数据同步方法的处理系统的示意图。如图1所示,处理系统100包括多个处理器101以及存储器102,多个处理器101用于执行指令序列,存储器102用于存储数据,可包括随机存储器(RAM,Random Access Memory)和寄存器堆。处理系统100中的多个处理器101既可共用部分存储空间,例如共用部分RAM存储空间和寄存器堆,又可同时拥有各自的存储空间。
图2示出根据本公开实施例的数据同步方法的流程图。如图2所示,该方法应用于第一处理器(处理系统中的任意一个处理器),该方法包括:
在步骤S11中:根据待同步的张量数据的描述符,确定所述张量数据的数据特征,所述描述符用于指示待同步的张量数据的形状;
在步骤S12中:根据所述张量数据的数据特征,生成状态查询指令并向第二处理器发送所述状态查询指令,所述状态查询指令用于指示第二处理器确定针对所述张量数据的可同步数据量并生成同步状态指令。
举例来说,待同步的数据可包括N维的张量数据(N为大于或等于零的整数,例如N=1、2或3),其中,张量可以包含多种形式的数据组成方式,张量可以是不同维度的,比如标量可以看作是0维张量,向量可以看作1维张量,而矩阵可以是2维或2维以上的张量。张量的形状包括张量的维度、张量各个维度的尺寸等信息。例如对于张量:
Figure PCTCN2020111291-appb-000001
该张量的形状可以被描述符描述为(2,4),也即通过两个参数表示该张量为二维张量,且该张量的第一维度(列)的尺寸为2、第二维度(行)的尺寸为4。需要说明的是,本公开对于描述符指示张量形状的方式并不做限定。在存储器中存储张量数据时,根据其数据地址(或存储区域)无法确定张量数据的形状,进而也无法确定多个张量数据之间相互关系等相关信息,导致处理器对张量数据的存取效率较低,在进行数据同步时的复杂度也较大。
在该情况下,可设定描述符(张量描述符)来指示张量数据(N维的张量数据)的形状。其中,N的取值可根据张量数据的维数(阶数)来确定,也可以根据张量数据的使用需要进行设定。例如,在N的取值为3时,张量数据为三维的张量数据,描述符可用来指示该三维的张量数据在三个维度方向上的形状(例如偏移量、尺寸等)。应当理解,本领域技术人员可以根据实际需要对N的取值进行设置,本公开对此不作限制。
在一种可能的实现方式中,描述符可包括标识和内容等,描述符的标识可用于对描述符进行区分,例如为编号;描述符的内容可包括表示张量数据的形状的至少一个形状参数(例如张量的各个维度方向上的尺寸等),还可以包括表示张量数据的地址的至少一个地址参数(例如数据基准点的基准地址)。本公开对描述符的内容包括的具体参数不作限制。
通过采用描述符来指示张量数据的方式,能够表达张量数据的形状,进而也能够确定多个张量数据之间的相互关系等相关信息,提高对张量数据的存取效率,从而降低数据同步时的复杂度。
在一种可能的实现方式中,在数据处理过程中,可能需要进行多个处理器(例如人工智能芯片的多个核)之间的数据同步,例如将处理器A1的运算结果同步到处理器A2中作为另一项运算的输入数据。在该情况下,可以采用基于描述符的数据同步机制实现数据同步。
在一种可能的实现方式中,各个处理器的非共用存储空间可分配给待同步的张量数据的空间可能有限,无法实现张量数据的整体同步。在该情况下,可进行张量数据的部分同步,通过多次的部分同步来实现整个张量数据的同步过程。
在一种可能的实现方式中,在数据同步的发送方有待同步的张量数据,例如在完成一项运算得到运算结果(也即张量数据)时,可由发送方查询接收方的状态,确定数据同步的接收方的非共用存储空间能够分配给该张量数据的空间所能容纳的数据量,以便进行张量数据的部分同步。
在一种可能的实现方式中,可以设定多个处理器中的第一处理器是数据同步的发送方,第二处理 器是数据同步的接收方。第一处理器和第二处理器均为多个处理器中的任意处理器,第二处理器可与第一处理器的类型相同或不同,本公开对第一处理器和第二处理器的类型不作限制。
在一种可能的实现方式中,当第一处理器确定存在待同步的张量数据时,可以获取该张量数据的描述符。该描述符可以是已注册(创建)的用于指示该张量数据的形状的描述符,也可以根据该张量数据的形状参数注册(创建)新的描述符,本公开对此不作限制。
在一种可能的实现方式中,在步骤S11中,根据该张量数据的描述符,第一处理器可确定该张量数据的数据特征。该数据特征可包括张量数据的标识(例如数据编号)、形状、来源、存储地址等信息中的至少一种。
在一种可能的实现方式中,待同步的张量数据的数据特征可包括张量数据的标识、形状、来源、地址等信息。例如,该张量数据的数据来源为第K个发送方(第K个处理器)、该张量数据的数据来源为编号200的卷积操作的运算结果、该张量数据的地址为特定的地址区域(例如地址ADDR0-ADDR127)、该张量数据的形状为指定的形状(例如20*10的二维张量)等。本领域技术人员可根据实际情况设定待同步的张量数据的数据特征,本公开对此不作限制。
在一种可能的实现方式中,在步骤S12中,根据该张量数据的数据特征,第一处理器可生成状态查询指令并向待同步的第二处理器发送该状态查询指令。如果第二处理器中已具有该张量数据的信息(例如已注册有指示该待同步的张量数据的描述符),则状态查询指令可仅包括部分数据特征,例如张量数据的标识,以指示第二处理器根据该张量数据的标识确定待同步的张量数据的描述符;如果第二处理器中不具有该张量数据的信息,则同步指令可包括更多的数据特征,例如张量数据的标识及存储地址等,以指示第二处理器确定待同步的张量数据的描述符。本公开对状态查询指令包括的具体内容不作限制。
在一种可能的实现方式中,如果状态查询指令包括张量数据的标识,则第二处理器可根据该标识确定待同步的张量数据,并注册或获取指示该待同步的张量数据的描述符。如果状态查询指令包括更多的数据特征(标识及存储地址等),则第二处理器可根据指令中的数据特征注册指示该张量数据的描述符。
在一种可能的实现方式中,在确定待同步的张量数据的描述符后,第二处理器可确定能够分配给描述符对应的张量数据的空间,确定针对该张量数据的可同步数据量。根据可同步数据量及数据特征,第二处理器可生成并发送同步状态指令,以使得第一处理器能够确定待同步的张量数据以及本次同步的可同步数据量。
根据本公开实施例的数据同步方法,通过设定指示张量数据的形状的描述符,发送方能够根据描述符确定张量数据的数据特征,根据数据特征生成并发送状态查询指令,以指示接收方根据状态查询指令反馈自身状态(也即可同步的数据量),从而实现张量数据的部分同步,在不改变指令结构的情况下减少同步开销,提高数据同步的效率。
在一种可能的实现方式中,所述方法还包括:
在接收到来自所述第二处理器的同步状态指令时,解析所述同步状态指令,确定待同步的张量数据的数据特征及可同步数据量,
根据所述数据特征,确定待同步的张量数据的描述符;
根据所述描述符及所述可同步数据量,确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应;
根据所述第一子数据,生成同步指令并向所述第二处理器发送所述同步指令,以指示所述第二处理器获取所述第一子数据。
举例来说,第一处理器在接收到来自第二处理器的同步状态指令时,可对该指令进行解析以得到该指令的内容,也即待同步的张量数据的数据特征及可同步数据量。根据该数据特征,可确定待同步的张量数据的描述符,从而确定待同步的张量数据;并根据可同步数据量从该张量数据中确定出本次可同步的部分数据,也即第一子数据。该第一子数据的数据量可与可同步数据量相对应,例如第一子数据的数据量小于或等于所述可同步数据量。
在一种可能的实现方式中,如果该张量数据的全部数据均未同步,则可从该张量数据中选择可同步数据量的数据作为第一子数据;如果该张量数据的部分数据未同步,且未同步的部分数据的数据量大于可同步数据量,则可从未同步的部分数据(也即该张量数据的第二子数据)中选择可同步数据量的数据作为第一子数据;如果未同步的部分数据的数据量小于或等于可同步数据量,则可将未同步的部分数据直接作为第一子数据,应当理解,本领域技术人员可根据实际情况确定第一子数据,本公开对此不作限制。
在一种可能的实现方式中,同步状态指令中也可包括待同步的张量数据的部分数据的范围,例如该部分子数据等存储地址范围等,以便指定获取待同步的部分数据。第一处理器可根据该部分数据的范围直接确定待同步的第一子数据。
在一种可能的实现方式中,第一处理器可根据第一子数据生成同步指令并向第二处理器发送同步指令。该指令中可包括待同步的张量数据的数据特征及第一子数据。第二处理器在接收到同步指令后,可解析该指令以确定待同步的张量数据的数据特征及张量数据的第一子数据,根据数据特征确定描述符,根据描述符确定待同步的张量数据,并将张量数据的第一子数据存储到自身的非共用存储空间中。
通过这种方式,能够根据来自发送方的同步状态指令中确定张量数据的描述符及可同步数据量,根据可同步数据量确定本次同步的子数据,根据该子数据生成并发送同步指令,以使接收方获取本次同步的子数据,从而减少同步开销,提高数据同步的效率。
在一种可能的实现方式中,根据所述描述符及所述可同步数据量,确定所述张量数据的第一子数据的步骤,可包括:
根据所述描述符,确定待同步的张量数据以及所述张量数据中处于待同步状态的第二子数据;
根据所述第二子数据及所述可同步数据量,确定第一子数据。
举例来说,可设定张量数据中数据的状态,将已同步的部分数据设定为已同步,并将未同步的部分数据设定为待同步。在该情况下,当第一处理器接收到来自第二处理器的同步状态指令时,可根据张量数据中数据的状态,确定出处于待同步状态的第二子数据;根据第二子数据以及同步状态指令所指示的可同步数据量,可确定本次同步的第一子数据。
在一种可能的实现方式中,如果第二子数据的数据量大于可同步数据量,则可从第二子数据中选择出本次同步的第一子数据;如果第二子数据的数据量小于或等于可同步数据量,则可将第二子数据直接作为第一子数据。
通过这种方式,可确定出本次同步的部分数据,以便实现张量数据的部分同步,提高数据同步的效率。
在一种可能的实现方式中,所述方法还包括:将所述张量数据的第一子数据的状态由待同步状态变更为已同步状态。
举例来说,第一处理器在根据张量数据的第一子数据生成并发送同步指令,使得第二处理器实现张量数据的第一子数据的同步后,第一处理器可对张量数据中数据的状态进行变更,也即,将第一子数据的状态由待同步状态变更为已同步状态。这样,在下一次查询第二处理器的状态并接收到第二处理器的同步状态指令时,可以从处于待同步状态的部分数据中确定下一次同步的数据,从而避免数据的重复同步,提高数据同步的效率。
图3示出根据本公开一实施例的数据同步方法的流程图。如图3所示,该方法应用于第二处理器,该方法包括:
在步骤S31中,在接收到来自第一处理器的状态查询指令时,解析所述状态查询指令,得到待同步的张量数据的数据特征;
在步骤S32中,根据所述数据特征,确定待同步的张量数据的描述符,所述描述符用于指示待同步的张量数据的形状;
在步骤S33中,根据所述张量数据的描述符,确定针对所述张量数据的可同步数据量;
在步骤S34中,根据所述张量数据的数据特征及所述可同步数据量,生成同步状态指令并向所述 第一处理器发送所述同步状态指令,所述同步状态指令用于指示所述第一处理器确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应。
举例来说,在数据同步的发送方有待同步的张量数据时,发送方可查询接收方的状态。第一处理器(发送方)可生成并发送状态查询指令,第二处理器在步骤S31中接收到该状态查询指令时,可解析该指令,已确定待同步的张量数据的数据特征。该数据特征可包括张量数据的标识(例如数据编号)、形状、来源、存储地址等信息中的至少一种。
在一种可能的实现方式中,在步骤S32中,第二处理器可根据数据特征,确定待同步的张量数据的描述符。该描述符可以是已注册(创建)的用于指示该张量数据的形状的描述符,也可以根据该张量数据的形状参数注册(创建)新的描述符,本公开对此不作限制。
在一种可能的实现方式中,在步骤S33中,第二处理器可根据该描述符确定待同步的张量数据,并确定自身的非共用存储空间能够分配给该张量数据的空间所能容纳的数据量,即可同步数据量,以便进行张量数据的部分同步。
在一种可能的实现方式中,在步骤S34中,第二处理器可根据确定出的可同步数据量及该张量数据的数据特征,生成并向第一处理器发送同步状态指令,以指示第一处理器确定本次同步的可同步数据量。第一处理器在确定本次可同步的部分数据(也即第一子数据)后,可生成同步指令并向第二处理器发送同步指令。该指令中可包括待同步的张量数据的数据特征及第一子数据。
通过这种方式,可由发送方查询接收方的状态,接收方在接收到状态查询指令后确定并回复自身的状态(即可同步数据量),通过交互实现张量数据的部分同步,提高数据同步的效率。
在一种可能的实现方式中,所述方法还包括:
在接收到来自所述第一处理器的同步指令时,解析所述同步指令,得到待同步的张量数据的数据特征及所述张量数据的第一子数据;
根据所述数据特征,确定所述张量数据的描述符;
根据所述张量数据的描述符,存储所述张量数据的第一子数据。
举例来说,第二处理器在接收到同步指令时,可解析该指令以确定待同步的张量数据的数据特征及本次同步的该张量数据的第一子数据;根据该数据特征查找到待同步的张量数据的描述符;进而根据描述符确定待同步的张量数据,并将张量数据的第一子数据存储到自身的非共用存储空间中。
通过这种方式,接收方能够根据同步指令确定描述符并获取本次同步的子数据,从而减少同步开销,提高数据同步的效率。
在一种可能的实现方式中,也可由数据同步的接收方发起对张量数据的部分同步请求,也即接收方发出描述符同步请求指令,该指令中可指示待同步的张量数据的描述符以及针对该张量数据的可同步数据量,也即接收方的非共用存储空间能够分配给该张量数据的空间所能容纳的数据量。
在一种可能的实现方式中,还提供了一种数据同步方法,应用于第一处理器,该方法包括:
在接收到来自所述第二处理器的同步请求指令时,解析所述同步请求指令,得到待同步的张量数据的数据特征以及针对所述张量数据的可同步数据量;
根据所述数据特征,确定待同步的张量数据的描述符,所述描述符用于指示待同步的张量数据的形状;
根据所述张量数据的描述符及所述可同步数据量,确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应;
根据所述第一子数据,生成同步指令并向所述第二处理器发送所述同步指令,以指示所述第二处理器获取所述第一子数据。
举例来说,可由数据同步的接收方发起对张量数据的部分同步请求,也即接收方发出同步请求指令,该指令中可指示待同步的张量数据的数据特征以及针对该张量数据的可同步数据量,也即接收方的非共用存储空间能够分配给该张量数据的空间所能容纳的数据量。
在一种可能的实现方式中,可以设定多个处理器中的第一处理器是数据同步的发送方,第二处理 器是数据同步的接收方。第一处理器和第二处理器均为多个处理器中的任意处理器,第二处理器可与第一处理器的类型相同或不同,本公开对第一处理器和第二处理器的类型不作限制。
在一种可能的实现方式中,第一处理器在接收到来自第二处理器的同步请求指令时,可对该指令进行解析以得到该指令的内容,也即待同步的张量数据的数据特征和可同步数据量。该数据特征可包括张量数据的标识(例如数据编号)、形状、来源、存储地址等信息中的至少一种。
在一种可能的实现方式中,待同步的张量数据的数据特征可包括张量数据的标识、形状、来源、地址等信息。例如,该张量数据的数据来源为第K个发送方(第K个处理器)、该张量数据的数据来源为编号200的卷积操作的运算结果、该张量数据的地址为特定的地址区域(例如地址ADDR0-ADDR127)、该张量数据的形状为指定的形状(例如20*10的二维张量)等。本领域技术人员可根据实际情况设定待同步的张量数据的数据特征,本公开对此不作限制。
在一种可能的实现方式中,第一处理器可根据数据特征确定出待同步的张量数据的描述符;进而根据描述符确定待同步的张量数据,并根据可同步数据量从该张量数据中确定出本次可同步的部分数据,也即第一子数据。该第一子数据的数据量可与所述可同步数据量相对应,例如第一子数据的数据量小于或等于所述可同步数据量。
在一种可能的实现方式中,如果该张量数据的全部数据均未同步,则可从该张量数据中选择可同步数据量的数据作为第一子数据;如果该张量数据的部分数据未同步,且未同步的部分数据的数据量大于可同步数据量,则可从未同步的部分数据(也即该张量数据的第二子数据)中选择可同步数据量的数据作为第一子数据;如果未同步的部分数据的数据量小于或等于可同步数据量,则可将未同步的部分数据直接作为第一子数据,应当理解,本领域技术人员可根据实际情况确定第一子数据,本公开对此不作限制。
在一种可能的实现方式中,同步请求指令中也可包括待同步的张量数据的部分数据的范围,例如该部分子数据等存储地址范围等,以便指定获取待同步的部分数据。第一处理器可根据该部分数据的范围直接确定待同步的第一子数据。
在一种可能的实现方式中,第一处理器可根据第一子数据生成同步指令并向第二处理器发送同步指令。该指令中可包括待同步的张量数据的数据特征及第一子数据。第二处理器在接收到同步指令后,可解析该指令以确定该数据特征及第一子数据,从而根据该数据特征确定描述符,根据描述符确定待同步的张量数据,并将张量数据的第一子数据存储到自身的非共用存储空间中。
通过这种方式,接收方能够发出同步请求指令以主动请求同步部分数据,发送方可根据接收方的可同步数据量确定本次同步的子数据,根据该子数据生成并发送同步指令,以使接收方获取本次同步的子数据,从而在不改变指令结构的情况下减少同步开销,提高数据同步的效率。
在一种可能的实现方式中,根据所述张量数据的描述符及所述可同步数据量,确定所述张量数据的第一子数据,可包括:
根据所述张量数据的描述符,确定所述张量数据以及所述张量数据中处于待同步状态的第二子数据;
根据所述第二子数据及所述可同步数据量,确定第一子数据。
举例来说,可设定张量数据中数据的状态,将已同步的部分数据设定为已同步,并将未同步的部分数据设定为待同步。在该情况下,当第一处理器接收到来自第二处理器的同步请求指令时,可根据张量数据中数据的状态,确定出处于待同步状态的第二子数据;根据第二子数据以及同步请求指令所指示的可同步数据量,可确定本次同步的第一子数据。
在一种可能的实现方式中,如果第二子数据的数据量大于可同步数据量,则可从第二子数据中选择出本次同步的第一子数据;如果第二子数据的数据量小于或等于可同步数据量,则可将第二子数据直接作为第一子数据。
通过这种方式,可确定出本次同步的部分数据,以便实现张量数据的部分同步,提高数据同步的效率。
在一种可能的实现方式中,所述方法还包括:将所述张量数据的第一子数据的状态由待同步状态 变更为已同步状态。
举例来说,第一处理器在根据张量数据的第一子数据生成并发送同步指令,使得第二处理器实现张量数据的第一子数据的同步后,第一处理器可对张量数据中数据的状态进行变更,也即,将第一子数据的状态由待同步状态变更为已同步状态。这样,在下一接收到第二处理器的同步请求指令时,可以从处于待同步状态的部分数据中确定下一次同步的数据,从而避免数据的重复同步,提高数据同步的效率。
在一种可能的实现方式中,还提供了一种数据同步方法,应用于第二处理器,该方法包括:
根据待同步的张量数据的描述符,确定所述张量数据的数据特征以及针对所述张量数据的可同步数据量,所述描述符用于指示待同步的张量数据的形状;
根据所述数据特征以及所述可同步数据量,生成同步请求指令并向第一处理器发送所述同步请求指令,所述同步请求指令用于指示第一处理器根据所述同步请求指令确定待同步的张量数据以及所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应。
举例来说,数据同步的接收方(即第二处理器)可发起对张量数据的部分同步请求。在第二处理器中存在待同步的张量数据时,可确定张量数据的描述符。该描述符可以是已注册(创建)的用于指示该张量数据的形状的描述符,也可以根据该张量数据的形状参数注册(创建)新的描述符,本公开对此不作限制。
在一种可能的实现方式中,第二处理器可根据该描述符确定张量数据的数据特征,例如张量数据的标识(例如数据编号)、形状、来源、存储地址等信息中的至少一种。并且,第二处理器可以确定自身的非共用存储空间能够分配给该张量数据的空间所能容纳的数据量,也即可同步数据量。
在一种可能的实现方式中,根据该张量数据的数据特征及可同步数据量,第二处理器可生成同步请求指令并发送该指令。该同步请求指令可用于指示第一处理器根据该指令确定待同步的张量数据以所述张量数据的第一子数据。
在一种可能的实现方式中,数据同步的接收方(即第一处理器)在接收到同步请求指令时,可对该指令进行解析,确定出待同步的张量数据的数据特征以及可同步数据量;根据数据特征确定,待同步的张量数据的描述符;根据描述符确定出待同步的张量数据,并根据可同步数据量从该张量数据中确定出本次可同步的部分数据,也即第一子数据。该第一子数据的数据量可与所述可同步数据量相对应,例如第一子数据的数据量小于或等于所述可同步数据量。
在一种可能的实现方式中,如果该张量数据的全部数据均未同步,则可从该张量数据中选择可同步数据量的数据作为第一子数据;如果该张量数据的部分数据未同步,且未同步的部分数据的数据量大于可同步数据量,则可从未同步的部分数据(也即该张量数据的第二子数据)中选择可同步数据量的数据作为第一子数据;如果未同步的部分数据的数据量小于或等于可同步数据量,则可将未同步的部分数据直接作为第一子数据,应当理解,本领域技术人员可根据实际情况确定第一子数据,本公开对此不作限制。
在一种可能的实现方式中,同步请求指令中也可包括待同步的张量数据的部分数据的范围,例如该部分子数据的描述符内容或存储地址范围等,以便指定获取待同步的部分数据。
通过这种方式,能够由接收方发起张量数据的部分同步请求,使得发送方确定本次同步的子数据,从而提高数据同步的效率。
在一种可能的实现方式中,所述方法还包括:
在接收到来自所述第一处理器的同步指令时,解析所述同步指令,得到待同步的张量数据的数据特征及所述张量数据的第一子数据;
根据所述数据特征,确定所述张量数据的描述符;
根据所述张量数据的描述符,存储所述张量数据的第一子数据。
举例来说,第一处理器可根据张量数据的数据特征及第一子数据生成并发送同步指令。第二处理器在接收到该同步指令时,可解析该指令以确定待同步的张量数据的数据特征及本次同步的该张量数 据的第一子数据;根据数据特征确定描述符,进而根据描述符确定待同步的张量数据,并将张量数据的第一子数据存储到自身的非共用存储空间中。
通过这种方式,接收方能够根据同步指令确定描述符并获取本次同步的子数据,从而减少同步开销,提高数据同步的效率,并且实现了指令传递及处理过程中的指令兼容。
在一种可能的实现方式中,描述符的标识和内容可存储在描述符存储空间中,该描述符存储空间可以为处理器的内部存储器(例如寄存器、片上的SRAM或其他介质缓存等)中的存储空间。描述符所指示的张量数据的数据存储空间可为处理器的内部存储器(例如片上缓存)或与处理器连接的外部存储器(片下存储器)中的存储空间。数据存储空间中的数据地址可以为实际的物理地址或虚拟地址。本公开对描述符存储空间及数据存储空间的位置以及数据地址的类型不作限制。
在一种可能的实现方式中,描述符的标识、内容以及描述符所指示的张量数据可以位于同一块区域,例如,可使用片上缓存的一块连续区域存储描述符的相关内容,其地址为ADDR0-ADDR1023,其中,地址ADDR0-ADDR31可用于存储描述符的标识,地址ADDR32-ADDR63可用于存储描述符的内容,地址ADDR64-ADDR1023可用于存储描述符指示的张量数据。其中,地址ADDR并不限于1位或一个字节,此处用来表示一个地址,是一个地址单位。本领域技术人员可以实际情况确定存储区域及其地址,本公开对此不作限制。
在一种可能的实现方式中,描述符的标识、内容以及描述符所指示的张量数据可以分开存储在内部存储器的不同区域,例如,可以将寄存器作为描述符存储空间,在寄存器中存储描述符的标识及内容,将片上缓存作为数据存储空间,存储描述符所指示的张量数据。
在一种可能的实现方式中,还可以设置专门供描述符使用的专用寄存器(SR),描述符中的数据可以是立即数也可以从专用寄存器中获取。在使用寄存器存储描述符的标识和内容时,可以使用寄存器的编号来表示描述符的标识,例如,寄存器的编号为0时,其存储的描述符的标识为0。当寄存器中的描述符有效时,可根据描述符所指示的张量数据的大小在缓存空间中分配一块区域(例如在缓存中为每个张量数据创建一个张量缓存单元)用于存储该张量数据。应当理解,也可以采用预设的缓存空间存储该张量数据,本公开对此不作限制。
在一种可能的实现方式中,描述符的标识及内容可存储在内部存储器,描述符所指示的张量数据可存储在外部存储器。例如,可以采用在片上存储描述符的标识及内容、在片下存储描述符所指示的张量数据的方式。
在一种可能的实现方式中,与描述符对应的数据存储空间的数据地址可以是固定地址。例如,可以为张量数据划分单独的数据存储空间,每个张量数据在数据存储空间的起始地址与描述符的标识一一对应。在这种情况下,处理器根据描述符的内容即可确定张量数据的数据地址。
在一种可能的实现方式中,在与描述符对应的数据存储空间的数据地址为可变地址时,所述描述符还可用于指示N维的张量数据的地址,其中,所述描述符的内容还可包括表示张量数据的地址的至少一个地址参数。例如,张量数据为3维数据,在描述符指向该张量数据的地址时,描述符的内容可包括表示该张量数据的地址的一个地址参数,例如张量数据的起始地址,也可以包括该张量数据的地址的多个地址参数,例如张量数据的起始地址+地址偏移量,或张量数据基于各维度的地址参数。本领域技术人员可以根据实际需要对地址参数进行设置,本公开对此不作限制。
在一种可能的实现方式中,所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。其中,基准地址可根据数据基准点的变化而不同。本公开对数据基准点的选取不作限制。
在一种可能的实现方式中,所述基准地址可包括所述数据存储空间的起始地址。在描述符的数据基准点是数据存储空间的第一个数据块时,描述符的基准地址即为数据存储空间的起始地址。在描述符的数据基准点是数据存储空间中第一个数据块以外的其他数据时,描述符的基准地址即为该数据块在数据存储空间中的物理地址。
在一种可能的实现方式中,所述张量数据的形状参数包括以下至少一种:所述张量数据的数据存 储空间在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系。其中,数据描述位置是描述符所指示的张量数据中的点或区域的映射位置,例如,张量数据为3维数据时,描述符可使用三维空间坐标(x,y,z)来表示该张量数据的形状,该张量数据的数据描述位置可以是使用三维空间坐标(x,y,z)表示的、该张量数据映射在三维空间中的点或区域的位置。
应当理解,本领域技术人员可以根据实际情况选择表示张量数据的形状参数,本公开对此不作限制。
图4示出根据本公开一实施例的数据同步方法的数据存储空间的示意图。如图4所示,数据存储空间21采用行优先的方式存储了一个二维数据,可通过(x,y)来表示(其中,X轴水平向右,Y轴垂直向下),X轴方向上的尺寸(每行的尺寸)为ori_x(图中未示出),Y轴方向上的尺寸(总行数)为ori_y(图中未示出),数据存储空间21的起始地址PA_start(基准地址)为第一个数据块22的物理地址。数据块23是数据存储空间21中的部分数据,其在X轴方向上的偏移量25表示为offset_x,在Y轴方向上的偏移量24表示为offset_y,在X轴方向上的尺寸表示为size_x,在Y轴方向上的尺寸表示为size_y。
在一种可能的实现方式中,使用描述符来定义数据块23时,描述符的数据基准点可使用数据存储空间21的第一个数据块,描述符的基准地址为数据存储空间21的起始地址PA_start,然后可以结合数据存储空间21在X轴的尺寸ori_x、在Y轴上的尺寸ori_y,以及数据块23在Y轴方向的偏移量offset_y、X轴方向上的偏移量offset_x、X轴方向上的尺寸size_x以及Y轴方向上的尺寸size_y来确定数据块23的描述符的内容。
在一种可能的实现方式中,可以使用下述公式(1)来表示描述符的内容:
Figure PCTCN2020111291-appb-000002
应当理解,虽然上述示例中,描述符描述的是二维空间,但本领域技术人员可以根据实际情况对描述符的内容表示的维度进行设置,本公开对此不作限制。
在一种可能的实现方式中,可根据所述描述符的数据基准点在所述数据存储空间中的基准地址、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置,确定所述张量数据的描述符的内容。
举例来说,可以使用描述符的数据基准点在数据存储空间中的基准地址PA_base,以及对角位置的两个顶点相对于数据基准点的位置,确定出图2中数据块23的描述符的内容。首先,确定描述符的数据基准点以及其在数据存储空间中的基准地址PA_base,例如,可以在数据存储空间21中选取一个数据(例如,位置为(2,2)的数据)作为数据基准点,将该数据在数据存储空间中的物理地址作为基准地址PA_base;然后,确定数据块23的对角位置的至少两个顶点相对于数据基准点的位置,例如,使用左上至右下方向的对角位置顶点相对于数据基准点的位置,其中,左上角顶点的相对位置为(x_min,y_min),右下角顶点的相对位置为(x_max,y_max),然后可以根据基准地址PA_base、左上角顶点的相对位置(x_min,y_min)以及右下角顶点的相对位置(x_max,y_max)确定出数据块23的描述符的内容。
在一种可能的实现方式中,可以使用下述公式(2)来表示描述符的内容:
Figure PCTCN2020111291-appb-000003
应当理解,虽然上述示例中使用左上角和右下角两个顶点来确定描述符的内容,但本领域技术人员可以根据实际需要对至少两个顶点的具体顶点进行设置,本公开对此不作限制。
在一种可能的实现方式中,可根据所述描述符的数据基准点在所述数据存储空间中的基准地址, 以及所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,确定所述张量数据的描述符的内容。其中,数据描述位置与数据地址之间的映射关系可以根据实际需要进行设定,例如,描述符所指示的张量数据为三维空间数据时,可是使用函数f(x,y,z)来定义数据描述位置与数据地址之间的映射关系。
在一种可能的实现方式中,可以使用下述公式(3)来表示描述符的内容:
Figure PCTCN2020111291-appb-000004
应当理解,本领域技术人员可以根据实际情况对数据描述位置与数据地址之间的映射关系进行设定,本公开对此不作限制。
在采用公式(1)表示描述符的内容的情况下,对于张量数据中的任意一个数据点,设其数据描述位置为(x q,y q),那么,该数据点在数据存储空间中的数据地址PA2 (x,y)可以使用下述公式(4)来确定:
PA2 (x,y)=PA_start+(offset_y+y q-1)*ori_x+(offset_x+x q)  (4)
通过这种方式,处理器可以根据描述符的内容计算出描述符所指示的张量数据在数据存储空间中的数据地址,进而根据该地址执行对应的处理(例如数据运算、数据同步等),从而可降低数据存取的复杂度,提高处理器的处理效率。
根据本公开实施例的数据同步方法,能够在数据同步的接收方空间不足时实现张量数据的部分同步,通过多次的部分同步来实现整个张量数据的同步,从而避免了在空间不足的情况下张量数据整体同步失败或同步延迟等问题,提高了数据同步的效率;并且设定有指示张量数据的形状的描述符,在数据同步过程中根据描述符来确定张量数据,从而减少了同步开销,降低了数据存取的复杂度,并且实现了指令传递及处理过程中的指令兼容。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。
进一步需要说明的是,虽然流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
图5示出根据本公开实施例的数据同步装置的框图。该数据同步装置应用于第一处理器,如图5所示,该数据同步装置包括:
特征确定模块51,用于根据待同步的张量数据的描述符,确定所述张量数据的数据特征,所述描述符用于指示待同步的张量数据的形状;
查询指令生成及发送模块52,用于根据所述张量数据的数据特征,生成状态查询指令并向第二处理器发送所述状态查询指令,所述状态查询指令用于指示第二处理器确定针对所述张量数据的可同步数据量并生成同步状态指令。
在一种可能的实现方式中,所述装置还包括:
状态指令解析模块,用于在接收到来自所述第二处理器的同步状态指令时,解析所述同步状态指令,确定待同步的张量数据的数据特征及可同步数据量,
第一描述符确定模块,用于根据所述数据特征,确定待同步的张量数据的描述符;
数据确定模块,用于根据所述描述符及所述可同步数据量,确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应;
同步指令生成及发送模块,用于根据所述第一子数据,生成同步指令并向所述第二处理器发送所述同步指令,以指示所述第二处理器获取所述第一子数据。
在一种可能的实现方式中,所述数据确定模块包括:
第一确定子模块,用于根据所述描述符,确定待同步的张量数据以及所述张量数据中处于待同步状态的第二子数据;
第二确定子模块,用于根据所述第二子数据及所述可同步数据量,确定第一子数据。
在一种可能的实现方式中,所述装置还包括:
状态变更模块,用于将所述张量数据的第一子数据的状态由待同步状态变更为已同步状态。
图6示出根据本公开实施例的数据同步装置的框图。该数据同步装置应用于第二处理器,如图6所示,该数据同步装置包括:
查询指令解析模块61,用于在接收到来自第一处理器的状态查询指令时,解析所述状态查询指令,得到待同步的张量数据的数据特征;
第二描述符确定模块62,用于根据所述数据特征,确定待同步的张量数据的描述符,所述描述符用于指示待同步的张量数据的形状;
数据量确定模块63,用于根据所述张量数据的描述符,确定针对所述张量数据的可同步数据量;
状态指令生成及发送模块64,用于根据所述张量数据的数据特征及所述可同步数据量,生成同步状态指令并向所述第一处理器发送所述同步状态指令,所述同步状态指令用于指示所述第一处理器确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应。
在一种可能的实现方式中,所述装置还包括:
同步指令解析模块,用于在接收到来自所述第一处理器的同步指令时,解析所述同步指令,得到待同步的张量数据的数据特征及所述张量数据的第一子数据;
第三描述符确定模块,用于根据所述数据特征,确定所述张量数据的描述符;
数据存储模块,用于根据所述张量数据的描述符,存储所述张量数据的第一子数据。
应该理解,上述的装置实施例仅是示意性的,本公开的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本公开各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述人工智能处理器可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘 等各种可以存储程序代码的介质。
在一种可能的实现方式中,还公开了一种人工智能芯片,其包括了上述数据同步装置。
在一种可能的实现方式中,还公开了一种板卡,其包括存储器件、接口装置和控制器件以及上述人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的状态进行监控。
图7示出根据本公开实施例的板卡的结构框图,参阅图7,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;
所述存储器件390与所述人工智能芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述人工智能芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述人工智能芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述人工智能芯片电连接。所述接口装置用于实现所述人工智能芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE 3.0X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述人工智能芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述人工智能芯片电连接。所述控制器件用于对所述人工智能芯片的状态进行监控。具体的,所述人工智能芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述人工智能芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述人工智能芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述人工智能芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
在一种可能的实现方式中,公开了一种电子设备,其包括了上述人工智能芯片。电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
依据以下条款可更好地理解前述内容:
条款A1、一种数据同步方法,所述方法应用于第一处理器,包括:
根据待同步的张量数据的描述符,确定所述张量数据的数据特征,所述描述符用于指示待同步的 张量数据的形状;
根据所述张量数据的数据特征,生成状态查询指令并向第二处理器发送所述状态查询指令,所述状态查询指令用于指示第二处理器确定针对所述张量数据的可同步数据量并生成同步状态指令。
条款A2、根据条款A1所述的方法,所述方法还包括:
在接收到来自所述第二处理器的同步状态指令时,解析所述同步状态指令,确定待同步的张量数据的数据特征及可同步数据量,
根据所述数据特征,确定待同步的张量数据的描述符;
根据所述描述符及所述可同步数据量,确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应;
根据所述第一子数据,生成同步指令并向所述第二处理器发送所述同步指令,以指示所述第二处理器获取所述第一子数据。
条款A3、根据条款A2所述的方法,根据所述描述符及所述可同步数据量,确定所述张量数据的第一子数据,包括:
根据所述描述符,确定待同步的张量数据以及所述张量数据中处于待同步状态的第二子数据;
根据所述第二子数据及所述可同步数据量,确定第一子数据。
条款A4、根据条款A2或条款A3所述的方法,所述方法还包括:
将所述张量数据的第一子数据的状态由待同步状态变更为已同步状态。
条款A5、一种数据同步方法,所述方法应用于第二处理器,包括:
在接收到来自第一处理器的状态查询指令时,解析所述状态查询指令,得到待同步的张量数据的数据特征;
根据所述数据特征,确定待同步的张量数据的描述符,所述描述符用于指示待同步的张量数据的形状;
根据所述张量数据的描述符,确定针对所述张量数据的可同步数据量;
根据所述张量数据的数据特征及所述可同步数据量,生成同步状态指令并向所述第一处理器发送所述同步状态指令,所述同步状态指令用于指示所述第一处理器确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应。
条款A6、根据条款A5所述的方法,所述方法还包括:
在接收到来自所述第一处理器的同步指令时,解析所述同步指令,得到待同步的张量数据的数据特征及所述张量数据的第一子数据;
根据所述数据特征,确定所述张量数据的描述符;
根据所述张量数据的描述符,存储所述张量数据的第一子数据。
条款A7、一种数据同步装置,所述装置应用于第一处理器,所述装置包括:
特征确定模块,用于根据待同步的张量数据的描述符,确定所述张量数据的数据特征,所述描述符用于指示待同步的张量数据的形状;
查询指令生成及发送模块,用于根据所述张量数据的数据特征,生成状态查询指令并向第二处理器发送所述状态查询指令,所述状态查询指令用于指示第二处理器确定针对所述张量数据的可同步数据量并生成同步状态指令。
条款A8、根据条款A7所述的装置,所述装置还包括:
状态指令解析模块,用于在接收到来自所述第二处理器的同步状态指令时,解析所述同步状态指令,确定待同步的张量数据的数据特征及可同步数据量,
第一描述符确定模块,用于根据所述数据特征,确定待同步的张量数据的描述符;
数据确定模块,用于根据所述描述符及所述可同步数据量,确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应;
同步指令生成及发送模块,用于根据所述第一子数据,生成同步指令并向所述第二处理器发送所述同步指令,以指示所述第二处理器获取所述第一子数据。
条款A9、根据条款A8所述的装置,所述数据确定模块包括:
第一确定子模块,用于根据所述描述符,确定待同步的张量数据以及所述张量数据中处于待同步状态的第二子数据;
第二确定子模块,用于根据所述第二子数据及所述可同步数据量,确定第一子数据。
条款A10、根据条款A8或条款A9所述的装置,所述装置还包括:
状态变更模块,用于将所述张量数据的第一子数据的状态由待同步状态变更为已同步状态。
条款A11、一种数据同步装置,所述装置应用于第二处理器,包括:
查询指令解析模块,用于在接收到来自第一处理器的状态查询指令时,解析所述状态查询指令,得到待同步的张量数据的数据特征;
第二描述符确定模块,用于根据所述数据特征,确定待同步的张量数据的描述符,所述描述符用于指示待同步的张量数据的形状;
数据量确定模块,用于根据所述张量数据的描述符,确定针对所述张量数据的可同步数据量;
状态指令生成及发送模块,用于根据所述张量数据的数据特征及所述可同步数据量,生成同步状态指令并向所述第一处理器发送所述同步状态指令,所述同步状态指令用于指示所述第一处理器确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应。
条款A12、根据条款A11所述的装置,所述装置还包括:
同步指令解析模块,用于在接收到来自所述第一处理器的同步指令时,解析所述同步指令,得到待同步的张量数据的数据特征及所述张量数据的第一子数据;
第三描述符确定模块,用于根据所述数据特征,确定所述张量数据的描述符;
数据存储模块,用于根据所述张量数据的描述符,存储所述张量数据的第一子数据。
条款A13、一种人工智能芯片,所述芯片包括如条款A7-条款A12中任意一项所述的数据同步装置。
条款A14、一种电子设备,所述电子设备包括如条款A13所述的人工智能芯片。
条款A15、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款A13所述的人工智能芯片;
其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
所述存储器件,用于存储数据;
所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;
所述控制器件,用于对所述人工智能芯片的状态进行监控。
条款A16、根据条款A15所述的板卡,
所述存储器件包括:多组存储单元,每一组所述存储单元与所述人工智能芯片通过总线连接,所述存储单元为:DDR SDRAM;
所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;
所述接口装置为:标准PCIE接口。
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本公开的方法及其核心思想。同时,本领域技术人员依据本公开的思想,基于本公开的具体实施方式及应用范围上做出的改变或变形之处,都属于本公开保护的范围。综上所述,本说明书内容不应理解为对本公开的限制。

Claims (16)

  1. 一种数据同步方法,其特征在于,所述方法应用于第一处理器,包括:
    根据待同步的张量数据的描述符,确定所述张量数据的数据特征,所述描述符用于指示待同步的张量数据的形状;
    根据所述张量数据的数据特征,生成状态查询指令并向第二处理器发送所述状态查询指令,所述状态查询指令用于指示第二处理器确定针对所述张量数据的可同步数据量并生成同步状态指令。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在接收到来自所述第二处理器的同步状态指令时,解析所述同步状态指令,确定待同步的张量数据的数据特征及可同步数据量,
    根据所述数据特征,确定待同步的张量数据的描述符;
    根据所述描述符及所述可同步数据量,确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应;
    根据所述第一子数据,生成同步指令并向所述第二处理器发送所述同步指令,以指示所述第二处理器获取所述第一子数据。
  3. 根据权利要求2所述的方法,其特征在于,根据所述描述符及所述可同步数据量,确定所述张量数据的第一子数据,包括:
    根据所述描述符,确定待同步的张量数据以及所述张量数据中处于待同步状态的第二子数据;
    根据所述第二子数据及所述可同步数据量,确定第一子数据。
  4. 根据权利要求2或3所述的方法,其特征在于,所述方法还包括:
    将所述张量数据的第一子数据的状态由待同步状态变更为已同步状态。
  5. 一种数据同步方法,其特征在于,所述方法应用于第二处理器,包括:
    在接收到来自第一处理器的状态查询指令时,解析所述状态查询指令,得到待同步的张量数据的数据特征;
    根据所述数据特征,确定待同步的张量数据的描述符,所述描述符用于指示待同步的张量数据的形状;
    根据所述张量数据的描述符,确定针对所述张量数据的可同步数据量;
    根据所述张量数据的数据特征及所述可同步数据量,生成同步状态指令并向所述第一处理器发送所述同步状态指令,所述同步状态指令用于指示所述第一处理器确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    在接收到来自所述第一处理器的同步指令时,解析所述同步指令,得到待同步的张量数据的数据特征及所述张量数据的第一子数据;
    根据所述数据特征,确定所述张量数据的描述符;
    根据所述张量数据的描述符,存储所述张量数据的第一子数据。
  7. 一种数据同步装置,其特征在于,所述装置应用于第一处理器,所述装置包括:
    特征确定模块,用于根据待同步的张量数据的描述符,确定所述张量数据的数据特征,所述描述符用于指示待同步的张量数据的形状;
    查询指令生成及发送模块,用于根据所述张量数据的数据特征,生成状态查询指令并向第二处理器发送所述状态查询指令,所述状态查询指令用于指示第二处理器确定针对所述张量数据的可同步数据量并生成同步状态指令。
  8. 根据权利要求7所述的装置,其特征在于,所述装置还包括:
    状态指令解析模块,用于在接收到来自所述第二处理器的同步状态指令时,解析所述同步状态指令,确定待同步的张量数据的数据特征及可同步数据量,
    第一描述符确定模块,用于根据所述数据特征,确定待同步的张量数据的描述符;
    数据确定模块,用于根据所述描述符及所述可同步数据量,确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应;
    同步指令生成及发送模块,用于根据所述第一子数据,生成同步指令并向所述第二处理器发送所述同步指令,以指示所述第二处理器获取所述第一子数据。
  9. 根据权利要求8所述的装置,其特征在于,所述数据确定模块包括:
    第一确定子模块,用于根据所述描述符,确定待同步的张量数据以及所述张量数据中处于待同步状态的第二子数据;
    第二确定子模块,用于根据所述第二子数据及所述可同步数据量,确定第一子数据。
  10. 根据权利要求8或9所述的装置,其特征在于,所述装置还包括:
    状态变更模块,用于将所述张量数据的第一子数据的状态由待同步状态变更为已同步状态。
  11. 一种数据同步装置,其特征在于,所述装置应用于第二处理器,包括:
    查询指令解析模块,用于在接收到来自第一处理器的状态查询指令时,解析所述状态查询指令,得到待同步的张量数据的数据特征;
    第二描述符确定模块,用于根据所述数据特征,确定待同步的张量数据的描述符,所述描述符用于指示待同步的张量数据的形状;
    数据量确定模块,用于根据所述张量数据的描述符,确定针对所述张量数据的可同步数据量;
    状态指令生成及发送模块,用于根据所述张量数据的数据特征及所述可同步数据量,生成同步状态指令并向所述第一处理器发送所述同步状态指令,所述同步状态指令用于指示所述第一处理器确定所述张量数据的第一子数据,所述第一子数据的数据量与所述可同步数据量相对应。
  12. 根据权利要求11所述的装置,其特征在于,所述装置还包括:
    同步指令解析模块,用于在接收到来自所述第一处理器的同步指令时,解析所述同步指令,得到待同步的张量数据的数据特征及所述张量数据的第一子数据;
    第三描述符确定模块,用于根据所述数据特征,确定所述张量数据的描述符;
    数据存储模块,用于根据所述张量数据的描述符,存储所述张量数据的第一子数据。
  13. 一种人工智能芯片,其特征在于,所述芯片包括如权利要求7-12中任意一项所述的数据同步装置。
  14. 一种电子设备,其特征在于,所述电子设备包括如权利要求13所述的人工智能芯片。
  15. 一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如权利要求13所述的人工智能芯片;
    其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
    所述存储器件,用于存储数据;
    所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;
    所述控制器件,用于对所述人工智能芯片的状态进行监控。
  16. 根据权利要求15所述的板卡,其特征在于,
    所述存储器件包括:多组存储单元,每一组所述存储单元与所述人工智能芯片通过总线连接,所述存储单元为:DDR SDRAM;
    所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;
    所述接口装置为:标准PCIE接口。
PCT/CN2020/111291 2019-08-09 2020-08-26 数据同步方法及装置以及相关产品 WO2021027973A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910735424.5A CN112347027A (zh) 2019-08-09 2019-08-09 数据同步方法及装置以及相关产品
CN201910735424.5 2019-08-09

Publications (1)

Publication Number Publication Date
WO2021027973A1 true WO2021027973A1 (zh) 2021-02-18

Family

ID=74366944

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111291 WO2021027973A1 (zh) 2019-08-09 2020-08-26 数据同步方法及装置以及相关产品

Country Status (2)

Country Link
CN (1) CN112347027A (zh)
WO (1) WO2021027973A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114489790A (zh) * 2020-11-13 2022-05-13 中科寒武纪科技股份有限公司 数据处理装置、数据处理方法及相关产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080005296A1 (en) * 2006-05-08 2008-01-03 Cisco Technology, Inc. Method and apparatus for synchronizing use of buffer descriptor entries
CN101950282A (zh) * 2010-08-30 2011-01-19 中国科学院计算技术研究所 一种多处理器系统及其同步引擎
CN103338144A (zh) * 2013-05-30 2013-10-02 华为软件技术有限公司 一种会话数据同步方法和装置
CN109656566A (zh) * 2018-12-14 2019-04-19 北京中科寒武纪科技有限公司 异构计算系统可执行文件获取方法、运行方法及相关产品
CN109886399A (zh) * 2019-02-13 2019-06-14 上海燧原智能科技有限公司 一种张量处理装置及方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9785565B2 (en) * 2014-06-30 2017-10-10 Microunity Systems Engineering, Inc. System and methods for expandably wide processor instructions
CN107103004B (zh) * 2016-02-23 2020-11-06 创新先进技术有限公司 网页中的数据处理方法、装置及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080005296A1 (en) * 2006-05-08 2008-01-03 Cisco Technology, Inc. Method and apparatus for synchronizing use of buffer descriptor entries
CN101950282A (zh) * 2010-08-30 2011-01-19 中国科学院计算技术研究所 一种多处理器系统及其同步引擎
CN103338144A (zh) * 2013-05-30 2013-10-02 华为软件技术有限公司 一种会话数据同步方法和装置
CN109656566A (zh) * 2018-12-14 2019-04-19 北京中科寒武纪科技有限公司 异构计算系统可执行文件获取方法、运行方法及相关产品
CN109886399A (zh) * 2019-02-13 2019-06-14 上海燧原智能科技有限公司 一种张量处理装置及方法

Also Published As

Publication number Publication date
CN112347027A (zh) 2021-02-09

Similar Documents

Publication Publication Date Title
CN110096310B (zh) 运算方法、装置、计算机设备和存储介质
WO2021027972A1 (zh) 数据同步方法及装置以及相关产品
EP3825842B1 (en) Data processing method and apparatus, and related product
US11687339B2 (en) Data processing method and apparatus, and related product
US20240004650A1 (en) Data processing method and apparatus, and related product
US20240111536A1 (en) Data processing apparatus and related products
WO2021027973A1 (zh) 数据同步方法及装置以及相关产品
WO2021018313A1 (zh) 数据同步方法及装置以及相关产品
WO2021223642A1 (zh) 数据处理方法及装置以及相关产品
CN111047005A (zh) 运算方法、装置、计算机设备和存储介质
WO2021082723A1 (zh) 运算装置
CN112347026B (zh) 数据同步方法及装置以及相关产品
CN112395008A (zh) 运算方法、装置、计算机设备和存储介质
US20240126553A1 (en) Data processing method and apparatus, and related product
CN111061507A (zh) 运算方法、装置、计算机设备和存储介质
WO2021223638A1 (zh) 数据处理方法及装置以及相关产品
CN112347185A (zh) 数据同步方法及装置以及相关产品
WO2021223644A1 (zh) 数据处理方法及装置以及相关产品
CN112395002B (zh) 运算方法、装置、计算机设备和存储介质
CN111338694B (zh) 运算方法、装置、计算机设备和存储介质
CN111275197B (zh) 运算方法、装置、计算机设备和存储介质
CN111831722A (zh) 数据同步方法及装置以及相关产品
CN111062483A (zh) 运算方法、装置、计算机设备和存储介质
CN111124497A (zh) 运算方法、装置、计算机设备和存储介质
CN113807507A (zh) 数据处理方法及装置以及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20852291

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20852291

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20852291

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 310822)

122 Ep: pct application non-entry in european phase

Ref document number: 20852291

Country of ref document: EP

Kind code of ref document: A1