WO2023284745A1 - 一种数据处理方法、系统及相关设备 - Google Patents

一种数据处理方法、系统及相关设备 Download PDF

Info

Publication number
WO2023284745A1
WO2023284745A1 PCT/CN2022/105221 CN2022105221W WO2023284745A1 WO 2023284745 A1 WO2023284745 A1 WO 2023284745A1 CN 2022105221 W CN2022105221 W CN 2022105221W WO 2023284745 A1 WO2023284745 A1 WO 2023284745A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
processor
operation set
tensor
operator
Prior art date
Application number
PCT/CN2022/105221
Other languages
English (en)
French (fr)
Inventor
李超
何轲
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22841380.3A priority Critical patent/EP4361812A1/en
Publication of WO2023284745A1 publication Critical patent/WO2023284745A1/zh
Priority to US18/410,757 priority patent/US20240143397A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the technical field of processors, in particular to a data processing method, system and related equipment.
  • Pytorch is an open source machine learning library for applications such as natural language processing.
  • the Pytorch framework supports view class framework operations on source tensors to obtain view tensors, which can effectively reduce the cost of displaying tensor data copies. Performance overhead, where source and view tensors share the same memory.
  • View frame operators which mainly include reshape (reshape) classes and discontinuous classes.
  • Reshape class operators include frame operators such as view, view_as, squeeze, unsqueeze, and flatten , the view tensors generated by these operators are called continuous tensors, which are characterized in that when the elements of the view tensor are expanded row-first, the corresponding index memory is consistent with the source tensor and is continuously distributed.
  • discontinuous operators include frame operators such as transpose, narrow, and expand.
  • discontinuous tensors which are characterized in that the elements of the view tensor When expanding according to row priority, the corresponding index memory is different from the source tensor, showing a discontinuous distribution on the shared memory.
  • the current conversion scheme mainly converts discontinuous tensors from the device side (such as neural-network processing unit (NPU) chips) Copy it to the host, and after the conversion is completed by the host, copy it back to the device side.
  • NPU neural-network processing unit
  • the embodiment of the present invention discloses a data processing method, system, and related equipment.
  • the discontinuous scenes are extracted one by one, and determined The operations that match each discontinuous scene, and finally execute the determined operations in sequence to complete the conversion process, which can effectively improve the conversion efficiency of converting discontinuous tensors into continuous tensors, reduce the dependence on device hardware, and improve conversion performance.
  • the present application provides a data processing method, the method is executed by a data processing system, the system includes a processor and a computing core, and the method includes: the processor acquires the metadata of the first data and the second The metadata of the second data, the second data is obtained from the first data through the first operation set, the first operation set includes at least two first operations, and the adjacent positions of each row in the second data
  • the memory addresses corresponding to the elements are discontinuous; the processor identifies the first operation set according to the metadata of the second data, and determines each first operation in the first operation set; the processor Determine a second operation set that matches the first operation set, and each first operation in the first operation set has a matching second operation in the second operation set; the computing core according to the The third data is obtained from the first data and the second operation set, and memory addresses corresponding to elements at adjacent positions in each row in the third data are continuous.
  • the first data can be the source tensor
  • the second data can be the discontinuous tensor obtained after the source tensor undergoes the first operation set
  • the third data can be the source tensor obtained after the second operation set
  • the first operation set can be non-continuous framework operators, such as transpose, narrow, expand, etc.
  • the processor analyzes the metadata of the discontinuous tensor and recursively deduces the generation scene of the discontinuous tensor, so as to determine a series of operations performed on the source tensor, and then determine the Multiple tensor boost engine (TBE) operators matched by a series of operations, the calculation core executes the multiple TBE operators on the source tensor in turn to complete the continuous conversion process.
  • TBE Multiple tensor boost engine
  • the processor sequentially identifies the first operations included in the first operation set according to preset priorities; The first operation is performed, the feature scene corresponding to the first operation is determined, and the feature scene is sequentially put into the scene information stack.
  • the processor when the processor deduces and identifies the first operation in the first operation set, according to the preset priority order, for example, it may first identify the transposition operation, then identify the slicing operation, and finally identify the deformation operation,
  • the first operations included in the first operation set are identified one by one, which can effectively reduce fusion interference among multiple first operations, and improve recognition accuracy and recognition efficiency.
  • the processor judges whether the metadata of the second data matches at least one characteristic information of the first operation to be identified, and in the case of matching Next, the processor determines the first operation to be identified, wherein the metadata of the second data includes shape, stride and storage_offset of the second data.
  • the processor identifies by comparing the metadata of the second data with the feature information of the first operation to be identified, without strict one-to-one correspondence, and only needs to match one or part of the features Information can determine the first operation to be recognized, which can effectively reduce fusion interference among multiple first operations and improve recognition efficiency.
  • the processor traverses an operator information library, where the operator information library includes multiple tensor acceleration engine TBE operators; A first operation identified from the first operation set, determining an operator in the operator information base with the same characteristics as the first operation as a second operation matching the first operation, and The second operation is sequentially put into the operator information stack.
  • the operator information library includes multiple tensor acceleration engine TBE operators
  • the processor determines a series of first operations performed on the source tensor, it can further search for each first operation in the current operator information base.
  • the processor determines a series of first operations performed on the source tensor, it can further search for each first operation in the current operator information base.
  • the processor issues a conversion command to the computing core, the conversion command includes the second operation set, and the conversion command is used to Instructing the calculation core to perform operations on the first data according to the second operation set to obtain the third data.
  • the processor After the processor finds multiple TBE operators that match the first operation set in the operator information base, it notifies the calculation and verification source tensor to execute the multiple TBE operators, thereby obtaining continuous tensors.
  • the index memory corresponding to the elements in the adjacent positions of each row in the tensor is continuous, so that the tensor conversion process can be completed without relying on the AI CPU, reducing the dependence on chip hardware.
  • the processor constructs fourth data, where the metadata of the fourth data is the same as that of the first data, and the fourth data is the same as the metadata of the first data.
  • the first data shares the same memory; and the calculation checks the fourth data and sequentially executes the second operation in the second operation set to obtain the third data.
  • the processor can obtain the input parameter information required by the operator according to the determined TBE operator.
  • the input parameter information includes input tensors, and the input tensors A temporary continuous tensor that can be constructed in a shared memory manner.
  • the metadata of this temporary continuous tensor is the same as that of the source tensor.
  • the first operation set includes a transpose operator, a slice narrow operator, and an expand operator.
  • the system includes a host and a chip, the processor is located in the host, and the computing core is located in the chip.
  • the chip includes at least one of a neural network processor NPU, a graphics processing unit GPU, a tensor processing unit TPU, and a deep learning processor DPU.
  • the process of recursively deriving the generation scenario of combined discontinuous tensors can be completed by the host computer in the data processing system, or by the chip in the data processing system, regardless of whether the scenario is completed by the host computer Recursive derivation or recursive derivation of the scene is completed through the chip.
  • the calculation core will execute the TBE operator to complete the conversion of discontinuous tensors into continuous tensors, thereby reducing data copying and hardware dependence on AI CPUs, and improving conversion efficiency and conversion performance.
  • the present application provides a data processing system, which includes: a processor and a computing core,
  • the processor is configured to acquire metadata of the first data and metadata of the second data, the second data is obtained from the first data through a first operation set, and the first operation set includes at least two For the first operation, the memory addresses corresponding to the elements in adjacent positions of each row in the second data are non-continuous; the first operation set is identified according to the metadata of the second data, and the first operation is determined Each first operation in the set; determine a second operation set that matches the first operation set, and each first operation in the first operation set has a matching second operation in the second operation set ;
  • the computing core is configured to obtain third data according to the first data and the second operation set, and memory addresses corresponding to elements at adjacent positions in each row in the third data are continuous.
  • the chip may include multiple processors and computing cores at the same time, and they may execute their respective tasks in parallel without affecting or interfering with each other.
  • This application does not limit the number of processors and computing cores of the chip.
  • the processor is specifically configured to: sequentially identify the first operations included in the first operation set according to preset priorities; For the identified first operation, determine the feature scene corresponding to the first operation, and put the feature scene into the scene information stack in turn.
  • the processor is specifically configured to: determine whether the metadata of the second data matches at least one characteristic information of the first operation to be identified, And in the case of matching, determine the first operation to be identified, wherein the metadata of the second data includes shape, stride and storage_offset of the second data.
  • the processor is specifically configured to: traverse an operator information library, where the operator information library includes multiple TBE operators; The first operation identified in the first operation set, the operator in the operator information base with the same characteristics as the first operation is determined as the second operation matching the first operation, and the second operation The second operation is put into the operator information stack in turn.
  • the processor is further configured to issue a conversion command to the computing core, where the conversion command includes the second operation set, the The conversion command is used to instruct the calculation core to perform operations on the first data according to the second operation set to obtain the third data.
  • the processor is further configured to construct fourth data, where the metadata of the fourth data is the same as that of the first data, and the fourth data The data shares the same memory with the first data; and the computing core is further configured to perform a second operation in the second operation set on the fourth data to obtain the third data.
  • the first operation set includes a transpose operator, a slice narrow operator, and an expand operator.
  • the processor is located in a host of the system, and the computing core is located in a chip of the system.
  • the chip includes at least one of a neural network processor NPU, a graphics processing unit GPU, a tensor processing unit TPU, and a deep learning processor DPU.
  • the present application provides a chip, including: a processor and a computing core,
  • the processor is configured to acquire metadata of the first data and metadata of the second data, the second data is obtained from the first data through a first operation set, and the first operation set includes at least two For the first operation, the memory addresses corresponding to the elements in adjacent positions of each row in the second data are non-continuous; the first operation set is identified according to the metadata of the second data, and the first operation is determined Each first operation in the set; determine a second operation set that matches the first operation set, and each first operation in the first operation set has a matching second operation in the second operation set ;
  • the computing core is configured to obtain third data according to the first data and the second operation set, and memory addresses corresponding to elements at adjacent positions in each row in the third data are continuous.
  • the present application provides a computing device, including the data processing system provided by any one of the implementation manners in the first aspect above.
  • the present application provides a computer storage medium, the computer storage medium stores a computer program, and when the computer program is executed by a processor, the above first aspect and any combination of the above first aspect can be realized.
  • the present application provides a computer program product, the computer program includes instructions, and when the computer program is executed by a computer, the computer can execute the above-mentioned first aspect and any implementation manner in combination with the above-mentioned first aspect provided method.
  • Fig. 1 is a schematic diagram of a tensor conversion provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a system structure provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of splitting a combined discontinuous scene provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of converting non-continuous tensors of combination type into continuous provided by the embodiment of the present application;
  • FIG. 5 is a schematic flow diagram of a data processing method provided in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a first set of operations provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of input parameter information of a transposition operator provided by the embodiment of the present application.
  • Fig. 8 is a schematic diagram of a combined discontinuous scene recognition and extraction provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another tensor conversion process provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of a chip structure provided by an embodiment of the present application.
  • Metadata is the data describing the actual data. It is used to describe the attribute information of the actual data. It can be the file name of the actual data or the storage address pointer of the actual data.
  • the metadata of the tensor can describe the shape and dimension of the tensor.
  • the metadata can also have a corresponding identifier for identifying the metadata.
  • the metadata and its corresponding identifier can form a key-value pair, and each set of key-value pairs can include keywords ( key) and the value corresponding to the key (value), the value is the metadata itself, and the key is used to identify the value.
  • the host can also be called a client, which is connected to a hard disk, hard disk subsystem or file server, and can store data and access computer systems for IO. Specifically, it can include physical machines, virtual machines, containers, etc., used to communicate with devices and For data processing, such as application servers, multiprocessor machines, workstations, personal computers, etc.
  • the device is a processing chip that integrates multiplication and addition, activation function, two-dimensional data operation, decompression and other modules, which can accelerate the operation of neural network and effectively improve the efficiency of neural network operation, such as NPU, GPU, TPU, DPU, etc., processing chip It can contain multiple processors and computing cores, which can execute their tasks in parallel.
  • the Pytorch framework supports the view class framework operation on the source tensor to obtain the view tensor.
  • the elements in the source tensor and the elements in the view tensor share the same block of memory.
  • the unified computing device architecture (compute unified device architecture, CUDA) will calculate the memory of each element of discontinuous tensors Address, depending on the load and store instructions, the processing chip (such as GPU) can access the elements of any memory location and store them in a specified contact memory area, ensuring that the elements of non-contiguous tensors are in line When the priority is expanded, the corresponding index memory is continuous, thus completing the conversion of the non-contiguous tensor.
  • many current processing chips cannot perform efficient data copying according to the above-mentioned data migration logic. For example, the NPU cannot complete the conversion from discontinuous tensors to continuous tensors through the above-mentioned method.
  • host 110 when performing continuous operations, Usually need to complete by means of a host, as shown in Figure 1, host 110 and device 120 are directly connected by network or high-speed serial computer expansion bus standard (peripheral component interconnect express, PCIe) interface, host 110 can be a server, device 120 can To insert the NPU accelerator card on the server, firstly, the host 110 sends a stream synchronization command to the device 120, which will block the execution of all tasks of the current stream, and after receiving the command, the device 120 copies the discontinuous tensor to Host 110, the host 110 calculates the memory address of each element in it according to the information of the current discontinuous tensor, and then copies each element to the specified memory area according to the CPU's Load/Store instruction.
  • PCIe peripheral component interconnect express
  • the continuous tensor is converted into a continuous tensor, and then the continuous tensor is copied to the device 120, and finally the host 110 sends an end stream synchronization command to the device 120 to release related resources. It can be seen that when converting non-sequential tensors, it hinders the normal execution of other tasks. Secondly, tensor data needs to be copied back and forth between the host and the device. The copying efficiency is low and the performance consumption of the host is large. The entire conversion Process performance is poor.
  • the discontinuous tensor may be the tensor obtained after the source tensor has been operated by multiple view frameworks. There are mutual superposition and interference between the operations of multiple view frameworks, which increases the impact on such discontinuous tensors. It is difficult to identify and deduce the generated scene. Only the processor in the host or device can be used for continuous operation, and the computing core capability in the device cannot be fully utilized, and the overall conversion efficiency is low.
  • the present application provides a data processing method, which uses a processor to deduce the scene of discontinuous tensor generation, especially recursively deduces the combined discontinuous scene of the fusion of multiple view operations, and determines the combination of A set of operations matching discontinuous scenes, and the AI core executes the operation set in turn to complete the re-copying of the data, thereby realizing the conversion of discontinuous tensors into continuous tensors, effectively improving conversion efficiency and reducing the impact on device hardware, especially AI CPU dependency to improve conversion performance.
  • FIG. 2 is a schematic diagram of a system structure provided by the present application.
  • the system includes a host 210 and a chip 220.
  • the host 210 may include a hardware layer and a software layer.
  • the software layer includes a guest operating system 2110 and a task scheduler 2120.
  • the hardware layer includes one or more processors,
  • the chip 220 may be at least one of a neural network processor NPU, a graphics processing unit GPU, a tensor processing unit TPU, and a deep learning processor DPU. It also includes a hardware layer and a software layer.
  • the hardware layer includes hardware such as one or more processors (such as AI CPU2220), one or more computing cores (AI Core2230), memory, and the software layer includes various processing units (such as I/O (2) processing unit 2210, etc.) to process the related processes of converting discontinuous tensors into continuous tensors, the host 210 and chip 220 can be connected through an interface.
  • the chip 220 and the host 210 may be located on different devices, and in other embodiments, the chip 220 may be mounted on the host 210 in the form of a plug-in card.
  • the host 210 is used to cooperate with the chip to complete the conversion of discontinuous tensors.
  • the processor 2130 may be a CPU, or other general-purpose processors, digital signal processing (digital signal processing, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the memory 2140 can be used to store operator information bases. It can include read only memory and random access memory, and can be either volatile or nonvolatile, or both volatile and nonvolatile.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the task scheduler 2120 in the host 210 sends the tensor conversion task to the processor 2130, and the processor 2130 extracts the metadata of the current discontinuous tensor from the memory 2140, the metadata includes Information such as the shape, stride, and memory offset of continuous tensors, and then analyze the metadata and deduce it according to the preset priority order.
  • Set the scene extract the scene and put it into the scene information stack, and determine the first operation corresponding to the device scene (that is, the transpose operation), and then the processor 2130 traverses the operator information library in the memory 2140 to find the
  • the first operation type is the same TBE operator, and the information required by the TBE operator is obtained, and it is put into the operator information stack.
  • the derivation process is shown in Figure 3.
  • the source tensor is discontinuous through n view operations. View tensor, and then identify the discontinuous view tensor in sequence according to the preset priority order, each time a view class operation will be identified, so as to extract the feature scene and match the corresponding TBE operator, and put them into the In the scene information stack and operator information stack.
  • the processor 2130 sends instructions to the chip 220 through the PCIe interface, builds a temporary continuous tensor in the chip 220 by sharing the non-continuous tensor memory, and sends the TBE operator in the operator information stack to the AI Core 2230 , the AI Core 2230 sequentially schedules the TBE operator in the operator information stack to operate on the temporary continuous tensor, and re-copy each element of the non-continuous tensor in the memory 2240 to the designated area to obtain a continuous tensor.
  • the index memory corresponding to two adjacent elements in each row of the continuous tensor is continuous in the memory 2240, and its specific conversion process is shown in Figure 4.
  • the temporary continuous tensor After the construction of the temporary continuous tensor is completed, it is taken out from the scene information stack For the first single-operator discontinuous scene, rebuild the first discontinuous scene, and transfer the corresponding TBE operator from the operator information stack to perform the continuous operation until the scene information stack and the operator information stack are both empty. Stop rebuilding, and the resulting tensor is the destination continuous view tensor.
  • the known discontinuous tensor and source tensor are used to recursively deduce the generation scene of the discontinuous tensor, so that the operation corresponding to each discontinuous scene can be derived, Then according to this operation, the matching TBE operator can be mapped from the operator information library, and put into the operator information stack. Finally, the AI Core executes multiple TBE operators in the operator information stack to generate Continuous tensor does not depend on hardware performance such as AI CPU, which can effectively improve conversion efficiency and improve conversion performance.
  • FIG. 5 shows a schematic flow chart of a data processing method provided by the embodiment of the present application.
  • the method can It is applied to the data processing system shown in FIG. 2 above.
  • the method may specifically include:
  • S501 The processor acquires metadata of the first data and metadata of the second data.
  • the first data may be source tensor data
  • the source tensor is an n-dimensional data structure, and its specific form includes scalar, vector, matrix, etc., such as the 0th dimension tensor is is a scalar
  • the metadata of the source tensor is the data describing the source tensor, including the shape, stride, storage_offset, etc. of the tensor.
  • the second data can be a discontinuous tensor, and the elements in the discontinuous tensor are in When row-first expansion is performed, the memory addresses corresponding to elements at adjacent positions in each row are non-contiguous.
  • the processor can be the processor 2130 in the host shown in Figure 2 above, that is, the scene derivation in step S502 and the operator mapping matching in step S503 are completed by the host, and the processor can also be the processor shown in Figure 2 above AI CPU2220, that is, the scene derivation in step S502 and the operator mapping matching in step S503 are completed by the chip, which is not limited in this application.
  • the second data is obtained from the first data through the first set of operations, that is, the above discontinuous tensor is obtained by performing a series of first operations on the source tensor.
  • the first operation is a discontinuous operation.
  • the first operation is to perform the operation corresponding to the discontinuous view class frame operator on the source tensor.
  • the discontinuous view class frame operator includes transpose, narrow , expand, etc.
  • the source tensor is a continuous tensor
  • the discontinuous operation unit 620 to perform the first set of operations on the source tensor 610, That is, the slice operation, deformation operation and transpose operation are performed on the source tensor in sequence to generate a discontinuous view tensor 630.
  • the view information in the discontinuous view tensor 630 is inconsistent with the source information, and the discontinuous view tensor 630 is inconsistent with the source information, and the discontinuous view tensor Compared with the source tensor 610, the shape and stride in the view information of the tensor 630 have obviously changed.
  • the view tensor is obtained by performing discontinuous combination view-like framework operations on the source tensor, only the metadata of the source tensor is changed, and the source tensor and the view tensor still share the same memory, that is, the The elements are the same as the elements of the view tensor, and they take up the same amount of memory.
  • the processor identifies the first operation set according to the metadata of the second data, and determines each first operation in the first operation set.
  • the processor After the processor obtains the metadata of the source tensor and the metadata of the discontinuous tensor, it analyzes the characteristics of the discontinuous tensor (such as the shape, stride and storage_offset) to determine A sequence of non-sequential operations performed.
  • the characteristics of the discontinuous tensor such as the shape, stride and storage_offset
  • the processor sequentially identifies the first operations included in the first operation set according to preset priorities, and the processor determines the first operation according to each identified first operation. Operate the corresponding feature scene, and put the feature scene into the scene information stack in sequence.
  • the processor can perform feature scene recognition according to the preset Set the priority recognition order to recognize in turn, for example, the transpose operation is the first priority, the slice operation is the second priority, and the deformation operation is the third priority, then the processor will prioritize the recognition of the transposed scene when performing scene recognition , then identify the sliced scene, and finally identify the deformed scene.
  • the processor needs to relax the judgment conditions corresponding to the scene. It does not need to meet all the judgment conditions of the scene, but only needs to meet some of the judgment conditions. For example, a certain There are three judgment conditions for a feature scene, and the metadata of the non-continuous tensor satisfies one of the judgment conditions. At this time, it can be determined that the feature scene exists. It is easy to understand that by appropriately relaxing the scene determination conditions, the interference of scene fusion can be further eliminated, and the efficiency and accuracy of scene recognition can be improved.
  • the processor uses the discontinuous view tensor 630, and it is found that the stride information is non-monotonic, that is, it can be determined that there are transposition scenarios and transposition operations.
  • discontinuous scenes and discontinuous operations can also be derived through the above logic.
  • the processor analyzes the shape information and finds the elements of some axes in the shape decrease, the processor can determine that slice scenes and slice operations exist.
  • feature scene extraction is required, that is, a combined discontinuous scene is divided into two parts, the first part is the constructed recognized scene, and the other part is the remaining scene, which may still be Combine non-continuous scenes, if the remaining scenes are still combined non-continuous scenes, continue to further identify until the remaining scenes can no longer be divided.
  • a scene information stack is constructed, and each extracted feature scene is sequentially put into the scene information stack, and each feature scene corresponds to a first operation, that is, corresponds to a discontinuous view operation.
  • S503 The processor determines a second operation set that matches the first operation set.
  • the processor recursively deduces the non-sequential tensor generation scene. After identifying the non-sequential operation and extracting the characteristic scene each time, it needs to further determine the second operation that matches the non-sequential operation and the characteristic scene, that is, determine Whether there is a tensor boost engine (TBE) operator whose characteristics are the same as those corresponding to the discontinuous operation and characteristic scene.
  • TBE tensor boost engine
  • each discontinuous operation corresponds to a TBE operator, for example,
  • the transpose operation corresponds to the transpose operator
  • the slice operation corresponds to the slice operator
  • the deformation operation corresponds to the broadcast operator.
  • the processor searches for a TBE operator that matches the discontinuous operation by traversing the current operator information base.
  • the first operation set includes transpose operations, and the characteristics of the transpose operator are:
  • the stride information of the tensor is non-monotonic
  • the specified axis of the tensor is replaced by the shape and stride information
  • the storage_offset of the tensor is 0 and remains unchanged
  • the stride information of the non-continuous view tensor is also non-monotonic. Therefore, the processor can determine the same as The TBE operator matched by the transpose operation is transpose, and the transpose operator in the operator information base is directly determined as the operator matching the discontinuous operation (ie, transpose operation).
  • the processor determines the TBE operator that matches the discontinuous operation, it can obtain the input parameter information required by the TBE operator.
  • the input parameter information includes input tensor, result tensor and transposed axis information.
  • the input tensor is a temporary continuous tensor constructed by shared memory, and the result tensor receives the execution result of the operator, which is based on
  • the view information of the non-contiguous tensor creates a new empty contiguous tensor, and the transposed axis information is the same as the transposed axis information corresponding to the deduced transpose operation.
  • the processor After the processor determines the second operation corresponding to each first operation and finds a matching TBE operator, the processor builds an operator information stack, and puts the derived TBE operators into the operator information stack in turn.
  • the processor determines the second operation, if no matching TBE operator is found in the operator information base, the R&D personnel can write a TBE operator matching the second operation, and Adding it to the operator information library, so that changing the operator implementation at the software level can effectively expand the applicable scenarios, improve the flexibility of conversion, give full play to the performance of AI Core, and relieve the hardware dependence on AI CPU.
  • the source tensor 810 is obtained through the slice operation to obtain the discontinuous tensor 820, and then the discontinuous tensor is obtained through the deformation operation.
  • Quantity 830, and the discontinuous tensor 840 is obtained after the transposition operation on the left.
  • the transposed scene is first identified, and then the first iterative derivation is performed , for the discontinuous tensor 840, construct the transposed discontinuous tensor 850 and the residual discontinuous tensor 860, for the transposed discontinuous tensor 850, put it into the scene information stack, and obtain the Find the matching TBE operator and put it into the operator information stack; then identify the slice operation and perform the second iterative derivation, and construct the slice non-continuous tensor 870 and The remaining discontinuous tensor 880 is put into the scene information stack for the sliced discontinuous tensor 870, and the matching TBE operator is found from the operator information library, and put into the operator information stack; Since the residual discontinuous tensor 880 is not a combined discontinuous tensor, but a deformed discontinuous tensor, the iteration is stopped at this time, and the derivation process ends
  • the above scene construction strategy follows a fixed scene refresh strategy.
  • the view information of the above-mentioned transposed discontinuous tensor 850 is consistent with the view information of the discontinuous tensor 840
  • the view information of the remaining discontinuous tensor 860 is consistent with that of the transposed tensor.
  • the source information of the discontinuous tensor 850 is set to be consistent
  • the source information of the remaining discontinuous tensor 860 is consistent with the source information of the discontinuous tensor 840
  • the scene information refreshing strategy of each iteration is consistent.
  • the calculation core obtains third data according to the first data and the second operation set.
  • the processor puts the TBE operator into the operator information stack, it sends the operator information stack to the computing core AI Core, and the computing core AI Core sequentially executes the TBE operator in the operator information stack to obtain Continuous tensor.
  • the TBE operator exists in the form of a file in the operator information database, and the file records the input parameter information of the TBE operator.
  • the processor sends the file corresponding to the TBE operator to the computing core AI Core, and the computing core AI Core Core outputs continuous tensors by executing the file corresponding to the TBE operator.
  • the processor before the processor sends the TBE operator to the computing core AI Core, it constructs a temporary continuous tensor by sharing memory with the discontinuous view tensor.
  • the metadata of the temporary continuous tensor is the same as that of the source tensor.
  • the metadata of the quantity is the same, and it shares the same memory with the source tensor, that is, the temporary continuous tensor can be understood as the restoration of the source tensor.
  • the temporary continuous tensor can also be constructed in other ways, which is not limited in this application.
  • the processor constructs a temporary continuous tensor as the input tensor 910 according to the discontinuous view tensor 630.
  • the input tensor 910 and the discontinuous view tensor 630 is the same, first extract the deformation scene from the scene information stack, and then call the corresponding TBE operator from the operator information stack, such as the broadcast operator 920, use the broadcast operator to calculate the input tensor 910, and the calculation result is a Continuous tensor, and then extract the slice scene from the scene information stack, and call the corresponding slice operator 930 from the operator information stack.
  • the slice operator calculates the first calculation result, and the calculation result is also a Continuous tensor, finally extract the transposed scene from the scene information stack, and call the corresponding transpose operator 940 from the operator information stack, use the transpose operator 940 to calculate the second calculation result, and obtain the output tensor after calculation 950, the output tensor 950 is a continuous tensor, that is, the output tensor 950 can also be called the continuous tensor 950.
  • the continuous tensor 950 when the elements in the tensor are expanded according to the row priority, the adjacent positions of each row The memory addresses corresponding to the elements are consecutive.
  • the computing core AI Core executes the TBE operator, it will redefine a memory area in the main memory, and migrate the elements in the source tensor to this area in order according to the memory reading method determined by the continuous tensor. In the memory area, so as to ensure that the memory addresses of adjacent elements are continuous when continuous tensors are expanded according to row priority.
  • the implementation of the embodiment of the present invention does not need to change the semantics of the view class frame operator of the Pytorch framework itself.
  • the discontinuous operation is determined by recursively deriving the scene of discontinuous tensor generation, and then the matching TBE operator is determined.
  • the calculation core AI Core is used to execute the TBE operator to generate continuous tensors with continuous memory distribution, which does not depend on the AI CPU hardware performance, improves conversion efficiency and conversion performance, and is more flexible at the software level, easy to expand, and can give full play to the calculation core.
  • the performance of the AI Core does not depend on the AI CPU hardware performance, improves conversion efficiency and conversion performance, and is more flexible at the software level, easy to expand, and can give full play to the calculation core.
  • the above method is not only applicable to the Pytorch framework, but also for other AI frameworks with discontinuous operations such as transposition and slicing, the method provided by the present invention can be used to perform scene inversion, so as to complete the discontinuous tensor according to the inversion result Transformations, especially transformations of composite class non-contiguous tensors.
  • the above only takes the first set of operations including slicing operations, deformation operations and transposition operations as an example to illustrate how to perform recursive derivation of combined scenarios and how to complete the tensor to continuous process according to the derivation results.
  • the first set of operations can also be recursively deduced and tensor converted in the same way.
  • the data processing method provided by this application can be widely used in Pytorch model training and inference scenarios, which can significantly improve the efficiency of model training and inference, reduce training time consumption, and accelerate model training. It can be understood that if the model training involves converting discontinuous tensors into continuous tensors, this application uses the AI Core to execute the TBE operator to perform memory copying to achieve continuous conversion. Compared with the process of converting to continuous by performing memory copy on the host, data can be reduced. Copy time delay back and forth, improve conversion efficiency, thereby effectively improving model training and inference efficiency, and generating huge commercial value.
  • FIG. 10 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip 10 includes: a processor 11 and a computing core 12, the chip 10 can be an NPU chip, a GPU chip, a TPU chip or other AI chips, and can include multiple processors and computing cores, which can Execute their respective tasks in parallel, one is taken as an example in Fig. 10 .
  • the functions of the chips described in the embodiments of the present invention reference may be made to the relevant descriptions in the embodiments of the invention shown in FIGS. 2-9 , and details will not be repeated here.
  • the computing device may be a host in the data processing system shown in FIG. 2 above, and a chip is integrated on the host in the form of a plug-in card.
  • the host and the chip can cooperate to execute the relevant descriptions in the embodiment of the invention as shown in FIG. 2 to FIG. 9 , which will not be repeated here.
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored.
  • a computer program is stored.
  • the program is executed by a processor, some or all of the steps described in any one of the above method embodiments can be realized.
  • the embodiment of the present application also provides a computer program product, which, when running on a computer or a processor, causes the computer or processor to execute one or more steps in any one of the above methods. If each component module of the above-mentioned device is implemented in the form of a software function unit and sold or used as an independent product, it can be stored in the computer-readable storage medium.
  • serial numbers of the above-mentioned processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not be implemented in this application.
  • the implementation of the examples constitutes no limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
  • the modules in the device of the embodiment of the present application can be combined, divided and deleted according to actual needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据处理方法、系统及相关设备,所述数据处理系统包括处理器和计算核,其中,该方法包括:处理器接收第一数据的元数据和第二数据的元数据(S501),该第二数据由第一数据经过第一操作集得到,该第一操作集包括至少两个第一操作,第二数据中每一行相邻位置的元素对应的内存地址是非连续的;处理器根据第二数据的元数据对第一操作集进行识别,处理器确定与第一操作集匹配的第二操作集(S503);计算核根据第一数据和第二操作集得到第三数据(S504),第三数据中每一行相邻位置的元素对应的内存地址是连续的。上述方法能够减小对设备硬件的依赖,在多个视图类操作融合的场景中提升非连续张量转换为连续张量的转换效率和转换性能。

Description

一种数据处理方法、系统及相关设备
本申请要求于2021年07月14日提交中国专利局、申请号为202110792976.7、申请名称为“一种数据处理方法、系统及相关设备”的中国专利申请的优先权,以及要求于2021年10月20日提交中国专利局、申请号为202111221037.3、申请名称为“一种数据处理方法、系统及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及处理器技术领域,尤其涉及一种数据处理方法、系统及相关设备。
背景技术
Pytorch是一个开源的机器学习库,用于自然语言处理等应用程序,Pytorch框架支持对源张量的视图类框架操作以得到视图张量,从而可以有效减少显示的张量数据拷贝所带来的性能消耗,其中,源张量和视图张量共享同一内存。
视图类框架操作所对应的算子可以称为视图类框架算子,其主要包括重塑(reshape)类和非连续类,reshape类算子包括view、view_as、squeeze、unsqueeze、flatten等框架算子,这些算子对应生成的视图张量称为连续张量,其特征在于视图张量的元素按照行优先展开时,对应的索引内存与源张量一致,是连续分布的。非连续类算子包括转置(transpose)、切片(narrow)、扩张(expand)等框架算子,这些算子对应生成的视图张量称为非连续张量,其特征在于视图张量的元素按照行优先展开时,对应的索引内存与源张量不同,在共享内存上呈现非连续分布。
在当前应用中,通常都需要将非连续张量转换为连续张量,目前的转换方案主要是将非连续张量从设备侧(例如神经网络处理器(neural-network processing unit,NPU)芯片)拷贝至主机,由主机完成转换后再将其拷贝回设备侧。综上,在目前对非连续张量转换为连续张量的方案中,其转换效率较低,对设备的硬件要求较高,性能消耗较大。
因此,如何提高张量转换效率,减小转换过程中对设备的硬件依赖,提高设备转换性能是目前亟待解决的问题。
发明内容
本发明实施例公开了一种数据处理方法、系统及相关设备,通过对非连续场景推导,特别是对多个视图类操作融合的场景进行递归推导,从而将非连续场景逐个提取出来,并确定与每个非连续场景匹配的操作,最后依次执行所确定的操作从而完成转换过程,可以有效提升非连续张量转换为连续张量的转换效率,减小对设备硬件的依赖,提高转换性能。
第一方面,本申请提供了一种数据处理方法,所述方法由数据处理系统执行,所述系统包括处理器和计算核,该方法包括:所述处理器获取第一数据的元数据和第二数据的元数据,所述第二数据由所述第一数据经过第一操作集得到,所述第一操作集包括至少两个第一操作,所述第二数据中每一行相邻位置的元素对应的内存地址是非连续的;所述处理器根据所述第二数据的元数据对所述第一操作集进行识别,确定所述第一操作集中的每个第一操作;所述处理器确定与所述第一操作集匹配的第二操作集,所述第一操作集中的每一个第一操作在所述第二操作集中存在与之匹配的第二操作;所述计算核根据所述第一数据和所述第二操作集 得到第三数据,所述第三数据中的每一行相邻位置的元素对应的内存地址是连续的。
可选的,第一数据可以是源张量,第二数据可以是源张量经过第一操作集后得到的非连续张量,第三数据可以是源张量经过第二操作集后得到的连续张量,第一操作集可以是非连续类框架算子,例如transpose、narrow、expand等。
在本申请实施例中,处理器通过对非连续张量的元数据进行分析,对非连续张量的产生场景进行递归推导,从而确定对源张量所执行的一系列操作,进而确定与该一系列操作匹配的多个张量加速引擎(tensor boost engine,TBE)算子,由计算核依次对源张量执行该多个TBE算子完成转连续过程,这样,可以提升转换效率,减小对芯片的AI CPU的性能依赖,有效提高转换性能。
结合第一方面,在第一方面一种可能的实现方式中,所述处理器按照预设优先级对所述第一操作集中包括的第一操作依次进行识别;所述处理器根据每一次识别出的第一操作,确定与所述第一操作对应的特征场景,并将所述特征场景依次放入场景信息栈中。
在本申请实施例中,处理器在对第一操作集中的第一操作进行推导和识别时,按照预设优先级顺序,例如可以先识别转置操作,再识别切片操作,最后识别形变操作,对第一操作集所包含的第一操作进行一一识别,这样可以有效减小多个第一操作之间的融合干扰,提高识别的准确性和识别效率。
结合第一方面,在第一方面一种可能的实现方式中,所述处理器判断所述第二数据的元数据与待识别的第一操作的至少一个特征信息是否匹配,并在匹配的情况下,所述处理器确定所述待识别的第一操作,其中,所述第二数据的元数据包括所述第二数据的形状shape、步幅stride和内存偏移量storage_offset。
在本申请提供的方案中,处理器通过比对第二数据的元数据与待识别的第一操作的特征信息进行识别,不需要进行严格的一一对应,仅需要匹配其中的一个或部分特征信息即可确定待识别的第一操作,这样可以有效减小多个第一操作之间的融合干扰,提高识别效率。
结合第一方面,在第一方面一种可能的实现方式中,所述处理器遍历算子信息库,所述算子信息库包括多个张量加速引擎TBE算子;所述处理器针对每一个从所述第一操作集中识别出的第一操作,将所述算子信息库中与所述第一操作特征相同的算子确定为与所述第一操作匹配的第二操作,并将所述第二操作依次放入算子信息栈中。
在本申请提供的方案中,处理器在确定对源张量所执行的一系列第一操作之后,针对每一个第一操作,可以进一步在当前算子信息库中进行寻找,当查找到某个算子的元数据所描述的特征与该操作所对应的特征相同时,可以确定该算子与该操作类型相同,即该算子为与该操作匹配的TBE算子,从而得到第二操作集。
结合第一方面,在第一方面一种可能的实现方式中,所述处理器将转换命令下发至所述计算核,所述转换命令包括所述第二操作集,所述转换命令用于指示所述计算核根据所述第二操作集对所述第一数据执行运算得到所述第三数据。
在本申请提供的方案中,处理器在算子信息库中查找到与第一操作集匹配的多个TBE算子后,通知计算核对源张量执行该多个TBE算子,从而得到连续张量,该张量中每一行相邻位置的元素对应的索引内存是连续的,这样可以不必依赖于AI CPU就能完成张量转换过程,减小了对芯片硬件的依赖。
结合第一方面,在第一方面一种可能的实现方式中,所述处理器构建第四数据,所述第四数据与所述第一数据的元数据相同,所述第四数据与所述第一数据共享同一内存;所述计算核对所述第四数据依次执行所述第二操作集中的第二操作,得到所述第三数据。
在本申请提供的方案中,计算核在进行计算之前,处理器根据确定的TBE算子可以获取到该算子所需要的入参信息,该入参信息中包括输入张量,该输入张量可以采用共享内存的方式构建的临时连续张量,这个临时连续张量的元数据与源张量的元数据相同,在完成临时连续张量构建后,计算核方可以执行相应的运算,从而保证计算核能够正确的执行相应的TBE算子,完成张量转换过程。
结合第一方面,在第一方面一种可能的实现方式中,所述第一操作集包括转置transpose算子、切片narrow算子、扩张expand算子。
结合第一方面,在第一方面一种可能的实现方式中,所述系统包括主机和芯片,所述处理器位于所述主机中,所述计算核位于所述芯片中。
结合第一方面,在第一方面一种可能的实现方式中,所述芯片包括神经网络处理器NPU、图形处理器GPU、张量处理单元TPU、深度学习处理器DPU中的至少一个。
在本申请提供的方案中,对组合类非连续张量的产生场景进行递归推导的过程可以由数据处理系统中的主机完成,也可以由数据处理系统中的芯片完成,不管是通过主机完成场景递归推导还是通过芯片完成场景递归推导,最终都将由计算核执行TBE算子完成非连续张量转换为连续张量,从而减少数据拷贝和对AI CPU的硬件依赖,提升转换效率和转换性能。
第二方面,本申请提供了一种数据处理系统,该系统包括:处理器和计算核,
所述处理器,用于获取第一数据的元数据和第二数据的元数据,所述第二数据由所述第一数据经过第一操作集得到,所述第一操作集包括至少两个第一操作,所述第二数据中每一行相邻位置的元素对应的内存地址是非连续的;根据所述第二数据的元数据对所述第一操作集进行识别,确定所述第一操作集中的每个第一操作;确定与所述第一操作集匹配的第二操作集,所述第一操作集中的每一个第一操作在所述第二操作集中存在与之匹配的第二操作;
所述计算核,用于根据所述第一数据和所述第二操作集得到第三数据,所述第三数据中每一行相邻位置的元素对应的内存地址是连续的。
应理解,该芯片可以同时包括多个处理器和计算核,它们可以并行执行各自的任务,互不影响和干涉,本申请对芯片的处理器和计算核数量不作限定。
结合第二方面,在第二方面一种可能的实现方式中,所述处理器,具体用于:按照预设优先级对所述第一操作集中包括的第一操作依次进行识别;根据每一次识别出的第一操作,确定与所述第一操作对应的特征场景,并将所述特征场景依次放入场景信息栈中。
结合第二方面,在第二方面一种可能的实现方式中,所述处理器,具体用于:判断所述第二数据的元数据与待识别的第一操作的至少一个特征信息是否匹配,并在匹配的情况下,确定所述待识别的第一操作,其中,所述第二数据的元数据包括所述第二数据的形状shape、步幅stride和内存偏移量storage_offset。
结合第二方面,在第二方面一种可能的实现方式中,所述处理器,具体用于:遍历算子信息库,所述算子信息库包括多个TBE算子;针对每一个从所述第一操作集中识别出的第一操作,将所述算子信息库中与所述第一操作特征相同的算子确定为与所述第一操作匹配的第二操作,并将所述第二操作依次放入算子信息栈中。
结合第二方面,在第二方面一种可能的实现方式中,所述处理器,还用于将转换命令下发至所述计算核,所述转换命令包括所述第二操作集,所述转换命令用于指示所述计算核根据所述第二操作集对所述第一数据执行运算得到所述第三数据。
结合第二方面,在第二方面一种可能的实现方式中,所述处理器,还用于构建第四数据,所述第四数据与所述第一数据的元数据相同,所述第四数据与所述第一数据共享同一内存; 所述计算核,还用于对所述第四数据执行所述第二操作集中的第二操作,得到所述第三数据。
结合第二方面,在第二方面一种可能的实现方式中,所述第一操作集包括转置transpose算子、切片narrow算子、扩张expand算子。
结合第二方面,在第二方面一种可能的实现方式中,所述处理器位于所述系统的主机中,所述计算核位于所述系统的芯片中。
结合第二方面,在第二方面一种可能的实现方式中,所述芯片包括神经网络处理器NPU、图形处理器GPU、张量处理单元TPU、深度学习处理器DPU中的至少一个。
第三方面,本申请提供了一种芯片,包括:处理器和计算核,
所述处理器,用于获取第一数据的元数据和第二数据的元数据,所述第二数据由所述第一数据经过第一操作集得到,所述第一操作集包括至少两个第一操作,所述第二数据中每一行相邻位置的元素对应的内存地址是非连续的;根据所述第二数据的元数据对所述第一操作集进行识别,确定所述第一操作集中的每个第一操作;确定与所述第一操作集匹配的第二操作集,所述第一操作集中的每一个第一操作在所述第二操作集中存在与之匹配的第二操作;
所述计算核,用于根据所述第一数据和所述第二操作集得到第三数据,所述第三数据中每一行相邻位置的元素对应的内存地址是连续的。
第四方面,本申请提供了一种计算设备,包括上述第一方面中的任意一种实现方式所提供的数据处理系统。
第五方面,本申请提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,当所述计算机程序被处理器执行时,可以实现上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的方法。
第六方面,本申请提供了一种计算机程序产品,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的方法。
附图说明
为了更清楚地说明本发明实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种张量转换的示意图;
图2是本申请实施例提供的一种系统结构的示意图;
图3是本申请实施例提供的一种组合类非连续场景拆分示意图;
图4是本申请实施例提供的一种组合类非连续张量转连续的示意图;
图5是本申请实施例提供的一种数据处理方法的流程示意图;
图6是本申请实施例提供的一种第一操作集的示意图;
图7是本申请实施例提供的一种转置算子的入参信息示意图;
图8是本申请实施例提供的一种组合非连续场景识别和提取的示意图;
图9是本申请实施例提供的另一种张量转换过程的示意图;
图10是本申请实施例提供的一种芯片结构的示意图。
具体实施方式
下面结合附图对本申请实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。
首先,结合附图对本申请中所涉及的部分用语和相关技术进行解释说明,以便于本领域技术人员理解。
元数据为描述实际数据的数据,用于描述实际数据的属性信息,可以是实际数据的文件名或者该实际数据的存储地址指针等,例如,张量的元数据可以描述张量的形状、维数、格式等特征信息,同时,元数据还可以具有对应的标识,用于对该元数据进行标识,元数据以及其对应的标识可以构成键值对,每组键值对可以包括关键字(key)以及该key对应的值(value),value即为元数据本身,key用于对value进行标识。
主机又可以称为客户端,其连接了硬盘、硬盘子系统或者文件服务器,并能存储数据和IO访问的计算机系统,具体可以包括物理机、虚拟机、容器等,用于与设备进行通信并进行数据处理,例如应用服务器、多处理器机器、工作站、个人电脑等。
设备为集成了乘加、激活函数、二维数据运算、解压缩等模块的处理芯片,能够加速神经网络的运算,有效提高神经网络运算的效率,例如NPU、GPU、TPU、DPU等,处理芯片中可以包含多个处理器和计算核,它们可以并行执行各自的任务。
目前为了减少显示的张量数据拷贝带来的性能消耗,Pytorch框架支持对源张量的视图类框架操作以得到视图张量,源张量中的元素和视图张量中的元素共享同一块内存。而在实际处理数据时,通常需要将非连续张量转换为连续张量,从而进一步进行数据处理,例如统一计算设备架构(compute unified device architecture,CUDA)会计算非连续张量每个元素的内存地址,依赖于装填(load)和储存(store)指令,处理芯片(例如GPU)可以访问任意内存位置的元素,并将其存储至一块指定的联系内存区域,保证非连续张量的元素按照行优先展开时,对应的索引内存是连续的,从而完成对该非连续张量的转换。然而当前很多处理芯片无法按照上述这种数据迁移逻辑进行高效的数据拷贝,例如NPU无法通过上述方式完成非连续张量到连续张量的转换,对于这类处理芯片,在进行转连续操作时,通常需要借助主机完成,如图1所示,主机110与设备120通过网络或高速串行计算机扩展总线标准(peripheral component interconnect express,PCIe)接口直接建立连接,主机110可以是一个服务器,设备120可以为插置于该服务器上的NPU加速卡,首先,主机110下发流同步指令至设备120,该指令将阻塞当前流所有任务执行,设备120在接收到指令后,将非连续张量拷贝至主机110,主机110根据当前非连续张量的信息计算其中每个元素的内存地址,然后根据CPU的Load/Store指令将每个元素拷贝至指定的内存区域,在完成拷贝之后可以实现对该非连续张量转换为连续张量,然后将该连续张量再拷贝至设备120,最后主机110下发结束流同步指令至设备120,释放相关资源。可以看出,在对非连续张量进行转换时,阻碍了其它任务正常执行,其次张量数据需要在主机和设备之间来回拷贝,拷贝效率较低且对主机的性能消耗较大,整个转换过程性能较差。
另外,非连续张量还可能是源张量经过多个视图类框架操作以后得到的张量,多个视图类框架操作之间存在相互叠加和干扰,加大了对该类非连续张量的产生场景进行识别和推导的难度,只能利用主机或设备中的处理器进行转连续操作,无法充分利用设备中的计算核的能力,其整体转换效率较低。
基于上述,本申请提供了一种数据处理方法,利用处理器对非连续张量的产生场景进行推导,特别是对多个视图类操作融合的组合类非连续场景进行递归推导,确定与该组合类非连续场景匹配的操作集,并由AI core依次执行该操作集完成数据的重新拷贝,从而实现将非 连续张量转换为连续张量,有效提升转换效率,减小对设备硬件,尤其是AI CPU的依赖,提高转换性能。
本申请实施例的技术方案可以应用于任何需要进行非连续张量转换的系统中,尤其适用于组合类非连续张量且对AI CPU依赖较低的场景。
参见图2,图2是本申请提供的一种系统结构的示意图。如图2所示,该系统包括主机210和芯片220,该主机210中可以包括硬件层和软件层,软件层包括客户操作系统2110、任务调度器2120,硬件层包括一个或多个处理器、内存等硬件,该芯片220可以是神经网络处理器NPU、图形处理器GPU、张量处理单元TPU、深度学习处理器DPU中的至少一个。其也包括硬件层和软件层,硬件层包括一个或多个处理器(如AI CPU2220)、一个或多个计算核(AI Core2230)、内存等硬件,软件层包括各种处理单元(例如I/O处理单元2210等)以处理非连续张量转换为连续张量的相关流程,主机210和芯片220可以通过接口进行连接。在一些实施例中,芯片220可以和主机210位于不同的设备上,在另一些实施例中,芯片220可以以插卡的方式挂载于主机210上。
主机210用于协同芯片完成非连续张量的转换。处理器2130可以是CPU,还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
内存2140可以用于存储算子信息库。可以包括只读存储器和随机存取存储器,还可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
在一种可能的实现方式中,主机210中的任务调度器2120将张量转换任务发送至处理器2130,处理器2130从内存2140中提取当前非连续张量的元数据,该元数据包括非连续张量的形状、步幅、内存偏移量等信息,然后对该元数据进行分析并按照预设优先级顺序进行推导,例如可以首先判断stride信息是否非单调,若单调,则确定存在转置场景,提取该场景放入场景信息栈中,并可以确定与该装置场景对应的第一操作(即转置操作),然后处理器2130遍历内存2140中的算子信息库,查找到与该第一操作类型相同的TBE算子,并获知该TBE算子所需的信息,将其放入算子信息栈中,之后,对残存的场景再次进行识别和推导,并将推导得到的非连续场景和TBE算子分别依次放入场景信息栈和算子信息栈中,直至残存的场景不可再拆分为止,其推导过程如图3所示,源张量经过n个视图类操作得到非连续视图张量,然后对该非连续视图张量按照预设优先级顺序依次进行识别,每次都将识别出一个视图类操作,从而提取特征场景并匹配对应的TBE算子,将其分别放入场景信息栈和算子信息栈中。最后处理器2130通过PCIe接口向芯片220下发指令,在芯片220中通过共享非连 续张量内存的方式构建一个临时连续张量,并将算子信息栈中的TBE算子下发至AI Core2230,AI Core2230依次调度该算子信息栈中的TBE算子对该临时连续张量进行运算,将内存2240中非连续张量的每个元素重新拷贝至指定的区域,得到一个连续张量,该连续张量每一行中相邻两个元素对应的索引内存在内存2240中是连续的,其具体的转连续过程如图4所示,在完成构建临时连续张量之后,从场景信息栈中取出第一单算子非连续场景,重新构建第一个非连续场景,并从算子信息栈中调对应的TBE算子执行转连续操作,直至场景信息栈和算子信息栈均为空时,停止重建,其结果张量即为目的连续视图张量。
可以看出,在进行张量转换时,利用已知的非连续张量和源张量,对非连续张量的产生场景进行递归推导,从而可以推导得到与每一个非连续场景对应的操作,然后根据该操作可以从算子信息库中映射得到相匹配的TBE算子,并将其放入算子信息栈中,最后由AI Core执行该算子信息栈中的多个TBE算子,生成连续张量,不依赖于AI CPU等硬件性能,可以有效提升转换效率,提高转换性能。
结合图2所示的芯片的示意图,下面将描述本申请实施例提供的数据处理方法,参阅图5,图5示出了本申请实施例提供的一种数据处理方法的流程示意图,该方法可以应用于如上述图2所示的数据处理系统中。该方法具体可以包括:
S501:处理器获取第一数据的元数据和第二数据的元数据。
具体地,在本申请实施例中,第一数据可以是源张量数据,源张量为一种n维的数据结构,其具体形式包括标量、矢量、矩阵等,如第0维张量即为一个标量,源张量的元数据即为描述该源张量的数据,包括张量的shape、stride、storage_offset等,相应的,第二数据可以是非连续张量,该非连续张量中的元素按照行优先展开时,每一行相邻位置的元素对应的内存地址是非连续的。
另外,处理器可以是上述图2所示的主机中处理器2130,即下述步骤S502中的场景推导和步骤S503中的算子映射匹配由主机完成,处理器也可以是上述图2所示的AI CPU2220,即步骤S502中的场景推导和步骤S503中的算子映射匹配由芯片完成,本申请对此不作限定。
需要说明的是,第二数据是由第一数据经过第一操作集得到的,即上述非连续张量是对源张量执行一系列第一操作得到的。第一操作为非连续操作,例如,在Pytorch架构下,第一操作为对源张量执行非连续视图类框架算子对应的操作,可选的,非连续视图类框架算子包括transpose、narrow、expand等。
示例性的,以第一操作集包括切片操作、形变操作和转置操作为例,如图6所示,源张量610包括视图信息和源信息,其中,视图信息和源信息一致,其shape都为{8,451143,4},stride都为{1804572,4,1},该源张量是一个连续张量,现在利用非连续操作单元620对该源张量610执行第一操作集,即对源张量依次进行切片操作、形变操作和转置操作,产生一个非连续视图张量630,可以看出,非连续视图张量630中的视图信息与源信息不一致,且非连续视图张量630与源张量610相比,视图信息中的shape和stride都发生了明显改变。
应理解,在对源张量执行非连续组合类视图类框架操作得到视图张量时,仅仅是改变了源张量的元数据,源张量和视图张量还是共享同一内存,即源张量中的元素和视图张量的元素是一样的,而且所占用的内存也是相同的。
S502:处理器根据第二数据的元数据对第一操作集进行识别,确定所述第一操作集中的每个第一操作。
具体地,处理器在获取到源张量的元数据和非连续张量的元数据后,通过对非连续张量 的各个特征(如上述shape、stride和storage_offset)进行分析,确定对源张量所执行的一系列非连续操作。
在一种可能的实现方式中,处理器按照预设优先级对第一操作集中包括的第一操作依次进行识别,所述处理器根据每一次识别出的第一操作,确定与所述第一操作对应的特征场景,并将所述特征场景依次放入场景信息栈中。
具体地,由于组合类视图类操作之间存在相互叠加与干扰,为了减小各个视图类操作之间的相互影响,提高识别和推导的准确性,处理器在进行特征场景识别时,可以按照预设优先级识别顺序依次进行识别,例如转置操作为第一优先级,切片操作为第二优先级,形变操作为第三优先级,那么处理器在进行场景识别时,将优先识别转置场景,然后识别切片场景,最后识别形变场景。
值得说明的是,处理器在对每个场景进行识别的过程中,需要放宽该场景对应的判定条件,不需要满足该场景所有的判定条件,只需要满足其中的部分判定条件即可,例如某个特征场景的判定条件为3个,该非连续张量的元数据满足其中的一个判定条件,此时就可以确定存在该特征场景。容易理解,通过适当放宽场景的判定条件,可以进一步排除场景融合的干扰,提高场景识别的效率和准确性。
示例性的,如上述图6所示,处理器在已知非连续视图张量630和源张量610的情况下,需要对非连续操作单元620进行推导,处理器通过对非连续视图张量630的stride信息进行分析,发现stride信息是非单调的,即可以确定存在转置场景和转置操作。
同理,对于其它非连续性场景和非连续性操作也可以通过上述逻辑进行推导,例如,对于非连续性操作为narrow时,处理器通过对shape信息进行分析,发现shape中某些轴的元素减少,则处理器可以确定存在切片场景和切片操作。
在处理器完成场景识别后,需要进行特征场景提取,即将一个组合非连续场景分为两部分,第一部分为构造出的被识别场景,另一部分为残存的场景,该残存的场景仍有可能是组合非连续场景,若残存场景仍为组合非连续场景,则继续进行进一步的识别,直至残存场景不可再割分。构建场景信息栈,对于提取得到的每个特征场景依次放入该场景信息栈中,每个特征场景都对应一个第一操作,即对应一个非连续视图操作。
S503:处理器确定与所述第一操作集匹配的第二操作集。
具体地,处理器通过对非连续张量产生场景进行递归推导,在每一次识别出非连续操作并提取特征场景之后,需要进一步确定与该非连续操作及特征场景匹配的第二操作,即确定是否存在某个张量加速引擎(tensor boost engine,TBE)算子的特征与该非连续操作及特征场景对应的特征相同。
应理解,TBE算子是由TBE语言编写的,可以直接被计算核AI Core调用执行,从而生成一个连续张量,在本申请实施例中,每一个非连续操作对应一个TBE算子,例如,转置操作对应transpose算子、切片操作对应slice算子、形变操作对应broadcast算子等。
处理器通过遍历当前算子信息库,寻找与该非连续操作匹配的TBE算子,例如,在上述图6所示的场景中,第一操作集包括转置操作,transpose算子的特征为:张量的stride信息非单调、张量指定轴发生shape和stride信息的置换、张量的storage_offset为0且不变,非连续视图张量的stride信息也是非单调的,因此,处理器可以确定与该转置操作匹配的TBE算子是transpose,直接将算子信息库中的transpose算子确定为与非连续操作(即转置操作)匹配的算子。
进一步的,处理器在确定与非连续操作匹配的TBE算子后,可以获取到该TBE算子所 需的入参信息,示例性的,如图7所示,是transpose算子所需的入参信息示意图,其入参信息包括输入张量、结果张量和转置的轴信息,其中输入张量是采用共享内存方式构建的临时连续张量,结果张量接收算子执行结果,是根据非连续张量的视图信息新建的空的连续张量,转置的轴信息与推导得到的转置操作对应的转置轴信息相同。
在处理器确定每个第一操作对应的第二操作且找到与之匹配的TBE算子之后,处理器构建算子信息栈,将推导得到的TBE算子依次放入算子信息栈中。
需要说明的是,处理器在确定第二操作之后,若在算子信息库中没有找到与之匹配的TBE算子,可以由研发人员编写与第二操作相匹配的TBE算子,并将其加入算子信息库中,这样在软件层面更改算子实现,可以有效扩展适用场景,提高转换的灵活性,充分发挥AI Core的性能,解除对AI CPU的硬件依赖。
为了进一步说明对于组合非连续场景是如何进行识别和特征场景提取,请参阅图8,如图8所示,源张量810经过切片操作得到非连续张量820,再经过形变操作得到非连续张量830,左后经过转置操作得到非连续张量840,对于非连续张量840,通过对其视图信息和源信息进行比较和分析,首先识别出转置场景,然后进行第一次迭代推导,针对非连续张量840,构造出转置非连续张量850以及残存非连续张量860,针对转置非连续张量850,将其放入场景信息栈中,并从算子信息库中找出与之匹配的TBE算子,将其放入算子信息栈中;之后识别出切片操作并进行第二次迭代推导,针对残存非连续张量860,构造出切片非连续张量870以及残存非连续张量880,针对切片非连续张量870将其放入场景信息栈中,并从算子信息库中找出与之匹配的TBE算子,将其放入算子信息栈中;由于残存非连续张量880不是组合类非连续张量,其为形变非连续张量,此时停止迭代,结束推导流程。
需要说明的是,上述场景构造策略遵从固定的场景刷新策略,例如上述转置非连续张量850的视图信息与非连续张量840的视图信息一致,残存非连续张量860的视图信息与转置非连续张量850的源信息一致,残存非连续张量860的源信息与非连续张量840的源信息一致,每次迭代的场景信息刷新策略均一致。
S504:计算核根据所述第一数据和所述第二操作集得到第三数据。
具体地,处理器在将TBE算子放入算子信息栈之后,将该算子信息栈下发至计算核AI Core,计算核AI Core通过依次执行该算子信息栈中的TBE算子得到连续张量。
应理解,TBE算子在算子信息库中是以文件的形式存在的,文件记录了TBE算子的入参信息,处理器将TBE算子对应的文件发送给计算核AI Core,计算核AI Core通过执行该TBE算子对应的文件从而输出连续张量。
在一种可能的实现方式中,处理器在将TBE算子下发至计算核AI Core之前,通过与非连续视图张量共享内存的方式构建临时连续张量,该临时连续张量元数据与源张量的元数据相同,且与源张量共享同一内存,即该临时连续张量可以理解为源张量的复原,当然也可以通过其它方式构建临时连续张量,本申请对此不作限定。
示例性的,在上述图6所示的场景中,处理器根据非连续视图张量630构建临时连续张量作为输入张量910,如图9所示,输入张量910与非连续视图张量630相同,首先从场景信息栈中提取形变场景,再从算子信息栈中调用对应的TBE算子,例如broadcast算子920,利用broadcast算子对输入张量910进行计算,其计算结果为一个连续张量,之后再从场景信息栈中提取切片场景,并从算子信息栈中调用对应的slice算子930,例如slice算子对第一次计算结果进行计算,其计算结果也同样为一个连续张量,最后从场景信息栈中提取转置场景,并从算子信息栈中调用对应的transpose算子940,利用transpose算子940对第二次计算结果 进行计算,计算后得到输出张量950,该输出张量950是一个连续张量,即输出张量950亦可以称为连续张量950,对于连续张量950来说,该张量中的元素按照行优先展开时,每一行相邻位置的元素对应的内存地址是连续的。
值得说明的是,计算核AI Core在执行TBE算子时,在主存中将重新确定一块内存区域,将源张量中的元素按照连续张量所确定的内存读取方式将其有序迁移至该内存区域中,从而保证连续张量在按照行优先展开时,相邻元素的内存地址是连续的。
可以看出,执行本发明实施例,不需要改变Pytorch框架本身的视图类框架算子语义,通过对非连续张量产生场景进行递归推导确定非连续操作,进而确定与之匹配的TBE算子,最后利用计算核AI Core执行TBE算子生成内存连续分布的连续张量,不依赖于AI CPU硬件性能,提升了转换效率和转换性能,且在软件层面更加灵活,便于扩展,能充分发挥计算核AI Core的性能。
应理解,上述方法不仅适用于Pytorch框架,对于其它具有转置、切片等非连续操作的AI框架都可以利用本发明所提供的方法进行场景反推,从而根据反推结果完成对非连续张量的转换,尤其是组合类非连续张量的转换。
还应理解,上述仅以第一操作集包括切片操作、形变操作和转置操作为例说明了如何进行组合场景递归推导以及如何根据推导结果完成张量转连续过程,对于包括其它非连续性操作的第一操作集,也可以按照同样的方法进行递归推导以及张量转换。
本申请所提供的数据处理方法可以广泛应用于Pytorch的模型训练与推理场景中,能够显著改善模型训练与推理的效率,减少训练耗时,加速模型训练。可以理解,若模型训练涉及将非连续张量转换为连续张量,本申请由AI Core执行TBE算子进行内存拷贝从而实现转连续,相较于主机进行内存拷贝完成转连续过程,可以减少数据来回拷贝时延,提高转换效率,从而有效改善模型训练和推理效率,产生巨大的商业价值。
上述详细阐述了本申请实施例的方法,为了便于更好的实施本申请实施例的上述方案,相应地,下面还提供用于配合实施上述方案的相关设备。
参见图10,图10是本申请实施例提供的一种芯片的结构示意图。如图10所示,该芯片10包括:处理器11和计算核12,该芯片10可以是NPU芯片、GPU芯片、TPU芯片或其它AI芯片,其中可以包含多个处理器和计算核,它们可以并行执行各自的任务,图10中以一个为例。其中,本发明实施例中所描述的芯片的功能,可以参见图2-图9中所述的发明实施例中的相关描述,此处不再赘述。
本申请实施例提供了一种计算设备,该计算设备可以是上述图2所示的数据处理系统中的主机,芯片以插卡的方式集成在主机上。主机和芯片可以协同执行如图2-图9中所述的发明实施例中的相关描述,此处不再赘述。
本申请实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时,可以实现上述方法实施例中记载的任意一种的部分或全部步骤。
本申请实施例还提供了一种计算机程序产品,当其在计算机或处理器上运行时,使得计算机或处理器执行上述任一个方法中的一个或多个步骤。上述所涉及的设备的各组成模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在所述计算机可读取存储介质中。
在上述实施例中,对各个实施例的描述各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
应理解,本文中涉及的第一、第二、第三、第四以及各种数字编号仅为描述方便进行的区分,并不用来限制本申请的范围。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。
本申请实施例装置中的模块可以根据实际需要进行合并、划分和删减。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (22)

  1. 一种数据处理方法,其特征在于,所述方法由数据处理系统执行,所述数据处理系统包括处理器和计算核,所述方法包括:
    所述处理器获取第一数据的元数据和第二数据的元数据,所述第二数据由所述第一数据经过第一操作集得到,所述第一操作集包括至少两个第一操作,所述第二数据中每一行相邻位置的元素对应的内存地址是非连续的;
    所述处理器根据所述第二数据的元数据对所述第一操作集进行识别,确定所述第一操作集中的每个第一操作;
    所述处理器确定与所述第一操作集匹配的第二操作集,所述第一操作集中的每一个第一操作在所述第二操作集中存在与之匹配的第二操作;
    所述计算核根据所述第一数据和所述第二操作集得到第三数据,所述第三数据中的每一行相邻位置的元素对应的内存地址是连续的。
  2. 如权利要求1所述的方法,其特征在于,所述处理器根据所述第二数据的元数据对所述第一操作集进行识别,确定所述第一操作集中的每个第一操作,包括:
    所述处理器按照预设优先级对所述第一操作集中包括的第一操作依次进行识别;
    所述处理器根据每一次识别出的第一操作,确定与所述第一操作对应的特征场景,并将所述特征场景依次放入场景信息栈中。
  3. 如权利要求2所述的方法,其特征在于,所述处理器按照预设优先级对所述第一操作集中包括的第一操作进行识别,包括:
    所述处理器判断所述第二数据的元数据与待识别的第一操作的至少一个特征信息是否匹配,并在匹配的情况下,所述处理器确定所述待识别的第一操作,其中,所述第二数据的元数据包括所述第二数据的形状shape、步幅stride和内存偏移量storage_offset。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述处理器确定与所述第一操作集匹配的第二操作集,包括:
    所述处理器遍历算子信息库,所述算子信息库包括多个张量加速引擎TBE算子;
    所述处理器针对每一个从所述第一操作集中识别出的第一操作,将所述算子信息库中与所述第一操作特征相同的算子确定为与所述第一操作匹配的第二操作,并将所述第二操作依次放入算子信息栈中。
  5. 如权利要求1-4任一项所述的方法,其特征在于,在所述计算核得到所述第三数据之前,所述方法还包括:
    所述处理器将转换命令下发至所述计算核,所述转换命令包括所述第二操作集,所述转换命令用于指示所述计算核根据所述第二操作集对所述第一数据执行运算得到所述第三数据。
  6. 如权利要求1-5任一项所述的方法,其特征在于,所述计算核得到所述第三数据包括:
    所述处理器构建第四数据,所述第四数据与所述第一数据的元数据相同,所述第四数据与所述第一数据共享同一内存;
    所述计算核对所述第四数据依次执行所述第二操作集中的第二操作,得到所述第三数据。
  7. 如权利要求1-6任一项所述的方法,其特征在于,所述第一操作集包括转置transpose算子、切片narrow算子、扩张expand算子。
  8. 如权利要求1-7任一项所述的方法,其特征在于,所述数据处理系统还包括主机和芯片,所述处理器位于所述主机中,所述计算核位于所述芯片中。
  9. 如权利要求8所述的方法,其特征在于,所述芯片包括神经网络处理器NPU、图形处理器GPU、张量处理单元TPU、深度学习处理器DPU中的至少一个。
  10. 一种数据处理系统,其特征在于,包括:处理器和计算核;
    所述处理器,用于获取第一数据的元数据和第二数据的元数据,所述第二数据由所述第一数据经过第一操作集得到,所述第一操作集包括至少两个第一操作,所述第二数据中每一行相邻位置的元素对应的内存地址是非连续的;根据所述第二数据的元数据对所述第一操作集进行识别,确定所述第一操作集中的每个第一操作;确定与所述第一操作集匹配的第二操作集,所述第一操作集中的每一个第一操作在所述第二操作集中存在与之匹配的第二操作;
    所述计算核,用于根据所述第一数据和所述第二操作集得到第三数据,所述第三数据中每一行相邻位置的元素对应的内存地址是连续的。
  11. 如权利要求10所述的数据处理系统,其特征在于,所述处理器,具体用于:
    按照预设优先级对所述第一操作集中包括的第一操作依次进行识别;
    根据每一次识别出的第一操作,确定与所述第一操作对应的特征场景,并将所述特征场景依次放入场景信息栈中。
  12. 如权利要求10或11所述的数据处理系统,其特征在于,所述处理器,具体用于:
    判断所述第二数据的元数据与待识别的第一操作的至少一个特征信息是否匹配,并在匹配的情况下,确定所述待识别的第一操作,其中,所述第二数据的元数据包括所述第二数据的形状shape、步幅stride和内存偏移量storage_offset。
  13. 如权利要求10-12任一项所述的数据处理系统,其特征在于,所述处理器,具体用于:
    遍历算子信息库,所述算子信息库包括多个TBE算子;
    针对每一个从所述第一操作集中识别出的第一操作,将所述算子信息库中与所述第一操作特征相同的算子确定为与所述第一操作匹配的第二操作,并将所述第二操作依次放入算子信息栈中。
  14. 如权利要求10-13任一项所述的数据处理系统,其特征在于,
    所述处理器,还用于将转换命令下发至所述计算核,所述转换命令包括所述第二操作集,所述转换命令用于指示所述计算核根据所述第二操作集对所述第一数据执行运算得到所述第三数据。
  15. 如权利要求10-14任一项所述的数据处理系统,其特征在于,
    所述处理器,还用于构建第四数据,所述第四数据与所述第一数据的元数据相同,所述第四数据与所述第一数据共享同一内存;
    所述计算核,还用于对所述第四数据执行所述第二操作集中的第二操作,得到所述第三数据。
  16. 如权利要求10-15任一项所述的数据处理系统,其特征在于,所述第一操作集包括转置transpose算子、切片narrow算子、扩张expand算子。
  17. 如权利要求10-16任一项所述的数据处理系统,其特征在于,所述数据处理系统还包括主机和芯片,所述处理器位于所述主机中,所述计算核位于所述芯片中。
  18. 如权利要求17所述的系统,其特征在于,所述芯片包括神经网络处理器NPU、图形处理器GPU、张量处理单元TPU、深度学习处理器DPU中的至少一个。
  19. 一种芯片,其特征在于,包括:处理器和计算核,
    所述处理器,用于获取第一数据的元数据和第二数据的元数据,所述第二数据由所述第一数据经过第一操作集得到,所述第一操作集包括至少两个第一操作,所述第二数据中每一行相邻位置的元素对应的内存地址是非连续的;根据所述第二数据的元数据对所述第一操作集进行识别,确定所述第一操作集中的每个第一操作;确定与所述第一操作集匹配的第二操作集,所述第一操作集中的每一个第一操作在所述第二操作集中存在与之匹配的第二操作;
    所述计算核,用于根据所述第一数据和所述第二操作集得到第三数据,所述第三数据中每一行相邻位置的元素对应的内存地址是连续的。
  20. 一种计算设备,其特征在于,包括:
    如权利要求1-9任一项所述的数据处理系统。
  21. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述权利要求1-9任一项所述的方法。
  22. 一种计算机程序,其特征在于,所述计算机程序包括指令,当所述计算机程序被处理器执行时,使得所述处理器执行如上述权利要求1-9任一项所述的方法。
PCT/CN2022/105221 2021-07-14 2022-07-12 一种数据处理方法、系统及相关设备 WO2023284745A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22841380.3A EP4361812A1 (en) 2021-07-14 2022-07-12 Data processing method, system and related device
US18/410,757 US20240143397A1 (en) 2021-07-14 2024-01-11 Data processing method and system, and related device

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110792976 2021-07-14
CN202110792976.7 2021-07-14
CN202111221037.3A CN114968612B (zh) 2021-07-14 2021-10-20 一种数据处理方法、系统及相关设备
CN202111221037.3 2021-10-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/410,757 Continuation US20240143397A1 (en) 2021-07-14 2024-01-11 Data processing method and system, and related device

Publications (1)

Publication Number Publication Date
WO2023284745A1 true WO2023284745A1 (zh) 2023-01-19

Family

ID=82975134

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/105221 WO2023284745A1 (zh) 2021-07-14 2022-07-12 一种数据处理方法、系统及相关设备

Country Status (4)

Country Link
US (1) US20240143397A1 (zh)
EP (1) EP4361812A1 (zh)
CN (1) CN114968612B (zh)
WO (1) WO2023284745A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611479A (zh) * 2023-07-21 2023-08-18 美智纵横科技有限责任公司 数据处理方法、装置、存储介质及芯片

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821019B (zh) * 2023-08-30 2023-11-14 腾讯科技(深圳)有限公司 数据处理方法、计算机设备及芯片

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190042092A1 (en) * 2018-06-12 2019-02-07 Olivia Wu Memory subsystem operations with unaligned and scatter gather feature to support convolution and dimension shuffle
CN109523455A (zh) * 2018-09-30 2019-03-26 平安科技(深圳)有限公司 一种图像数据异步传输方法、装置及计算机可读存储介质
CN109976686A (zh) * 2017-12-28 2019-07-05 浙江宇视科技有限公司 分布式显示系统及方法
CN111340201A (zh) * 2018-12-19 2020-06-26 北京地平线机器人技术研发有限公司 卷积神经网络加速器及其执行卷积运算操作的方法
CN112559163A (zh) * 2019-09-10 2021-03-26 华为技术有限公司 优化张量计算性能的方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7480400B2 (en) * 2006-03-16 2009-01-20 Siemens Medical Solutions Usa, Inc. Detection of fiber pathways
DE112018006630T5 (de) * 2017-12-28 2020-09-24 Intel Corporation Visual fog
US20210117242A1 (en) * 2020-10-03 2021-04-22 Intel Corporation Infrastructure processing unit
CN113722269B (zh) * 2021-08-26 2023-01-24 北京大学 一种基于昇腾ai处理器的跨步切片算子处理方法及装置
CN114329325A (zh) * 2021-11-19 2022-04-12 北京大学 一种基于昇腾ai处理器的批量矩阵乘算子的优化方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109976686A (zh) * 2017-12-28 2019-07-05 浙江宇视科技有限公司 分布式显示系统及方法
US20190042092A1 (en) * 2018-06-12 2019-02-07 Olivia Wu Memory subsystem operations with unaligned and scatter gather feature to support convolution and dimension shuffle
CN109523455A (zh) * 2018-09-30 2019-03-26 平安科技(深圳)有限公司 一种图像数据异步传输方法、装置及计算机可读存储介质
CN111340201A (zh) * 2018-12-19 2020-06-26 北京地平线机器人技术研发有限公司 卷积神经网络加速器及其执行卷积运算操作的方法
CN112559163A (zh) * 2019-09-10 2021-03-26 华为技术有限公司 优化张量计算性能的方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611479A (zh) * 2023-07-21 2023-08-18 美智纵横科技有限责任公司 数据处理方法、装置、存储介质及芯片
CN116611479B (zh) * 2023-07-21 2023-10-03 美智纵横科技有限责任公司 数据处理方法、装置、存储介质及芯片

Also Published As

Publication number Publication date
EP4361812A1 (en) 2024-05-01
CN114968612B (zh) 2023-04-11
CN114968612A (zh) 2022-08-30
US20240143397A1 (en) 2024-05-02

Similar Documents

Publication Publication Date Title
WO2023284745A1 (zh) 一种数据处理方法、系统及相关设备
CN106663047B (zh) 用于优化的签名比较和数据复制的系统和方法
US9189487B2 (en) Method for recording transaction log, and database engine
WO2019136993A1 (zh) 文本相似度计算方法、装置、计算机设备和存储介质
WO2017124713A1 (zh) 一种数据模型的确定方法及装置
WO2019095994A1 (zh) 一种摘要获取的方法、装置、设备及计算机可读存储介质
CN111448550A (zh) 网络可访问的机器学习模型训练和托管系统
WO2017133188A1 (zh) 一种特征集确定的方法及装置
US20220121903A1 (en) Method of performing splitting in neural network model by means of multi-core processor, and related product
US20210295168A1 (en) Gradient compression for distributed training
Ibtisum et al. A comparative analysis of big data processing paradigms: Mapreduce vs. apache spark
CN112416654A (zh) 一种数据库日志重演方法、装置、设备及存储介质
CN117369731B (zh) 一种数据的缩减处理方法、装置、设备及介质
WO2022007596A1 (zh) 图像检索系统、方法和装置
Senthilkumar et al. An efficient FP-Growth based association rule mining algorithm using Hadoop MapReduce
WO2022174499A1 (zh) 文本韵律边界预测的方法、装置、设备及存储介质
WO2023284701A1 (zh) 一种数据处理方法、系统及相关设备
WO2023071566A1 (zh) 数据处理方法、装置、计算机设备、计算机可读存储介质及计算机程序产品
US11615306B2 (en) Statically generated compiled representations for processing data in neural networks
WO2017001900A1 (en) A data processing method
CN113971224A (zh) 图像检索系统、方法和相关设备
JP7024863B2 (ja) データ削減装置、データ削減方法、及びプログラム
CN110419026B (zh) 存储器内搜索技术
US11687456B1 (en) Memory coloring for executing operations in concurrent paths of a graph representing a model
US11979174B1 (en) Systems and methods for providing simulation data compression, high speed interchange, and storage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22841380

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022841380

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022841380

Country of ref document: EP

Effective date: 20240124

NENP Non-entry into the national phase

Ref country code: DE