WO2021109703A1 - 数据处理方法、芯片、设备及存储介质 - Google Patents

数据处理方法、芯片、设备及存储介质 Download PDF

Info

Publication number
WO2021109703A1
WO2021109703A1 PCT/CN2020/118893 CN2020118893W WO2021109703A1 WO 2021109703 A1 WO2021109703 A1 WO 2021109703A1 CN 2020118893 W CN2020118893 W CN 2020118893W WO 2021109703 A1 WO2021109703 A1 WO 2021109703A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
unit
instruction
parallel control
processing
Prior art date
Application number
PCT/CN2020/118893
Other languages
English (en)
French (fr)
Inventor
孟玉
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021109703A1 publication Critical patent/WO2021109703A1/zh
Priority to US17/502,218 priority Critical patent/US20220035745A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0875Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/452Instruction code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/454Vector or matrix data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/608Details relating to cache mapping
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of computer technology, and in particular to a data processing method, chip, device, and storage medium.
  • the processor in the computer device can handle a large number of computing tasks.
  • the data moving unit in the processor moves the picture data from outside the processor to the processor, and the processing unit in the processor processes the picture data.
  • the embodiments of the present application provide a data processing method, chip, device, and storage medium, which can improve processing efficiency.
  • the technical solution is as follows:
  • a data processing method which is applied to a computer device, and the method includes:
  • the parallel control instruction read the cached first data in the data cache space, process the read first data, and output the processed first data to the data cache space;
  • the second data is moved from the data storage space to the data cache space, and the second data is the next data of the first data.
  • a data processing chip in another aspect, includes: an instruction processing unit, a data processing unit, a data moving unit, and a data caching unit;
  • the instruction processing unit is configured to read parallel control instructions, and simultaneously send the parallel control instructions to the data processing unit and the data moving unit;
  • the data processing unit is configured to read the first data buffered by the data buffer unit according to the parallel control instruction, process the read first data, and output the processed first data to all The data cache unit;
  • the data moving unit is configured to simultaneously move second data from a data storage unit located outside the chip to the data cache unit according to the parallel control instruction, and the second data is the first data The next piece of data.
  • the data processing unit includes at least one of a convolution engine or a pooling engine.
  • a computer device in another aspect, includes a processor and a data storage unit, and the processor includes an instruction processing unit, a data processing unit, a data moving unit, and a data caching unit;
  • the instruction processing unit is used to read parallel control instructions
  • the data processing unit is configured to read the first data buffered by the data buffer unit according to the parallel control instruction, process the read first data, and output the processed first data to all The data cache unit;
  • the data moving unit is configured to simultaneously move the second data from the data storage unit to the data cache unit according to the control instruction, and the second data is the next data of the first data.
  • the computer device includes an instruction storage unit
  • the processor includes an instruction cache unit
  • the instruction processing unit is configured to read the parallel control instructions in the instruction storage unit;
  • the read parallel control instructions are moved to the instruction cache unit for caching to obtain an instruction cache queue;
  • the parallel control instructions are read from the instruction cache queue according to the instruction cache sequence.
  • the parallel control instruction includes a data processing instruction and a data movement instruction;
  • the instruction processing unit is configured to extract the data processing instruction and the data movement instruction in the parallel control instruction ;
  • the instruction processing unit is further configured to send the data processing instruction to the data processing unit, and at the same time send the data movement instruction to the data movement unit;
  • the data processing unit is configured to read the first data buffered in the data buffer unit according to the data processing instruction, process the read first data, and perform the processing after the processing. Outputting the first data of to the data buffer unit;
  • the data moving unit is configured to move the second data from the data storage unit to the data cache unit according to the data moving instruction.
  • the instruction processing unit is configured to extract valid field indication information in the parallel control instruction; and determine the first valid field and the first valid field in the parallel control instruction according to the valid field indication information.
  • the second valid field read the first valid field from the parallel control instruction to obtain the data processing instruction, and read the second valid field from the parallel control instruction to obtain the data movement instruction .
  • the computer device further includes a splitting unit configured to obtain the data to be processed; split the data to be processed according to the cache capacity of the data cache unit, Obtain multiple pieces of data after splitting; and store the data sequence composed of the multiple pieces of data in the data storage unit.
  • a splitting unit configured to obtain the data to be processed; split the data to be processed according to the cache capacity of the data cache unit, Obtain multiple pieces of data after splitting; and store the data sequence composed of the multiple pieces of data in the data storage unit.
  • the data to be processed is image data; the data processing unit is configured to read the first data buffered by the data buffer unit based on the neural network model according to the parallel control instruction A data; processing the read first data, and outputting the processed first data to the data buffer unit.
  • the data processing unit is configured to perform data processing in parallel according to data processing instructions corresponding to each layer in the neural network model.
  • the neural network model includes a convolutional layer and a pooling layer
  • the data processing unit is configured to receive data processing instructions corresponding to the convolutional layer and data corresponding to the pooling layer Processing instructions
  • the data processing instruction corresponding to the convolution layer based on the convolution layer, read the first data cached in the data buffer space, perform convolution processing on the first data, and perform convolution processing on the first data after the convolution processing. Outputting the first data to the data buffer unit;
  • the data processing instruction corresponding to the pooling layer based on the pooling layer, read the convolution processed third data that has been cached by the data caching unit, and perform processing on the convolution processed third data.
  • the data is pooled, and the pooled third data is output to the data caching unit, where the third data is the previous piece of data of the first data in the plurality of pieces of data.
  • a computer-readable storage medium is provided, and at least one piece of program code is stored in the computer-readable storage medium, and the at least one piece of program code is loaded and executed by a processor to implement the data processing method as described. The operation performed in.
  • the data processing operation and the data moving operation are executed simultaneously according to the parallel control instruction, and the time for the data processing operation to wait for the data moving operation is reduced as much as possible, thereby improving the speed and efficiency of data processing.
  • the processed data is the data that has been moved to the data storage space, and can be processed without waiting for the data movement process, which reduces the dependence of the data processing process on the data movement process, and improves the processing speed and processing efficiency.
  • FIG. 1 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of another computer device provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of another computer device provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of another computer device provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of another computer device provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a convolution process provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of multiple pieces of data after splitting according to an embodiment of the present application.
  • FIG. 9 is a flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a control instruction provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a control instruction provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an instruction cache unit provided by an embodiment of the present application.
  • FIG. 13 is a flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 14 is a flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a processor provided by an embodiment of the present application.
  • FIG. 16 is a flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 17 is a flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 18 is a flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 19 is a schematic diagram of a chip provided by an embodiment of the present application.
  • FIG. 20 is a schematic diagram of a chip provided by an embodiment of the present application.
  • FIG. 21 is a structural block diagram of a terminal provided by an embodiment of the present application.
  • Fig. 22 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • first, second, etc. used in this application can be used herein to describe various concepts, but unless otherwise specified, these concepts are not limited by these terms. These terms are only used to distinguish one concept from another.
  • first data is referred to as second data
  • second data is referred to as first data.
  • At least one includes one, two or more than two, and multiple includes two or more than two, and each One refers to each of the corresponding multiple, any one refers to any one of the multiple, for example, multiple units include 3 units, and each refers to each of the 3 units, Either refers to any one of these 3 units, for example, the first, the second, or the third.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the embodiments of the present application can use the above-mentioned artificial intelligence technology for data processing, and use the data processing method provided in the present application to improve the processing speed and processing efficiency.
  • the data processing method of the present application is described in detail through the following embodiments.
  • the data processing method provided in the embodiments of the present application is applied to computer equipment, which includes mobile phones, tablets, smart terminals, robots, computers, printers, scanners, telephones, driving recorders, navigators, cameras,
  • computer equipment includes mobile phones, tablets, smart terminals, robots, computers, printers, scanners, telephones, driving recorders, navigators, cameras,
  • electronic products such as cameras, watches, earphones, wearable devices; or various transportation tools such as airplanes, ships, vehicles, etc.; or televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, etc.
  • Various household appliances such as range hoods; or various medical equipment such as nuclear magnetic resonance instruments and electrocardiographs; or servers, for example, the computer equipment is a server, or a server cluster composed of several servers, or a cloud Computing service center.
  • the electrocardiograph takes the user’s electrocardiogram image, and uses the trained neural network to analyze the obtained electrocardiogram image to determine whether the user has a heart problem.
  • the data processing method provided in the embodiments of this application is used. After the electrocardiogram image is obtained, processing The processor executes the calculation steps that the neural network needs to perform.
  • the instruction processing unit in the processor sends control instructions to the data processing unit and the data moving unit at the same time.
  • the data processing unit and the data moving unit run in parallel, that is, the ECG image is moved at the same time.
  • the process to the processor and the process of processing the last ECG image moved to the processor avoid as much as possible the data processing unit waiting for the data moving unit to move the ECG image, which improves the processing speed and processing efficiency.
  • the data processing method provided in the embodiment of the present application can be applied to any scenario where data is processed, which is not limited in the embodiment of the present application.
  • the computer device includes a processor 1 and a data storage unit 201.
  • the processor 1 includes an instruction processing unit 101, a data processing unit 102, a data moving unit 103, and a data caching unit 104.
  • the instruction processing unit 101 is connected to the data processing unit 102 and the data moving unit 103
  • the data caching unit 104 is connected to the data processing unit 102 and the data moving unit 103
  • the data storage unit 201 is connected to the data processing unit 102 and the data caching unit.
  • 103 connections the instruction processing unit 101 can send control instructions to the data processing unit 102 and the data moving unit 103 at the same time.
  • the data storage unit 201 can store multiple pieces of data.
  • the computer device further includes an instruction storage unit 202
  • the processor 1 further includes an instruction cache unit 105
  • the instruction processing unit 101 is configured to store at least one of the instructions stored in the instruction storage unit 202.
  • the control instruction is moved to the instruction cache unit 105.
  • the computer device further includes a splitting unit 203, which is located outside the processor 1, and can split the data to be processed into multiple pieces of data, and can also split the data to be processed into multiple pieces of data.
  • the split pieces of data are stored in the data storage unit 201.
  • the data processing unit 102 includes at least one data processing subunit, and each data processing subunit is used to process data differently.
  • the data moving unit includes There is at least one data moving subunit, and the data moving process of each data moving subunit is different.
  • the data moving unit including three data moving sub-units Take the data moving unit including three data moving sub-units as an example for description, where the first data moving sub-unit is used to move data from the data storage unit 201 to the data caching unit 104; the second data moving sub-unit is used to move data Move from the data cache unit 104 to the data storage unit 201; the third data move subunit is used to move the data from the first position of the data cache unit 104 to the second position of the data cache unit 104.
  • the data processing subunit is a calculation engine, for example, a convolution engine, a pooling engine, etc.
  • the data movement subunit is a movement engine, for example, a load engine, a store engine, and a move engine. Wait.
  • the computer device includes an instruction processing unit 101, a data processing unit 102, a data moving unit 103, a data caching unit 104, an instruction caching unit 105, a data storage unit 201, an instruction storage unit 202, and a splitting unit 203.
  • the data processing unit 102 includes at least one data processing sub-unit
  • the data moving unit includes at least one data moving sub-unit.
  • Fig. 6 is a flowchart of a data processing method provided by an embodiment of the present application.
  • the execution subject of the embodiment of the present application is a computer device as shown in any one of Figs. 1 to 5. Referring to Fig. 6, the method includes:
  • the splitting unit splits the data to be processed into multiple pieces of data, and stores the multiple pieces of data after the split in the data storage unit.
  • the data to be processed is data that the processor needs to process.
  • the data to be processed includes any one or more forms of data such as picture data, audio data, and text data. Do restrictions.
  • the data to be processed is split into multiple pieces of data, and each of the multiple pieces of data after the split is divided. Each piece of data is processed, so that because the amount of data that is moved each time becomes smaller, the moving speed will also become faster.
  • the splitting unit splitting the data to be processed into multiple pieces of data includes: if the data to be processed is image data, dividing the data to be processed into multiple pieces of data evenly according to the size of the data to be processed, For example, the size of the picture data is 128*128*3, and the picture data is split to obtain 16 picture data with a size of 32*32*3.
  • splitting the to-be-processed data into multiple pieces of data evenly refers to: splitting the to-be-processed data according to the target quantity, and splitting the to-be-processed data into the target quantity of data; for example, if the target quantity is 100, no matter what What is the size of the data to be processed, and the data to be processed is equally divided into 100 pieces of data.
  • splitting the to-be-processed data into multiple pieces of data evenly refers to: splitting the to-be-processed data according to the target size, so that each piece of data after the split is not larger than the target size, so that the split data Data can be moved smoothly.
  • the data to be processed is audio data, which is divided according to the reference duration.
  • the data to be processed is audio data with a duration of 1 minute.
  • the reference duration is 10 seconds
  • the data to be processed is split according to the reference duration to obtain 6 pieces of audio data with a duration of 10 seconds.
  • it is divided according to sentences, and each of the obtained pieces of data includes at least one sentence. For example, there is a certain time interval between two adjacent sentences. Therefore, according to the segment of the voice data output by the target object is not included in the data to be processed, multiple sentences in the data to be processed can be separated to obtain Multiple pieces of data, each of the multiple pieces of data includes a sentence, optionally, the target object is a person, or any other object in the environment that can input voice data.
  • the data to be processed is text data.
  • the data to be processed is equally divided into multiple pieces of data.
  • the data volume of each piece of data does not exceed the target data volume; or, according to the segment Dividing methods such as, sentence, etc., split the data to be processed into multiple pieces of data. For example, divide each paragraph of text in the data to be processed into one piece of data to obtain multiple pieces of data; or divide each sentence in the data to be processed into one piece of data to obtain multiple pieces of data.
  • the splitting unit splits the data to be processed into multiple pieces of data, the multiple pieces of data are input to the processor, and the multiple pieces of data are processed by the processor in turn.
  • the splitting unit is based on the processor’s Configuration information, the data to be processed is split, for example, the configuration information is the cache capacity of the data cache unit in the processor, the splitting unit obtains the data to be processed, and the data to be processed is split according to the cache capacity of the data cache unit to obtain the split After divided multiple pieces of data, the data volume of each piece of data does not exceed the cache capacity.
  • the splitting unit splits the data to be processed into multiple pieces of data according to the splitting rule, where the splitting rule indicates that the data volume of any piece of data obtained by splitting is not greater than the cache of the data caching unit capacity.
  • the split rule is a rule specified according to configuration information.
  • the configuration information of the processor is the processing capacity of the data processing unit in the processor, or is the bandwidth in the processor.
  • the configuration information includes one or more types of information.
  • the cache capacity of the data cache unit is the total cache capacity of the data cache unit.
  • the splitting unit splits the to-be-processed data into multiple pieces of data according to splitting rules, including: splitting The unit splits the data to be processed into multiple pieces of data according to the total buffer capacity of the data buffer unit, and the data volume of each piece of data is not greater than the total buffer capacity of the data buffer unit.
  • the total cache capacity of the data cache unit is 15KB
  • the data volume of the data to be processed is 85KB.
  • the data to be processed is divided into 6 pieces of data with the same amount of data, or the data to be processed is divided into 6 pieces of data with different amounts of data. Data, for example, the data volume of the 6 pieces of data are 15KB, 15KB, 15KB, 15KB, 15KB, and 10KB, respectively.
  • the data cache unit may need to cache other data in addition to the data input to the processor.
  • the cache capacity of the data cache unit is the remaining cache capacity of the current data cache unit.
  • the splitting unit splits the data to be processed into multiple pieces of data according to the splitting rules, including: the splitting unit splits the data to be processed into multiple pieces of data according to the remaining cache capacity of the data cache unit , The data volume of each piece of data is not greater than the remaining buffer capacity of the data buffer unit.
  • the data cache unit can cache the data that the data processing unit needs to process, and can also cache the processed data output by the data processing unit, and the instruction control unit in the embodiment of the present application will simultaneously send to the data processing unit and the data movement unit Parallel control instructions enable the data processing unit to simultaneously move the data to be processed to the data cache unit when the data processing unit processes the data stored in the data cache unit. Therefore, the data cache unit at least needs to cache the data moving unit to move The data, the data to be processed by the data processing unit, and the data to be output by the data processing unit.
  • the splitting unit splits the to-be-processed data into multiple pieces of data according to the splitting rules, including: the splitting unit splits the data to be processed into multiple pieces of data according to the data volume of the data processing unit and the data processing unit The amount of output data is divided into multiple pieces of data to be processed, so that the data caching unit can cache at least two pieces of input data and one piece of output data of the data processing unit.
  • the cache capacity of the data cache unit is 30KB. If data with a data volume of 10KB is input to the data processing unit, and the data is processed by the data processing unit, the output data is also 10KB, then the data to be processed is split Divided into multiple pieces of data, the data volume of each piece of data does not exceed 10KB.
  • the data caching unit is caching data
  • different types of data can be stored in different storage spaces, so that the data can be differentiated and managed.
  • the data to be processed by the data processing unit is stored in the first storage space
  • the processed data is stored in the second storage space.
  • the splitting unit splits the data to be processed into multiple pieces of data according to the splitting rules, including: determining the maximum data amount of each piece of data according to the size of the first storage space and the second storage space , Split the data to be processed into multiple pieces of data, and the data volume of each piece of data is not greater than the maximum data volume.
  • the first storage space is an input storage space of the data processing unit
  • the second storage space is an output storage space of the data processing unit.
  • the amount of data may change.
  • the capacity of the first storage space is 16KB
  • the capacity of the second storage space is 8KB.
  • the data volume of the data will double that before the processing. If the second storage space can hold the processed data, Then the maximum data volume is 4KB, and the data obtained by splitting does not exceed 4KB.
  • the capacity of the first storage space is 16KB, and the capacity of the second storage space is 16KB. Due to the parallel execution of data processing and data movement, the first storage space needs to store the data to be processed by the data processing unit, and it also needs to be the data The moving unit reserves space for the data to be moved. Therefore, the first storage space needs to hold at least two pieces of data, so the data obtained by splitting does not exceed 8KB.
  • the processor is an AI chip used to perform various calculation processes of the neural network model. Taking the data processing unit as the unit for convolution operation as an example, the data to be processed is split into multiple Description of the data:
  • the data to be processed is 2048*2048*64 (width*height*channel, width*length*channel) picture data, pad (expansion) is 1, and 32 groups of 3*3(width*height,
  • the convolution kernel of width * length performs a convolution operation with stride (step length) of 1.
  • the storage space of the input data in the AI chip has a capacity of 16KB*32, and the storage space of the output data has a capacity of 16KB*32. Due to the parallel execution of data processing and data movement, the storage space of the input data is divided into two parts.
  • the data amount of the output data is the same as the data amount of the input data, then the size of the storage space of the input data can be used as the basis for data splitting, that is, 8KB*32 is the splitting.
  • the data is split according to a 60*60*64 specification, where each 60*60*64 data consumes a capacity of 7.2KB*32.
  • the sum of the data amount of the two pieces of data obtained by this split is less than the total capacity of the storage space of the input data.
  • splitting the data to be processed into multiple pieces of data is to split the data to be processed into multiple tiles (tile data), and optionally, to split the data to be processed into multiple tiles.
  • the process is shown in Figure 8.
  • the size of the first tile is 60*60*64; the size of the second to 35th tiles is 60*60*64, considering the sliding window characteristics of convolution, the second Each Tile from Article 35 to the previous one has two overlapping columns, and the new area size in each Tile is 60*58.
  • the size of the 36th Tile is 60*20*64. Taking into account the sliding window characteristics of convolution, this Tile and the previous Tile have two overlapping columns. Therefore, the size of the new area in the Tile is 60*18.
  • the split rule in the embodiment of this application is configured by the developer according to the capacity of the data cache unit in the processor and the requirements of the data processing unit for the cache when configuring the processor. , At least one of the requirements of the data moving unit for the cache or the change of the data volume before and after data processing is determined, and the splitting rule is configured in the processor. After the processor is configured, the processor needs to process the data It is also determined, that is, the type and amount of data to be processed are determined, so the splitting unit can split the to-be-processed data into multiple pieces of data according to the splitting rules configured by the developer.
  • the data sequence composed of the multiple pieces of data is stored in the data storage space.
  • the data sequence is shown in FIG.
  • the data sequence can also be arranged in other ways, and the embodiment of the present application does not limit the form of the data sequence.
  • the instruction processing unit moves at least one parallel control instruction stored in the instruction storage unit to the instruction cache unit.
  • the instruction storage unit is located outside the processor.
  • the instruction storage unit is a memory, or a storage medium, or is another type of storage unit.
  • the instruction storage unit is used to store multiple parallel control instructions, and the multiple parallel control instructions are used to instruct the processor to process multiple pieces of data obtained by splitting a piece of data to be processed, where the multiple parallel control instructions can be recycled, That is, when multiple pieces of data obtained by splitting each piece of data to be processed are processed, multiple parallel control instructions stored in the instruction storage unit can be used.
  • the multiple parallel control instructions stored in the instruction storage unit are stored when the instruction storage unit is configured; optionally, multiple parallel control instructions in the instruction storage unit are input through the instruction management program.
  • the management program is a program used to manage the instruction storage unit.
  • the instruction management program can add, delete, or modify instructions in the instruction storage unit.
  • the instruction management program can reset the instructions in the instruction storage unit.
  • the parallel control instruction can be read from the instruction storage unit. For example, take the data processing unit to process the first 6 pieces of data obtained by splitting as an example. As shown in Fig. 9, the first parallel control instruction stored in the instruction storage unit is used to instruct the data moving unit to transfer the first piece of data.
  • Data A is moved from the data storage unit to the data cache unit;
  • the second parallel control instruction is used to instruct the data movement unit to move the second piece of data B from the data storage unit to the data cache unit, and instruct the data processing unit to read the data cache unit
  • the first data A that has been cached, the first data A is processed, and the processed first data A is output to the data cache unit;
  • the third parallel control instruction is used to instruct the data movement unit to transfer the third data C moves to the data cache unit, and instructs the data processing unit to read the second piece of data B that has been cached by the data cache unit, process the second piece of data B, and output the processed second piece of data B to the data cache unit,
  • instruct the data moving unit to move the first piece of data A processed by the data cache unit from the data cache unit to the data storage unit; refer to FIG. 9 for the remaining parallel control instructions, which will not be repeated in the embodiment of the present application.
  • the data moving unit moves the third piece of data C from the data storage unit to the data cache unit under the instruction of the third parallel control instruction, and moves the processed first piece of data A from the data cache unit to the data storage unit. unit.
  • the data moving unit includes a first data moving subunit and a second data moving subunit.
  • the first data moving subunit moves the data from the data storage unit to the data cache unit
  • the second data moving subunit moves the data from the data storage unit to the data cache unit.
  • the subunit moves the data from the data cache unit to the data storage unit.
  • any one of the multiple parallel control instructions stored in the instruction storage unit includes: the control unit performs an operation The valid fields of. For example, as shown in FIG.
  • the parallel control instruction 1 includes a second valid field, which is used to instruct the data moving unit to move the first piece of data A from the data storage unit to the data cache unit;
  • the parallel control instruction 2 includes The first valid field and the second valid field, the second valid field is used to instruct the data moving unit to move the second piece of data B from the data storage unit to the data cache unit, and the first valid field is used to instruct the data processing unit to read the data cache
  • the first piece of data A that has been cached by the unit is processed, and the processed first piece of data A is output to the data cache unit.
  • the effective field carried by the parallel control instruction is different. Therefore, the parallel control instruction When the control objects of 1 and parallel control instruction 2 are different, the formats of parallel control instruction 1 and parallel control instruction 2 are also different.
  • each parallel control instruction includes valid field indication information and multiple valid fields.
  • the valid field indication information is the control instruction of each unit in the processor, which is used to indicate the valid valid fields in the current parallel control instruction and which units need to be The operation is executed under the parallel control instruction.
  • Each valid field defines the information required by the corresponding unit to perform operations.
  • each parallel control instruction is shown in Figure 11.
  • the header of the parallel control instruction is valid field indication information, and there are multiple fields after the instruction information. If the parallel control instruction 1 is used to control the data movement unit to transfer data 1 from the data
  • the storage unit is moved to the data cache unit, and the second valid field of the parallel control instruction 1 defines the movement parameters, such as the start address of data 1, the end address of data 1, and the length of the moved data.
  • other fields of the parallel control instruction 1 are filled with default values, and the corresponding parameters are not defined, indicating that the other fields are invalid fields.
  • the parallel control instruction N is used to control the first data moving subunit to move the data N from the data storage unit to the data buffer unit, and at the same time control the data processing unit to read the data N-1 stored in the data buffer unit, and to control the data N- 1 for processing, and output the processed data N-1 to the data cache unit, and at the same time control the second data moving sub-unit to move the processed data N-2 from the data cache unit to the data storage unit, the parallel control instruction N
  • Corresponding parameters are defined in the first, second, and fourth fields of, which are valid fields, and the other fields are filled with default values, which are invalid fields, where N is any integer greater than 2.
  • the instruction cache unit is a unit inside the processor, which is characterized by high cost, small storage capacity, and larger bandwidth compared to the instruction storage unit; the instruction storage unit is located outside the processor, which is characterized by low cost and large storage capacity , The bandwidth is smaller than the instruction cache unit. Therefore, during the operation of the processor, moving at least one parallel control instruction stored in the instruction storage unit to the instruction cache unit can ensure that the instruction cache unit can continuously and seamlessly supply the parallel control instructions, enabling the data processing unit and the data The moving unit receives the parallel control instruction in time.
  • the instruction control unit moves the parallel control instruction to be executed from the instruction storage unit to the instruction cache unit.
  • the instruction storage unit After the parallel control instruction is moved from the instruction storage unit to the instruction cache unit, the instruction storage unit The parallel control instruction will not disappear, so that the parallel control instruction is still moved from the instruction storage unit when multiple pieces of data after the next data to be processed are split are processed.
  • the instruction processing unit moves at least one parallel control instruction stored in the instruction storage unit to the instruction cache unit, including: reading the parallel control instruction in the instruction storage unit, and reading the parallel control instructions in the instruction storage unit according to the reading order.
  • the parallel control instructions are moved to the instruction cache unit for caching, and the instruction cache queue is obtained. Subsequently, the parallel control instructions are read from the instruction cache queue according to the instruction cache sequence.
  • the instruction cache queue is a queue located in the instruction cache unit and containing at least one instruction.
  • the instruction buffer unit is a block-shaped FIFO (First Input First Output, first-in first-out queue) structure.
  • the instruction processing unit moves the parallel control instruction from the instruction storage unit to the instruction cache unit; among them, each data block of the instruction cache unit can store a parallel control instruction, and each parallel control instruction is used to control multiple units Perform the corresponding operation.
  • each parallel control instruction occupies 64B
  • the instruction cache unit includes 8 data blocks, each data block is used to store a parallel control instruction, and the 8 data blocks all store parallel control instructions, occupying a total of 512B Therefore, the overall cost of the processor is small, and it is not affected by more parallel control instructions, maintaining good scalability.
  • the instruction storage unit stores multiple parallel control instructions required for processing multiple pieces of data after splitting. Compared with storing the multiple parallel control instructions in the instruction cache unit, the cost can be greatly reduced. And, as the computational complexity of the processor is gradually increasing and the amount of data to be processed is gradually increasing, more and more parallel control instructions are required. Storage media other than the processor are used to store multiple parallel control instructions. It can save the cost of the internal storage of the processor, and the two instruction storage units also better solve the problem of more parallel control instructions and longer parallel control instructions, which cause the parallel control instructions to occupy more storage space.
  • the instruction processing unit reads the parallel control instruction from at least one instruction stored in the instruction cache unit, and sends the parallel control instruction to the data processing unit and the data movement unit at the same time.
  • the instruction processing unit reading the parallel control instruction from the at least one instruction stored in the instruction cache unit includes: the instruction processing unit reads the parallel control with the longest storage time from the at least one instruction stored in the instruction cache unit Command, and send the parallel control command with the longest storage time in the command buffer unit to the data processing unit and the data moving unit at the same time.
  • the instruction cache unit has a block FIFO structure, the instruction cache unit stores parallel control instructions, and the instruction processing unit reads the parallel control instruction that first enters the instruction cache unit.
  • the instruction processing unit reading the parallel control instruction from at least one instruction stored in the instruction cache unit includes: reading the parallel control instruction from the instruction cache queue according to the instruction cache sequence.
  • the instruction processing unit sends parallel control instructions to the data processing unit and the data moving unit at the same time
  • the parallel control instructions sent to the data processing unit and the data moving unit are the same; or the parallel control sent to the data processing unit and the data moving unit The instructions are different.
  • the parallel control instruction sent by the instruction processing unit to the data processing unit and the data movement unit at the same time is the same.
  • the parallel control instruction is shown in Figure 11 and includes valid field indication information and multiple valid fields.
  • the unit receives the parallel control instruction, determines the effective field corresponding to the data processing unit according to the effective field indication information in the parallel control instruction, and executes the operation according to the corresponding instruction in the effective field;
  • the data movement unit receives the parallel control instruction, According to the effective field indication information in the parallel control instruction, the effective field corresponding to the data moving unit is determined, and the operation is performed according to the corresponding instruction in the effective field.
  • the parallel control instructions sent by the instruction processing unit to the data processing unit and the data movement unit at the same time are different.
  • the instruction processing unit reads the parallel control instruction, it extracts the data in the parallel control instruction. Processing instructions and data moving instructions, sending data processing instructions to the data processing unit, and sending data moving instructions to the data moving unit at the same time.
  • Extracting the data processing instruction and the data movement instruction in the parallel control instruction includes: extracting effective field indication information in the parallel control instruction; according to the effective field indication information, determining the first effective field and the second effective field in the parallel control instruction, The first valid field and the second valid field are read from the parallel control instruction to obtain a data processing instruction and a data movement instruction.
  • the first valid field is read from the parallel control instruction to obtain the data processing instruction; the second valid field is read from the parallel control instruction to obtain the data movement instruction. Therefore, the data processing unit and the data moving unit can directly perform corresponding operations according to the received data processing instructions and data moving instructions.
  • the valid field indication information is used to indicate which of the multiple fields of the parallel control instruction are valid fields, and the unit or subunit that receives the instruction is determined according to the valid field.
  • the parallel control instruction includes valid field indication information and 6 fields, and the 6 fields are matched with 6 subunits respectively. If the valid field indication information indicates that the first field and the third field are valid fields, the first field The corresponding subunit sends a control instruction carrying the first field, and at the same time sends a control instruction carrying the third field to the subunit corresponding to the third field.
  • the instruction processing unit sends parallel control instructions to the data processing unit and the data movement unit at the same time, including: the instruction processing unit simultaneously sends parallel control instructions matching the target data to the processing unit and the data movement unit according to any target data in the data storage unit Control instruction.
  • the parallel control instruction matching the target data refers to an instruction that instructs the data moving unit to move the target data from the data storage unit to the data cache unit, and instructs the data processing unit to process the last piece of data of the target data.
  • the instruction processing unit simultaneously sends parallel control instructions matching the target data to the processing unit and the data movement unit according to any target data in the data storage unit, including: instructing the processing unit to send data processing to the data processing unit At the same time, the data movement instruction is sent to the data movement unit.
  • the data control instruction instructs to process the last data of the target data
  • the data movement control instruction instructs to move the target data from the data storage unit to the data cache unit.
  • step 604 to step 609 describe the process of any data processing unit and the data moving unit running in parallel.
  • the data processing unit reads the first data buffered by the data buffer unit according to the parallel control instruction, processes the first data, and outputs the processed first data to the data buffer unit.
  • the first data is any one of the multiple pieces of data after splitting.
  • a data sequence composed of multiple pieces of data after splitting is stored in the data storage space, and the first data is Any piece of data in the data sequence.
  • the parallel control instruction includes valid field indication information and multiple valid fields.
  • the data processing unit obtains the first valid field in the parallel control instruction according to the valid field indication information of the parallel control instruction. Field.
  • the data processing unit reads the first data cached by the data cache unit according to the parallel control instruction, processes the first data, and outputs the processed first data to the data cache unit, including: The data processing unit extracts the valid field indication information in the parallel control instruction, determines the first valid field according to the valid field indication information, reads the first data cached by the data cache unit according to the first valid field, and processes the first data , Output the processed first data to the data buffer unit.
  • the parallel control instruction is a first control instruction
  • the first control instruction carries a first valid field for controlling the data processing unit
  • the data processing unit can read the data cached by the data cache unit according to the first valid field.
  • the first data is to process the first data, and output the processed first data to the data buffer unit.
  • the data processing unit reads the first data that has been cached by the data caching unit according to the first valid field, including: the first valid field indicates the cache location of the first data, and the data processing unit reads the first data according to the cache location of the first data.
  • the first data; or, the first valid field instructs the data processing unit to start, and the data processing unit reads the cached first data from the first storage space of the data cache unit by itself.
  • the data processing unit is configured with a program for executing the data cached by the data caching unit, processing the data, and outputting the processed data to the data caching unit, and the first valid field indicating that the data processing unit is activated means:
  • the first valid field instructs the data processing unit to run the program.
  • the data processing unit processing the first data according to the first valid field includes: the first valid field indicates the processing operation that the data processing unit needs to perform, the data processing unit executes the processing operation indicated by the first valid field, and One piece of data is processed; or, if the first valid field instructs the data processing unit to start, the data processing unit starts and processes the first data according to the configured processing operation.
  • the data processing unit outputs the processed first data to the data caching unit, including: the first valid field indicates the location of the second storage space, and the data processing unit outputs the processed first data to the data according to the first valid field.
  • the second storage space of the cache unit or, if the first valid field does not indicate the location of the second storage space, the data processing unit automatically outputs the processed first data to the second storage space of the data cache unit.
  • the data moving unit moves the second data to be processed from the data storage unit to the data cache unit according to the parallel control instruction, and the second data is the next data of the first data.
  • the second data is the next data of the first data among the split pieces of data.
  • the first data is Tile1 and the second data is Tile2.
  • a data sequence composed of multiple pieces of data is stored in the data storage space, and the second data is the next piece of data of the first data in the data sequence.
  • the parallel control instruction includes valid field indication information and multiple valid fields.
  • the data moving unit can obtain the second in the parallel control instruction according to the valid field indication information of the parallel control instruction. Valid field.
  • the data moving unit moves the second data to be processed from the data storage unit to the data cache unit according to the parallel control instruction, including: the data moving unit extracts the valid field indication information in the parallel control instruction, according to The valid field indication information determines the second valid field in the parallel control instruction, and according to the second valid field, the second data to be processed is moved from the data storage unit to the data cache unit.
  • the parallel control instruction is a second control instruction
  • the second control instruction carries a second valid field for controlling the data moving unit
  • the data moving unit can remove the second data to be processed from the second valid field according to the second valid field.
  • the data storage unit is moved to the data cache unit.
  • the second valid field includes the start storage location of the second data in the data storage unit, the end storage location of the second data in the data storage unit, the target location where the second data is moved to the data cache unit, or the first storage location of the second data in the data storage unit. 2. At least one item of data length, etc. Wherein, the second data occupies a certain storage space in the data storage unit, and the storage space occupied by the second data can be accurately determined through the initial storage location and the end storage location of the second data in the data storage unit.
  • the data moving unit moves the second data to be processed from the data storage unit to the data caching unit according to the second valid field, including: the data caching unit according to the initial storage position of the second data in the data storage unit and the data storage unit Data length, read the second data from the data storage unit, and move the second data to the target position according to the target position.
  • the data moving unit includes multiple sub-units. Therefore, step 605 is completed by the first data moving sub-unit in the data moving unit.
  • the data moving unit moves the processed third data from the data cache unit to the data storage unit according to the parallel control instruction, where the third data is the previous data of the first data.
  • the third data is the previous data of the first data among the split pieces of data.
  • the first data is Tile3
  • the third data is Tile2.
  • a data sequence composed of multiple pieces of data after splitting is stored in the data storage space. Therefore, the third data is the previous data of the first data in the data sequence.
  • the third data is one piece of data in a data sequence composed of multiple pieces of data after splitting, and the third data is the previous piece of data of the first data.
  • the parallel control instruction includes valid field indication information and multiple valid fields.
  • the data moving unit can determine the corresponding valid field according to the valid field indication information of the parallel control instruction, and obtain the The third valid field in the valid field is a field used to control the data moving unit to perform an operation on the processed data.
  • the data moving unit extracts the valid field indication information in the parallel control instruction, determines the third valid field in the parallel control instruction according to the valid field indication information, and transfers the processed data according to the third valid field.
  • the third data is moved from the data cache unit to the data storage unit.
  • the parallel control instruction is a data movement instruction
  • the data movement instruction carries a third valid field for controlling the data movement unit
  • the data movement unit transfers the processed third data from the data cache unit according to the third valid field. Move to the data storage unit.
  • the third valid field includes the initial storage location of the third data in the data cache unit, the end storage location of the third data in the data cache unit, the data length of the processed third data, and the processed third data.
  • the data is at least one item in the target location of the data storage unit, etc.
  • the third data occupies a certain storage space in the data cache unit, and the storage space occupied by the third data can be accurately determined through the initial storage location and the end storage location of the third data in the data cache unit.
  • the data moving unit moves the processed third data from the data buffer unit to the data storage unit according to the third valid field, including: the data buffer unit moves the processed third data according to the initial storage location and processing of the processed third data in the data buffer unit After the data length of the third data, read the processed third data from the data buffer unit, and move the processed third data to the target location in the data storage unit.
  • step 606 is completed by the second data moving sub-unit in the data moving unit.
  • the instruction processing unit sends parallel control instructions to the data processing unit and the data movement unit at the same time, it may not be able to instruct the processor to complete all the operations to be performed on the data to be processed, and it needs to continue to the data processing unit and the data movement unit. Continue to send parallel control commands.
  • the time for the instruction processing unit to send the parallel control instruction to the data processing unit and the data moving unit at the same time again is: the data processing unit and the data moving unit have completed their work according to the previous parallel control instruction, the following steps 607 to 609 are performed Description.
  • the data moving unit in the embodiment of the application also moves the processed data output by the data processing unit from the data caching unit to the data storage unit. Therefore, the data storage unit is not only used to store the data to be processed. The multiple pieces of data obtained are also used to store the data processed by the data processing unit.
  • the data processing unit needs to process the data multiple times, or the data processing unit includes multiple data
  • the processing subunit needs to process the data sequentially through multiple data processing subunits. Therefore, the second data is the data to be processed, or the data output after the last processing by the data processing unit, or a certain data in the data processing unit Process the processed data output by the sub-unit.
  • the data processing unit After the data processing unit outputs the processed first data to the data buffer unit, the data processing unit sends a first completion message to the instruction processing unit.
  • the first completion message is used to indicate that the data processing unit has completed the operation.
  • the first completion message carries the identifier of the data processing unit, so as to instruct the processing unit to determine that the data processing unit has completed the operation according to the first completion message.
  • the identifier of the data processing unit is an identifier that determines a unique data processing unit, for example, the identifier is the number of the data processing unit, the name of the data processing unit, and so on.
  • the data movement unit moves the second data from the data storage unit to the data cache unit, and after the processed third data is moved from the data cache unit to the data storage unit, the data movement unit sends the second data to the instruction processing unit. Complete the message.
  • the second completion message is used to indicate that the data movement unit has completed the operation.
  • the second completion message carries the identifier of the data movement unit, so as to instruct the processing unit to determine that the data movement unit has completed the operation according to the second completion message .
  • the identifier of the data moving unit is an identifier that determines a unique data moving unit, and may be the number of the data moving unit, the name of the data moving unit, and so on.
  • the parallel control instruction only instructs the data movement unit to move the second data from the data storage unit to the data cache unit, after the data movement unit moves the second data from the data storage unit to the data cache unit, the data movement unit Send a second completion message to the instruction processing unit. If the parallel control instruction only instructs the data movement unit to move the processed third data from the data buffer unit to the data storage unit, the data movement unit moves the processed third data from the data buffer unit to the data storage unit, and then the data movement unit Send a second completion message to the instruction processing unit.
  • the data moving unit moves the second data from the data storage unit. After the storage unit is moved to the data buffer unit, and the processed third data is moved from the data buffer unit to the data storage unit, the data movement unit sends a second completion message to the instruction processing unit.
  • the instruction processing unit simultaneously sends the next parallel control instruction matching the fourth data to the instruction processing unit and the data cache unit, and the fourth data is multiple pieces of split The next data of the second data in the data.
  • the data moving unit moves the second data from the data storage unit to the data cache unit
  • the data moving unit needs to continue to move the next piece of data of the second data, that is, the fourth data. Therefore, the instruction processing unit needs to read and The next parallel control instruction matching the four data is sent to the instruction processing unit and the data cache unit at the same time, and the next parallel control instruction is used to control the data moving unit to move the fourth data from the data storage unit to the data
  • the cache unit is also used to control the data processing unit to read the second data that has been cached by the data cache unit, to process the second data, and to output the processed second data to the data cache unit; and to control the data movement unit Move the processed first data from the data cache unit to the data storage unit.
  • the multiple parallel control instructions stored in the instruction storage unit are used to control the processor to complete the processing operation of multiple pieces of data obtained by splitting a piece of data to be processed. Therefore, the multiple parallel control instructions are in sequence according to the processing flow of the processor.
  • the instruction processing unit moves at least one parallel control instruction in the instruction storage unit to the instruction cache unit, it is also moved sequentially in order. Therefore, the arrangement order of at least one parallel control instruction in the instruction cache unit is the same as that of the processor.
  • the processing flow is matched. Therefore, the instruction processing unit can directly read the parallel control instruction with the longest current storage time in the first storage unit, and the parallel control instruction is the next parallel control instruction matching the fourth data. Among them, the flow of the above-mentioned data processing method is shown in FIG. 13 and FIG. 14.
  • step 601 provided in the embodiment of this application is an optional execution step.
  • the split unit is not used to separate each piece of data to be processed.
  • the data is split into multiple pieces of data, but multiple pieces of data to be processed are directly processed.
  • the process of moving the data to be processed may be relatively slow, because the data processing and data movement are executed in parallel, the waiting is avoided as much as possible After the data moving process is executed, the data processing process is executed, which still improves the processing efficiency of the processor. Therefore, the embodiment of the present application does not limit whether to split the data to be processed.
  • the above-mentioned data processing process is the data processing process of the neural network model.
  • the data processing unit reads the first data cached in the data buffer space according to the parallel control instruction, processes the read first data, and outputs the processed first data to the data
  • the storage space includes: the data processing unit reads the first data cached in the data cache space based on the parallel control instruction and the neural network model; processes the first data, and outputs the processed first data to the data cache space.
  • the data processing unit further includes a plurality of data processing subunits, and the data processing of the neural network model is realized by the plurality of data processing subunits.
  • the data processing subunits simultaneously receive the data processing instructions corresponding to the data processing subunits, and perform data processing in parallel according to the data processing instructions.
  • the first data processing subunit is used to perform convolution processing on the data
  • the second data processing subunit is used to perform pooling processing on the data. If the first data processing subunit receives the corresponding data processing instruction, it reads the first data buffered by the data buffer unit, performs convolution processing on the first data, and outputs the convolution processed first data to the data buffer unit.
  • the second data processing subunit receives the corresponding data processing instruction, reads the convolution processed third data that has been cached by the data buffer unit, and performs pooling processing on the convolution processed third data to pool
  • the processed third data is output to the data buffer unit.
  • the third data is the previous data of the first data among the multiple pieces of data after splitting.
  • the data processing method provided by the embodiments of the present application sends parallel control instructions to the data processing unit and the data moving unit at the same time, so that the data processing unit and the data moving unit can be executed in parallel, and the data processing unit is processing the data moving unit this time.
  • the data that was moved last time can be processed by the data processing unit without waiting for the completion of the current movement of the data moving unit.
  • the data processing process no longer depends on the data moving process, which improves the processing speed and processing efficiency.
  • the data to be processed is compressed and cropped to reduce the amount of data to be processed. Processing such as, cropping, etc. will cause some information in the picture to be lost.
  • the embodiment of this application does not perform processing such as compression, cropping, etc., on the data to be processed. Therefore, the information in the data to be processed is not lost, and the processor processing result accuracy.
  • a cache unit with a larger capacity is configured for the processor, so that the processor can accommodate the data to be processed.
  • the data to be processed can only be moved in the processor, thereby speeding up the movement; or, using a higher bus bandwidth in the processing, such as HBM (High Bandwidth Memory), that is, through Improve data transmission efficiency to reduce the time-consuming data movement.
  • HBM High Bandwidth Memory
  • configuring a larger-capacity cache unit for the processor will result in an increase in the area of the processor and a significant increase in cost; the cost increase brought by the use of high-bandwidth memory is also more obvious.
  • the data processing method provided by the embodiment of the present application splits the data to be processed into multiple pieces of data, processes the multiple pieces of data, and executes the data movement process and the data processing process in parallel to avoid data processing as much as possible.
  • the waiting time of the process on the data transfer process thereby reducing the impact of data transfer on the processing speed of the processor. Therefore, the embodiments of the present application have higher requirements on the cache unit and bandwidth of the processor, and will not increase the cost of the processor.
  • the embodiment of this application can send parallel control instructions to the data processing unit and the data movement unit of the processor at the same time through one instruction processing unit, so that the data processing unit and the data movement unit can be processed in parallel by only one instruction processing unit.
  • Sending instructions to two or more units at the same time does not require multiple instruction processing units, and at the same time, it does not require interactive scheduling between multiple instruction processing units, and the implementation cost is low.
  • there is no need for software and hardware interaction which avoids the performance loss of the processor caused by the software and hardware interaction.
  • an instruction storage unit other than the processor stores multiple parallel control instructions corresponding to multiple pieces of data after processing the data to be processed, compared to storing the multiple parallel control instructions in the instruction cache unit.
  • the cost can be greatly reduced, and as the processor's computational complexity is gradually increasing and the amount of data to be processed is gradually increasing, more and more parallel control instructions are required.
  • the storage unit outside the processor is used to store Multiple parallel control instructions can save the cost of internal storage of the processor, and the two instruction storage units also solve the problem of parallel control instructions that occupy more storage space due to more parallel control instructions and longer parallel control instructions. problem.
  • the processor is an AI chip, as shown in FIG. 15, the AI chip is used to perform calculation tasks performed by a neural network model in an artificial intelligence application.
  • the data processing unit of the processor includes a convolution processing sub-unit, and the data moving unit includes a load (loading) moving sub-unit and a store (storing) moving sub-unit.
  • the data to be processed is split into 1296 tiles as an example.
  • the specific content of the data to be processed is not fixed, but the type and size of the data to be processed by the processor are fixed. Therefore, after the data to be processed is split into multiple pieces of data, the size of each piece of data and the number of multiple pieces of data are also fixed. Therefore, the number of parallel control instructions used to process multiple pieces of data after the split is also fixed. , For example, the number of tiles after splitting is 1296 tiles, and the number of parallel control instructions is 1298. In the embodiment of the present application, six tiles are used as multiple tiles, and eight parallel control instructions are used to illustrate the convolution processing process of the processor as an example, as shown in FIG. 16.
  • the first parallel control instruction is used to control the work of the load transfer subunit
  • the second parallel control instruction is used to control the work of the load transfer subunit and the convolution processing subunit.
  • the 3 to 6th parallel control instructions are used to control the work of the load transfer subunit, the convolution processing subunit and the store transfer subunit.
  • the 7th parallel control instruction is used to control the work of the convolution processing subunit and the store transfer subunit.
  • the eighth parallel control instruction is used to control the work of the store moving subunit.
  • the load transfer subunit is used to sequentially move the split multiple data from the data storage unit to the data cache unit
  • the store transfer subunit is used to move the data processed by the convolution processing subunit from the data cache unit to the data cache unit.
  • the processor completes the convolution processing of the 6 tiles, and moves the processed 6 tiles to the data storage unit.
  • the data processing unit further includes any of the other processing subunits, such as a pooling processing subunit, a fully connected processing subunit, and the like.
  • the data processing unit further includes a pooling processing subunit as an example, and the processing process of the processor is described as an example.
  • the processor After the processor completes the convolution processing of the 6 tiles, the 6 tiles after the processing are moved to the data storage unit, therefore, it is also necessary to move the 6 tiles processed by the convolution from the data storage unit in turn In the data buffer unit, the 6 tiles that have been processed by the convolution are sequentially pooled.
  • the pooling process is similar to the convolution process, and will not be repeated here.
  • the processor performs convolution processing and pooling processing on multiple pieces of data after splitting. After convolution processing a tile, the convolution processing tile is used as pooling processing. After the input of the sub-units, after the tile is pooled, the tile after the pooling is moved to the data storage unit.
  • the convolution processing process of the processor is exemplified, as shown in FIG. 17, where the data moving unit also includes a move (moving) moving subunit,
  • the move subunit is used to move the data in the data buffer unit from the first target location to the second target location.
  • the first parallel control instruction is used to control the work of the load transfer subunit
  • the second parallel control instruction is used to control the load transfer subunit and the convolution processing subunit
  • the third parallel control instruction is used to control the load transfer subunit.
  • the fourth parallel control instruction is used to control the load transfer subunit, the convolution processing subunit, the move transfer subunit and the pooling subunit work
  • the fifth to the sixth One parallel control instruction is used to control the work of the load transfer subunit, the convolution processing subunit, the move transfer subunit, the pooling processing subunit, and the store transfer subunit.
  • the seventh parallel control instruction is used to control the work of the convolution processing subunit, the move moving subunit, the pooling processing subunit, and the store moving subunit.
  • the eighth parallel control instruction is used to control the work of the move subunit, the pool processing subunit, and the store move subunit.
  • the 9th parallel control instruction is used to control the work of the pooling processing subunit and the store moving subunit.
  • the tenth parallel control instruction is used to control the work of the store moving subunit.
  • the move subunit is used to move the data in the output storage space corresponding to the convolution processing subunit to the input storage space corresponding to the pooling processing unit.
  • the store moving subunit is used to move the pooled data output by the pooling processing subunit from the data cache unit to the data storage unit.
  • FIG. 18 is a flowchart of a data processing method provided by an embodiment of the present application.
  • the execution subject of the embodiment of the present application is any computer device. Referring to FIG. 18, the method includes:
  • the data to be processed is split into multiple pieces of data, and the multiple pieces of data after the split are subsequently processed.
  • the divided pieces of data are sequentially moved from the data storage space to the data cache space. Therefore, in a possible implementation, the data to be processed is divided into multiple pieces of data according to the split rule, and the split rule indicates the split
  • the data volume of any piece of data obtained is not greater than the cache capacity of the data cache space.
  • the data to be processed is obtained, the data to be processed is split according to the cache capacity of the data cache space to obtain multiple pieces of data, and a data sequence composed of multiple pieces of data is stored in the data storage space.
  • the data to be processed is feature map data.
  • the data to be processed is split into multiple pieces of data according to the cache capacity of the data cache space.
  • the cache capacity of the data cache space is 16KB
  • the size of the feature map data is 128* 128*128, split the feature map data, and the obtained multiple pieces of data are 16 pieces of 32*32*128 picture data.
  • the data to be processed is image data.
  • the data to be processed is divided into multiple pieces of data evenly according to the buffer capacity of the data buffer space.
  • the buffer capacity of the data buffer space is 16KB
  • the size of the image data is 128*128. *3.
  • Split the picture data, and the multiple pieces of data obtained are 16 pieces of 32*32*3 picture data.
  • the data storage space is the space on any data storage unit on the computer device, and the embodiment of the present application does not limit the data storage space.
  • the instruction cache space caches multiple parallel control instructions.
  • the multiple parallel control instructions in the instruction cache space are obtained from the instruction storage space, and the instruction storage space stores a number of instructions used to indicate the split.
  • Multiple parallel control instructions for processing data the multiple parallel control instructions are arranged in sequence according to the indicated sequence, and the multiple parallel control instructions are sequentially cached in the instruction cache space according to the sequence order, so as to be read in the instruction cache space Parallel control instruction with the longest cache time.
  • reading the parallel control instruction includes: reading the parallel control instruction in the instruction storage space; moving the read parallel control instruction to the instruction cache space for caching according to the reading order, and obtaining the instruction cache queue ; Read the parallel control instructions from the instruction cache queue in accordance with the instruction cache sequence.
  • the instruction cache queue is a queue located in the instruction cache space and containing at least one instruction.
  • the parallel control instruction includes a data processing instruction and a data movement instruction.
  • the data processing instruction carries a first valid field
  • the data movement instruction carries a second valid field, so that the data processing instruction and the data movement instruction are used to indicate different
  • the first valid field is used to indicate the operation of processing the first data
  • the second valid field is used to indicate the operation of caching the second data.
  • the method further includes: extracting valid field indication information in the parallel control instruction; and determining the first valid field and the second valid field in the parallel control instruction according to the valid field indication information.
  • Effective field read the first effective field and the second effective field from the parallel control instruction, thereby obtaining the data processing instruction and the data movement instruction.
  • the data processing instruction is obtained by reading the first valid field
  • the data movement instruction is obtained by reading the second valid field.
  • the instruction cache space is an instruction cache unit
  • the instruction storage space is an instruction storage unit
  • the first data is any piece of data in the multiple pieces of data after the split. Since the data sequence composed of the multiple pieces of data after the split is stored in the data storage space, the first data is any piece of data in the data sequence.
  • the data to be processed is picture data
  • the first data is one piece of data among multiple pieces of data after the data to be processed is split. Therefore, the first data is a small picture.
  • reading the first data that has been cached in the data buffer space, processing the read first data, and outputting the processed first data to the data buffer space includes : According to the parallel control instruction, based on the neural network model, read the first data cached in the data cache space; process the first data, and output the processed first data to the data cache space.
  • the parallel control instruction is a data processing instruction.
  • the parallel control instruction the first data cached in the data buffer space is read, the read first data is processed, and the processed first data is processed.
  • the data output to the data storage space includes: reading the first data cached in the data buffer space according to the first valid field carried by the data processing instruction, processing the read first data, and converting the processed first data Output to data buffer space.
  • the parallel control instruction read the first data cached in the data buffer space, process the read first data, and output the processed first data to the data buffer space, It includes: extracting the valid field indication information in the parallel control instruction; determining the first valid field in the parallel control instruction according to the valid field indication information, and reading the first data cached in the data buffer space according to the first valid field, The read first data is processed, and the processed first data is output to the data buffer space.
  • the process of reading the next parallel control instruction After reading the next parallel control instruction, read the second data buffered in the data buffer space according to the next parallel control instruction, process the second data, and output the processed second data to the data buffer space; At the same time, according to the parallel control instruction, the fourth data is moved from the data storage space to the data cache space. At the same time, according to the parallel control instruction, the processed first data is moved from the data cache space to the data storage space. After the operation, the process of reading the next parallel control instruction is repeated, and the operation is executed according to the next parallel control instruction, until the multiple data processing after splitting is completed or the multiple parallel control instructions in the data storage space are all executed once.
  • the above data processing method is applied to the neural network model.
  • the parallel control instruction based on the neural network model, the first data that has been cached in the data cache space is read, and the first data is processed. Processing, outputting the processed first data to the data buffer unit.
  • the neural network model includes multiple layers.
  • data processing is performed in parallel according to the data processing instructions corresponding to each layer in the neural network model, that is, each layer in the neural network model is performed in parallel data processing.
  • the parallel data processing of each layer in the neural network model is explained, and the data processing instructions corresponding to the convolutional layer and the pooling layer are received at the same time.
  • Data processing instruction based on the convolution layer, read the first data cached in the data cache space, perform convolution processing on the first data, and output the convolution processed first data to the data cache unit; at the same time, based on the pool The layer, reads the convolution processed third data that has been cached by the data cache unit, performs pooling processing on the convolution processed third data, and outputs the pooled third data to the data cache unit to achieve The parallel operation of the convolutional layer and the pooling layer is achieved.
  • the data processing method provided by the embodiments of the present application reads the parallel control instruction, thereby simultaneously executing the data processing operation and the data moving operation according to the parallel control instruction, thereby reducing as much as possible the length of time the data processing operation waits for the data moving operation, thereby increasing Speed and efficiency of data processing.
  • the data processed this time is the last data moved by the data moving unit, and it can be processed without waiting for the data moving process, reducing the dependence of the data processing process on the data moving process, and improving the processing speed and processing efficiency.
  • the data to be processed in order to avoid the time-consuming process of moving large data to be processed, the data to be processed will be compressed and cropped to reduce the amount of data to be processed, but compression, cropping, etc.
  • the processing may cause some information in the picture to be lost, and the embodiment of the present application does not perform processing such as compression, cropping, etc. of the data to be processed. Therefore, the information in the data to be processed is not lost, which ensures the accuracy of the processing result of the processor.
  • a data cache space with a larger cache capacity will be set, so that the moving speed can be accelerated during data processing; or, use a higher Bus bandwidth, such as HBM (High Bandwidth Memory), is to reduce the time consumption of data movement by improving the efficiency of data transmission.
  • HBM High Bandwidth Memory
  • the data processing method provided in the embodiment of the application is to split the data to be processed into multiple pieces of data, process the multiple pieces of data after the split, and execute the data movement process and the data processing process in parallel, as far as possible It avoids the waiting time of the data processing process on the data moving process, thereby reducing the impact of data moving on the processing speed. Therefore, the embodiments of the present application have higher requirements on the data cache space and bandwidth, and will not increase the cost.
  • one instruction can only be used to instruct to perform one operation. Therefore, if you want to perform two or more operations, you need software and hardware interactive control, through the synchronization and scheduling of multiple instructions, for example, Multiple instruction processing units make the two operations parallel as much as possible according to the complex scheduling and synchronization mechanism.
  • the concurrent control instruction can be read, and data processing and data movement are performed at the same time according to the concurrent control instruction.
  • the cost is small.
  • no software and hardware interaction is required, which avoids the loss of data processing performance caused by the software and hardware interaction.
  • the embodiment of the present application also provides a data processing chip.
  • the chip is installed in any computer device to realize the data processing function of the computer device.
  • the chip 1900 includes an instruction processing unit. 1901.
  • the instruction processing unit 1901 is configured to read the parallel control instruction, and send the parallel control instruction to the data processing unit 1902 and the data moving unit 1903 at the same time;
  • the data processing unit 1902 is configured to read the first data buffered by the data buffer unit 1904 according to the parallel control instruction, process the read first data, and output the processed first data to the data buffer unit;
  • the data moving unit 1903 is configured to simultaneously move the second data to be processed from the data storage unit outside the chip to the data cache unit 1904 according to the parallel control instruction, and the second data is the download of the first data.
  • a piece of data is configured to simultaneously move the second data to be processed from the data storage unit outside the chip to the data cache unit 1904 according to the parallel control instruction
  • the parallel control instruction includes a data processing instruction and a data movement instruction; the instruction processing unit 1901 is used to extract the data processing instruction and the data movement instruction in the parallel control instruction; the data processing unit 1902 uses According to the data processing instruction, read the first data buffered in the data buffer unit, process the read first data, and output the processed first data to the data buffer unit; the data moving unit 1903 , Used to move the second data from the data storage unit to the data cache unit according to the data movement instruction.
  • the instruction processing unit 1901 extracts the data processing instruction and the data movement instruction in the parallel control instruction by extracting the effective field indication information in the parallel control instruction; according to the effective field indication information, the first one in the parallel control instruction is determined A valid field and a second valid field; the first valid field and the second valid field are read from the parallel control instruction to obtain a data processing instruction and a data moving instruction.
  • the first valid field is read from the parallel control instruction to obtain the data processing instruction; the second valid field is read from the parallel control instruction to obtain the data movement instruction.
  • the chip further includes an instruction cache unit 1905; the instruction processing unit 1901 is configured to read parallel control instructions in an instruction storage unit located outside the chip; The read parallel control instructions are sequentially moved to the instruction cache unit 1905 for caching to obtain an instruction cache queue; the parallel control instructions are read from the instruction cache queue according to the instruction cache sequence.
  • the instruction cache queue is a queue located in the instruction cache unit 1905 and containing at least one instruction.
  • the data moving unit 1903 After the data processing unit 1902 outputs the processed data to the data buffer unit, the data moving unit 1903 also moves the data output to the data buffer unit to a data storage unit outside the chip.
  • the data moving unit 1903 is configured to move the processed third data from the data cache unit 1904 to a data storage unit outside the processor according to the parallel control instruction, and the third The data is the previous data of the first data.
  • the data processing unit 1902 is configured to output the processed first data to the data caching unit 1904, and then send a first completion message to the instruction processing unit 1901; the data moving unit 1903 , Used to send the second completion message to the instruction processing unit 1901 after moving the second data from the processor to the data cache unit 1904; the instruction processing unit 1901 is used to receive the first completion message And the second completion message, the next parallel control instruction matching the fourth data is sent to the data processing unit 1902 and the data caching unit 1903 at the same time, and the fourth data is the value of the second data in the plurality of data. The next piece of data.
  • the chip 1900 is an artificial intelligence AI chip, and the data to be processed is image data; the data processing unit 1902 is configured to read the data stored in the data buffer unit 1904 according to the parallel control instruction. Cached first data; data processing unit 1902, also used to process the first data based on the neural network model, and output the processed first data to the data storage unit, which is located outside the AI chip Any data storage unit.
  • the data processing unit 1902 includes multiple data processing sub-units
  • the data moving unit 1903 includes at least one of a load engine, a store engine, and a move engine;
  • the load engine is used to transfer the data to be processed from the The data storage unit is moved to the data buffer unit; any data processing subunit is used to read the data buffered in the data buffer unit, process the data, and output the processed data to the any data processing subunit
  • the move engine is used to move the processed data of multiple data processing subunits, except the last data processing subunit, from the output storage unit to the next
  • the store engine is used to move the data processed by the last data processing subunit from the output storage unit corresponding to the last data processing subunit to the data storage unit.
  • the data processing unit 1902 includes at least one of a convolution engine or a pooling engine.
  • FIG. 21 is a structural block diagram of a terminal provided by an embodiment of the present application.
  • the terminal 2100 is used to perform the steps performed by the terminal in the above-mentioned embodiment.
  • the terminal is a portable mobile terminal, such as a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Level 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compress standard audio level 4) player, laptop or desktop computer.
  • the terminal 2100 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
  • the terminal 2100 includes a processor 2101 and a memory 2102.
  • the processor 2101 includes one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 2101 adopts at least one of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array) Realize in the form of hardware.
  • the processor 2101 includes a main processor and a coprocessor.
  • the main processor is a processor used to process data in the awake state, and is also called a CPU (Central Processing Unit, central processing unit);
  • the processor is a low-power processor used to process data in the standby state.
  • the processor 2101 is integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is responsible for rendering and drawing content that needs to be displayed on the display screen.
  • the processor 2101 further includes an AI (Artificial Intelligence) processor, and the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 2102 includes one or more computer-readable storage media, which are non-transitory.
  • the memory 2102 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 2102 is used to store at least one program code, and the at least one program code is used to be executed by the processor 2101 to implement the methods provided in the method embodiments of the present application. Data processing method.
  • the terminal 2100 may optionally further include: a peripheral device interface 2103 and at least one peripheral device.
  • the processor 2101, the memory 2102, and the peripheral device interface 2103 can be connected through a bus or a signal line.
  • Each peripheral device is connected to the peripheral device interface 2103 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 2104, a display screen 2105, a camera component 2106, an audio circuit 2107, a positioning component 2108, or a power supply 2109.
  • the peripheral device interface 2103 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 2101 and the memory 2102.
  • the processor 2101, the memory 2102, and the peripheral device interface 2103 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 2101, the memory 2102, and the peripheral device interface 2103 or The two can be implemented on separate chips or circuit boards, which are not limited in the embodiment of the present application.
  • the radio frequency circuit 2104 is used to receive and transmit RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 2104 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 2104 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 2104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and so on.
  • the radio frequency circuit 2104 can communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocol includes but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area network and/or WiFi (Wireless Fidelity, wireless fidelity) network.
  • the radio frequency circuit 2104 also includes a circuit related to NFC (Near Field Communication), which is not limited in this application.
  • the display screen 2105 is used to display UI (User Interface, user interface).
  • the UI includes graphics, text, icons, videos, and any combination of them.
  • the display screen 2105 also has the ability to collect touch signals on or above the surface of the display screen 2105.
  • the touch signal is input to the processor 2101 as a control signal for processing.
  • the display screen 2105 is also used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • one display screen 2105 is provided with the front panel of the terminal 2100; in other embodiments, there are at least two display screens 2105, which are respectively provided on different surfaces of the terminal 2100 or in a folding design;
  • the display screen 2105 may be a flexible display screen, which is arranged on a curved surface or a folding surface of the terminal 2100.
  • the display screen 2105 is also set as a non-rectangular irregular pattern, that is, an irregular screen.
  • the display 2105 is made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).
  • the camera assembly 2106 is used to collect images or videos.
  • the camera assembly 2106 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
  • the camera assembly 2106 also includes a flash.
  • the flash is a single-color temperature flash, or a dual-color temperature flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash used for light compensation under different color temperatures.
  • the audio circuit 2107 includes a microphone and a speaker.
  • the microphone is used to collect sound waves from the user and the environment, and convert the sound waves into electrical signals and input them to the processor 2101 for processing, or input to the radio frequency circuit 2104 to implement voice communication.
  • the microphone is an array microphone or an omnidirectional collection microphone.
  • the speaker is used to convert the electrical signal from the processor 2101 or the radio frequency circuit 2104 into sound waves.
  • the speaker is a traditional thin-film speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves that are audible to humans, but also convert electrical signals into sound waves that are inaudible to humans for distance measurement and other purposes.
  • the audio circuit 2107 also includes a headphone jack.
  • the positioning component 2108 is used to locate the current geographic location of the terminal 2100 to implement navigation or LBS (Location Based Service, location-based service).
  • the positioning component 2108 may be a positioning component based on the GPS (Global Positioning System, Global Positioning System) of the United States, the Beidou system of China, the Granus system of Russia, or the Galileo system of the European Union.
  • the power supply 2109 is used to supply power to various components in the terminal 2100.
  • the power source 2109 is alternating current, direct current, disposable batteries or rechargeable batteries.
  • the rechargeable battery supports wired charging or wireless charging.
  • the rechargeable battery is also used to support fast charging technology.
  • FIG. 21 does not constitute a limitation on the terminal 2100, and may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.
  • FIG. 22 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 2200 may have relatively large differences due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 2201 and one Or more than one memory 2202, where at least one piece of program code is stored in the memory 2202, and at least one piece of program code is loaded and executed by the processor 2201 to implement the methods provided by the foregoing method embodiments.
  • the server may also have components such as a wired or wireless network interface, a keyboard, an input and output interface for input and output, and the server may also include other components for implementing device functions, which will not be described in detail here.
  • the server 2200 may be used to execute the steps executed by the server in the foregoing data processing method.
  • An embodiment of the present application also provides a computer device that includes a processor and a memory, and at least one piece of program code is stored in the memory.
  • the program code is loaded by the processor and executes the data processing method in the foregoing embodiment. Action performed.
  • the computer device includes a processor and a data storage unit, and the processor includes: an instruction processing unit, a data processing unit, a data movement unit, and a data cache unit;
  • the data processing unit is configured to read the first data buffered by the data buffer unit according to the parallel control instruction, process the read first data, and output the processed first data to the data buffer unit;
  • the data moving unit is used to move the second data from the data storage unit to the data cache unit at the same time according to the control instruction, and the second data is the next data of the first data.
  • the computer device includes an instruction storage unit
  • the processor includes an instruction cache unit, and an instruction processing unit for reading the parallel control instructions in the instruction storage unit; the read parallel control is performed according to the read order
  • the instructions are moved to the instruction cache unit for caching, and the instruction cache queue is obtained; parallel control instructions are read from the instruction cache queue according to the instruction cache sequence.
  • the parallel control instruction includes data processing instructions and data movement instructions; the instruction processing unit is used to extract data processing instructions and data movement instructions in the parallel control instructions;
  • the instruction processing unit is also used for sending data processing instructions to the data processing unit and at the same time sending data moving instructions to the data moving unit;
  • the data processing unit is configured to read the cached first data in the data cache unit according to the data processing instruction, process the read first data, and output the processed first data to the data cache unit;
  • the data moving unit is used to move the second data from the data storage unit to the data cache unit according to the data moving instruction.
  • the instruction processing unit is used to extract the effective field indication information in the parallel control instruction; determine the first effective field and the second effective field in the parallel control instruction according to the effective field indication information; from the parallel control instruction Read the first valid field from the parallel control instruction to obtain the data processing instruction; read the second valid field from the parallel control instruction to obtain the data movement instruction.
  • the computer device further includes a splitting unit, which is used to obtain the data to be processed; split the data to be processed according to the cache capacity of the data cache unit to obtain multiple pieces of data after the split; The data sequence composed of multiple pieces of data is stored in the data storage unit.
  • a splitting unit which is used to obtain the data to be processed; split the data to be processed according to the cache capacity of the data cache unit to obtain multiple pieces of data after the split; The data sequence composed of multiple pieces of data is stored in the data storage unit.
  • the data to be processed is image data; the data processing unit is used to read the first data cached by the data cache unit based on the neural network model according to the parallel control instruction; The data is processed, and the processed first data is output to the data buffer unit.
  • the data processing unit is configured to perform data processing in parallel according to data processing instructions corresponding to each layer in the neural network model.
  • the neural network model includes a convolutional layer and a pooling layer, and a data processing unit is used to receive data processing instructions corresponding to the convolutional layer and data processing instructions corresponding to the pooling layer;
  • the data processing instruction corresponding to the convolution layer based on the convolution layer, read the first data cached in the data cache space, perform convolution processing on the first data, and output the convolution processed first data to the data cache unit ;
  • the data processing instructions corresponding to the pooling layer based on the pooling layer, read the convolution processed third data that has been cached by the data cache unit, and perform pooling processing on the convolution processed third data, and pool the The transformed third data is output to the data buffer unit, and the third data is the previous data of the first data in the plurality of data.
  • the embodiment of the present application also provides a computer-readable storage medium, in which at least one piece of program code is stored, and the at least one piece of program code is loaded and executed by a processor to implement the following operations:
  • the parallel control instruction read the cached first data in the data cache space, process the read first data, and output the processed first data to the data cache space;
  • the second data is moved from the data storage space to the data cache space, and the second data is the next data of the first data.
  • the at least one piece of program code is loaded and executed by the processor to implement the following operations:
  • the parallel control instructions are read from the instruction cache queue according to the instruction cache sequence.
  • the parallel control instruction includes a data processing instruction and a data movement instruction; the at least one program code is also loaded and executed by the processor to implement the following operations:
  • the data processing instruction read the cached first data in the data cache space, process the read first data, and output the processed first data to the data cache space;
  • the second data is moved from the data storage space to the data cache space.
  • the at least one piece of program code is loaded and executed by the processor to implement the following operations:
  • the valid field indication information determine the first valid field and the second valid field in the parallel control instruction
  • the first valid field is read from the parallel control instruction to obtain the data processing instruction; the second valid field is read from the parallel control instruction to obtain the data movement instruction.
  • the at least one piece of program code is also loaded and executed by the processor to implement the following operations:
  • the data to be processed is image data; the at least one piece of program code is loaded and executed by the processor to implement the following operations:
  • the parallel control instruction based on the neural network model, read the first data cached in the data cache space; process the read first data, and output the processed first data to the data cache space.
  • the at least one piece of program code is also loaded and executed by the processor to implement the following operations:
  • the neural network model includes a convolutional layer and a pooling layer, and the at least one piece of program code is also loaded and executed by the processor to implement the following operations:
  • the data processing instruction corresponding to the convolution layer based on the convolution layer, read the first data cached in the data cache space, perform convolution processing on the first data, and output the convolution processed first data to the data cache unit ;
  • the data processing instructions corresponding to the pooling layer based on the pooling layer, read the convolution processed third data that has been cached by the data cache unit, and perform pooling processing on the convolution processed third data, and pool the The transformed third data is output to the data buffer unit, and the third data is the previous data of the first data in the plurality of data.
  • the embodiment of the present application also provides a computer program, in which at least one piece of program code is stored, and the at least one piece of program code is loaded and executed by a processor to implement the following operations:
  • the parallel control instruction read the cached first data in the data cache space, process the read first data, and output the processed first data to the data cache space;
  • the second data is moved from the data storage space to the data cache space, and the second data is the next data of the first data.
  • the at least one piece of program code is loaded and executed by the processor to implement the following operations:
  • the parallel control instructions are read from the instruction cache queue according to the instruction cache sequence.
  • the parallel control instruction includes a data processing instruction and a data movement instruction; the at least one program code is also loaded and executed by the processor to implement the following operations:
  • the data processing instruction read the cached first data in the data cache space, process the read first data, and output the processed first data to the data cache space;
  • moving the second data from the data storage space to the data cache space includes:
  • the second data is moved from the data storage space to the data cache space.
  • the at least one piece of program code is loaded and executed by the processor to implement the following operations:
  • the valid field indication information determine the first valid field and the second valid field in the parallel control instruction
  • the first valid field is read from the parallel control instruction to obtain the data processing instruction; the second valid field is read from the parallel control instruction to obtain the data movement instruction.
  • the at least one piece of program code is also loaded and executed by the processor to implement the following operations:
  • the data to be processed is image data; the at least one piece of program code is loaded and executed by the processor to implement the following operations:
  • the parallel control instruction based on the neural network model, read the first data cached in the data cache space; process the read first data, and output the processed first data to the data cache space.
  • the at least one piece of program code is also loaded and executed by the processor to implement the following operations:
  • the at least one piece of program code is also loaded and executed by the processor to implement the following operations:
  • the data processing instruction corresponding to the convolution layer based on the convolution layer, read the first data cached in the data cache space, perform convolution processing on the first data, and output the convolution processed first data to the data cache unit ;
  • the data processing instructions corresponding to the pooling layer based on the pooling layer, read the convolution processed third data that has been cached by the data cache unit, and perform pooling processing on the convolution processed third data, and pool the The transformed third data is output to the data buffer unit, and the third data is the previous data of the first data in the plurality of data.
  • the program may be stored in a computer-readable storage medium.
  • the storage medium can be read-only memory, magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据处理方法、芯片、设备及存储介质,属于计算机技术领域。方法包括:读取并行控制指令;根据所述并行控制指令,读取数据缓存空间中已缓存的第一数据,并对读取的所述第一数据进行处理,将处理后的第一数据输出至所述数据缓存空间;同时,根据所述并行控制指令,将第二数据从数据存储空间搬移到所述数据缓存空间,所述第二数据为所述第一数据的下一条数据。通过读取并行控制指令,根据并行控制指令,同时进行数据处理和数据搬移,减少了数据处理等待数据搬移的时间,加快了处理速度和处理效率。

Description

数据处理方法、芯片、设备及存储介质
本申请要求于2019年12月05日提交、申请号为201911235760.X、发明名称为“数据处理方法、芯片、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种数据处理方法、芯片、设备及存储介质。
背景技术
计算机设备中的处理器可以处理大量的计算任务,例如,处理器中的数据搬移单元将图片数据从处理器之外搬移至处理器中,由处理器中的处理单元对该图片数据进行处理。
目前,随着计算机设备的不断发展,处理性能的不断提高,数据也越来越大,数据搬移过程会耗时较多,且数据处理过程依赖于数据搬移过程,从而导致处理器的处理速度较慢,处理效率较低。
发明内容
本申请实施例提供了一种数据处理方法、芯片、设备及存储介质,能够提高处理效率。所述技术方案如下:
一方面,提供了一种数据处理方法,应用于计算机设备,所述方法包括:
读取并行控制指令;
根据所述并行控制指令,读取数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至所述数据缓存空间;
同时,根据所述并行控制指令,将第二数据从所述数据存储空间搬移到所述数据缓存空间,所述第二数据为所述第一数据的下一条数据。
另一方面,提供了一种数据处理芯片,所述芯片包括:指令处理单元、数据处理单元、数据搬移单元和数据缓存单元;
所述指令处理单元,用于读取并行控制指令,同时向所述数据处理单元和所述数据搬移单元发送所述并行控制指令;
所述数据处理单元,用于根据所述并行控制指令,读取所述数据缓存单元已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至所述数据缓存单元;
所述数据搬移单元,用于同时根据所述并行控制指令,将第二数据从位于所述芯片之外的数据存储单元搬移至所述数据缓存单元,所述第二数据为所述第一数据的下一条数据。
在一种可能实现方式中,所述数据处理单元包括卷积引擎或池化引擎中的至少一项。
另一方面,提供了一种计算机设备,所述计算机设备包括:处理器和数据存储单元,所述处理器包括:指令处理单元、数据处理单元、数据搬移单元和数据缓存单元;
所述指令处理单元,用于读取并行控制指令;
所述数据处理单元,用于根据所述并行控制指令,读取所述数据缓存单元已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至所述数据缓存单元;
所述数据搬移单元,用于同时,根据所述控制指令,将第二数据从所述数据存储单元搬移至所述数据缓存单元,所述第二数据为所述第一数据的下一条数据。
在一种可能实现方式中,所述计算机设备包括指令存储单元,所述处理器包括指令缓存单元,所述指令处理单元,用于读取指令存储单元中的并行控制指令;按照读取顺序将读取到的并行控制指令搬移到所述指令缓存单元进行缓存,得到指令缓存队列;按照指令缓存顺序从所述指令缓存队列中读取并行控制指令。
在一种可能实现方式中,所述并行控制指令中包括数据处理指令和数据搬移指令;所述指令处理单元,用于提取所述并行控制指令中的所述数据处理指令和所述数据搬移指令;
所述指令处理单元,还用于向所述数据处理单元发送所述数据处理指令,同时向所述数据搬移单元发送所述数据搬移指令;
所述数据处理单元,用于根据所述数据处理指令,读取所述数据缓存单元中已缓存的所述第一数据,并对读取的所述第一数据进行处理,将所述处理后的第一数据输出至所述数据缓存单元;
所述数据搬移单元,用于根据所述数据搬移指令,将所述第二数据从所述数据存储单元搬移到所述数据缓存单元。
在一种可能实现方式中,所述指令处理单元,用于提取所述并行控制指令中的有效字段指示信息;根据所述有效字段指示信息,确定所述并行控制指令中的第一有效字段和第二有效字段;从所述并行控制指令中读取所述第一有效字段,得到所述数据处理指令,从所述并行控制指令中读取所述第二有效字段,得到所述数据搬移指令。
在一种可能实现方式中,所述计算机设备还包括拆分单元,所述拆分单元,用于获取待处理数据;根据所述数据缓存单元的缓存容量对所述待处理数据进行拆分,得到拆分后的多条数据;将所述多条数据组成的数据序列存储在数据存储单元。
在一种可能实现方式中,所述待处理数据为图片数据;所述数据处理单元,用于根据所述并行控制指令,基于神经网络模型,读取所述数据缓存单元已缓存的所述第一数据;并对读取的所述第一数据进行处理,将所述处理后的第一数据输出至所述数据缓存单元。
在一种可能实现方式中,所述数据处理单元,用于根据所述神经网络模型中各层分别对应的数据处理指令,并行进行数据处理。
在一种可能实现方式中,所述神经网络模型包括卷积层和池化层,所述数据处理单元,用于接收所述卷积层对应的数据处理指令和所述池化层对应的数据处理指令;
根据所述卷积层对应的数据处理指令,基于所述卷积层,读取所述数据缓存空间已缓存的第一数据,对所述第一数据进行卷积处理,将卷积处理后的第一数据输出至所述数据缓存单元;
同时,根据所述池化层对应的数据处理指令,基于所述池化层,读取所述数据缓存单元已缓存的卷积处理后的第三数据,对所述卷积处理后的第三数据进行池化处理,将池化处理后的第三数据输出至所述数据缓存单元,所述第三数据为所述多条数据中所述第一数据的上一条数据。
再一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现如所述的数据处理方法中所执行的操作。
本申请实施例提供的技术方案带来的有益效果至少包括:
通过读取并行控制指令,从而根据并行控制指令同时执行数据处理操作和数据搬移操作,尽可能地减少了数据处理操作等待数据搬移操作的时长,从而提高了数据处理的速度和效率。并且,被处理的数据是已经搬移至数据存储空间的数据,无需等待数据搬移过程,即可进行处理,减少了数据处理过程对数据搬移过程的依赖,提高了处理速度和处理效率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种计算机设备的示意图;
图2是本申请实施例提供的另一种计算机设备的示意图;
图3是本申请实施例提供的另一种计算机设备的示意图;
图4是本申请实施例提供的另一种计算机设备的示意图;
图5是本申请实施例提供的另一种计算机设备的示意图;
图6是本申请实施例提供的一种数据处理方法的流程图;
图7是本申请实施例提供的一种卷积处理的示意图;
图8是本申请实施例提供的一种拆分后的多条数据的示意图;
图9是本申请实施例提供的一种数据处理方法的流程图;
图10是本申请实施例提供的一种控制指令的示意图;
图11是本申请实施例提供的一种控制指令的示意图;
图12是本申请实施例提供的一种指令缓存单元的示意图;
图13是本申请实施例提供的一种数据处理方法的流程图;
图14是本申请实施例提供的一种数据处理方法的流程图;
图15是本申请实施例提供的一种处理器的示意图;
图16是本申请实施例提供的一种数据处理方法的流程图;
图17是本申请实施例提供的一种数据处理方法的流程图;
图18是本申请实施例提供的一种数据处理方法的流程图;
图19是本申请实施例提供的一种芯片的示意图;
图20是本申请实施例提供的一种芯片的示意图;
图21是本申请实施例提供的一种终端的结构框图;
图22是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
能够理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种概念,但除非特别说明,这些概念不受这些术语限制。这些术语仅用于将一个概念与另一个概念区分。举例来说,在不脱离本申请的范围的情况下,将第一数据称为第二数据,且类似地,第二数据称为第一数据。
本申请所使用的术语“至少一个”、“多个”、“每个”、“任一”,至少一个包括一个、两个或者两个以上,多个包括两个或者两个以上,而每个是指对应的多个中的每一个,任一是指多个中的任意一个,举例来说,多个单元包括3个单元,而每个是指这3个单元中的每一个单元,任一是指这3个单元中的任意一个,例如,第一个,第二个,或者第三个。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本申请实施例能够采用上述的人工智能技术进行数据处理,并使用本申请提供的数据处理方法来提高处理速度和处理效率,通过如下实施例对本申请的数据处理方法进行详细说明。
可选地,本申请实施例提供的数据处理方法应用于计算机设备,该计算机设备包括手机、平板电脑、智能终端、机器人、电脑、打印机、扫描仪、电话、行车记录仪、导航仪、摄像头、摄像机、手表、耳机、可穿戴设备等各类电子产品;或者包括飞机、轮船、车辆等各类交通工具;或者包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机等各类家用电器;或者包括核磁共振仪、心电图仪等各类医疗设备;或者包括服务器,例如,该计算机设备是一台服务器,或者是若干台服务器组成的服务器集群,或者是一个云计算服务中心。
以应用于心电图仪为例进行说明:
心电图仪拍摄用户的心电图图像,采用训练好的神经网络对获取的心电图图像进行分析,能够确定用户是否存在心脏问题,采用本申请实施例提供的数据处理方法,在获取到心电图图像之后,采用处理器来执行神经网络需要执行的运算步骤,处理器中的指令处理单元同时向数据处理单元和数据搬移单元发送控制指令,数据处理单元和数据搬移单元并行运行,也即是同时进行将心电图图像搬移至处理器的过程和对搬移至处理器的上一心电图图像进行处理的过程,尽可能地避免了数据处理单元等待数据搬移单元搬移心电图图像,提高了处理速度和处理效率。
本申请实施例提供的数据处理方法能够应用于任一对数据进行处理的场景中,本申请实施例对此不做限定。
如图1所示,该计算机设备包括处理器1和数据存储单元201。其中,处理器1包括指 令处理单元101、数据处理单元102、数据搬移单元103和数据缓存单元104。其中,指令处理单元101分别与数据处理单元102和数据搬移单元103连接,数据缓存单元104分别与数据处理单元102和数据搬移单元103连接,数据存储单元201分别与数据处理单元102和数据缓存单元103连接。其中,指令处理单元101能够同时向数据处理单元102和数据搬移单元103发送控制指令。数据存储单元201能够存储多条数据。
在另一种可能实现方式中,如图2所示,计算机设备还包括指令存储单元202,处理器1还包括指令缓存单元105,指令处理单元101用于将指令存储单元202中存储的至少一个控制指令搬移至指令缓存单元105。
在另一种可能实现方式中,如图3所示,计算机设备还包括拆分单元203,该拆分单元位于处理器1之外,能够将待处理数据拆分为多条数据,还能够将拆分后的多条数据存储于数据存储单元201中。
在另一种可能实现方式中,如图4所示,数据处理单元102包括至少一个数据处理子单元,每个数据处理子单元用于对数据进行不同的处理,可选地,数据搬移单元包括至少一个数据搬移子单元,每个数据搬移子单元的数据搬移过程不同。以数据搬移单元包括三个数据搬移子单元为例进行说明,其中,第一数据搬移子单元用于将数据从数据存储单元201搬移至数据缓存单元104;第二数据搬移子单元用于将数据从数据缓存单元104搬移至数据存储单元201;第三数据搬移子单元用于将数据从数据缓存单元104的第一位置搬移至数据缓存单元104的第二位置。可选地,数据处理子单元为计算引擎,例如,卷积引擎、池化引擎等,数据搬移子单元为搬移引擎,例如,load(加载)引擎、store(存储)引擎、move(搬移)引擎等。
例如,如图5所示,计算机设备包括指令处理单元101、数据处理单元102、数据搬移单元103、数据缓存单元104、指令缓存单元105、数据存储单元201、指令存储单元202和拆分单元203。其中,数据处理单元102包括至少一个数据处理子单元,数据搬移单元包括至少一个数据搬移子单元。
图6是本申请实施例提供的一种数据处理方法的流程图,本申请实施例的执行主体为如图1至5任一所示的计算机设备,参见图6,该方法包括:
601、拆分单元将待处理数据拆分为多条数据,将拆分后的多条数据存储于数据存储单元。
其中,待处理数据为处理器需要进行处理的数据,可选地,该待处理数据包括图片数据,音频数据、文本数据等任一种或者多种形式的数据,本申请实施例对待处理数据不做限制。
若待处理数据的数据量较大,则搬移该待处理数据的过程会较为耗时,因此本申请实施例将待处理数据拆分为多条数据,分别对拆分后的多条数据中的每条数据进行处理,这样由于每次搬移数据的数据量变小,搬移的速度也会变快。
在一种可能实现方式中,拆分单元将待处理数据拆分为多条数据包括:若待处理数据为图片数据,按照待处理数据的尺寸,将待处理数据平均拆分为多条数据,例如,图片数据的尺寸为128*128*3,将图片数据进行拆分,得到的16个尺寸为32*32*3的图片数据。
可选地,将待处理数据平均拆分为多条数据是指:按照目标数量,对待处理数据进行拆分,将待处理数据拆分为目标数量条数据;例如,目标数量为100,则无论待处理数据的尺寸是多少,均将待处理数据平均拆分为100条数据。
可选地,将待处理数据平均拆分为多条数据是指:按照目标尺寸,对待处理数据进行拆 分,使得拆分后的每条数据均不大于该目标大小,从而使得拆分后的数据能够顺利搬移。
可选地,待处理数据为音频数据,按照参考时长进行划分,例如,待处理数据为时长为1分钟的音频数据,当参考时长为10秒,按照该参考时长对待处理数据进行拆分,得到6条时长为10秒的音频数据。或者,按照语句进行划分,得到的多条数据中的每条数据包括至少一个语句。例如,两个相邻语句之间是存在一定时间间隔的,因此,能够根据待处理数据中不包括目标对象输出的语音数据的片段,将待处理数据中的多个语句拆分开来,得到多条数据,多条数据中的每条数据中包括一个语句,可选地,目标对象为人,或者环境中其他任一能够输入语音数据的对象。
可选地,待处理数据为文本数据,按照目标数据量,将待处理数据均分为多条数据,该多条数据中,每条数据的数据量不超过该目标数据量;或者,按照段、句等划分方式,将待处理数据中拆分为多条数据。例如,将待处理数据中的每段文字划分为一条数据,从而得到多条数据;或者,将待处理数据中的每个语句划分为一条数据,从而得到多条数据。
由于拆分单元将待处理数据拆分成多条数据,是将多条数据是输入至处理器,由处理器对该多条数据依次进行处理的,可选地,拆分单元根据处理器的配置信息,对待处理数据进行拆分,例如,该配置信息为处理器中数据缓存单元的缓存容量,拆分单元获取待处理数据,根据数据缓存单元的缓存容量对待处理数据进行拆分,得到拆分后的多条数据,每条数据的数据量不超过该缓存容量。在一种可能实现方式中,拆分单元按照拆分规则,将待处理数据拆分为多条数据,其中,拆分规则指示拆分得到的任一条数据的数据量不大于数据缓存单元的缓存容量。可选地,该拆分规则为根据配置信息指定的规则。
需要说明的是,在一种可能实现方式中,该处理器的配置信息为处理器中数据处理单元的处理量,或者,为处理器中的带宽等。可选地,该配置信息包括一种或多种信息。
可选地,数据缓存单元的缓存容量为数据缓存单元的总缓存容量,在一种可能实现方式中,拆分单元按照拆分规则,将待处理数据拆分为多条数据,包括:拆分单元按照数据缓存单元的总缓存容量,将待处理数据拆分为多条数据,每条数据的数据量不大于数据缓存单元的总缓存容量。例如,数据缓存单元的总缓存容量为15KB,待处理数据的数据量为85KB,将待处理数据拆分为6条数据量相同的数据,或者将待处理数据拆分为6条数据量不同的数据,例如,该6条数据的数据量分别为15KB、15KB、15KB、15KB、15KB和10KB。
在一种可能实现方式中,数据缓存单元除了需要缓存输入处理器的数据之外,可能还需要缓存其他数据,可选地,数据缓存单元的缓存容量是当前数据缓存单元的剩余缓存容量,在一种可能实现方式中,拆分单元按照拆分规则,将待处理数据拆分为多条数据,包括:拆分单元按照数据缓存单元的剩余缓存容量,将待处理数据拆分为多条数据,每条数据的数据量不大于数据缓存单元的剩余缓存容量。
另外,数据缓存单元能够缓存数据处理单元需要进行处理的数据,还能够缓存数据处理单元输出的处理后的数据,并且,本申请实施例中指令控制单元会同时向数据处理单元和数据搬移单元发送并行控制指令,使得数据处理单元在对数据缓存单元已存储的数据进行处理时,数据搬移单元能够同时将待处理的数据搬移至数据缓存单元,因此,数据缓存单元至少需要缓存数据搬移单元将要搬移的数据、数据处理单元将要处理的数据和数据处理单元将要输出的数据。
因此,在一种可能实现方式中,拆分单元按照拆分规则,将待处理数据拆分为多条数据,包括:拆分单元按照数据处理单元的输入数据的数据量和该数据处理单元的输出数据的数据 量,将待处理数据拆分为多条数据,以使数据缓存单元能够至少缓存数据处理单元的两条输入数据和一条输出数据。
例如,数据缓存单元的缓存容量为30KB,若将数据量为10KB的数据输入至数据处理单元,由数据处理单元对该数据进行处理之后,输出的数据也为10KB,那么,将待处理数据拆分为多条数据,每条数据的数据量不超过10KB。
另外,数据缓存单元在缓存数据时,能够将不同种类的数据存储在不同的存储空间,以便对数据进行区分管理。例如,将数据处理单元待处理的数据存储在第一存储空间,将处理后的数据存储在第二存储空间。在一种可能实现方式中,拆分单元按照拆分规则,将待处理数据拆分为多条数据,包括:根据第一存储空间和第二存储空间的大小,确定每条数据的最大数据量,将待处理数据拆分为多条数据,每条数据的数据量不大于该最大数据量。可选地,第一存储空间为数据处理单元的输入存储空间,第二存储空间为数据处理单元的输出存储空间。
例如,对于数据处理单元处理前的数据和处理后的数据,数据的数据量可能会发生变化。第一存储空间的容量为16KB,第二存储空间的容量为8KB,假设数据处理后,数据的数据量会变成处理前的两倍,若要使第二存储空间能够容纳处理后的数据,那么最大数据量为4KB,拆分得到的数据不超过4KB。
又如,第一存储空间的容量为16KB,第二存储空间的容量为16KB,由于数据处理和数据搬移的并行执行,因此第一存储空间需要存储数据处理单元即将处理的数据,还需要为数据搬移单元即将搬移的数据预留空间,因此,第一存储空间需要至少容纳两条数据,那么拆分得到的数据不超过8KB。
在一种可能实现方式中,处理器是用于执行神经网络模型的各种计算过程的一个AI芯片,以数据处理单元为进行卷积运算的单元为例,对将待处理数据拆分为多条数据进行说明:
例如,待处理数据为2048*2048*64(width*height*channel,宽*长*通道)的图片数据,pad(扩展)为1,在AI芯片中通过32组3*3(width*height,宽*长)的卷积核进行stride(步长)为1的卷积运算。AI芯片内输入数据的存储空间的容量为16KB*32,输出数据的存储空间的容量为16KB*32。由于数据处理和数据搬移的并行执行,将输入数据的存储空间分成两部分。如图7所示,该卷积计算的过程中,输出数据的数据量和输入数据的数据量相同,那么能够以输入数据的存储空间的大小作为数据拆分的依据,即8KB*32为拆分得到的数据的最大数据量。可选地,数据拆分按照60*60*64的规格进行拆分,其中每个60*60*64的数据消耗7.2KB*32的容量。这样拆分得到的两条数据的数据量之和小于输入数据的存储空间的总容量。
在一种可能实现方式中,将待处理数据拆分为多条数据是将待处理数据拆分为多条Tile(瓦片数据),可选地,将待处理数据拆分为多条Tile的过程如图8所示,图8为拆分后的多条Tile的示意图,以图8的第一行为例,总共拆分为36条Tile;以整个待处理数据为例,总共拆分为36*36=1296条Tile。
以第一行的Tile为例,其中,第1条Tile的大小为60*60*64;第2至35条Tile的大小为60*60*64,考虑到卷积的滑窗特点,第2条至第35条中的每条Tile与前一条Tile均有2列交叠,每条Tile中新的区域大小为60*58。第36条Tile的大小为60*20*64,考虑到卷积的滑窗特点,该Tile和前一条Tile有2列交叠,因此,该Tile中新的区域大小为60*18。
需要说明的是,在一种可能实现方式中,本申请实施例中的拆分规则,由开发人员在配 置处理器的时候,根据处理器中数据缓存单元的容量、数据处理单元对缓存的要求、数据搬移单元对缓存的要求、或者数据处理前后数据量的变化中的至少一项确定,并将该拆分规则配置到处理器中,由于处理器配置好之后,该处理器需要处理的数据也确定了,即待处理数据的类型和数据量确定了,因此拆分单元能够根据开发人员配置的拆分规则将待处理数据拆分为多条数据。
另外,在将待处理数据进行拆分得到拆分后的多条数据之后,将该多条数据组成的数据序列存储到数据存储空间中,可选地,该数据序列如图8所示,可选地,该数据序列还能够按照其他方式进行排列,本申请实施例对数据序列的形式不做限定。
602、指令处理单元将指令存储单元中存储的至少一个并行控制指令搬移至指令缓存单元。
其中,指令存储单元位于处理器之外,可选地,指令存储单元为存储器,或者为存储介质,或者为其他类型的存储单元。指令存储单元用于存储多个并行控制指令,该多个并行控制指令用于指示处理器对一条待处理数据拆分得到的多条数据进行处理,其中,该多条并行控制指令能够循环利用,也即是对每条待处理数据拆分得到的多条数据进行处理时,均能够使用指令存储单元存储的多个并行控制指令。
可选地,指令存储单元存储的多个并行控制指令是在配置该指令存储单元时存储的;可选地,该指令存储单元中的多个并行控制指令是通过指令管理程序输入的,该指令管理程序是用于管理该指令存储单元的程序,例如,该指令管理程序能够对指令存储单元中的指令进行增加、删除、或者修改等。又如,该指令管理程序能够重置该指令存储单元中的指令。
在需要处理器对数据进行处理时,能够从指令存储单元中读取并行控制指令。例如,以数据处理单元对拆分得到的前6条数据进行处理为例进行说明,如图9所示,指令存储单元中存储的第一个并行控制指令用于指示数据搬移单元将第一条数据A从数据存储单元搬移至数据缓存单元;第二个并行控制指令用于指示数据搬移单元将第二条数据B从数据存储单元搬移至数据缓存单元,并指示数据处理单元读取数据缓存单元已缓存的第一条数据A,对第一条数据A进行处理,将处理后的第一条数据A输出至数据缓存单元;第三个并行控制指令用于指示数据搬移单元将第三条数据C搬移至数据缓存单元,并指示数据处理单元读取数据缓存单元已缓存的第二条数据B,对第二条数据B进行处理,将处理后的第二条数据B输出至数据缓存单元,并指示数据搬移单元将数据缓存单元处理后的第一条数据A从数据缓存单元搬移至数据存储单元;其余并行控制指令参考图9,本申请实施例对此不再一一赘述。
其中,数据搬移单元在第三个并行控制指令的指示下,将第三条数据C从数据存储单元搬移至数据缓存单元,还将处理后的第一条数据A从数据缓存单元搬移至数据存储单元。
在一种可能实现方式中,数据搬移单元包括第一数据搬移子单元和第二数据搬移子单元,由第一数据搬移子单元将数据从数据存储单元搬移至数据缓存单元,由第二数据搬移子单元将数据从数据缓存单元搬移至数据存储单元。
另外,指令存储单元存储的多个并行控制指令的格式相同,或者不同,在一种可能实现方式中,指令存储单元存储的多个并行控制指令中的任一个并行控制指令包括:控制单元执行操作的有效字段。例如,如图10所示,并行控制指令1包括第二有效字段,该第二有效字段用于指示数据搬移单元将第一条数据A从数据存储单元搬移至数据缓存单元;并行控制指令2包括第一有效字段和第二有效字段,第二有效字段用于指示数据搬移单元将第二条数据B从数据存储单元搬移至数据缓存单元,第一有效字段用于指示数据处理单元读取数据缓存 单元已缓存的第一条数据A,对该第一条数据A进行处理,将处理后的第一条数据A输出至数据缓存单元,其中并行控制指令携带的有效字段不同,因此,并行控制指令1和并行控制指令2的控制对象不同时,并行控制指令1和并行控制指令2的格式也不相同。
在一种可能实现方式中,指令存储单元存储的多个并行控制指令的格式相同。每个并行控制指令包括有效字段指示信息和多个有效字段,其中,有效字段指示信息为处理器中各个单元的控制指示,用于指示当前并行控制指令中有效的有效字段,以及哪些单元需要在该并行控制指令下执行操作。每个有效字段定义了对应的单元执行操作所需的信息。
例如,每个并行控制指令如图11所示,并行控制指令的头部为有效字段指示信息,指示信息后面还有多个字段,若并行控制指令1用于控制数据搬移单元将数据1从数据存储单元搬移至数据缓存单元,并行控制指令1的第二有效字段中定义有搬移参数,例如,数据1的起始地址、数据1的终止地址、搬移的数据长度等。可选地,并行控制指令1的其他字段填充有默认值,而未定义相应的参数,表示该其他字段为无效字段。
例如,并行控制指令N用于控制第一数据搬移子单元将数据N从数据存储单元搬移至数据缓存单元,同时控制数据处理单元读取数据缓存单元已存储的数据N-1,对数据N-1进行处理,并将处理后的数据N-1输出至数据缓存单元、同时控制第二数据搬移子单元将处理后的数据N-2从数据缓存单元搬移至数据存储单元,该并行控制指令N的第一字段、第二字段和第四字段中定义有相应的参数,为有效字段,其他字段中填充有默认值,为无效字段,其中,N为大于2的任一整数。
其中,指令缓存单元为处理器内部的一个单元,其特点为成本高、存储容量小、带宽相对于指令存储单元较大;指令存储单元位于处理器之外,其特点为成本低,存储容量大,带宽相对于指令缓存单元较小。因此,在处理器运行过程中,将指令存储单元中存储的至少一个并行控制指令搬移至指令缓存单元中,能够保证指令缓存单元能够持续无缝地供给并行控制指令,能够使数据处理单元和数据搬移单元及时接收到并行控制指令。
也即是,指令控制单元将即将要执行的并行控制指令,从指令存储单元搬移至指令缓存单元,可选地,在将并行控制指令从指令存储单元搬移至指令缓存单元之后,指令存储单元中的并行控制指令不会消失,以便于在对下一待处理数据拆分后的多条数据进行处理时,依旧从指令存储单元中搬移该并行控制指令。
在一种可能实现方式中,指令处理单元将指令存储单元中存储的至少一个并行控制指令搬移至指令缓存单元,包括:读取指令存储单元中的并行控制指令,按照读取顺序将读取到的并行控制指令搬移到指令缓存单元进行缓存,得到指令缓存队列。后续按照指令缓存顺序从该指令缓存队列中读取并行控制指令。其中,指令缓存队列为位于指令缓存单元中、且包含至少一个指令的队列。
可选地,指令缓存单元为一个块状FIFO(First Input First Output,先入先出队列)结构。FIFO中存在空间,指令处理单元将并行控制指令从指令存储单元向指令缓存单元搬移;其中,指令缓存单元的每个数据块能够存储一个并行控制指令,每个并行控制指令用于控制多个单元执行相应的操作。如图12所示,每个并行控制指令占用64B,指令缓存单元包括8个数据块,每个数据块均用于存储一个并行控制指令,8个数据块均存储有并行控制指令,共占用512B,因此,处理器的整体代价较小,不受并行控制指令较多的影响,保持了较好的可扩展性。
本申请实施例通过指令存储单元,存储用于处理拆分后的多条数据所需的多个并行控制 指令,相对于将该多条并行控制指令存储在指令缓存单元来说,能够大大降低成本,并且,由于处理器的运算复杂程度逐渐加深、待处理数据的数据量逐渐加大,需要的并行控制指令也越来越多,利用处理器之外的存储介质来存储多个并行控制指令,能够节省处理器内部存储的成本,并且经过两个指令存储单元也较好地解决了并行控制指令较多和并行控制指令较长,而导致的并行控制指令占用存储空间较多的问题。
603、指令处理单元从指令缓存单元存储的至少一条指令中读取并行控制指令,同时向数据处理单元和数据搬移单元发送并行控制指令。
在一种可能实现方式中,指令处理单元从指令缓存单元存储的至少一条指令中读取并行控制指令包括:指令处理单元从指令缓存单元存储的至少一条指令中读取存储时间最长的并行控制指令,同时向数据处理单元和所述数据搬移单元发送该指令缓存单元中存储时间最长的并行控制指令。
例如,指令缓存单元为块状FIFO结构,指令缓存单元存储有并行控制指令,指令处理单元会读取最先进入该指令缓存单元的并行控制指令。
在另一种可能实现方式中,指令处理单元从指令缓存单元存储的至少一条指令中读取并行控制指令包括:按照指令缓存顺序从指令缓存队列中读取并行控制指令。
可选地,指令处理单元同时向数据处理单元和数据搬移单元发送并行控制指令时,向数据处理单元和数据搬移单元发送的并行控制指令相同;或者向数据处理单元和数据搬移单元发送的并行控制指令不同。
在一种可能实现方式中,指令处理单元同时向数据处理单元和数据搬移单元发送的并行控制指令相同,该并行控制指令如图11所示,包括有效字段指示信息和多个有效字段,数据处理单元接收到该并行控制指令,根据该并行控制指令中的有效字段指示信息,确定数据处理单元对应的有效字段,按照该有效字段中的相应指示执行操作;数据搬移单元接收到该并行控制指令,根据该并行控制指令中的有效字段指示信息,确定数据搬移单元对应的有效字段,按照该有效字段中的相应指示执行操作。
在另一种可能实现方式中,指令处理单元同时向数据处理单元和数据搬移单元发送的并行控制指令不同,可选地,指令处理单元读取并行控制指令之后,提取该并行控制指令中的数据处理指令和数据搬移指令,向数据处理单元发送数据处理指令,同时向数据搬移单元发送数据搬移指令。
提取该并行控制指令中的数据处理指令和数据搬移指令包括:提取并行控制指令中的有效字段指示信息;根据该有效字段指示信息,确定该并行控制指令中第一有效字段和第二有效字段,从并行控制指令中读取第一有效字段和第二有效字段,得到数据处理指令和数据搬移指令。可选地,从并行控制指令中读取第一有效字段,得到数据处理指令;从并行控制指令中读取第二有效字段,得到数据搬移指令。从而数据处理单元和数据搬移单元能够直接根据接收到的数据处理指令和数据搬移指令,执行相应的操作。
其中,有效字段指示信息用于指示并行控制指令的多个字段中哪些字段为有效字段,并根据有效字段确定接收指令的单元或者子单元。例如,并行控制指令包括有效字段指示信息和6个字段,该6个字段分别与6个子单元匹配,若有效字段指示信息指示第一个字段和第三个字段为有效字段,向第一个字段对应的子单元发送携带该第一个字段的控制指令,同时向第三个字段对应的子单元发送携带该第三个字段的控制指令。
另外,指令处理单元同时向数据处理单元和数据搬移单元发送并行控制指令,包括:指 令处理单元根据数据存储单元中的任一目标数据,同时向处理单元和数据搬移单元发送与目标数据匹配的并行控制指令。
其中,与目标数据匹配的并行控制指令是指:指示数据搬移单元将目标数据从数据存储单元搬移至数据缓存单元,以及指示数据处理单元处理目标数据的上一条数据的指令。
可选地,指令处理单元根据数据存储单元中的任一目标数据,同时向处理单元和数据搬移单元发送与目标数据匹配的并行控制指令,包括:指令处理单元向所述数据处理单元发送数据处理指令,同时向数据搬移单元发送数据搬移指令,数据控制指令指示对目标数据的上一条数据进行处理,数据搬移控制指令指示将目标数据从数据存储单元搬移至数据缓存单元。
需要说明的是,为了使得数据处理单元和数据搬移单元能够并行处理数据,需要在处理器开始工作时,先由数据搬移单元将第一条数据从数据存储单元搬移至数据缓存单元,后续数据处理单元和数据搬移单元才能并行运行,因此,指令处理单元在处理器开始工作时,先向数据搬移单元发送一个并发控制指令,控制数据搬移单元搬移数据,后续才同时向数据处理单元和数据搬移单元发送并发控制指令,使得数据处理单元和数据搬移单元并行运行,下述步骤604至步骤609是对任一数据处理单元和数据搬移单元并行运行的过程进行说明。
604、数据处理单元根据并行控制指令,读取数据缓存单元已缓存的第一数据,对第一数据进行处理,将处理后的第一数据输出至数据缓存单元。
其中,第一数据为拆分后的多条数据中的任一条数据,在一种可能实现方式中,将拆分后的多条数据组成的数据序列存储在数据存储空间中,第一数据为数据序列中的任一条数据。
可选地,该并行控制指令包括有效字段指示信息和多个有效字段,数据处理单元在接收到并行控制指令后,根据并行控制指令的有效字段指示信息,获取该并行控制指令中的第一有效字段。
在一种可能实现方式中,数据处理单元根据并行控制指令,读取数据缓存单元已缓存的第一数据,对第一数据进行处理,将处理后的第一数据输出至数据缓存单元,包括:数据处理单元提取并行控制指令中的有效字段指示信息,根据该有效字段指示信息确定第一有效字段,根据第一有效字段,读取数据缓存单元已缓存的第一数据,对第一数据进行处理,将处理后的第一数据输出至数据缓存单元。
可选地,该并行控制指令为第一控制指令,该第一控制指令携带用于控制数据处理单元的第一有效字段,数据处理单元能够根据第一有效字段,读取数据缓存单元已缓存的第一数据,对第一数据进行处理,将处理后的第一数据输出至数据缓存单元。
其中,数据处理单元根据第一有效字段,读取数据缓存单元已缓存的第一数据,包括:第一有效字段指示第一数据的缓存位置,数据处理单元根据第一数据的缓存位置,读取该第一数据;或者,第一有效字段指示数据处理单元启动,数据处理单元自行从数据缓存单元的第一存储空间读取已缓存的第一数据。可选地,数据处理单元中配置有执行数据缓存单元已缓存的数据、对数据进行处理、以及将处理后的数据输出至数据缓存单元的程序,第一有效字段指示数据处理单元启动是指:第一有效字段指示数据处理单元运行该程序。
另外,数据处理单元根据第一有效字段,对第一数据进行处理包括:该第一有效字段指示数据处理单元需要执行的处理操作,数据处理单元执行该第一有效字段指示的处理操作,对第一数据进行处理;或者,该第一有效字段指示数据处理单元启动,则数据处理单元启动并按照配置的处理操作,对该第一数据进行处理。
另外,数据处理单元将处理后的第一数据输出至数据缓存单元,包括:第一有效字段指 示第二存储空间的位置,数据处理单元根据第一有效字段将处理后的第一数据输出至数据缓存单元的第二存储空间;或者,第一有效字段未指示第二存储空间的位置,数据处理单元自行将处理后的第一数据输出至数据缓存单元的第二存储空间。
605、数据搬移单元根据并行控制指令,将待处理的第二数据从数据存储单元搬移至数据缓存单元,第二数据为第一数据的下一条数据。
其中,第二数据为拆分后的多条数据中第一数据的下一条数据,例如,如图8所示,第一数据为Tile1,第二数据为Tile2,在一种可能实现方式中,多条数据组成的数据序列存储在数据存储空间中,第二数据为数据序列中第一数据的下一条数据。
可选地,该并行控制指令包括有效字段指示信息和多个有效字段,数据搬移单元在接收到并行控制指令后,能够根据并行控制指令的有效字段指示信息,获取该并行控制指令中的第二有效字段。在一种可能实现方式中,数据搬移单元根据并行控制指令,将待处理的第二数据从数据存储单元搬移至数据缓存单元,包括:数据搬移单元提取并行控制指令中的有效字段指示信息,根据该有效字段指示信息,确定并行控制指令中的第二有效字段,根据该第二有效字段,将待处理的第二数据从数据存储单元搬移至数据缓存单元。
可选地,该并行控制指令为第二控制指令,该第二控制指令携带用于控制数据搬移单元的第二有效字段,数据搬移单元能够根据第二有效字段,将待处理的第二数据从数据存储单元搬移至数据缓存单元。
可选地,第二有效字段包括第二数据在数据存储单元中的起始存储位置、第二数据在数据存储单元中的终止存储位置、第二数据搬移至数据缓存单元的目标位置、或者第二数据的数据长度等中的至少一项。其中,第二数据在数据存储单元中会占用一定的存储空间,通过第二数据在数据存储单元中的起始存储位置和终止存储位置能够准确地确定该第二数据所占用的存储空间。数据搬移单元根据第二有效字段,将待处理的第二数据从数据存储单元搬移至数据缓存单元,包括:数据缓存单元根据第二数据在数据存储单元中的起始存储位置以及第二数据的数据长度,从数据存储单元中读取第二数据,根据目标位置,将第二数据搬移至该目标位置处。
可选地,数据搬移单元包括多个子单元,因此,步骤605通过数据搬移单元中的第一数据搬移子单元完成。
606、数据搬移单元根据并行控制指令,将处理后的第三数据从数据缓存单元搬移至数据存储单元,第三数据为第一数据的上一条数据。
其中,第三数据为拆分后的多条数据中第一数据的上一条数据,如图8所示,第一数据为Tile3时,第三数据为Tile2。在一种可能实现方式中,将拆分后的多条数据组成的数据序列存储在数据存储空间中,因此,第三数据是数据序列中第一数据的上一条数据。
第三数据为拆分后的多条数据组成的数据序列中的一条数据,且该第三数据为第一数据的上一条数据。
可选地,该并行控制指令包括有效字段指示信息和多个有效字段,数据搬移单元在接收到并行控制指令后,能够根据并行控制指令的有效字段指示信息,确定相应的有效字段,并获取该有效字段中的第三有效字段,该第三有效字段为用于控制该数据搬移单元对处理后的数据执行操作的字段。在一种可能实现方式中,数据搬移单元提取并行控制指令中有效字段指示信息,根据该有效字段指示信息,确定该并行控制指令中的第三有效字段,根据第三有效字段,将处理后的第三数据从数据缓存单元搬移至数据存储单元。
可选地,该并行控制指令为数据搬移指令,该数据搬移指令携带用于控制数据搬移单元的第三有效字段,数据搬移单元根据第三有效字段,将处理后的第三数据从数据缓存单元搬移至数据存储单元。
可选地,第三有效字段包括第三数据在数据缓存单元的起始存储位置、第三数据在数据缓存单元的终止存储位置、处理后的第三数据的数据长度、以及处理后的第三数据在数据存储单元的目标位置等中的至少一项。其中,第三数据在数据缓存单元中会占用一定的存储空间,通过第三数据在数据缓存单元中的起始存储位置和终止存储位置能够准确地确定该第三数据所占用的存储空间。数据搬移单元根据第三有效字段,将处理后的第三数据从数据缓存单元搬移至数据存储单元,包括:数据缓存单元根据处理后的第三数据在数据缓存单元中的起始存储位置以及处理后的第三数据的数据长度,从数据缓存单元中读取处理后的第三数据,将处理后的第三数据搬移至数据存储单元中的目标位置处。
另外,数据搬移单元包括多个子单元,因此,步骤606通过数据搬移单元中的第二数据搬移子单元完成。
需要说明的是,在指令处理单元同时向数据处理单元和数据搬移单元发送并行控制指令,可能并不能指示处理器完成对待处理数据所要执行的所有操作,还需要继续向数据处理单元和数据搬移单元继续发送并行控制指令。可选地,指令处理单元再次同时向数据处理单元和数据搬移单元发送并行控制指令的时机为:数据处理单元和数据搬移单元已根据上一条并行控制指令完成工作,下面以步骤607至步骤609进行说明。
需要说明的是,本申请实施例中数据搬移单元还将数据处理单元输出的处理后的数据,从数据缓存单元搬移至数据存储单元,因此,数据存储单元中不仅用于存储待处理数据拆分得到的多条数据,还用于存储数据处理单元处理后的数据,而在处理器对数据进行处理的过程中,需要数据处理单元对数据进行多次处理,或者,数据处理单元包括多个数据处理子单元,需要通过多个数据处理子单元对数据先后进行处理,因此,第二数据为待处理数据,或者为数据处理单元上一次处理后输出的数据,或者为数据处理单元中某一数据处理子单元输出的处理后的数据。
607、在数据处理单元将处理后的第一数据输出至数据缓存单元之后,数据处理单元向指令处理单元发送第一完成消息。
其中,第一完成消息用于指示数据处理单元已执行完操作,可选地,该第一完成消息携带数据处理单元的标识,以便指令处理单元根据该第一完成消息确定数据处理单元已完成操作。该数据处理单元的标识为确定唯一数据处理单元的标识,例如,该标识为该数据处理单元的编号、该数据处理单元的名称等。
608、在数据搬移单元将第二数据从数据存储单元搬移至数据缓存单元之后,且将处理后的第三数据从数据缓存单元搬移至数据存储单元之后,数据搬移单元向指令处理单元发送第二完成消息。
其中,第二完成消息用于指示数据搬移单元已执行完操作,可选地,该第二完成消息携带数据搬移单元的标识,以便指令处理单元根据该第二完成消息确定数据搬移单元已完成操作。该数据搬移单元的标识为确定唯一数据搬移单元的标识,可以为该数据搬移单元的编号、该数据搬移单元的名称等。
需要说明的是,若并行控制指令仅指示数据搬移单元将第二数据从数据存储单元搬移至数据缓存单元,在数据搬移单元将第二数据从数据存储单元搬移至数据缓存单元之后,数据 搬移单元向指令处理单元发送第二完成消息。若并行控制指令仅指示数据搬移单元将处理后的第三数据从数据缓存单元搬移至数据存储单元,数据搬移单元将处理后的第三数据从数据缓存单元搬移至数据存储单元之后,数据搬移单元向指令处理单元发送第二完成消息。若并行控制指令指示数据搬移单元将第二数据从数据存储单元搬移至数据缓存单元,且将处理后的第三数据从数据缓存单元搬移至数据存储单元,在数据搬移单元将第二数据从数据存储单元搬移至数据缓存单元,且将处理后的第三数据从数据缓存单元搬移至数据存储单元之后,数据搬移单元向指令处理单元发送第二完成消息。
609、指令处理单元接收到第一完成消息和第二完成消息后,同时向指令处理单元和数据缓存单元发送与第四数据匹配的下一并行控制指令,第四数据为拆分后的多条数据中第二数据的下一条数据。
在数据搬移单元将第二数据从数据存储单元搬移至数据缓存单元之后,数据搬移单元需要继续搬移第二数据的下一条数据,也即是第四数据,因此,指令处理单元需要读取与第四数据匹配的下一并行控制指令,同时向指令处理单元和数据缓存单元发送该下一并行控制指令,该下一并行控制指令用于控制数据搬移单元将第四数据从数据存储单元搬移至数据缓存单元,还用于控制数据处理单元读取数据缓存单元已缓存的第二数据,对第二数据进行处理,将处理后的第二数据输出至数据缓存单元上;还用于控制数据搬移单元将处理后的第一数据从数据缓存单元搬移至数据存储单元。
另外,指令存储单元中存储的多条并行控制指令用于控制处理器完成对一条待处理数据拆分得到的多条数据的处理操作,因此,多条并行控制指令是按照处理器的处理流程依次排列的,而指令处理单元将指令存储单元中的至少一条并行控制指令搬移至指令缓存单元时,也是按照顺序依次搬移的,因此,指令缓存单元中至少一条并行控制指令的排列顺序与处理器的处理流程是匹配的,因此,指令处理单元能够直接在第一存储单元中读取当前存储时间最长的并行控制指令,该并行控制指令即为与第四数据匹配的下一并行控制指令。其中,上述数据处理方法的流程如图13和图14所示。
需要说明的是,本申请实施例提供的步骤601为可选执行步骤,在一种可能实现方式中,计算机设备需要处理多条待处理数据的情况下,不使用拆分单元将每条待处理数据拆分为多条数据,而是直接对多条待处理数据进行处理,虽然搬移待处理数据的过程可能相对较慢,但是由于将数据处理和数据搬移并行执行,尽可能地避免了在等待数据搬移过程执行之后再执行数据处理过程,因此还是提高了处理器的处理效率,因此,本申请实施例对是否拆分待处理数据不做限定。
需要说明的是,上述数据处理的过程为神经网络模型的数据处理过程。在一种可能实现方式中,数据处理单元根据并行控制指令,读取数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据存储空间,包括:数据处理单元根据并行控制指令,基于神经网络模型,读取数据缓存空间已缓存的第一数据;对第一数据进行处理,将处理后的第一数据输出至数据缓存空间。
另外,数据处理单元还包括多个数据处理子单元,通过该多个数据处理子单元来实现神经网络模型的数据处理。在一种可能实现方式中,数据处理子单元同时接收到各个数据处理子单元对应的数据处理指令,根据该数据处理指令并行进行数据处理。
例如,第一数据处理子单元用于对数据进行卷积处理,第二数据处理子单元用于对数据进行池化处理。若第一数据处理子单元接收到对应的数据处理指令,则读取数据缓存单元已 缓存的第一数据,对第一数据进行卷积处理,将卷积处理后的第一数据输出至数据缓存单元。同时,第二数据处理子单元接收到对应的数据处理指令,读取数据缓存单元已缓存的卷积处理后的第三数据,对卷积处理后的第三数据进行池化处理,将池化处理后的第三数据输出至数据缓存单元。其中第三数据为拆分后的多条数据中第一数据的上一条数据。
本申请实施例提供的数据处理方法,通过同时向数据处理单元和数据搬移单元发送并行控制指令,使得数据处理单元和数据搬移单元能够并行执行,并且,数据处理单元本次处理的是数据搬移单元上一次搬移的数据,数据处理单元无需等待数据搬移单元本次搬移完成,即可进行处理,数据处理过程不再依赖于数据搬移过程,提高了处理速度和处理效率。
另外相关技术中,为了避免数据搬移单元在搬移较大的待处理数据时,耗时较久,会将待处理的数据进行压缩、裁剪等处理,来减小待处理数据的数据量,但是压缩、裁剪等处理会导致图片中的某些信息丢失,而本申请实施例并未对待处理数据进行压缩、裁剪等处理,因此,并未丢失待处理数据中的信息,保证了处理器处理结果的准确性。
另外,相关技术中,为了避免数据搬移单元在搬移较大的待处理数据耗时较久,会为处理器配置容量更大的缓存单元,使得处理器能够容纳待处理的数据,处理器在进行数据处理时,待处理的数据可以仅在处理器中搬移,从而加快搬移速度;或者,在处理中使用更高的总线带宽,如采用HBM(High Bandwidth Memory,高带宽存储器),也即是通过提高数据传输效率来减小数据搬移的耗时。但是为处理器配置容量更大的缓存单元,会导致处理器的面积增加,成本也随之显著增加;采用高带宽存储器带来的成本增加也比较明显。
而本申请实施例提供的数据处理方法,是将待处理数据拆分为多条数据,对该多条数据进行处理,并且通过将数据搬移过程和数据处理过程并行执行,尽可能地避免数据处理过程对数据搬移过程的等待时间,从而减少数据搬移对处理器处理速度的影响。因此,本申请实施例对处理器的缓存单元和带宽有没有较高的要求,不会增加处理器的成本。
另外相关技术中,在控制两个单元或者多个单元并行处理时,由于一个指令只能控制一个单元执行一个操作,因此,若想要控制两个单元或者多个单元并行处理时,需要软件和硬件交互控制,通过多种指令的同步和调度,例如,多个指令处理单元根据复杂的调度和同步机制,尽可能地使得两个单元或者多个单元并行处理。
而本申请实施例通过一个指令处理单元,即可同时向处理器的数据处理单元和数据搬移单元发送并行控制指令,使得数据处理单元和数据搬移单元并行处理,仅通过一个指令处理单元即可实现同时向两个或者多个单元发送指令,无需多个指令处理单元,同时也不需要多个指令处理单元之间的交互调度,实现代价小。并且,在处理器的处理过程中,无需软件和硬件交互,避免了软件和硬件交互带来的处理器的性能损失。
另外,本申请实施例通过处理器之外的指令存储单元存储处理待处理数据拆分后的多条数据对应的多个并行控制指令,相对于将该多条并行控制指令存储在指令缓存单元来说,可以大大降低成本,并且,由于处理器的运算复杂程度逐渐加深、待处理数据的数据量逐渐加大,需要的并行控制指令也越来越多,利用处理器之外的存储单元来存储多个并行控制指令,能够节省处理器内部存储的成本,并且经过两个指令存储单元也较好地解决了并行控制指令较多和并行控制指令较长导致的并行控制指令占用存储空间较多的问题。
需要说明的是,在一种可能实现方式中,处理器为AI芯片,如图15所示,该AI芯片用于执行人工智能应用中神经网络模型所执行的计算任务。以处理器的数据处理单元包括卷积 处理子单元,数据搬移单元包括load(加载)搬移子单元和store(存储)搬移子单元,将待处理数据拆分为1296条Tile为例进行说明。
在处理器投入使用后,待处理数据的具体内容不是固定的,但是该处理器的待处理数据的类型和尺寸是固定的。因此,将待处理数据拆分为多条数据后,每条数据的尺寸和多条数据的数量也是固定的,因此,用于处理拆分后的多条数据的并行控制指令的数量也是固定的,如拆分后的多条Tile为1296条Tile,并行控制指令为1298个。本申请实施例以多条Tile为6条Tile,并行控制指令为8个,对处理器的卷积处理过程进行示例性说明,如图16所示。
在人工智能应用对待处理数据进行分析的情况下,第1个并行控制指令用于控制load搬移子单元工作,第2个并行控制指令用于控制load搬移子单元和卷积处理子单元工作,第3个至第6个并行控制指令用于控制load搬移子单元、卷积处理子单元和store搬移子单元工作,第7个并行控制指令用于控制卷积处理子单元和store搬移子单元工作,第8个并行控制指令用于控制store搬移子单元工作。其中,load搬移子单元用于将拆分后的多个数据依次从数据存储单元搬移至数据缓存单元中,store搬移子单元用于将卷积处理子单元处理后的数据从数据缓存单元搬移至数据存储单元。
因此,8个并行控制指令执行完毕之后,处理器完成了对6条Tile的卷积处理,并将处理后的6条Tile搬移至数据存储单元中。
可选地,数据处理单元还包括池化处理子单元、全连接处理子单元等其他处理子单元中的任一子单元。本申请实施例对数据处理单元还包括池化处理子单元为例对处理器的处理过程为例进行说明。
由于在处理器完成了对6条Tile的卷积处理后,将处理后的6条Tile搬移至数据存储单元中,因此,还需要将通过卷积处理后的6条Tile从数据存储单元依次搬移至数据缓存单元中,对通过卷积处理后的6条Tile依次进行池化处理,该池化处理过程与卷积处理过程类似,在此不再一一赘述。
在另一种可能实现方式中,处理器对拆分后的多条数据进行卷积处理和池化处理,在将一条Tile进行卷积处理之后,将通过卷积处理后的Tile作为池化处理子单元的输入,再对该Tile进行池化处理之后,将池化处理后的该Tile搬移至数据存储单元中。
以多条Tile为6条Tile,并行控制指令为10个,对处理器的卷积处理过程进行示例性说明,如图17所示,其中,数据搬移单元还包括move(搬移)搬移子单元,该move搬移子单元用于将数据缓存单元中的数据从第一目标位置搬移至第二目标位置。
第1个并行控制指令用于控制load搬移子单元工作,第2个并行控制指令用于控制load搬移子单元和卷积处理子单元工作,第3个并行控制指令用于控制load搬移子单元、卷积处理子单元和move搬移子单元工作,第4个并行控制指令用于控制load搬移子单元、卷积处理子单元、move搬移子单元和池化处理子单元工作,第5个至第6个并行控制指令用于控制load搬移子单元、卷积处理子单元、move搬移子单元、池化处理子单元和store搬移子单元工作。第7个并行控制指令用于控制卷积处理子单元、move搬移子单元、池化处理子单元和store搬移子单元工作。第8个并行控制指令用于控制move搬移子单元、池化处理子单元和store搬移子单元工作。第9个并行控制指令用于控制池化处理子单元和store搬移子单元工作。第10个并行控制指令用于控制store搬移子单元工作。
其中,move搬移子单元用于将卷积处理子单元对应的输出存储空间的数据搬移至池化处理单元对应的输入存储空间。store搬移子单元用于将池化处理子单元输出的池化后的数据从 数据缓存单元搬移至数据存储单元中。
另外,上述实施例提供的数据处理方法不仅能够应用在计算机设备的处理器中,还能够应用于其他部件中,本申请实施例对此不做限定。图18是本申请实施例提供的一种数据处理方法的流程图,本申请实施例的执行主体为任一计算机设备,参见图18,该方法包括:
1801、对待处理数据进行拆分,得到拆分后的多条数据,将该多条数据存储于数据存储空间。
如果待处理数据的数据量较大,为了减少每次数据搬移所消耗的时间,将待处理数据拆分为多条数据,并且在后续对拆分后的多条数据进行处理时,会将拆分后的多条数据依次从数据存储空间搬移至数据缓存空间,因此,在一种可能实现方式中,按照拆分规则,将待处理数据拆分为多条数据,该拆分规则指示拆分得到的任一条数据的数据量不大于数据缓存空间的缓存容量。
例如,获取待处理数据,根据数据缓存空间的缓存容量对待处理数据进行拆分,得到多条数据,将多条数据组成的数据序列存储在数据存储空间中。
可选地,待处理数据为特征图数据,按照数据缓存空间的缓存容量,将待处理数据拆分为多条数据,例如,数据缓存空间的缓存容量为16KB,特征图数据的尺寸为128*128*128,将特征图数据进行拆分,得到的多条数据为16条32*32*128的图片数据。
可选地,待处理数据为图片数据,按照数据缓存空间的缓存容量,将待处理数据平均拆分为多条数据,例如,数据缓存空间的缓存容量为16KB,图片数据的尺寸为128*128*3,将图片数据进行拆分,得到的多条数据为16条32*32*3的图片数据。
其中,数据存储空间是计算机设备上任一数据存储单元上的空间,本申请实施例对数据存储空间不做限定。
1802、读取并行控制指令。
其中,指令缓存空间缓存有多个并行控制指令,可选地,指令缓存空间中的多个并行控制指令是从指令存储空间中获取的,指令存储空间中存储有用于指示对拆分后的多条数据进行处理的多条并行控制指令,该多条并行控制指令按照指示的先后顺序依次排列,将该多条并行控制指令按照排列顺序依次缓存至指令缓存空间中,从而读取在指令缓存空间缓存时间最长的并行控制指令。
在一种可能实现方式中,读取并行控制指令包括:读取指令存储空间中的并行控制指令;按照读取顺序将读取到的并行控制指令搬移到指令缓存空间进行缓存,得到指令缓存队列;按照指令缓存顺序从指令缓存队列中读取并行控制指令。其中,指令缓存队列为位于指令缓存空间中、且包含至少一个指令的队列。
在一种可能实现方式中,并行控制指令包括数据处理指令和数据搬移指令,数据处理指令携带第一有效字段,数据搬移指令携带第二有效字段,从而数据处理指令和数据搬移指令用于指示不同的操作,例如,第一有效字段用于指示执行对第一数据进行处理的操作,第二有效字段用于指示执行缓存第二数据的操作。
在一种可能实现方式中,在读取并行指令之后,该方法还包括:提取并行控制指令中的有效字段指示信息;根据该有效字段指示信息,确定并行控制指令中第一有效字段和第二有效字段;从并行控制指令中读取第一有效字段和第二有效字段,从而得到数据处理指令和数据搬移指令。可选地,通过读取第一有效字段,得到数据处理指令,通过读取第二有效字段, 得到数据搬移指令。
在一种可能实现方式中,指令缓存空间为指令缓存单元,指令存储空间为指令存储单元。
1803、根据并行控制指令,读取数据缓存空间已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间。
第一数据是拆分后的多条数据中的任一条数据,由于拆分后的多条数据组成的数据序列存储在数据存储空间中,因此,第一数据是数据序列中的任一条数据。
在一种可能实现方式中,待处理数据为图片数据,第一数据为待处理数据被拆分后的多条数据中的一条数据,因此第一数据为一个小图片。
在一种可能实现方式中,根据并行控制指令,读取数据缓存空间已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间,包括:根据并行控制指令,基于神经网络模型,读取数据缓存空间已缓存的第一数据;并对该第一数据进行处理,将处理后的第一数据输出至数据缓存空间。
在一种可能实现方式中,并行控制指令为数据处理指令,根据并行控制指令,读取数据缓存空间已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据存储空间,包括:根据数据处理指令携带的第一有效字段,读取数据缓存空间已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间。
在另一种可能实现方式中,根据并行控制指令,读取数据缓存空间已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间,包括:提取该并行控制指令中的有效字段指示信息;根据该有效字段指示信息,确定该并行控制指令中第一有效字段,根据第一有效字段,读取数据缓存空间已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间。
1804、同时,根据并行控制指令,将第二数据从数据存储空间搬移至数据缓存空间,该第二数据为第一数据的下一条数据。
1805、同时,根据并行控制指令,将处理后的第三数据从数据缓存空间搬移至数据存储空间,第三数据为第一数据的上一条数据。
1806、将处理后的第一数据输出至数据缓存空间之后,将第二数据从数据存储空间搬移至数据缓存空间,且将处理后的第三数据从数据缓存空间搬移至数据存储空间之后,读取与第四数据匹配的下一并行控制指令,第四数据为第二数据的下一条数据。
读取下一并行控制指令之后,根据该下一并行控制指令,读取数据缓存空间已缓存的第二数据,对该第二数据进行处理,将处理后的第二数据输出至数据缓存空间;同时,根据该并行控制指令,将第四数据从数据存储空间搬移至数据缓存空间,同时,根据并行控制指令,将处理后的第一数据从数据缓存空间搬移至数据存储空间,在执行完上述操作之后,重复执行读取下一并行控制指令,根据下一并行控制指令执行操作的过程,直至拆分后的多条数据处理完成或者直至数据存储空间中的多条并行控制指令均执行一遍。
需要说明的是上述数据处理方法应用于神经网络模型中,在一种可能实现方式中,根据并行控制指令,基于神经网络模型,读取数据缓存空间已缓存的第一数据,对第一数据进行处理,将处理后的第一数据输出至数据缓存单元。
另外,神经网络模型包括多个层,在一种可能实现方式中,根据神经网络模型中各层分别对应的数据处理指令,并行进行数据处理,也即是神经网络模型中的每个层并行进行数据处理。
例如,以神经网络模型包括卷积层和池化层为例,对神经网络模型中的每个层并行进行数据处理进行说明,同时接收到卷积层对应的数据处理指令和池化层对应的数据处理指令;基于卷积层,读取数据缓存空间已缓存的第一数据,对该第一数据进行卷积处理,将卷积处理后的第一数据输出至数据缓存单元;同时,基于池化层,读取数据缓存单元已缓存的卷积处理后的第三数据,对卷积处理后的第三数据进行池化处理,将池化处理后的第三数据输出至数据缓存单元,实现了卷积层和池化层的并行运行。
本申请实施例提供的数据处理方法,通过读取并行控制指令,从而根据并行控制指令同时执行数据处理操作和数据搬移操作,尽可能地减少了数据处理操作等待数据搬移操作的时长,从而提高了数据处理的速度和效率。并且,本次处理的是数据搬移单元上一次搬移的数据,无需等待数据搬移过程,即可进行处理,减少了数据处理过程对数据搬移过程的依赖,提高了处理速度和处理效率。
另外相关技术中,为了避免在搬移较大的待处理数据时,耗时较久,会将待处理的数据进行压缩、裁剪等处理,来减小待处理数据的数据量,但是压缩、裁剪等处理会导致图片中的某些信息丢失,而本申请实施例并未对待处理数据进行压缩、裁剪等处理,因此,并未丢失待处理数据中的信息,保证了处理器处理结果的准确性。
另外,相关技术中,为了避免在搬移较大的待处理数据耗时较久,会设置更大缓存容量的数据缓存空间,这样在进行数据处理时,可以加快搬移速度;或者,使用更高的总线带宽,如采用HBM(High Bandwidth Memory,高带宽存储器),也即是通过提高数据传输效率来减小数据搬移的耗时。但是设置更大缓存容量的缓存空间,会导致成本显著增加;采用高带宽存储器带来的成本增加也比较明显。
而本申请实施例提供的数据处理方法,是将待处理数据拆分为多条数据,对该拆分后的多条数据进行处理,并且通过将数据搬移过程和数据处理过程并行执行,尽可能地避免数据处理过程对数据搬移过程的等待时间,从而减少数据搬移对处理速度的影响。因此,本申请实施例对数据缓存空间和带宽有没有较高的要求,不会增加成本。
另外相关技术中,一个指令只能用于指示执行一个操作,因此,若想要执行两个或者两个以上的操作时,需要软件和硬件交互控制,通过多种指令的同步和调度,例如,多个指令处理单元根据复杂的调度和同步机制,尽可能地使得两个操作并行进行。
而本申请实施例通过可读取并发控制指令,根据该并发控制指令,同时执行数据处理和数据搬移,无需多个指令处理单元,同时也不需要多个指令处理单元之间的交互调度,实现代价小。并且,在数据处理过程中,无需软件和硬件交互,避免了软件和硬件交互带来的数据处理性能损失。
另外,本申请实施例还提供了一种数据处理芯片,可选地,该芯片安装在任一计算机设备中,实现该计算机设备的数据处理功能,如图19所示,该芯片1900包括指令处理单元1901、数据处理单元1902和数据搬移单元1903和数据缓存单元1904;该指令处理单元1901,用于读取并行控制指令,同时向该数据处理单元1902和该数据搬移单元1903发送该并行控制指令;该数据处理单元1902,用于根据该并行控制指令,读取数据缓存单元1904已缓存的第一数据,对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存单元;该数据搬移单元1903,用于同时根据该并行控制指令,将待处理的第二数据从位于该芯片之外的数据存储单元搬移至该数据缓存单元1904,该第二数据为第一数据的下一条数据。
在一种可能实现方式中,并行控制指令包括数据处理指令和数据搬移指令;该指令处理单元1901,用于提取该并行控制指令中的数据处理指令和数据搬移指令;该数据处理单元1902,用于根据该数据处理指令,读取该数据缓存单元中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存单元;该数据搬移单元1903,用于根据该数据搬移指令,将该第二数据从数据存储单元搬移到该数据缓存单元。
其中,指令处理单元1901提取并行控制指令中的数据处理指令和数据搬移指令,是通过提取并行控制指令中的有效字段指示信息实现的;根据该有效字段指示信息,确定该并行控制指令中第一有效字段和第二有效字段;从该并行控制指令中读取第一有效字段和第二有效字段,得到数据处理指令和数据搬移指令。可选地,从并行控制指令中读取第一有效字段,得到数据处理指令;从并行控制指令中读取第二有效字段,得到数据搬移指令。
如图19所示,在一种可能实现方式中,该芯片还包括指令缓存单元1905;该指令处理单元1901,用于读取位于芯片之外的指令存储单元中的并行控制指令;按照读取顺序将读取到的并行控制指令搬移到指令缓存单元1905进行缓存,得到指令缓存队列;按照指令缓存顺序从该指令缓存队列中读取并行控制指令。其中,指令缓存队列为位于指令缓存单元1905中、且包含至少一个指令的队列。
在数据处理单元1902将处理后的数据输出至数据缓存单元后,还由数据搬移单元1903将输出至数据缓存单元的数据,搬移至位于芯片之外的数据存储单元。在一种可能实现方式中,该数据搬移单元1903,用于根据该并行控制指令,将处理后的第三数据从该数据缓存单元1904搬移至该处理器之外的数据存储单元,该第三数据为该第一数据的上一条数据。
在一种可能实现方式中,该数据处理单元1902,用于将该处理后的第一数据输出至该数据缓存单元1904之后,向该指令处理单元1901发送第一完成消息;该数据搬移单元1903,用于将该第二数据从该处理器之外搬移至该数据缓存单元1904之后,向该指令处理单元1901发送第二完成消息;该指令处理单元1901,用于接收到该第一完成消息和该第二完成消息后,同时向该数据处理单元1902和该数据缓存单元1903发送与该第四数据匹配的下一并行控制指令,该第四数据为该多条数据中该第二数据的下一条数据。
在一种可能实现方式中,如图20所示,该芯片1900为人工智能AI芯片,待处理数据为图片数据;数据处理单元1902,用于根据并行控制指令,读取数据缓存单元1904中已缓存的第一数据;数据处理单元1902,还用于基于神经网络模型,对第一数据进行处理,将处理后的第一数据输出至数据存储单元,该数据存储单元为位于AI芯片之外的任一数据存储单元。
其中,该数据处理单元1902包括多个数据处理子单元,该数据搬移单元1903包括加载load引擎、存储store引擎和搬移move引擎中的至少一项;该load引擎,用于将待处理数据从该数据存储单元搬移至该数据缓存单元;任一数据处理子单元,用于读取该数据缓存单元中已缓存的数据,对该数据进行处理,将处理后的数据输出至该任一数据处理子单元对应的输出存储单元;该move引擎,用于将多个数据处理子单元中,除最后一个数据处理子单元之外的其他数据处理子单元处理后的数据从该输出存储单元搬移至下一数据处理子单元对应的输入存储单元;该store引擎,用于将该最后一个数据处理子单元处理后的数据从最后一个数据处理子单元对应的输出存储单元搬移至该数据存储单元。
在一种可能实现方式中,如图20所示,该数据处理单元1902包括卷积引擎或池化引擎中的至少一项。
图21是本申请实施例提供的一种终端的结构框图。该终端2100用于执行上述实施例中终端执行的步骤,例如,该终端是便携式移动终端,比如:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端2100还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
终端2100包括有:处理器2101和存储器2102。
处理器2101包括一个或多个处理核心,比如4核心处理器、8核心处理器等。可选地,处理器2101采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。可选地,处理器2101包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器2101在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器2101还包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器2102包括一个或多个计算机可读存储介质,该计算机可读存储介质是非暂态的。存储器2102还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器2102中的非暂态的计算机可读存储介质用于存储至少一个程序代码,该至少一个程序代码用于被处理器2101所执行以实现本申请中方法实施例提供的数据处理方法。
在一些实施例中,终端2100还可选包括有:外围设备接口2103和至少一个外围设备。处理器2101、存储器2102和外围设备接口2103之间能够通过总线或信号线相连。各个外围设备通过总线、信号线或电路板与外围设备接口2103相连。可选地,外围设备包括:射频电路2104、显示屏2105、摄像头组件2106、音频电路2107、定位组件2108或者电源2109中的至少一种。
外围设备接口2103可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器2101和存储器2102。在一些实施例中,处理器2101、存储器2102和外围设备接口2103被集成在同一芯片或电路板上;在一些其他实施例中,处理器2101、存储器2102和外围设备接口2103中的任意一个或两个可以在单独的芯片或电路板上实现,本申请实施例对此不加以限定。
射频电路2104用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路2104通过电磁信号与通信网络以及其他通信设备进行通信。射频电路2104将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路2104包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路2104可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路2104还包括NFC(Near Field Communication,近距离无线通信)有关的电 路,本申请对此不加以限定。
显示屏2105用于显示UI(User Interface,用户界面)。该UI包括图形、文本、图标、视频及其它们的任意组合。当显示屏2105是触摸显示屏时,显示屏2105还具有采集在显示屏2105的表面或表面上方的触摸信号的能力。该触摸信号作为控制信号输入至处理器2101进行处理。此时,显示屏2105还用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏2105为一个,设置终端2100的前面板;在另一些实施例中,显示屏2105为至少两个,分别设置在终端2100的不同表面或呈折叠设计;在再一些实施例中,显示屏2105可以是柔性显示屏,设置在终端2100的弯曲表面上或折叠面上。甚至,显示屏2105还设置成非矩形的不规则图形,也即异形屏。显示屏2105采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头组件2106用于采集图像或视频。可选地,摄像头组件2106包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件2106还包括闪光灯。可选地,闪光灯是单色温闪光灯,或者是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,用于不同色温下的光线补偿。
音频电路2107包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器2101进行处理,或者输入至射频电路2104以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端2100的不同部位。可选地,麦克风是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器2101或射频电路2104的电信号转换为声波。可选地,扬声器是传统的薄膜扬声器,或者是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅能够将电信号转换为人类可听见的声波,还能够将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路2107还包括耳机插孔。
定位组件2108用于定位终端2100的当前地理位置,以实现导航或LBS(Location Based Service,基于位置的服务)。定位组件2108可以是基于美国的GPS(Global Positioning System,全球定位系统)、中国的北斗系统或俄罗斯的格雷纳斯系统或欧盟的伽利略系统的定位组件。
电源2109用于为终端2100中的各个组件进行供电。电源2109是交流电、直流电、一次性电池或可充电电池。当电源2109包括可充电电池时,该可充电电池支持有线充电或无线充电。该可充电电池还用于支持快充技术。
本领域技术人员可以理解,图21中示出的结构并不构成对终端2100的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
图22是本申请实施例提供的一种服务器的结构示意图,该服务器2200可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(Central Processing Units,CPU)2201和一个或一个以上的存储器2202,其中,存储器2202中存储有至少一条程序代码,至少一条程序代码由处理器2201加载并执行以实现上述各个方法实施例提供的方法。当然,该服务器还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器还可以包括其他用于实现设备功能的部件,在此不做赘述。
服务器2200可以用于执行上述数据处理方法中服务器所执行的步骤。
本申请实施例还提供了一种计算机设备,该计算机设备包括处理器和存储器,该存储器中存储有至少一条程序代码,该程序代码由该处理器加载并执行上述实施例的数据处理方法中所执行的操作。
例如,在一种可能实现方式中,计算机设备包括:处理器和数据存储单元,处理器包括:指令处理单元、数据处理单元、数据搬移单元和数据缓存单元;
指令处理单元,用于读取并行控制指令;
数据处理单元,用于根据并行控制指令,读取数据缓存单元已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存单元;
数据搬移单元,用于同时,根据控制指令,将第二数据从数据存储单元搬移至数据缓存单元,第二数据为第一数据的下一条数据。
在一种可能实现方式中,计算机设备包括指令存储单元,处理器包括指令缓存单元,指令处理单元,用于读取指令存储单元中的并行控制指令;按照读取顺序将读取到的并行控制指令搬移到指令缓存单元进行缓存,得到指令缓存队列;按照指令缓存顺序从指令缓存队列中读取并行控制指令。
在一种可能实现方式中,并行控制指令中包括数据处理指令和数据搬移指令;指令处理单元,用于提取并行控制指令中的数据处理指令和数据搬移指令;
指令处理单元,还用于向数据处理单元发送数据处理指令,同时向数据搬移单元发送数据搬移指令;
数据处理单元,用于根据数据处理指令,读取数据缓存单元中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存单元;
数据搬移单元,用于根据数据搬移指令,将第二数据从数据存储单元搬移到数据缓存单元。
在一种可能实现方式中,指令处理单元,用于提取并行控制指令中的有效字段指示信息;根据有效字段指示信息,确定并行控制指令中第一有效字段和第二有效字段;从并行控制指令中读取第一有效字段,得到数据处理指令;从并行控制指令中读取第二有效字段,得到数据搬移指令。
在一种可能实现方式中,计算机设备还包括拆分单元,拆分单元,用于获取待处理数据;根据数据缓存单元的缓存容量对待处理数据进行拆分,得到拆分后的多条数据;将多条数据组成的数据序列存储在数据存储单元。
在一种可能实现方式中,待处理数据为图片数据;数据处理单元,用于根据并行控制指令,基于神经网络模型,读取数据缓存单元已缓存的第一数据;并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存单元。
在一种可能实现方式中,数据处理单元,用于根据神经网络模型中各层分别对应的数据处理指令,并行进行数据处理。
在一种可能实现方式中,神经网络模型包括卷积层和池化层,数据处理单元,用于接收卷积层对应的数据处理指令和池化层对应的数据处理指令;
根据卷积层对应的数据处理指令,基于卷积层,读取数据缓存空间已缓存的第一数据,对第一数据进行卷积处理,将卷积处理后的第一数据输出至数据缓存单元;
同时,根据池化层对应的数据处理指令,基于池化层,读取数据缓存单元已缓存的卷积处理后的第三数据,对卷积处理后的第三数据进行池化处理,将池化处理后的第三数据输出至数据缓存单元,第三数据为多条数据中第一数据的上一条数据。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行以实现如下操作:
读取并行控制指令;
根据并行控制指令,读取数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间;
同时,根据并行控制指令,将第二数据从数据存储空间搬移到数据缓存空间,第二数据为第一数据的下一条数据。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以实现如下操作:
读取指令存储空间中的并行控制指令;
按照读取顺序将读取到的并行控制指令搬移到指令缓存空间进行缓存,得到指令缓存队列;
按照指令缓存顺序从指令缓存队列中读取并行控制指令。
在一种可能实现方式中,并行控制指令中包括数据处理指令和数据搬移指令;该至少一条程序代码还由处理器加载并执行以实现如下操作:
提取并行控制指令中的数据处理指令和数据搬移指令;
根据数据处理指令,读取数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间;
根据数据搬移指令,将第二数据从数据存储空间搬移到数据缓存空间。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以实现如下操作:
提取并行控制指令中的有效字段指示信息;
根据有效字段指示信息,确定并行控制指令中第一有效字段和第二有效字段;
从并行控制指令中读取第一有效字段,得到数据处理指令;从并行控制指令中读取第二有效字段,得到数据搬移指令。
在一种可能实现方式中,该至少一条程序代码还由处理器加载并执行以实现如下操作:
获取待处理数据;
根据数据缓存空间的缓存容量对待处理数据进行拆分,得到拆分后的多条数据;
将多条数据组成的数据序列存储在数据存储空间。
在一种可能实现方式中,待处理数据为图片数据;该至少一条程序代码由处理器加载并执行以实现如下操作:
根据并行控制指令,基于神经网络模型,读取数据缓存空间已缓存的第一数据;并对对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间。
在一种可能实现方式中,该至少一条程序代码还由处理器加载并执行以实现如下操作:
根据神经网络模型中各层分别对应的数据处理指令,并行进行数据处理。
在一种可能实现方式中,神经网络模型包括卷积层和池化层,该至少一条程序代码还由处理器加载并执行以实现如下操作:
接收卷积层对应的数据处理指令和池化层对应的数据处理指令;
根据卷积层对应的数据处理指令,基于卷积层,读取数据缓存空间已缓存的第一数据,对第一数据进行卷积处理,将卷积处理后的第一数据输出至数据缓存单元;
同时,根据池化层对应的数据处理指令,基于池化层,读取数据缓存单元已缓存的卷积处理后的第三数据,对卷积处理后的第三数据进行池化处理,将池化处理后的第三数据输出至数据缓存单元,第三数据为多条数据中第一数据的上一条数据。
本申请实施例还提供了一种计算机程序,该计算机程序中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行以实现如下操作:
读取并行控制指令;
根据并行控制指令,读取数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间;
同时,根据并行控制指令,将第二数据从数据存储空间搬移到数据缓存空间,第二数据为第一数据的下一条数据。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以实现如下操作:
读取指令存储空间中的并行控制指令;
按照读取顺序将读取到的并行控制指令搬移到指令缓存空间进行缓存,得到指令缓存队列;
按照指令缓存顺序从指令缓存队列中读取并行控制指令。
在一种可能实现方式中,并行控制指令中包括数据处理指令和数据搬移指令;该至少一条程序代码还由处理器加载并执行以实现如下操作:
提取并行控制指令中的数据处理指令和数据搬移指令;
根据并行控制指令,读取数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间,包括:
根据数据处理指令,读取数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间;
根据并行控制指令,将第二数据从数据存储空间搬移到数据缓存空间,包括:
根据数据搬移指令,将第二数据从数据存储空间搬移到数据缓存空间。
在一种可能实现方式中,该至少一条程序代码由处理器加载并执行以实现如下操作:
提取并行控制指令中的有效字段指示信息;
根据有效字段指示信息,确定并行控制指令中第一有效字段和第二有效字段;
从并行控制指令中读取第一有效字段,得到数据处理指令;从并行控制指令中读取第二有效字段,得到数据搬移指令。
在一种可能实现方式中,该至少一条程序代码还由处理器加载并执行以实现如下操作:
获取待处理数据;
根据数据缓存空间的缓存容量对待处理数据进行拆分,得到拆分后的多条数据;
将多条数据组成的数据序列存储在数据存储空间。
在一种可能实现方式中,待处理数据为图片数据;该至少一条程序代码由处理器加载并执行以实现如下操作:
根据并行控制指令,基于神经网络模型,读取数据缓存空间已缓存的第一数据;并对对读取的第一数据进行处理,将处理后的第一数据输出至数据缓存空间。
在一种可能实现方式中,该至少一条程序代码还由处理器加载并执行以实现如下操作:
根据神经网络模型中各层分别对应的数据处理指令,并行进行数据处理。
在一种可能实现方式中,该至少一条程序代码还由处理器加载并执行以实现如下操作:
接收卷积层对应的数据处理指令和池化层对应的数据处理指令;
根据卷积层对应的数据处理指令,基于卷积层,读取数据缓存空间已缓存的第一数据,对第一数据进行卷积处理,将卷积处理后的第一数据输出至数据缓存单元;
同时,根据池化层对应的数据处理指令,基于池化层,读取数据缓存单元已缓存的卷积处理后的第三数据,对卷积处理后的第三数据进行池化处理,将池化处理后的第三数据输出至数据缓存单元,第三数据为多条数据中第一数据的上一条数据。
本领域普通技术人员能够理解实现上述实施例的全部或部分步骤通过硬件来完成,也能够通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (17)

  1. 一种数据处理方法,应用于计算机设备,所述方法包括:
    读取并行控制指令;
    根据所述并行控制指令,读取数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至所述数据缓存空间;
    同时,根据所述并行控制指令,将第二数据从数据存储空间搬移到所述数据缓存空间,所述第二数据为所述第一数据的下一条数据。
  2. 根据权利要求1所述的方法,其中,所述读取并行控制指令,包括:
    读取指令存储空间中的并行控制指令;
    按照读取顺序将读取到的并行控制指令搬移到指令缓存空间进行缓存,得到指令缓存队列;
    按照指令缓存顺序从所述指令缓存队列中读取并行控制指令。
  3. 根据权利要求1所述的方法,其中,所述并行控制指令中包括数据处理指令和数据搬移指令;
    所述读取并行控制指令之后,所述方法还包括:
    提取所述并行控制指令中的所述数据处理指令和所述数据搬移指令;
    所述根据所述并行控制指令,读取所述数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将所述处理后的第一数据输出至所述数据缓存空间,包括:
    根据所述数据处理指令,读取所述数据缓存空间中已缓存的所述第一数据,并对读取的所述第一数据进行处理,将所述处理后的第一数据输出至所述数据缓存空间;
    所述根据所述并行控制指令,将第二数据从数据存储空间搬移到所述数据缓存空间,包括:
    根据所述数据搬移指令,将所述第二数据从所述数据存储空间搬移到所述数据缓存空间。
  4. 根据权利要求3所述的方法,其中,所述提取所述并行控制指令中的所述数据处理指令和所述数据搬移指令,包括:
    提取所述并行控制指令中的有效字段指示信息;
    根据所述有效字段指示信息,确定所述并行控制指令中的第一有效字段和第二有效字段;
    从所述并行控制指令中读取所述第一有效字段,得到所述数据处理指令;从所述并行控制指令中读取所述第二有效字段,得到所述数据搬移指令。
  5. 根据权利要求1所述的方法,其中,所述方法还包括:
    获取待处理数据;
    根据所述数据缓存空间的缓存容量对所述待处理数据进行拆分,得到拆分后的多条数据;
    将所述多条数据组成的数据序列存储在所述数据存储空间。
  6. 根据权利要求1-5中任一项所述的方法,其中,所述待处理数据为图片数据;所述根据所述并行控制指令,读取数据缓存空间中已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至所述数据缓存空间,包括:
    根据所述并行控制指令,基于神经网络模型,读取所述数据缓存空间已缓存的所述第一数据;并对读取的所述第一数据进行处理,将所述处理后的第一数据输出至所述数据缓存空 间。
  7. 根据权利要求6所述的方法,其中,所述方法还包括:
    根据所述神经网络模型中各层分别对应的数据处理指令,并行进行数据处理。
  8. 根据权利要求7所述的方法,其中,所述神经网络模型包括卷积层和池化层,所述根据所述神经网络模型中各层分别对应的数据处理指令,并行进行数据处理,包括:
    接收所述卷积层对应的数据处理指令和所述池化层对应的数据处理指令;
    根据所述卷积层对应的数据处理指令,基于所述卷积层,读取所述数据缓存空间已缓存的第一数据,对所述第一数据进行卷积处理,将卷积处理后的第一数据输出至所述数据缓存单元;
    同时,根据所述池化层对应的数据处理指令,基于所述池化层,读取所述数据缓存单元已缓存的卷积处理后的第三数据,对所述卷积处理后的第三数据进行池化处理,将池化处理后的第三数据输出至所述数据缓存单元,所述第三数据为所述多条数据中所述第一数据的上一条数据。
  9. 一种数据处理芯片,所述芯片包括:指令处理单元、数据处理单元、数据搬移单元和数据缓存单元;
    所述指令处理单元,用于读取并行控制指令,同时向所述数据处理单元和所述数据搬移单元发送所述并行控制指令;
    所述数据处理单元,用于根据所述并行控制指令,读取所述数据缓存单元已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至所述数据缓存单元;
    所述数据搬移单元,用于同时根据所述并行控制指令,将第二数据从位于所述芯片之外的数据存储单元搬移至所述数据缓存单元,所述第二数据为所述第一数据的下一条数据。
  10. 根据权利要求9所述的芯片,其中,所述芯片还包括指令缓存单元;
    所述指令处理单元,用于读取位于所述芯片之外的指令存储单元中的并行控制指令;
    所述指令处理单元,用于按照读取顺序将读取到的并行控制指令搬移到所述指令缓存单元进行缓存,得到指令缓存队列;
    所述指令处理单元,用于按照指令缓存顺序从所述指令缓存队列中读取并行控制指令。
  11. 根据权利要求9所述的芯片,其中,所述并行控制指令包括数据处理指令和数据搬移指令;
    所述指令处理单元,用于提取所述并行控制指令中的所述数据处理指令和所述数据搬移指令;
    所述数据处理单元,用于根据所述数据处理指令,读取所述数据缓存单元中已缓存的所述第一数据,并对读取的所述第一数据进行处理,将所述处理后的第一数据输出至所述数据缓存单元;
    所述数据搬移单元,用于根据所述数据搬移指令,将所述第二数据从所述数据存储单元搬移到所述数据缓存单元。
  12. 根据权利要求11所述的芯片,其中,所述指令处理单元,用于提取所述并行控制指令中的有效字段指示信息;根据所述有效字段指示信息,确定所述并行控制指令中的第一有效字段和第二有效字段;从所述并行控制指令中读取所述第一有效字段,得到所述数据处理 指令;从所述并行控制指令中读取所述第二有效字段,得到所述数据搬移指令。
  13. 根据权利要求9所述的芯片,其中,所述待处理数据为图片数据,所述芯片为人工智能AI芯片;
    所述数据处理单元,用于根据所述并行控制指令,基于神经网络模型,读取所述数据缓存单元中已缓存的所述第一数据;并对所述第一数据进行处理,将处理后的第一数据输出至所述数据缓存单元。
  14. 根据权利要求13所述的芯片,其中,所述数据处理单元包括多个数据处理子单元,所述数据搬移单元包括加载load引擎、存储store引擎或搬移move引擎中的至少一项;
    所述load引擎,用于将待处理数据从所述数据存储单元搬移至所述数据缓存单元;
    任一数据处理子单元,用于读取所述数据缓存单元中已缓存的数据,对所述数据进行处理,将处理后的数据输出至所述任一数据处理子单元对应的输出存储空间;
    所述move引擎,用于将所述多个数据处理子单元中,除最后一个数据处理子单元之外的其他数据处理子单元处理后的数据从所述输出存储空间搬移至下一数据处理子单元对应的输入存储空间;
    所述store引擎,用于将所述最后一个数据处理子单元处理后的数据从所述最后一个数据处理子单元对应的输出存储空间搬移至所述数据存储单元。
  15. 根据权利要求13所述的芯片,其中,所述数据处理单元包括卷积引擎或池化引擎中的至少一项。
  16. 一种计算机设备,所述计算机设备包括:处理器和数据存储单元,所述处理器包括:指令处理单元、数据处理单元、数据搬移单元和数据缓存单元;
    所述指令处理单元,用于读取并行控制指令;
    所述数据处理单元,用于根据所述并行控制指令,读取所述数据缓存单元已缓存的第一数据,并对读取的第一数据进行处理,将处理后的第一数据输出至所述数据缓存单元;
    所述数据搬移单元,用于同时,根据所述控制指令,将第二数据从所述数据存储单元搬移至所述数据缓存单元,所述第二数据为所述第一数据的下一条数据。
  17. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现如权利要求1至8任一项所述的数据处理方法中所执行的操作。
PCT/CN2020/118893 2019-12-05 2020-09-29 数据处理方法、芯片、设备及存储介质 WO2021109703A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/502,218 US20220035745A1 (en) 2019-12-05 2021-10-15 Data processing method and chip, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911235760.X 2019-12-05
CN201911235760.XA CN111045732B (zh) 2019-12-05 2019-12-05 数据处理方法、芯片、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/502,218 Continuation US20220035745A1 (en) 2019-12-05 2021-10-15 Data processing method and chip, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021109703A1 true WO2021109703A1 (zh) 2021-06-10

Family

ID=70234743

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118893 WO2021109703A1 (zh) 2019-12-05 2020-09-29 数据处理方法、芯片、设备及存储介质

Country Status (3)

Country Link
US (1) US20220035745A1 (zh)
CN (1) CN111045732B (zh)
WO (1) WO2021109703A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111045732B (zh) * 2019-12-05 2023-06-09 腾讯科技(深圳)有限公司 数据处理方法、芯片、设备及存储介质
CN111562948B (zh) * 2020-06-29 2020-11-10 深兰人工智能芯片研究院(江苏)有限公司 在实时图像处理系统中实现串行任务并行化的系统及方法
CN111651207B (zh) * 2020-08-06 2020-11-17 腾讯科技(深圳)有限公司 一种神经网络模型运算芯片、方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205012A (zh) * 2014-06-26 2015-12-30 北京兆易创新科技股份有限公司 一种数据读取方法和装置
CN105868121A (zh) * 2016-03-28 2016-08-17 联想(北京)有限公司 一种信息处理方法及电子设备
CN107728939A (zh) * 2017-09-26 2018-02-23 郑州云海信息技术有限公司 基于Linux的IO调度方法、装置、设备及存储介质
CN108334474A (zh) * 2018-03-05 2018-07-27 山东领能电子科技有限公司 一种基于数据并行的深度学习处理器架构及方法
CN111045732A (zh) * 2019-12-05 2020-04-21 腾讯科技(深圳)有限公司 数据处理方法、芯片、设备及存储介质

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2539974B2 (ja) * 1991-11-20 1996-10-02 富士通株式会社 情報処理装置におけるレジスタの読出制御方式
DE69739404D1 (de) * 1997-12-10 2009-06-25 Hitachi Ltd Optimiertes speicherzugriffsverfahren
JP2004289294A (ja) * 2003-03-19 2004-10-14 Fujitsu Ltd データ処理システム、データ処理装置、及びデータ処理方法
US20070276989A1 (en) * 2006-05-29 2007-11-29 Sandisk Il Ltd. Predictive data-loader
CN101276294B (zh) * 2008-05-16 2010-10-13 杭州华三通信技术有限公司 异态性数据的并行处理方法和处理装置
CN102207916B (zh) * 2011-05-30 2013-10-30 西安电子科技大学 一种基于指令预取的多核共享存储器控制设备
CN104978282B (zh) * 2014-04-04 2019-10-01 上海芯豪微电子有限公司 一种缓存系统和方法
US9632715B2 (en) * 2015-08-10 2017-04-25 International Business Machines Corporation Back-up and restoration of data between volatile and flash memory
US10430081B2 (en) * 2016-06-28 2019-10-01 Netapp, Inc. Methods for minimizing fragmentation in SSD within a storage system and devices thereof
US11010431B2 (en) * 2016-12-30 2021-05-18 Samsung Electronics Co., Ltd. Method and apparatus for supporting machine learning algorithms and data pattern matching in ethernet SSD
CN108536473B (zh) * 2017-03-03 2021-02-23 华为技术有限公司 读取数据的方法和装置
CN109219805B (zh) * 2017-05-08 2023-11-10 华为技术有限公司 一种多核系统内存访问方法、相关装置、系统及存储介质
CN108805267B (zh) * 2018-05-28 2021-09-10 重庆大学 用于卷积神经网络硬件加速的数据处理方法
CN110209472B (zh) * 2018-08-29 2023-04-07 腾讯科技(深圳)有限公司 任务数据处理方法和板卡
CN109934339B (zh) * 2019-03-06 2023-05-16 东南大学 一种基于一维脉动阵列的通用卷积神经网络加速器
CN110222005A (zh) * 2019-07-15 2019-09-10 北京一流科技有限公司 用于异构架构的数据处理系统及其方法
WO2021024792A1 (ja) * 2019-08-05 2021-02-11 日立オートモティブシステムズ株式会社 車両制御装置、更新プログラム、プログラム更新システム、及び書込み装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205012A (zh) * 2014-06-26 2015-12-30 北京兆易创新科技股份有限公司 一种数据读取方法和装置
CN105868121A (zh) * 2016-03-28 2016-08-17 联想(北京)有限公司 一种信息处理方法及电子设备
CN107728939A (zh) * 2017-09-26 2018-02-23 郑州云海信息技术有限公司 基于Linux的IO调度方法、装置、设备及存储介质
CN108334474A (zh) * 2018-03-05 2018-07-27 山东领能电子科技有限公司 一种基于数据并行的深度学习处理器架构及方法
CN111045732A (zh) * 2019-12-05 2020-04-21 腾讯科技(深圳)有限公司 数据处理方法、芯片、设备及存储介质

Also Published As

Publication number Publication date
CN111045732A (zh) 2020-04-21
CN111045732B (zh) 2023-06-09
US20220035745A1 (en) 2022-02-03

Similar Documents

Publication Publication Date Title
WO2021109703A1 (zh) 数据处理方法、芯片、设备及存储介质
US11244672B2 (en) Speech recognition method and apparatus, and storage medium
US10877924B2 (en) Instruction set processing method based on a chip architecture and apparatus, and storage medium
US20160081028A1 (en) Information processing method and electronic device supporting the same
WO2017129022A1 (zh) 一种终端数据库的并行执行方法和装置
TW201921251A (zh) 共享虛擬記憶體的技術(四)
CN104951358B (zh) 基于优先级的上下文抢占
WO2021110133A1 (zh) 一种控件的操作方法及电子设备
CN104035540B (zh) 在图形渲染期间降低功耗
EP3121684B1 (en) Multi-core processor management method and device
WO2020108457A1 (zh) 目标对象的控制方法、装置、设备及存储介质
CN104050040A (zh) 媒体重放工作负荷调度器
CN114817120A (zh) 一种跨域数据共享方法、系统级芯片、电子设备及介质
CN111399819A (zh) 数据生成方法、装置、电子设备及存储介质
WO2024037068A1 (zh) 任务调度方法、电子设备及计算机可读存储介质
KR20180086792A (ko) 복수의 프로세서들 사이에 데이터를 처리하는 방법 및 전자 장치
CN115686252B (zh) 触控屏中的位置信息计算方法和电子设备
CN113384893A (zh) 一种数据处理方法、装置及计算机可读存储介质
WO2023004762A1 (zh) 计算机系统和数据处理方法
KR20210011028A (ko) 어휘 데이터베이스를 처리하는 방법 및 장치
CN106445692B (zh) 一种网络服务控制方法及装置
CN113760540B (zh) 一种任务处理方法和相关装置
WO2024119823A1 (zh) Gpu计算资源的管理方法、装置、电子设备及可读存储介质
CN116881194B (zh) 处理器、数据处理方法及计算机设备
WO2021115229A1 (zh) 电子书的显示方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20897505

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20897505

Country of ref document: EP

Kind code of ref document: A1