CN116521088A

CN116521088A - Data processing method, device, equipment and storage medium

Info

Publication number: CN116521088A
Application number: CN202310614809.2A
Authority: CN
Inventors: 李强; 田超; 纪纲
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-08-01

Abstract

The present disclosure provides a data processing method, apparatus, device, and storage medium, and relates to the technical field of artificial intelligence, in particular to the technical fields of neural networks, speech recognition, image processing, text processing, and chips. The specific implementation scheme is as follows: the method comprises the steps of obtaining model input data of a target model, wherein the target model comprises at least two sub-models, model structure data of the sub-models are stored in a first memory of a preset chip, weight data of the corresponding sub-models are moved to the first memory from a second memory of the preset chip in batches through a secondary core of the preset chip, and the input data of the preset chip are processed through a main core of the preset chip and by utilizing the weight data and the corresponding model structure data of the first memory, wherein the second memory is larger in capacity and slower in reading and writing speed compared with the first memory. By adopting the technical scheme, the model can obtain better operation effect at the terminal, and the timeliness of data processing is improved.

Description

Data processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the fields of neural networks, speech recognition, image processing, text processing, and chip technology.

Background

Currently, neural network models are widely used in fields such as speech recognition and image processing. The application modes of the neural network model mainly include two types: the model is deployed at the server, the terminal uploads the acquired input data to the server, and the server processes the input data by using the model and returns a processing result to the terminal; one type is that the model is deployed locally at the terminal, and the terminal directly processes data by using the model.

Disclosure of Invention

The present disclosure provides a data processing method, apparatus, device, and storage medium.

According to an aspect of the present disclosure, there is provided a data processing method including:

obtaining model input data of a target model, wherein the target model comprises at least two sub-models, model structure data of the sub-models are stored in a first memory of a preset chip in a preset chip set, and the first memory comprises a dynamic random access memory;

the method comprises the steps that weight data of a corresponding sub-model are moved to a first memory from a second memory in a preset chip in batches through a secondary core in the preset chip, wherein the second memory is larger in capacity and slower in reading and writing speed compared with the first memory;

And processing the input data of the preset chips by using the weight data and the corresponding model structure data in the first memory through the main cores in the preset chips, wherein the input data of the first preset chip in the preset chip set comprises the model input data, and the input data of the non-first preset chip comprises the output data of the preset chip with an association relation with the non-first preset chip.

According to another aspect of the present disclosure, there is provided a data processing apparatus including:

the model input data acquisition module is used for acquiring model input data of a target model, wherein the target model comprises at least two sub-models, model structure data of the sub-models are stored in a first memory of a preset chip in a preset chip set, and the first memory comprises a dynamic random access memory;

the data moving module is used for moving the weight data of the corresponding sub-model to the first memory from a second memory in the preset chip in batches through a sub-core in the preset chip, wherein the second memory has larger capacity and slower reading and writing speed compared with the first memory;

The data processing module is used for processing the input data of the preset chip through the main core in the preset chip and by utilizing the weight data and the corresponding model structure data in the first memory, wherein the input data of the first preset chip in the preset chip set comprises the model input data, and the input data of the non-first preset chip comprises the output data of the preset chip with an association relation with the non-first preset chip.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least two multi-core processor chips; and

a memory communicatively coupled to the at least two multi-core processor chips; wherein,,

the memory stores instructions executable by the at least two multi-core processor chips to enable the at least two multi-core processor chips to perform the methods described in the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any embodiment of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a data processing method provided in accordance with an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a chip operation process provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a flow chart of yet another data processing method provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a flow chart of another data processing method provided in accordance with an embodiment of the present disclosure;

FIG. 5 is a schematic illustration of a single layer process provided in accordance with an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data processing apparatus provided according to an embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device for implementing a data processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Currently, neural network models are widely used in fields such as speech recognition and image processing. The application modes of the neural network model mainly include two types: the model is deployed at the server, the terminal uploads the acquired input data to the server, and the server processes the input data by using the model and returns a processing result to the terminal; one type is that the model is deployed locally at the terminal, and the terminal directly processes data by using the model. Taking a voice recognition scene as an example, for a first type of application mode, a terminal device collects voice signals, uploads the voice signals to a cloud server for processing, analyzes voice contents and returns the voice contents to the terminal device through a network, and the cloud server can support a large complex model for processing, so that the recognition effect is good, but the recognition effect is limited by factors such as network quality and the like, and real-time performance is difficult to ensure; for the second type of application mode, the terminal equipment is identified locally, interaction with a cloud server is not needed, response speed is high, but the method is limited by hardware performance and cost of the terminal equipment, complex models are difficult to support in the prior art, only simple models can be used for identification, and good identification effect is difficult to ensure. It can be seen that, for the scheme of locally applying the neural network model to the terminal, it is difficult to achieve a better identification effect at present, and the cost is higher.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present disclosure, where the embodiment of the present disclosure may be applicable to a case where a neural network model is locally run on an electronic device. The method may be performed by a data processing apparatus, which may be implemented in hardware and/or software, and may be configured in an electronic device. The electronic device may include mobile terminals such as a mobile phone, an intelligent watch, a tablet PC, a personal digital assistant, and the like, and may also be terminals such as a personal computer (Personal Computer, PC) and the like. Referring to fig. 1, the method specifically includes the following:

s101, obtaining model input data of a target model, wherein the target model comprises at least two sub-models, model structure data of the sub-models are stored in a first memory of a preset chip in a preset chip set, and the first memory comprises a dynamic random access memory;

s102, moving weight data of a corresponding sub-model from a second memory in a preset chip to the first memory in batches through a secondary core in the preset chip, wherein the second memory has larger capacity and slower reading and writing speed compared with the first memory;

s103, processing input data of the preset chips by the aid of the weight data and corresponding model structure data in the first memory through the main cores in the preset chips, wherein the input data of a first preset chip in the preset chip set comprises the model input data, and the input data of a non-first preset chip comprises output data of a preset chip with an association relation with the non-first preset chip.

The target model may specifically be a neural network model, and specific functions are not limited, and may be a speech recognition model, a text processing model, an image processing model, or the like. The target model is split into at least two sub-models, so that data processing corresponding to different sub-models can be realized by different chips, the problem that a single chip is difficult to bear a larger or more complex neural network model to influence the processing effect of the model is avoided, the scale of the target model in the embodiment of the disclosure can be higher than a preset scale threshold, and the scale of the model can be determined based on the number of model layers and/or the weight data. The specific splitting mode of the target model is not limited, and can be determined according to the structural characteristics of the model. For example, the object model includes an Encoder-Decoder (Encoder-Decoder), and can be split into a first sub-model of the Encoder structure and a second sub-model of the Decoder structure. For example, the main operation content of the voice recognition system comprises four parts of signal processing, feature extraction, an Encoder and a Decode, wherein the Encoder and the Decode are collectively called a voice recognition model in the embodiment of the present disclosure, and the target model can be the voice recognition model.

In the embodiment of the disclosure, a preset chip set is configured in the electronic device, the preset chip set includes a plurality of preset chips, the plurality of preset chips can be connected in a cascade mode or the like, and each sub-model can correspond to one or a plurality of preset chips. Each preset chip is a multi-core (core) chip, and may be a multi-core low-cost chip. In order to control the cost, the multi-core low-cost chip generally does not have a large capacity high-speed storage device, such as a dynamic random access memory (Dynamic Random Access Memory, DRAM) in the multi-core low-cost chip has small capacity, and is difficult to meet the requirement of storing a large-scale neural network model. In the embodiment of the disclosure, the preset chip includes a first memory and a second memory, where the first memory includes a DRAM, and the DRAM has the characteristics of high read-write speed and timing refresh, and the second memory has a larger capacity and slower read-write speed than the first memory, that is, a high-capacity low-speed memory device, so that the chip cost can be effectively controlled, for example, the second memory may be a pseudo static random access memory (Pseudo Static Random Access Memory, PSRAM). The number of the first memories is generally 1, and the number of the second memories may be 1 or more. For chips carrying neural network models, the data that generally needs to be stored includes model structure data and weight data, the data amount of the model structure data is small compared to the weight data, and thus the model size mainly depends on the amount of the weight data. In the embodiment of the disclosure, the model structure data of the submodel is stored in a first memory of a preset chip, and the weight data is stored in a second memory with larger capacity. For the case that a single sub-model corresponds to a preset chip, in the preset chip, the model structure data of the sub-model is stored in a first memory, and the weight data of the sub-model is stored in a second memory; for the case that a single sub-model corresponds to a plurality of preset chips, in each preset chip, the first memory stores part of model structure data of the sub-model, the second memory stores part of weight data of the sub-model, and the part of weight data corresponds to the part of model structure data. The multiple cores in the preset chip are configured as a main core and a secondary core, wherein the main core is generally one core, the cores except the main core are secondary cores, the main core is used for data processing based on a sub-model, namely, model operation is realized, and the secondary cores are used for moving weight data and can be also called as moving cores. For example, the default chip comprises 2 cores, 1 primary core, 1 secondary core; for another example, the default chip includes 4 cores, 1 primary core, and 3 secondary cores.

For example, since the data amount of the weight data is generally large, for one sub-model, the number of corresponding preset chips may be determined according to the data amount of the weight data of the sub-model and the storage capacity of the second memory of the single preset chip. For example, the weight data amount of a certain submodel is a, the storage capacity of all the second memories of a single preset chip is B, and the number of corresponding preset chips can be determined by rounding up according to a/B, for example, the preset chip number is (a+ (B-1))/B.

For example, because the capacity of the first memory is limited, the secondary core may be utilized to move weight data from the second memory to the first memory in batches for the primary core to use in data processing the weight data of the current batch being moved. The DRAM has the characteristic of timing refreshing, and the weight data which is used by the main core can be covered by the weight data which is newly moved, so that the weight data which is moved in a plurality of batches can be reused in the limited storage space of the first memory.

Fig. 2 is a schematic diagram of a chip operation process provided according to an embodiment of the present disclosure, and an example is taken as an example of a single preset chip operation process, where the preset chip is assumed to include a secondary core a and a secondary core b, and further includes a second memory a and a second memory b, where all or part of model structure data of a certain sub-model is stored in a main core of the preset chip, and weight data corresponding to the model structure data in the main core is stored in the second memory a and the second memory b. The weight data to be moved of the current batch may be stored in the second memory a and/or the second memory b, the secondary core a and the secondary core b may move the weight data to be moved from the second memory a and/or the second memory b to the first memory in parallel, after the moving is completed, the primary core may perform data processing by using the moved weight data in the first memory and the model structure data corresponding to the moved weight data, and the processed data may be cached in the first memory. It should be noted that fig. 2 is only a schematic illustration, and the number of sub-cores, the number of second memories, and the like are not limited.

For example, the preset chips in the preset chip set are connected in a cascade manner, and for the first preset chip, the input data may be model input data of a target model, and for example, the model input data may be to-be-recognized voice data obtained through signal processing and feature extraction operation. For a non-first preset chip, that is, a preset chip except the first preset chip in the preset chip set, input data of the preset chip includes output data of the preset chip having an association relationship with the non-first preset chip, and the association relationship can be determined according to a model structure of a target model. The model input data may be original data collected by the electronic device, or may be data obtained by preprocessing the original data, where the preprocessing process may be related to a specific model function. Taking a speech recognition scene as an example, the speech recognition system comprises a preprocessing part and a speech recognition model, the target model is the speech recognition model, and the preprocessing process can comprise speech signal processing, speech feature extraction and the like. The related data processing in the preprocessing process can be completed by a preprocessing chip, and the preprocessing chip can be a single-core chip or a multi-core chip, and is not particularly limited. Optionally, preprocessing the original data by a main core and a secondary core in the preprocessing chip to obtain model input data of the target model. The advantage of this arrangement is that the preprocessing and model processing both use multi-core chips, which can reduce the complexity of chip related designs in electronic devices.

According to the technical scheme provided by the embodiment of the disclosure, the target model is split, a plurality of low-cost multi-core chips are adopted in the electronic equipment, data processing corresponding to each sub-model is respectively carried out, in the data processing process, the secondary cores in the chips are utilized to move weight data from the low-speed large-capacity memory to the high-speed small-capacity memory in batches, the weight data are provided for the main cores responsible for data processing to carry out operation, so that the model can obtain better operation effect at the terminal, the timeliness of data processing is improved, a large-capacity DRAM (dynamic random access memory) is not required to be integrated in the terminal, the manufacturing difficulty of the DRAM is reduced, the manufacturing difficulty of the terminal is reduced, and the manufacturing cost of the terminal can be considered. For a voice recognition scene, the operation of a complex voice recognition model on the terminal can be supported, the timeliness and recognition effect of voice recognition can be ensured, and the manufacturing difficulty of the terminal which needs to support voice recognition is reduced.

Fig. 3 is a flowchart of yet another data processing method according to an embodiment of the present disclosure, where an alternative scheme is provided based on the foregoing alternative embodiments, and a moving of weight data and a working process of a main core are further described. Referring to fig. 3, the method includes:

S301, obtaining model input data of a target model.

S302, determining the weight to be moved of the current batch of the corresponding sub-model through a main core in a preset chip, and sending a first moving instruction of the weight to be moved of the current batch to the corresponding sub-core.

The method includes that the weight data is moved into a plurality of batches, the weight data amount required to be moved for each batch can be preset, the main core determines the weights required to be moved for the current batch, and then a first movement instruction is sent to the secondary core, namely, the main core starts the weight data movement task of each batch. The shifting instruction may include identification information of the weights to be shifted, such as a weight parameter name or a weight number, so that the secondary core explicitly needs to shift which weight data when receiving the shifting instruction. Optionally, when the number of the secondary cores is multiple, corresponding first moving instructions may be sent to the multiple secondary cores respectively, and different first moving instructions may include different identification information. Optionally, when the number of the secondary cores is multiple, a first moving instruction may be sent to a certain secondary core, and the secondary core and other secondary cores coordinate together to complete the moving work of the weight data corresponding to the weight to be moved in the current batch.

S303, moving the weight data corresponding to the weight to be moved of the current batch from a second memory in the preset chip to the first memory through a secondary core in the preset chip according to the first moving instruction.

For example, the secondary core may store in advance a mapping relationship between the identification information of the weights and the second memories, where the mapping relationship is used to indicate which weight parameters are specifically stored in each second memory. After the secondary core receives the first moving instruction, the mapping relation can be queried according to the identification information in the first moving instruction, which second memory needs to be read with weight data, and the read weight data is written into the first memory.

S304, after the movement of the weight data corresponding to the weight to be moved of the current batch is determined through the main core in the preset chip, the weight data of the current batch and the corresponding model structure data in the first memory are utilized to process the current data to be processed.

The current data to be processed comprises input data of the preset chip or intermediate data processed by the preset chip.

For example, after determining that the weight data of the current lot has been moved, the master core may perform a corresponding operation, such as a dot product operation, using the weight data of the current lot in the first memory. The model structure of the sub-model can comprise one or more neuron layers, the specific flow direction of the data to be processed in the preset chip is not limited, and the data to be processed can be determined according to the connection relation of all layers in the model structure. If the weight data of the current batch corresponds to the first layer of the model structure in the preset chip, the current data to be processed can comprise the input data of the preset chip and also can comprise the data (the intermediate data processed by the preset chip) returned to the first layer after the non-first layer processing; if the weight data of the current batch corresponds to the non-first layer of the model structure in the preset chip, the current data to be processed may include data (intermediate data processed by the preset chip) input to the non-first layer after being processed by other layers having a connection relationship with the non-first layer, and may also include input data of the preset chip. The input data and the intermediate data of the preset chip can be cached in the first processor.

According to the technical scheme provided by the embodiment of the disclosure, the main core starts the weight data moving task of each batch and checks whether the moving task is completed, and under the condition that the moving task is completed, corresponding data processing is started, so that the main core can more reasonably control the weight moving task and the data processing task, the two tasks can be better matched, the storage space of the first memory is reasonably used, and the accuracy and the processing efficiency of data processing are ensured.

In an alternative embodiment, before the processing of the current data to be processed by the main core in the preset chip by using the weight data of the current batch and the corresponding model structure data in the first memory, the method further includes: determining the weight to be moved of the next batch of the corresponding sub-model through the main core in the preset chip, and sending a second moving instruction of the weight to be moved of the next batch to the corresponding sub-core. The method has the advantages that the data processing of the main core and the data moving of the secondary core can be performed in parallel, the time length of waiting for the weight data of the next batch by the main core is reduced, and the data processing efficiency is further ensured.

The main core determines that the weight data corresponding to the weight to be moved of the current batch is moved, and then sends a second moving instruction, and processes the current data to be processed by utilizing the weight data of the current batch and the corresponding model structure data in the first memory.

In an alternative embodiment, the first memory includes a first weight buffer area and a second weight buffer area, where one of the first weight buffer area and the second weight buffer area is used to store the weight data after being moved, and the other is used to store the weight data being moved. The advantage of this arrangement is that it can ensure that the weight data being used by the master core is not covered by the writing of the weight data of the next batch, ensuring the accuracy of data processing.

Illustratively, the movement of the weight data is divided into 3 batches. After the 1 st batch of weight data is moved to the first weight buffer, the main core sends a 2 nd batch of moving instruction, the sub-core moves the 2 nd batch of weight data to the second weight buffer in the process of utilizing the weight data in the first weight buffer to process data, after the 1 st batch of corresponding data is processed, whether the 2 nd batch of weight data is moved is judged, if the 2 nd batch of weight data is moved, a 3 rd batch of moving instruction is sent, and the sub-core moves the 2 nd batch of weight data to the first weight buffer to cover the used 1 st batch of weight data.

In an alternative embodiment, the preset chip set includes a first-level preset chip and a second-level preset chip; the first-stage preset chip corresponds to a first sub-model of the encoder structure, and the second-stage preset chip corresponds to a second sub-model of the decoder structure. Therefore, the target model can be reasonably split according to the structural characteristics of the encoder-decoder, and the preset chips in the preset chip set are classified in grades.

In an alternative embodiment, the number of the first-stage preset chips is at least two, and/or the number of the second-stage preset chips is at least two. Under the condition that the number of the first-stage preset chips is at least two, unidirectional data interaction exists among different first-stage preset chips; and under the condition that the number of the second-level preset chips is at least two, two-way data interaction exists between at least two second-level preset chips. Therefore, the data interaction between the chips preset in the same level is reasonably set according to the model structure characteristics corresponding to the chips preset in each level, and the cost of the chips can be better controlled. For the encoder, the layers are usually in serial relationship, so that different first-stage preset chips are also in serial relationship, that is, unidirectional data interaction exists. For decoders, there may be data interaction between different layers, and thus, between different second-level preset chips.

In an alternative embodiment, the weight data of the same layer in the sub-model is divided into at least two batches for movement. Therefore, the requirement on the capacity of the DRAM can be further reduced, and the manufacturing difficulty of the DRAM is reduced.

In an alternative embodiment, the target model comprises a speech recognition model. Therefore, a high-quality and low-time-delay voice recognition scheme can be locally realized on the electronic equipment, the manufacturing difficulty of the electronic equipment is reduced, and the cost can be effectively controlled.

In an alternative embodiment, the input data of the preset chip corresponds to a preset number of frames of voice data to be recognized. Therefore, the voice data with a certain frame number is processed in batches, namely, a data accumulation mode is adopted instead of a single frame processing mode, so that the moving times of the weight data can be effectively reduced. The voice data to be identified is usually stream data, for example, if a single frame processing mode is adopted, for a single layer, the processing of the single frame voice data needs participation of all weight data of the layer, and if the single frame voice data needs to be moved for 4 times, after the current frame is processed, the weight data of the layer needs to be moved again for 4 times when the next frame is processed, and if the single frame voice data is processed, the total weight data needs to be moved for 12 times; if a multi-frame processing method, such as 3 frames, is adopted, for a single layer, the 3 frames can be processed in batches by using the weight data moved from each batch, and after the processing is completed, the 3 frames are continuously processed by using the weight data moved from the next batch, so that the moving is required for 4 times in total.

In an alternative embodiment, the preset number of frames is an integer multiple of a preset parameter value of the target model, the preset parameter value including a batch size (batch size). Thus, the calculation efficiency can be improved. The batch size is the maximum data amount input into the model for operation at a time, for example, the batch size is 3 frames, and if the accumulated data is 2 frames, that is, the preset frame number is 2 frames, the 3 rd frame cannot be effectively utilized.

In an optional implementation manner, the preset frame number is determined by taking the difference between the moving duration of the weight data quantity of the single layer and the single layer operation duration of the main core in the preset chip as a target that the difference is smaller than a first preset value. Therefore, the preset frame number can be reasonably determined, and the moving times of the weight data and the timeliness of voice recognition are considered.

For example, assuming that the weight data amount required for a layer operation is M, the number of sub-cores is N, and the moving speed of a single sub-core is V0, the length required for moving the weight data required for the layer to the target area (the first memory) is denoted as T0, which may be expressed as t0=m/(n×v0). In the layer operation, the calculated amount of the main core is normally positively correlated with M and is also positively correlated with the preset frame number, and by determining a reasonable preset frame number, the single-layer operation duration T1 of the main core for completing the layer operation can be approximately equal to T0, namely, T0-T1 is smaller than a first preset value.

In an alternative embodiment, the data amount of the weight data of single-batch moving is determined by targeting that the difference between the single-batch moving time length and the single-batch operation time length of the main core in the preset chip is smaller than a second preset value. Therefore, the cost of the secondary core moving weight and the cost of the primary core caused by initiating the moving task can be reasonably balanced.

For example, the smaller the data size of the weight data of a single batch, the shorter the time that the main core waits for the weight data of a single batch to be moved, i.e. the more the movement time is covered by the operation time, but the more the corresponding movement times, and the smaller the data size of the weight data of a single batch, the more the overhead of the main core is increased because the main core needs to be responsible for initiating the movement task and checking whether the movement task is completed. The single-batch moving duration is approximately equal to the single-batch operation duration, so that the weight moving waiting time of the main core after the single-batch operation is completed can be reduced, and the time waiting for the main core to send the moving instruction after the single-batch operation is completed can be reduced by the secondary core.

Fig. 4 is a flowchart of another data processing method according to an embodiment of the present disclosure, where an alternative is provided based on the foregoing alternative embodiments, and a speech recognition model is further described as an example. Referring to fig. 4, the method includes:

S401, preprocessing an original voice signal through a main core and a secondary core in a preprocessing chip to obtain voice data to be recognized of a voice recognition model, and sending the voice data to be recognized to a first-stage preset chip.

Illustratively, the main operation content of the voice recognition system comprises four parts of signal processing, feature extraction, an Encoder and a Decode, wherein the Encoder and the Decode are collectively called a voice recognition model in the embodiment of the present disclosure. The model size mainly depends on the weight data, and the signal processing and the feature extraction operation do not involve weights and can be regarded as a preprocessing part. The Encoder mainly comprises operations such as a Full connection layer (Full), a visual interpretation regularization layer (Batch Norm), a Long Short-Term Memory (LSTM) layer, an alignment connection time sequence classification layer (align Connectionist Temporal Classification, alignment CTC) and the like. The Decoder mainly includes an embedded additional layer (Embedding), a plurality of LSTM layers and a standardized layer (LayerNorm), decoding operations, and the like.

Illustratively, the speech recognition system is split into three levels of implementation. The first stage is a preprocessing part, mainly realizes signal processing and feature extraction operation, is realized by adopting a multi-core preprocessing chip, and can be recorded as a preprocessing stage chip. The second stage is an Encoder part, which corresponds to the first sub-model, mainly realizes the Encoder operation, is realized by adopting a multi-core preset chip, and can be recorded as a first-stage preset chip. The second stage is a Decoder part, mainly realizes the Decoder operation, is realized by adopting a multi-core preset chip, and can be recorded as a second stage preset chip. The hardware components of the preprocessing chip and the preset chip can be the same. The number of preprocessing chips may be one or more, typically one.

In the preprocessing chip, a plurality of cores participate in data operation, that is, the plurality of cores perform signal processing and feature extraction operation, specifically, a single-frame processing mode can be adopted, that is, after an electronic device collects or receives a frame of original voice signal (original data), the frame of voice signal is processed immediately, and a processing result is used as model input data of a target model, that is, the model input data is transmitted to a first-stage preset chip, and a specific transmission mode is not limited, for example, a mode based on a serial peripheral interface (Serial Peripheral Interface) can be adopted.

S402, the weight data of the corresponding first sub-model are moved to the first memory from the second memory in the first-stage preset chip in batches through the secondary cores in the first-stage preset chip, the weight data in the first memory and the corresponding model structure data are utilized by the main cores in the first-stage preset chip to process the voice data to be recognized, intermediate processing data are obtained, and the intermediate processing data are sent to the second-stage preset chip.

The first-stage preset chip adopts a multi-frame processing mode, and after the voice data (model input data) to be recognized sent by the scraping and preprocessing chip reaches a preset frame number, the operation of the Encoder is carried out. The preset frame number is an integer multiple of the batch size of the speech recognition model, such as 1 time or 2 times, and if the batch size is 3 frames and the multiple is 2, the preset frame number is 6. Optionally, before the preset frame number is not reached, the weight to be moved of the current batch of the corresponding first sub-model may be determined, and a move instruction of the weight to be moved of the current batch may be sent to the corresponding secondary core. The communication method between the primary core and the secondary core is not limited, and may be, for example, an Inter-process communication (Inter-Process Communication, IPC) method. The secondary core moves the corresponding weight data from the second memory in the first-stage preset chip to the first memory according to the moving instruction, after the moving of the weight data of the current batch is confirmed by the primary core in the first-stage preset chip, the next moving instruction of the batch is sent to the secondary core, and the current data to be processed is processed by utilizing the weight data of the current batch and the corresponding model structure data in the first memory.

Fig. 5 is a schematic diagram of a single-layer operation process according to an embodiment of the disclosure. As shown in fig. 5, taking a dot product operation of a certain layer as an example, the dot product operation of the current layer is started, the dot product operation is split into multiple implementations, the weight data movement corresponding to one batch is implemented each time, and the weight data movement amount of each batch is determined. For example, the total weight data of the current layer is 20×1024, and divided into 4 shifts, and the weight data of each shift is 5×1024. The method comprises the steps that a main core sends a moving instruction of a current batch to a secondary core, all secondary cores in a preset chip start moving weight data, in the moving process, the main core judges whether the weight data of the current batch is moved to be completed, if not, the main core can wait for and repeatedly judge whether the moving of the weight data of the current batch is completed, if so, the main core sends a moving instruction of a next batch to the secondary core, and carries out dot product operation corresponding to the current batch, for example, the dot product operation of 6 frames of voice data and weight data of 5 x 1024, after the operation is completed, the dot product operation of the layer is judged whether to be completed, namely, whether the dot product operation corresponding to a plurality of divided batches is completed, if not, the next batch is taken as a new current batch, and whether the weight data of the new current batch is moved to be completed is confirmed until the dot product operation task of the layer is finally completed.

S403, the weight data of the corresponding second sub-model are moved to the first memory from the second memory in the second-stage preset chip in batches through the secondary core in the second-stage preset chip, and the middle processing data are processed through the main core in the second-stage preset chip by utilizing the weight data in the first memory and the corresponding model structure data, so that a voice recognition result is obtained.

The second-stage preset chip also adopts a multi-frame processing mode, and performs Decoder operation after the intermediate processing data sent by the first-stage preset chip reaches the preset frame number. The single-layer operation process in the second-stage preset chip can refer to fig. 5, and will not be described herein. When the number of the second-level preset chips is multiple, two-way data interaction exists between at least two second-level preset chips due to the characteristics of the Decoder structure.

According to the technical scheme provided by the embodiment of the disclosure, in a voice recognition application scene, a voice recognition system which is locally deployed in a terminal device is split into three stages to be realized, a pre-processing stage chip, a first pre-setting chip corresponding to an Encoder operation and a second pre-setting chip corresponding to a Decoder operation are respectively corresponding to the pre-processing stage chip, each chip adopts a low-cost multi-core chip, the pre-processing stage chip adopts a single-frame processing mode, the pre-setting chip adopts a multi-frame processing mode, the pre-setting chip starts weight data moving tasks of each batch by a main core in a processing process, the weight data are moved from a low-speed high-capacity memory to a high-speed low-capacity memory by a secondary core in batches, the main core checks whether the moving tasks are completed or not, under the condition of completion, the weight data moving tasks of the next batch are started, and corresponding data processing of the current batch is started, and the main core can reasonably control the weight moving tasks and the data processing tasks, under the condition of effectively controlling cost, the high-quality voice recognition system is deployed on the terminal device, and recognition effect and timeliness of voice recognition are ensured.

Fig. 6 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure, where the embodiment of the present disclosure may be applicable to a case where a neural network model is locally run on an electronic device. The device can be realized in a hardware and/or software mode and can be configured in electronic equipment. Referring to fig. 6, the image processing apparatus 600 includes:

the model input data obtaining module 601 is configured to obtain model input data of a target model, where the target model includes at least two sub-models, and model structure data of the sub-models is stored in a first memory of a preset chip in a preset chip set, and the first memory includes a dynamic random access memory;

a data moving module 602, configured to move weight data of a corresponding sub-model from a second memory in the preset chip to the first memory in batches through a sub-core in the preset chip, where the second memory has a larger capacity and a slower read/write speed than the first memory;

the data processing module 603 is configured to process, by using the weight data and the corresponding model structure data in the first memory, input data of a first preset chip in the preset chip set by using a main core in the preset chip, where the input data of a non-first preset chip includes the model input data, and input data of a preset chip that has an association relationship with the non-first preset chip includes output data of the preset chip.

According to the technical scheme provided by the embodiment of the disclosure, the target model is split, a plurality of low-cost multi-core chips are adopted in the electronic equipment, data processing corresponding to each sub-model is respectively carried out, and in the data processing process, the weight data is provided for the main core responsible for data processing to operate in a mode that the sub-cores in the chips move weight data from the low-speed large-capacity memory to the high-speed small-capacity memory in batches, so that cost, model operation effect and timeliness can be well considered.

In an alternative embodiment, the data movement module includes:

the first instruction sending unit is used for determining the weight to be moved of the current batch of the corresponding sub-model through the main core in the preset chip and sending a first moving instruction of the weight to be moved of the current batch to the corresponding sub-core;

wherein the data processing module comprises:

the weight moving unit is used for moving the weight data corresponding to the weight to be moved of the current batch from a second memory in the preset chip to the first memory according to the first moving instruction through a secondary core in the preset chip;

The data processing unit is used for processing the current data to be processed by utilizing the weight data of the current batch and the corresponding model structure data in the first memory after the weight data corresponding to the weight to be moved of the current batch is determined to be moved through the main core in the preset chip, wherein the current data to be processed comprises the input data of the preset chip or the intermediate data processed by the preset chip.

In an alternative embodiment, the apparatus further comprises:

the second instruction sending unit is used for determining the weight to be moved of the next batch of the corresponding sub-model through the main core in the preset chip and sending a second moving instruction of the weight to be moved of the next batch to the corresponding sub-core before the main core in the preset chip processes the current data to be processed through the weight data of the current batch and the corresponding model structure data in the first memory.

In an alternative embodiment, the first memory includes a first weight buffer area and a second weight buffer area, where one of the first weight buffer area and the second weight buffer area is used to store the weight data after being moved, and the other is used to store the weight data being moved.

In an alternative embodiment, the model input data acquisition module is specifically configured to:

and preprocessing the original data through a main core and a secondary core in the preprocessing chip to obtain model input data of the target model.

In an alternative embodiment, the preset chip set includes a first-level preset chip and a second-level preset chip; the first-stage preset chip corresponds to a first sub-model of the encoder structure, and the second-stage preset chip corresponds to a second sub-model of the decoder structure.

In an alternative embodiment, the number of the first-stage preset chips is at least two, and/or the number of the second-stage preset chips is at least two;

under the condition that the number of the first-stage preset chips is at least two, unidirectional data interaction exists among different first-stage preset chips; and under the condition that the number of the second-level preset chips is at least two, two-way data interaction exists between at least two second-level preset chips.

In an alternative embodiment, the weight data of the same layer in the sub-model is divided into at least two batches for movement.

In an alternative embodiment, the target model comprises a speech recognition model.

In an alternative embodiment, the input data of the preset chip corresponds to a preset number of frames of voice data to be recognized.

In an alternative embodiment, the preset number of frames is an integer multiple of a preset parameter value of the target model, the preset parameter value including a batch size.

In an optional implementation manner, the preset frame number is determined by taking a difference between a moving duration of a weight data quantity of a single layer and a single-layer operation duration of a main core in the preset chip as a target that the difference is smaller than a first preset value.

In an optional implementation manner, the data amount of the weight data of single-batch moving is determined by taking the difference between the single-batch moving duration and the single-batch operation duration of the main core in the preset chip as a target that the difference is smaller than a second preset value.

In the technical scheme of the disclosure, the related personal information of the user is collected, stored, used, processed, transmitted, provided, disclosed and the like, all conform to the regulations of related laws and regulations and do not violate the popular public order.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program stored from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When a computer program is stored in the RAM 703 and executed by the computing unit 701, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the data processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligent software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

Cloud computing (cloud computing) refers to a technical system that a shared physical or virtual resource pool which is elastically extensible is accessed through a network, resources can comprise servers, operating systems, networks, software, applications, storage devices and the like, and resources can be deployed and managed in an on-demand and self-service mode. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and blockchain, and model training.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A data processing method, comprising:

2. The method of claim 1, wherein the moving the weight data of the corresponding sub-model from the second memory to the first memory in the preset chip in batches by the secondary cores in the preset chip comprises:

determining the weight to be moved of the current batch of the corresponding sub-model through a main core in the preset chip, and sending a first moving instruction of the weight to be moved of the current batch to the corresponding sub-core;

moving weight data corresponding to the weight to be moved of the current batch from a second memory to the first memory in the preset chip according to the first moving instruction through a secondary core in the preset chip;

the processing, by the main core in the preset chip, the input data of the preset chip by using the weight data and the corresponding model structure data in the first memory includes:

after the movement of the weight data corresponding to the weight to be moved of the current batch is determined through the main core in the preset chip, the current data to be processed are processed by utilizing the weight data of the current batch and the corresponding model structure data in the first memory, wherein the current data to be processed comprise the input data of the preset chip or the intermediate data processed by the preset chip.

3. The method of claim 2, further comprising, prior to processing, by a master core in the preset chip, the current data to be processed with the weight data of the current lot and the corresponding model structure data in the first memory:

determining the weight to be moved of the next batch of the corresponding sub-model through the main core in the preset chip, and sending a second moving instruction of the weight to be moved of the next batch to the corresponding sub-core.

4. A method according to claim 3, wherein the first memory includes a first weight buffer and a second weight buffer, one of the first weight buffer and the second weight buffer is used for storing the moved weight data, and the other is used for storing the weight data being moved.

5. The method of claim 1, wherein the obtaining model input data for the target model comprises:

6. The method of claim 1, wherein the set of preset chips comprises a first level preset chip and a second level preset chip; the first-stage preset chip corresponds to a first sub-model of the encoder structure, and the second-stage preset chip corresponds to a second sub-model of the decoder structure.

7. The method of claim 6, wherein the number of first-level preset chips is at least two, and/or the number of second-level preset chips is at least two;

8. The method of claim 1, wherein the weight data of the same layer in the sub-model is divided into at least two batches for movement.

9. The method of any of claims 1-8, the target model comprising a speech recognition model.

10. The method of claim 9, wherein the input data of the preset chip corresponds to a preset number of frames of voice data to be recognized.

11. The method of claim 10, wherein the preset number of frames is an integer multiple of a preset parameter value of the target model, the preset parameter value comprising a batch size.

12. The method of claim 10, wherein the preset number of frames is determined with a goal that a difference between a moving duration of a weight data amount of a monolayer and a monolayer operation duration of a main core in the preset chip is smaller than a first preset value.

13. The method of claim 8, wherein the data amount of the weight data of the single-batch transfer is determined with the goal that a difference between a single-batch transfer time length and a single-batch operation time length of a main core in the preset chip is smaller than a second preset value.

14. A data processing apparatus comprising:

15. The apparatus of claim 14, wherein the data mover module comprises:

wherein the data processing module comprises:

16. The apparatus of claim 15, further comprising:

17. The apparatus of claim 16, wherein the first memory includes a first weight buffer and a second weight buffer, one of the first weight buffer and the second weight buffer is used for storing the moved weight data, and the other is used for storing the weight data being moved.

18. The apparatus of claim 14, wherein the model input data acquisition module is specifically configured to:

19. The apparatus of claim 14, wherein the set of preset chips comprises a first level preset chip and a second level preset chip; the first-stage preset chip corresponds to a first sub-model of the encoder structure, and the second-stage preset chip corresponds to a second sub-model of the decoder structure.

20. The apparatus of claim 19, wherein the number of first-level preset chips is at least two, and/or the number of second-level preset chips is at least two;

21. The apparatus of claim 14, wherein the weight data of the same layer in the sub-model is divided into at least two batches for movement.

22. The apparatus of any of claims 14-21, the target model comprising a speech recognition model.

23. The apparatus of claim 22, wherein the input data of the preset chip corresponds to a preset number of frames of voice data to be recognized.

24. The apparatus of claim 23, wherein the preset number of frames is an integer multiple of a preset parameter value of the target model, the preset parameter value comprising a batch size.

25. The apparatus of claim 23, wherein the preset number of frames is determined with a goal that a difference between a moving duration of a weight data amount of a monolayer and a monolayer operation duration of a main core in the preset chip is smaller than a first preset value.

26. The apparatus of claim 21, wherein the data amount of the weight data of the single-lot movement is determined with the goal that a difference between a single-lot movement time length and a single-lot operation time length of a main core in the preset chip is smaller than a second preset value.

27. An electronic device, comprising:

at least two multi-core processor chips; and

the memory stores instructions executable by the at least two multi-core processor chips to enable the at least two multi-core processor chips to perform the method of any one of claims 1-13.

28. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-13.