WO2024160186A1

WO2024160186A1 - Model training method and related device

Info

Publication number: WO2024160186A1
Application number: PCT/CN2024/074585
Authority: WO
Inventors: 黄文勇; 郑念祖; 陈晓
Original assignee: 华为技术有限公司
Priority date: 2023-01-30
Filing date: 2024-01-30
Publication date: 2024-08-08
Also published as: CN116312489A

Abstract

A model training method and apparatus, a storage medium, and a program product. The method comprises: acquiring training data associated with a speech task, wherein the training data comprises first text and first speech (401); inputting the first speech into a first model, so as to obtain a first speech feature (405); inputting the first text into a second model, so as to obtain a second speech feature (406); inputting the first speech feature into a first model to be trained, so as to obtain a processing result of the first speech (407); inputting the second speech feature into the first model to be trained, so as to obtain a processing result of the first text (408); and on the basis of the processing result of the first speech and the processing result of the first text, training the first model to be trained, so as to obtain a third model, wherein a model formed by means of the first model and the third model is used for completing the speech task (409).

Description

A model training method and related equipment

This application claims the priority of the Chinese patent application filed with the State Intellectual Property Office on January 30, 2023, with application number 202310127981.5 and invention name “A model training method and related equipment”, the entire contents of which are incorporated by reference in this application.

Technical Field

The embodiments of the present application relate to the field of artificial intelligence (AI) technology, and in particular to a model training method and related equipment.

Background Art

With the rapid development of AI technology, smart terminals often have end-to-end (E2E) voice models, which can be used to complete the user's voice tasks. That is, when the user inputs voice into the voice model, the voice model can process the user's voice to obtain the processing result of the user's voice, which is used as the result of the user's voice task.

When training a speech model, it is often necessary to obtain a large amount of training data, that is, a large amount of speech and the labels of these speech (that is, the actual processing results of these speech), but speech with labels is often difficult to collect. At present, a speech synthesis system is usually used to convert labeled text into labeled speech. Since text is easier to collect, a sufficient amount of speech can be obtained through the speech synthesis system as training data.

However, due to the performance limitations of the speech synthesis system, it is often impossible to convert text into correct speech. The converted speech often has a certain gap with the correct speech. Therefore, the speech model trained based on these incorrect speech has insufficient performance and cannot accurately complete the user's speech tasks, thus reducing the user experience.

Summary of the invention

The embodiments of the present application provide a model training method and related equipment. The trained speech model can have excellent performance, so it can accurately complete the user's speech task, thereby improving the user experience.

A first aspect of an embodiment of the present application provides a model training method, the method comprising:

After determining the speech task that the user needs to complete, training data related to the speech task can be obtained. The training data usually includes multiple first voices and multiple first texts, and the multiple first voices and multiple first texts are independent of each other in content. It is worth noting that the actual processing results (labels) of the multiple first voices and the actual processing results of the multiple first texts are known.

Among the multiple first voices and multiple first texts, for any first voice, the first voice can be input into the first model (also referred to as the trained voice precoding model) so that the first model processes the first voice to obtain the first voice feature of the first voice, and the first voice feature of the first voice is input into the first model to be trained (also referred to as the task trunk model to be trained) so that the first model to be trained processes the first voice feature of the first voice to obtain the processing result of the first voice. Of course, for any first text, the first text can also be input into the second model (also referred to as the trained text mapping model) so that the second model processes the first text to obtain the second voice feature of the first text, and the second voice feature of the first text is input into the first model to be trained so that the first model to be trained processes the second voice feature of the first text to obtain the processing result of the first text.

Next, the first model to be trained can be trained based on the processing result of the first speech and the processing result of the first text, so as to obtain a third model (also called a trained task backbone model, which is a trained neural network model). In this way, the model constructed by the first model and the third model can be used as the final model for completing the user's speech task.

It can be seen from the above method that: after determining the user's voice task, the training data associated with the voice task can be obtained first, and the training data includes a first text and a first voice. Then, the first voice can be processed by the first model to obtain a first voice feature, and then the first voice feature can be processed by the first model to be trained to obtain a processing result of the first voice. In addition, the first text can be processed by the second model to obtain a second voice feature, and then the second voice feature can be processed by the first model to be trained to obtain a processing result of the first text. Finally, based on the processing result of the first voice and the processing result of the first text, the first model to be trained is trained to obtain a third model. In this way, the final model composed of the first model and the third model can be used to complete the user's voice task. In the above process, the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from speech (i.e., the first speech) or text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, the text can often be converted into correct speech features. The speech model trained based on the correct speech features (i.e., the final model constructed by the first model and the third model) can have excellent performance, so it can accurately complete the user's speech task, thereby improving the user experience.

In a possible implementation, the first model to be trained is trained based on the processing result of the first speech and the processing result of the first text, and the third model is obtained, including: based on the processing result of the first speech and the actual processing result of the first speech, a first loss is obtained, and the first loss is used to indicate the difference between the processing result of the first speech and the actual processing result of the first speech; based on the processing result of the first text and the actual processing result of the first text, a second loss is obtained, and the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text; based on the first loss and the second loss, the parameters of the first model to be trained are updated until the model training conditions are met to obtain the third model. In the above implementation, since the training process of the first model to be trained is multiple rounds of iterations, and each round of iteration uses a first speech or a first text for model training, for the convenience of explanation, two adjacent rounds of iterations are introduced, one of which is called the current round of iteration, and the other is called the next round of iteration. A first speech is used in the current round of iteration, and a first text is used in the next round of iteration. After entering the current round of iteration, the first speech can be input into the first model so that the first model processes the first speech, thereby obtaining the first speech feature of the first speech. Then, the first speech feature of the first speech can be input into the first model to be trained so that the first model to be trained processes the first speech feature of the first speech, thereby obtaining the processing result of the first speech. After obtaining the processing result of the first speech, the processing result of the first speech and the real processing result of the first speech can be calculated to obtain the first loss of the first speech, and the first loss of the first speech is used for the difference between the processing result of the first speech and the real processing result of the first speech. Then, the parameters of the first model to be trained can be updated based on the first loss of the first speech, thereby obtaining the first model to be trained after the updated parameters, and thus the current round of iteration is completed. After entering the next round of iteration, the first text can be input into the second model so that the second model processes the first text, thereby obtaining the second speech feature of the first text. Next, the second speech feature of the first text can be input into the first model to be trained, so that the first model to be trained processes the second speech feature of the first text, thereby obtaining the processing result of the first text. After obtaining the processing result of the first text, the processing result of the first text and the actual processing result of the first text can be calculated to obtain the second loss of the first text, and the second loss of the first text is used for the difference between the processing result of the first text and the actual processing result of the first text. Then, the parameters of the first model to be trained after the parameters are updated can be updated based on the second loss of the first text, thereby obtaining the first model to be trained after the parameters are updated again, and the next round of iteration is completed. Subsequently, the next round of iteration can be entered, that is, the first model to be trained after the parameters are updated again can be trained using the next first speech or the next first text until the model training conditions are met, thereby obtaining the first model.

In a possible implementation, the training data also includes a second text and a second voice, and the second text corresponds to the second voice. The method also includes: inputting the second voice into the first model to obtain a third voice feature; inputting the second text into the second model to be trained to obtain a fourth voice feature; based on the third voice feature and the fourth voice feature, the second model to be trained is trained to obtain a second model. In the above implementation, the training data also includes a plurality of voice-text matching pairs, a voice-text matching pair includes a second voice and a second text corresponding to the second voice, that is, the second voice and the second text corresponding to the second voice are matched in content. Among the plurality of voice-text matching pairs, for any voice-text matching pair, the second voice in the voice-text matching pair can be input into the first model so that the first model processes the second voice, thereby obtaining the third voice feature of the second voice. At the same time, the second text in the voice-text matching pair can also be input into the second model to be trained (also referred to as a text mapping model to be trained), so that the second model to be trained processes the second text, thereby obtaining the fourth voice feature of the second text. Next, the second model to be trained may be trained based on the third speech feature of the second speech and the fourth speech feature of the second text, thereby obtaining a second model.

In one possible implementation, the second model to be trained is trained based on the third voice feature and the fourth voice feature to obtain the second model, including: obtaining a third loss based on the third voice feature and the fourth voice feature, the third loss being used to indicate the difference between the third voice feature and the fourth voice feature; based on the third loss, updating the parameters of the second model to be trained until the model training conditions are met to obtain the second model. In the aforementioned implementation, since the training process of the second model to be trained is a multi-round iteration, and each round of iteration uses a speech-text matching pair for model training, for the convenience of explanation, one of the iterations is used for introduction, and the iteration is referred to as the current round iteration, and a speech-text matching pair is used in the current round iteration. After entering the current round iteration, the speech-text matching pair can be used as the training object. The second speech is input into the first model so that the first model processes the second speech, thereby obtaining a third speech feature of the second speech. At the same time, the second text in the speech-text matching pair can also be input into the second model to be trained so that the second model to be trained processes the second text, thereby obtaining a fourth speech feature of the second text.

After obtaining the third voice feature of the second speech and the fourth voice feature of the second text, the third voice feature of the second speech and the fourth voice feature of the second text can be calculated to obtain the third loss of the speech-text matching pair, and the third loss of the speech-text matching pair is used to indicate the difference between the third voice feature corresponding to the second speech and the fourth voice feature corresponding to the second text. After obtaining the third loss of the speech-text matching pair, the parameters of the second model to be trained can be updated based on the third loss of the speech-text matching pair, and the current round of iteration is completed. Then, the next round of iteration can be entered, that is, the second model to be trained with the updated parameters is continued to be trained using the next speech-text matching pair until the model training conditions are met, thereby obtaining the second model.

In one possible implementation, the second model is any one of the following: a diffusion generation model, a generative adversarial network model, a sequence-to-sequence model, and the like.

In one possible implementation, a portion of the fourth model may be cut out to serve as the first model, for example, certain layers in the fourth model, etc. The fourth model is usually a model for completing speech recognition (also referred to as a trained speech recognition model), or the fourth model is a speech pre-training model, etc.

In a possible implementation, the speech task is any one of the following: speech translation, speech recognition, speech command, and speech dialogue, etc.

The second aspect of an embodiment of the present application provides a model training device, which includes: an acquisition module for acquiring training data associated with a speech task, the training data including a first text and a first speech; a first processing module for inputting the first speech into a first model to obtain a first speech feature; a second processing module for inputting the first text into a second model to obtain a second speech feature; a third processing module for inputting the first speech feature into a first model to be trained to obtain a processing result of the first speech; a fourth processing module for inputting the second speech feature into the first model to be trained to obtain a processing result of the first text; a first training module for training the first model to be trained based on the processing result of the first speech and the processing result of the first text to obtain a third model, and the model composed of the first model and the third model is used to complete the speech task.

It can be seen from the above device that after determining the user's voice task, the training data associated with the voice task can be obtained first, and the training data includes a first text and a first voice. Then, the first voice can be processed by the first model to obtain a first voice feature, and then the first voice feature can be processed by the first model to be trained to obtain the processing result of the first voice. In addition, the first text can be processed by the second model to obtain a second voice feature, and then the second voice feature can be processed by the first model to be trained to obtain the processing result of the first text. Finally, based on the processing result of the first voice and the processing result of the first text, the first model to be trained is trained to obtain a third model. In this way, the final model composed of the first model and the third model can be used to complete the user's voice task. In the above process, the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features. The speech model trained based on the correct speech features (i.e., the final model constructed by the first model and the third model) can have excellent performance, so it can accurately complete the user's speech tasks, thereby improving the user experience.

In one possible implementation, the first training module is used to: obtain a first loss based on a processing result of a first speech and an actual processing result of the first speech, the first loss being used to indicate the difference between the processing result of the first speech and the actual processing result of the first speech; obtain a second loss based on a processing result of a first text and an actual processing result of the first text, the second loss being used to indicate the difference between the processing result of the first text and the actual processing result of the first text; and update the parameters of the first model to be trained based on the first loss and the second loss until the model training conditions are met, thereby obtaining a third model.

In one possible implementation, the training data also includes a second text and a second voice, the second text corresponds to the second voice, and the device also includes: a fifth processing module, used to input the second voice into the first model to obtain a third voice feature; a sixth processing module, used to input the second text into the second model to be trained to obtain a fourth voice feature; and a second training module, used to train the second model to be trained based on the third voice feature and the fourth voice feature to obtain a second model.

In one possible implementation, the second training module is used to: obtain a third loss based on a third speech feature and a fourth speech feature, where the third loss is used to indicate the difference between the third speech feature and the fourth speech feature; and update the parameters of the second model to be trained based on the third loss until the model training conditions are met to obtain the second model.

In one possible implementation, the second model is any one of the following: a diffusion generation model, a generative adversarial network model, and a sequence-to-sequence model.

In one possible implementation, the first model is a part of a fourth model, and the fourth model is a model for completing speech recognition, or the fourth model is a speech pre-training model.

In a possible implementation, the speech task is any one of the following: speech translation, speech recognition, speech command, and speech dialogue.

A third aspect of an embodiment of the present application provides a model training device, which includes a memory and a processor; the memory stores code, and the processor is configured to execute the code. When the code is executed, the model training device is used to execute the method described in the first aspect or any possible implementation method of the first aspect.

A fourth aspect of an embodiment of the present application provides a circuit system, which includes a processing circuit, and the processing circuit is configured to execute the method described in the first aspect or any possible implementation manner of the first aspect.

A fifth aspect of an embodiment of the present application provides a chip system, which includes a processor for calling a computer program or computer instructions stored in a memory so that the processor executes the method described in the first aspect or any possible implementation method of the first aspect.

In a possible implementation manner, the processor is coupled to the memory through an interface.

In a possible implementation, the chip system also includes a memory, in which a computer program or computer instructions are stored.

A sixth aspect of an embodiment of the present application provides a computer storage medium, which stores a computer program. When the program is executed by a computer, the computer implements the method described in the first aspect or any possible implementation method of the first aspect.

A seventh aspect of the embodiments of the present application provides a computer program product, which stores instructions. When the instructions are executed by a computer, the computer implements the method described in the first aspect or any possible implementation method of the first aspect.

In an embodiment of the present application, after determining the user's voice task, training data associated with the voice task may be obtained first, and the training data includes a first text and a first voice. Then, the first voice may be processed by the first model to obtain a first voice feature, and then the first voice feature may be processed by the first model to be trained to obtain a processing result of the first voice. Furthermore, the first text may be processed by the second model to obtain a second voice feature, and then the second voice feature may be processed by the first model to be trained to obtain a processing result of the first text. Finally, based on the processing result of the first voice and the processing result of the first text, the first model to be trained is trained to obtain a third model. In this way, the final model composed of the first model and the third model can be used to complete the user's voice task. In the above process, the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features. The speech model trained based on the correct speech features (i.e., the final model constructed by the first model and the third model) can have excellent performance, so it can accurately complete the user's speech tasks, thereby improving the user experience.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram of a structure of an artificial intelligence main framework;

FIG2a is a schematic diagram of a structure of a model training system provided in an embodiment of the present application;

FIG2b is another schematic diagram of the structure of the model training system provided in an embodiment of the present application;

FIG2c is a schematic diagram of a device related to model training provided in an embodiment of the present application;

FIG3 is a schematic diagram of the architecture of the system 100 provided in an embodiment of the present application;

FIG4 is a flow chart of a model training method provided in an embodiment of the present application;

FIG5 is a schematic diagram of a mapping learning phase provided in an embodiment of the present application;

FIG6 is a schematic diagram of a text-speech multi-modal learning stage provided in an embodiment of the present application;

FIG7 is a schematic diagram of an application example of the mapping learning phase provided in an embodiment of the present application;

FIG8 is a schematic diagram of an application example of the text-speech multi-modal learning stage provided in an embodiment of the present application;

FIG9 is a schematic diagram of a structure of a model training device provided in an embodiment of the present application;

FIG10 is a schematic diagram of a structure of a training device provided in an embodiment of the present application;

FIG. 11 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.

DETAILED DESCRIPTION

The terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and need not be used to describe a specific order or sequential order. It should be understood that the terms used in this way can be interchangeable under appropriate circumstances, which is only to describe the distinction mode adopted by the objects of the same attributes when describing in the embodiments of the present application. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, so that the process, method, system, product or equipment comprising a series of units need not be limited to those units, but may include other units that are not clearly listed or inherent to these processes, methods, products or equipment.

With the rapid development of AI technology, smart terminals often have end-to-end (E2E) voice models, which can be used to complete user voice tasks, such as voice translation, voice recognition, voice commands, and voice conversations, etc. That is to say, when the user inputs voice into the voice model, the voice model can process the user's voice to obtain the processing result of the user's voice, such as the translation text obtained by recognizing and translating the user's voice, the text obtained by recognizing the user's language, recognizing the user's voice and determining the command issued by the user, recognizing the user's voice and finding the corresponding answer, etc.

When training a speech model, it is often necessary to obtain a large amount of training data, that is, a large amount of speech and the labels of these speech (that is, the actual processing results of these speech), but speech with labels is often difficult to collect. In order to collect a certain amount of speech with labels, related technologies usually use a text-to-speech (TTS) system to convert text with labels into speech with labels. Since text is easier to collect, a sufficient amount of speech can be obtained through the speech synthesis system as training data.

In order to solve the above problems, the embodiments of the present application provide a new model training method, which can be implemented in combination with artificial intelligence (AI) technology. AI technology is a technical discipline that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence. AI technology obtains the best results by sensing the environment, acquiring knowledge and using knowledge. In other words, artificial intelligence technology is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Using artificial intelligence for data processing is a common application of artificial intelligence.

First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 is a structural diagram of the main framework of artificial intelligence. The following is an explanation of the above artificial intelligence theme framework from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be a general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensation process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecology process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligence system, enables communication with the outside world, and is supported by the basic platform. It communicates with the outside world through sensors; computing power is provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.

(2) Data

The data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and perception data such as force, displacement, liquid level, temperature, and humidity.

(3) Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.

Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training. wait.

Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.

Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.

(4) General capabilities

After the data has undergone the data processing mentioned above, some general capabilities can be further formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Smart products and industry applications

Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical applications. Its application areas mainly include: smart terminals, smart transportation, smart medical care, autonomous driving, smart cities, etc.

Next, several application scenarios of this application are introduced.

FIG2a is a schematic diagram of a structure of a model training system provided in an embodiment of the present application, wherein the model training system includes a user device and a data processing device. The user device includes an intelligent terminal such as a mobile phone, a personal computer or an information processing center. The user device is the initiator of the model training, and as the initiator of the model training request, the request is usually initiated by the user through the user device.

The above-mentioned data processing device can be a device or server with data processing function such as a cloud server, a network server, an application server and a management server. The data processing device receives requests from the intelligent terminal through an interactive interface, and then performs machine learning, deep learning, search, reasoning, decision-making and other processing through the memory for storing data and the processor link for data processing. The memory in the data processing device can be a general term, including local storage and a database for storing historical data. The database can be on the data processing device or on other network servers.

In the model training system shown in FIG2a, the user device can receive the user's instructions, the user device can determine the voice task that the user needs to complete, and then initiate a request to the data processing device, so that the data processing device performs a model training application for the voice task obtained by the user device and the training data associated with the voice task, thereby obtaining a model for completing the voice task. Exemplarily, after receiving the user's instructions, the user device can obtain the voice task that the user needs to complete based on the user's instructions. Then, the user device can initiate a model training request to the data processing device, so that the data processing device can obtain the training data associated with the target task based on the model training request, and use the training data to complete the training of the model to be trained, thereby obtaining a model for completing the voice task, and returning the model to the user device, so that the user device uses the model to complete the voice task that the user needs to complete.

In Figure 2a, the data processing device can execute the model training method of an embodiment of the present application.

Figure 2b is another structural diagram of the model training system provided in an embodiment of the present application. In Figure 2b, the user device directly serves as a data processing device. The user device can directly obtain instructions from the user and directly process them by the hardware of the user device itself. The specific process is similar to that of Figure 2a. Please refer to the above description and will not be repeated here.

In the model training system shown in FIG2b, the user device can receive the user's instruction, and the user device can obtain the voice task that the user needs to complete and the training data associated with the voice task based on the user's instruction. Then, the user device can use the training data to complete the training of the model to be trained, thereby obtaining a model for completing the voice task. In this way, the user device can use the model to complete the voice task that the user needs to complete in the subsequent process.

In Figure 2b, the user device itself can execute the model training method of the embodiment of the present application.

Figure 2c is a schematic diagram of the relevant equipment for model training provided in an embodiment of the present application.

The user device in the above Figures 2a and 2b can specifically be the local device 301 or the local device 302 in Figure 2c, and the data processing device in Figure 2a can specifically be the training device 210 in Figure 2c, wherein the data storage system 250 can store the data to be processed of the training device 210, and the data storage system 250 can be integrated on the training device 210, and can also be set on the cloud or other network servers.

The processors in Figures 2a and 2b can perform data training/machine learning/deep learning through a neural network model or other models (for example, a model based on a support vector machine, etc.), and use the data to ultimately train or learn a model that can be used to complete speech tasks.

FIG3 is a schematic diagram of the system 100 architecture provided in an embodiment of the present application. In FIG3, the training device 120 is configured with an input/output (I/O) interface 112 for information interaction with an external device. A user can input instructions to the I/O interface 112 through a client device 140. The instructions in the embodiment of the present application may include: various tasks to be scheduled, callable resources, and other parameters. etc.

First, the training device 120 may determine the voice task that the user needs to complete based on the instruction input by the user.

Next, the training device 120 can train a corresponding model/rule based on the training data associated with the voice task for the voice task that the user needs to complete, and the corresponding model/rule can be used to achieve the voice task that the user needs to complete, thereby providing the user with the required voice task result. The training data can be obtained in a variety of ways: for example, during the process of the computing module 111 performing model training and other related processing, the training device 120 can call the data, code, etc. in the data storage system 150 for the corresponding processing, and can also store the model obtained by the corresponding training in the data storage system 150. For another example, during the process of the computing module 111 performing model training and other related processing, the training device 120 can obtain training data from the database 130, and these training data are usually from the training samples collected by the data acquisition device 160.

Finally, the I/O interface 112 returns the trained model to the client device 140, thereby providing it to the user so that the user can complete the speech task that he needs to complete.

In the case shown in FIG. 3 , the user can manually give instructions, which can be operated through the interface provided by the I/O interface 112. In another case, the client device 140 can automatically send instructions to the I/O interface 112. If the client device 140 is required to automatically send instructions and needs to obtain the user's authorization, the user can set the corresponding permissions in the client device 140. The user can view the results output by the training device 120 on the client device 140, and the specific presentation form can be a specific method such as display, sound, action, etc. The client device 140 can also serve as a data collection terminal, collect various data as new sample data under the user's instruction, and store them in the database 130.

It is worth noting that Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application. The positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation. For example, in Figure 3, the data storage system 150 is an external memory relative to the training device 120. In other cases, the data storage system 150 can also be placed in the training device 120.

The embodiment of the present application also provides a chip, which includes a neural network processor NPU. The chip can be set in the training device 120 shown in FIG3 to complete the training work of the training device 120 and output the target model/rule.

Neural network processor NPU, NPU is mounted on the main central processing unit (CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks. The core part of NPU is the operation circuit, and the controller controls the operation circuit to extract data from the memory (weight memory or input memory) and perform operations.

In some implementations, the arithmetic circuit includes multiple processing units (process engines, PEs) internally. In some implementations, the arithmetic circuit is a two-dimensional systolic array. The arithmetic circuit can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit takes the corresponding data of matrix B from the weight memory and caches it on each PE in the operation circuit. The operation circuit takes the matrix A data from the input memory and performs matrix operations with matrix B. The partial results or final results of the matrix are stored in the accumulator.

The vector calculation unit can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. For example, the vector calculation unit can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc.

In some implementations, the vector computation unit can store the processed output vector to a unified buffer. For example, the vector computation unit can apply a nonlinear function to the output of the computation circuit, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector computation unit generates a normalized value, a merged value, or both. In some implementations, the processed output vector can be used as an activation input to the computation circuit, such as for use in a subsequent layer in a neural network.

The unified memory is used to store input data and output data.

The weight data is directly transferred from the external memory to the input memory and/or the unified memory through the direct memory access controller (DMAC), the weight data in the external memory is stored in the weight memory, and the data in the unified memory is stored in the external memory.

The bus interface unit (BIU) is used to enable interaction between the main CPU, DMAC and instruction fetch memory through the bus.

An instruction fetch buffer connected to the controller, used to store instructions used by the controller;

The controller is used to call the instructions cached in the memory to control the working process of the computing accelerator.

Generally, unified memory, input memory, weight memory, and instruction fetch memory are all on-chip memories. The memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a high bandwidth memory (HBM), or other readable and writable memory.

Since the embodiments of the present application involve the application of a large number of neural networks, in order to facilitate understanding, the relevant terms and related concepts such as neural networks involved in the embodiments of the present application are first introduced below.

(1) Neural Network

A neural network may be composed of neural units, and a neural unit may refer to an operation unit with xs and intercept 1 as input, and the output of the operation unit may be:

Where s=1, 2, ...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal. The output signal of the activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the characteristics of the local receptive field. The local receptive field can be an area composed of several neural units.

The work of each layer in the neural network can be described by the mathematical expression y=a(Wx+b): From a physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (i.e., the row space to the column space of the matrix) through five operations on the input space (the set of input vectors). These five operations include: 1. Dimension increase/reduction; 2. Zoom in/out; 3. Rotation; 4. Translation; 5. "Bending". Among them, operations 1, 2, and 3 are completed by Wx, operation 4 is completed by +b, and operation 5 is implemented by a(). The word "space" is used here because the classified object is not a single thing, but a class of things, and space refers to the collection of all individuals of this class of things. Among them, W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer. The vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space. The purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by many layers of vectors W). Therefore, the training process of a neural network is essentially about learning how to control spatial transformations, or more specifically, learning the weight matrix.

Because we want the output of the neural network to be as close as possible to the value we really want to predict, we can compare the current network's predicted value with the target value we really want, and then update the weight vector of each layer of the neural network based on the difference between the two (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the neural network). For example, if the network's predicted value is high, adjust the weight vector to make it predict a lower value, and keep adjusting until the neural network can predict the target value we really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value", which is the loss function or objective function, which are important equations used to measure the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, so the training of the neural network becomes a process of minimizing this loss as much as possible.

(2) Back propagation algorithm

Neural networks can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the forward transmission of the input signal to the output will generate error loss, and the error loss information is back-propagated to update the parameters in the initial neural network model, so that the error loss converges. The back propagation algorithm is a back propagation movement dominated by error loss, which aims to obtain the optimal parameters of the neural network model, such as the weight matrix.

The method provided in the present application is described below from the training side of the neural network and the application side of the neural network.

The model training method provided in the embodiment of the present application involves the processing of data sequences, and can be specifically applied to methods such as data training, machine learning, and deep learning, to perform symbolic and formalized intelligent information modeling, extraction, preprocessing, and training on training data (for example, the training data associated with the speech task in the embodiment of the present application, including the first text, the first speech, the second text, and the second speech, etc.). Finally, a trained neural network is obtained (for example, a model composed of the first model and the third model in the embodiment of the present application); and the task processing method provided in the embodiment of the present application can use the above-mentioned trained neural network to input the user's input data (for example, voice, etc.) into the trained neural network to obtain output data, thereby completing the task that the user needs to complete. It should be noted that the model training method and task processing method provided in the embodiment of the present application are inventions based on the same concept, and can also be understood as two parts in a system, or two stages of an overall process: such as the model training stage and the model application stage.

FIG4 is a flow chart of a model training method provided in an embodiment of the present application. As shown in FIG4 , the method includes:

401. Acquire training data associated with a speech task, where the training data includes a first speech, a first text, a second speech, and a second text.

In this embodiment, after determining the voice task that the user needs to complete (for example, voice translation, voice recognition, voice command, voice dialogue, etc.), training data related to the voice task can be obtained. It should be noted that the training data usually includes data from two learning stages. The data from the text-speech multimodal learning stage is used to train the first model to be trained (also referred to as the task trunk model to be trained, which is a neural network model that needs to be trained), and the data from the mapping learning stage is used to train the second model to be trained (also referred to as the text mapping model to be trained, which is a neural network model that needs to be trained). Among them, the data from the text-speech multimodal learning stage may include multiple first voices and multiple first texts, and the multiple first voices and multiple first texts are independent of each other in content. It is worth noting that the labels of multiple first voices (i.e., the real processing results of multiple first voices) and the labels of multiple first texts (the real processing results of multiple first texts) are known. The data from the mapping learning stage may include multiple voice-text matching pairs, and a voice-text matching pair includes a second voice and a second text corresponding to the second voice, that is, the second voice and the second text corresponding to the second voice are matched in content.

Since the model training process in this embodiment includes a mapping learning stage and a text-speech multimodal learning stage, after obtaining the training data, the mapping learning stage can be performed first.

For example, after determining that the task backbone model to be trained and the text mapping model to be trained need to be trained, training data related to the user's speech task can be obtained. These training models include data from the mapping learning stage and data from the text-speech multimodal learning stage, where the data from the mapping learning stage can be expressed as D = {(S1, T1), (S2, T2), (S3, T3), ..., (Si, Ti), ... (SN, TN)}, where (Si, Ti) represents the i-th speech-text matching pair (i = 1, ..., N, N≥2), Si represents the i-th speech in the mapping learning stage, and Ti represents the i-th text in the mapping learning stage. The data in the text-speech multimodal learning stage can be expressed as DS = {(S`1, Y1), (S`2, Y2), (S`3, Y3), ..., (S`j, Yj), ... (S`M, YM)}, and DT = {(T`1, Y`1), (T`2, Y`2), (T`3, Y`3), ..., (T`k, Y`k), ... (T`P, Y`P)}, where S`j represents the j-th speech in the text-speech multimodal learning stage (j = 1, ..., M, M ≥ 2), Yj represents the label of the j-th speech, T`k represents the k-th text in the text-speech multimodal learning stage (k = 1, ..., P, P ≥ 2), and Y`k represents the label of the k-th text.

402. Input the second speech into the first model to obtain a third speech feature.

403. Input the second text into the second model to be trained to obtain a fourth speech feature.

404. Train the second model to be trained based on the third speech feature and the fourth speech feature to obtain a second model.

After entering the mapping learning stage, among multiple speech-text matching pairs, for any speech-text matching pair, the second speech in the speech-text matching pair can be input into the first model (also referred to as a trained speech precoding model, which is a trained neural network model), so that the first model processes the second speech (for example, feature extraction, etc.), thereby obtaining the third speech feature of the second speech. At the same time, the second text in the speech-text matching pair can also be input into the second model to be trained, so that the second model to be trained processes the second text (for example, feature extraction, etc.), thereby obtaining the fourth speech feature of the second text.

Next, the second model to be trained can be trained based on the third speech feature of the second speech and the fourth speech feature of the second text to obtain a second model (also referred to as a trained text mapping model, which is a trained neural network model).

Specifically, the second model to be trained can be trained in the following manner to obtain the second model:

(1) After entering the mapping learning stage, since the training process of the second model to be trained is a multi-round iteration, and each round of iteration uses a speech-text matching pair for model training, for the convenience of explanation, one of the iterations is introduced and the iteration is called the current iteration. A speech-text matching pair is used in the current iteration. After entering the current iteration, the second speech in the speech-text matching pair can be input into the first model so that the first model processes the second speech to obtain the third speech feature of the second speech. At the same time, the second text in the speech-text matching pair can also be input into the second model to be trained so that the second model to be trained The model processes the second text to obtain a fourth speech feature of the second text.

(2) After obtaining the third speech feature of the second speech and the fourth speech feature of the second text, a preset first loss function can be used to calculate the third speech feature of the second speech and the fourth speech feature of the second text, thereby obtaining a third loss of the speech-text matching pair, and the third loss of the speech-text matching pair is used to indicate the difference between the third speech feature corresponding to the second speech and the fourth speech feature corresponding to the second text.

(3) After obtaining the third loss of the speech-text matching pair, the parameters of the second model to be trained can be updated based on the third loss of the speech-text matching pair, and the current round of iteration is completed. Then, the next round of iteration can be entered, that is, the second model to be trained with the updated parameters can be continuously trained using the next speech-text matching pair until the model training conditions are met (for example, the loss reaches convergence, etc.), thereby obtaining the second model.

Still as in the above example, as shown in Figure 5 (Figure 5 is a schematic diagram of the mapping learning stage provided in an embodiment of the present application), after entering the mapping learning stage, Si in (Si, Ti) can be input into the trained speech precoding model to obtain the speech latent space representation (i.e., speech feature) Hi of Si, and Ti in (Si, Ti) can be input into the text mapping model to be trained to obtain the speech latent space representation H`i of Ti. Then, with the purpose of maximizing the similarity between Hi and H`i, the mapping learning loss function can be used to calculate Hi and H`i to obtain the loss Li, which is used to represent the difference between Hi and H`i. Subsequently, Li can be used to update the parameters of the text mapping model to be trained, and then (Si+1, Ti+1) can be used to continue training the text mapping model after the updated parameters until the loss converges, thereby obtaining the trained text mapping model.

405. Input the first speech into the first model to obtain a first speech feature.

406. Input the first text into the second model to obtain a second speech feature.

407. Input the first speech feature into the first model to be trained to obtain a processing result of the first speech.

408. Input the second speech feature into the first model to be trained to obtain a processing result of the first text.

409. Based on the processing result of the first text and the processing result of the first speech, the first model to be trained is trained to obtain a third model. The model composed of the first model and the third model is used to complete the speech task.

After entering the text-speech multimodal learning stage, among the multiple first voices and the multiple first texts, for any first voice, the first voice can be input into the first model so that the first model processes the first voice (for example, feature extraction, etc.) to obtain the first voice feature of the first voice, and the first voice feature of the first voice is input into the first model to be trained so that the first model to be trained processes the first voice feature of the first voice (for example, feature extraction, etc.) to obtain the processing result of the first voice. Of course, for any first text, the first text can also be input into the second model so that the second model processes the first text (for example, feature extraction, etc.) to obtain the second voice feature of the first text, and the second voice feature of the first text is input into the first model to be trained so that the first model to be trained processes the second voice feature of the first text (for example, feature extraction, etc.) to obtain the processing result of the first text.

Specifically, the first model to be trained can be trained in the following manner to obtain the third model:

(1) After entering the text-speech multimodal learning stage, since the training process of the first model to be trained is a multi-round iteration, and each iteration uses a first speech or a first text for model training, for the convenience of explanation, two adjacent iterations are introduced, one of which is called the current iteration, and the other is called the next iteration. A first speech is used in the current iteration, and a first text is used in the next iteration. After entering the current iteration, the first speech can be input into the first model so that the first model processes the first speech, thereby obtaining the first speech feature of the first speech. Then, the first speech feature of the first speech can be input into the first model to be trained so that the first model to be trained processes the first speech feature of the first speech, thereby obtaining the processing result of the first speech.

(2) After obtaining the processing result of the first speech, the preset second loss function can be used to calculate the processing result of the first speech and the actual processing result (label) of the first speech, thereby obtaining the first loss of the first speech, which is used for the difference between the processing result of the first speech and the actual processing result of the first speech. Then, the parameters of the first model to be trained can be updated based on the first loss of the first speech, thereby obtaining the first model to be trained after the updated parameters, and thus the current round of iteration is completed.

(3) After entering the next round of iteration, the first text can be input into the second model so that the second model processes the first text, thereby obtaining the second speech feature of the first text. Then, the second speech feature of the first text can be input into the first model to be trained so that the first model to be trained processes the second speech feature of the first text, thereby obtaining the processing result of the first text.

(4) After obtaining the processing result of the first text, the second loss function can be used to calculate the processing result of the first text and the actual processing result of the first text, thereby obtaining the second loss of the first text, and the second loss of the first text is used for the difference between the processing result of the first text and the actual processing result of the first text. Then, the parameters of the first model to be trained after the parameters are updated can be updated based on the second loss of the first text, thereby obtaining the first model to be trained after the parameters are updated again, and the next round of iteration is completed.

(5) Subsequently, the next iteration can be entered, that is, the first model to be trained after the parameters are updated again is continued to be trained using the next first speech or the next first text until the model training conditions are met (for example, the loss reaches convergence, etc.), thereby obtaining the first model.

Still as in the above example, as shown in Figure 6 (Figure 6 is a schematic diagram of the text-speech multimodal learning stage provided in an embodiment of the present application), after entering the text-speech multimodal learning stage, in the current round of iteration, S`j can be input into the trained speech precoding model to obtain the speech latent space representation H``j of S`j, and then H``j is input into the task backbone model to be trained to obtain the processing result Y``j of S`j. Then, the text-speech multimodal learning loss function can be used to calculate Y``j and the label Yj of S`j to obtain the loss LSj, which is used to indicate the difference between Y``j and Yj. Subsequently, LSj can be used to update the parameters of the task backbone model to be trained to obtain the task backbone model after the updated parameters. At this point, the current round of iteration is completed.

In the next round of iteration, T`k can be input into the trained speech precoding model to obtain the speech latent space representation H```k of T`k, and then H```k can be input into the task backbone model to be trained to obtain the processing result Y```k of T`k. Then, the text-speech multimodal learning loss function can be used to calculate Y```k and the label Y`k of T`k to obtain the loss LTk, which is used to indicate the difference between Y```k and T`k. Subsequently, LTk can be used to update the parameters of the task backbone model after the parameters are updated to obtain the task backbone model after the parameters are updated again. At this point, the next round of iteration is completed.

In the next iteration, S`j+1 or T`k+1 can be used to continue training the task backbone model after updating the parameters again until the loss converges, thereby obtaining the trained task backbone model. In this way, after the trained speech precoding model and the trained task backbone model are spliced together, the obtained model is the final model used to complete the user's speech task.

It should be understood that, in this embodiment, the second model can be any one of the following: a diffusion model, a generative adversarial nets (GAN) model, a sequence-to-sequence model, and the like.

It should also be understood that in this embodiment, the first model is a part of the fourth model, that is, a part of the fourth model can be cut out to serve as the first model, for example, certain layers in the fourth model, etc. The fourth model is usually a model for completing speech recognition (i.e., a trained speech recognition model, a trained neural network model), or the fourth model is a speech pre-training model, etc.

In order to further understand the model training method provided by the embodiment of the present application, the method is further introduced below in conjunction with a specific application example. As shown in Figures 7 and 8 (Figure 7 is a schematic diagram of an application example of the mapping learning stage provided by the embodiment of the present application, and Figure 8 is a schematic diagram of an application example of the text-speech multimodal learning stage provided by the embodiment of the present application), the application example includes:

Assuming that the user's speech task is speech translation, the trained speech recognition model can be obtained first. The speech recognition model is used to convert Chinese speech into Chinese text. In the speech recognition model, the first 4 layers of the model can be taken as the trained speech precoding model.

In order to train a model that can convert Chinese speech into English text, the diffusion model to be trained and the speech translation backbone model to be trained can be obtained first. A batch of training data can be obtained first, which contains Chinese speech and Chinese text (the two are corresponding), and then the text is input into the diffusion model to be trained to obtain the corresponding speech latent space representation, and the speech is input into the trained speech precoding model to obtain the corresponding speech latent space representation. Then, the parameters of the diffusion model can be updated based on these speech latent space representations, thereby completing the training of the diffusion model and obtaining a trained diffusion model.

Next, another batch of training data can be obtained, which also includes Chinese speech and Chinese text (the correct English text of the Chinese speech and the correct English text of the Chinese text are known). Then, the Chinese speech can be input into the trained speech precoding model to obtain the corresponding speech latent space representation, and then the speech latent space representation is input into the speech translation backbone model to be trained to obtain the corresponding predicted English text. The Chinese text can also be input into the trained diffusion model to obtain the corresponding speech latent space representation, and then the speech latent space representation is input into the speech translation backbone model to be trained to obtain the corresponding predicted English text. Since the correct English text of the Chinese speech and the correct English text of the Chinese text are known, the speech translation backbone can be updated based on the predicted English text and the correct English text. The parameters of the model are obtained, thereby completing the training of the speech translation backbone model and obtaining the trained speech translation backbone model.

At this point, the output of the speech precoding model can be connected to the input of the speech translation backbone model, and the resulting final model can convert Chinese speech into English text.

The above is a detailed description of the model training method provided in the embodiment of the present application. The following is an introduction to the model training device provided in the embodiment of the present application. FIG. 9 is a structural schematic diagram of the model training device provided in the embodiment of the present application. As shown in FIG. 9 , the device includes:

An acquisition module 901 is used to acquire training data associated with a speech task, where the training data includes a first text and a first speech;

A first processing module 902, configured to input the first speech into a first model to obtain a first speech feature;

A second processing module 903 is used to input the first text into a second model to obtain a second speech feature;

The third processing module 904 is used to input the first speech feature into the first model to be trained to obtain a processing result of the first speech;

The fourth processing module 905 is used to input the second speech feature into the first model to be trained to obtain a processing result of the first text;

The first training module 906 is used to train the first model to be trained based on the processing results of the first speech and the processing results of the first text to obtain a third model. The model composed of the first model and the third model is used to complete the speech task.

In one possible implementation, the first training module 906 is used to: obtain a first loss based on the processing result of the first speech and the actual processing result of the first speech, and the first loss is used to indicate the difference between the processing result of the first speech and the actual processing result of the first speech; obtain a second loss based on the processing result of the first text and the actual processing result of the first text, and the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text; based on the first loss and the second loss, update the parameters of the first model to be trained until the model training conditions are met to obtain a third model.

In one possible implementation, the second training module is used to: obtain a third loss based on the third speech feature and the fourth speech feature, where the third loss is used to indicate the difference between the third speech feature and the fourth speech feature; and The parameters are updated until the model training conditions are met to obtain the second model.

It should be noted that the information interaction, execution process, etc. between the modules/units of the above-mentioned device are based on the same concept as the method embodiment of the present application, and the technical effects they bring are the same as those of the method embodiment of the present application. The specific contents can be referred to the description in the method embodiment shown above in the embodiment of the present application, and will not be repeated here.

The embodiment of the present application also relates to a training device, and FIG. 10 is a structural diagram of the training device provided by the embodiment of the present application. As shown in FIG. 10, the training device 1000 can be specifically manifested as a mobile phone, a tablet, a laptop computer, an intelligent wearable device, a server, etc., which is not limited here. Among them, the function of model training in the corresponding embodiment of FIG. 4 can be implemented on the training device 1000. Specifically, the training device 1000 includes: a receiver 1001, a transmitter 1002, a processor 1003 and a memory 1004 (wherein the number of processors 1003 in the training device 1000 can be one or more, and one processor is taken as an example in FIG. 10), wherein the processor 1003 may include an application processor 10031 and a communication processor 10032. In some embodiments of the present application, the receiver 1001, the transmitter 1002, the processor 1003 and the memory 1004 may be connected via a bus or other means.

The memory 1004 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1003. A portion of the memory 1004 may also include a non-volatile random access memory (NVRAM). The memory 1004 stores processor and operation instructions, executable modules or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.

The processor 1003 controls the operation of the training device. In a specific application, the various components of the training device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus. However, for the sake of clarity, various buses are referred to as bus systems in the figure.

The method disclosed in the above embodiment of the present application can be applied to the processor 1003, or implemented by the processor 1003. The processor 1003 can be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit in the processor 1003 or the instruction in the form of software. The above processor 1003 can be a general processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The processor 1003 can implement or execute the various methods, steps and logic block diagrams disclosed in the embodiment of the present application. The general processor can be a microprocessor or the processor can also be any conventional processor, etc. The steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to be executed, or a combination of hardware and software modules in the decoding processor can be executed. The software module may be located in a storage medium mature in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, etc. The storage medium is located in the memory 1004, and the processor 1003 reads the information in the memory 1004 and completes the steps of the above method in combination with its hardware.

The receiver 1001 can be used to receive input digital or character information and generate signal input related to the relevant settings and function control of the training device. The transmitter 1002 can be used to output digital or character information through the first interface; the transmitter 1002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1002 can also include a display device such as a display screen.

In an embodiment of the present application, in one case, the processor 1003 is used to complete model training through the model training architecture in the embodiment corresponding to Figure 4, and provide the obtained final model to the user device, so that the user can complete the voice task that the user needs to complete based on the final model through the user device, that is, execute the task processing method.

An embodiment of the present application also relates to a computer storage medium, in which a program for signal processing is stored. When the computer storage medium is run on a computer, the computer executes the steps executed by the aforementioned training device, or the computer executes the steps executed by the aforementioned training device.

The present application also relates to a computer program product, wherein the computer program product stores instructions, which are executed by a computer. When making the computer execute the steps executed by the aforementioned execution device, or, making the computer execute the steps executed by the aforementioned training device.

The training device or terminal device provided in the embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored in the storage unit so that the chip in the execution device executes the data processing method described in the above embodiment, or so that the chip in the training device executes the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc. The storage unit may also be a storage unit located outside the chip in the wireless access device end, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), etc.

Specifically, please refer to FIG. 11 , which is a schematic diagram of the structure of a chip provided in an embodiment of the present application. The chip can be a neural network processor NPU 1100. NPU 1100 is mounted on the host CPU (Host CPU) as a coprocessor, and tasks are assigned by the Host CPU. The core part of the NPU is the operation circuit 1103, which is controlled by the controller 1104 to extract matrix data from the memory and perform multiplication operations.

In some implementations, the operation circuit 1103 includes multiple processing units (Process Engine, PE) inside. In some implementations, the operation circuit 1103 is a two-dimensional systolic array. The operation circuit 1103 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1103 is a general-purpose matrix processor.

For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit takes the corresponding data of matrix B from the weight memory 1102 and caches it on each PE in the operation circuit. The operation circuit takes the matrix A data from the input memory 1101 and performs matrix operations with matrix B, and the partial results or final results of the matrix are stored in the accumulator 1108.

The unified memory 1106 is used to store input data and output data. The weight data is directly transferred to the weight memory 1102 through the direct memory access controller (DMAC) 1105. The input data is also transferred to the unified memory 1106 through the DMAC.

BIU stands for Bus Interface Unit, that is, the bus interface unit 1013, which is used for the interaction between AXI bus and DMAC and instruction fetch buffer (IFB) 1109.

The bus interface unit 1013 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1109 to obtain instructions from the external memory, and is also used for the storage unit access controller 1105 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1106 or to transfer weight data to the weight memory 1102 or to transfer input data to the input memory 1101.

The vector calculation unit 1107 includes multiple operation processing units, and when necessary, further processes the output of the operation circuit 1103, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of the predicted label plane, etc.

In some implementations, the vector calculation unit 1107 can store the processed output vector to the unified memory 1106. For example, the vector calculation unit 1107 can apply a linear function; or, a nonlinear function to the output of the operation circuit 1103, such as linear interpolation of the predicted label plane extracted by the convolution layer, and then, for example, a vector of accumulated values to generate an activation value. In some implementations, the vector calculation unit 1107 generates a normalized value, a pixel-level summed value, or both. In some implementations, the processed output vector can be used as an activation input to the operation circuit 1103, for example, for use in a subsequent layer in a neural network.

An instruction fetch buffer 1109 connected to the controller 1104 is used to store instructions used by the controller 1104;

Unified memory 1106, input memory 1101, weight memory 1102 and instruction fetch memory 1109 are all on-chip memories. External memories are private to the NPU hardware architecture.

The processor mentioned in any of the above places may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.

It should also be noted that the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. In addition, in the drawings of the device embodiments provided by the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.

Through the description of the above implementation mode, the technicians in the field can clearly understand that the present application can be implemented by means of software plus necessary general hardware, and of course, it can also be implemented by special hardware including special integrated circuits, special CPUs, special memories, special components, etc. In general, all functions completed by computer programs can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be various, such as analog circuits, digital circuits or special circuits. However, for the present application, software program implementation is a better implementation mode in more cases. Based on such an understanding, the technical solution of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile hard disk, a ROM, a RAM, a disk or an optical disk, etc., including a number of instructions to enable a computer device (which can be a personal computer, a training device, or a network device, etc.) to execute the methods described in each embodiment of the present application.

In the above embodiments, all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented by software, all or part of the embodiments may be implemented in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website site, a computer, a training device, or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, training device, or data center. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device, a data center, etc. that includes one or more available media integrations. The available medium may be a magnetic medium, (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.

Claims

A model training method, characterized in that the method comprises:

Acquire training data associated with a speech task, wherein the training data includes a first text and a first speech;

Inputting the first speech into a first model to obtain a first speech feature;

Inputting the first text into a second model to obtain a second speech feature;

Inputting the first speech feature into a first model to be trained to obtain a processing result of the first speech;

Inputting the second speech feature into the first model to be trained to obtain a processing result of the first text;

Based on the processing results of the first speech and the processing results of the first text, the first model to be trained is trained to obtain a third model, and the model composed of the first model and the third model is used to complete the speech task.
The method according to claim 1, characterized in that the step of training the first model to be trained based on the processing result of the first speech and the processing result of the first text to obtain the third model comprises:

Based on the processing result of the first speech and the real processing result of the first speech, obtaining a first loss, wherein the first loss is used to indicate the difference between the processing result of the first speech and the real processing result of the first speech;

Based on the processing result of the first text and the actual processing result of the first text, obtaining a second loss, where the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text;

Based on the first loss and the second loss, the parameters of the first model to be trained are updated until the model training conditions are met, so as to obtain a third model.
The method according to claim 1 or 2, characterized in that the training data further includes a second text and a second voice, the second text corresponds to the second voice, and the method further includes:

Inputting the second speech into the first model to obtain a third speech feature;

Inputting the second text into a second model to be trained to obtain a fourth speech feature;

Based on the third speech feature and the fourth speech feature, the second model to be trained is trained to obtain the second model.
The method according to claim 3 is characterized in that the step of training the second model to be trained based on the third speech feature and the fourth speech feature to obtain the second model comprises:

Based on the third voice feature and the fourth voice feature, obtaining a third loss, wherein the third loss is used to indicate a difference between the third voice feature and the fourth voice feature;

Based on the third loss, the parameters of the second model to be trained are updated until the model training conditions are met to obtain the second model.
The method according to any one of claims 1 to 4 is characterized in that the second model is any one of the following: a diffusion generation model, a generative adversarial network model, and a sequence-to-sequence model.
The method according to any one of claims 1 to 5 is characterized in that the first model is a part of a fourth model, the fourth model is a model for completing speech recognition, or the fourth model is a speech pre-training model.
The method according to any one of claims 1 to 6 is characterized in that the speech task is any one of the following: speech translation, speech recognition, speech command and speech dialogue.
A model training device, characterized in that the device comprises:

An acquisition module, configured to acquire training data associated with a speech task, wherein the training data includes a first text and a first speech;

A first processing module, used for inputting the first speech into a first model to obtain a first speech feature;

A second processing module, used for inputting the first text into a second model to obtain a second speech feature;

A third processing module, used for inputting the first speech feature into a first model to be trained to obtain a processing result of the first speech;

a fourth processing module, configured to input the second speech feature into the first model to be trained to obtain a processing result of the first text;

The first training module is used to train the first model to be trained based on the processing results of the first speech and the processing results of the first text to obtain a third model. The model composed of the first model and the third model is used to complete the speech task.
The device according to claim 8, characterized in that the first training module is used to:

Based on the processing result of the first speech and the real processing result of the first speech, obtaining a first loss, wherein the first loss is used to indicate the difference between the processing result of the first speech and the real processing result of the first speech;

Based on the processing result of the first text and the actual processing result of the first text, obtaining a second loss, where the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text;

Based on the first loss and the second loss, the parameters of the first model to be trained are updated until the model training conditions are met, so as to obtain a third model.
The device according to claim 8 or 9, characterized in that the training data further includes a second text and a second voice, the second text corresponds to the second voice, and the device further includes:

A fifth processing module, configured to input the second speech into the first model to obtain a third speech feature;

A sixth processing module, configured to input the second text into a second model to be trained to obtain a fourth speech feature;

The second training module is used to train the second model to be trained based on the third speech feature and the fourth speech feature to obtain the second model.
The device according to claim 10, characterized in that the second training module is used to:

Based on the third voice feature and the fourth voice feature, obtaining a third loss, wherein the third loss is used to indicate a difference between the third voice feature and the fourth voice feature;

Based on the third loss, the parameters of the second model to be trained are updated until the model training conditions are met to obtain the second model.
The device according to any one of claims 8 to 11 is characterized in that the second model is any one of the following: a diffusion generation model, a generative adversarial network model, and a sequence-to-sequence model.
The device according to any one of claims 8 to 12 is characterized in that the first model is a part of a fourth model, the fourth model is a model for completing speech recognition, or the fourth model is a speech pre-training model.
The device according to any one of claims 8 to 13, characterized in that the voice task is any one of the following: voice translation, voice recognition, voice command and voice dialogue.
A model training device, characterized in that the device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code. When the code is executed, the model training device executes the method according to any one of claims 1 to 7.
A computer storage medium, characterized in that the computer storage medium stores one or more instructions, and when the instructions are executed by one or more computers, the one or more computers implement any one of the methods described in claims 1 to 7.
A computer program product, characterized in that the computer program product stores instructions, and when the instructions are executed by a computer, the computer implements the method according to any one of claims 1 to 7.