WO2024160186A1 - Model training method and related device - Google Patents
Model training method and related device Download PDFInfo
- Publication number
- WO2024160186A1 WO2024160186A1 PCT/CN2024/074585 CN2024074585W WO2024160186A1 WO 2024160186 A1 WO2024160186 A1 WO 2024160186A1 CN 2024074585 W CN2024074585 W CN 2024074585W WO 2024160186 A1 WO2024160186 A1 WO 2024160186A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- speech
- text
- trained
- voice
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 202
- 238000000034 method Methods 0.000 title claims abstract description 120
- 238000012545 processing Methods 0.000 claims abstract description 188
- 230000015654 memory Effects 0.000 claims description 71
- 238000013519 translation Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 12
- 238000009792 diffusion process Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 description 62
- 238000013528 artificial neural network Methods 0.000 description 30
- 238000010586 diagram Methods 0.000 description 26
- 239000011159 matrix material Substances 0.000 description 26
- 239000013598 vector Substances 0.000 description 26
- 238000013507 mapping Methods 0.000 description 25
- 230000006870 function Effects 0.000 description 24
- 238000013473 artificial intelligence Methods 0.000 description 18
- 238000003062 neural network model Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 10
- 230000001537 neural effect Effects 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 9
- 238000000605 extraction Methods 0.000 description 8
- 230000004913 activation Effects 0.000 description 7
- 238000013500 data storage Methods 0.000 description 7
- 230000015572 biosynthetic process Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 238000003672 processing method Methods 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 3
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 239000003999 initiator Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000009833 condensation Methods 0.000 description 1
- 230000005494 condensation Effects 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
Definitions
- the embodiments of the present application relate to the field of artificial intelligence (AI) technology, and in particular to a model training method and related equipment.
- AI artificial intelligence
- E2E voice models which can be used to complete the user's voice tasks. That is, when the user inputs voice into the voice model, the voice model can process the user's voice to obtain the processing result of the user's voice, which is used as the result of the user's voice task.
- a speech synthesis system is usually used to convert labeled text into labeled speech. Since text is easier to collect, a sufficient amount of speech can be obtained through the speech synthesis system as training data.
- the embodiments of the present application provide a model training method and related equipment.
- the trained speech model can have excellent performance, so it can accurately complete the user's speech task, thereby improving the user experience.
- a first aspect of an embodiment of the present application provides a model training method, the method comprising:
- training data related to the speech task can be obtained.
- the training data usually includes multiple first voices and multiple first texts, and the multiple first voices and multiple first texts are independent of each other in content. It is worth noting that the actual processing results (labels) of the multiple first voices and the actual processing results of the multiple first texts are known.
- the first voice can be input into the first model (also referred to as the trained voice precoding model) so that the first model processes the first voice to obtain the first voice feature of the first voice, and the first voice feature of the first voice is input into the first model to be trained (also referred to as the task trunk model to be trained) so that the first model to be trained processes the first voice feature of the first voice to obtain the processing result of the first voice.
- the first model also referred to as the trained voice precoding model
- the first model to be trained also referred to as the task trunk model to be trained
- the first text can also be input into the second model (also referred to as the trained text mapping model) so that the second model processes the first text to obtain the second voice feature of the first text, and the second voice feature of the first text is input into the first model to be trained so that the first model to be trained processes the second voice feature of the first text to obtain the processing result of the first text.
- the second model also referred to as the trained text mapping model
- the first model to be trained can be trained based on the processing result of the first speech and the processing result of the first text, so as to obtain a third model (also called a trained task backbone model, which is a trained neural network model).
- a third model also called a trained task backbone model, which is a trained neural network model.
- the training data associated with the voice task can be obtained first, and the training data includes a first text and a first voice.
- the first voice can be processed by the first model to obtain a first voice feature, and then the first voice feature can be processed by the first model to be trained to obtain a processing result of the first voice.
- the first text can be processed by the second model to obtain a second voice feature, and then the second voice feature can be processed by the first model to be trained to obtain a processing result of the first text.
- the first model to be trained is trained to obtain a third model.
- the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from speech (i.e., the first speech) or text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, the text can often be converted into correct speech features.
- the speech model trained based on the correct speech features i.e., the final model constructed by the first model and the third model
- the first model to be trained is trained based on the processing result of the first speech and the processing result of the first text, and the third model is obtained, including: based on the processing result of the first speech and the actual processing result of the first speech, a first loss is obtained, and the first loss is used to indicate the difference between the processing result of the first speech and the actual processing result of the first speech; based on the processing result of the first text and the actual processing result of the first text, a second loss is obtained, and the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text; based on the first loss and the second loss, the parameters of the first model to be trained are updated until the model training conditions are met to obtain the third model.
- the training process of the first model to be trained is multiple rounds of iterations, and each round of iteration uses a first speech or a first text for model training, for the convenience of explanation, two adjacent rounds of iterations are introduced, one of which is called the current round of iteration, and the other is called the next round of iteration.
- a first speech is used in the current round of iteration
- a first text is used in the next round of iteration.
- the first speech can be input into the first model so that the first model processes the first speech, thereby obtaining the first speech feature of the first speech.
- the first speech feature of the first speech can be input into the first model to be trained so that the first model to be trained processes the first speech feature of the first speech, thereby obtaining the processing result of the first speech.
- the processing result of the first speech and the real processing result of the first speech can be calculated to obtain the first loss of the first speech, and the first loss of the first speech is used for the difference between the processing result of the first speech and the real processing result of the first speech.
- the parameters of the first model to be trained can be updated based on the first loss of the first speech, thereby obtaining the first model to be trained after the updated parameters, and thus the current round of iteration is completed.
- the first text can be input into the second model so that the second model processes the first text, thereby obtaining the second speech feature of the first text.
- the second speech feature of the first text can be input into the first model to be trained, so that the first model to be trained processes the second speech feature of the first text, thereby obtaining the processing result of the first text.
- the processing result of the first text and the actual processing result of the first text can be calculated to obtain the second loss of the first text, and the second loss of the first text is used for the difference between the processing result of the first text and the actual processing result of the first text.
- the parameters of the first model to be trained after the parameters are updated can be updated based on the second loss of the first text, thereby obtaining the first model to be trained after the parameters are updated again, and the next round of iteration is completed.
- the next round of iteration can be entered, that is, the first model to be trained after the parameters are updated again can be trained using the next first speech or the next first text until the model training conditions are met, thereby obtaining the first model.
- the training data also includes a second text and a second voice, and the second text corresponds to the second voice.
- the method also includes: inputting the second voice into the first model to obtain a third voice feature; inputting the second text into the second model to be trained to obtain a fourth voice feature; based on the third voice feature and the fourth voice feature, the second model to be trained is trained to obtain a second model.
- the training data also includes a plurality of voice-text matching pairs, a voice-text matching pair includes a second voice and a second text corresponding to the second voice, that is, the second voice and the second text corresponding to the second voice are matched in content.
- the second voice in the voice-text matching pair can be input into the first model so that the first model processes the second voice, thereby obtaining the third voice feature of the second voice.
- the second text in the voice-text matching pair can also be input into the second model to be trained (also referred to as a text mapping model to be trained), so that the second model to be trained processes the second text, thereby obtaining the fourth voice feature of the second text.
- the second model to be trained may be trained based on the third speech feature of the second speech and the fourth speech feature of the second text, thereby obtaining a second model.
- the second model to be trained is trained based on the third voice feature and the fourth voice feature to obtain the second model, including: obtaining a third loss based on the third voice feature and the fourth voice feature, the third loss being used to indicate the difference between the third voice feature and the fourth voice feature; based on the third loss, updating the parameters of the second model to be trained until the model training conditions are met to obtain the second model.
- the training process of the second model to be trained is a multi-round iteration, and each round of iteration uses a speech-text matching pair for model training, for the convenience of explanation, one of the iterations is used for introduction, and the iteration is referred to as the current round iteration, and a speech-text matching pair is used in the current round iteration.
- the speech-text matching pair can be used as the training object.
- the second speech is input into the first model so that the first model processes the second speech, thereby obtaining a third speech feature of the second speech.
- the second text in the speech-text matching pair can also be input into the second model to be trained so that the second model to be trained processes the second text, thereby obtaining a fourth speech feature of the second text.
- the third voice feature of the second speech and the fourth voice feature of the second text can be calculated to obtain the third loss of the speech-text matching pair, and the third loss of the speech-text matching pair is used to indicate the difference between the third voice feature corresponding to the second speech and the fourth voice feature corresponding to the second text.
- the parameters of the second model to be trained can be updated based on the third loss of the speech-text matching pair, and the current round of iteration is completed. Then, the next round of iteration can be entered, that is, the second model to be trained with the updated parameters is continued to be trained using the next speech-text matching pair until the model training conditions are met, thereby obtaining the second model.
- the second model is any one of the following: a diffusion generation model, a generative adversarial network model, a sequence-to-sequence model, and the like.
- a portion of the fourth model may be cut out to serve as the first model, for example, certain layers in the fourth model, etc.
- the fourth model is usually a model for completing speech recognition (also referred to as a trained speech recognition model), or the fourth model is a speech pre-training model, etc.
- the speech task is any one of the following: speech translation, speech recognition, speech command, and speech dialogue, etc.
- the second aspect of an embodiment of the present application provides a model training device, which includes: an acquisition module for acquiring training data associated with a speech task, the training data including a first text and a first speech; a first processing module for inputting the first speech into a first model to obtain a first speech feature; a second processing module for inputting the first text into a second model to obtain a second speech feature; a third processing module for inputting the first speech feature into a first model to be trained to obtain a processing result of the first speech; a fourth processing module for inputting the second speech feature into the first model to be trained to obtain a processing result of the first text; a first training module for training the first model to be trained based on the processing result of the first speech and the processing result of the first text to obtain a third model, and the model composed of the first model and the third model is used to complete the speech task.
- the training data associated with the voice task can be obtained first, and the training data includes a first text and a first voice. Then, the first voice can be processed by the first model to obtain a first voice feature, and then the first voice feature can be processed by the first model to be trained to obtain the processing result of the first voice. In addition, the first text can be processed by the second model to obtain a second voice feature, and then the second voice feature can be processed by the first model to be trained to obtain the processing result of the first text. Finally, based on the processing result of the first voice and the processing result of the first text, the first model to be trained is trained to obtain a third model.
- the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features.
- the speech model trained based on the correct speech features i.e., the final model constructed by the first model and the third model
- the first training module is used to: obtain a first loss based on a processing result of a first speech and an actual processing result of the first speech, the first loss being used to indicate the difference between the processing result of the first speech and the actual processing result of the first speech; obtain a second loss based on a processing result of a first text and an actual processing result of the first text, the second loss being used to indicate the difference between the processing result of the first text and the actual processing result of the first text; and update the parameters of the first model to be trained based on the first loss and the second loss until the model training conditions are met, thereby obtaining a third model.
- the training data also includes a second text and a second voice
- the second text corresponds to the second voice
- the device also includes: a fifth processing module, used to input the second voice into the first model to obtain a third voice feature; a sixth processing module, used to input the second text into the second model to be trained to obtain a fourth voice feature; and a second training module, used to train the second model to be trained based on the third voice feature and the fourth voice feature to obtain a second model.
- the second training module is used to: obtain a third loss based on a third speech feature and a fourth speech feature, where the third loss is used to indicate the difference between the third speech feature and the fourth speech feature; and update the parameters of the second model to be trained based on the third loss until the model training conditions are met to obtain the second model.
- the second model is any one of the following: a diffusion generation model, a generative adversarial network model, and a sequence-to-sequence model.
- the first model is a part of a fourth model
- the fourth model is a model for completing speech recognition
- the fourth model is a speech pre-training model
- the speech task is any one of the following: speech translation, speech recognition, speech command, and speech dialogue.
- a third aspect of an embodiment of the present application provides a model training device, which includes a memory and a processor; the memory stores code, and the processor is configured to execute the code.
- the model training device is used to execute the method described in the first aspect or any possible implementation method of the first aspect.
- a fourth aspect of an embodiment of the present application provides a circuit system, which includes a processing circuit, and the processing circuit is configured to execute the method described in the first aspect or any possible implementation manner of the first aspect.
- a fifth aspect of an embodiment of the present application provides a chip system, which includes a processor for calling a computer program or computer instructions stored in a memory so that the processor executes the method described in the first aspect or any possible implementation method of the first aspect.
- the processor is coupled to the memory through an interface.
- the chip system also includes a memory, in which a computer program or computer instructions are stored.
- a sixth aspect of an embodiment of the present application provides a computer storage medium, which stores a computer program.
- the program When executed by a computer, the computer implements the method described in the first aspect or any possible implementation method of the first aspect.
- a seventh aspect of the embodiments of the present application provides a computer program product, which stores instructions. When the instructions are executed by a computer, the computer implements the method described in the first aspect or any possible implementation method of the first aspect.
- training data associated with the voice task may be obtained first, and the training data includes a first text and a first voice.
- the first voice may be processed by the first model to obtain a first voice feature, and then the first voice feature may be processed by the first model to be trained to obtain a processing result of the first voice.
- the first text may be processed by the second model to obtain a second voice feature, and then the second voice feature may be processed by the first model to be trained to obtain a processing result of the first text.
- the first model to be trained is trained to obtain a third model.
- the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features.
- the speech model trained based on the correct speech features i.e., the final model constructed by the first model and the third model
- FIG1 is a schematic diagram of a structure of an artificial intelligence main framework
- FIG2a is a schematic diagram of a structure of a model training system provided in an embodiment of the present application.
- FIG2b is another schematic diagram of the structure of the model training system provided in an embodiment of the present application.
- FIG2c is a schematic diagram of a device related to model training provided in an embodiment of the present application.
- FIG3 is a schematic diagram of the architecture of the system 100 provided in an embodiment of the present application.
- FIG4 is a flow chart of a model training method provided in an embodiment of the present application.
- FIG5 is a schematic diagram of a mapping learning phase provided in an embodiment of the present application.
- FIG6 is a schematic diagram of a text-speech multi-modal learning stage provided in an embodiment of the present application.
- FIG7 is a schematic diagram of an application example of the mapping learning phase provided in an embodiment of the present application.
- FIG8 is a schematic diagram of an application example of the text-speech multi-modal learning stage provided in an embodiment of the present application.
- FIG9 is a schematic diagram of a structure of a model training device provided in an embodiment of the present application.
- FIG10 is a schematic diagram of a structure of a training device provided in an embodiment of the present application.
- FIG. 11 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.
- the embodiments of the present application provide a model training method and related equipment.
- the trained speech model can have excellent performance, so it can accurately complete the user's speech task, thereby improving the user experience.
- E2E voice models can be used to complete user voice tasks, such as voice translation, voice recognition, voice commands, and voice conversations, etc. That is to say, when the user inputs voice into the voice model, the voice model can process the user's voice to obtain the processing result of the user's voice, such as the translation text obtained by recognizing and translating the user's voice, the text obtained by recognizing the user's language, recognizing the user's voice and determining the command issued by the user, recognizing the user's voice and finding the corresponding answer, etc.
- E2E end-to-end
- TTS text-to-speech
- AI technology is a technical discipline that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence. AI technology obtains the best results by sensing the environment, acquiring knowledge and using knowledge.
- artificial intelligence technology is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
- Using artificial intelligence for data processing is a common application of artificial intelligence.
- Figure 1 is a structural diagram of the main framework of artificial intelligence.
- the following is an explanation of the above artificial intelligence theme framework from the two dimensions of "intelligent information chain” (horizontal axis) and “IT value chain” (vertical axis).
- the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be a general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensation process of "data-information-knowledge-wisdom".
- the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecology process of the system.
- the infrastructure provides computing power support for the artificial intelligence system, enables communication with the outside world, and is supported by the basic platform. It communicates with the outside world through sensors; computing power is provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
- smart chips CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips
- the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc.
- sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
- the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
- the data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and perception data such as force, displacement, liquid level, temperature, and humidity.
- Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
- machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training. wait.
- Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.
- Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
- some general capabilities can be further formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
- Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical applications. Its application areas mainly include: smart terminals, smart transportation, smart medical care, autonomous driving, smart cities, etc.
- FIG2a is a schematic diagram of a structure of a model training system provided in an embodiment of the present application, wherein the model training system includes a user device and a data processing device.
- the user device includes an intelligent terminal such as a mobile phone, a personal computer or an information processing center.
- the user device is the initiator of the model training, and as the initiator of the model training request, the request is usually initiated by the user through the user device.
- the above-mentioned data processing device can be a device or server with data processing function such as a cloud server, a network server, an application server and a management server.
- the data processing device receives requests from the intelligent terminal through an interactive interface, and then performs machine learning, deep learning, search, reasoning, decision-making and other processing through the memory for storing data and the processor link for data processing.
- the memory in the data processing device can be a general term, including local storage and a database for storing historical data.
- the database can be on the data processing device or on other network servers.
- the user device can receive the user's instructions, the user device can determine the voice task that the user needs to complete, and then initiate a request to the data processing device, so that the data processing device performs a model training application for the voice task obtained by the user device and the training data associated with the voice task, thereby obtaining a model for completing the voice task.
- the user device can obtain the voice task that the user needs to complete based on the user's instructions.
- the user device can initiate a model training request to the data processing device, so that the data processing device can obtain the training data associated with the target task based on the model training request, and use the training data to complete the training of the model to be trained, thereby obtaining a model for completing the voice task, and returning the model to the user device, so that the user device uses the model to complete the voice task that the user needs to complete.
- the data processing device can execute the model training method of an embodiment of the present application.
- Figure 2b is another structural diagram of the model training system provided in an embodiment of the present application.
- the user device directly serves as a data processing device.
- the user device can directly obtain instructions from the user and directly process them by the hardware of the user device itself.
- the specific process is similar to that of Figure 2a. Please refer to the above description and will not be repeated here.
- the user device can receive the user's instruction, and the user device can obtain the voice task that the user needs to complete and the training data associated with the voice task based on the user's instruction. Then, the user device can use the training data to complete the training of the model to be trained, thereby obtaining a model for completing the voice task. In this way, the user device can use the model to complete the voice task that the user needs to complete in the subsequent process.
- the user device itself can execute the model training method of the embodiment of the present application.
- Figure 2c is a schematic diagram of the relevant equipment for model training provided in an embodiment of the present application.
- the user device in the above Figures 2a and 2b can specifically be the local device 301 or the local device 302 in Figure 2c
- the data processing device in Figure 2a can specifically be the training device 210 in Figure 2c
- the data storage system 250 can store the data to be processed of the training device 210
- the data storage system 250 can be integrated on the training device 210, and can also be set on the cloud or other network servers.
- the processors in Figures 2a and 2b can perform data training/machine learning/deep learning through a neural network model or other models (for example, a model based on a support vector machine, etc.), and use the data to ultimately train or learn a model that can be used to complete speech tasks.
- a neural network model or other models for example, a model based on a support vector machine, etc.
- FIG3 is a schematic diagram of the system 100 architecture provided in an embodiment of the present application.
- the training device 120 is configured with an input/output (I/O) interface 112 for information interaction with an external device.
- I/O input/output
- a user can input instructions to the I/O interface 112 through a client device 140.
- the instructions in the embodiment of the present application may include: various tasks to be scheduled, callable resources, and other parameters. etc.
- the training device 120 may determine the voice task that the user needs to complete based on the instruction input by the user.
- the training device 120 can train a corresponding model/rule based on the training data associated with the voice task for the voice task that the user needs to complete, and the corresponding model/rule can be used to achieve the voice task that the user needs to complete, thereby providing the user with the required voice task result.
- the training data can be obtained in a variety of ways: for example, during the process of the computing module 111 performing model training and other related processing, the training device 120 can call the data, code, etc. in the data storage system 150 for the corresponding processing, and can also store the model obtained by the corresponding training in the data storage system 150. For another example, during the process of the computing module 111 performing model training and other related processing, the training device 120 can obtain training data from the database 130, and these training data are usually from the training samples collected by the data acquisition device 160.
- the I/O interface 112 returns the trained model to the client device 140, thereby providing it to the user so that the user can complete the speech task that he needs to complete.
- the user can manually give instructions, which can be operated through the interface provided by the I/O interface 112.
- the client device 140 can automatically send instructions to the I/O interface 112. If the client device 140 is required to automatically send instructions and needs to obtain the user's authorization, the user can set the corresponding permissions in the client device 140.
- the user can view the results output by the training device 120 on the client device 140, and the specific presentation form can be a specific method such as display, sound, action, etc.
- the client device 140 can also serve as a data collection terminal, collect various data as new sample data under the user's instruction, and store them in the database 130.
- Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application.
- the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
- the data storage system 150 is an external memory relative to the training device 120. In other cases, the data storage system 150 can also be placed in the training device 120.
- the embodiment of the present application also provides a chip, which includes a neural network processor NPU.
- the chip can be set in the training device 120 shown in FIG3 to complete the training work of the training device 120 and output the target model/rule.
- Neural network processor NPU is mounted on the main central processing unit (CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks.
- the core part of NPU is the operation circuit, and the controller controls the operation circuit to extract data from the memory (weight memory or input memory) and perform operations.
- the arithmetic circuit includes multiple processing units (process engines, PEs) internally.
- the arithmetic circuit is a two-dimensional systolic array.
- the arithmetic circuit can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
- the arithmetic circuit is a general-purpose matrix processor.
- the operation circuit takes the corresponding data of matrix B from the weight memory and caches it on each PE in the operation circuit.
- the operation circuit takes the matrix A data from the input memory and performs matrix operations with matrix B.
- the partial results or final results of the matrix are stored in the accumulator.
- the vector calculation unit can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
- the vector calculation unit can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc.
- the vector computation unit can store the processed output vector to a unified buffer.
- the vector computation unit can apply a nonlinear function to the output of the computation circuit, such as a vector of accumulated values, to generate an activation value.
- the vector computation unit generates a normalized value, a merged value, or both.
- the processed output vector can be used as an activation input to the computation circuit, such as for use in a subsequent layer in a neural network.
- the unified memory is used to store input data and output data.
- the weight data is directly transferred from the external memory to the input memory and/or the unified memory through the direct memory access controller (DMAC), the weight data in the external memory is stored in the weight memory, and the data in the unified memory is stored in the external memory.
- DMAC direct memory access controller
- the bus interface unit (BIU) is used to enable interaction between the main CPU, DMAC and instruction fetch memory through the bus.
- An instruction fetch buffer connected to the controller, used to store instructions used by the controller
- the controller is used to call the instructions cached in the memory to control the working process of the computing accelerator.
- unified memory, input memory, weight memory, and instruction fetch memory are all on-chip memories.
- the memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a high bandwidth memory (HBM), or other readable and writable memory.
- DDR SDRAM double data rate synchronous dynamic random access memory
- HBM high bandwidth memory
- a neural network may be composed of neural units, and a neural unit may refer to an operation unit with xs and intercept 1 as input, and the output of the operation unit may be:
- n is a natural number greater than 1
- Ws is the weight of xs
- b is the bias of the neural unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal.
- the output signal of the activation function can be used as the input of the next convolutional layer.
- the activation function can be a sigmoid function.
- a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected to the local receptive field of the previous layer to extract the characteristics of the local receptive field.
- the local receptive field can be an area composed of several neural units.
- space is used here because the classified object is not a single thing, but a class of things, and space refers to the collection of all individuals of this class of things.
- W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
- the vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space.
- the purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by many layers of vectors W). Therefore, the training process of a neural network is essentially about learning how to control spatial transformations, or more specifically, learning the weight matrix.
- Neural networks can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the forward transmission of the input signal to the output will generate error loss, and the error loss information is back-propagated to update the parameters in the initial neural network model, so that the error loss converges.
- the back propagation algorithm is a back propagation movement dominated by error loss, which aims to obtain the optimal parameters of the neural network model, such as the weight matrix.
- the method provided in the present application is described below from the training side of the neural network and the application side of the neural network.
- the model training method provided in the embodiment of the present application involves the processing of data sequences, and can be specifically applied to methods such as data training, machine learning, and deep learning, to perform symbolic and formalized intelligent information modeling, extraction, preprocessing, and training on training data (for example, the training data associated with the speech task in the embodiment of the present application, including the first text, the first speech, the second text, and the second speech, etc.).
- training data for example, the training data associated with the speech task in the embodiment of the present application, including the first text, the first speech, the second text, and the second speech, etc.
- a trained neural network for example, a model composed of the first model and the third model in the embodiment of the present application
- the task processing method provided in the embodiment of the present application can use the above-mentioned trained neural network to input the user's input data (for example, voice, etc.) into the trained neural network to obtain output data, thereby completing the task that the user needs to complete.
- the model training method and task processing method provided in the embodiment of the present application are inventions based on the same concept, and can also be understood as two parts in a system, or two stages of an overall process: such as the model training stage and the model application stage.
- FIG4 is a flow chart of a model training method provided in an embodiment of the present application. As shown in FIG4 , the method includes:
- Acquire training data associated with a speech task where the training data includes a first speech, a first text, a second speech, and a second text.
- training data related to the voice task can be obtained.
- the training data usually includes data from two learning stages.
- the data from the text-speech multimodal learning stage is used to train the first model to be trained (also referred to as the task trunk model to be trained, which is a neural network model that needs to be trained), and the data from the mapping learning stage is used to train the second model to be trained (also referred to as the text mapping model to be trained, which is a neural network model that needs to be trained).
- the data from the text-speech multimodal learning stage may include multiple first voices and multiple first texts, and the multiple first voices and multiple first texts are independent of each other in content. It is worth noting that the labels of multiple first voices (i.e., the real processing results of multiple first voices) and the labels of multiple first texts (the real processing results of multiple first texts) are known.
- the data from the mapping learning stage may include multiple voice-text matching pairs, and a voice-text matching pair includes a second voice and a second text corresponding to the second voice, that is, the second voice and the second text corresponding to the second voice are matched in content.
- the model training process in this embodiment includes a mapping learning stage and a text-speech multimodal learning stage, after obtaining the training data, the mapping learning stage can be performed first.
- training data related to the user's speech task can be obtained.
- the second speech in the speech-text matching pair can be input into the first model (also referred to as a trained speech precoding model, which is a trained neural network model), so that the first model processes the second speech (for example, feature extraction, etc.), thereby obtaining the third speech feature of the second speech.
- the second text in the speech-text matching pair can also be input into the second model to be trained, so that the second model to be trained processes the second text (for example, feature extraction, etc.), thereby obtaining the fourth speech feature of the second text.
- the second model to be trained can be trained based on the third speech feature of the second speech and the fourth speech feature of the second text to obtain a second model (also referred to as a trained text mapping model, which is a trained neural network model).
- a second model also referred to as a trained text mapping model, which is a trained neural network model.
- the second model to be trained can be trained in the following manner to obtain the second model:
- the training process of the second model to be trained is a multi-round iteration, and each round of iteration uses a speech-text matching pair for model training, for the convenience of explanation, one of the iterations is introduced and the iteration is called the current iteration.
- a speech-text matching pair is used in the current iteration.
- the second speech in the speech-text matching pair can be input into the first model so that the first model processes the second speech to obtain the third speech feature of the second speech.
- the second text in the speech-text matching pair can also be input into the second model to be trained so that the second model to be trained The model processes the second text to obtain a fourth speech feature of the second text.
- a preset first loss function can be used to calculate the third speech feature of the second speech and the fourth speech feature of the second text, thereby obtaining a third loss of the speech-text matching pair, and the third loss of the speech-text matching pair is used to indicate the difference between the third speech feature corresponding to the second speech and the fourth speech feature corresponding to the second text.
- the parameters of the second model to be trained can be updated based on the third loss of the speech-text matching pair, and the current round of iteration is completed. Then, the next round of iteration can be entered, that is, the second model to be trained with the updated parameters can be continuously trained using the next speech-text matching pair until the model training conditions are met (for example, the loss reaches convergence, etc.), thereby obtaining the second model.
- Figure 5 is a schematic diagram of the mapping learning stage provided in an embodiment of the present application
- Si in (Si, Ti) can be input into the trained speech precoding model to obtain the speech latent space representation (i.e., speech feature) Hi of Si
- Ti in (Si, Ti) can be input into the text mapping model to be trained to obtain the speech latent space representation H ⁇ i of Ti.
- the mapping learning loss function can be used to calculate Hi and H ⁇ i to obtain the loss Li, which is used to represent the difference between Hi and H ⁇ i.
- Li can be used to update the parameters of the text mapping model to be trained, and then (Si+1, Ti+1) can be used to continue training the text mapping model after the updated parameters until the loss converges, thereby obtaining the trained text mapping model.
- the first model to be trained is trained to obtain a third model.
- the model composed of the first model and the third model is used to complete the speech task.
- the first voice can be input into the first model so that the first model processes the first voice (for example, feature extraction, etc.) to obtain the first voice feature of the first voice
- the first voice feature of the first voice is input into the first model to be trained so that the first model to be trained processes the first voice feature of the first voice (for example, feature extraction, etc.) to obtain the processing result of the first voice.
- the first text can also be input into the second model so that the second model processes the first text (for example, feature extraction, etc.) to obtain the second voice feature of the first text
- the second voice feature of the first text is input into the first model to be trained so that the first model to be trained processes the second voice feature of the first text (for example, feature extraction, etc.) to obtain the processing result of the first text.
- the first model to be trained can be trained based on the processing result of the first speech and the processing result of the first text, so as to obtain a third model (also called a trained task backbone model, which is a trained neural network model).
- a third model also called a trained task backbone model, which is a trained neural network model.
- the first model to be trained can be trained in the following manner to obtain the third model:
- the training process of the first model to be trained is a multi-round iteration, and each iteration uses a first speech or a first text for model training, for the convenience of explanation, two adjacent iterations are introduced, one of which is called the current iteration, and the other is called the next iteration.
- a first speech is used in the current iteration
- a first text is used in the next iteration.
- the first speech can be input into the first model so that the first model processes the first speech, thereby obtaining the first speech feature of the first speech.
- the first speech feature of the first speech can be input into the first model to be trained so that the first model to be trained processes the first speech feature of the first speech, thereby obtaining the processing result of the first speech.
- the preset second loss function can be used to calculate the processing result of the first speech and the actual processing result (label) of the first speech, thereby obtaining the first loss of the first speech, which is used for the difference between the processing result of the first speech and the actual processing result of the first speech.
- the parameters of the first model to be trained can be updated based on the first loss of the first speech, thereby obtaining the first model to be trained after the updated parameters, and thus the current round of iteration is completed.
- the first text can be input into the second model so that the second model processes the first text, thereby obtaining the second speech feature of the first text.
- the second speech feature of the first text can be input into the first model to be trained so that the first model to be trained processes the second speech feature of the first text, thereby obtaining the processing result of the first text.
- the second loss function can be used to calculate the processing result of the first text and the actual processing result of the first text, thereby obtaining the second loss of the first text, and the second loss of the first text is used for the difference between the processing result of the first text and the actual processing result of the first text.
- the parameters of the first model to be trained after the parameters are updated can be updated based on the second loss of the first text, thereby obtaining the first model to be trained after the parameters are updated again, and the next round of iteration is completed.
- the next iteration can be entered, that is, the first model to be trained after the parameters are updated again is continued to be trained using the next first speech or the next first text until the model training conditions are met (for example, the loss reaches convergence, etc.), thereby obtaining the first model.
- Figure 6 is a schematic diagram of the text-speech multimodal learning stage provided in an embodiment of the present application
- S ⁇ j can be input into the trained speech precoding model to obtain the speech latent space representation H ⁇ j of S ⁇ j, and then H ⁇ j is input into the task backbone model to be trained to obtain the processing result Y ⁇ j of S ⁇ j.
- the text-speech multimodal learning loss function can be used to calculate Y ⁇ j and the label Yj of S ⁇ j to obtain the loss LSj, which is used to indicate the difference between Y ⁇ j and Yj.
- LSj can be used to update the parameters of the task backbone model to be trained to obtain the task backbone model after the updated parameters.
- the current round of iteration is completed.
- T ⁇ k can be input into the trained speech precoding model to obtain the speech latent space representation H ⁇ k of T ⁇ k, and then H ⁇ k can be input into the task backbone model to be trained to obtain the processing result Y ⁇ k of T ⁇ k.
- the text-speech multimodal learning loss function can be used to calculate Y ⁇ k and the label Y ⁇ k of T ⁇ k to obtain the loss LTk, which is used to indicate the difference between Y ⁇ k and T ⁇ k.
- LTk can be used to update the parameters of the task backbone model after the parameters are updated to obtain the task backbone model after the parameters are updated again.
- the next round of iteration is completed.
- S ⁇ j+1 or T ⁇ k+1 can be used to continue training the task backbone model after updating the parameters again until the loss converges, thereby obtaining the trained task backbone model.
- the obtained model is the final model used to complete the user's speech task.
- the second model can be any one of the following: a diffusion model, a generative adversarial nets (GAN) model, a sequence-to-sequence model, and the like.
- GAN generative adversarial nets
- the first model is a part of the fourth model, that is, a part of the fourth model can be cut out to serve as the first model, for example, certain layers in the fourth model, etc.
- the fourth model is usually a model for completing speech recognition (i.e., a trained speech recognition model, a trained neural network model), or the fourth model is a speech pre-training model, etc.
- Figures 7 and 8 Figure 7 is a schematic diagram of an application example of the mapping learning stage provided by the embodiment of the present application, and Figure 8 is a schematic diagram of an application example of the text-speech multimodal learning stage provided by the embodiment of the present application
- the application example includes:
- the trained speech recognition model can be obtained first.
- the speech recognition model is used to convert Chinese speech into Chinese text.
- the first 4 layers of the model can be taken as the trained speech precoding model.
- the diffusion model to be trained and the speech translation backbone model to be trained can be obtained first.
- a batch of training data can be obtained first, which contains Chinese speech and Chinese text (the two are corresponding), and then the text is input into the diffusion model to be trained to obtain the corresponding speech latent space representation, and the speech is input into the trained speech precoding model to obtain the corresponding speech latent space representation.
- the parameters of the diffusion model can be updated based on these speech latent space representations, thereby completing the training of the diffusion model and obtaining a trained diffusion model.
- another batch of training data can be obtained, which also includes Chinese speech and Chinese text (the correct English text of the Chinese speech and the correct English text of the Chinese text are known).
- the Chinese speech can be input into the trained speech precoding model to obtain the corresponding speech latent space representation, and then the speech latent space representation is input into the speech translation backbone model to be trained to obtain the corresponding predicted English text.
- the Chinese text can also be input into the trained diffusion model to obtain the corresponding speech latent space representation, and then the speech latent space representation is input into the speech translation backbone model to be trained to obtain the corresponding predicted English text. Since the correct English text of the Chinese speech and the correct English text of the Chinese text are known, the speech translation backbone can be updated based on the predicted English text and the correct English text. The parameters of the model are obtained, thereby completing the training of the speech translation backbone model and obtaining the trained speech translation backbone model.
- the output of the speech precoding model can be connected to the input of the speech translation backbone model, and the resulting final model can convert Chinese speech into English text.
- training data associated with the voice task may be obtained first, and the training data includes a first text and a first voice.
- the first voice may be processed by the first model to obtain a first voice feature, and then the first voice feature may be processed by the first model to be trained to obtain a processing result of the first voice.
- the first text may be processed by the second model to obtain a second voice feature, and then the second voice feature may be processed by the first model to be trained to obtain a processing result of the first text.
- the first model to be trained is trained to obtain a third model.
- the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features.
- the speech model trained based on the correct speech features i.e., the final model constructed by the first model and the third model
- FIG. 9 is a structural schematic diagram of the model training device provided in the embodiment of the present application. As shown in FIG. 9 , the device includes:
- An acquisition module 901 is used to acquire training data associated with a speech task, where the training data includes a first text and a first speech;
- a first processing module 902 configured to input the first speech into a first model to obtain a first speech feature
- a second processing module 903 is used to input the first text into a second model to obtain a second speech feature
- the third processing module 904 is used to input the first speech feature into the first model to be trained to obtain a processing result of the first speech;
- the fourth processing module 905 is used to input the second speech feature into the first model to be trained to obtain a processing result of the first text
- the first training module 906 is used to train the first model to be trained based on the processing results of the first speech and the processing results of the first text to obtain a third model.
- the model composed of the first model and the third model is used to complete the speech task.
- training data associated with the voice task may be obtained first, and the training data includes a first text and a first voice.
- the first voice may be processed by the first model to obtain a first voice feature, and then the first voice feature may be processed by the first model to be trained to obtain a processing result of the first voice.
- the first text may be processed by the second model to obtain a second voice feature, and then the second voice feature may be processed by the first model to be trained to obtain a processing result of the first text.
- the first model to be trained is trained to obtain a third model.
- the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features.
- the speech model trained based on the correct speech features i.e., the final model constructed by the first model and the third model
- the first training module 906 is used to: obtain a first loss based on the processing result of the first speech and the actual processing result of the first speech, and the first loss is used to indicate the difference between the processing result of the first speech and the actual processing result of the first speech; obtain a second loss based on the processing result of the first text and the actual processing result of the first text, and the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text; based on the first loss and the second loss, update the parameters of the first model to be trained until the model training conditions are met to obtain a third model.
- the training data also includes a second text and a second voice
- the second text corresponds to the second voice
- the device also includes: a fifth processing module, used to input the second voice into the first model to obtain a third voice feature; a sixth processing module, used to input the second text into the second model to be trained to obtain a fourth voice feature; and a second training module, used to train the second model to be trained based on the third voice feature and the fourth voice feature to obtain a second model.
- the second training module is used to: obtain a third loss based on the third speech feature and the fourth speech feature, where the third loss is used to indicate the difference between the third speech feature and the fourth speech feature; and The parameters are updated until the model training conditions are met to obtain the second model.
- the second model is any one of the following: a diffusion generation model, a generative adversarial network model, and a sequence-to-sequence model.
- the first model is a part of a fourth model
- the fourth model is a model for completing speech recognition
- the fourth model is a speech pre-training model
- the speech task is any one of the following: speech translation, speech recognition, speech command, and speech dialogue.
- FIG. 10 is a structural diagram of the training device provided by the embodiment of the present application.
- the training device 1000 can be specifically manifested as a mobile phone, a tablet, a laptop computer, an intelligent wearable device, a server, etc., which is not limited here.
- the function of model training in the corresponding embodiment of FIG. 4 can be implemented on the training device 1000.
- the training device 1000 includes: a receiver 1001, a transmitter 1002, a processor 1003 and a memory 1004 (wherein the number of processors 1003 in the training device 1000 can be one or more, and one processor is taken as an example in FIG. 10), wherein the processor 1003 may include an application processor 10031 and a communication processor 10032.
- the receiver 1001, the transmitter 1002, the processor 1003 and the memory 1004 may be connected via a bus or other means.
- the memory 1004 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1003. A portion of the memory 1004 may also include a non-volatile random access memory (NVRAM).
- NVRAM non-volatile random access memory
- the memory 1004 stores processor and operation instructions, executable modules or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.
- the processor 1003 controls the operation of the training device.
- the various components of the training device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus.
- the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus.
- various buses are referred to as bus systems in the figure.
- the method disclosed in the above embodiment of the present application can be applied to the processor 1003, or implemented by the processor 1003.
- the processor 1003 can be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit in the processor 1003 or the instruction in the form of software.
- the above processor 1003 can be a general processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
- the processor 1003 can implement or execute the various methods, steps and logic block diagrams disclosed in the embodiment of the present application.
- the general processor can be a microprocessor or the processor can also be any conventional processor, etc.
- the steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to be executed, or a combination of hardware and software modules in the decoding processor can be executed.
- the software module may be located in a storage medium mature in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, etc.
- the storage medium is located in the memory 1004, and the processor 1003 reads the information in the memory 1004 and completes the steps of the above method in combination with its hardware.
- the receiver 1001 can be used to receive input digital or character information and generate signal input related to the relevant settings and function control of the training device.
- the transmitter 1002 can be used to output digital or character information through the first interface; the transmitter 1002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1002 can also include a display device such as a display screen.
- the processor 1003 is used to complete model training through the model training architecture in the embodiment corresponding to Figure 4, and provide the obtained final model to the user device, so that the user can complete the voice task that the user needs to complete based on the final model through the user device, that is, execute the task processing method.
- An embodiment of the present application also relates to a computer storage medium, in which a program for signal processing is stored.
- the computer storage medium When the computer storage medium is run on a computer, the computer executes the steps executed by the aforementioned training device, or the computer executes the steps executed by the aforementioned training device.
- the present application also relates to a computer program product, wherein the computer program product stores instructions, which are executed by a computer.
- the computer program product stores instructions, which are executed by a computer.
- the training device or terminal device provided in the embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc.
- the processing unit may execute the computer execution instructions stored in the storage unit so that the chip in the execution device executes the data processing method described in the above embodiment, or so that the chip in the training device executes the data processing method described in the above embodiment.
- the storage unit is a storage unit in the chip, such as a register, a cache, etc.
- the storage unit may also be a storage unit located outside the chip in the wireless access device end, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), etc.
- ROM read-only memory
- RAM random access memory
- FIG. 11 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.
- the chip can be a neural network processor NPU 1100.
- NPU 1100 is mounted on the host CPU (Host CPU) as a coprocessor, and tasks are assigned by the Host CPU.
- the core part of the NPU is the operation circuit 1103, which is controlled by the controller 1104 to extract matrix data from the memory and perform multiplication operations.
- the operation circuit 1103 includes multiple processing units (Process Engine, PE) inside.
- the operation circuit 1103 is a two-dimensional systolic array.
- the operation circuit 1103 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
- the operation circuit 1103 is a general-purpose matrix processor.
- the operation circuit takes the corresponding data of matrix B from the weight memory 1102 and caches it on each PE in the operation circuit.
- the operation circuit takes the matrix A data from the input memory 1101 and performs matrix operations with matrix B, and the partial results or final results of the matrix are stored in the accumulator 1108.
- the unified memory 1106 is used to store input data and output data.
- the weight data is directly transferred to the weight memory 1102 through the direct memory access controller (DMAC) 1105.
- the input data is also transferred to the unified memory 1106 through the DMAC.
- DMAC direct memory access controller
- BIU stands for Bus Interface Unit, that is, the bus interface unit 1013, which is used for the interaction between AXI bus and DMAC and instruction fetch buffer (IFB) 1109.
- IOB instruction fetch buffer
- the bus interface unit 1013 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1109 to obtain instructions from the external memory, and is also used for the storage unit access controller 1105 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
- DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1106 or to transfer weight data to the weight memory 1102 or to transfer input data to the input memory 1101.
- the vector calculation unit 1107 includes multiple operation processing units, and when necessary, further processes the output of the operation circuit 1103, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of the predicted label plane, etc.
- the vector calculation unit 1107 can store the processed output vector to the unified memory 1106.
- the vector calculation unit 1107 can apply a linear function; or, a nonlinear function to the output of the operation circuit 1103, such as linear interpolation of the predicted label plane extracted by the convolution layer, and then, for example, a vector of accumulated values to generate an activation value.
- the vector calculation unit 1107 generates a normalized value, a pixel-level summed value, or both.
- the processed output vector can be used as an activation input to the operation circuit 1103, for example, for use in a subsequent layer in a neural network.
- An instruction fetch buffer 1109 connected to the controller 1104 is used to store instructions used by the controller 1104;
- Unified memory 1106, input memory 1101, weight memory 1102 and instruction fetch memory 1109 are all on-chip memories. External memories are private to the NPU hardware architecture.
- the processor mentioned in any of the above places may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.
- the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment.
- the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
- the technical solution of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile hard disk, a ROM, a RAM, a disk or an optical disk, etc., including a number of instructions to enable a computer device (which can be a personal computer, a training device, or a network device, etc.) to execute the methods described in each embodiment of the present application.
- a computer device which can be a personal computer, a training device, or a network device, etc.
- all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof.
- all or part of the embodiments may be implemented in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions may be transmitted from a website site, a computer, a training device, or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, training device, or data center.
- the computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device, a data center, etc. that includes one or more available media integrations.
- the available medium may be a magnetic medium, (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.
- a magnetic medium e.g., a floppy disk, a hard disk, a tape
- an optical medium e.g., a DVD
- a semiconductor medium e.g., a solid-state drive (SSD)
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
A model training method and apparatus, a storage medium, and a program product. The method comprises: acquiring training data associated with a speech task, wherein the training data comprises first text and first speech (401); inputting the first speech into a first model, so as to obtain a first speech feature (405); inputting the first text into a second model, so as to obtain a second speech feature (406); inputting the first speech feature into a first model to be trained, so as to obtain a processing result of the first speech (407); inputting the second speech feature into the first model to be trained, so as to obtain a processing result of the first text (408); and on the basis of the processing result of the first speech and the processing result of the first text, training the first model to be trained, so as to obtain a third model, wherein a model formed by means of the first model and the third model is used for completing the speech task (409).
Description
本申请要求于2023年1月30日提交国家知识产权局、申请号为202310127981.5、发明名称为“一种模型训练方法及其相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the State Intellectual Property Office on January 30, 2023, with application number 202310127981.5 and invention name “A model training method and related equipment”, the entire contents of which are incorporated by reference in this application.
本申请实施例涉及人工智能(artificial intelligence,AI)技术领域,尤其涉及一种模型训练方法及其相关设备。The embodiments of the present application relate to the field of artificial intelligence (AI) technology, and in particular to a model training method and related equipment.
随着AI技术的快速发展,智能终端往往具有端到端(end-to-end,E2E)的语音模型,语音模型可用于完成用户的语音任务。也就是说,当用户向语音模型输入语音后,语音模型可对用户的语音进行处理,从而得到用户的语音的处理结果,以此作为用户的语音任务的结果。With the rapid development of AI technology, smart terminals often have end-to-end (E2E) voice models, which can be used to complete the user's voice tasks. That is, when the user inputs voice into the voice model, the voice model can process the user's voice to obtain the processing result of the user's voice, which is used as the result of the user's voice task.
当需要训练得到语音模型时,往往需要获取大量的训练数据,即大量的语音以及这些语音的标签(即这些语音的真实处理结果),但携带有标签的语音往往难以采集。目前,通常采用语音合成系统,将携带标签的文本转换为携带标签的语音,由于文本较于容易采集,故通过语音合成系统可获取足够数量的语音来作为训练数据。When training a speech model, it is often necessary to obtain a large amount of training data, that is, a large amount of speech and the labels of these speech (that is, the actual processing results of these speech), but speech with labels is often difficult to collect. At present, a speech synthesis system is usually used to convert labeled text into labeled speech. Since text is easier to collect, a sufficient amount of speech can be obtained through the speech synthesis system as training data.
然而,受限于语音合成系统的性能,往往无法将文本转换成正确的语音,其转换得到的语音与正确的语音往往存在一定的差距,那么,基于这些不够正确的语音所训练得到的语音模型,其性能不够优良,无法准确地完成用户的语音任务,降低了用户体验。However, due to the performance limitations of the speech synthesis system, it is often impossible to convert text into correct speech. The converted speech often has a certain gap with the correct speech. Therefore, the speech model trained based on these incorrect speech has insufficient performance and cannot accurately complete the user's speech tasks, thus reducing the user experience.
发明内容Summary of the invention
本申请实施例提供了一种模型训练方法及其相关设备,所训练得到的语音模型,能够具备优秀的性能,故可以准确地完成用户的语音任务,从而提高用户体验。The embodiments of the present application provide a model training method and related equipment. The trained speech model can have excellent performance, so it can accurately complete the user's speech task, thereby improving the user experience.
本申请实施例的第一方面提供了一种模型训练方法,该方法包括:A first aspect of an embodiment of the present application provides a model training method, the method comprising:
在确定用户所需完成的语音任务后,可获取语音任务相关的训练数据。其中,训练数据通常包含多个第一语音以及多个第一文本,多个第一语音以及多个第一文本之间在内容上是相互独立的,值得注意的是,多个第一语音的真实处理结果(标签)以及多个第一文本的的真实处理结果是已知的。After determining the speech task that the user needs to complete, training data related to the speech task can be obtained. The training data usually includes multiple first voices and multiple first texts, and the multiple first voices and multiple first texts are independent of each other in content. It is worth noting that the actual processing results (labels) of the multiple first voices and the actual processing results of the multiple first texts are known.
多个第一语音和多个第一文本中,对于任意一个第一语音而言,可将该第一语音输入至第一模型(也可以称为已训练的语音预编码模型),以使得第一模型对该第一语音进行处理,从而得到该第一语音的第一语音特征,并把该第一语音的第一语音特征输入至第一待训练模型(也可以称为需要训练的任务主干模型),以使得第一待训练模型对该第一语音的第一语音特征进行处理,从而得到该第一语音的处理结果。当然,对于任意一个第一文本,也可将该第一文本输入至第二模型(也可以称为已训练的文本映射模型),以使得第二模型对该第一文本进行处理,从而得到该第一文本的第二语音特征,并把该第一文本的第二语音特征输入至第一待训练模型,以使得第一待训练模型对该第一文本的第二语音特征进行处理,从而得到该第一文本的处理结果。Among the multiple first voices and multiple first texts, for any first voice, the first voice can be input into the first model (also referred to as the trained voice precoding model) so that the first model processes the first voice to obtain the first voice feature of the first voice, and the first voice feature of the first voice is input into the first model to be trained (also referred to as the task trunk model to be trained) so that the first model to be trained processes the first voice feature of the first voice to obtain the processing result of the first voice. Of course, for any first text, the first text can also be input into the second model (also referred to as the trained text mapping model) so that the second model processes the first text to obtain the second voice feature of the first text, and the second voice feature of the first text is input into the first model to be trained so that the first model to be trained processes the second voice feature of the first text to obtain the processing result of the first text.
接着,可基于该第一语音的处理结果以及该第一文本的处理结果,对第一待训练模型进行训练,从而得到第三模型(也可以称为已训练的任务主干模型,是一个已训练的神经网络模型)。如此一来,第一模型以及第三模型所构建的模型,可作为用于完成用户的语音任务的最终模型。Next, the first model to be trained can be trained based on the processing result of the first speech and the processing result of the first text, so as to obtain a third model (also called a trained task backbone model, which is a trained neural network model). In this way, the model constructed by the first model and the third model can be used as the final model for completing the user's speech task.
从上述方法可以看出:在确定用户的语音任务后,可先获取与语音任务相关联的训练数据,训练数据包括第一文本和第一语音。接着,可通过第一模型对第一语音进行处理,从而得到第一语音特征,再通过第一待训练模型对第一语音特征进行处理,从而得到第一语音的处理结果。并且,还可通过第二模型对第一文本进行处理,从而得到第二语音特征,再通过第一待训练模型对第二语音特征进行处理,从而得到第一文本的处理结果。最后,基于第一语音的处理结果以及第一文本的处理结果,对第一待训练模型进行训练,从而得到第三模型。如此一来,第一模型与第三模型所构成的最终模型可用于完成用户
的语音任务。前述过程中,第一待训练模型的训练数据为语音特征(即第一语音特征以及第二语音特征),且语音特征既可来源于语音(即第一语音),也可来源于文本(即第一文本),由于将文本转换为语音特征,相对于将文本转换为语音更易于实现,故往往能够将文本转换成正确的语音特征,基于正确的语音特征所训练得到的语音模型(即第一模型与第三模型所构建的最终模型),能够具备优秀的性能,故可以准确地完成用户的语音任务,从而提高用户体验。It can be seen from the above method that: after determining the user's voice task, the training data associated with the voice task can be obtained first, and the training data includes a first text and a first voice. Then, the first voice can be processed by the first model to obtain a first voice feature, and then the first voice feature can be processed by the first model to be trained to obtain a processing result of the first voice. In addition, the first text can be processed by the second model to obtain a second voice feature, and then the second voice feature can be processed by the first model to be trained to obtain a processing result of the first text. Finally, based on the processing result of the first voice and the processing result of the first text, the first model to be trained is trained to obtain a third model. In this way, the final model composed of the first model and the third model can be used to complete the user's voice task. In the above process, the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from speech (i.e., the first speech) or text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, the text can often be converted into correct speech features. The speech model trained based on the correct speech features (i.e., the final model constructed by the first model and the third model) can have excellent performance, so it can accurately complete the user's speech task, thereby improving the user experience.
在一种可能实现的方式中,基于第一语音的处理结果以及第一文本的处理结果,对第一待训练模型进行训练,得到第三模型包括:基于第一语音的处理结果和第一语音的真实处理结果,获取第一损失,第一损失用于指示第一语音的处理结果和第一语音的真实处理结果之间的差异;基于第一文本的处理结果和第一文本的真实处理结果,获取第二损失,第二损失用于指示第一文本的处理结果和第一文本的真实处理结果之间的差异;基于第一损失和第二损失,对第一待训练模型的参数进行更新,直至满足模型训练条件,得到第三模型。前述实现方式中,由于第一待训练模型的训练过程是多轮迭代的,且每一轮迭代使用一个第一语音或一个第一文本进行模型训练,为了方便说明,以其中相邻的某两轮迭代进行介绍,将其中一轮迭代称为当前轮迭代,另一轮迭代称为下一轮迭代,当前轮迭代中使用了某一个第一语音,下一轮迭代中使用了某一个第一文本。进入当前轮迭代后,可将该第一语音输入至第一模型,以使得第一模型对该第一语音进行处理,从而得到该第一语音的第一语音特征。接着,可将该第一语音的第一语音特征输入至第一待训练模型,以使得第一待训练模型对该第一语音的第一语音特征进行处理,从而得到该第一语音的处理结果。得到该第一语音的处理结果后,可对该第一语音的处理结果以及该第一语音的真实处理结果进行计算,从而得到该第一语音的第一损失,该第一语音的第一损失用于该第一语音的处理结果以及该第一语音的真实处理结果之间的差异。那么,可基于该第一语音的第一损失对第一待训练模型的参数进行更新,从而得到更新参数后的第一待训练模型,至此,则完成了当前轮迭代。进入下一轮迭代后,可该第一文本输入至第二模型,以使得第二模型对该第一文本进行处理,从而得到该第一文本的第二语音特征。接着,可将该第一文本的第二语音特征输入至第一待训练模型,以使得第一待训练模型对该第一文本的第二语音特征进行处理,从而得到该第一文本的处理结果。得到该第一文本的处理结果后,可对该第一文本的处理结果以及该第一文本的真实处理结果进行计算,从而得到该第一文本的第二损失,该第一文本的第二损失用于该第一文本的处理结果以及该第一文本的真实处理结果之间的差异。那么,可基于该第一文本的第二损失对更新参数后的第一待训练模型的参数进行更新,从而得到再次更新参数后的第一待训练模型,至此,则完成了下一轮迭代。随后,可进入下下轮迭代,即利用下一个第一语音或下一个第一文本继续对再次更新参数后的第一待训练模型进行训练,直至满足模型训练条件,从而得到第一模型。In a possible implementation, the first model to be trained is trained based on the processing result of the first speech and the processing result of the first text, and the third model is obtained, including: based on the processing result of the first speech and the actual processing result of the first speech, a first loss is obtained, and the first loss is used to indicate the difference between the processing result of the first speech and the actual processing result of the first speech; based on the processing result of the first text and the actual processing result of the first text, a second loss is obtained, and the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text; based on the first loss and the second loss, the parameters of the first model to be trained are updated until the model training conditions are met to obtain the third model. In the above implementation, since the training process of the first model to be trained is multiple rounds of iterations, and each round of iteration uses a first speech or a first text for model training, for the convenience of explanation, two adjacent rounds of iterations are introduced, one of which is called the current round of iteration, and the other is called the next round of iteration. A first speech is used in the current round of iteration, and a first text is used in the next round of iteration. After entering the current round of iteration, the first speech can be input into the first model so that the first model processes the first speech, thereby obtaining the first speech feature of the first speech. Then, the first speech feature of the first speech can be input into the first model to be trained so that the first model to be trained processes the first speech feature of the first speech, thereby obtaining the processing result of the first speech. After obtaining the processing result of the first speech, the processing result of the first speech and the real processing result of the first speech can be calculated to obtain the first loss of the first speech, and the first loss of the first speech is used for the difference between the processing result of the first speech and the real processing result of the first speech. Then, the parameters of the first model to be trained can be updated based on the first loss of the first speech, thereby obtaining the first model to be trained after the updated parameters, and thus the current round of iteration is completed. After entering the next round of iteration, the first text can be input into the second model so that the second model processes the first text, thereby obtaining the second speech feature of the first text. Next, the second speech feature of the first text can be input into the first model to be trained, so that the first model to be trained processes the second speech feature of the first text, thereby obtaining the processing result of the first text. After obtaining the processing result of the first text, the processing result of the first text and the actual processing result of the first text can be calculated to obtain the second loss of the first text, and the second loss of the first text is used for the difference between the processing result of the first text and the actual processing result of the first text. Then, the parameters of the first model to be trained after the parameters are updated can be updated based on the second loss of the first text, thereby obtaining the first model to be trained after the parameters are updated again, and the next round of iteration is completed. Subsequently, the next round of iteration can be entered, that is, the first model to be trained after the parameters are updated again can be trained using the next first speech or the next first text until the model training conditions are met, thereby obtaining the first model.
在一种可能实现的方式中,训练数据还包括第二文本和第二语音,第二文本与第二语音对应,该方法还包括:将第二语音输入第一模型,得到第三语音特征;将第二文本输入第二待训练模型,得到第四语音特征;基于第三语音特征和第四语音特征,对第二待训练模型进行训练,得到第二模型。前述实现方式中,训练数据还包括多个语音-文本匹配对,一个语音-文本匹配对包含一个第二语音以及该第二语音对应的一个第二文本,也就是说,该第二语音与该第二语音对应的第二文本在内容上是相匹配的。在多个语音-文本匹配对中,对于任意一个语音-文本匹配对而言,可将该语音-文本匹配对中的第二语音输入至第一模型,以使得第一模型对该第二语音进行处理,从而得到该第二语音的第三语音特征。与此同时,还可将该语音-文本匹配对中的第二文本输入至第二待训练模型(也可以称为需要训练的文本映射模型),以使得第二待训练模型对该第二文本进行处理,从而得到该第二文本的第四语音特征。接着,可基于该第二语音的第三语音特征以及该第二文本的第四语音特征,对第二待训练模型进行训练,从而得到第二模型。In a possible implementation, the training data also includes a second text and a second voice, and the second text corresponds to the second voice. The method also includes: inputting the second voice into the first model to obtain a third voice feature; inputting the second text into the second model to be trained to obtain a fourth voice feature; based on the third voice feature and the fourth voice feature, the second model to be trained is trained to obtain a second model. In the above implementation, the training data also includes a plurality of voice-text matching pairs, a voice-text matching pair includes a second voice and a second text corresponding to the second voice, that is, the second voice and the second text corresponding to the second voice are matched in content. Among the plurality of voice-text matching pairs, for any voice-text matching pair, the second voice in the voice-text matching pair can be input into the first model so that the first model processes the second voice, thereby obtaining the third voice feature of the second voice. At the same time, the second text in the voice-text matching pair can also be input into the second model to be trained (also referred to as a text mapping model to be trained), so that the second model to be trained processes the second text, thereby obtaining the fourth voice feature of the second text. Next, the second model to be trained may be trained based on the third speech feature of the second speech and the fourth speech feature of the second text, thereby obtaining a second model.
在一种可能实现的方式中,基于第三语音特征和第四语音特征,对第二待训练模型进行训练,得到第二模型包括:基于第三语音特征和第四语音特征,获取第三损失,第三损失用于指示第三语音特征和第四语音特征之间的差异;基于第三损失,对第二待训练模型的参数进行更新,直至满足模型训练条件,得到第二模型。前述实现方式中,由于第二待训练模型的训练过程是多轮迭代的,且每一轮迭代使用一个语音-文本匹配对进行模型训练,为了方便说明,以其中某一轮迭代进行介绍,将该轮迭代称为当前轮迭代,当前轮迭代中使用了某一个语音-文本匹配对。进入当前轮迭代后,可将该语音-文本匹配对中
的第二语音输入至第一模型,以使得第一模型对该第二语音进行处理,从而得到该第二语音的第三语音特征。与此同时,还可将该语音-文本匹配对中的第二文本输入至第二待训练模型,以使得第二待训练模型对该第二文本进行处理,从而得到该第二文本的第四语音特征。In one possible implementation, the second model to be trained is trained based on the third voice feature and the fourth voice feature to obtain the second model, including: obtaining a third loss based on the third voice feature and the fourth voice feature, the third loss being used to indicate the difference between the third voice feature and the fourth voice feature; based on the third loss, updating the parameters of the second model to be trained until the model training conditions are met to obtain the second model. In the aforementioned implementation, since the training process of the second model to be trained is a multi-round iteration, and each round of iteration uses a speech-text matching pair for model training, for the convenience of explanation, one of the iterations is used for introduction, and the iteration is referred to as the current round iteration, and a speech-text matching pair is used in the current round iteration. After entering the current round iteration, the speech-text matching pair can be used as the training object. The second speech is input into the first model so that the first model processes the second speech, thereby obtaining a third speech feature of the second speech. At the same time, the second text in the speech-text matching pair can also be input into the second model to be trained so that the second model to be trained processes the second text, thereby obtaining a fourth speech feature of the second text.
在得到该第二语音的第三语音特征以及该第二文本的第四语音特征后,可对该第二语音的第三语音特征以及该第二文本的第四语音特征进行计算,从而得到该语音-文本匹配对的第三损失,该语音-文本匹配对的第三损失用于指示该第二语音对应的第三语音特征以及该第二文本对应的第四语音特征之间的差异。在得到该语音-文本匹配对的第三损失后,可基于该语音-文本匹配对的第三损失,对第二待训练模型的参数进行更新,至此,则完成了当前轮迭代。然后,可进入下一轮迭代,即利用下一个语音-文本匹配对继续对更新参数后的第二待训练模型进行训练,直至满足模型训练条件,从而得到第二模型。After obtaining the third voice feature of the second speech and the fourth voice feature of the second text, the third voice feature of the second speech and the fourth voice feature of the second text can be calculated to obtain the third loss of the speech-text matching pair, and the third loss of the speech-text matching pair is used to indicate the difference between the third voice feature corresponding to the second speech and the fourth voice feature corresponding to the second text. After obtaining the third loss of the speech-text matching pair, the parameters of the second model to be trained can be updated based on the third loss of the speech-text matching pair, and the current round of iteration is completed. Then, the next round of iteration can be entered, that is, the second model to be trained with the updated parameters is continued to be trained using the next speech-text matching pair until the model training conditions are met, thereby obtaining the second model.
在一种可能实现的方式中,第二模型为以下的任意一种:扩散生成模型、生成对抗网络模型以及序列到序列模型等等。In one possible implementation, the second model is any one of the following: a diffusion generation model, a generative adversarial network model, a sequence-to-sequence model, and the like.
在一种可能实现的方式中,可从第四模型中截取一部分来作为第一模型,例如,第四模型中的某几层等等。其中,第四模型通常为用于完成语音识别的模型(也可以称为已训练的语音识别模型),或者,第四模型为语音预训练模型等等。In one possible implementation, a portion of the fourth model may be cut out to serve as the first model, for example, certain layers in the fourth model, etc. The fourth model is usually a model for completing speech recognition (also referred to as a trained speech recognition model), or the fourth model is a speech pre-training model, etc.
在一种可能实现的方式中,语音任务为以下的任意一种:语音翻译、语音识别、语音命令以及语音对话等等。In a possible implementation, the speech task is any one of the following: speech translation, speech recognition, speech command, and speech dialogue, etc.
本申请实施例的第二方面提供了一种模型训练装置,该装置包括:获取模块,用于获取与语音任务相关联的训练数据,训练数据包括第一文本和第一语音;第一处理模块,用于将第一语音输入第一模型,得到第一语音特征;第二处理模块,用于将第一文本输入第二模型,得到第二语音特征;第三处理模块,用于将第一语音特征输入第一待训练模型,得到第一语音的处理结果;第四处理模块,用于将第二语音特征输入第一待训练模型,得到第一文本的处理结果;第一训练模块,用于基于第一语音的处理结果以及第一文本的处理结果,对第一待训练模型进行训练,得到第三模型,第一模型与第三模型所构成的模型用于完成语音任务。The second aspect of an embodiment of the present application provides a model training device, which includes: an acquisition module for acquiring training data associated with a speech task, the training data including a first text and a first speech; a first processing module for inputting the first speech into a first model to obtain a first speech feature; a second processing module for inputting the first text into a second model to obtain a second speech feature; a third processing module for inputting the first speech feature into a first model to be trained to obtain a processing result of the first speech; a fourth processing module for inputting the second speech feature into the first model to be trained to obtain a processing result of the first text; a first training module for training the first model to be trained based on the processing result of the first speech and the processing result of the first text to obtain a third model, and the model composed of the first model and the third model is used to complete the speech task.
从上述装置可以看出:在确定用户的语音任务后,可先获取与语音任务相关联的训练数据,训练数据包括第一文本和第一语音。接着,可通过第一模型对第一语音进行处理,从而得到第一语音特征,再通过第一待训练模型对第一语音特征进行处理,从而得到第一语音的处理结果。并且,还可通过第二模型对第一文本进行处理,从而得到第二语音特征,再通过第一待训练模型对第二语音特征进行处理,从而得到第一文本的处理结果。最后,基于第一语音的处理结果以及第一文本的处理结果,对第一待训练模型进行训练,从而得到第三模型。如此一来,第一模型与第三模型所构成的最终模型可用于完成用户的语音任务。前述过程中,第一待训练模型的训练数据为语音特征(即第一语音特征以及第二语音特征),且语音特征既可来源于语音(即第一语音),也可来源于文本(即第一文本),由于将文本转换为语音特征,相对于将文本转换为语音更易于实现,故往往能够将文本转换成正确的语音特征,基于正确的语音特征所训练得到的语音模型(即第一模型与第三模型所构建的最终模型),能够具备优秀的性能,故可以准确地完成用户的语音任务,从而提高用户体验。It can be seen from the above device that after determining the user's voice task, the training data associated with the voice task can be obtained first, and the training data includes a first text and a first voice. Then, the first voice can be processed by the first model to obtain a first voice feature, and then the first voice feature can be processed by the first model to be trained to obtain the processing result of the first voice. In addition, the first text can be processed by the second model to obtain a second voice feature, and then the second voice feature can be processed by the first model to be trained to obtain the processing result of the first text. Finally, based on the processing result of the first voice and the processing result of the first text, the first model to be trained is trained to obtain a third model. In this way, the final model composed of the first model and the third model can be used to complete the user's voice task. In the above process, the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features. The speech model trained based on the correct speech features (i.e., the final model constructed by the first model and the third model) can have excellent performance, so it can accurately complete the user's speech tasks, thereby improving the user experience.
在一种可能实现的方式中,第一训练模块,用于:基于第一语音的处理结果和第一语音的真实处理结果,获取第一损失,第一损失用于指示第一语音的处理结果和第一语音的真实处理结果之间的差异;基于第一文本的处理结果和第一文本的真实处理结果,获取第二损失,第二损失用于指示第一文本的处理结果和第一文本的真实处理结果之间的差异;基于第一损失和第二损失,对第一待训练模型的参数进行更新,直至满足模型训练条件,得到第三模型。In one possible implementation, the first training module is used to: obtain a first loss based on a processing result of a first speech and an actual processing result of the first speech, the first loss being used to indicate the difference between the processing result of the first speech and the actual processing result of the first speech; obtain a second loss based on a processing result of a first text and an actual processing result of the first text, the second loss being used to indicate the difference between the processing result of the first text and the actual processing result of the first text; and update the parameters of the first model to be trained based on the first loss and the second loss until the model training conditions are met, thereby obtaining a third model.
在一种可能实现的方式中,训练数据还包括第二文本和第二语音,第二文本与第二语音对应,该装置还包括:第五处理模块,用于将第二语音输入第一模型,得到第三语音特征;第六处理模块,用于将第二文本输入第二待训练模型,得到第四语音特征;第二训练模块,用于基于第三语音特征和第四语音特征,对第二待训练模型进行训练,得到第二模型。In one possible implementation, the training data also includes a second text and a second voice, the second text corresponds to the second voice, and the device also includes: a fifth processing module, used to input the second voice into the first model to obtain a third voice feature; a sixth processing module, used to input the second text into the second model to be trained to obtain a fourth voice feature; and a second training module, used to train the second model to be trained based on the third voice feature and the fourth voice feature to obtain a second model.
在一种可能实现的方式中,第二训练模块,用于:基于第三语音特征和第四语音特征,获取第三损失,第三损失用于指示第三语音特征和第四语音特征之间的差异;基于第三损失,对第二待训练模型的参数进行更新,直至满足模型训练条件,得到第二模型。
In one possible implementation, the second training module is used to: obtain a third loss based on a third speech feature and a fourth speech feature, where the third loss is used to indicate the difference between the third speech feature and the fourth speech feature; and update the parameters of the second model to be trained based on the third loss until the model training conditions are met to obtain the second model.
在一种可能实现的方式中,第二模型为以下的任意一种:扩散生成模型、生成对抗网络模型以及序列到序列模型。In one possible implementation, the second model is any one of the following: a diffusion generation model, a generative adversarial network model, and a sequence-to-sequence model.
在一种可能实现的方式中,第一模型为第四模型的一部分,第四模型为用于完成语音识别的模型,或,第四模型为语音预训练模型。In one possible implementation, the first model is a part of a fourth model, and the fourth model is a model for completing speech recognition, or the fourth model is a speech pre-training model.
在一种可能实现的方式中,语音任务为以下的任意一种:语音翻译、语音识别、语音命令以及语音对话。In a possible implementation, the speech task is any one of the following: speech translation, speech recognition, speech command, and speech dialogue.
本申请实施例的第三方面提供了一种模型训练装置,该装置包括存储器和处理器;存储器存储有代码,处理器被配置为执行代码,当代码被执行时,模型训练装置用于执行如第一方面或第一方面中的任意一种可能的实现方式所述的方法。A third aspect of an embodiment of the present application provides a model training device, which includes a memory and a processor; the memory stores code, and the processor is configured to execute the code. When the code is executed, the model training device is used to execute the method described in the first aspect or any possible implementation method of the first aspect.
本申请实施例的第四方面提供了一种电路系统,该电路系统包括处理电路,该处理电路配置为执行如第一方面或第一方面中的任意一种可能的实现方式所述的方法。A fourth aspect of an embodiment of the present application provides a circuit system, which includes a processing circuit, and the processing circuit is configured to execute the method described in the first aspect or any possible implementation manner of the first aspect.
本申请实施例的第五方面提供了一种芯片系统,该芯片系统包括处理器,用于调用存储器中存储的计算机程序或计算机指令,以使得该处理器执行如第一方面或第一方面中的任意一种可能的实现方式所述的方法。A fifth aspect of an embodiment of the present application provides a chip system, which includes a processor for calling a computer program or computer instructions stored in a memory so that the processor executes the method described in the first aspect or any possible implementation method of the first aspect.
在一种可能的实现方式中,该处理器通过接口与存储器耦合。In a possible implementation manner, the processor is coupled to the memory through an interface.
在一种可能的实现方式中,该芯片系统还包括存储器,该存储器中存储有计算机程序或计算机指令。In a possible implementation, the chip system also includes a memory, in which a computer program or computer instructions are stored.
本申请实施例的第六方面提供了一种计算机存储介质,该计算机存储介质存储有计算机程序,该程序在由计算机执行时,使得计算机实施如第一方面或第一方面中的任意一种可能的实现方式所述的方法。A sixth aspect of an embodiment of the present application provides a computer storage medium, which stores a computer program. When the program is executed by a computer, the computer implements the method described in the first aspect or any possible implementation method of the first aspect.
本申请实施例的第七方面提供了一种计算机程序产品,该计算机程序产品存储有指令,该指令在由计算机执行时,使得计算机实施如第一方面或第一方面中的任意一种可能的实现方式所述的方法。A seventh aspect of the embodiments of the present application provides a computer program product, which stores instructions. When the instructions are executed by a computer, the computer implements the method described in the first aspect or any possible implementation method of the first aspect.
本申请实施例中,在确定用户的语音任务后,可先获取与语音任务相关联的训练数据,训练数据包括第一文本和第一语音。接着,可通过第一模型对第一语音进行处理,从而得到第一语音特征,再通过第一待训练模型对第一语音特征进行处理,从而得到第一语音的处理结果。并且,还可通过第二模型对第一文本进行处理,从而得到第二语音特征,再通过第一待训练模型对第二语音特征进行处理,从而得到第一文本的处理结果。最后,基于第一语音的处理结果以及第一文本的处理结果,对第一待训练模型进行训练,从而得到第三模型。如此一来,第一模型与第三模型所构成的最终模型可用于完成用户的语音任务。前述过程中,第一待训练模型的训练数据为语音特征(即第一语音特征以及第二语音特征),且语音特征既可来源于语音(即第一语音),也可来源于文本(即第一文本),由于将文本转换为语音特征,相对于将文本转换为语音更易于实现,故往往能够将文本转换成正确的语音特征,基于正确的语音特征所训练得到的语音模型(即第一模型与第三模型所构建的最终模型),能够具备优秀的性能,故可以准确地完成用户的语音任务,从而提高用户体验。In an embodiment of the present application, after determining the user's voice task, training data associated with the voice task may be obtained first, and the training data includes a first text and a first voice. Then, the first voice may be processed by the first model to obtain a first voice feature, and then the first voice feature may be processed by the first model to be trained to obtain a processing result of the first voice. Furthermore, the first text may be processed by the second model to obtain a second voice feature, and then the second voice feature may be processed by the first model to be trained to obtain a processing result of the first text. Finally, based on the processing result of the first voice and the processing result of the first text, the first model to be trained is trained to obtain a third model. In this way, the final model composed of the first model and the third model can be used to complete the user's voice task. In the above process, the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features. The speech model trained based on the correct speech features (i.e., the final model constructed by the first model and the third model) can have excellent performance, so it can accurately complete the user's speech tasks, thereby improving the user experience.
图1为人工智能主体框架的一种结构示意图;FIG1 is a schematic diagram of a structure of an artificial intelligence main framework;
图2a为本申请实施例提供的模型训练系统的一个结构示意图;FIG2a is a schematic diagram of a structure of a model training system provided in an embodiment of the present application;
图2b为本申请实施例提供的模型训练系统的另一结构示意图;FIG2b is another schematic diagram of the structure of the model training system provided in an embodiment of the present application;
图2c为本申请实施例提供的模型训练的相关设备的一个示意图;FIG2c is a schematic diagram of a device related to model training provided in an embodiment of the present application;
图3为本申请实施例提供的系统100架构的一个示意图;FIG3 is a schematic diagram of the architecture of the system 100 provided in an embodiment of the present application;
图4为本申请实施例提供的模型训练方法的一个流程示意图;FIG4 is a flow chart of a model training method provided in an embodiment of the present application;
图5为本申请实施例提供的映射学习阶段的一个示意图;FIG5 is a schematic diagram of a mapping learning phase provided in an embodiment of the present application;
图6为本申请实施例提供的文音多模学习阶段的一个示意图;FIG6 is a schematic diagram of a text-speech multi-modal learning stage provided in an embodiment of the present application;
图7为本申请实施例提供的映射学习阶段的一个应用例示意图;FIG7 is a schematic diagram of an application example of the mapping learning phase provided in an embodiment of the present application;
图8为本申请实施例提供的文音多模学习阶段的一个应用例示意图;FIG8 is a schematic diagram of an application example of the text-speech multi-modal learning stage provided in an embodiment of the present application;
图9为本申请实施例提供的模型训练装置的一个结构示意图;FIG9 is a schematic diagram of a structure of a model training device provided in an embodiment of the present application;
图10为本申请实施例提供的训练设备的一个结构示意图;FIG10 is a schematic diagram of a structure of a training device provided in an embodiment of the present application;
图11为本申请实施例提供的芯片的一个结构示意图。
FIG. 11 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.
本申请实施例提供了一种模型训练方法及其相关设备,所训练得到的语音模型,能够具备优秀的性能,故可以准确地完成用户的语音任务,从而提高用户体验。The embodiments of the present application provide a model training method and related equipment. The trained speech model can have excellent performance, so it can accurately complete the user's speech task, thereby improving the user experience.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and need not be used to describe a specific order or sequential order. It should be understood that the terms used in this way can be interchangeable under appropriate circumstances, which is only to describe the distinction mode adopted by the objects of the same attributes when describing in the embodiments of the present application. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, so that the process, method, system, product or equipment comprising a series of units need not be limited to those units, but may include other units that are not clearly listed or inherent to these processes, methods, products or equipment.
随着AI技术的快速发展,智能终端往往具有端到端(end-to-end,E2E)的语音模型,语音模型可用于完成用户的语音任务,例如,语音翻译、语音识别、语音命令以及语音对话等等。也就是说,当用户向语音模型输入语音后,语音模型可对用户的语音进行处理,从而得到用户的语音的处理结果,例如,对用户的语音进行识别并翻译所得到的翻译文本,对用户的语言进行识别所得到的文本,对用户的语音进行识别并确定用户所下达的命令,对用户的语音进行识别并找到相应的回答等等。With the rapid development of AI technology, smart terminals often have end-to-end (E2E) voice models, which can be used to complete user voice tasks, such as voice translation, voice recognition, voice commands, and voice conversations, etc. That is to say, when the user inputs voice into the voice model, the voice model can process the user's voice to obtain the processing result of the user's voice, such as the translation text obtained by recognizing and translating the user's voice, the text obtained by recognizing the user's language, recognizing the user's voice and determining the command issued by the user, recognizing the user's voice and finding the corresponding answer, etc.
当需要训练得到语音模型时,往往需要获取大量的训练数据,即大量的语音以及这些语音的标签(即这些语音的真实处理结果),但携带有标签的语音往往难以采集。为了采集一定数量的携带有标签的语音,相关技术通常采用语音合成(text To speech,TTS)系统,将携带标签的文本转换为携带标签的语音,由于文本较于容易采集,故通过语音合成系统可获取足够数量的语音来作为训练数据。When training a speech model, it is often necessary to obtain a large amount of training data, that is, a large amount of speech and the labels of these speech (that is, the actual processing results of these speech), but speech with labels is often difficult to collect. In order to collect a certain amount of speech with labels, related technologies usually use a text-to-speech (TTS) system to convert text with labels into speech with labels. Since text is easier to collect, a sufficient amount of speech can be obtained through the speech synthesis system as training data.
然而,受限于语音合成系统的性能,往往无法将文本转换成正确的语音,其转换得到的语音与正确的语音往往存在一定的差距,那么,基于这些不够正确的语音所训练得到的语音模型,其性能不够优良,无法准确地完成用户的语音任务,降低了用户体验。However, due to the performance limitations of the speech synthesis system, it is often impossible to convert text into correct speech. The converted speech often has a certain gap with the correct speech. Therefore, the speech model trained based on these incorrect speech has insufficient performance and cannot accurately complete the user's speech tasks, thus reducing the user experience.
为了解决上述问题,本申请实施例提供了一种新型的模型训练方法,该该方法可结合人工智能(artificial intelligence,AI)技术实现。AI技术是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能的技术学科,AI技术通过感知环境、获取知识并使用知识获得最佳结果。换句话说,人工智能技术是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。利用人工智能进行数据处理是人工智能常见的一个应用方式。In order to solve the above problems, the embodiments of the present application provide a new model training method, which can be implemented in combination with artificial intelligence (AI) technology. AI technology is a technical discipline that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence. AI technology obtains the best results by sensing the environment, acquiring knowledge and using knowledge. In other words, artificial intelligence technology is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Using artificial intelligence for data processing is a common application of artificial intelligence.
首先对人工智能系统总体工作流程进行描述,请参见图1,图1为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 is a structural diagram of the main framework of artificial intelligence. The following is an explanation of the above artificial intelligence theme framework from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be a general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensation process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecology process of the system.
(1)基础设施(1) Infrastructure
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power support for the artificial intelligence system, enables communication with the outside world, and is supported by the basic platform. It communicates with the outside world through sensors; computing power is provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
(2)数据(2) Data
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。The data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and perception data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理(3) Data processing
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练
等。Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training. wait.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
(4)通用能力(4) General capabilities
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。After the data has undergone the data processing mentioned above, some general capabilities can be further formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
(5)智能产品及行业应用(5) Smart products and industry applications
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical applications. Its application areas mainly include: smart terminals, smart transportation, smart medical care, autonomous driving, smart cities, etc.
接下来介绍几种本申请的应用场景。Next, several application scenarios of this application are introduced.
图2a为本申请实施例提供的模型训练系统的一个结构示意图,该模型训练系统包括用户设备以及数据处理设备。其中,用户设备包括手机、个人电脑或者信息处理中心等智能终端。用户设备为模型训练的发起端,作为模型训练请求的发起方,通常由用户通过用户设备发起请求。FIG2a is a schematic diagram of a structure of a model training system provided in an embodiment of the present application, wherein the model training system includes a user device and a data processing device. The user device includes an intelligent terminal such as a mobile phone, a personal computer or an information processing center. The user device is the initiator of the model training, and as the initiator of the model training request, the request is usually initiated by the user through the user device.
上述数据处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有数据处理功能的设备或服务器。数据处理设备通过交互接口接收来自智能终端的请求,再通过存储数据的存储器以及数据处理的处理器环节进行机器学习,深度学习,搜索,推理,决策等方式的处理。数据处理设备中的存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,数据库可以在数据处理设备上,也可以在其它网络服务器上。The above-mentioned data processing device can be a device or server with data processing function such as a cloud server, a network server, an application server and a management server. The data processing device receives requests from the intelligent terminal through an interactive interface, and then performs machine learning, deep learning, search, reasoning, decision-making and other processing through the memory for storing data and the processor link for data processing. The memory in the data processing device can be a general term, including local storage and a database for storing historical data. The database can be on the data processing device or on other network servers.
在图2a所示的模型训练系统中,用户设备可以接收用户的指令,用户设备可确定用户所需完成的语音任务,然后向数据处理设备发起请求,使得数据处理设备针对用户设备得到的语音任务以及与语音任务相关联的训练数据执行模型训练应用,从而得到用于完成语音任务的模型。示例性的,接收到用户的指令后,用户设备可基于用户的指令,获取用户所需完成的语音任务。然后,用户设备可向数据处理设备发起模型训练请求,以使得数据处理设备基于模型训练请求,可获取与目标任务相关联的训练数据,并利用训练数据对待训练模型完成训练,从而得到用于完成语音任务的模型,并将该模型返回给用户设备,以使得用户设备利用该模型完成用户所需完成的语音任务。In the model training system shown in FIG2a, the user device can receive the user's instructions, the user device can determine the voice task that the user needs to complete, and then initiate a request to the data processing device, so that the data processing device performs a model training application for the voice task obtained by the user device and the training data associated with the voice task, thereby obtaining a model for completing the voice task. Exemplarily, after receiving the user's instructions, the user device can obtain the voice task that the user needs to complete based on the user's instructions. Then, the user device can initiate a model training request to the data processing device, so that the data processing device can obtain the training data associated with the target task based on the model training request, and use the training data to complete the training of the model to be trained, thereby obtaining a model for completing the voice task, and returning the model to the user device, so that the user device uses the model to complete the voice task that the user needs to complete.
在图2a中,数据处理设备可以执行本申请实施例的模型训练方法。In Figure 2a, the data processing device can execute the model training method of an embodiment of the present application.
图2b为本申请实施例提供的模型训练系统的另一结构示意图,在图2b中,用户设备直接作为数据处理设备,该用户设备能够直接获取来自用户的指令并直接由用户设备本身的硬件进行处理,具体过程与图2a相似,可参考上面的描述,在此不再赘述。Figure 2b is another structural diagram of the model training system provided in an embodiment of the present application. In Figure 2b, the user device directly serves as a data processing device. The user device can directly obtain instructions from the user and directly process them by the hardware of the user device itself. The specific process is similar to that of Figure 2a. Please refer to the above description and will not be repeated here.
在图2b所示的模型训练系统中,用户设备可以接收用户的指令,用户设备可基于用户的指令,获取用户所需完成的语音任务以及与语音任务相关联的训练数据。然后,用户设备可利用训练数据对待训练模型完成训练,从而得到用于完成语音任务的模型。如此一来,用户设备在后续中可利用该模型完成用户所需完成的语音任务。In the model training system shown in FIG2b, the user device can receive the user's instruction, and the user device can obtain the voice task that the user needs to complete and the training data associated with the voice task based on the user's instruction. Then, the user device can use the training data to complete the training of the model to be trained, thereby obtaining a model for completing the voice task. In this way, the user device can use the model to complete the voice task that the user needs to complete in the subsequent process.
在图2b中,用户设备自身就可以执行本申请实施例的模型训练方法。In Figure 2b, the user device itself can execute the model training method of the embodiment of the present application.
图2c为本申请实施例提供的模型训练的相关设备的一个示意图。Figure 2c is a schematic diagram of the relevant equipment for model training provided in an embodiment of the present application.
上述图2a和图2b中的用户设备具体可以是图2c中的本地设备301或者本地设备302,图2a中的数据处理设备具体可以是图2c中的训练设备210,其中,数据存储系统250可以存储训练设备210的待处理数据,数据存储系统250可以集成在训练设备210上,也可以设置在云上或其它网络服务器上。The user device in the above Figures 2a and 2b can specifically be the local device 301 or the local device 302 in Figure 2c, and the data processing device in Figure 2a can specifically be the training device 210 in Figure 2c, wherein the data storage system 250 can store the data to be processed of the training device 210, and the data storage system 250 can be integrated on the training device 210, and can also be set on the cloud or other network servers.
图2a和图2b中的处理器可以通过神经网络模型或者其它模型(例如,基于支持向量机的模型等等)进行数据训练/机器学习/深度学习,并利用数据最终训练或者学习得到可用于完成语音任务的模型。The processors in Figures 2a and 2b can perform data training/machine learning/deep learning through a neural network model or other models (for example, a model based on a support vector machine, etc.), and use the data to ultimately train or learn a model that can be used to complete speech tasks.
图3为本申请实施例提供的系统100架构的一个示意图,在图3中,训练设备120配置输入/输出(input/output,I/O)接口112,用于与外部设备进行信息交互,用户可以通过客户设备140向I/O接口112输入指令,所述指令在本申请实施例中可以包括:各个待调度任务、可调用资源以及其他参数
等等。FIG3 is a schematic diagram of the system 100 architecture provided in an embodiment of the present application. In FIG3, the training device 120 is configured with an input/output (I/O) interface 112 for information interaction with an external device. A user can input instructions to the I/O interface 112 through a client device 140. The instructions in the embodiment of the present application may include: various tasks to be scheduled, callable resources, and other parameters. etc.
首先,训练设备120基于用户输入的指令,可确定用户所需完成的语音任务。First, the training device 120 may determine the voice task that the user needs to complete based on the instruction input by the user.
接着,训练设备120可以针对用户所需完成的语音任务,基于与语音任务相关联的训练数据训练出相应的模型/规则,该相应的模型/规则即可以用于实现用户所需完成的语音任务,从而为用户提供所需的语音任务的结果。其中,训练数据可通过多种方式获取:例如,在计算模块111执行模型训练等相关的处理过程中,训练设备120可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应训练得到的模型等存入数据存储系统150中。又如,在计算模块111执行模型训练等相关的处理过程中,训练设备120可以从数据库130中获取训练数据,这些训练数据通常是来自于数据采集设备160采集的训练样本。Next, the training device 120 can train a corresponding model/rule based on the training data associated with the voice task for the voice task that the user needs to complete, and the corresponding model/rule can be used to achieve the voice task that the user needs to complete, thereby providing the user with the required voice task result. The training data can be obtained in a variety of ways: for example, during the process of the computing module 111 performing model training and other related processing, the training device 120 can call the data, code, etc. in the data storage system 150 for the corresponding processing, and can also store the model obtained by the corresponding training in the data storage system 150. For another example, during the process of the computing module 111 performing model training and other related processing, the training device 120 can obtain training data from the database 130, and these training data are usually from the training samples collected by the data acquisition device 160.
最后,I/O接口112将训练得到的模型返回给客户设备140,从而提供给用户,以供用户完成自身所需完成的语音任务。Finally, the I/O interface 112 returns the trained model to the client device 140, thereby providing it to the user so that the user can complete the speech task that he needs to complete.
在图3中所示情况下,用户可以手动给定指令,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送指令,如果要求客户设备140自动发送指令需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看训练设备120输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,在用户的指示下采集各种数据以作为新的样本数据,并存入数据库130。In the case shown in FIG. 3 , the user can manually give instructions, which can be operated through the interface provided by the I/O interface 112. In another case, the client device 140 can automatically send instructions to the I/O interface 112. If the client device 140 is required to automatically send instructions and needs to obtain the user's authorization, the user can set the corresponding permissions in the client device 140. The user can view the results output by the training device 120 on the client device 140, and the specific presentation form can be a specific method such as display, sound, action, etc. The client device 140 can also serve as a data collection terminal, collect various data as new sample data under the user's instruction, and store them in the database 130.
值得注意的是,图3仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3中,数据存储系统150相对训练设备120是外部存储器,在其它情况下,也可以将数据存储系统150置于训练设备120中。It is worth noting that Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application. The positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation. For example, in Figure 3, the data storage system 150 is an external memory relative to the training device 120. In other cases, the data storage system 150 can also be placed in the training device 120.
本申请实施例还提供的一种芯片,该芯片包括神经网络处理器NPU。该芯片可以被设置在如图3所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则。The embodiment of the present application also provides a chip, which includes a neural network processor NPU. The chip can be set in the training device 120 shown in FIG3 to complete the training work of the training device 120 and output the target model/rule.
神经网络处理器NPU,NPU作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路,控制器控制运算电路提取存储器(权重存储器或输入存储器)中的数据并进行运算。Neural network processor NPU, NPU is mounted on the main central processing unit (CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks. The core part of NPU is the operation circuit, and the controller controls the operation circuit to extract data from the memory (weight memory or input memory) and perform operations.
在一些实现中,运算电路内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路是二维脉动阵列。运算电路还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路是通用的矩阵处理器。In some implementations, the arithmetic circuit includes multiple processing units (process engines, PEs) internally. In some implementations, the arithmetic circuit is a two-dimensional systolic array. The arithmetic circuit can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit takes the corresponding data of matrix B from the weight memory and caches it on each PE in the operation circuit. The operation circuit takes the matrix A data from the input memory and performs matrix operations with matrix B. The partial results or final results of the matrix are stored in the accumulator.
向量计算单元可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。The vector calculation unit can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. For example, the vector calculation unit can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc.
在一些实现种,向量计算单元能将经处理的输出的向量存储到统一缓存器。例如,向量计算单元可以将非线性函数应用到运算电路的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector computation unit can store the processed output vector to a unified buffer. For example, the vector computation unit can apply a nonlinear function to the output of the computation circuit, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector computation unit generates a normalized value, a merged value, or both. In some implementations, the processed output vector can be used as an activation input to the computation circuit, such as for use in a subsequent layer in a neural network.
统一存储器用于存放输入数据以及输出数据。The unified memory is used to store input data and output data.
权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器和/或统一存储器、将外部存储器中的权重数据存入权重存储器,以及将统一存储器中的数据存入外部存储器。The weight data is directly transferred from the external memory to the input memory and/or the unified memory through the direct memory access controller (DMAC), the weight data in the external memory is stored in the weight memory, and the data in the unified memory is stored in the external memory.
总线接口单元(bus interface unit,BIU),用于通过总线实现主CPU、DMAC和取指存储器之间进行交互。The bus interface unit (BIU) is used to enable interaction between the main CPU, DMAC and instruction fetch memory through the bus.
与控制器连接的取指存储器(instruction fetch buffer),用于存储控制器使用的指令;An instruction fetch buffer connected to the controller, used to store instructions used by the controller;
控制器,用于调用指存储器中缓存的指令,实现控制该运算加速器的工作过程。The controller is used to call the instructions cached in the memory to control the working process of the computing accelerator.
一般地,统一存储器,输入存储器,权重存储器以及取指存储器均为片上(On-Chip)存储器,外部
存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。Generally, unified memory, input memory, weight memory, and instruction fetch memory are all on-chip memories. The memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a high bandwidth memory (HBM), or other readable and writable memory.
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。Since the embodiments of the present application involve the application of a large number of neural networks, in order to facilitate understanding, the relevant terms and related concepts such as neural networks involved in the embodiments of the present application are first introduced below.
(1)神经网络(1) Neural Network
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
A neural network may be composed of neural units, and a neural unit may refer to an operation unit with xs and intercept 1 as input, and the output of the operation unit may be:
A neural network may be composed of neural units, and a neural unit may refer to an operation unit with xs and intercept 1 as input, and the output of the operation unit may be:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。Where s=1, 2, ...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal. The output signal of the activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the characteristics of the local receptive field. The local receptive field can be an area composed of several neural units.
神经网络中的每一层的工作可以用数学表达式y=a(Wx+b)来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由Wx完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。The work of each layer in the neural network can be described by the mathematical expression y=a(Wx+b): From a physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (i.e., the row space to the column space of the matrix) through five operations on the input space (the set of input vectors). These five operations include: 1. Dimension increase/reduction; 2. Zoom in/out; 3. Rotation; 4. Translation; 5. "Bending". Among them, operations 1, 2, and 3 are completed by Wx, operation 4 is completed by +b, and operation 5 is implemented by a(). The word "space" is used here because the classified object is not a single thing, but a class of things, and space refers to the collection of all individuals of this class of things. Among them, W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer. The vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space. The purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by many layers of vectors W). Therefore, the training process of a neural network is essentially about learning how to control spatial transformations, or more specifically, learning the weight matrix.
因为希望神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么神经网络的训练就变成了尽可能缩小这个loss的过程。Because we want the output of the neural network to be as close as possible to the value we really want to predict, we can compare the current network's predicted value with the target value we really want, and then update the weight vector of each layer of the neural network based on the difference between the two (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the neural network). For example, if the network's predicted value is high, adjust the weight vector to make it predict a lower value, and keep adjusting until the neural network can predict the target value we really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value", which is the loss function or objective function, which are important equations used to measure the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, so the training of the neural network becomes a process of minimizing this loss as much as possible.
(2)反向传播算法(2) Back propagation algorithm
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。Neural networks can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the forward transmission of the input signal to the output will generate error loss, and the error loss information is back-propagated to update the parameters in the initial neural network model, so that the error loss converges. The back propagation algorithm is a back propagation movement dominated by error loss, which aims to obtain the optimal parameters of the neural network model, such as the weight matrix.
下面从神经网络的训练侧和神经网络的应用侧对本申请提供的方法进行描述。The method provided in the present application is described below from the training side of the neural network and the application side of the neural network.
本申请实施例提供的模型训练方法,涉及数据序列的处理,具体可以应用于数据训练、机器学习、深度学习等方法,对训练数据(例如,本申请实施例中的与语音任务相关联的训练数据,包括第一文本、第一语音、第二文本以及第二语音等等)进行符号化和形式化的智能信息建模、抽取、预处理、训练等,
最终得到训练好的神经网络(例如,本申请实施例中的第一模型以及第三模型所构成的模型);并且,本申请实施例提供的任务处理方法可以运用上述训练好的神经网络,将用户的输入数据(例如,语音等等)输入到所述训练好的神经网络中,得到输出数据,从而完成用户所需完成的任务。需要说明的是,本申请实施例提供的模型训练方法和任务处理方法是基于同一个构思产生的发明,也可以理解为一个系统中的两个部分,或一个整体流程的两个阶段:如模型训练阶段和模型应用阶段。The model training method provided in the embodiment of the present application involves the processing of data sequences, and can be specifically applied to methods such as data training, machine learning, and deep learning, to perform symbolic and formalized intelligent information modeling, extraction, preprocessing, and training on training data (for example, the training data associated with the speech task in the embodiment of the present application, including the first text, the first speech, the second text, and the second speech, etc.). Finally, a trained neural network is obtained (for example, a model composed of the first model and the third model in the embodiment of the present application); and the task processing method provided in the embodiment of the present application can use the above-mentioned trained neural network to input the user's input data (for example, voice, etc.) into the trained neural network to obtain output data, thereby completing the task that the user needs to complete. It should be noted that the model training method and task processing method provided in the embodiment of the present application are inventions based on the same concept, and can also be understood as two parts in a system, or two stages of an overall process: such as the model training stage and the model application stage.
图4为本申请实施例提供的模型训练方法的一个流程示意图,如图4所示,该方法包括:FIG4 is a flow chart of a model training method provided in an embodiment of the present application. As shown in FIG4 , the method includes:
401、获取与语音任务相关联的训练数据,训练数据包括第一语音、第一文本、第二语音以及第二文本。401. Acquire training data associated with a speech task, where the training data includes a first speech, a first text, a second speech, and a second text.
本实施例中,在确定用户所需完成的语音任务(例如,语音翻译、语音识别、语音命令以及语音对话等等)后,可获取语音任务相关的训练数据。需要说明的是,训练数据通常包含两个学习阶段的数据,文音多模学习阶段的数据用于对第一待训练模型(也可以称为待训练的任务主干模型,是一个需要训练的神经网络模型)进行训练,映射学习阶段的数据用于对第二待训练模型(也可以称为待训练的文本映射模型,是一个需要训练的神经网络模型)进行训练。其中,文音多模学习阶段的数据可包含多个第一语音以及多个第一文本,多个第一语音以及多个第一文本之间在内容上是相互独立的,值得注意的是,多个第一语音的标签(即多个第一语音的真实处理结果)以及多个第一文本的标签(多个第一文本的真实处理结果)是已知的。映射学习阶段的数据可包含多个语音-文本匹配对,一个语音-文本匹配对包含一个第二语音以及该第二语音对应的一个第二文本,也就是说,该第二语音与该第二语音对应的第二文本在内容上是相匹配的。In this embodiment, after determining the voice task that the user needs to complete (for example, voice translation, voice recognition, voice command, voice dialogue, etc.), training data related to the voice task can be obtained. It should be noted that the training data usually includes data from two learning stages. The data from the text-speech multimodal learning stage is used to train the first model to be trained (also referred to as the task trunk model to be trained, which is a neural network model that needs to be trained), and the data from the mapping learning stage is used to train the second model to be trained (also referred to as the text mapping model to be trained, which is a neural network model that needs to be trained). Among them, the data from the text-speech multimodal learning stage may include multiple first voices and multiple first texts, and the multiple first voices and multiple first texts are independent of each other in content. It is worth noting that the labels of multiple first voices (i.e., the real processing results of multiple first voices) and the labels of multiple first texts (the real processing results of multiple first texts) are known. The data from the mapping learning stage may include multiple voice-text matching pairs, and a voice-text matching pair includes a second voice and a second text corresponding to the second voice, that is, the second voice and the second text corresponding to the second voice are matched in content.
由于本实施例中的模型训练过程包含映射学习阶段以及文音多模学习阶段,在得到训练数据后,可先进行映射学习阶段。Since the model training process in this embodiment includes a mapping learning stage and a text-speech multimodal learning stage, after obtaining the training data, the mapping learning stage can be performed first.
例如,在确定需要对待训练的任务主干模型以及待训练的文本映射模型进行训练后,可获取与用户的语音任务相关的训练数据,这些训练模型包含映射学习阶段的数据以及文音多模学习阶段的数据,其中,映射学习阶段的数据可表示为D={(S1,T1),(S2,T2),(S3,T3),...,(Si,Ti),...(SN,TN)},其中,(Si,Ti)表示第i个语音-文本匹配对(i=1,...,N,N≥2),Si表示映射学习阶段中的第i个语音,Ti表示映射学习阶段中的第i个文本。文音多模学习阶段的数据可表示为DS={(S`1,Y1),(S`2,Y2),(S`3,Y3),...,(S`j,Yj),...(S`M,YM)},以及DT={(T`1,Y`1),(T`2,Y`2),(T`3,Y`3),...,(T`k,Y`k),...(T`P,Y`P)},其中,S`j表示文音多模学习阶段中的第j个语音(j=1,...,M,M≥2),Yj表示第j个语音的标签,T`k表示文音多模学习阶段中的第k个文本(k=1,...,P,P≥2),Y`k表示第k个文本的标签。For example, after determining that the task backbone model to be trained and the text mapping model to be trained need to be trained, training data related to the user's speech task can be obtained. These training models include data from the mapping learning stage and data from the text-speech multimodal learning stage, where the data from the mapping learning stage can be expressed as D = {(S1, T1), (S2, T2), (S3, T3), ..., (Si, Ti), ... (SN, TN)}, where (Si, Ti) represents the i-th speech-text matching pair (i = 1, ..., N, N≥2), Si represents the i-th speech in the mapping learning stage, and Ti represents the i-th text in the mapping learning stage. The data in the text-speech multimodal learning stage can be expressed as DS = {(S`1, Y1), (S`2, Y2), (S`3, Y3), ..., (S`j, Yj), ... (S`M, YM)}, and DT = {(T`1, Y`1), (T`2, Y`2), (T`3, Y`3), ..., (T`k, Y`k), ... (T`P, Y`P)}, where S`j represents the j-th speech in the text-speech multimodal learning stage (j = 1, ..., M, M ≥ 2), Yj represents the label of the j-th speech, T`k represents the k-th text in the text-speech multimodal learning stage (k = 1, ..., P, P ≥ 2), and Y`k represents the label of the k-th text.
402、将第二语音输入第一模型,得到第三语音特征。402. Input the second speech into the first model to obtain a third speech feature.
403、将第二文本输入第二待训练模型,得到第四语音特征。403. Input the second text into the second model to be trained to obtain a fourth speech feature.
404、基于第三语音特征和第四语音特征,对第二待训练模型进行训练,得到第二模型。404. Train the second model to be trained based on the third speech feature and the fourth speech feature to obtain a second model.
进入映射学习阶段后,在多个语音-文本匹配对中,对于任意一个语音-文本匹配对而言,可将该语音-文本匹配对中的第二语音输入至第一模型(也可以称为已训练的语音预编码模型,是一个已训练的神经网络模型),以使得第一模型对该第二语音进行处理(例如,特征提取等等),从而得到该第二语音的第三语音特征。与此同时,还可将该语音-文本匹配对中的第二文本输入至第二待训练模型,以使得第二待训练模型对该第二文本进行处理(例如,特征提取等等),从而得到该第二文本的第四语音特征。After entering the mapping learning stage, among multiple speech-text matching pairs, for any speech-text matching pair, the second speech in the speech-text matching pair can be input into the first model (also referred to as a trained speech precoding model, which is a trained neural network model), so that the first model processes the second speech (for example, feature extraction, etc.), thereby obtaining the third speech feature of the second speech. At the same time, the second text in the speech-text matching pair can also be input into the second model to be trained, so that the second model to be trained processes the second text (for example, feature extraction, etc.), thereby obtaining the fourth speech feature of the second text.
接着,可基于该第二语音的第三语音特征以及该第二文本的第四语音特征,对第二待训练模型进行训练,从而得到第二模型(也可以称为已训练的文本映射模型,是一个已训练的神经网络模型)。Next, the second model to be trained can be trained based on the third speech feature of the second speech and the fourth speech feature of the second text to obtain a second model (also referred to as a trained text mapping model, which is a trained neural network model).
具体地,可通过以下方式对第二待训练模型进行训练,从而得到第二模型:Specifically, the second model to be trained can be trained in the following manner to obtain the second model:
(1)进入映射学习阶段后,由于第二待训练模型的训练过程是多轮迭代的,且每一轮迭代使用一个语音-文本匹配对进行模型训练,为了方便说明,以其中某一轮迭代进行介绍,将该轮迭代称为当前轮迭代,当前轮迭代中使用了某一个语音-文本匹配对。进入当前轮迭代后,可将该语音-文本匹配对中的第二语音输入至第一模型,以使得第一模型对该第二语音进行处理,从而得到该第二语音的第三语音特征。与此同时,还可将该语音-文本匹配对中的第二文本输入至第二待训练模型,以使得第二待训练
模型对该第二文本进行处理,从而得到该第二文本的第四语音特征。(1) After entering the mapping learning stage, since the training process of the second model to be trained is a multi-round iteration, and each round of iteration uses a speech-text matching pair for model training, for the convenience of explanation, one of the iterations is introduced and the iteration is called the current iteration. A speech-text matching pair is used in the current iteration. After entering the current iteration, the second speech in the speech-text matching pair can be input into the first model so that the first model processes the second speech to obtain the third speech feature of the second speech. At the same time, the second text in the speech-text matching pair can also be input into the second model to be trained so that the second model to be trained The model processes the second text to obtain a fourth speech feature of the second text.
(2)在得到该第二语音的第三语音特征以及该第二文本的第四语音特征后,可使用预置的第一损失函数,对该第二语音的第三语音特征以及该第二文本的第四语音特征进行计算,从而得到该语音-文本匹配对的第三损失,该语音-文本匹配对的第三损失用于指示该第二语音对应的第三语音特征以及该第二文本对应的第四语音特征之间的差异。(2) After obtaining the third speech feature of the second speech and the fourth speech feature of the second text, a preset first loss function can be used to calculate the third speech feature of the second speech and the fourth speech feature of the second text, thereby obtaining a third loss of the speech-text matching pair, and the third loss of the speech-text matching pair is used to indicate the difference between the third speech feature corresponding to the second speech and the fourth speech feature corresponding to the second text.
(3)在得到该语音-文本匹配对的第三损失后,可基于该语音-文本匹配对的第三损失,对第二待训练模型的参数进行更新,至此,则完成了当前轮迭代。然后,可进入下一轮迭代,即利用下一个语音-文本匹配对继续对更新参数后的第二待训练模型进行训练,直至满足模型训练条件(例如,损失达到收敛等等),从而得到第二模型。(3) After obtaining the third loss of the speech-text matching pair, the parameters of the second model to be trained can be updated based on the third loss of the speech-text matching pair, and the current round of iteration is completed. Then, the next round of iteration can be entered, that is, the second model to be trained with the updated parameters can be continuously trained using the next speech-text matching pair until the model training conditions are met (for example, the loss reaches convergence, etc.), thereby obtaining the second model.
依旧如上述例子,如图5所示(图5为本申请实施例提供的映射学习阶段的一个示意图),进入映射学习阶段后,可将(Si,Ti)中的Si输入已训练的语音预编码模型,得到Si的语音隐空间表示(即语音特征)Hi,并将(Si,Ti)中的Ti输入待训练的文本映射模型,得到Ti的语音隐空间表示H`i。然后,以Hi以及H`i之间的相似度最大化为目的,可利用映射学习损失函数,对Hi以及H`i进行计算,从而得到损失Li,损失Li用于表示Hi以及H`i之间的差异。随后,可利用Li对待训练的文本映射模型的参数进行更新,再利用(Si+1,Ti+1)对更新参数后的文本映射模型继续进行训练,直至损失收敛,从而得到已训练的文本映射模型。Still as in the above example, as shown in Figure 5 (Figure 5 is a schematic diagram of the mapping learning stage provided in an embodiment of the present application), after entering the mapping learning stage, Si in (Si, Ti) can be input into the trained speech precoding model to obtain the speech latent space representation (i.e., speech feature) Hi of Si, and Ti in (Si, Ti) can be input into the text mapping model to be trained to obtain the speech latent space representation H`i of Ti. Then, with the purpose of maximizing the similarity between Hi and H`i, the mapping learning loss function can be used to calculate Hi and H`i to obtain the loss Li, which is used to represent the difference between Hi and H`i. Subsequently, Li can be used to update the parameters of the text mapping model to be trained, and then (Si+1, Ti+1) can be used to continue training the text mapping model after the updated parameters until the loss converges, thereby obtaining the trained text mapping model.
405、将第一语音输入第一模型,得到第一语音特征。405. Input the first speech into the first model to obtain a first speech feature.
406、将第一文本输入第二模型,得到第二语音特征。406. Input the first text into the second model to obtain a second speech feature.
407、将第一语音特征输入第一待训练模型,得到第一语音的处理结果。407. Input the first speech feature into the first model to be trained to obtain a processing result of the first speech.
408、将第二语音特征输入第一待训练模型,得到第一文本的处理结果。408. Input the second speech feature into the first model to be trained to obtain a processing result of the first text.
409、基于第一文本的处理结果以及第一语音的处理结果,对第一待训练模型进行训练,得到第三模型,第一模型与第三模型所构成的模型用于完成语音任务。409. Based on the processing result of the first text and the processing result of the first speech, the first model to be trained is trained to obtain a third model. The model composed of the first model and the third model is used to complete the speech task.
进入文音多模学习阶段后,在多个第一语音和多个第一文本中,对于任意一个第一语音而言,可将该第一语音输入至第一模型,以使得第一模型对该第一语音进行处理(例如,特征提取等等),从而得到该第一语音的第一语音特征,并把该第一语音的第一语音特征输入至第一待训练模型,以使得第一待训练模型对该第一语音的第一语音特征进行处理(例如,特征提取等等),从而得到该第一语音的处理结果。当然,对于任意一个第一文本,也可将该第一文本输入至第二模型,以使得第二模型对该第一文本进行处理(例如,特征提取等等),从而得到该第一文本的第二语音特征,并把该第一文本的第二语音特征输入至第一待训练模型,以使得第一待训练模型对该第一文本的第二语音特征进行处理(例如,特征提取等等),从而得到该第一文本的处理结果。After entering the text-speech multimodal learning stage, among the multiple first voices and the multiple first texts, for any first voice, the first voice can be input into the first model so that the first model processes the first voice (for example, feature extraction, etc.) to obtain the first voice feature of the first voice, and the first voice feature of the first voice is input into the first model to be trained so that the first model to be trained processes the first voice feature of the first voice (for example, feature extraction, etc.) to obtain the processing result of the first voice. Of course, for any first text, the first text can also be input into the second model so that the second model processes the first text (for example, feature extraction, etc.) to obtain the second voice feature of the first text, and the second voice feature of the first text is input into the first model to be trained so that the first model to be trained processes the second voice feature of the first text (for example, feature extraction, etc.) to obtain the processing result of the first text.
接着,可基于该第一语音的处理结果以及该第一文本的处理结果,对第一待训练模型进行训练,从而得到第三模型(也可以称为已训练的任务主干模型,是一个已训练的神经网络模型)。如此一来,第一模型以及第三模型所构建的模型,可作为用于完成用户的语音任务的最终模型。Next, the first model to be trained can be trained based on the processing result of the first speech and the processing result of the first text, so as to obtain a third model (also called a trained task backbone model, which is a trained neural network model). In this way, the model constructed by the first model and the third model can be used as the final model for completing the user's speech task.
具体地,可通过以下方式对第一待训练模型进行训练,从而得到第三模型:Specifically, the first model to be trained can be trained in the following manner to obtain the third model:
(1)进入文音多模学习阶段后,由于第一待训练模型的训练过程是多轮迭代的,且每一轮迭代使用一个第一语音或一个第一文本进行模型训练,为了方便说明,以其中相邻的某两轮迭代进行介绍,将其中一轮迭代称为当前轮迭代,另一轮迭代称为下一轮迭代,当前轮迭代中使用了某一个第一语音,下一轮迭代中使用了某一个第一文本。进入当前轮迭代后,可将该第一语音输入至第一模型,以使得第一模型对该第一语音进行处理,从而得到该第一语音的第一语音特征。接着,可将该第一语音的第一语音特征输入至第一待训练模型,以使得第一待训练模型对该第一语音的第一语音特征进行处理,从而得到该第一语音的处理结果。(1) After entering the text-speech multimodal learning stage, since the training process of the first model to be trained is a multi-round iteration, and each iteration uses a first speech or a first text for model training, for the convenience of explanation, two adjacent iterations are introduced, one of which is called the current iteration, and the other is called the next iteration. A first speech is used in the current iteration, and a first text is used in the next iteration. After entering the current iteration, the first speech can be input into the first model so that the first model processes the first speech, thereby obtaining the first speech feature of the first speech. Then, the first speech feature of the first speech can be input into the first model to be trained so that the first model to be trained processes the first speech feature of the first speech, thereby obtaining the processing result of the first speech.
(2)得到该第一语音的处理结果后,可使用预置的第二损失函数对该第一语音的处理结果以及该第一语音的真实处理结果(标签)进行计算,从而得到该第一语音的第一损失,该第一语音的第一损失用于该第一语音的处理结果以及该第一语音的真实处理结果之间的差异。那么,可基于该第一语音的第一损失对第一待训练模型的参数进行更新,从而得到更新参数后的第一待训练模型,至此,则完成了当前轮迭代。
(2) After obtaining the processing result of the first speech, the preset second loss function can be used to calculate the processing result of the first speech and the actual processing result (label) of the first speech, thereby obtaining the first loss of the first speech, which is used for the difference between the processing result of the first speech and the actual processing result of the first speech. Then, the parameters of the first model to be trained can be updated based on the first loss of the first speech, thereby obtaining the first model to be trained after the updated parameters, and thus the current round of iteration is completed.
(3)进入下一轮迭代后,可该第一文本输入至第二模型,以使得第二模型对该第一文本进行处理,从而得到该第一文本的第二语音特征。接着,可将该第一文本的第二语音特征输入至第一待训练模型,以使得第一待训练模型对该第一文本的第二语音特征进行处理,从而得到该第一文本的处理结果。(3) After entering the next round of iteration, the first text can be input into the second model so that the second model processes the first text, thereby obtaining the second speech feature of the first text. Then, the second speech feature of the first text can be input into the first model to be trained so that the first model to be trained processes the second speech feature of the first text, thereby obtaining the processing result of the first text.
(4)得到该第一文本的处理结果后,可使用第二损失函数对该第一文本的处理结果以及该第一文本的真实处理结果进行计算,从而得到该第一文本的第二损失,该第一文本的第二损失用于该第一文本的处理结果以及该第一文本的真实处理结果之间的差异。那么,可基于该第一文本的第二损失对更新参数后的第一待训练模型的参数进行更新,从而得到再次更新参数后的第一待训练模型,至此,则完成了下一轮迭代。(4) After obtaining the processing result of the first text, the second loss function can be used to calculate the processing result of the first text and the actual processing result of the first text, thereby obtaining the second loss of the first text, and the second loss of the first text is used for the difference between the processing result of the first text and the actual processing result of the first text. Then, the parameters of the first model to be trained after the parameters are updated can be updated based on the second loss of the first text, thereby obtaining the first model to be trained after the parameters are updated again, and the next round of iteration is completed.
(5)随后,可进入下下轮迭代,即利用下一个第一语音或下一个第一文本继续对再次更新参数后的第一待训练模型进行训练,直至满足模型训练条件(例如,损失达到收敛等等),从而得到第一模型。(5) Subsequently, the next iteration can be entered, that is, the first model to be trained after the parameters are updated again is continued to be trained using the next first speech or the next first text until the model training conditions are met (for example, the loss reaches convergence, etc.), thereby obtaining the first model.
依旧如上述例子,如图6所示(图6为本申请实施例提供的文音多模学习阶段的一个示意图),进入文音多模学习阶段后,在当前轮迭代中,可将S`j输入至已训练的语音预编码模型,得到S`j的语音隐空间表示H``j,再将H``j输入至待训练的任务主干模型,得到S`j的处理结果Y``j。然后,可利用文音多模学习损失函数,对Y``j以及S`j的标签Yj进行计算,从而得到损失LSj,损失LSj用于指示Y``j以及Yj之间的差异。随后,可利用LSj对待训练的任务主干模型的参数进行更新,得到更新参数后的任务主干模型。至此,完成了当前轮迭代。Still as in the above example, as shown in Figure 6 (Figure 6 is a schematic diagram of the text-speech multimodal learning stage provided in an embodiment of the present application), after entering the text-speech multimodal learning stage, in the current round of iteration, S`j can be input into the trained speech precoding model to obtain the speech latent space representation H``j of S`j, and then H``j is input into the task backbone model to be trained to obtain the processing result Y``j of S`j. Then, the text-speech multimodal learning loss function can be used to calculate Y``j and the label Yj of S`j to obtain the loss LSj, which is used to indicate the difference between Y``j and Yj. Subsequently, LSj can be used to update the parameters of the task backbone model to be trained to obtain the task backbone model after the updated parameters. At this point, the current round of iteration is completed.
在下一轮迭代中,可将T`k输入至已训练的语音预编码模型,得到T`k的语音隐空间表示H```k,再将H```k输入至待训练的任务主干模型,得到T`k的处理结果Y```k。然后,可利用文音多模学习损失函数,对Y```k以及T`k的标签Y`k进行计算,从而得到损失LTk,损失LTk用于指示Y```k以及T`k之间的差异。随后,可利用LTk对更新参数后的任务主干模型的参数进行更新,得到再次更新参数后的任务主干模型。至此,完成了下一轮迭代。In the next round of iteration, T`k can be input into the trained speech precoding model to obtain the speech latent space representation H```k of T`k, and then H```k can be input into the task backbone model to be trained to obtain the processing result Y```k of T`k. Then, the text-speech multimodal learning loss function can be used to calculate Y```k and the label Y`k of T`k to obtain the loss LTk, which is used to indicate the difference between Y```k and T`k. Subsequently, LTk can be used to update the parameters of the task backbone model after the parameters are updated to obtain the task backbone model after the parameters are updated again. At this point, the next round of iteration is completed.
在下下轮迭代中,可利用S`j+1或T`k+1对再次更新参数后的任务主干模型继续进行训练,直至损失收敛,从而得到已训练的任务主干模型。如此一来,将已训练的语音预编码模型以及已训练的任务主干模型拼接在一起后,所得到的模型即为用于完成用户的语音任务的最终模型。In the next iteration, S`j+1 or T`k+1 can be used to continue training the task backbone model after updating the parameters again until the loss converges, thereby obtaining the trained task backbone model. In this way, after the trained speech precoding model and the trained task backbone model are spliced together, the obtained model is the final model used to complete the user's speech task.
应理解,本实施例中,第二模型可以为以下的任意一种:扩散生成模型(diffusion model)、生成对抗网络(generative adversarial nets,GAN)模型以及序列到序列模型(sequence-to-sequence model)等等。It should be understood that, in this embodiment, the second model can be any one of the following: a diffusion model, a generative adversarial nets (GAN) model, a sequence-to-sequence model, and the like.
还应理解,本实施例中,第一模型为第四模型的一部分,也就是说,可从第四模型中截取一部分来作为第一模型,例如,第四模型中的某几层等等。其中,第四模型通常为用于完成语音识别的模型(即已训练的语音识别模型,是一个已训练的神经网络模型),或者,第四模型为语音预训练模型等等。It should also be understood that in this embodiment, the first model is a part of the fourth model, that is, a part of the fourth model can be cut out to serve as the first model, for example, certain layers in the fourth model, etc. The fourth model is usually a model for completing speech recognition (i.e., a trained speech recognition model, a trained neural network model), or the fourth model is a speech pre-training model, etc.
为了进一步理解本申请实施例提供的模型训练方法,下文结合一个具体应用例对该方法作进一步的介绍。如图7和如图8所示(图7为本申请实施例提供的映射学习阶段的一个应用例示意图,图8为本申请实施例提供的文音多模学习阶段的一个应用例示意图),该应用例包括:In order to further understand the model training method provided by the embodiment of the present application, the method is further introduced below in conjunction with a specific application example. As shown in Figures 7 and 8 (Figure 7 is a schematic diagram of an application example of the mapping learning stage provided by the embodiment of the present application, and Figure 8 is a schematic diagram of an application example of the text-speech multimodal learning stage provided by the embodiment of the present application), the application example includes:
设用户的语音任务为语音翻译,可先获取已训练的语音识别模型,语音识别模型用于将中文语音转换为中文文本。在语音识别模型中,可取该模型的前4层作为已训练的语音预编码模型。Assuming that the user's speech task is speech translation, the trained speech recognition model can be obtained first. The speech recognition model is used to convert Chinese speech into Chinese text. In the speech recognition model, the first 4 layers of the model can be taken as the trained speech precoding model.
为了训练出能够将中文语音转换为英文文本的模型,可先获取待训练的diffusion模型以及待训练的语音翻译主干模型。可先获取一批训练数据,这批训练数据包含中文语音和中文文本(这二者是相对应的),然后将文本输入至待训练的diffusion模型,得到相应的语音隐空间表示,将语音输入至已训练的语音预编码模型,得到相应的语音隐空间表示。那么,可以基于这些语音隐空间表示来更新diffusion模型的参数,从而完成diffusion模型的训练,得到已训练的diffusion模型。In order to train a model that can convert Chinese speech into English text, the diffusion model to be trained and the speech translation backbone model to be trained can be obtained first. A batch of training data can be obtained first, which contains Chinese speech and Chinese text (the two are corresponding), and then the text is input into the diffusion model to be trained to obtain the corresponding speech latent space representation, and the speech is input into the trained speech precoding model to obtain the corresponding speech latent space representation. Then, the parameters of the diffusion model can be updated based on these speech latent space representations, thereby completing the training of the diffusion model and obtaining a trained diffusion model.
接着,可再获取另一批训练数据,这批训练数据也包含中文语音和中文文本(中文语音的正确英文文本,以及中文文本的正确英文文本是已知的),然后可将中文语音输入至已训练的语音预编码模型,得到相应的语音隐空间表示,再将语音隐空间表示输入至待训练的语音翻译主干模型,得到相应的预测英文文本,还可将中文文本输入至已训练的diffusion模型,得到相应的语音隐空间表示,再将语音隐空间表示输入至待训练的语音翻译主干模型,得到相应的预测英文文本。由于中文语音的正确英文文本以及中文文本的正确英文文本是已知的,故可以基于预测英文文本以及正确英文文本更新语音翻译主干
模型的参数,从而完成语音翻译主干模型的训练,得到已训练的语音翻译主干模型。Next, another batch of training data can be obtained, which also includes Chinese speech and Chinese text (the correct English text of the Chinese speech and the correct English text of the Chinese text are known). Then, the Chinese speech can be input into the trained speech precoding model to obtain the corresponding speech latent space representation, and then the speech latent space representation is input into the speech translation backbone model to be trained to obtain the corresponding predicted English text. The Chinese text can also be input into the trained diffusion model to obtain the corresponding speech latent space representation, and then the speech latent space representation is input into the speech translation backbone model to be trained to obtain the corresponding predicted English text. Since the correct English text of the Chinese speech and the correct English text of the Chinese text are known, the speech translation backbone can be updated based on the predicted English text and the correct English text. The parameters of the model are obtained, thereby completing the training of the speech translation backbone model and obtaining the trained speech translation backbone model.
至此,可将语音预编码模型的输出端与语音翻译主干模型的输入端连接,所得到的最终模型可将中文语音转换为英文文本。At this point, the output of the speech precoding model can be connected to the input of the speech translation backbone model, and the resulting final model can convert Chinese speech into English text.
本申请实施例中,在确定用户的语音任务后,可先获取与语音任务相关联的训练数据,训练数据包括第一文本和第一语音。接着,可通过第一模型对第一语音进行处理,从而得到第一语音特征,再通过第一待训练模型对第一语音特征进行处理,从而得到第一语音的处理结果。并且,还可通过第二模型对第一文本进行处理,从而得到第二语音特征,再通过第一待训练模型对第二语音特征进行处理,从而得到第一文本的处理结果。最后,基于第一语音的处理结果以及第一文本的处理结果,对第一待训练模型进行训练,从而得到第三模型。如此一来,第一模型与第三模型所构成的最终模型可用于完成用户的语音任务。前述过程中,第一待训练模型的训练数据为语音特征(即第一语音特征以及第二语音特征),且语音特征既可来源于语音(即第一语音),也可来源于文本(即第一文本),由于将文本转换为语音特征,相对于将文本转换为语音更易于实现,故往往能够将文本转换成正确的语音特征,基于正确的语音特征所训练得到的语音模型(即第一模型与第三模型所构建的最终模型),能够具备优秀的性能,故可以准确地完成用户的语音任务,从而提高用户体验。In an embodiment of the present application, after determining the user's voice task, training data associated with the voice task may be obtained first, and the training data includes a first text and a first voice. Then, the first voice may be processed by the first model to obtain a first voice feature, and then the first voice feature may be processed by the first model to be trained to obtain a processing result of the first voice. Furthermore, the first text may be processed by the second model to obtain a second voice feature, and then the second voice feature may be processed by the first model to be trained to obtain a processing result of the first text. Finally, based on the processing result of the first voice and the processing result of the first text, the first model to be trained is trained to obtain a third model. In this way, the final model composed of the first model and the third model can be used to complete the user's voice task. In the above process, the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features. The speech model trained based on the correct speech features (i.e., the final model constructed by the first model and the third model) can have excellent performance, so it can accurately complete the user's speech tasks, thereby improving the user experience.
以上是对本申请实施例提供的模型训练方法所进行的详细说明,以下将对本申请实施例提供的模型训练装置进行介绍。图9为本申请实施例提供的模型训练装置的一个结构示意图,如图9所示,该装置包括:The above is a detailed description of the model training method provided in the embodiment of the present application. The following is an introduction to the model training device provided in the embodiment of the present application. FIG. 9 is a structural schematic diagram of the model training device provided in the embodiment of the present application. As shown in FIG. 9 , the device includes:
获取模块901,用于获取与语音任务相关联的训练数据,训练数据包括第一文本和第一语音;An acquisition module 901 is used to acquire training data associated with a speech task, where the training data includes a first text and a first speech;
第一处理模块902,用于将第一语音输入第一模型,得到第一语音特征;A first processing module 902, configured to input the first speech into a first model to obtain a first speech feature;
第二处理模块903,用于将第一文本输入第二模型,得到第二语音特征;A second processing module 903 is used to input the first text into a second model to obtain a second speech feature;
第三处理模块904,用于将第一语音特征输入第一待训练模型,得到第一语音的处理结果;The third processing module 904 is used to input the first speech feature into the first model to be trained to obtain a processing result of the first speech;
第四处理模块905,用于将第二语音特征输入第一待训练模型,得到第一文本的处理结果;The fourth processing module 905 is used to input the second speech feature into the first model to be trained to obtain a processing result of the first text;
第一训练模块906,用于基于第一语音的处理结果以及第一文本的处理结果,对第一待训练模型进行训练,得到第三模型,第一模型与第三模型所构成的模型用于完成语音任务。The first training module 906 is used to train the first model to be trained based on the processing results of the first speech and the processing results of the first text to obtain a third model. The model composed of the first model and the third model is used to complete the speech task.
本申请实施例中,在确定用户的语音任务后,可先获取与语音任务相关联的训练数据,训练数据包括第一文本和第一语音。接着,可通过第一模型对第一语音进行处理,从而得到第一语音特征,再通过第一待训练模型对第一语音特征进行处理,从而得到第一语音的处理结果。并且,还可通过第二模型对第一文本进行处理,从而得到第二语音特征,再通过第一待训练模型对第二语音特征进行处理,从而得到第一文本的处理结果。最后,基于第一语音的处理结果以及第一文本的处理结果,对第一待训练模型进行训练,从而得到第三模型。如此一来,第一模型与第三模型所构成的最终模型可用于完成用户的语音任务。前述过程中,第一待训练模型的训练数据为语音特征(即第一语音特征以及第二语音特征),且语音特征既可来源于语音(即第一语音),也可来源于文本(即第一文本),由于将文本转换为语音特征,相对于将文本转换为语音更易于实现,故往往能够将文本转换成正确的语音特征,基于正确的语音特征所训练得到的语音模型(即第一模型与第三模型所构建的最终模型),能够具备优秀的性能,故可以准确地完成用户的语音任务,从而提高用户体验。In an embodiment of the present application, after determining the user's voice task, training data associated with the voice task may be obtained first, and the training data includes a first text and a first voice. Then, the first voice may be processed by the first model to obtain a first voice feature, and then the first voice feature may be processed by the first model to be trained to obtain a processing result of the first voice. Furthermore, the first text may be processed by the second model to obtain a second voice feature, and then the second voice feature may be processed by the first model to be trained to obtain a processing result of the first text. Finally, based on the processing result of the first voice and the processing result of the first text, the first model to be trained is trained to obtain a third model. In this way, the final model composed of the first model and the third model can be used to complete the user's voice task. In the above process, the training data of the first model to be trained is speech features (i.e., the first speech features and the second speech features), and the speech features can be derived from both speech (i.e., the first speech) and text (i.e., the first text). Since converting text into speech features is easier to implement than converting text into speech, it is often possible to convert text into correct speech features. The speech model trained based on the correct speech features (i.e., the final model constructed by the first model and the third model) can have excellent performance, so it can accurately complete the user's speech tasks, thereby improving the user experience.
在一种可能实现的方式中,第一训练模块906,用于:基于第一语音的处理结果和第一语音的真实处理结果,获取第一损失,第一损失用于指示第一语音的处理结果和第一语音的真实处理结果之间的差异;基于第一文本的处理结果和第一文本的真实处理结果,获取第二损失,第二损失用于指示第一文本的处理结果和第一文本的真实处理结果之间的差异;基于第一损失和第二损失,对第一待训练模型的参数进行更新,直至满足模型训练条件,得到第三模型。In one possible implementation, the first training module 906 is used to: obtain a first loss based on the processing result of the first speech and the actual processing result of the first speech, and the first loss is used to indicate the difference between the processing result of the first speech and the actual processing result of the first speech; obtain a second loss based on the processing result of the first text and the actual processing result of the first text, and the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text; based on the first loss and the second loss, update the parameters of the first model to be trained until the model training conditions are met to obtain a third model.
在一种可能实现的方式中,训练数据还包括第二文本和第二语音,第二文本与第二语音对应,该装置还包括:第五处理模块,用于将第二语音输入第一模型,得到第三语音特征;第六处理模块,用于将第二文本输入第二待训练模型,得到第四语音特征;第二训练模块,用于基于第三语音特征和第四语音特征,对第二待训练模型进行训练,得到第二模型。In one possible implementation, the training data also includes a second text and a second voice, the second text corresponds to the second voice, and the device also includes: a fifth processing module, used to input the second voice into the first model to obtain a third voice feature; a sixth processing module, used to input the second text into the second model to be trained to obtain a fourth voice feature; and a second training module, used to train the second model to be trained based on the third voice feature and the fourth voice feature to obtain a second model.
在一种可能实现的方式中,第二训练模块,用于:基于第三语音特征和第四语音特征,获取第三损失,第三损失用于指示第三语音特征和第四语音特征之间的差异;基于第三损失,对第二待训练模型的
参数进行更新,直至满足模型训练条件,得到第二模型。In one possible implementation, the second training module is used to: obtain a third loss based on the third speech feature and the fourth speech feature, where the third loss is used to indicate the difference between the third speech feature and the fourth speech feature; and The parameters are updated until the model training conditions are met to obtain the second model.
在一种可能实现的方式中,第二模型为以下的任意一种:扩散生成模型、生成对抗网络模型以及序列到序列模型。In one possible implementation, the second model is any one of the following: a diffusion generation model, a generative adversarial network model, and a sequence-to-sequence model.
在一种可能实现的方式中,第一模型为第四模型的一部分,第四模型为用于完成语音识别的模型,或,第四模型为语音预训练模型。In one possible implementation, the first model is a part of a fourth model, and the fourth model is a model for completing speech recognition, or the fourth model is a speech pre-training model.
在一种可能实现的方式中,语音任务为以下的任意一种:语音翻译、语音识别、语音命令以及语音对话。In a possible implementation, the speech task is any one of the following: speech translation, speech recognition, speech command, and speech dialogue.
需要说明的是,上述装置各模块/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其带来的技术效果与本申请方法实施例相同,具体内容可参考本申请实施例前述所示的方法实施例中的叙述,此处不再赘述。It should be noted that the information interaction, execution process, etc. between the modules/units of the above-mentioned device are based on the same concept as the method embodiment of the present application, and the technical effects they bring are the same as those of the method embodiment of the present application. The specific contents can be referred to the description in the method embodiment shown above in the embodiment of the present application, and will not be repeated here.
本申请实施例还涉及一种训练设备,图10为本申请实施例提供的训练设备的一个结构示意图。如图10所示,训练设备1000具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。其中,训练设备1000上可实现图4对应实施例中模型训练的功能。具体的,训练设备1000包括:接收器1001、发射器1002、处理器1003和存储器1004(其中训练设备1000中的处理器1003的数量可以一个或多个,图10中以一个处理器为例),其中,处理器1003可以包括应用处理器10031和通信处理器10032。在本申请的一些实施例中,接收器1001、发射器1002、处理器1003和存储器1004可通过总线或其它方式连接。The embodiment of the present application also relates to a training device, and FIG. 10 is a structural diagram of the training device provided by the embodiment of the present application. As shown in FIG. 10, the training device 1000 can be specifically manifested as a mobile phone, a tablet, a laptop computer, an intelligent wearable device, a server, etc., which is not limited here. Among them, the function of model training in the corresponding embodiment of FIG. 4 can be implemented on the training device 1000. Specifically, the training device 1000 includes: a receiver 1001, a transmitter 1002, a processor 1003 and a memory 1004 (wherein the number of processors 1003 in the training device 1000 can be one or more, and one processor is taken as an example in FIG. 10), wherein the processor 1003 may include an application processor 10031 and a communication processor 10032. In some embodiments of the present application, the receiver 1001, the transmitter 1002, the processor 1003 and the memory 1004 may be connected via a bus or other means.
存储器1004可以包括只读存储器和随机存取存储器,并向处理器1003提供指令和数据。存储器1004的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1004存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。The memory 1004 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1003. A portion of the memory 1004 may also include a non-volatile random access memory (NVRAM). The memory 1004 stores processor and operation instructions, executable modules or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.
处理器1003控制训练设备的操作。具体的应用中,训练设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。The processor 1003 controls the operation of the training device. In a specific application, the various components of the training device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus. However, for the sake of clarity, various buses are referred to as bus systems in the figure.
上述本申请实施例揭示的方法可以应用于处理器1003中,或者由处理器1003实现。处理器1003可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1003中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1003可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1003可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1004,处理器1003读取存储器1004中的信息,结合其硬件完成上述方法的步骤。The method disclosed in the above embodiment of the present application can be applied to the processor 1003, or implemented by the processor 1003. The processor 1003 can be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit in the processor 1003 or the instruction in the form of software. The above processor 1003 can be a general processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The processor 1003 can implement or execute the various methods, steps and logic block diagrams disclosed in the embodiment of the present application. The general processor can be a microprocessor or the processor can also be any conventional processor, etc. The steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to be executed, or a combination of hardware and software modules in the decoding processor can be executed. The software module may be located in a storage medium mature in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, etc. The storage medium is located in the memory 1004, and the processor 1003 reads the information in the memory 1004 and completes the steps of the above method in combination with its hardware.
接收器1001可用于接收输入的数字或字符信息,以及产生与训练设备的相关设置以及功能控制有关的信号输入。发射器1002可用于通过第一接口输出数字或字符信息;发射器1002还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1002还可以包括显示屏等显示设备。The receiver 1001 can be used to receive input digital or character information and generate signal input related to the relevant settings and function control of the training device. The transmitter 1002 can be used to output digital or character information through the first interface; the transmitter 1002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1002 can also include a display device such as a display screen.
本申请实施例中,在一种情况下,处理器1003,用于通过图4对应实施例中的模型训练架构,完成模型训练,并将得到的最终模型提供给用户设备,以使得用户通过用户设备基于最终模型完成用户所需完成的语音任务,即执行任务处理方法。In an embodiment of the present application, in one case, the processor 1003 is used to complete model training through the model training architecture in the embodiment corresponding to Figure 4, and provide the obtained final model to the user device, so that the user can complete the voice task that the user needs to complete based on the final model through the user device, that is, execute the task processing method.
本申请实施例还涉及一种计算机存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述训练设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。An embodiment of the present application also relates to a computer storage medium, in which a program for signal processing is stored. When the computer storage medium is run on a computer, the computer executes the steps executed by the aforementioned training device, or the computer executes the steps executed by the aforementioned training device.
本申请实施例还涉及一种计算机程序产品,该计算机程序产品存储有指令,该指令在由计算机执行
时使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。The present application also relates to a computer program product, wherein the computer program product stores instructions, which are executed by a computer. When making the computer execute the steps executed by the aforementioned execution device, or, making the computer execute the steps executed by the aforementioned training device.
本申请实施例提供的训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。The training device or terminal device provided in the embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored in the storage unit so that the chip in the execution device executes the data processing method described in the above embodiment, or so that the chip in the training device executes the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc. The storage unit may also be a storage unit located outside the chip in the wireless access device end, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), etc.
具体的,请参阅图11,图11为本申请实施例提供的芯片的一个结构示意图,所述芯片可以表现为神经网络处理器NPU 1100,NPU 1100作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1103,通过控制器1104控制运算电路1103提取存储器中的矩阵数据并进行乘法运算。Specifically, please refer to FIG. 11 , which is a schematic diagram of the structure of a chip provided in an embodiment of the present application. The chip can be a neural network processor NPU 1100. NPU 1100 is mounted on the host CPU (Host CPU) as a coprocessor, and tasks are assigned by the Host CPU. The core part of the NPU is the operation circuit 1103, which is controlled by the controller 1104 to extract matrix data from the memory and perform multiplication operations.
在一些实现中,运算电路1103内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1103是二维脉动阵列。运算电路1103还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1103是通用的矩阵处理器。In some implementations, the operation circuit 1103 includes multiple processing units (Process Engine, PE) inside. In some implementations, the operation circuit 1103 is a two-dimensional systolic array. The operation circuit 1103 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1103 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1102中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1101中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1108中。For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit takes the corresponding data of matrix B from the weight memory 1102 and caches it on each PE in the operation circuit. The operation circuit takes the matrix A data from the input memory 1101 and performs matrix operations with matrix B, and the partial results or final results of the matrix are stored in the accumulator 1108.
统一存储器1106用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1105,DMAC被搬运到权重存储器1102中。输入数据也通过DMAC被搬运到统一存储器1106中。The unified memory 1106 is used to store input data and output data. The weight data is directly transferred to the weight memory 1102 through the direct memory access controller (DMAC) 1105. The input data is also transferred to the unified memory 1106 through the DMAC.
BIU为Bus Interface Unit即,总线接口单元1013,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1109的交互。BIU stands for Bus Interface Unit, that is, the bus interface unit 1013, which is used for the interaction between AXI bus and DMAC and instruction fetch buffer (IFB) 1109.
总线接口单元1013(Bus Interface Unit,简称BIU),用于取指存储器1109从外部存储器获取指令,还用于存储单元访问控制器1105从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 1013 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1109 to obtain instructions from the external memory, and is also used for the storage unit access controller 1105 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1106或将权重数据搬运到权重存储器1102中或将输入数据数据搬运到输入存储器1101中。DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1106 or to transfer weight data to the weight memory 1102 or to transfer input data to the input memory 1101.
向量计算单元1107包括多个运算处理单元,在需要的情况下,对运算电路1103的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对预测标签平面进行上采样等。The vector calculation unit 1107 includes multiple operation processing units, and when necessary, further processes the output of the operation circuit 1103, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of the predicted label plane, etc.
在一些实现中,向量计算单元1107能将经处理的输出的向量存储到统一存储器1106。例如,向量计算单元1107可以将线性函数;或,非线性函数应用到运算电路1103的输出,例如对卷积层提取的预测标签平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1107生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1103的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit 1107 can store the processed output vector to the unified memory 1106. For example, the vector calculation unit 1107 can apply a linear function; or, a nonlinear function to the output of the operation circuit 1103, such as linear interpolation of the predicted label plane extracted by the convolution layer, and then, for example, a vector of accumulated values to generate an activation value. In some implementations, the vector calculation unit 1107 generates a normalized value, a pixel-level summed value, or both. In some implementations, the processed output vector can be used as an activation input to the operation circuit 1103, for example, for use in a subsequent layer in a neural network.
控制器1104连接的取指存储器(instruction fetch buffer)1109,用于存储控制器1104使用的指令;An instruction fetch buffer 1109 connected to the controller 1104 is used to store instructions used by the controller 1104;
统一存储器1106,输入存储器1101,权重存储器1102以及取指存储器1109均为On-Chip存储器。外部存储器私有于该NPU硬件架构。Unified memory 1106, input memory 1101, weight memory 1102 and instruction fetch memory 1109 are all on-chip memories. External memories are private to the NPU hardware architecture.
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。The processor mentioned in any of the above places may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
It should also be noted that the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. In addition, in the drawings of the device embodiments provided by the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation mode, the technicians in the field can clearly understand that the present application can be implemented by means of software plus necessary general hardware, and of course, it can also be implemented by special hardware including special integrated circuits, special CPUs, special memories, special components, etc. In general, all functions completed by computer programs can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be various, such as analog circuits, digital circuits or special circuits. However, for the present application, software program implementation is a better implementation mode in more cases. Based on such an understanding, the technical solution of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile hard disk, a ROM, a RAM, a disk or an optical disk, etc., including a number of instructions to enable a computer device (which can be a personal computer, a training device, or a network device, etc.) to execute the methods described in each embodiment of the present application.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented by software, all or part of the embodiments may be implemented in the form of a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website site, a computer, a training device, or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, training device, or data center. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device, a data center, etc. that includes one or more available media integrations. The available medium may be a magnetic medium, (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.
Claims (17)
- 一种模型训练方法,其特征在于,所述方法包括:A model training method, characterized in that the method comprises:获取与语音任务相关联的训练数据,所述训练数据包括第一文本和第一语音;Acquire training data associated with a speech task, wherein the training data includes a first text and a first speech;将所述第一语音输入第一模型,得到第一语音特征;Inputting the first speech into a first model to obtain a first speech feature;将所述第一文本输入第二模型,得到第二语音特征;Inputting the first text into a second model to obtain a second speech feature;将所述第一语音特征输入第一待训练模型,得到所述第一语音的处理结果;Inputting the first speech feature into a first model to be trained to obtain a processing result of the first speech;将所述第二语音特征输入所述第一待训练模型,得到所述第一文本的处理结果;Inputting the second speech feature into the first model to be trained to obtain a processing result of the first text;基于所述第一语音的处理结果以及所述第一文本的处理结果,对所述第一待训练模型进行训练,得到第三模型,所述第一模型与所述第三模型所构成的模型用于完成所述语音任务。Based on the processing results of the first speech and the processing results of the first text, the first model to be trained is trained to obtain a third model, and the model composed of the first model and the third model is used to complete the speech task.
- 根据权利要求1所述的方法,其特征在于,所述基于所述第一语音的处理结果以及所述第一文本的处理结果,对所述第一待训练模型进行训练,得到第三模型包括:The method according to claim 1, characterized in that the step of training the first model to be trained based on the processing result of the first speech and the processing result of the first text to obtain the third model comprises:基于所述第一语音的处理结果和所述第一语音的真实处理结果,获取第一损失,所述第一损失用于指示所述第一语音的处理结果和所述第一语音的真实处理结果之间的差异;Based on the processing result of the first speech and the real processing result of the first speech, obtaining a first loss, wherein the first loss is used to indicate the difference between the processing result of the first speech and the real processing result of the first speech;基于所述第一文本的处理结果和所述第一文本的真实处理结果,获取第二损失,所述第二损失用于指示所述第一文本的处理结果和所述第一文本的真实处理结果之间的差异;Based on the processing result of the first text and the actual processing result of the first text, obtaining a second loss, where the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text;基于所述第一损失和所述第二损失,对所述第一待训练模型的参数进行更新,直至满足模型训练条件,得到第三模型。Based on the first loss and the second loss, the parameters of the first model to be trained are updated until the model training conditions are met, so as to obtain a third model.
- 根据权利要求1或2所述的方法,其特征在于,所述训练数据还包括第二文本和第二语音,所述第二文本与所述第二语音对应,所述方法还包括:The method according to claim 1 or 2, characterized in that the training data further includes a second text and a second voice, the second text corresponds to the second voice, and the method further includes:将所述第二语音输入所述第一模型,得到第三语音特征;Inputting the second speech into the first model to obtain a third speech feature;将所述第二文本输入第二待训练模型,得到第四语音特征;Inputting the second text into a second model to be trained to obtain a fourth speech feature;基于所述第三语音特征和所述第四语音特征,对所述第二待训练模型进行训练,得到所述第二模型。Based on the third speech feature and the fourth speech feature, the second model to be trained is trained to obtain the second model.
- 根据权利要求3所述的方法,其特征在于,所述基于所述第三语音特征和所述第四语音特征,对所述第二待训练模型进行训练,得到所述第二模型包括:The method according to claim 3 is characterized in that the step of training the second model to be trained based on the third speech feature and the fourth speech feature to obtain the second model comprises:基于所述第三语音特征和所述第四语音特征,获取第三损失,所述第三损失用于指示所述第三语音特征和所述第四语音特征之间的差异;Based on the third voice feature and the fourth voice feature, obtaining a third loss, wherein the third loss is used to indicate a difference between the third voice feature and the fourth voice feature;基于所述第三损失,对所述第二待训练模型的参数进行更新,直至满足模型训练条件,得到第二模型。Based on the third loss, the parameters of the second model to be trained are updated until the model training conditions are met to obtain the second model.
- 根据权利要求1至4任意一项所述的方法,其特征在于,所述第二模型为以下的任意一种:扩散生成模型、生成对抗网络模型以及序列到序列模型。The method according to any one of claims 1 to 4 is characterized in that the second model is any one of the following: a diffusion generation model, a generative adversarial network model, and a sequence-to-sequence model.
- 根据权利要求1至5任意一项所述的方法,其特征在于,所述第一模型为第四模型的一部分,所述第四模型为用于完成语音识别的模型,或,所述第四模型为语音预训练模型。The method according to any one of claims 1 to 5 is characterized in that the first model is a part of a fourth model, the fourth model is a model for completing speech recognition, or the fourth model is a speech pre-training model.
- 根据权利要求1至6任意一项所述的方法,其特征在于,所述语音任务为以下的任意一种:语音翻译、语音识别、语音命令以及语音对话。The method according to any one of claims 1 to 6 is characterized in that the speech task is any one of the following: speech translation, speech recognition, speech command and speech dialogue.
- 一种模型训练装置,其特征在于,所述装置包括:A model training device, characterized in that the device comprises:获取模块,用于获取与语音任务相关联的训练数据,所述训练数据包括第一文本和第一语音;An acquisition module, configured to acquire training data associated with a speech task, wherein the training data includes a first text and a first speech;第一处理模块,用于将所述第一语音输入第一模型,得到第一语音特征;A first processing module, used for inputting the first speech into a first model to obtain a first speech feature;第二处理模块,用于将所述第一文本输入第二模型,得到第二语音特征;A second processing module, used for inputting the first text into a second model to obtain a second speech feature;第三处理模块,用于将所述第一语音特征输入第一待训练模型,得到所述第一语音的处理结果;A third processing module, used for inputting the first speech feature into a first model to be trained to obtain a processing result of the first speech;第四处理模块,用于将所述第二语音特征输入所述第一待训练模型,得到所述第一文本的处理结果;a fourth processing module, configured to input the second speech feature into the first model to be trained to obtain a processing result of the first text;第一训练模块,用于基于所述第一语音的处理结果以及所述第一文本的处理结果,对所述第一待训练模型进行训练,得到第三模型,所述第一模型与所述第三模型所构成的模型用于完成所述语音任务。The first training module is used to train the first model to be trained based on the processing results of the first speech and the processing results of the first text to obtain a third model. The model composed of the first model and the third model is used to complete the speech task.
- 根据权利要求8所述的装置,其特征在于,所述第一训练模块,用于:The device according to claim 8, characterized in that the first training module is used to:基于所述第一语音的处理结果和所述第一语音的真实处理结果,获取第一损失,所述第一损失用于指示所述第一语音的处理结果和所述第一语音的真实处理结果之间的差异; Based on the processing result of the first speech and the real processing result of the first speech, obtaining a first loss, wherein the first loss is used to indicate the difference between the processing result of the first speech and the real processing result of the first speech;基于所述第一文本的处理结果和所述第一文本的真实处理结果,获取第二损失,所述第二损失用于指示所述第一文本的处理结果和所述第一文本的真实处理结果之间的差异;Based on the processing result of the first text and the actual processing result of the first text, obtaining a second loss, where the second loss is used to indicate the difference between the processing result of the first text and the actual processing result of the first text;基于所述第一损失和所述第二损失,对所述第一待训练模型的参数进行更新,直至满足模型训练条件,得到第三模型。Based on the first loss and the second loss, the parameters of the first model to be trained are updated until the model training conditions are met, so as to obtain a third model.
- 根据权利要求8或9所述的装置,其特征在于,所述训练数据还包括第二文本和第二语音,所述第二文本与所述第二语音对应,所述装置还包括:The device according to claim 8 or 9, characterized in that the training data further includes a second text and a second voice, the second text corresponds to the second voice, and the device further includes:第五处理模块,用于将所述第二语音输入所述第一模型,得到第三语音特征;A fifth processing module, configured to input the second speech into the first model to obtain a third speech feature;第六处理模块,用于将所述第二文本输入第二待训练模型,得到第四语音特征;A sixth processing module, configured to input the second text into a second model to be trained to obtain a fourth speech feature;第二训练模块,用于基于所述第三语音特征和所述第四语音特征,对所述第二待训练模型进行训练,得到所述第二模型。The second training module is used to train the second model to be trained based on the third speech feature and the fourth speech feature to obtain the second model.
- 根据权利要求10所述的装置,其特征在于,所述第二训练模块,用于:The device according to claim 10, characterized in that the second training module is used to:基于所述第三语音特征和所述第四语音特征,获取第三损失,所述第三损失用于指示所述第三语音特征和所述第四语音特征之间的差异;Based on the third voice feature and the fourth voice feature, obtaining a third loss, wherein the third loss is used to indicate a difference between the third voice feature and the fourth voice feature;基于所述第三损失,对所述第二待训练模型的参数进行更新,直至满足模型训练条件,得到第二模型。Based on the third loss, the parameters of the second model to be trained are updated until the model training conditions are met to obtain the second model.
- 根据权利要求8至11任意一项所述的装置,其特征在于,所述第二模型为以下的任意一种:扩散生成模型、生成对抗网络模型以及序列到序列模型。The device according to any one of claims 8 to 11 is characterized in that the second model is any one of the following: a diffusion generation model, a generative adversarial network model, and a sequence-to-sequence model.
- 根据权利要求8至12任意一项所述的装置,其特征在于,所述第一模型为第四模型的一部分,所述第四模型为用于完成语音识别的模型,或,所述第四模型为语音预训练模型。The device according to any one of claims 8 to 12 is characterized in that the first model is a part of a fourth model, the fourth model is a model for completing speech recognition, or the fourth model is a speech pre-training model.
- 根据权利要求8至13任意一项所述的装置,其特征在于,所述语音任务为以下的任意一种:语音翻译、语音识别、语音命令以及语音对话。The device according to any one of claims 8 to 13, characterized in that the voice task is any one of the following: voice translation, voice recognition, voice command and voice dialogue.
- 一种模型训练装置,其特征在于,所述装置包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为执行所述代码,当所述代码被执行时,所述模型训练装置执行如权利要求1至7任意一项所述的方法。A model training device, characterized in that the device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code. When the code is executed, the model training device executes the method according to any one of claims 1 to 7.
- 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机实施权利要求1至7任一所述的方法。A computer storage medium, characterized in that the computer storage medium stores one or more instructions, and when the instructions are executed by one or more computers, the one or more computers implement any one of the methods described in claims 1 to 7.
- 一种计算机程序产品,其特征在于,所述计算机程序产品存储有指令,所述指令在由计算机执行时,使得所述计算机实施权利要求1至7任意一项所述的方法。 A computer program product, characterized in that the computer program product stores instructions, and when the instructions are executed by a computer, the computer implements the method according to any one of claims 1 to 7.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310127981.5 | 2023-01-30 | ||
CN202310127981.5A CN116312489A (en) | 2023-01-30 | 2023-01-30 | Model training method and related equipment thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024160186A1 true WO2024160186A1 (en) | 2024-08-08 |
Family
ID=86795239
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2024/074585 WO2024160186A1 (en) | 2023-01-30 | 2024-01-30 | Model training method and related device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116312489A (en) |
WO (1) | WO2024160186A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116312489A (en) * | 2023-01-30 | 2023-06-23 | 华为技术有限公司 | Model training method and related equipment thereof |
CN118098235B (en) * | 2024-04-23 | 2024-08-23 | 荣耀终端有限公司 | Wake-up word recognition method, model training method and electronic equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170206891A1 (en) * | 2016-01-16 | 2017-07-20 | Genesys Telecommunications Laboratories, Inc. | Material selection for language model customization in speech recognition for speech analytics |
CN111261144A (en) * | 2019-12-31 | 2020-06-09 | 华为技术有限公司 | Voice recognition method, device, terminal and storage medium |
CN111477216A (en) * | 2020-04-09 | 2020-07-31 | 南京硅基智能科技有限公司 | Training method and system for pronunciation understanding model of conversation robot |
CN111540345A (en) * | 2020-05-09 | 2020-08-14 | 北京大牛儿科技发展有限公司 | Weakly supervised speech recognition model training method and device |
CN111583913A (en) * | 2020-06-15 | 2020-08-25 | 深圳市友杰智新科技有限公司 | Model training method and device for speech recognition and speech synthesis and computer equipment |
WO2020168752A1 (en) * | 2019-02-22 | 2020-08-27 | 平安科技(深圳)有限公司 | Speech recognition and speech synthesis method and apparatus based on dual learning |
CN111754985A (en) * | 2020-07-06 | 2020-10-09 | 上海依图信息技术有限公司 | Method and device for training voice recognition model and voice recognition |
CN116312489A (en) * | 2023-01-30 | 2023-06-23 | 华为技术有限公司 | Model training method and related equipment thereof |
-
2023
- 2023-01-30 CN CN202310127981.5A patent/CN116312489A/en active Pending
-
2024
- 2024-01-30 WO PCT/CN2024/074585 patent/WO2024160186A1/en unknown
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170206891A1 (en) * | 2016-01-16 | 2017-07-20 | Genesys Telecommunications Laboratories, Inc. | Material selection for language model customization in speech recognition for speech analytics |
WO2020168752A1 (en) * | 2019-02-22 | 2020-08-27 | 平安科技(深圳)有限公司 | Speech recognition and speech synthesis method and apparatus based on dual learning |
CN111261144A (en) * | 2019-12-31 | 2020-06-09 | 华为技术有限公司 | Voice recognition method, device, terminal and storage medium |
CN111477216A (en) * | 2020-04-09 | 2020-07-31 | 南京硅基智能科技有限公司 | Training method and system for pronunciation understanding model of conversation robot |
CN111540345A (en) * | 2020-05-09 | 2020-08-14 | 北京大牛儿科技发展有限公司 | Weakly supervised speech recognition model training method and device |
CN111583913A (en) * | 2020-06-15 | 2020-08-25 | 深圳市友杰智新科技有限公司 | Model training method and device for speech recognition and speech synthesis and computer equipment |
CN111754985A (en) * | 2020-07-06 | 2020-10-09 | 上海依图信息技术有限公司 | Method and device for training voice recognition model and voice recognition |
CN116312489A (en) * | 2023-01-30 | 2023-06-23 | 华为技术有限公司 | Model training method and related equipment thereof |
Also Published As
Publication number | Publication date |
---|---|
CN116312489A (en) | 2023-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2024160186A1 (en) | Model training method and related device | |
CN112183718A (en) | Deep learning training method and device for computing equipment | |
CN113065633B (en) | Model training method and associated equipment | |
WO2023284716A1 (en) | Neural network searching method and related device | |
WO2023020613A1 (en) | Model distillation method and related device | |
WO2022111387A1 (en) | Data processing method and related apparatus | |
WO2024213099A1 (en) | Data processing method and apparatus | |
CN113627163A (en) | Attention model, feature extraction method and related device | |
WO2024179485A1 (en) | Image processing method and related device thereof | |
WO2024179503A1 (en) | Speech processing method and related device | |
WO2024175014A1 (en) | Image processing method and related device thereof | |
WO2024199404A1 (en) | Consumption prediction method and related device | |
CN115238909A (en) | Data value evaluation method based on federal learning and related equipment thereof | |
WO2024188171A1 (en) | Image processing method and related device thereof | |
WO2024114659A1 (en) | Summary generation method and related device | |
WO2024140973A1 (en) | Action counting method and related device | |
WO2024109910A1 (en) | Generative model training method and apparatus and data conversion method and apparatus | |
WO2024067113A1 (en) | Action prediction method and related device thereof | |
WO2024061123A1 (en) | Image processing method and image processing related device | |
WO2024046144A1 (en) | Video processing method and related device thereof | |
WO2023185541A1 (en) | Model training method and related device | |
WO2023197857A1 (en) | Model partitioning method and related device thereof | |
WO2023045949A1 (en) | Model training method and related device | |
WO2023020185A1 (en) | Image classification method and related device | |
CN116739154A (en) | Fault prediction method and related equipment thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24749665 Country of ref document: EP Kind code of ref document: A1 |