WO2022121257A1 - Model training method and apparatus, speech recognition method and apparatus, device, and storage medium - Google Patents

Model training method and apparatus, speech recognition method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2022121257A1
WO2022121257A1 PCT/CN2021/097411 CN2021097411W WO2022121257A1 WO 2022121257 A1 WO2022121257 A1 WO 2022121257A1 CN 2021097411 W CN2021097411 W CN 2021097411W WO 2022121257 A1 WO2022121257 A1 WO 2022121257A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
recognition model
speech recognition
sequences
sequence
Prior art date
Application number
PCT/CN2021/097411
Other languages
French (fr)
Chinese (zh)
Inventor
罗剑
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022121257A1 publication Critical patent/WO2022121257A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the technical field of model construction in artificial intelligence, and in particular, to a model training method, a speech recognition method, an apparatus, a device and a storage medium.
  • Automatic Speech Recognition is a technology that converts speech into text.
  • speech recognition is used in various industries related to the Internet, communication, smart home, etc.
  • speech recognition models are usually used for automatic speech recognition.
  • the text data sample is obtained in the following manner: organize a large number of people to listen to the voice data and write down the correct text data.
  • the speech recognition model allows more and more speech data and corresponding text data to be trained to improve the accuracy of the speech recognition model, which makes the labor cost a resource investment It is time-consuming, expensive and inefficient.
  • the main purpose of the present application is to provide a model training method, speech recognition method, apparatus, equipment and storage medium, aiming at improving the training effect and training efficiency of the speech recognition model.
  • the present application provides a model training method, including:
  • the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence
  • the second training samples include a second speech sequence
  • the present application also provides a speech recognition method, comprising:
  • the target speech recognition model is obtained by training according to the above-mentioned model training method.
  • the present application also provides a model training device, the model training device comprising:
  • an acquisition module configured to acquire a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, the second the training sample includes a second speech sequence;
  • a first training module configured to iteratively train a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model
  • a fusion module configured to fuse the first speech recognition model with the preset language model to obtain a second speech recognition model
  • an input module configured to input a plurality of the second speech sequences into the second speech recognition model, to obtain a second text and a fusion score corresponding to each of the second speech sequences;
  • a screening module configured to screen out a target speech sequence from the plurality of second speech sequences according to the fusion score of each of the second speech sequences
  • a second training module configured to iterate the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples Training to get the target speech recognition model.
  • the present application also provides a computer device, the computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program is executed by the When executed by the processor, the steps of the above-mentioned model training method or speech recognition method are implemented.
  • the present application further provides a computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the above-mentioned model training method or speech Identify the steps of the method.
  • the present application provides a model training method, a speech recognition method, an apparatus, a device and a storage medium.
  • the present application obtains a plurality of first training samples and a plurality of second training samples, wherein the first training samples include a first speech sequence and annotated The first text corresponding to the first voice sequence, the second training sample includes the second voice sequence, and then according to the plurality of first training samples, the first preset voice recognition model is iteratively trained to obtain a first voice recognition model, and the first voice recognition model is obtained.
  • a speech recognition model is fused with a preset language model to obtain a second speech recognition model, and then multiple second speech sequences are input into the second speech recognition model to obtain the second text corresponding to each second speech sequence and the fusion score , according to the fusion score of each second speech sequence, screen out the target speech sequence from multiple second speech sequences, and according to each target speech sequence, the second text corresponding to each target speech sequence and a plurality of first training samples , performing iterative training on the second preset speech recognition model to obtain the target speech recognition model.
  • This application trains the "teacher-noisy student" self-training learning model through multiple labeled first training samples and multiple unlabeled second training samples, which can greatly improve the training effect of the speech recognition model and reduce the need for The number of labeled training samples is required, and the training efficiency of the speech recognition model is improved.
  • FIG. 1 is a schematic flowchart of steps of a model training method provided by an embodiment of the present application
  • Fig. 2 is the sub-step flow schematic diagram of the model training method in Fig. 1;
  • FIG. 3 is a schematic diagram of a scenario for implementing the model training method provided by this embodiment
  • FIG. 4 is a schematic flowchart of steps of a speech recognition method provided by an embodiment of the present application.
  • FIG. 5 is a schematic block diagram of a model training apparatus provided by an embodiment of the present application.
  • Fig. 6 is a schematic block diagram of sub-modules of the model training device in Fig. 5;
  • FIG. 7 is a schematic block diagram of a speech recognition apparatus provided by an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application.
  • Embodiments of the present application provide a model training method, a speech recognition method, an apparatus, a device, and a storage medium.
  • the model training method can be applied to a terminal device or a server, and the terminal device can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device; the server can be a single server , or a server cluster consisting of multiple servers.
  • the following takes the model training method applied to the server as an example for explanation.
  • FIG. 1 is a schematic flowchart of steps of a model training method provided by an embodiment of the present application.
  • the model training method includes steps S101 to S106.
  • Step S101 Acquire a plurality of first training samples and a plurality of second training samples, where the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, and the second training samples include a second speech sequence.
  • the first training sample includes the first speech sequence and the first text corresponding to the first speech sequence, where the first text is a label of the corresponding first speech sequence, and the second training sample includes the second speech sequence.
  • the first voice sequence and the second voice sequence are audio data
  • the first text corresponding to the first voice sequence is the text content recognized by the voice of the first voice sequence.
  • the first speech sequence is a song
  • the corresponding first text is lyrics.
  • the Noisy Student Training (NST) model is a semi-supervised learning model composed of "teacher” and "student".
  • the teacher model (the first preset speech recognition model) is used to learn the marked first training sample, Predict the unlabeled second training sample, obtain the labeled second training sample and the second text corresponding to the second training sample, and then let the student model (the second preset speech recognition model) for the labeled first training sample.
  • the training sample, the labeled second training sample, and the second text corresponding to the second training sample are trained, and the above process is iterated. Through "teacher-noise student" self-training learning, the training effect of the speech recognition model can be greatly improved.
  • the total audio length of the plurality of first training samples is higher than the first preset time threshold, and the total audio length of the plurality of second training samples is higher than the second preset time threshold, which can ensure that the data obtained in subsequent training can be guaranteed.
  • the accuracy of the speech recognition model for speech recognition is higher than the first preset time threshold, and the total audio length of the plurality of second training samples is higher than the second preset time threshold, which can ensure that the data obtained in subsequent training can be guaranteed.
  • the second preset time threshold is higher than the first preset time threshold.
  • the first preset time threshold and the second preset time threshold may be set according to actual application scenarios, for example, the first preset time threshold is 100h, and the second preset time threshold is 500h, which will not be repeated here. .
  • the above-mentioned related information such as the first training samples and the second training samples can also be stored in one area.
  • the block chain in the present invention refers to computers such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, etc.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • Step S102 Perform iterative training on a first preset speech recognition model according to a plurality of first training samples to obtain a first speech recognition model.
  • the first preset speech recognition model is a "teacher" model, and a plurality of first training samples are input into the first preset speech recognition model to obtain a speech recognition result corresponding to each first training sample, and according to each first training sample, the corresponding speech recognition results are obtained.
  • the speech recognition results corresponding to the training samples and the corresponding first text respectively adjust the parameters of the first preset speech recognition model until a first speech recognition model whose performance meets the preset training conditions is obtained.
  • the preset training condition may be that the recognition accuracy is higher than the preset accuracy threshold.
  • both the preset training conditions and the preset accuracy threshold can be set according to actual application scenarios.
  • the preset accuracy threshold is 0.9, which is not specifically limited here.
  • the first preset speech recognition model is, for example, a LAS (Listen, Attend and Spell) model, and the LAS model includes a Listen (listening) layer, an Attend (attention) layer and a Spell (spelling) layer.
  • the first preset speech recognition model extracts the speech signal features of the input first training sample, encodes by inputting the speech signal features into the Listen layer, and then uses the Attention layer to pay attention to different parts of the input at different times (Attend), and finally uses The Spell layer is decoded to obtain the speech recognition result of the first training sample.
  • data enhancement is performed on a plurality of first training samples; according to the plurality of first training samples after data enhancement, the first preset speech recognition model is iteratively trained until the first preset speech recognition model is Convergence to obtain the first speech recognition model.
  • the number of samples of the first training sample can be increased through data augmentation, for example, by normalizing the channel length, superimposing clean audio and noisy audio signals to synthesize noisy audio or speed perturbation of the original audio etc. to achieve data enhancement.
  • iterative training of the first preset speech recognition model reference may be made to the foregoing embodiments.
  • the convergence of the first preset speech recognition model may be that the performance meets the preset training conditions, and the number of iterations is greater than the preset number of iterations and/or the iteration duration. is greater than the preset iteration duration, etc., which is not specifically limited in this embodiment.
  • the speech recognition result improves the training effect of the target speech recognition model.
  • the SpecAugment is used to perform data enhancement on the plurality of first training samples, and the robustness of the first speech recognition model is increased by adding some noise to the first preset speech recognition model.
  • each first speech sequence is converted into a spectrogram, and time deformation, frequency masking and/or time masking are performed on the plurality of spectrograms by SpecAugment.
  • the spectrogram of the first speech sequence can be enhanced through SpecAugment, which can increase the training speed of the first speech recognition model for the first training sample, thereby improving the performance of the target speech recognition model. training efficiency.
  • time deformation of multiple spectrograms through SpecAugment refers to the mel spectrogram of the ⁇ time jump, taking time as the x-axis and frequency as the y-axis, and constructing a random horizontal line through the center of the mel spectrogram.
  • frequency masking multiple spectrograms means that on the frequency axis of successive mel spectrograms, [f 0 ,f 0 +f ] mask, where f is a uniform parameter from 0 to F; frequency masking multiple spectrograms refers to masking [t 0 ,t 0 +t] on the time axis of successive mel spectrograms membrane, where t is a uniform distribution from 0 to T.
  • noise is added to the first preset speech recognition model; according to a plurality of first training samples, the first preset speech recognition model with noise added is iteratively trained until the first preset speech recognition model with noise added The model converges, and a first speech recognition model is obtained.
  • data enhancement is performed on a plurality of first training samples, and noise is added to the first preset speech recognition model; according to the plurality of first training samples after data enhancement, the first preset noise added
  • the speech recognition model is iteratively trained until the first preset speech recognition model with added noise converges, and a first speech recognition model is obtained.
  • Step S103 fuse the first speech recognition model with the preset language model to obtain a second speech recognition model.
  • the preset language model is a pre-trained language model (Language Model), and the preset language model is, for example, a statistical language model, a feedforward neural network language model, a recurrent neural network language model, and the like.
  • the obtained second speech recognition model has better performance, which is beneficial to improve the training effect of the target speech recognition model and make the speech recognition accuracy of the target speech recognition model more accurate. high.
  • the data volume of the training samples of the language model is much larger than the data volume of the first training samples of the first speech recognition model
  • the fusion of the first speech recognition model and the preset language model can help the second speech recognition model.
  • the model performs semantic information modeling, and the fusion methods include Voting (voting method), Averaging (averaging method), Bagging (guided aggregation method) algorithm, Boosting (boosting method), etc., which are not specifically limited in this embodiment.
  • Step S104 inputting multiple second speech sequences into the second speech recognition model to obtain the second text and fusion score corresponding to each second speech sequence.
  • a plurality of second training samples are input into the second speech recognition model, and a speech recognition result corresponding to each second speech sequence is obtained, and the speech recognition result includes the second text corresponding to the second speech sequence and the fusion score.
  • the second speech recognition model is a LAS (Listen, Attend and Spell) model, including a Listen (listen) layer, an Attend (attention) layer and a Spell (spelling) layer.
  • the second speech sequence is the sound signal feature vector x of length T. After the sound signal feature vector x of length T is input into the second speech recognition model, the content related to the sound signal is retained through the Listen layer, and the content that is not related to the sound signal is removed.
  • the second speech recognition model is obtained by the fusion of the trained first speech recognition model LAS and the language model LM, the weighted summation of the character distribution probabilities of the LAS model and the LM model can obtain the fusion corresponding to the second speech sequence. Fraction.
  • Step S105 screen out the target speech sequence from the plurality of second speech sequences.
  • the target speech sequence For the second text corresponding to the second speech sequence output by the second speech recognition model, it is necessary to filter out the target speech sequence that meets the preset conditions. According to the fusion score of each second speech sequence, it is possible to select the target speech sequence from multiple second speech sequences.
  • the target speech sequence is screened out, and the target speech sequence can be used as training data for a high-quality "student" model (the second preset speech recognition model), thereby improving the training effect of the second preset speech recognition model.
  • step S105 includes: sub-steps S1051 to S1052 .
  • Sub-step S1051 Filter the multiple second voice sequences according to the preset score threshold and the fusion score of each second voice sequence to obtain multiple candidate voice sequences.
  • the preset score threshold can be flexibly set by the user, the second speech sequence whose fusion score is greater than or equal to the preset score threshold is reserved, and the second speech sequence whose fusion score is less than the preset score threshold is screened out to obtain a plurality of speech sequences. candidate speech sequences. It should be noted that the accuracy rate of the second text corresponding to the second voice sequence with a high fusion score is higher, so retaining the second voice sequence with a higher accuracy rate of the second text is conducive to screening out high-quality second voices. sequence.
  • the speech recognition result of the second speech recognition model will be affected, resulting in inconsistent accuracy of the second text corresponding to each second speech sequence and the fusion score. Therefore, by regularizing the fusion score of the second speech sequence, and then comparing the regularized fusion score with the preset score threshold, the second speech sequence whose fusion score is less than the preset score threshold is filtered out, and a high-quality speech sequence is obtained. Multiple candidate speech sequences.
  • the regularization formula is: l is the character length of the second speech sequence, ⁇ and ⁇ are the parameters obtained by performing linear regression on (li i , S i ) of multiple second speech sequences, and ⁇ is the calculation standard deviation of .
  • the preset score threshold may decrease as the iteration time increases, and the preset score threshold becomes smaller and smaller in the iterative training, so that more and more candidate speech sequences can be used for training the target speech recognition model sample.
  • Sub-step S1052 according to the probability distribution information of the plurality of first training samples, screen out the target speech sequence from the plurality of candidate speech sequences.
  • selecting a target speech sequence from a plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples includes: generating a plurality of speech sequence sets according to the plurality of candidate speech sequences, wherein each the speech sequence sets include at least one candidate speech sequence; determine the probability distribution information of each speech sequence set; according to the probability distribution information of the plurality of first training samples and the probability distribution information of each speech sequence set, from the plurality of speech sequence sets Select the target speech sequence set.
  • the target speech sequence set includes at least one target speech sequence. It should be noted that the distribution of the filtered candidate speech sequences is quite different.
  • the second preset speech recognition model will be affected.
  • the performance of the speech recognition model therefore, a target speech sequence set that is similar to the probability distribution information of multiple first training samples is found from multiple speech sequence sets, and at least one target speech sequence in the target speech sequence set is used as the second preset.
  • the training samples of the speech recognition model can improve the performance of the second preset speech recognition model, which can improve the training effect of the target speech recognition model.
  • multiple batches are randomly selected from multiple candidate speech sequences to generate multiple speech sequence sets, and each batch includes at least one candidate speech sequence.
  • Each candidate speech sequence carries attribute information, and the attribute information carried by multiple candidate speech sequences can form the probability distribution information of a speech sequence set.
  • the probability distribution information is determined according to the specific service set. For example, the probability distribution information is the length of the audio. , the ratio of male and female speakers, the age of the speaker, the surrounding environment, etc., and compare the probability distribution information corresponding to each speech sequence set with the probability distribution information of multiple first training samples to find out the corresponding probability distribution information of multiple first training samples.
  • the probability distribution information of the approximate target speech sequence set is the probability distribution information of the approximate target speech sequence set.
  • the K-L divergence of each voice sequence set is calculated according to the probability distribution information of a plurality of first training samples and the probability distribution information of each voice sequence set; according to the K-L divergence of each voice sequence set, Select a target speech sequence set from multiple speech sequence sets. Among them, the probability distribution information of the speech sequence set with the lower K-L divergence is closer to the probability distribution information of multiple first training samples, and the speech sequence set with the lowest K-L divergence is selected as the target speech sequence set. Include at least one target speech sequence.
  • K-L divergence K-L Divergence
  • f(M(U)) is the speech sequence set
  • P(i) is the probability distribution information of a plurality of first training samples
  • Q(i) is the probability distribution information of the speech sequence set.
  • Step S106 Perform iterative training on the second preset speech recognition model according to each target speech sequence, the second text corresponding to each target speech sequence, and a plurality of first training samples to obtain a target speech recognition model.
  • the preset performance condition is determined according to the speech recognition accuracy and speech recognition speed of the speech recognition student model. In practical applications, the preset performance conditions may also be set according to actual application scenarios.
  • the second preset speech recognition model is initialized by using a plurality of first training samples to ensure the convergence of the training data. The initial speech recognition model is trained through multiple target speech sequences, and a target speech recognition model with better training effect is obtained, and the target speech recognition model has a high accuracy of speech recognition.
  • the second preset speech recognition model is, for example, a LAS (Listen, Attend and Spell) model
  • the LAS model includes a Listen (listen) layer, an Attend (attention) layer and a Spell (spelling) layer.
  • multiple third training samples are generated according to each target speech sequence and the second text corresponding to each target speech sequence; training samples are obtained according to multiple third training samples and multiple first training samples Set; through the training sample set, iterative training is performed on the second preset speech recognition model until the preset condition is reached, and the target speech recognition model is obtained.
  • the preset conditions may be that the performance meets the preset training conditions, the number of iterations is greater than the preset number of iterations, and/or the iteration duration is greater than the preset iteration duration, etc., which are not specifically limited in this embodiment of the present application.
  • FIG. 3 is a schematic diagram of a scenario for implementing the model training method provided by this embodiment.
  • a plurality of first training samples and a plurality of second training samples are obtained, the first training samples include the first speech sequence and the first text corresponding to the marked first speech sequence, and the second training samples include the second speech sequence, and then input a plurality of first training samples into the first preset speech recognition model 10 to iteratively train the first preset speech recognition model 10 to obtain the first speech recognition model 20, and the preset language model 30 Integrate with the first speech recognition model 20 to obtain a second speech recognition model 40, and then input the second speech sequences in the plurality of second training samples into the second speech recognition model 40 to obtain each second speech sequence corresponding to According to the fusion score of each second speech sequence, the target speech sequence is screened from multiple second speech sequences, and each The text and the plurality of first training samples are input into the second preset speech recognition model 50 to iteratively train the second preset speech recognition model 50 to obtain the target speech recognition model 60 .
  • a plurality of first training samples and a plurality of second training samples are obtained, wherein the first training samples include the first speech sequence and the first text corresponding to the marked first speech sequence, and the second training samples are obtained.
  • the sample includes a second speech sequence, and then according to a plurality of first training samples, the first preset speech recognition model is iteratively trained to obtain a first speech recognition model, and the first speech recognition model and the preset language model are fused to obtain The second speech recognition model, and then multiple second speech sequences are input into the second speech recognition model to obtain the second text corresponding to each second speech sequence and the fusion score.
  • each second speech sequence from The target speech sequence is selected from the plurality of second speech sequences, and the second preset speech recognition model is iteratively trained according to each target speech sequence, the second text corresponding to each target speech sequence, and a plurality of first training samples, Get the target speech recognition model.
  • This application trains the "teacher-noisy student" self-training learning model through multiple labeled first training samples and multiple unlabeled second training samples, which can greatly improve the training effect of the speech recognition model and reduce the need for The number of labeled training samples is required, and the training efficiency of the speech recognition model is improved.
  • FIG. 4 is a schematic flowchart of steps of a speech recognition method provided by an embodiment of the present application.
  • the model training method includes steps S201 to S202.
  • Step S201 acquiring a speech sequence to be recognized.
  • the speech sequence to be recognized is a piece of speech data sent by a user in a social application.
  • Step S202 perform speech recognition on the speech sequence by using the target speech recognition model to obtain text information corresponding to the speech sequence.
  • the target speech recognition model is obtained by training according to the model training method described in the foregoing embodiment.
  • user A receives a voice sequence sent by user B through the social application of the terminal device, performs voice recognition on the voice sequence through the target voice recognition model, and obtains the text information "Hello" (speech recognition result).
  • the speech recognition method provided by the above-mentioned embodiment, by acquiring the speech sequence to be recognized, and performing speech recognition on the speech sequence through the target speech recognition model described in the foregoing embodiment, the text information corresponding to the speech sequence is obtained, because the target speech recognition model " The teacher-noise-student" self-training learning model can effectively improve the accuracy of speech recognition.
  • FIG. 5 is a schematic block diagram of a model training apparatus provided by an embodiment of the present application.
  • the model training apparatus 300 includes: an acquisition module 301 , a first training module 302 , a fusion module 303 , an input module 304 , a screening module 305 and a second training module 306 .
  • the acquisition module 301 is configured to acquire a plurality of first training samples and a plurality of second training samples, the first training samples include the first speech sequence and the first text corresponding to the marked first speech sequence, and the second training samples include the second speech sequence;
  • a first training module 302 configured to perform iterative training on a first preset speech recognition model according to a plurality of first training samples to obtain a first speech recognition model
  • a fusion module 303 configured to fuse the first speech recognition model with the preset language model to obtain a second speech recognition model
  • the input module 304 is used for inputting a plurality of second speech sequences into the second speech recognition model to obtain the second text and fusion score corresponding to each second speech sequence;
  • the screening module 305 is used for screening out the target speech sequence from the plurality of second speech sequences according to the fusion score of each second speech sequence;
  • the second training module 306 is configured to iteratively train the second preset speech recognition model according to each target speech sequence, the second text corresponding to each target speech sequence, and a plurality of first training samples to obtain the target speech recognition model .
  • the screening module 305 includes:
  • Filtering submodule 3051 configured to filter a plurality of the second speech sequences according to a preset score threshold and a fusion score of each of the second speech sequences to obtain a plurality of candidate speech sequences;
  • the screening sub-module 3052 is configured to screen out a target speech sequence from the plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples.
  • the screening sub-module 3052 is also used to:
  • each of the speech sequence sets includes at least one of the candidate speech sequences
  • a target speech sequence set is selected from the plurality of the speech sequence sets.
  • the screening sub-module 3052 is also used to:
  • a target speech sequence set is selected from a plurality of the speech sequence sets.
  • the first training module 302 is further used to:
  • the first preset speech recognition model is iteratively trained according to the plurality of first training samples after data enhancement, until the first preset speech recognition model converges, and a first speech recognition model is obtained.
  • the second training module 306 is also used to:
  • the second preset speech recognition model is iteratively trained until a preset condition is reached, and the target speech recognition model is obtained.
  • FIG. 7 is a schematic block diagram of a speech recognition apparatus provided by an embodiment of the present application.
  • the speech recognition device 400 includes:
  • the acquiring module 401 is used for acquiring the speech sequence to be recognized.
  • the recognition module 402 is configured to perform speech recognition on the speech sequence through a target speech recognition model to obtain text information corresponding to the speech sequence.
  • the target speech recognition model is obtained by training according to the model training method described in the foregoing embodiment.
  • the apparatuses provided in the above embodiments may be implemented in the form of a computer program, and the computer program may be executed on the computer device as shown in FIG. 8 .
  • FIG. 8 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application.
  • the computer device can be a server or a terminal device.
  • the computer device includes a processor, a memory and a network interface connected through a system bus, wherein the memory may include a storage medium and an internal memory, and the storage medium may be non-volatile or volatile.
  • the storage medium may store an operating system and a computer program.
  • the computer program includes program instructions that, when executed, can cause the processor to execute any model training method or speech recognition method.
  • the processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
  • the internal memory provides an environment for running the computer program in the storage medium.
  • the processor can cause the processor to execute any model training method or speech recognition method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated circuits) Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
  • the processor is configured to run a computer program stored in the memory to implement the following steps:
  • the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence
  • the second training samples include a second speech sequence
  • the processor when the processor selects the target speech sequence from the plurality of second speech sequences according to the fusion score of each of the second speech sequences, the processor is configured to:
  • a target speech sequence is selected from the plurality of candidate speech sequences.
  • the processor when the processor selects the target speech sequence from the plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples, the processor is configured to:
  • each of the speech sequence sets includes at least one of the candidate speech sequences
  • a target speech sequence set is selected from the plurality of the speech sequence sets.
  • the processor selects the target from the plurality of speech sequence sets according to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets.
  • a voice sequence set it is used to implement:
  • a target speech sequence set is selected from a plurality of the speech sequence sets.
  • the processor when implementing the iterative training of the first preset speech recognition model according to the plurality of first training samples to obtain the first speech recognition model, the processor is used to implement:
  • the first preset speech recognition model is iteratively trained according to the plurality of first training samples after data enhancement, until the first preset speech recognition model converges, and a first speech recognition model is obtained.
  • the processor is performing the processing of the second text according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples.
  • the preset speech recognition model is iteratively trained, and when the target speech recognition model is obtained, it is used to achieve:
  • the second preset speech recognition model is iteratively trained until a preset condition is reached, and the target speech recognition model is obtained.
  • the processor is configured to run a computer program stored in a memory to implement the following steps:
  • the target speech recognition model is obtained by training according to the model training method described in the foregoing embodiment.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, wherein when the computer program is executed by a processor, the following steps are implemented:
  • the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence
  • the second training samples include a second speech sequence
  • the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) ) card, Flash Card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

A model training method and apparatus, a speech recognition method and apparatus, a device, and a storage medium. The method comprises: performing iterative training on a first preset speech recognition model according to a plurality of first training samples to obtain a first speech recognition model (S102); fusing the first speech recognition model with a preset language model to obtain a second speech recognition model (S103); inputting second speech sequences in a plurality of second training samples into the second speech recognition model to obtain second text and a fusion score corresponding to each second speech sequence (S104); selecting target speech sequences from the plurality of second speech sequences according to the fusion score of each second speech sequence (S105); and performing iterative training on a second preset speech recognition model according to each target speech sequence, the second text corresponding to the target speech sequence, and the plurality of first training samples, to obtain a target speech recognition model (S106). The method allows for improvement of the speech recognition model training efficiency.

Description

模型训练方法、语音识别方法、装置、设备及存储介质Model training method, speech recognition method, apparatus, equipment and storage medium
本申请要求于2020年12月11日提交中国专利局、申请号为202011453446.1、发明名称为“模型训练方法、语音识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011453446.1 and the invention titled "Model Training Method, Speech Recognition Method, Apparatus, Equipment and Storage Medium" filed with the China Patent Office on December 11, 2020, and its entire contents Incorporated herein by reference.
技术领域technical field
本申请涉及人工智能中的模型构建的技术领域,尤其涉及一种模型训练方法、语音识别方法、装置、设备及存储介质。The present application relates to the technical field of model construction in artificial intelligence, and in particular, to a model training method, a speech recognition method, an apparatus, a device and a storage medium.
背景技术Background technique
自动语音识别(Automatic Speech Recognition)是一种将语音转换为文本的技术。语音识别作为人工智能领域的一个重要技术,被应用于互联网,通信、智能家居等相关的各行各业,通常使用语音识别模型来进行自动语音识别。为了对语音识别模型进行训练,需要准备大量的语音数据,以及与语音数据对应的文本数据。现有技术中,该文本数据样本是通过如下的方式获取的:组织大量的人听取语音数据,并写下正确的文本数据。然而,发明人意识到随着算法和计算机能力的进步,语音识别模型允许越来越多的语音数据和对应的文本数据加入训练,以提升语音识别模型的准确度,这使得人工成本成为资源投入的瓶颈,投入大量的人力劳动去对语音数据进行标注,既费时昂贵又效率底下。Automatic Speech Recognition is a technology that converts speech into text. As an important technology in the field of artificial intelligence, speech recognition is used in various industries related to the Internet, communication, smart home, etc., and speech recognition models are usually used for automatic speech recognition. In order to train a speech recognition model, a large amount of speech data and text data corresponding to the speech data need to be prepared. In the prior art, the text data sample is obtained in the following manner: organize a large number of people to listen to the voice data and write down the correct text data. However, the inventor realized that with the advancement of algorithms and computer capabilities, the speech recognition model allows more and more speech data and corresponding text data to be trained to improve the accuracy of the speech recognition model, which makes the labor cost a resource investment It is time-consuming, expensive and inefficient.
发明内容SUMMARY OF THE INVENTION
本申请的主要目的在于提供一种模型训练方法、语音识别方法、装置、设备及存储介质,旨在提高语音识别模型的训练效果和训练效率。The main purpose of the present application is to provide a model training method, speech recognition method, apparatus, equipment and storage medium, aiming at improving the training effect and training efficiency of the speech recognition model.
第一方面,本申请提供一种模型训练方法,包括:In a first aspect, the present application provides a model training method, including:
获取多个第一训练样本和多个第二训练样本,所述第一训练样本包括第一语音序列和标注的所述第一语音序列对应的第一文本,所述第二训练样本包括第二语音序列;Obtain a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, and the second training samples include a second speech sequence;
根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型;performing iterative training on a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model;
将所述第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型;Integrating the first speech recognition model with the preset language model to obtain a second speech recognition model;
将多个所述第二语音序列输入至所述第二语音识别模型,得到每个所述第二语音序列对应的第二文本和融合分数;Inputting a plurality of the second speech sequences into the second speech recognition model to obtain a second text and a fusion score corresponding to each of the second speech sequences;
根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列;According to the fusion score of each of the second speech sequences, filter out the target speech sequence from the plurality of second speech sequences;
根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。Perform iterative training on the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples to obtain a target speech recognition model.
第二方面,本申请还提供一种语音识别方法,包括:In a second aspect, the present application also provides a speech recognition method, comprising:
获取待识别的语音序列;Get the speech sequence to be recognized;
通过目标语音识别模型对所述语音序列进行语音识别,得到所述语音序列对应的文本信息;Perform speech recognition on the speech sequence by using the target speech recognition model to obtain text information corresponding to the speech sequence;
所述目标语音识别模型是根据如上所述的模型训练方法进行训练得到的。The target speech recognition model is obtained by training according to the above-mentioned model training method.
第三方面,本申请还提供一种模型训练装置,所述模型训练装置包括:In a third aspect, the present application also provides a model training device, the model training device comprising:
获取模块,用于获取多个第一训练样本和多个第二训练样本,所述第一训练样本包括第一语音序列和标注的所述第一语音序列对应的第一文本,所述第二训练样本包括第二语音序列;an acquisition module, configured to acquire a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, the second the training sample includes a second speech sequence;
第一训练模块,用于根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型;a first training module, configured to iteratively train a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model;
融合模块,用于将所述第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型;a fusion module, configured to fuse the first speech recognition model with the preset language model to obtain a second speech recognition model;
输入模块,用于将多个所述第二语音序列输入至所述第二语音识别模型,得到每个所述第二语音序列对应的第二文本和融合分数;an input module, configured to input a plurality of the second speech sequences into the second speech recognition model, to obtain a second text and a fusion score corresponding to each of the second speech sequences;
筛选模块,用于根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列;a screening module, configured to screen out a target speech sequence from the plurality of second speech sequences according to the fusion score of each of the second speech sequences;
第二训练模块,用于根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对所述第二预设语音识别模型进行迭代训练,得到目标语音识别模型。A second training module, configured to iterate the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples Training to get the target speech recognition model.
第四方面,本申请还提供一种计算机设备,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如上所述的模型训练方法或者语音识别方法的步骤。In a fourth aspect, the present application also provides a computer device, the computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program is executed by the When executed by the processor, the steps of the above-mentioned model training method or speech recognition method are implemented.
第五方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如上所述的模型训练方法或者语音识别方法的步骤。In a fifth aspect, the present application further provides a computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the above-mentioned model training method or speech Identify the steps of the method.
本申请提供一种模型训练方法、语音识别方法、装置、设备及存储介质,本申请通过获取多个第一训练样本和多个第二训练样本,第一训练样本包括第一语音序列和标注的第一语音序列对应的第一文本,第二训练样本包括第二语音序列,然后根据多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型,将第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型,再将多个第二语音序列输入至第二语音识别模型,得到每个第二语音序列对应的第二文本和融合分数,根据每个第二语音序 列的融合分数,从多个第二语音序列中筛选出目标语音序列,根据每个目标语音序列、每个目标语音序列对应的第二文本和多个第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。本申请通过多个有标注的第一训练样本和多个无标注的第二训练样本对“教师-噪声学生”自训练学习模型进行训练,能够极大提高语音识别模型的训练效果,减少了对有标注的训练样本的数量要求,并且提升了语音识别模型的训练效率。The present application provides a model training method, a speech recognition method, an apparatus, a device and a storage medium. The present application obtains a plurality of first training samples and a plurality of second training samples, wherein the first training samples include a first speech sequence and annotated The first text corresponding to the first voice sequence, the second training sample includes the second voice sequence, and then according to the plurality of first training samples, the first preset voice recognition model is iteratively trained to obtain a first voice recognition model, and the first voice recognition model is obtained. A speech recognition model is fused with a preset language model to obtain a second speech recognition model, and then multiple second speech sequences are input into the second speech recognition model to obtain the second text corresponding to each second speech sequence and the fusion score , according to the fusion score of each second speech sequence, screen out the target speech sequence from multiple second speech sequences, and according to each target speech sequence, the second text corresponding to each target speech sequence and a plurality of first training samples , performing iterative training on the second preset speech recognition model to obtain the target speech recognition model. This application trains the "teacher-noisy student" self-training learning model through multiple labeled first training samples and multiple unlabeled second training samples, which can greatly improve the training effect of the speech recognition model and reduce the need for The number of labeled training samples is required, and the training efficiency of the speech recognition model is improved.
附图说明Description of drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1为本申请实施例提供的一种模型训练方法的步骤流程示意图;1 is a schematic flowchart of steps of a model training method provided by an embodiment of the present application;
图2为图1中的模型训练方法的子步骤流程示意图;Fig. 2 is the sub-step flow schematic diagram of the model training method in Fig. 1;
图3为实施本实施例提供的模型训练方法的一场景示意图;3 is a schematic diagram of a scenario for implementing the model training method provided by this embodiment;
图4为本申请实施例提供的一种语音识别方法的步骤流程示意图;4 is a schematic flowchart of steps of a speech recognition method provided by an embodiment of the present application;
图5为本申请实施例提供的一种模型训练装置的示意性框图;FIG. 5 is a schematic block diagram of a model training apparatus provided by an embodiment of the present application;
图6为图5中的模型训练装置的子模块的示意性框图;Fig. 6 is a schematic block diagram of sub-modules of the model training device in Fig. 5;
图7为本申请实施例提供的一种语音识别装置的示意性框图;FIG. 7 is a schematic block diagram of a speech recognition apparatus provided by an embodiment of the present application;
图8为本申请实施例提供的一种计算机设备的结构示意性框图。FIG. 8 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。另外,虽然在装置示意图中进行了功能模块的划分,但是在某些情况下,可以以不同于装置示意图中的模块划分。The flowcharts shown in the figures are for illustration only, and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to the actual situation. In addition, although the functional modules are divided in the schematic diagram of the device, in some cases, the modules may be divided differently from the schematic diagram of the device.
本申请实施例提供一种模型训练方法、语音识别方法、装置、设备及存储介质。其中,该模型训练方法可应用于终端设备或服务器中,该终端设备可以为手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备;该服务器可以为单台的服务器,也可以为由多台服务器组成的服务器集群。以下以该模型训练方法应用于服务器为例进行解释说明。Embodiments of the present application provide a model training method, a speech recognition method, an apparatus, a device, and a storage medium. Wherein, the model training method can be applied to a terminal device or a server, and the terminal device can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device; the server can be a single server , or a server cluster consisting of multiple servers. The following takes the model training method applied to the server as an example for explanation.
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and features in the embodiments may be combined with each other without conflict.
请参照图1,图1为本申请实施例提供的一种模型训练方法的步骤流程示意图。Please refer to FIG. 1 , which is a schematic flowchart of steps of a model training method provided by an embodiment of the present application.
如图1所示,该模型训练方法包括步骤S101至步骤S106。As shown in FIG. 1 , the model training method includes steps S101 to S106.
步骤S101、获取多个第一训练样本和多个第二训练样本,第一训练样本包括第一语音序列和标注的第一语音序列对应的第一文本,第二训练样本包括第二语音序列。Step S101: Acquire a plurality of first training samples and a plurality of second training samples, where the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, and the second training samples include a second speech sequence.
其中,第一训练样本包括第一语音序列和第一语音序列对应的第一文本,该第一文本是对应的第一语音序列的标注,第二训练样本包括第二语音序列。需要说明的是,第一语音序列和第二语音序列为音频数据,第一语音序列对应的第一文本为该第一语音序列语音识别出的文本内容。例如,第一语音序列为一首歌曲,对应的第一文本为歌词。The first training sample includes the first speech sequence and the first text corresponding to the first speech sequence, where the first text is a label of the corresponding first speech sequence, and the second training sample includes the second speech sequence. It should be noted that the first voice sequence and the second voice sequence are audio data, and the first text corresponding to the first voice sequence is the text content recognized by the voice of the first voice sequence. For example, the first speech sequence is a song, and the corresponding first text is lyrics.
噪声学生训练(Noisy Student Training,NST)模型是一个由“教师”和“学生”组成的半监督学习模型,通过教师模型(第一预设语音识别模型)来学习有标注的第一训练样本,并对无标注的第二训练样本进行预测,得到有标注的第二训练样本和第二训练样本对应的第二文本,然后让学生模型(第二预设语音识别模型)对有标注的第一训练样本、有标注的第二训练样本以及第二训练样本对应的第二文本进行训练,并对以上过程进行迭代。通过“教师-噪声学生”自训练学习,能够极大提高语音识别模型的训练效果。The Noisy Student Training (NST) model is a semi-supervised learning model composed of "teacher" and "student". The teacher model (the first preset speech recognition model) is used to learn the marked first training sample, Predict the unlabeled second training sample, obtain the labeled second training sample and the second text corresponding to the second training sample, and then let the student model (the second preset speech recognition model) for the labeled first training sample. The training sample, the labeled second training sample, and the second text corresponding to the second training sample are trained, and the above process is iterated. Through "teacher-noise student" self-training learning, the training effect of the speech recognition model can be greatly improved.
在一实施例中,多个第一训练样本的总音频长度高于第一预设时间阈值,多个第二训练样本的总音频长度高于第二预设时间阈值,能够保证后续训练出的语音识别模型进行语音识别的准确度。In one embodiment, the total audio length of the plurality of first training samples is higher than the first preset time threshold, and the total audio length of the plurality of second training samples is higher than the second preset time threshold, which can ensure that the data obtained in subsequent training can be guaranteed. The accuracy of the speech recognition model for speech recognition.
进一步地,第二预设时间阈值高于第一预设时间阈值。实际应用中,第一预设时间阈值和第二预设时间阈值可以根据实际应用场景进行设置,如,第一预设时间阈值为100h,第二预设时间阈值为500h,在此不再赘述。Further, the second preset time threshold is higher than the first preset time threshold. In practical applications, the first preset time threshold and the second preset time threshold may be set according to actual application scenarios, for example, the first preset time threshold is 100h, and the second preset time threshold is 500h, which will not be repeated here. .
需要说明的是,为进一步保证上述多个第一训练样本和多个第二训练样本等相关信息的私密和安全性,上述第一训练样本和第二训练样本等相关信息还可以存储于一区块链的节点中,本申请的技术方案还可适用于添加其他存储于区块链上的数据文件,本发明所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。It should be noted that, in order to further ensure the privacy and security of related information such as the above-mentioned multiple first training samples and multiple second training samples, the above-mentioned related information such as the first training samples and the second training samples can also be stored in one area. In the nodes of the block chain, the technical solution of this application can also be applied to add other data files stored on the block chain. The block chain in the present invention refers to computers such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, etc. New application model of technology. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
步骤S102、根据多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型。Step S102: Perform iterative training on a first preset speech recognition model according to a plurality of first training samples to obtain a first speech recognition model.
第一预设语音识别模型为“教师”模型,将多个第一训练样本输入至第一预设语音识别模型,获得每个第一训练样本各自对应的语音识别结果,并根据每个第一训练样本各自对 应的语音识别结果和对应的第一文本对第一预设语音识别模型的参数进行调整,直至获得性能符合预设训练条件的第一语音识别模型。The first preset speech recognition model is a "teacher" model, and a plurality of first training samples are input into the first preset speech recognition model to obtain a speech recognition result corresponding to each first training sample, and according to each first training sample, the corresponding speech recognition results are obtained. The speech recognition results corresponding to the training samples and the corresponding first text respectively adjust the parameters of the first preset speech recognition model until a first speech recognition model whose performance meets the preset training conditions is obtained.
例如,性能为识别准确度,则预设训练条件可以为识别准确度高于预设准确度阈值。需要说明的是,预设训练条件和预设准确度阈值均可以根据实际应用场景进行设置,例如,预设准确度阈值为0.9,在此不做具体限定。For example, if the performance is recognition accuracy, the preset training condition may be that the recognition accuracy is higher than the preset accuracy threshold. It should be noted that both the preset training conditions and the preset accuracy threshold can be set according to actual application scenarios. For example, the preset accuracy threshold is 0.9, which is not specifically limited here.
其中,第一预设语音识别模型例如为LAS(Listen,Attend and Spell)模型,该LAS模型包括Listen(听)层,Attend(注意)层和Spell(拼写)层。第一预设语音识别模型提取输入的第一训练样的语音信号特征,通过将语音信号特征输入到Listen层进行编码,然后利用Attention层在不同时刻关注输入的不同部分(Attend),最后再使用Spell层进行解码,得到第一训练样本的语音识别结果。The first preset speech recognition model is, for example, a LAS (Listen, Attend and Spell) model, and the LAS model includes a Listen (listening) layer, an Attend (attention) layer and a Spell (spelling) layer. The first preset speech recognition model extracts the speech signal features of the input first training sample, encodes by inputting the speech signal features into the Listen layer, and then uses the Attention layer to pay attention to different parts of the input at different times (Attend), and finally uses The Spell layer is decoded to obtain the speech recognition result of the first training sample.
在一实施例中,对多个第一训练样本进行数据增强;根据经过数据增强后的多个第一训练样本,对第一预设语音识别模型进行迭代训练,直至第一预设语音识别模型收敛,得到第一语音识别模型。需要说明的是,通过数据增强(Data Augmentation)能够增加第一训练样本的样本数量,例如通过声道长度标准化、将干净的音频和嘈杂的音频信号叠加来合成嘈杂的音频或者原始音频的速度扰动等方式来实现数据增强。对第一预设语音识别模型进行迭代训练的具体实施过程可参考前述实施例,第一预设语音识别模型收敛可以是性能符合预设训练条件、迭代次数大于预设迭代次数和/或迭代时长大于预设迭代时长等,本实施例不做具体限定。通过数据增强给第一预设语音识别模型加入一些噪音,从而使得后续的第二预设语音识别模型(学生模型)可以被迫更努力的学习第一预设语音识别模型(教师模型)输出的语音识别结果,提升目标语音识别模型的训练效果。In one embodiment, data enhancement is performed on a plurality of first training samples; according to the plurality of first training samples after data enhancement, the first preset speech recognition model is iteratively trained until the first preset speech recognition model is Convergence to obtain the first speech recognition model. It should be noted that the number of samples of the first training sample can be increased through data augmentation, for example, by normalizing the channel length, superimposing clean audio and noisy audio signals to synthesize noisy audio or speed perturbation of the original audio etc. to achieve data enhancement. For the specific implementation process of iterative training of the first preset speech recognition model, reference may be made to the foregoing embodiments. The convergence of the first preset speech recognition model may be that the performance meets the preset training conditions, and the number of iterations is greater than the preset number of iterations and/or the iteration duration. is greater than the preset iteration duration, etc., which is not specifically limited in this embodiment. Add some noise to the first preset speech recognition model through data enhancement, so that the subsequent second preset speech recognition model (student model) can be forced to work harder to learn the output of the first preset speech recognition model (teacher model). The speech recognition result improves the training effect of the target speech recognition model.
进一步地,使用SpecAugment对多个第一训练样本进行数据增强,通过给第一预设语音识别模型加入一些噪音,从而增加第一语音识别模型的鲁棒性。具体地,将每个第一语音序列转化为频谱图,通过SpecAugment对多个频谱图进行时间形变、频率掩膜和/或时间掩膜。在第一预设语音识别模型进行迭代训练之前,通过SpecAugment可以对第一语音序列的频谱图进行增强,可以增加第一语音识别模型对第一训练样本的训练速度,从而提升目标语音识别模型的训练效率。Further, the SpecAugment is used to perform data enhancement on the plurality of first training samples, and the robustness of the first speech recognition model is increased by adding some noise to the first preset speech recognition model. Specifically, each first speech sequence is converted into a spectrogram, and time deformation, frequency masking and/or time masking are performed on the plurality of spectrograms by SpecAugment. Before the iterative training of the first preset speech recognition model, the spectrogram of the first speech sequence can be enhanced through SpecAugment, which can increase the training speed of the first speech recognition model for the first training sample, thereby improving the performance of the target speech recognition model. training efficiency.
需要说明的是,通过SpecAugment对多个频谱图进行时间形变是指对于π时跃的梅尔谱图,以时间为x轴,频率为y轴,构建一个随机的水平线穿过梅尔谱图中心,然后时间段(W,W-π)被形变到左边或右边;对多个频谱图进行频率掩膜是指在连续的梅尔谱图的频率轴上,将[f 0,f 0+f]掩膜,其中f是一个统一的0到F的参数;对多个频谱图进行频率掩膜是指在连续的梅尔谱图的时间轴上,将[t 0,t 0+t]掩膜,其中t是一个0到T的均匀分布。使用SpecAugment对多个第一训练样本进行数据增强,随着迭代训练的进行,第一预设语音识别模型的鲁棒性和表现都会提升,此时可以增加Specaugment的强度,给第一预设语音识别模型的输入带来更多的噪声,从而提升第一语音识别模型的训练效果。 It should be noted that the time deformation of multiple spectrograms through SpecAugment refers to the mel spectrogram of the π time jump, taking time as the x-axis and frequency as the y-axis, and constructing a random horizontal line through the center of the mel spectrogram. , then the time period (W,W-π) is deformed to the left or right; frequency masking multiple spectrograms means that on the frequency axis of successive mel spectrograms, [f 0 ,f 0 +f ] mask, where f is a uniform parameter from 0 to F; frequency masking multiple spectrograms refers to masking [t 0 ,t 0 +t] on the time axis of successive mel spectrograms membrane, where t is a uniform distribution from 0 to T. Use SpecAugment to perform data enhancement on multiple first training samples. As the iterative training progresses, the robustness and performance of the first preset speech recognition model will be improved. At this time, the intensity of Specaugment can be increased to give the first preset speech The input of the recognition model brings more noise, thereby improving the training effect of the first speech recognition model.
在一实施例中,对第一预设语音识别模型加入噪声;根据多个第一训练样本,对加入噪声的第一预设语音识别模型进行迭代训练,直至加入噪声的第一预设语音识别模型收敛,得到第一语音识别模型。In one embodiment, noise is added to the first preset speech recognition model; according to a plurality of first training samples, the first preset speech recognition model with noise added is iteratively trained until the first preset speech recognition model with noise added The model converges, and a first speech recognition model is obtained.
示例性的,使用Dropout对第一预设语音识别模型加入噪声,即在第一预设语音识别模型每次训练中随机的让神经网络的某些隐藏神经元不工作,隐藏神经元的输出为0,也不更新权重。例如设定dropout ratio=p,则每个隐藏神经元以概率p不工作,因此,在噪声学生训练中,通过Dropout对第一预设语音识别模型加入噪声,可以使得第二语音识别模型(学生模型)被迫更努力地学习第一预设语音识别模型(教师模型)输出的语音识别结果,提升目标语音识别模型的训练效果。Exemplarily, Dropout is used to add noise to the first preset speech recognition model, that is, some hidden neurons of the neural network are randomly disabled in each training of the first preset speech recognition model, and the output of the hidden neurons is 0, and do not update the weights either. For example, if dropout ratio=p is set, then each hidden neuron does not work with probability p. Therefore, in noise student training, adding noise to the first preset speech recognition model through Dropout can make the second speech recognition model (student model) is forced to work harder to learn the speech recognition result output by the first preset speech recognition model (teacher model), so as to improve the training effect of the target speech recognition model.
在一实施例中,对多个第一训练样本进行数据增强,以及对第一预设语音识别模型加入噪声;根据经过数据增强后的多个第一训练样本,对加入噪声的第一预设语音识别模型进行迭代训练,直至加入噪声的第一预设语音识别模型收敛,得到第一语音识别模型。通过对第一训练样本进行数据增强和对第一预设语音识别模型加入噪声,能够使得迭代完成的第一语音识别模型的参数更加准确,从而提升后续的目标语音识别模型的训练效果。In one embodiment, data enhancement is performed on a plurality of first training samples, and noise is added to the first preset speech recognition model; according to the plurality of first training samples after data enhancement, the first preset noise added The speech recognition model is iteratively trained until the first preset speech recognition model with added noise converges, and a first speech recognition model is obtained. By performing data enhancement on the first training sample and adding noise to the first preset speech recognition model, the parameters of the iteratively completed first speech recognition model can be made more accurate, thereby improving the training effect of the subsequent target speech recognition model.
步骤S103、将第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型。Step S103 , fuse the first speech recognition model with the preset language model to obtain a second speech recognition model.
其中,预设语言模型为预先训练好的语言模型(Language Model),该预设语言模型例如为统计语言模型、前馈神经网络语言模型、循环神经网络语言模型等。通过将第一语音识别模型与预设语言模型进行融合,得到的第二语音识别模型性能更好的,有利于提升目标语音识别模型的训练效果,使得目标语音识别模型的语音识别的准确度更高。The preset language model is a pre-trained language model (Language Model), and the preset language model is, for example, a statistical language model, a feedforward neural network language model, a recurrent neural network language model, and the like. By fusing the first speech recognition model with the preset language model, the obtained second speech recognition model has better performance, which is beneficial to improve the training effect of the target speech recognition model and make the speech recognition accuracy of the target speech recognition model more accurate. high.
在一实施例中,语言模型的训练样本的数据量远大于第一语音识别模型的第一训练样本的数据量,将第一语音识别模型与预设语言模型进行融合可以帮助到第二语音识别模型进行语义信息的建模,融合方式包括Voting(投票法)、Averaging(平均法)、Bagging(引导聚集法)算法和Boosting(提升法)等,本实施例不做具体限定。In one embodiment, the data volume of the training samples of the language model is much larger than the data volume of the first training samples of the first speech recognition model, and the fusion of the first speech recognition model and the preset language model can help the second speech recognition model. The model performs semantic information modeling, and the fusion methods include Voting (voting method), Averaging (averaging method), Bagging (guided aggregation method) algorithm, Boosting (boosting method), etc., which are not specifically limited in this embodiment.
步骤S104、将多个第二语音序列输入至第二语音识别模型,得到每个第二语音序列对应的第二文本和融合分数。Step S104 , inputting multiple second speech sequences into the second speech recognition model to obtain the second text and fusion score corresponding to each second speech sequence.
将多个第二训练样本输入至第二语音识别模型,获得每个第二语音序列各自对应的语音识别结果,该语音识别结果包括第二语音序列对应的第二文本和融合分数。通过第二语音识别模型对多个第二语音序列进行预测,输出每个第二语音序列对应的第二文本和融合分数,以便从多个第二语音序列中筛选出符合预设条件的第二语音序列。A plurality of second training samples are input into the second speech recognition model, and a speech recognition result corresponding to each second speech sequence is obtained, and the speech recognition result includes the second text corresponding to the second speech sequence and the fusion score. Predict multiple second voice sequences through the second voice recognition model, and output the second text and fusion score corresponding to each second voice sequence, so as to screen out the second voice sequences that meet the preset conditions from the multiple second voice sequences speech sequence.
示例性的,第二语音识别模型为LAS(Listen,Attend and Spell)模型,包括Listen(听)层,Attend(注意)层和Spell(拼写)层。第二语音序列为长度T的声音信号特征向量x,该长度T的声音信号特征向量x输入至第二语音识别模型之后,通过Listen层保留与声音信号相关的内容,去除与声音信号不相关的噪声,Listen层例如为双向的LSTM网络,输出长度为T的特征向量h=BiLSTM(x);在Attend层中,可以采用scaled attention机制, 获取Attend层中的RNN网络的当前时刻的隐藏层状态S t,根据Listen层输出的特征向量h和隐藏层状态S t计算当前时刻的上下文向量(Context Vector),即上下文向量C t=Attention(S t,h);在Spell层中,使用RNN网络作为解码器,确定上一刻的隐藏层状态、Spell层的输出向量和上下文向量,并计算当前时刻的隐藏层状态S t=RNN(S t-1,Y t-1,C t-1),再将当前时刻的输出向量经过softmax网络,输出第二语音序列对应的字符分布概率(第二文本的分布概率)Y t=CharacterDistribution(S t)。由于第二语音识别模型是训练好的第一语音识别模型LAS和语言模型LM进行融合得到的,因此对LAS模型和LM模型的字符分布概率进行加权求和,可以得到第二语音序列对应的融合分数。例如融合分数S=log p(Y t=k)=log p LAS(Y t=k)+βlog p LM(Y t=k),其中,β是指需要对第二语音序列上调的超参数,k指的是在t时刻字符分布概率最大的第二文本,log p LAS(Y t=k)是指第二语音序列在LAS模型输出的字符分布概率,log p LM(Y t=k)是指第二语音序列在LM模型输出的字符分布概率。 Exemplarily, the second speech recognition model is a LAS (Listen, Attend and Spell) model, including a Listen (listen) layer, an Attend (attention) layer and a Spell (spelling) layer. The second speech sequence is the sound signal feature vector x of length T. After the sound signal feature vector x of length T is input into the second speech recognition model, the content related to the sound signal is retained through the Listen layer, and the content that is not related to the sound signal is removed. Noise, the Listen layer is for example a bidirectional LSTM network, and the output feature vector h=BiLSTM(x) of length T; in the Attend layer, the scaled attention mechanism can be used to obtain the hidden layer state of the RNN network in the Attend layer at the current moment S t , calculate the context vector (Context Vector) at the current moment according to the feature vector h output by the Listen layer and the hidden layer state S t , that is, the context vector C t = Attention(S t , h); in the Spell layer, use the RNN network As a decoder, determine the hidden layer state at the last moment, the output vector and context vector of the Spell layer, and calculate the hidden layer state at the current moment S t =RNN(S t-1 ,Y t-1 ,C t-1 ), The output vector at the current moment is then passed through the softmax network to output the character distribution probability corresponding to the second speech sequence (the distribution probability of the second text) Y t =CharacterDistribution(S t ). Since the second speech recognition model is obtained by the fusion of the trained first speech recognition model LAS and the language model LM, the weighted summation of the character distribution probabilities of the LAS model and the LM model can obtain the fusion corresponding to the second speech sequence. Fraction. For example, the fusion score S=log p(Y t =k)=log p LAS (Y t =k)+βlog p LM (Y t =k), where β refers to the hyperparameter that needs to be up-regulated for the second speech sequence, k refers to the second text with the largest character distribution probability at time t, log p LAS (Y t =k) refers to the character distribution probability of the second speech sequence output by the LAS model, log p LM (Y t =k) is Refers to the character distribution probability of the second speech sequence output by the LM model.
步骤S105、根据每个第二语音序列的融合分数,从多个第二语音序列中筛选出目标语音序列。Step S105 , according to the fusion score of each second speech sequence, screen out the target speech sequence from the plurality of second speech sequences.
对于第二语音识别模型输出的第二语音序列对应的第二文本,需要筛选出符合预设条件的目标语音序列,可以根据每个第二语音序列的融合分数,从多个第二语音序列中筛选出目标语音序列,目标语音序列可以作为高质量的“学生”模型(第二预设语音识别模型)的训练数据,从而提升第二预设语音识别模型的训练效果。For the second text corresponding to the second speech sequence output by the second speech recognition model, it is necessary to filter out the target speech sequence that meets the preset conditions. According to the fusion score of each second speech sequence, it is possible to select the target speech sequence from multiple second speech sequences. The target speech sequence is screened out, and the target speech sequence can be used as training data for a high-quality "student" model (the second preset speech recognition model), thereby improving the training effect of the second preset speech recognition model.
在一实施例中,如图2所示,步骤S105包括:子步骤S1051至子步骤S1052。In one embodiment, as shown in FIG. 2 , step S105 includes: sub-steps S1051 to S1052 .
子步骤S1051、根据预设分数阈值和每个第二语音序列的融合分数,对多个第二语音序列进行过滤,得到多个候选语音序列。Sub-step S1051: Filter the multiple second voice sequences according to the preset score threshold and the fusion score of each second voice sequence to obtain multiple candidate voice sequences.
在一实施例中,预设分数阈值可以由用户灵活设置,保留融合分数大于或者等于预设分数阈值的第二语音序列,筛除融合分数小于预设分数阈值的第二语音序列,得到多个候选语音序列。需要说明的是,融合分数高的第二语音序列对应的第二文本的正确率较高,因此保留第二文本的正确率较高的第二语音序列,有利于筛选出高质量的第二语音序列。In one embodiment, the preset score threshold can be flexibly set by the user, the second speech sequence whose fusion score is greater than or equal to the preset score threshold is reserved, and the second speech sequence whose fusion score is less than the preset score threshold is screened out to obtain a plurality of speech sequences. candidate speech sequences. It should be noted that the accuracy rate of the second text corresponding to the second voice sequence with a high fusion score is higher, so retaining the second voice sequence with a higher accuracy rate of the second text is conducive to screening out high-quality second voices. sequence.
在一实施例中,由于第二语音序列的句子长度不同,会影响第二语音识别模型的语音识别结果,导致每个第二语音序列对应的第二文本和融合分数准确度不一致。因此,通过对第二语音序列的融合分数进行正则化,再将正则化的融合分数与预设分数阈值进行比较,以筛除融合分数小于预设分数阈值的第二语音序列,得到高质量的多个候选语音序列。In one embodiment, since the sentence lengths of the second speech sequences are different, the speech recognition result of the second speech recognition model will be affected, resulting in inconsistent accuracy of the second text corresponding to each second speech sequence and the fusion score. Therefore, by regularizing the fusion score of the second speech sequence, and then comparing the regularized fusion score with the preset score threshold, the second speech sequence whose fusion score is less than the preset score threshold is filtered out, and a high-quality speech sequence is obtained. Multiple candidate speech sequences.
其中,正则化公式为:
Figure PCTCN2021097411-appb-000001
l是第二语音序列的字符长度,μ、β是对多个第二语音序列的(l i,S i)进行线性回归得到的参数,σ是计算
Figure PCTCN2021097411-appb-000002
的标准差。在一些实施例中,预设分数阈值可以随迭代时间的增加而减小,在迭代训练中预设分数阈值越来越小,使得越来越多的候选语音序列能够作为目标语音识别模型的训练 样本。
Among them, the regularization formula is:
Figure PCTCN2021097411-appb-000001
l is the character length of the second speech sequence, μ and β are the parameters obtained by performing linear regression on (li i , S i ) of multiple second speech sequences, and σ is the calculation
Figure PCTCN2021097411-appb-000002
standard deviation of . In some embodiments, the preset score threshold may decrease as the iteration time increases, and the preset score threshold becomes smaller and smaller in the iterative training, so that more and more candidate speech sequences can be used for training the target speech recognition model sample.
子步骤S1052、根据多个第一训练样本的概率分布信息,从多个候选语音序列中筛选出目标语音序列。Sub-step S1052, according to the probability distribution information of the plurality of first training samples, screen out the target speech sequence from the plurality of candidate speech sequences.
在一实施例中,根据多个第一训练样本的概率分布信息,从多个候选语音序列中筛选出目标语音序列,包括:根据多个候选语音序列,生成多个语音序列集,其中,每个语音序列集包括至少一个候选语音序列;确定每个语音序列集的概率分布信息;根据多个第一训练样本的概率分布信息和每个语音序列集的概率分布信息,从多个语音序列集中选取目标语音序列集。其中,目标语音序列集中包括至少一个目标语音序列。需要说明的是,经过过滤后的多个候选语音序列的分布相差较大,若直接将经过过滤后的多个候选语音序列作为第二预设语音识别模型的训练样本,会影响第二预设语音识别模型的表现,因此从多个语音序列集中查找出与多个第一训练样本的概率分布信息近似的目标语音序列集,并将目标语音序列集中的至少一个目标语音序列作为第二预设语音识别模型的训练样本,可以提升第二预设语音识别模型的表现,即可提高目标语音识别模型的训练效果。In one embodiment, selecting a target speech sequence from a plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples includes: generating a plurality of speech sequence sets according to the plurality of candidate speech sequences, wherein each the speech sequence sets include at least one candidate speech sequence; determine the probability distribution information of each speech sequence set; according to the probability distribution information of the plurality of first training samples and the probability distribution information of each speech sequence set, from the plurality of speech sequence sets Select the target speech sequence set. Wherein, the target speech sequence set includes at least one target speech sequence. It should be noted that the distribution of the filtered candidate speech sequences is quite different. If the filtered candidate speech sequences are directly used as the training samples of the second preset speech recognition model, the second preset speech recognition model will be affected. The performance of the speech recognition model, therefore, a target speech sequence set that is similar to the probability distribution information of multiple first training samples is found from multiple speech sequence sets, and at least one target speech sequence in the target speech sequence set is used as the second preset. The training samples of the speech recognition model can improve the performance of the second preset speech recognition model, which can improve the training effect of the target speech recognition model.
其中,从多个候选语音序列中随机选取多个批(Batch),以生成多个语音序列集,每个批包括至少一个候选语音序列。每个候选语音序列携带有属性信息,多个候选语音序列携带的属性信息可以构成一个语音序列集的概率分布信息,概率分布信息根据设定的具体业务来决定,比如概率分布信息为音频的长度、说话人男女的比例、说话人年龄、周围环境等,将每个语音序列集对应的概率分布信息与多个第一训练样本的概率分布信息进行比较,以查找出与多个第一训练样本的概率分布信息近似的目标语音序列集。Wherein, multiple batches (Batch) are randomly selected from multiple candidate speech sequences to generate multiple speech sequence sets, and each batch includes at least one candidate speech sequence. Each candidate speech sequence carries attribute information, and the attribute information carried by multiple candidate speech sequences can form the probability distribution information of a speech sequence set. The probability distribution information is determined according to the specific service set. For example, the probability distribution information is the length of the audio. , the ratio of male and female speakers, the age of the speaker, the surrounding environment, etc., and compare the probability distribution information corresponding to each speech sequence set with the probability distribution information of multiple first training samples to find out the corresponding probability distribution information of multiple first training samples. The probability distribution information of the approximate target speech sequence set.
在一实施例中,根据多个第一训练样本的概率分布信息和每个语音序列集的概率分布信息,计算每个语音序列集的K-L散度;根据每个语音序列集的K-L散度,从多个语音序列集中选取目标语音序列集。其中,K-L散度越低的语音序列集的概率分布信息与多个第一训练样本的概率分布信息相比越接近,选取K-L散度最低的语音序列集作为目标语音序列集,目标语音序列集中包括至少一个目标语音序列。通过计算每个语音序列集的K-L散度(K-L Divergence),可以准确查找出与多个第一训练样本的概率分布信息近似的目标语音序列集。In one embodiment, the K-L divergence of each voice sequence set is calculated according to the probability distribution information of a plurality of first training samples and the probability distribution information of each voice sequence set; according to the K-L divergence of each voice sequence set, Select a target speech sequence set from multiple speech sequence sets. Among them, the probability distribution information of the speech sequence set with the lower K-L divergence is closer to the probability distribution information of multiple first training samples, and the speech sequence set with the lowest K-L divergence is selected as the target speech sequence set. Include at least one target speech sequence. By calculating the K-L divergence (K-L Divergence) of each speech sequence set, the target speech sequence set that is similar to the probability distribution information of the multiple first training samples can be accurately found.
K-L散度计算公式如下:The formula for calculating the K-L divergence is as follows:
Figure PCTCN2021097411-appb-000003
Figure PCTCN2021097411-appb-000003
其中,f(M(U))为语音序列集,P(i)为多个第一训练样本的概率分布信息,Q(i)为语音序列集的概率分布信息。Wherein, f(M(U)) is the speech sequence set, P(i) is the probability distribution information of a plurality of first training samples, and Q(i) is the probability distribution information of the speech sequence set.
步骤S106、根据每个目标语音序列、每个目标语音序列对应的第二文本和多个第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。Step S106: Perform iterative training on the second preset speech recognition model according to each target speech sequence, the second text corresponding to each target speech sequence, and a plurality of first training samples to obtain a target speech recognition model.
获取多个目标语音序列之后,将多个第一训练样本输入至语音识别学生模型中,输出第一语音识别结果,并根据第一语音识别结果和每个第一训练样本对应的第一文本之间的相似度,对第二预设语音识别模型的参数进行调整;确定调整后的第二预设语音识别模型符合预设性能条件时,停止模型训练,获得训练好的初始语音识别模型;根据多个目标语音序列对初始语音识别模型进行训练,得到目标语音识别模型。需要说明的是,通过多个有标注的第一训练样本和多个无标注的第二训练样本对“教师-噪声学生”自训练学习模型进行训练,能够极大提高语音识别模型的训练效果,减少了对有标注的训练样本的数量要求,并且提升了语音识别模型的训练效率。After acquiring multiple target speech sequences, input multiple first training samples into the speech recognition student model, output the first speech recognition result, and according to the difference between the first speech recognition result and the first text corresponding to each first training sample. The parameters of the second preset speech recognition model are adjusted; when it is determined that the adjusted second preset speech recognition model meets the preset performance conditions, the model training is stopped, and the trained initial speech recognition model is obtained; Multiple target speech sequences are used to train the initial speech recognition model to obtain the target speech recognition model. It should be noted that training the "teacher-noise student" self-training learning model through multiple labeled first training samples and multiple unlabeled second training samples can greatly improve the training effect of the speech recognition model. The requirement for the number of labeled training samples is reduced, and the training efficiency of the speech recognition model is improved.
其中,预设性能条件是根据语音识别学生模型的语音识别准确度以及语音识别速度确定的。实际应用中,预设性能条件也可以根据实际应用场景进行设置。通过多个第一训练样本对第二预设语音识别模型进行初始化,以保证训练数据的收敛。通过多个目标语音序列对初始语音识别模型进行训练,得到训练效果较好的目标语音识别模型,该目标语音识别模型进行语音识别的准确度较高。The preset performance condition is determined according to the speech recognition accuracy and speech recognition speed of the speech recognition student model. In practical applications, the preset performance conditions may also be set according to actual application scenarios. The second preset speech recognition model is initialized by using a plurality of first training samples to ensure the convergence of the training data. The initial speech recognition model is trained through multiple target speech sequences, and a target speech recognition model with better training effect is obtained, and the target speech recognition model has a high accuracy of speech recognition.
在一实施例中,第二预设语音识别模型例如为LAS(Listen,Attend and Spell)模型,该LAS模型包括Listen(听)层,Attend(注意)层和Spell(拼写)层。In one embodiment, the second preset speech recognition model is, for example, a LAS (Listen, Attend and Spell) model, and the LAS model includes a Listen (listen) layer, an Attend (attention) layer and a Spell (spelling) layer.
在一实施例中,根据每个目标语音序列和每个目标语音序列对应的第二文本,生成多个第三训练样本;根据多个第三训练样本和多个第一训练样本,得到训练样本集;通过训练样本集,对第二预设语音识别模型进行迭代训练,直至达到预设条件,得到目标语音识别模型。其中,预设条件可以是性能符合预设训练条件、迭代次数大于预设迭代次数和/或迭代时长大于预设迭代时长等,本申请实施例不做具体限定。In one embodiment, multiple third training samples are generated according to each target speech sequence and the second text corresponding to each target speech sequence; training samples are obtained according to multiple third training samples and multiple first training samples Set; through the training sample set, iterative training is performed on the second preset speech recognition model until the preset condition is reached, and the target speech recognition model is obtained. The preset conditions may be that the performance meets the preset training conditions, the number of iterations is greater than the preset number of iterations, and/or the iteration duration is greater than the preset iteration duration, etc., which are not specifically limited in this embodiment of the present application.
请参照图3,图3为实施本实施例提供的模型训练方法的一场景示意图。Please refer to FIG. 3 , which is a schematic diagram of a scenario for implementing the model training method provided by this embodiment.
如图3所示,获取多个第一训练样本和多个第二训练样本,第一训练样本包括第一语音序列和标注的第一语音序列对应的第一文本,第二训练样本包括第二语音序列,然后将多个第一训练样本输入至第一预设语音识别模型10,以对第一预设语音识别模型10进行迭代训练,得到第一语音识别模型20,将预设语言模型30与第一语音识别模型20与进行融合,得到第二语音识别模型40,再将多个第二训练样本中的第二语音序列输入至第二语音识别模型40,得到每个第二语音序列对应的第二文本和融合分数,根据每个第二语音序列的融合分数,从多个第二语音序列中筛选出目标语音序列,并将每个目标语音序列、每个目标语音序列对应的第二文本和多个第一训练样本输入至第二预设语音识别模型50,以对第二预设语音识别模型50进行迭代训练,得到目标语音识别模型60。As shown in FIG. 3 , a plurality of first training samples and a plurality of second training samples are obtained, the first training samples include the first speech sequence and the first text corresponding to the marked first speech sequence, and the second training samples include the second speech sequence, and then input a plurality of first training samples into the first preset speech recognition model 10 to iteratively train the first preset speech recognition model 10 to obtain the first speech recognition model 20, and the preset language model 30 Integrate with the first speech recognition model 20 to obtain a second speech recognition model 40, and then input the second speech sequences in the plurality of second training samples into the second speech recognition model 40 to obtain each second speech sequence corresponding to According to the fusion score of each second speech sequence, the target speech sequence is screened from multiple second speech sequences, and each The text and the plurality of first training samples are input into the second preset speech recognition model 50 to iteratively train the second preset speech recognition model 50 to obtain the target speech recognition model 60 .
上述实施例提供的模型训练方法,通过获取多个第一训练样本和多个第二训练样本,第一训练样本包括第一语音序列和标注的第一语音序列对应的第一文本,第二训练样本包括第二语音序列,然后根据多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型,将第一语音识别模型与预设语言模型进行融合,得到第二语音识 别模型,再将多个第二语音序列输入至第二语音识别模型,得到每个第二语音序列对应的第二文本和融合分数,根据每个第二语音序列的融合分数,从多个第二语音序列中筛选出目标语音序列,根据每个目标语音序列、每个目标语音序列对应的第二文本和多个第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。本申请通过多个有标注的第一训练样本和多个无标注的第二训练样本对“教师-噪声学生”自训练学习模型进行训练,能够极大提高语音识别模型的训练效果,减少了对有标注的训练样本的数量要求,并且提升了语音识别模型的训练效率。In the model training method provided by the above-mentioned embodiment, a plurality of first training samples and a plurality of second training samples are obtained, wherein the first training samples include the first speech sequence and the first text corresponding to the marked first speech sequence, and the second training samples are obtained. The sample includes a second speech sequence, and then according to a plurality of first training samples, the first preset speech recognition model is iteratively trained to obtain a first speech recognition model, and the first speech recognition model and the preset language model are fused to obtain The second speech recognition model, and then multiple second speech sequences are input into the second speech recognition model to obtain the second text corresponding to each second speech sequence and the fusion score. According to the fusion score of each second speech sequence, from The target speech sequence is selected from the plurality of second speech sequences, and the second preset speech recognition model is iteratively trained according to each target speech sequence, the second text corresponding to each target speech sequence, and a plurality of first training samples, Get the target speech recognition model. This application trains the "teacher-noisy student" self-training learning model through multiple labeled first training samples and multiple unlabeled second training samples, which can greatly improve the training effect of the speech recognition model and reduce the need for The number of labeled training samples is required, and the training efficiency of the speech recognition model is improved.
请参照图4,图4为本申请实施例提供的一种语音识别方法的步骤流程示意图。Please refer to FIG. 4 , which is a schematic flowchart of steps of a speech recognition method provided by an embodiment of the present application.
如图4所示,该模型训练方法包括步骤S201至S202。As shown in FIG. 4 , the model training method includes steps S201 to S202.
步骤S201、获取待识别的语音序列。Step S201, acquiring a speech sequence to be recognized.
例如,待识别的语音序列为社交应用中用户发送的一条语音数据。For example, the speech sequence to be recognized is a piece of speech data sent by a user in a social application.
步骤S202、通过目标语音识别模型对语音序列进行语音识别,得到语音序列对应的文本信息。Step S202 , perform speech recognition on the speech sequence by using the target speech recognition model to obtain text information corresponding to the speech sequence.
其中,目标语音识别模型是根据前述实施例所述的模型训练方法进行训练得到的。例如用户A通过终端设备的社交应用接收用户B发送的一条语音序列,通过目标语音识别模型对该语音序列进行语音识别,获得文本信息“你好”(语音识别结果)。The target speech recognition model is obtained by training according to the model training method described in the foregoing embodiment. For example, user A receives a voice sequence sent by user B through the social application of the terminal device, performs voice recognition on the voice sequence through the target voice recognition model, and obtains the text information "Hello" (speech recognition result).
上述实施例提供的语音识别方法,通过获取待识别的语音序列,并通过前述实施例所述的目标语音识别模型对语音序列进行语音识别,得到语音序列对应的文本信息,由于目标语音识别模型“教师-噪声学生”自训练学习模型进行训练得到的,可以有效的提高语音识别的准确性。The speech recognition method provided by the above-mentioned embodiment, by acquiring the speech sequence to be recognized, and performing speech recognition on the speech sequence through the target speech recognition model described in the foregoing embodiment, the text information corresponding to the speech sequence is obtained, because the target speech recognition model " The teacher-noise-student" self-training learning model can effectively improve the accuracy of speech recognition.
请参照图5,图5为本申请实施例提供的一种模型训练装置的示意性框图。Please refer to FIG. 5 , which is a schematic block diagram of a model training apparatus provided by an embodiment of the present application.
如图5所示,该模型训练装置300,包括:获取模块301、第一训练模块302、融合模块303、输入模块304、筛选模块305和第二训练模块306。As shown in FIG. 5 , the model training apparatus 300 includes: an acquisition module 301 , a first training module 302 , a fusion module 303 , an input module 304 , a screening module 305 and a second training module 306 .
获取模块301,用于获取多个第一训练样本和多个第二训练样本,第一训练样本包括第一语音序列和标注的第一语音序列对应的第一文本,第二训练样本包括第二语音序列;The acquisition module 301 is configured to acquire a plurality of first training samples and a plurality of second training samples, the first training samples include the first speech sequence and the first text corresponding to the marked first speech sequence, and the second training samples include the second speech sequence;
第一训练模块302,用于根据多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型;A first training module 302, configured to perform iterative training on a first preset speech recognition model according to a plurality of first training samples to obtain a first speech recognition model;
融合模块303,用于将第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型;A fusion module 303, configured to fuse the first speech recognition model with the preset language model to obtain a second speech recognition model;
输入模块304,用于将多个第二语音序列输入至第二语音识别模型,得到每个第二语音序列对应的第二文本和融合分数;The input module 304 is used for inputting a plurality of second speech sequences into the second speech recognition model to obtain the second text and fusion score corresponding to each second speech sequence;
筛选模块305,用于根据每个第二语音序列的融合分数,从多个第二语音序列中筛选出目标语音序列;The screening module 305 is used for screening out the target speech sequence from the plurality of second speech sequences according to the fusion score of each second speech sequence;
第二训练模块306,用于根据每个目标语音序列、每个目标语音序列对应的第二文本 和多个第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。The second training module 306 is configured to iteratively train the second preset speech recognition model according to each target speech sequence, the second text corresponding to each target speech sequence, and a plurality of first training samples to obtain the target speech recognition model .
在一个实施例中,如图6所示,筛选模块305包括:In one embodiment, as shown in Figure 6, the screening module 305 includes:
过滤子模块3051,用于根据预设分数阈值和每个所述第二语音序列的融合分数,对多个所述第二语音序列进行过滤,得到多个候选语音序列; Filtering submodule 3051, configured to filter a plurality of the second speech sequences according to a preset score threshold and a fusion score of each of the second speech sequences to obtain a plurality of candidate speech sequences;
筛选子模块3052,用于根据所述多个第一训练样本的概率分布信息,从所述多个候选语音序列中筛选出目标语音序列。The screening sub-module 3052 is configured to screen out a target speech sequence from the plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples.
在一个实施例中,筛选子模块3052还用于:In one embodiment, the screening sub-module 3052 is also used to:
根据所述多个候选语音序列,生成多个语音序列集,其中,每个所述语音序列集包括至少一个所述候选语音序列;generating a plurality of speech sequence sets according to the plurality of candidate speech sequences, wherein each of the speech sequence sets includes at least one of the candidate speech sequences;
确定每个所述语音序列集的概率分布信息;determining the probability distribution information of each said set of speech sequences;
根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,从多个所述语音序列集中选取目标语音序列集。According to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets, a target speech sequence set is selected from the plurality of the speech sequence sets.
在一个实施例中,筛选子模块3052还用于:In one embodiment, the screening sub-module 3052 is also used to:
根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,计算每个所述语音序列集的K-L散度;Calculate the K-L divergence of each of the speech sequence sets according to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets;
根据每个所述语音序列集的K-L散度,从多个所述语音序列集中选取目标语音序列集。According to the K-L divergence of each of the speech sequence sets, a target speech sequence set is selected from a plurality of the speech sequence sets.
在一个实施例中,第一训练模块302还用于:In one embodiment, the first training module 302 is further used to:
对多个所述第一训练样本进行数据增强;performing data enhancement on a plurality of the first training samples;
根据经过数据增强后的多个所述第一训练样本,对第一预设语音识别模型进行迭代训练,直至所述第一预设语音识别模型收敛,得到第一语音识别模型。The first preset speech recognition model is iteratively trained according to the plurality of first training samples after data enhancement, until the first preset speech recognition model converges, and a first speech recognition model is obtained.
在一个实施例中,第二训练模块306还用于:In one embodiment, the second training module 306 is also used to:
根据每个所述目标语音序列和每个所述目标语音序列对应的第二文本,生成多个第三训练样本;generating a plurality of third training samples according to each of the target speech sequences and the second text corresponding to each of the target speech sequences;
根据所述多个第三训练样本和所述多个第一训练样本,得到训练样本集;obtaining a training sample set according to the plurality of third training samples and the plurality of first training samples;
通过所述训练样本集,对所述第二预设语音识别模型进行迭代训练,直至达到预设条件,得到目标语音识别模型。Through the training sample set, the second preset speech recognition model is iteratively trained until a preset condition is reached, and the target speech recognition model is obtained.
请参照图7,图7为本申请实施例提供的一种语音识别装置的示意性框图。Please refer to FIG. 7 , which is a schematic block diagram of a speech recognition apparatus provided by an embodiment of the present application.
如图7所示,该语音识别装置400,包括:As shown in FIG. 7 , the speech recognition device 400 includes:
获取模块401,用于获取待识别的语音序列。The acquiring module 401 is used for acquiring the speech sequence to be recognized.
识别模块402,用于通过目标语音识别模型对所述语音序列进行语音识别,得到所述语音序列对应的文本信息。The recognition module 402 is configured to perform speech recognition on the speech sequence through a target speech recognition model to obtain text information corresponding to the speech sequence.
其中,所述目标语音识别模型是根据前述实施例所述的模型训练方法进行训练得到的。Wherein, the target speech recognition model is obtained by training according to the model training method described in the foregoing embodiment.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上 述描述的语音识别装置的各模块及单元的具体工作过程,可以参考前述语音识别方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that, for the convenience and brevity of the description, for the specific working process of each module and unit of the speech recognition device described above, reference may be made to the corresponding speech recognition method embodiments described above. The process is not repeated here.
上述实施例提供的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图8所示的计算机设备上运行。The apparatuses provided in the above embodiments may be implemented in the form of a computer program, and the computer program may be executed on the computer device as shown in FIG. 8 .
请参阅图8,图8为本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以为服务器或终端设备。Please refer to FIG. 8 , FIG. 8 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application. The computer device can be a server or a terminal device.
如图8所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括存储介质和内存储器,该存储介质可以是非易失性的也可以是易失性。As shown in FIG. 8 , the computer device includes a processor, a memory and a network interface connected through a system bus, wherein the memory may include a storage medium and an internal memory, and the storage medium may be non-volatile or volatile.
存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种模型训练方法或者语音识别方法。The storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, can cause the processor to execute any model training method or speech recognition method.
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。The processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
内存储器为存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种模型训练方法或者语音识别方法。The internal memory provides an environment for running the computer program in the storage medium. When the computer program is executed by the processor, the processor can cause the processor to execute any model training method or speech recognition method.
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated circuits) Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:Wherein, in one embodiment, the processor is configured to run a computer program stored in the memory to implement the following steps:
获取多个第一训练样本和多个第二训练样本,所述第一训练样本包括第一语音序列和标注的所述第一语音序列对应的第一文本,所述第二训练样本包括第二语音序列;Obtain a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, and the second training samples include a second speech sequence;
根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型;performing iterative training on a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model;
将所述第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型;Integrating the first speech recognition model with the preset language model to obtain a second speech recognition model;
将多个所述第二语音序列输入至所述第二语音识别模型,得到每个所述第二语音序列对应的第二文本和融合分数;Inputting a plurality of the second speech sequences into the second speech recognition model to obtain a second text and a fusion score corresponding to each of the second speech sequences;
根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列;According to the fusion score of each of the second speech sequences, filter out the target speech sequence from the plurality of second speech sequences;
根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。Perform iterative training on the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples to obtain a target speech recognition model.
在一个实施例中,所述处理器在实现所述根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列时,用于实现:In one embodiment, when the processor selects the target speech sequence from the plurality of second speech sequences according to the fusion score of each of the second speech sequences, the processor is configured to:
根据预设分数阈值和每个所述第二语音序列的融合分数,对多个所述第二语音序列进行过滤,得到多个候选语音序列;Filtering a plurality of the second speech sequences according to a preset score threshold and a fusion score of each of the second speech sequences to obtain a plurality of candidate speech sequences;
根据所述多个第一训练样本的概率分布信息,从所述多个候选语音序列中筛选出目标语音序列。According to the probability distribution information of the plurality of first training samples, a target speech sequence is selected from the plurality of candidate speech sequences.
在一个实施例中,所述处理器在实现所述根据所述多个第一训练样本的概率分布信息,从所述多个候选语音序列中筛选出目标语音序列时,用于实现:In one embodiment, when the processor selects the target speech sequence from the plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples, the processor is configured to:
根据所述多个候选语音序列,生成多个语音序列集,其中,每个所述语音序列集包括至少一个所述候选语音序列;generating a plurality of speech sequence sets according to the plurality of candidate speech sequences, wherein each of the speech sequence sets includes at least one of the candidate speech sequences;
确定每个所述语音序列集的概率分布信息;determining the probability distribution information of each said set of speech sequences;
根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,从多个所述语音序列集中选取目标语音序列集。According to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets, a target speech sequence set is selected from the plurality of the speech sequence sets.
在一个实施例中,所述处理器在实现所述根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,从多个所述语音序列集中选取目标语音序列集时,用于实现:In one embodiment, the processor selects the target from the plurality of speech sequence sets according to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets. When a voice sequence set is used, it is used to implement:
根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,计算每个所述语音序列集的K-L散度;Calculate the K-L divergence of each of the speech sequence sets according to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets;
根据每个所述语音序列集的K-L散度,从多个所述语音序列集中选取目标语音序列集。According to the K-L divergence of each of the speech sequence sets, a target speech sequence set is selected from a plurality of the speech sequence sets.
在一个实施例中,所述处理器在实现所述根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型时,用于实现:In one embodiment, when implementing the iterative training of the first preset speech recognition model according to the plurality of first training samples to obtain the first speech recognition model, the processor is used to implement:
对多个所述第一训练样本进行数据增强;performing data enhancement on a plurality of the first training samples;
根据经过数据增强后的多个所述第一训练样本,对第一预设语音识别模型进行迭代训练,直至所述第一预设语音识别模型收敛,得到第一语音识别模型。The first preset speech recognition model is iteratively trained according to the plurality of first training samples after data enhancement, until the first preset speech recognition model converges, and a first speech recognition model is obtained.
在一个实施例中,所述处理器在实现所述根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对所述第二预设语音识别模型进行迭代训练,得到目标语音识别模型时,用于实现:In one embodiment, the processor is performing the processing of the second text according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples. The preset speech recognition model is iteratively trained, and when the target speech recognition model is obtained, it is used to achieve:
根据每个所述目标语音序列和每个所述目标语音序列对应的第二文本,生成多个第三训练样本;generating a plurality of third training samples according to each of the target speech sequences and the second text corresponding to each of the target speech sequences;
根据所述多个第三训练样本和所述多个第一训练样本,得到训练样本集;obtaining a training sample set according to the plurality of third training samples and the plurality of first training samples;
通过所述训练样本集,对所述第二预设语音识别模型进行迭代训练,直至达到预设条件,得到目标语音识别模型。Through the training sample set, the second preset speech recognition model is iteratively trained until a preset condition is reached, and the target speech recognition model is obtained.
在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:In one embodiment, the processor is configured to run a computer program stored in a memory to implement the following steps:
获取待识别的语音序列;Get the speech sequence to be recognized;
通过目标语音识别模型对所述语音序列进行语音识别,得到所述语音序列对应的文本信息;Perform speech recognition on the speech sequence by using the target speech recognition model to obtain text information corresponding to the speech sequence;
所述目标语音识别模型是根据前述实施例所述的模型训练方法进行训练得到的。The target speech recognition model is obtained by training according to the model training method described in the foregoing embodiment.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述计算机设备的具体工作过程,可以参考前述模型训练方法或者语音识别方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that, for the convenience and brevity of the description, for the specific working process of the computer equipment described above, reference may be made to the corresponding process in the foregoing embodiment of the model training method or the speech recognition method. This will not be repeated here.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现以下步骤:Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, wherein when the computer program is executed by a processor, the following steps are implemented:
获取多个第一训练样本和多个第二训练样本,所述第一训练样本包括第一语音序列和标注的所述第一语音序列对应的第一文本,所述第二训练样本包括第二语音序列;Obtain a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, and the second training samples include a second speech sequence;
根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型;performing iterative training on a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model;
将所述第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型;Integrating the first speech recognition model with the preset language model to obtain a second speech recognition model;
将多个所述第二语音序列输入至所述第二语音识别模型,得到每个所述第二语音序列对应的第二文本和融合分数;Inputting a plurality of the second speech sequences into the second speech recognition model to obtain a second text and a fusion score corresponding to each of the second speech sequences;
根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列;According to the fusion score of each of the second speech sequences, filter out the target speech sequence from the plurality of second speech sequences;
根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。Perform iterative training on the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples to obtain a target speech recognition model.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述计算机可读存储介质的具体工作过程,可以参考前述模型训练方法或者语音识别方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that, for the convenience and brevity of the description, for the specific working process of the computer-readable storage medium described above, reference may be made to the corresponding models in the foregoing model training method or speech recognition method embodiment. The process is not repeated here.
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) ) card, Flash Card, etc.
应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should be understood that the terms used in the specification of the present application herein are for the purpose of describing particular embodiments only and are not intended to limit the present application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。需要说明的是, 在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It will also be understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items. It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments. The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种模型训练方法,其中,包括:A model training method, which includes:
    获取多个第一训练样本和多个第二训练样本,所述第一训练样本包括第一语音序列和标注的所述第一语音序列对应的第一文本,所述第二训练样本包括第二语音序列;Obtain a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, and the second training samples include a second speech sequence;
    根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型;performing iterative training on a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model;
    将所述第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型;Integrating the first speech recognition model with the preset language model to obtain a second speech recognition model;
    将多个所述第二语音序列输入至所述第二语音识别模型,得到每个所述第二语音序列对应的第二文本和融合分数;Inputting a plurality of the second speech sequences into the second speech recognition model to obtain a second text and a fusion score corresponding to each of the second speech sequences;
    根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列;According to the fusion score of each of the second speech sequences, filter out the target speech sequence from the plurality of second speech sequences;
    根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。Perform iterative training on the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples to obtain a target speech recognition model.
  2. 如权利要求1所述的模型训练方法,其中,所述根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列,包括:The model training method according to claim 1, wherein, according to the fusion score of each of the second speech sequences, selecting a target speech sequence from the plurality of second speech sequences comprises:
    根据预设分数阈值和每个所述第二语音序列的融合分数,对多个所述第二语音序列进行过滤,得到多个候选语音序列;Filtering a plurality of the second speech sequences according to a preset score threshold and a fusion score of each of the second speech sequences to obtain a plurality of candidate speech sequences;
    根据所述多个第一训练样本的概率分布信息,从所述多个候选语音序列中筛选出目标语音序列。According to the probability distribution information of the plurality of first training samples, a target speech sequence is selected from the plurality of candidate speech sequences.
  3. 如权利要求2所述的模型训练方法,其中,所述根据所述多个第一训练样本的概率分布信息,从所述多个候选语音序列中筛选出目标语音序列,包括:The model training method according to claim 2, wherein, according to the probability distribution information of the plurality of first training samples, selecting the target speech sequence from the plurality of candidate speech sequences comprises:
    根据所述多个候选语音序列,生成多个语音序列集,其中,每个所述语音序列集包括至少一个所述候选语音序列;generating a plurality of speech sequence sets according to the plurality of candidate speech sequences, wherein each of the speech sequence sets includes at least one of the candidate speech sequences;
    确定每个所述语音序列集的概率分布信息;determining the probability distribution information of each said set of speech sequences;
    根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,从多个所述语音序列集中选取目标语音序列集。According to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets, a target speech sequence set is selected from the plurality of the speech sequence sets.
  4. 如权利要求3所述的模型训练方法,其中,所述根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,从多个所述语音序列集中选取目标语音序列集,包括:The model training method according to claim 3, wherein the selection is performed from a plurality of the speech sequence sets according to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets Target speech sequence set, including:
    根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,计算每个所述语音序列集的K-L散度;Calculate the K-L divergence of each of the speech sequence sets according to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets;
    根据每个所述语音序列集的K-L散度,从多个所述语音序列集中选取目标语音序列集。According to the K-L divergence of each of the speech sequence sets, a target speech sequence set is selected from a plurality of the speech sequence sets.
  5. 如权利要求1-4中任一项所述的模型训练方法,其中,所述根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型,包括:The model training method according to any one of claims 1-4, wherein the iterative training is performed on the first preset speech recognition model according to the plurality of first training samples to obtain the first speech recognition model, include:
    对多个所述第一训练样本进行数据增强;performing data enhancement on a plurality of the first training samples;
    根据经过数据增强后的多个所述第一训练样本,对第一预设语音识别模型进行迭代训练,直至所述第一预设语音识别模型收敛,得到第一语音识别模型。The first preset speech recognition model is iteratively trained according to the plurality of first training samples after data enhancement, until the first preset speech recognition model converges, and a first speech recognition model is obtained.
  6. 如权利要求1-4中任一项所述的模型训练方法,其中,所述根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对所述第二预设语音识别模型进行迭代训练,得到目标语音识别模型,包括:The model training method according to any one of claims 1-4, wherein the training method based on each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training sample, perform iterative training on the second preset speech recognition model to obtain a target speech recognition model, including:
    根据每个所述目标语音序列和每个所述目标语音序列对应的第二文本,生成多个第三训练样本;generating a plurality of third training samples according to each of the target speech sequences and the second text corresponding to each of the target speech sequences;
    根据所述多个第三训练样本和所述多个第一训练样本,得到训练样本集;obtaining a training sample set according to the plurality of third training samples and the plurality of first training samples;
    通过所述训练样本集,对所述第二预设语音识别模型进行迭代训练,直至达到预设条件,得到目标语音识别模型。Through the training sample set, the second preset speech recognition model is iteratively trained until a preset condition is reached, and the target speech recognition model is obtained.
  7. 一种语音识别方法,其中,包括:A speech recognition method, comprising:
    获取待识别的语音序列;Get the speech sequence to be recognized;
    通过目标语音识别模型对所述语音序列进行语音识别,得到所述语音序列对应的文本信息;Perform speech recognition on the speech sequence by using the target speech recognition model to obtain text information corresponding to the speech sequence;
    所述目标语音识别模型是根据权利要求1至6任一项所述的模型训练方法进行训练得到的。The target speech recognition model is obtained by training according to the model training method according to any one of claims 1 to 6.
  8. 一种模型训练装置,其中,所述模型训练装置包括:A model training device, wherein the model training device comprises:
    获取模块,用于获取多个第一训练样本和多个第二训练样本,所述第一训练样本包括第一语音序列和标注的所述第一语音序列对应的第一文本,所述第二训练样本包括第二语音序列;an acquisition module, configured to acquire a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, the second the training sample includes a second speech sequence;
    第一训练模块,用于根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型;a first training module, configured to iteratively train a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model;
    融合模块,用于将所述第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型;a fusion module, configured to fuse the first speech recognition model with the preset language model to obtain a second speech recognition model;
    输入模块,用于将多个所述第二语音序列输入至所述第二语音识别模型,得到每个所述第二语音序列对应的第二文本和融合分数;an input module, configured to input a plurality of the second speech sequences into the second speech recognition model, to obtain a second text and a fusion score corresponding to each of the second speech sequences;
    筛选模块,用于根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列;a screening module, configured to screen out a target speech sequence from the plurality of second speech sequences according to the fusion score of each of the second speech sequences;
    第二训练模块,用于根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对所述第二预设语音识别模型进行迭代训练,得到目标语音识别模型。A second training module, configured to iterate the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples Training to get the target speech recognition model.
  9. 一种计算机设备,其中,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现以下步骤:A computer device, wherein the computer device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements The following steps:
    获取多个第一训练样本和多个第二训练样本,所述第一训练样本包括第一语音序列和标注的所述第一语音序列对应的第一文本,所述第二训练样本包括第二语音序列;Obtain a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, and the second training samples include a second speech sequence;
    根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型;performing iterative training on a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model;
    将所述第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型;Integrating the first speech recognition model with the preset language model to obtain a second speech recognition model;
    将多个所述第二语音序列输入至所述第二语音识别模型,得到每个所述第二语音序列对应的第二文本和融合分数;Inputting a plurality of the second speech sequences into the second speech recognition model to obtain a second text and a fusion score corresponding to each of the second speech sequences;
    根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列;According to the fusion score of each of the second speech sequences, filter out the target speech sequence from the plurality of second speech sequences;
    根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。Perform iterative training on the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples to obtain a target speech recognition model.
  10. 如权利要求9所述的计算机设备,其中,所述处理器在实现所述根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列时,用于实现:The computer device according to claim 9, wherein, when the processor implements the filtering of the target speech sequence from the plurality of second speech sequences according to the fusion score of each of the second speech sequences, Used to implement:
    根据预设分数阈值和每个所述第二语音序列的融合分数,对多个所述第二语音序列进行过滤,得到多个候选语音序列;Filtering a plurality of the second speech sequences according to a preset score threshold and a fusion score of each of the second speech sequences to obtain a plurality of candidate speech sequences;
    根据所述多个第一训练样本的概率分布信息,从所述多个候选语音序列中筛选出目标语音序列。According to the probability distribution information of the plurality of first training samples, a target speech sequence is selected from the plurality of candidate speech sequences.
  11. 如权利要求10所述的计算机设备,其中,所述处理器在实现所述根据所述多个第一训练样本的概率分布信息,从所述多个候选语音序列中筛选出目标语音序列时,用于实现:The computer device according to claim 10, wherein, when the processor selects the target speech sequence from the plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples, Used to implement:
    根据所述多个候选语音序列,生成多个语音序列集,其中,每个所述语音序列集包括至少一个所述候选语音序列;generating a plurality of speech sequence sets according to the plurality of candidate speech sequences, wherein each of the speech sequence sets includes at least one of the candidate speech sequences;
    确定每个所述语音序列集的概率分布信息;determining the probability distribution information of each said set of speech sequences;
    根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,从多个所述语音序列集中选取目标语音序列集。According to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets, a target speech sequence set is selected from the plurality of the speech sequence sets.
  12. 如权利要求11所述的计算机设备,其中,所述处理器在实现所述根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,从多个所述语音序列集中选取目标语音序列集时,用于实现:The computer device of claim 11, wherein the processor implements the probability distribution information according to the plurality of first training samples and the probability distribution information of each of the speech sequence sets, from a plurality of the When selecting the target speech sequence set from the described speech sequence set, it is used to realize:
    根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,计算每个所述语音序列集的K-L散度;Calculate the K-L divergence of each of the speech sequence sets according to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets;
    根据每个所述语音序列集的K-L散度,从多个所述语音序列集中选取目标语音序列 集。According to the K-L divergence of each said voice sequence set, a target voice sequence set is selected from a plurality of said voice sequence sets.
  13. 如权利要求9-12中任一项所述的计算机设备,其中,所述处理器在实现所述根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型时,用于实现:The computer device according to any one of claims 9-12, wherein the processor performs iterative training on the first preset speech recognition model according to the plurality of first training samples to obtain the first A speech recognition model is used to implement:
    对多个所述第一训练样本进行数据增强;performing data enhancement on a plurality of the first training samples;
    根据经过数据增强后的多个所述第一训练样本,对第一预设语音识别模型进行迭代训练,直至所述第一预设语音识别模型收敛,得到第一语音识别模型。The first preset speech recognition model is iteratively trained according to the plurality of first training samples after data enhancement, until the first preset speech recognition model converges, and a first speech recognition model is obtained.
  14. 如权利要求9-12中任一项所述的计算机设备,其中,所述计算机程序被所述处理器执行时,还实现以下步骤:The computer device according to any one of claims 9-12, wherein, when the computer program is executed by the processor, the following steps are further implemented:
    获取待识别的语音序列;Get the speech sequence to be recognized;
    通过所述目标语音识别模型对所述语音序列进行语音识别,得到所述语音序列对应的文本信息。Perform speech recognition on the speech sequence by using the target speech recognition model to obtain text information corresponding to the speech sequence.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现以下步骤:A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented:
    获取多个第一训练样本和多个第二训练样本,所述第一训练样本包括第一语音序列和标注的所述第一语音序列对应的第一文本,所述第二训练样本包括第二语音序列;Obtain a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, and the second training samples include a second speech sequence;
    根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型;performing iterative training on a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model;
    将所述第一语音识别模型与预设语言模型进行融合,得到第二语音识别模型;Integrating the first speech recognition model with the preset language model to obtain a second speech recognition model;
    将多个所述第二语音序列输入至所述第二语音识别模型,得到每个所述第二语音序列对应的第二文本和融合分数;Inputting a plurality of the second speech sequences into the second speech recognition model to obtain a second text and a fusion score corresponding to each of the second speech sequences;
    根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列;According to the fusion score of each of the second speech sequences, filter out the target speech sequence from the plurality of second speech sequences;
    根据每个所述目标语音序列、每个所述目标语音序列对应的第二文本和多个所述第一训练样本,对第二预设语音识别模型进行迭代训练,得到目标语音识别模型。Perform iterative training on the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples to obtain a target speech recognition model.
  16. 如权利要求15所述的计算机可读存储介质,其中,所述处理器在实现所述根据每个所述第二语音序列的融合分数,从所述多个第二语音序列中筛选出目标语音序列时,用于实现:16. The computer-readable storage medium of claim 15, wherein, when the processor implements the filtering of the target speech from the plurality of second speech sequences according to the fusion score of each of the second speech sequences sequence, used to implement:
    根据预设分数阈值和每个所述第二语音序列的融合分数,对多个所述第二语音序列进行过滤,得到多个候选语音序列;Filtering a plurality of the second speech sequences according to a preset score threshold and a fusion score of each of the second speech sequences to obtain a plurality of candidate speech sequences;
    根据所述多个第一训练样本的概率分布信息,从所述多个候选语音序列中筛选出目标语音序列。According to the probability distribution information of the plurality of first training samples, a target speech sequence is selected from the plurality of candidate speech sequences.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述处理器在实现所述根据所述多个第一训练样本的概率分布信息,从所述多个候选语音序列中筛选出目标语音序列 时,用于实现:The computer-readable storage medium according to claim 16, wherein, when the processor implements the selection of the target speech from the plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples sequence, used to implement:
    根据所述多个候选语音序列,生成多个语音序列集,其中,每个所述语音序列集包括至少一个所述候选语音序列;generating a plurality of speech sequence sets according to the plurality of candidate speech sequences, wherein each of the speech sequence sets includes at least one of the candidate speech sequences;
    确定每个所述语音序列集的概率分布信息;determining the probability distribution information of each said set of speech sequences;
    根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,从多个所述语音序列集中选取目标语音序列集。According to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets, a target speech sequence set is selected from the plurality of the speech sequence sets.
  18. 如权利要求17所述的计算机可读存储介质,其中,所述处理器在实现所述根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,从多个所述语音序列集中选取目标语音序列集时,用于实现:The computer-readable storage medium of claim 17, wherein the processor implements the probability distribution information according to the plurality of first training samples and the probability distribution information for each of the speech sequence sets, from When selecting a target voice sequence set from a plurality of said voice sequence sets, it is used to realize:
    根据所述多个第一训练样本的概率分布信息和每个所述语音序列集的概率分布信息,计算每个所述语音序列集的K-L散度;Calculate the K-L divergence of each of the speech sequence sets according to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets;
    根据每个所述语音序列集的K-L散度,从多个所述语音序列集中选取目标语音序列集。According to the K-L divergence of each of the speech sequence sets, a target speech sequence set is selected from a plurality of the speech sequence sets.
  19. 如权利要求15-18中任一项所述的计算机可读存储介质,其中,所述处理器在实现所述根据所述多个第一训练样本,对第一预设语音识别模型进行迭代训练,得到第一语音识别模型时,用于实现:The computer-readable storage medium according to any one of claims 15-18, wherein, when the processor implements the iterative training of the first preset speech recognition model according to the plurality of first training samples , when the first speech recognition model is obtained, it is used to realize:
    对多个所述第一训练样本进行数据增强;performing data enhancement on a plurality of the first training samples;
    根据经过数据增强后的多个所述第一训练样本,对第一预设语音识别模型进行迭代训练,直至所述第一预设语音识别模型收敛,得到第一语音识别模型。The first preset speech recognition model is iteratively trained according to the plurality of first training samples after data enhancement, until the first preset speech recognition model converges, and a first speech recognition model is obtained.
  20. 如权利要求15-18中任一项所述的计算机可读存储介质,其中,所述计算机程序被所述处理器执行时,还实现以下步骤:The computer-readable storage medium of any one of claims 15-18, wherein, when the computer program is executed by the processor, the following steps are further implemented:
    获取待识别的语音序列;Get the speech sequence to be recognized;
    通过所述目标语音识别模型对所述语音序列进行语音识别,得到所述语音序列对应的文本信息。Perform speech recognition on the speech sequence by using the target speech recognition model to obtain text information corresponding to the speech sequence.
PCT/CN2021/097411 2020-12-11 2021-05-31 Model training method and apparatus, speech recognition method and apparatus, device, and storage medium WO2022121257A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011453446.1A CN112435656B (en) 2020-12-11 2020-12-11 Model training method, voice recognition method, device, equipment and storage medium
CN202011453446.1 2020-12-11

Publications (1)

Publication Number Publication Date
WO2022121257A1 true WO2022121257A1 (en) 2022-06-16

Family

ID=74691123

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097411 WO2022121257A1 (en) 2020-12-11 2021-05-31 Model training method and apparatus, speech recognition method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN112435656B (en)
WO (1) WO2022121257A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112435656B (en) * 2020-12-11 2024-03-01 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium
CN113129869B (en) * 2021-03-22 2022-01-28 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model
CN113257235B (en) * 2021-04-30 2023-01-03 平安科技(深圳)有限公司 Model training method, voice recognition method, device, server and storage medium
CN113241062B (en) * 2021-06-01 2023-12-26 平安科技(深圳)有限公司 Enhancement method, device, equipment and storage medium for voice training data set
CN113314124B (en) * 2021-06-15 2022-03-25 宿迁硅基智能科技有限公司 Text output method and system, storage medium and electronic device
CN113327598B (en) * 2021-06-30 2023-11-14 北京有竹居网络技术有限公司 Model training method, voice recognition method, device, medium and equipment
CN113608664B (en) * 2021-07-26 2024-06-18 京东科技控股股份有限公司 Intelligent voice robot interaction effect optimization method and device and intelligent robot
CN113706172B (en) * 2021-08-30 2023-08-25 平安银行股份有限公司 Customer behavior-based complaint solving method, device, equipment and storage medium
CN113793604B (en) * 2021-09-14 2024-01-05 思必驰科技股份有限公司 Speech recognition system optimization method and device
CN115237182A (en) * 2022-07-29 2022-10-25 大连世有电力科技有限公司 Transformer temperature control system of low-power consumption wireless communication
CN117219067B (en) * 2023-09-27 2024-04-09 北京华星酷娱文化传媒有限公司 Method and system for automatically generating subtitles by short video based on speech understanding

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609672A (en) * 2009-07-21 2009-12-23 北京邮电大学 A kind of speech recognition semantic confidence feature extracting methods and device
CN107305575A (en) * 2016-04-25 2017-10-31 北京京东尚科信息技术有限公司 The punctuate recognition methods of human-machine intelligence's question answering system and device
CN108389576A (en) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 The optimization method and system of compressed speech recognition modeling
US20190272818A1 (en) * 2018-03-04 2019-09-05 International Business Machines Corporation Voice-transformation based data augmentation for prosodic classification
US20190385592A1 (en) * 2019-08-12 2019-12-19 Lg Electronics Inc. Speech recognition device and speech recognition method
CN111754985A (en) * 2020-07-06 2020-10-09 上海依图信息技术有限公司 Method and device for training voice recognition model and voice recognition
CN111933175A (en) * 2020-08-06 2020-11-13 北京中电慧声科技有限公司 Active voice detection method and system based on noise scene recognition
CN112435656A (en) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110895932B (en) * 2018-08-24 2022-05-03 中国科学院声学研究所 Multi-language voice recognition method based on language type and voice content collaborative classification
CN110797016B (en) * 2019-02-26 2020-12-29 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN110265001B (en) * 2019-05-06 2023-06-23 平安科技(深圳)有限公司 Corpus screening method and device for speech recognition training and computer equipment
CN110941945B (en) * 2019-12-02 2021-03-23 百度在线网络技术(北京)有限公司 Language model pre-training method and device
CN111583911B (en) * 2020-04-30 2023-04-14 深圳市优必选科技股份有限公司 Speech recognition method, device, terminal and medium based on label smoothing
CN111667818B (en) * 2020-05-27 2023-10-10 北京声智科技有限公司 Method and device for training wake-up model
CN111816165A (en) * 2020-07-07 2020-10-23 北京声智科技有限公司 Voice recognition method and device and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609672A (en) * 2009-07-21 2009-12-23 北京邮电大学 A kind of speech recognition semantic confidence feature extracting methods and device
CN107305575A (en) * 2016-04-25 2017-10-31 北京京东尚科信息技术有限公司 The punctuate recognition methods of human-machine intelligence's question answering system and device
CN108389576A (en) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 The optimization method and system of compressed speech recognition modeling
US20190272818A1 (en) * 2018-03-04 2019-09-05 International Business Machines Corporation Voice-transformation based data augmentation for prosodic classification
US20190385592A1 (en) * 2019-08-12 2019-12-19 Lg Electronics Inc. Speech recognition device and speech recognition method
CN111754985A (en) * 2020-07-06 2020-10-09 上海依图信息技术有限公司 Method and device for training voice recognition model and voice recognition
CN111933175A (en) * 2020-08-06 2020-11-13 北京中电慧声科技有限公司 Active voice detection method and system based on noise scene recognition
CN112435656A (en) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112435656B (en) 2024-03-01
CN112435656A (en) 2021-03-02

Similar Documents

Publication Publication Date Title
WO2022121257A1 (en) Model training method and apparatus, speech recognition method and apparatus, device, and storage medium
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
WO2018133761A1 (en) Method and device for man-machine dialogue
CN110853666B (en) Speaker separation method, device, equipment and storage medium
WO2020253060A1 (en) Speech recognition method, model training method, apparatus and device, and storage medium
CN110390017B (en) Target emotion analysis method and system based on attention gating convolutional network
US11355097B2 (en) Sample-efficient adaptive text-to-speech
WO2022121251A1 (en) Method and apparatus for training text processing model, computer device and storage medium
CN116072098B (en) Audio signal generation method, model training method, device, equipment and medium
WO2022121180A1 (en) Model training method and apparatus, voice conversion method, device, and storage medium
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
US20210089909A1 (en) High fidelity speech synthesis with adversarial networks
CN111243574B (en) Voice model adaptive training method, system, device and storage medium
Mai et al. Analyzing unaligned multimodal sequence via graph convolution and graph pooling fusion
CN116956835B (en) Document generation method based on pre-training language model
Lee et al. Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities
Wu et al. Acoustic to articulatory mapping with deep neural network
CN113096647A (en) Voice model training method and device and electronic equipment
JP2021081713A (en) Method, device, apparatus, and media for processing voice signal
Jiang et al. RETRACTED ARTICLE: Intelligent online education system based on speech recognition with specialized analysis on quality of service
JP2018084627A (en) Language model learning device and program thereof
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
CN111292715B (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
US10706086B1 (en) Collaborative-filtering based user simulation for dialog systems
CN116434758A (en) Voiceprint recognition model training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901970

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901970

Country of ref document: EP

Kind code of ref document: A1