WO2022121257A1 - Procédé et appareil d'entraînement de modèle, procédé et appareil de reconnaissance de la parole, dispositif et support de stockage - Google Patents

Procédé et appareil d'entraînement de modèle, procédé et appareil de reconnaissance de la parole, dispositif et support de stockage Download PDF

Info

Publication number
WO2022121257A1
WO2022121257A1 PCT/CN2021/097411 CN2021097411W WO2022121257A1 WO 2022121257 A1 WO2022121257 A1 WO 2022121257A1 CN 2021097411 W CN2021097411 W CN 2021097411W WO 2022121257 A1 WO2022121257 A1 WO 2022121257A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
recognition model
speech recognition
sequences
sequence
Prior art date
Application number
PCT/CN2021/097411
Other languages
English (en)
Chinese (zh)
Inventor
罗剑
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022121257A1 publication Critical patent/WO2022121257A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the technical field of model construction in artificial intelligence, and in particular, to a model training method, a speech recognition method, an apparatus, a device and a storage medium.
  • Automatic Speech Recognition is a technology that converts speech into text.
  • speech recognition is used in various industries related to the Internet, communication, smart home, etc.
  • speech recognition models are usually used for automatic speech recognition.
  • the text data sample is obtained in the following manner: organize a large number of people to listen to the voice data and write down the correct text data.
  • the speech recognition model allows more and more speech data and corresponding text data to be trained to improve the accuracy of the speech recognition model, which makes the labor cost a resource investment It is time-consuming, expensive and inefficient.
  • the main purpose of the present application is to provide a model training method, speech recognition method, apparatus, equipment and storage medium, aiming at improving the training effect and training efficiency of the speech recognition model.
  • the present application provides a model training method, including:
  • the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence
  • the second training samples include a second speech sequence
  • the present application also provides a speech recognition method, comprising:
  • the target speech recognition model is obtained by training according to the above-mentioned model training method.
  • the present application also provides a model training device, the model training device comprising:
  • an acquisition module configured to acquire a plurality of first training samples and a plurality of second training samples, the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, the second the training sample includes a second speech sequence;
  • a first training module configured to iteratively train a first preset speech recognition model according to the plurality of first training samples to obtain a first speech recognition model
  • a fusion module configured to fuse the first speech recognition model with the preset language model to obtain a second speech recognition model
  • an input module configured to input a plurality of the second speech sequences into the second speech recognition model, to obtain a second text and a fusion score corresponding to each of the second speech sequences;
  • a screening module configured to screen out a target speech sequence from the plurality of second speech sequences according to the fusion score of each of the second speech sequences
  • a second training module configured to iterate the second preset speech recognition model according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples Training to get the target speech recognition model.
  • the present application also provides a computer device, the computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program is executed by the When executed by the processor, the steps of the above-mentioned model training method or speech recognition method are implemented.
  • the present application further provides a computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the above-mentioned model training method or speech Identify the steps of the method.
  • the present application provides a model training method, a speech recognition method, an apparatus, a device and a storage medium.
  • the present application obtains a plurality of first training samples and a plurality of second training samples, wherein the first training samples include a first speech sequence and annotated The first text corresponding to the first voice sequence, the second training sample includes the second voice sequence, and then according to the plurality of first training samples, the first preset voice recognition model is iteratively trained to obtain a first voice recognition model, and the first voice recognition model is obtained.
  • a speech recognition model is fused with a preset language model to obtain a second speech recognition model, and then multiple second speech sequences are input into the second speech recognition model to obtain the second text corresponding to each second speech sequence and the fusion score , according to the fusion score of each second speech sequence, screen out the target speech sequence from multiple second speech sequences, and according to each target speech sequence, the second text corresponding to each target speech sequence and a plurality of first training samples , performing iterative training on the second preset speech recognition model to obtain the target speech recognition model.
  • This application trains the "teacher-noisy student" self-training learning model through multiple labeled first training samples and multiple unlabeled second training samples, which can greatly improve the training effect of the speech recognition model and reduce the need for The number of labeled training samples is required, and the training efficiency of the speech recognition model is improved.
  • FIG. 1 is a schematic flowchart of steps of a model training method provided by an embodiment of the present application
  • Fig. 2 is the sub-step flow schematic diagram of the model training method in Fig. 1;
  • FIG. 3 is a schematic diagram of a scenario for implementing the model training method provided by this embodiment
  • FIG. 4 is a schematic flowchart of steps of a speech recognition method provided by an embodiment of the present application.
  • FIG. 5 is a schematic block diagram of a model training apparatus provided by an embodiment of the present application.
  • Fig. 6 is a schematic block diagram of sub-modules of the model training device in Fig. 5;
  • FIG. 7 is a schematic block diagram of a speech recognition apparatus provided by an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application.
  • Embodiments of the present application provide a model training method, a speech recognition method, an apparatus, a device, and a storage medium.
  • the model training method can be applied to a terminal device or a server, and the terminal device can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device; the server can be a single server , or a server cluster consisting of multiple servers.
  • the following takes the model training method applied to the server as an example for explanation.
  • FIG. 1 is a schematic flowchart of steps of a model training method provided by an embodiment of the present application.
  • the model training method includes steps S101 to S106.
  • Step S101 Acquire a plurality of first training samples and a plurality of second training samples, where the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence, and the second training samples include a second speech sequence.
  • the first training sample includes the first speech sequence and the first text corresponding to the first speech sequence, where the first text is a label of the corresponding first speech sequence, and the second training sample includes the second speech sequence.
  • the first voice sequence and the second voice sequence are audio data
  • the first text corresponding to the first voice sequence is the text content recognized by the voice of the first voice sequence.
  • the first speech sequence is a song
  • the corresponding first text is lyrics.
  • the Noisy Student Training (NST) model is a semi-supervised learning model composed of "teacher” and "student".
  • the teacher model (the first preset speech recognition model) is used to learn the marked first training sample, Predict the unlabeled second training sample, obtain the labeled second training sample and the second text corresponding to the second training sample, and then let the student model (the second preset speech recognition model) for the labeled first training sample.
  • the training sample, the labeled second training sample, and the second text corresponding to the second training sample are trained, and the above process is iterated. Through "teacher-noise student" self-training learning, the training effect of the speech recognition model can be greatly improved.
  • the total audio length of the plurality of first training samples is higher than the first preset time threshold, and the total audio length of the plurality of second training samples is higher than the second preset time threshold, which can ensure that the data obtained in subsequent training can be guaranteed.
  • the accuracy of the speech recognition model for speech recognition is higher than the first preset time threshold, and the total audio length of the plurality of second training samples is higher than the second preset time threshold, which can ensure that the data obtained in subsequent training can be guaranteed.
  • the second preset time threshold is higher than the first preset time threshold.
  • the first preset time threshold and the second preset time threshold may be set according to actual application scenarios, for example, the first preset time threshold is 100h, and the second preset time threshold is 500h, which will not be repeated here. .
  • the above-mentioned related information such as the first training samples and the second training samples can also be stored in one area.
  • the block chain in the present invention refers to computers such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, etc.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • Step S102 Perform iterative training on a first preset speech recognition model according to a plurality of first training samples to obtain a first speech recognition model.
  • the first preset speech recognition model is a "teacher" model, and a plurality of first training samples are input into the first preset speech recognition model to obtain a speech recognition result corresponding to each first training sample, and according to each first training sample, the corresponding speech recognition results are obtained.
  • the speech recognition results corresponding to the training samples and the corresponding first text respectively adjust the parameters of the first preset speech recognition model until a first speech recognition model whose performance meets the preset training conditions is obtained.
  • the preset training condition may be that the recognition accuracy is higher than the preset accuracy threshold.
  • both the preset training conditions and the preset accuracy threshold can be set according to actual application scenarios.
  • the preset accuracy threshold is 0.9, which is not specifically limited here.
  • the first preset speech recognition model is, for example, a LAS (Listen, Attend and Spell) model, and the LAS model includes a Listen (listening) layer, an Attend (attention) layer and a Spell (spelling) layer.
  • the first preset speech recognition model extracts the speech signal features of the input first training sample, encodes by inputting the speech signal features into the Listen layer, and then uses the Attention layer to pay attention to different parts of the input at different times (Attend), and finally uses The Spell layer is decoded to obtain the speech recognition result of the first training sample.
  • data enhancement is performed on a plurality of first training samples; according to the plurality of first training samples after data enhancement, the first preset speech recognition model is iteratively trained until the first preset speech recognition model is Convergence to obtain the first speech recognition model.
  • the number of samples of the first training sample can be increased through data augmentation, for example, by normalizing the channel length, superimposing clean audio and noisy audio signals to synthesize noisy audio or speed perturbation of the original audio etc. to achieve data enhancement.
  • iterative training of the first preset speech recognition model reference may be made to the foregoing embodiments.
  • the convergence of the first preset speech recognition model may be that the performance meets the preset training conditions, and the number of iterations is greater than the preset number of iterations and/or the iteration duration. is greater than the preset iteration duration, etc., which is not specifically limited in this embodiment.
  • the speech recognition result improves the training effect of the target speech recognition model.
  • the SpecAugment is used to perform data enhancement on the plurality of first training samples, and the robustness of the first speech recognition model is increased by adding some noise to the first preset speech recognition model.
  • each first speech sequence is converted into a spectrogram, and time deformation, frequency masking and/or time masking are performed on the plurality of spectrograms by SpecAugment.
  • the spectrogram of the first speech sequence can be enhanced through SpecAugment, which can increase the training speed of the first speech recognition model for the first training sample, thereby improving the performance of the target speech recognition model. training efficiency.
  • time deformation of multiple spectrograms through SpecAugment refers to the mel spectrogram of the ⁇ time jump, taking time as the x-axis and frequency as the y-axis, and constructing a random horizontal line through the center of the mel spectrogram.
  • frequency masking multiple spectrograms means that on the frequency axis of successive mel spectrograms, [f 0 ,f 0 +f ] mask, where f is a uniform parameter from 0 to F; frequency masking multiple spectrograms refers to masking [t 0 ,t 0 +t] on the time axis of successive mel spectrograms membrane, where t is a uniform distribution from 0 to T.
  • noise is added to the first preset speech recognition model; according to a plurality of first training samples, the first preset speech recognition model with noise added is iteratively trained until the first preset speech recognition model with noise added The model converges, and a first speech recognition model is obtained.
  • data enhancement is performed on a plurality of first training samples, and noise is added to the first preset speech recognition model; according to the plurality of first training samples after data enhancement, the first preset noise added
  • the speech recognition model is iteratively trained until the first preset speech recognition model with added noise converges, and a first speech recognition model is obtained.
  • Step S103 fuse the first speech recognition model with the preset language model to obtain a second speech recognition model.
  • the preset language model is a pre-trained language model (Language Model), and the preset language model is, for example, a statistical language model, a feedforward neural network language model, a recurrent neural network language model, and the like.
  • the obtained second speech recognition model has better performance, which is beneficial to improve the training effect of the target speech recognition model and make the speech recognition accuracy of the target speech recognition model more accurate. high.
  • the data volume of the training samples of the language model is much larger than the data volume of the first training samples of the first speech recognition model
  • the fusion of the first speech recognition model and the preset language model can help the second speech recognition model.
  • the model performs semantic information modeling, and the fusion methods include Voting (voting method), Averaging (averaging method), Bagging (guided aggregation method) algorithm, Boosting (boosting method), etc., which are not specifically limited in this embodiment.
  • Step S104 inputting multiple second speech sequences into the second speech recognition model to obtain the second text and fusion score corresponding to each second speech sequence.
  • a plurality of second training samples are input into the second speech recognition model, and a speech recognition result corresponding to each second speech sequence is obtained, and the speech recognition result includes the second text corresponding to the second speech sequence and the fusion score.
  • the second speech recognition model is a LAS (Listen, Attend and Spell) model, including a Listen (listen) layer, an Attend (attention) layer and a Spell (spelling) layer.
  • the second speech sequence is the sound signal feature vector x of length T. After the sound signal feature vector x of length T is input into the second speech recognition model, the content related to the sound signal is retained through the Listen layer, and the content that is not related to the sound signal is removed.
  • the second speech recognition model is obtained by the fusion of the trained first speech recognition model LAS and the language model LM, the weighted summation of the character distribution probabilities of the LAS model and the LM model can obtain the fusion corresponding to the second speech sequence. Fraction.
  • Step S105 screen out the target speech sequence from the plurality of second speech sequences.
  • the target speech sequence For the second text corresponding to the second speech sequence output by the second speech recognition model, it is necessary to filter out the target speech sequence that meets the preset conditions. According to the fusion score of each second speech sequence, it is possible to select the target speech sequence from multiple second speech sequences.
  • the target speech sequence is screened out, and the target speech sequence can be used as training data for a high-quality "student" model (the second preset speech recognition model), thereby improving the training effect of the second preset speech recognition model.
  • step S105 includes: sub-steps S1051 to S1052 .
  • Sub-step S1051 Filter the multiple second voice sequences according to the preset score threshold and the fusion score of each second voice sequence to obtain multiple candidate voice sequences.
  • the preset score threshold can be flexibly set by the user, the second speech sequence whose fusion score is greater than or equal to the preset score threshold is reserved, and the second speech sequence whose fusion score is less than the preset score threshold is screened out to obtain a plurality of speech sequences. candidate speech sequences. It should be noted that the accuracy rate of the second text corresponding to the second voice sequence with a high fusion score is higher, so retaining the second voice sequence with a higher accuracy rate of the second text is conducive to screening out high-quality second voices. sequence.
  • the speech recognition result of the second speech recognition model will be affected, resulting in inconsistent accuracy of the second text corresponding to each second speech sequence and the fusion score. Therefore, by regularizing the fusion score of the second speech sequence, and then comparing the regularized fusion score with the preset score threshold, the second speech sequence whose fusion score is less than the preset score threshold is filtered out, and a high-quality speech sequence is obtained. Multiple candidate speech sequences.
  • the regularization formula is: l is the character length of the second speech sequence, ⁇ and ⁇ are the parameters obtained by performing linear regression on (li i , S i ) of multiple second speech sequences, and ⁇ is the calculation standard deviation of .
  • the preset score threshold may decrease as the iteration time increases, and the preset score threshold becomes smaller and smaller in the iterative training, so that more and more candidate speech sequences can be used for training the target speech recognition model sample.
  • Sub-step S1052 according to the probability distribution information of the plurality of first training samples, screen out the target speech sequence from the plurality of candidate speech sequences.
  • selecting a target speech sequence from a plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples includes: generating a plurality of speech sequence sets according to the plurality of candidate speech sequences, wherein each the speech sequence sets include at least one candidate speech sequence; determine the probability distribution information of each speech sequence set; according to the probability distribution information of the plurality of first training samples and the probability distribution information of each speech sequence set, from the plurality of speech sequence sets Select the target speech sequence set.
  • the target speech sequence set includes at least one target speech sequence. It should be noted that the distribution of the filtered candidate speech sequences is quite different.
  • the second preset speech recognition model will be affected.
  • the performance of the speech recognition model therefore, a target speech sequence set that is similar to the probability distribution information of multiple first training samples is found from multiple speech sequence sets, and at least one target speech sequence in the target speech sequence set is used as the second preset.
  • the training samples of the speech recognition model can improve the performance of the second preset speech recognition model, which can improve the training effect of the target speech recognition model.
  • multiple batches are randomly selected from multiple candidate speech sequences to generate multiple speech sequence sets, and each batch includes at least one candidate speech sequence.
  • Each candidate speech sequence carries attribute information, and the attribute information carried by multiple candidate speech sequences can form the probability distribution information of a speech sequence set.
  • the probability distribution information is determined according to the specific service set. For example, the probability distribution information is the length of the audio. , the ratio of male and female speakers, the age of the speaker, the surrounding environment, etc., and compare the probability distribution information corresponding to each speech sequence set with the probability distribution information of multiple first training samples to find out the corresponding probability distribution information of multiple first training samples.
  • the probability distribution information of the approximate target speech sequence set is the probability distribution information of the approximate target speech sequence set.
  • the K-L divergence of each voice sequence set is calculated according to the probability distribution information of a plurality of first training samples and the probability distribution information of each voice sequence set; according to the K-L divergence of each voice sequence set, Select a target speech sequence set from multiple speech sequence sets. Among them, the probability distribution information of the speech sequence set with the lower K-L divergence is closer to the probability distribution information of multiple first training samples, and the speech sequence set with the lowest K-L divergence is selected as the target speech sequence set. Include at least one target speech sequence.
  • K-L divergence K-L Divergence
  • f(M(U)) is the speech sequence set
  • P(i) is the probability distribution information of a plurality of first training samples
  • Q(i) is the probability distribution information of the speech sequence set.
  • Step S106 Perform iterative training on the second preset speech recognition model according to each target speech sequence, the second text corresponding to each target speech sequence, and a plurality of first training samples to obtain a target speech recognition model.
  • the preset performance condition is determined according to the speech recognition accuracy and speech recognition speed of the speech recognition student model. In practical applications, the preset performance conditions may also be set according to actual application scenarios.
  • the second preset speech recognition model is initialized by using a plurality of first training samples to ensure the convergence of the training data. The initial speech recognition model is trained through multiple target speech sequences, and a target speech recognition model with better training effect is obtained, and the target speech recognition model has a high accuracy of speech recognition.
  • the second preset speech recognition model is, for example, a LAS (Listen, Attend and Spell) model
  • the LAS model includes a Listen (listen) layer, an Attend (attention) layer and a Spell (spelling) layer.
  • multiple third training samples are generated according to each target speech sequence and the second text corresponding to each target speech sequence; training samples are obtained according to multiple third training samples and multiple first training samples Set; through the training sample set, iterative training is performed on the second preset speech recognition model until the preset condition is reached, and the target speech recognition model is obtained.
  • the preset conditions may be that the performance meets the preset training conditions, the number of iterations is greater than the preset number of iterations, and/or the iteration duration is greater than the preset iteration duration, etc., which are not specifically limited in this embodiment of the present application.
  • FIG. 3 is a schematic diagram of a scenario for implementing the model training method provided by this embodiment.
  • a plurality of first training samples and a plurality of second training samples are obtained, the first training samples include the first speech sequence and the first text corresponding to the marked first speech sequence, and the second training samples include the second speech sequence, and then input a plurality of first training samples into the first preset speech recognition model 10 to iteratively train the first preset speech recognition model 10 to obtain the first speech recognition model 20, and the preset language model 30 Integrate with the first speech recognition model 20 to obtain a second speech recognition model 40, and then input the second speech sequences in the plurality of second training samples into the second speech recognition model 40 to obtain each second speech sequence corresponding to According to the fusion score of each second speech sequence, the target speech sequence is screened from multiple second speech sequences, and each The text and the plurality of first training samples are input into the second preset speech recognition model 50 to iteratively train the second preset speech recognition model 50 to obtain the target speech recognition model 60 .
  • a plurality of first training samples and a plurality of second training samples are obtained, wherein the first training samples include the first speech sequence and the first text corresponding to the marked first speech sequence, and the second training samples are obtained.
  • the sample includes a second speech sequence, and then according to a plurality of first training samples, the first preset speech recognition model is iteratively trained to obtain a first speech recognition model, and the first speech recognition model and the preset language model are fused to obtain The second speech recognition model, and then multiple second speech sequences are input into the second speech recognition model to obtain the second text corresponding to each second speech sequence and the fusion score.
  • each second speech sequence from The target speech sequence is selected from the plurality of second speech sequences, and the second preset speech recognition model is iteratively trained according to each target speech sequence, the second text corresponding to each target speech sequence, and a plurality of first training samples, Get the target speech recognition model.
  • This application trains the "teacher-noisy student" self-training learning model through multiple labeled first training samples and multiple unlabeled second training samples, which can greatly improve the training effect of the speech recognition model and reduce the need for The number of labeled training samples is required, and the training efficiency of the speech recognition model is improved.
  • FIG. 4 is a schematic flowchart of steps of a speech recognition method provided by an embodiment of the present application.
  • the model training method includes steps S201 to S202.
  • Step S201 acquiring a speech sequence to be recognized.
  • the speech sequence to be recognized is a piece of speech data sent by a user in a social application.
  • Step S202 perform speech recognition on the speech sequence by using the target speech recognition model to obtain text information corresponding to the speech sequence.
  • the target speech recognition model is obtained by training according to the model training method described in the foregoing embodiment.
  • user A receives a voice sequence sent by user B through the social application of the terminal device, performs voice recognition on the voice sequence through the target voice recognition model, and obtains the text information "Hello" (speech recognition result).
  • the speech recognition method provided by the above-mentioned embodiment, by acquiring the speech sequence to be recognized, and performing speech recognition on the speech sequence through the target speech recognition model described in the foregoing embodiment, the text information corresponding to the speech sequence is obtained, because the target speech recognition model " The teacher-noise-student" self-training learning model can effectively improve the accuracy of speech recognition.
  • FIG. 5 is a schematic block diagram of a model training apparatus provided by an embodiment of the present application.
  • the model training apparatus 300 includes: an acquisition module 301 , a first training module 302 , a fusion module 303 , an input module 304 , a screening module 305 and a second training module 306 .
  • the acquisition module 301 is configured to acquire a plurality of first training samples and a plurality of second training samples, the first training samples include the first speech sequence and the first text corresponding to the marked first speech sequence, and the second training samples include the second speech sequence;
  • a first training module 302 configured to perform iterative training on a first preset speech recognition model according to a plurality of first training samples to obtain a first speech recognition model
  • a fusion module 303 configured to fuse the first speech recognition model with the preset language model to obtain a second speech recognition model
  • the input module 304 is used for inputting a plurality of second speech sequences into the second speech recognition model to obtain the second text and fusion score corresponding to each second speech sequence;
  • the screening module 305 is used for screening out the target speech sequence from the plurality of second speech sequences according to the fusion score of each second speech sequence;
  • the second training module 306 is configured to iteratively train the second preset speech recognition model according to each target speech sequence, the second text corresponding to each target speech sequence, and a plurality of first training samples to obtain the target speech recognition model .
  • the screening module 305 includes:
  • Filtering submodule 3051 configured to filter a plurality of the second speech sequences according to a preset score threshold and a fusion score of each of the second speech sequences to obtain a plurality of candidate speech sequences;
  • the screening sub-module 3052 is configured to screen out a target speech sequence from the plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples.
  • the screening sub-module 3052 is also used to:
  • each of the speech sequence sets includes at least one of the candidate speech sequences
  • a target speech sequence set is selected from the plurality of the speech sequence sets.
  • the screening sub-module 3052 is also used to:
  • a target speech sequence set is selected from a plurality of the speech sequence sets.
  • the first training module 302 is further used to:
  • the first preset speech recognition model is iteratively trained according to the plurality of first training samples after data enhancement, until the first preset speech recognition model converges, and a first speech recognition model is obtained.
  • the second training module 306 is also used to:
  • the second preset speech recognition model is iteratively trained until a preset condition is reached, and the target speech recognition model is obtained.
  • FIG. 7 is a schematic block diagram of a speech recognition apparatus provided by an embodiment of the present application.
  • the speech recognition device 400 includes:
  • the acquiring module 401 is used for acquiring the speech sequence to be recognized.
  • the recognition module 402 is configured to perform speech recognition on the speech sequence through a target speech recognition model to obtain text information corresponding to the speech sequence.
  • the target speech recognition model is obtained by training according to the model training method described in the foregoing embodiment.
  • the apparatuses provided in the above embodiments may be implemented in the form of a computer program, and the computer program may be executed on the computer device as shown in FIG. 8 .
  • FIG. 8 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application.
  • the computer device can be a server or a terminal device.
  • the computer device includes a processor, a memory and a network interface connected through a system bus, wherein the memory may include a storage medium and an internal memory, and the storage medium may be non-volatile or volatile.
  • the storage medium may store an operating system and a computer program.
  • the computer program includes program instructions that, when executed, can cause the processor to execute any model training method or speech recognition method.
  • the processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
  • the internal memory provides an environment for running the computer program in the storage medium.
  • the processor can cause the processor to execute any model training method or speech recognition method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated circuits) Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
  • the processor is configured to run a computer program stored in the memory to implement the following steps:
  • the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence
  • the second training samples include a second speech sequence
  • the processor when the processor selects the target speech sequence from the plurality of second speech sequences according to the fusion score of each of the second speech sequences, the processor is configured to:
  • a target speech sequence is selected from the plurality of candidate speech sequences.
  • the processor when the processor selects the target speech sequence from the plurality of candidate speech sequences according to the probability distribution information of the plurality of first training samples, the processor is configured to:
  • each of the speech sequence sets includes at least one of the candidate speech sequences
  • a target speech sequence set is selected from the plurality of the speech sequence sets.
  • the processor selects the target from the plurality of speech sequence sets according to the probability distribution information of the plurality of first training samples and the probability distribution information of each of the speech sequence sets.
  • a voice sequence set it is used to implement:
  • a target speech sequence set is selected from a plurality of the speech sequence sets.
  • the processor when implementing the iterative training of the first preset speech recognition model according to the plurality of first training samples to obtain the first speech recognition model, the processor is used to implement:
  • the first preset speech recognition model is iteratively trained according to the plurality of first training samples after data enhancement, until the first preset speech recognition model converges, and a first speech recognition model is obtained.
  • the processor is performing the processing of the second text according to each of the target speech sequences, the second text corresponding to each of the target speech sequences, and a plurality of the first training samples.
  • the preset speech recognition model is iteratively trained, and when the target speech recognition model is obtained, it is used to achieve:
  • the second preset speech recognition model is iteratively trained until a preset condition is reached, and the target speech recognition model is obtained.
  • the processor is configured to run a computer program stored in a memory to implement the following steps:
  • the target speech recognition model is obtained by training according to the model training method described in the foregoing embodiment.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, wherein when the computer program is executed by a processor, the following steps are implemented:
  • the first training samples include a first speech sequence and a first text corresponding to the marked first speech sequence
  • the second training samples include a second speech sequence
  • the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) ) card, Flash Card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé et un appareil d'entraînement de modèle, un procédé et un appareil de reconnaissance de la parole, un dispositif et un support de stockage. Le procédé consiste à : exécuter un entraînement itératif d'un premier modèle de reconnaissance de la parole prédéfini conformément à une pluralité de premiers échantillons d'apprentissage pour obtenir un premier modèle de reconnaissance de la parole (S102) ; fusionner le premier modèle de reconnaissance de la parole et un modèle de langage prédéfini pour obtenir un second modèle de reconnaissance de la parole (S103) ; entrer des secondes séquences de parole dans une pluralité de seconds échantillons d'apprentissage du second modèle de reconnaissance de la parole pour obtenir un second texte et un score de fusion correspondant à chaque seconde séquence de parole (S104) ; sélectionner des séquences de parole cibles à partir de la pluralité de secondes séquences de parole conformément au score de fusion de chaque seconde séquence de parole (S105) ; et exécuter un entraînement itératif d'un second modèle de reconnaissance de la parole prédéfini conformément à chaque séquence de parole cible, le second texte correspondant à la séquence de parole cible, et à la pluralité de premiers échantillons d'apprentissage, pour obtenir un modèle de reconnaissance de la parole cible (S106). Le procédé permet d'améliorer l'efficacité d'entraînement de modèle de reconnaissance de la parole.
PCT/CN2021/097411 2020-12-11 2021-05-31 Procédé et appareil d'entraînement de modèle, procédé et appareil de reconnaissance de la parole, dispositif et support de stockage WO2022121257A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011453446.1 2020-12-11
CN202011453446.1A CN112435656B (zh) 2020-12-11 2020-12-11 模型训练方法、语音识别方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022121257A1 true WO2022121257A1 (fr) 2022-06-16

Family

ID=74691123

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097411 WO2022121257A1 (fr) 2020-12-11 2021-05-31 Procédé et appareil d'entraînement de modèle, procédé et appareil de reconnaissance de la parole, dispositif et support de stockage

Country Status (2)

Country Link
CN (1) CN112435656B (fr)
WO (1) WO2022121257A1 (fr)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112435656B (zh) * 2020-12-11 2024-03-01 平安科技(深圳)有限公司 模型训练方法、语音识别方法、装置、设备及存储介质
CN113129869B (zh) * 2021-03-22 2022-01-28 北京百度网讯科技有限公司 语音识别模型的训练与语音识别的方法、装置
CN113257235B (zh) * 2021-04-30 2023-01-03 平安科技(深圳)有限公司 模型训练方法、语音识别方法、装置、服务器及存储介质
CN113241062B (zh) * 2021-06-01 2023-12-26 平安科技(深圳)有限公司 语音训练数据集的增强方法、装置、设备及存储介质
CN114627874A (zh) 2021-06-15 2022-06-14 宿迁硅基智能科技有限公司 文本对齐方法、存储介质、电子装置
CN113327598B (zh) * 2021-06-30 2023-11-14 北京有竹居网络技术有限公司 模型的训练方法、语音识别方法、装置、介质及设备
CN113608664B (zh) * 2021-07-26 2024-06-18 京东科技控股股份有限公司 智能语音机器人交互效果优化方法、装置及智能机器人
CN113706172B (zh) * 2021-08-30 2023-08-25 平安银行股份有限公司 基于客户行为的投诉解决方法、装置、设备及存储介质
CN113793604B (zh) * 2021-09-14 2024-01-05 思必驰科技股份有限公司 语音识别系统优化方法和装置
CN115237182A (zh) * 2022-07-29 2022-10-25 大连世有电力科技有限公司 一种低功耗无线通讯的变压器温控系统
CN117219067B (zh) * 2023-09-27 2024-04-09 北京华星酷娱文化传媒有限公司 一种基于语音理解的短视频自动生成字幕的方法及系统

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609672A (zh) * 2009-07-21 2009-12-23 北京邮电大学 一种语音识别语义置信特征提取的方法和装置
CN107305575A (zh) * 2016-04-25 2017-10-31 北京京东尚科信息技术有限公司 人机智能问答系统的断句识别方法和装置
CN108389576A (zh) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 压缩后的语音识别模型的优化方法及系统
US20190272818A1 (en) * 2018-03-04 2019-09-05 International Business Machines Corporation Voice-transformation based data augmentation for prosodic classification
US20190385592A1 (en) * 2019-08-12 2019-12-19 Lg Electronics Inc. Speech recognition device and speech recognition method
CN111754985A (zh) * 2020-07-06 2020-10-09 上海依图信息技术有限公司 一种语音识别模型的训练以及语音识别的方法和装置
CN111933175A (zh) * 2020-08-06 2020-11-13 北京中电慧声科技有限公司 一种基于噪声场景识别的活动语音检测方法及系统
CN112435656A (zh) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 模型训练方法、语音识别方法、装置、设备及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110895932B (zh) * 2018-08-24 2022-05-03 中国科学院声学研究所 基于语言种类和语音内容协同分类的多语言语音识别方法
CN110797016B (zh) * 2019-02-26 2020-12-29 北京嘀嘀无限科技发展有限公司 一种语音识别方法、装置、电子设备及存储介质
CN110265001B (zh) * 2019-05-06 2023-06-23 平安科技(深圳)有限公司 用于语音识别训练的语料筛选方法、装置及计算机设备
CN110941945B (zh) * 2019-12-02 2021-03-23 百度在线网络技术(北京)有限公司 语言模型预训练方法和装置
CN111583911B (zh) * 2020-04-30 2023-04-14 深圳市优必选科技股份有限公司 基于标签平滑的语音识别方法、装置、终端及介质
CN111667818B (zh) * 2020-05-27 2023-10-10 北京声智科技有限公司 一种训练唤醒模型的方法及装置
CN111816165A (zh) * 2020-07-07 2020-10-23 北京声智科技有限公司 语音识别方法、装置及电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609672A (zh) * 2009-07-21 2009-12-23 北京邮电大学 一种语音识别语义置信特征提取的方法和装置
CN107305575A (zh) * 2016-04-25 2017-10-31 北京京东尚科信息技术有限公司 人机智能问答系统的断句识别方法和装置
CN108389576A (zh) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 压缩后的语音识别模型的优化方法及系统
US20190272818A1 (en) * 2018-03-04 2019-09-05 International Business Machines Corporation Voice-transformation based data augmentation for prosodic classification
US20190385592A1 (en) * 2019-08-12 2019-12-19 Lg Electronics Inc. Speech recognition device and speech recognition method
CN111754985A (zh) * 2020-07-06 2020-10-09 上海依图信息技术有限公司 一种语音识别模型的训练以及语音识别的方法和装置
CN111933175A (zh) * 2020-08-06 2020-11-13 北京中电慧声科技有限公司 一种基于噪声场景识别的活动语音检测方法及系统
CN112435656A (zh) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 模型训练方法、语音识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN112435656B (zh) 2024-03-01
CN112435656A (zh) 2021-03-02

Similar Documents

Publication Publication Date Title
WO2022121257A1 (fr) Procédé et appareil d'entraînement de modèle, procédé et appareil de reconnaissance de la parole, dispositif et support de stockage
JP7337953B2 (ja) 音声認識方法及び装置、ニューラルネットワークの訓練方法及び装置、並びにコンピュータープログラム
WO2018133761A1 (fr) Procédé et dispositif de dialogue homme-machine
CN110853666B (zh) 一种说话人分离方法、装置、设备及存储介质
WO2020253060A1 (fr) Procédé de reconnaissance vocale, procédé, appareil et dispositif d'apprentissage de modèle et support de stockage
WO2023103308A1 (fr) Procédé et appareil d'apprentissage de modèle, procédé et appareil de prédiction de texte, dispositif électronique et support
CN110390017B (zh) 基于注意力门控卷积网络的目标情感分析方法及系统
US11355097B2 (en) Sample-efficient adaptive text-to-speech
WO2022121251A1 (fr) Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage
CN116072098B (zh) 音频信号生成方法、模型训练方法、装置、设备和介质
WO2022121180A1 (fr) Procédé et appareil de formation de modèle, procédé de conversion de voix, dispositif, et support de stockage
WO2022178942A1 (fr) Procédé et appareil de reconnaissance d'émotion, dispositif informatique et support de stockage
WO2022141868A1 (fr) Procédé et appareil permettant d'extraire des caractéristiques de parole, terminal et support de stockage
US20210089909A1 (en) High fidelity speech synthesis with adversarial networks
Mai et al. Analyzing unaligned multimodal sequence via graph convolution and graph pooling fusion
CN111243574B (zh) 一种语音模型自适应训练方法、系统、装置及存储介质
CN116956835B (zh) 一种基于预训练语言模型的文书生成方法
Lee et al. Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities
CN113096647A (zh) 语音模型训练方法、装置和电子设备
JP2021081713A (ja) 音声信号を処理するための方法、装置、機器、および媒体
Jiang et al. RETRACTED ARTICLE: Intelligent online education system based on speech recognition with specialized analysis on quality of service
JP2018084627A (ja) 言語モデル学習装置およびそのプログラム
CN114065720A (zh) 会议纪要生成方法、装置、存储介质及电子设备
WO2024114303A1 (fr) Procédé et appareil de reconnaissance de phonèmes, dispositif électronique et support de stockage
CN111292715B (zh) 语音合成方法、装置、电子设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901970

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901970

Country of ref document: EP

Kind code of ref document: A1