CN114974221B

CN114974221B - Speech recognition model training method and device and computer readable storage medium

Info

Publication number: CN114974221B
Application number: CN202210465435.8A
Authority: CN
Inventors: 胡洪涛; 徐景成; 朱耀磷; 彭成高; 刘莹
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Internet Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Internet Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2024-01-19
Anticipated expiration: 2042-04-29
Also published as: CN114974221A

Abstract

The application discloses a speech recognition model training method and device and a computer readable storage medium, and the scheme provided by the application comprises the following steps: acquiring feedback information of voice recognition output by a user on a target voice recognition model, wherein the feedback information comprises an error text of the voice recognition and a correct text corresponding to the error text; acquiring the speaker voice characteristics of the voice corresponding to the error text; determining an updated training sample and a corresponding label based on the error text, the correct text corresponding to the error text and the speaker voice characteristics of the voice corresponding to the error text; and updating and training the target voice recognition model based on the updating and training sample and the corresponding label.

Description

Speech recognition model training method and device and computer readable storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a method and apparatus for training a speech recognition model, and a computer readable storage medium.

Background

Speech recognition is a technique that converts speech into text, and a good speech recognition model requires thousands of hours of corpus to train. The existing speech recognition system is not changed once entering the system, and if updated, the following main methods exist at present: 1. according to the identification performance purchasing data, the data company is handed to customize or directly purchases the existing finished product database; 2. and manually re-labeling the data with poor recognition performance, and then re-adding the model for training.

The data obtained by the method is limited in voice duration and quantity, the whole flow is long, high time cost and high price cost are caused, and the recognition accuracy of the voice recognition model is limited to improve.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method and apparatus for training a speech recognition model, and a computer readable storage medium, which are used for solving the problems existing in the existing speech recognition model training.

In order to solve the technical problems, the present specification is implemented as follows:

in a first aspect, a method for training a speech recognition model is provided, including:

acquiring feedback information of voice recognition output by a user on a target voice recognition model, wherein the feedback information comprises an error text of the voice recognition and a correct text corresponding to the error text;

acquiring the speaker voice characteristics of the voice corresponding to the error text;

determining an updated training sample and a corresponding label based on the error text, the correct text corresponding to the error text and the speaker voice characteristics of the voice corresponding to the error text;

and updating and training the target voice recognition model based on the updating and training sample and the corresponding label.

Optionally, determining the update training sample and the corresponding label based on the error text, the correct text corresponding to the error text, and the speaker speech feature of the speech corresponding to the error text includes:

calculating the confusion degree of the correct text corresponding to the error text;

screening a first correct text with the confusion degree exceeding a preset confusion degree threshold value and a first error text corresponding to the first correct text;

performing voice synthesis based on any collocation combination of the voice characteristics of the speaker of the voice corresponding to the first correct text and the first error text, and generating an updated training sample;

and determining a label corresponding to the updated training sample based on the first correct text.

Optionally, the method further comprises:

crawling hot words from a target network;

matching the hotword with a training sample library of the target voice recognition model;

under the condition that the matching is unsuccessful, determining the hot word as a new word;

determining an updated training sample and corresponding label based on the erroneous text, the correct text corresponding to the erroneous text, and speaker speech characteristics of the speech corresponding to the erroneous text, comprising:

based on the new word, the wrong text, the correct text corresponding to the wrong text, and the speaker voice characteristics of the voice corresponding to the wrong text, the updated training sample and the corresponding label are determined.

Optionally, determining the update training sample and the corresponding label based on the new word, the error text, the correct text corresponding to the error text, and the speaker speech feature of the speech corresponding to the error text includes:

performing voice synthesis based on any collocation combination of the voice characteristics of a speaker of the voice corresponding to the first error text and the text corresponding to the new word respectively to generate an updated training sample;

and determining the label corresponding to the updated training sample based on the first correct text or the text of the new word.

Optionally, calculating the confusion of the correct text corresponding to the wrong text is performed by the following formula:

wherein S represents a target correct text corresponding to a target incorrect text, k represents the number of words included in the target correct text, and P (Wk) represents the sentence probability of the kth word included in the target correct text.

Optionally, before performing the speech synthesis, the method further comprises:

according to the speaker voice characteristics of the voice corresponding to the first error text, performing speaker clustering;

determining the number of speakers included in each clustered set after clustering;

and screening speaker voice characteristics of voices corresponding to the first error text in the clustering set with the number of speakers being lower than the preset number, and using the speaker voice characteristics for voice synthesis.

Optionally, the speaker clustering is performed according to speaker voice characteristics of the voice corresponding to the first error text, including:

calculating the similarity between the voice characteristics of a target speaker of the voice corresponding to the first error text of the target and the voice characteristics of each speaker in a training sample library of the target voice recognition model;

and clustering the target speakers to a clustering set to which the speakers with high voice feature similarity belong.

according to the speaker voice characteristics of the voice corresponding to the error text, performing speaker clustering;

screening speaker voice characteristics of voices corresponding to the second error text in the clustering set with the number of speakers lower than the preset number;

performing voice synthesis based on the speaker voice characteristics of the voice corresponding to the correct text corresponding to the error text and the second error text, and generating an updated training sample;

and determining the label corresponding to the updated training sample based on the correct text.

In a second aspect, there is provided a speech recognition model training apparatus comprising a memory and a processor electrically connected to the memory, the memory storing a computer program executable by the processor to perform the steps of the method according to the first aspect when the computer program is executed by the processor.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to the first aspect.

In the embodiment of the application, feedback information of voice recognition output by a user on a target voice recognition model is obtained, wherein the feedback information comprises a wrong text of the voice recognition and a correct text corresponding to the wrong text; acquiring the speaker voice characteristics of the voice corresponding to the error text; determining an updated training sample and a corresponding label based on the error text, the correct text corresponding to the error text and the speaker voice characteristics of the voice corresponding to the error text; based on the update training sample and the corresponding label, the target speech recognition model is updated and trained, so that poorly recognized speech fed back by a user is dynamically collected, speech synthesis is performed from two dimensions of text and speaker speech characteristics, the update training sample is generated, the update training of the target speech recognition model is added in real time, corpus augmentation and model training with higher speed, stronger timeliness and lower cost can be realized, and the recognition accuracy of the speech recognition model is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a flow chart of a speech recognition model training method according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a procedure for determining an updated training sample and a tag according to a first embodiment of the present application.

Fig. 3 is a flowchart illustrating a procedure for determining an updated training sample and a tag according to a second embodiment of the present application.

Fig. 4 is a flowchart illustrating a determining step of updating training samples and labels according to a third embodiment of the present application.

Fig. 5 is a block diagram showing the structure of a speech recognition model training apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application. The reference numerals in the present application are only used to distinguish the steps in the scheme, and are not used to limit the execution sequence of the steps, and the specific execution sequence controls the description in the specification.

In order to solve the problems in the prior art, an embodiment of the present application provides a method for training a speech recognition model, and fig. 1 is a schematic flow chart of the method for training a speech recognition model in the embodiment of the present application.

As shown in fig. 1, the method comprises the following steps:

step 102, obtaining feedback information of voice recognition output by a user on a target voice recognition model, wherein the feedback information comprises error text of the voice recognition and correct text corresponding to the error text;

104, obtaining the speaker voice characteristics of the voice corresponding to the error text;

step 106, determining an updated training sample and a corresponding label based on the error text, the correct text corresponding to the error text and the speaker voice characteristics of the voice corresponding to the error text;

and step 108, updating and training the target voice recognition model based on the updating and training sample and the corresponding label.

In step 102, the target speech recognition model is a generic speech recognition model for recognizing the input speech of different users and outputting speech recognition text to the corresponding users.

The feedback information is feedback sent when the target user is not satisfied with the recognized text, including when the voice recognition is inaccurate or not completely accurate, and the user can mark the erroneous text with inaccurate recognition to mark the correct text. The feedback information comprises the error text of the voice recognition output by the target voice recognition model and the correct text corresponding to the error text marked by the user.

And the background collects feedback information of the user in real time, and stores the error text of the voice recognition and the correct text of the corresponding standard into a database. Therefore, multiple error texts and corresponding correct texts output by the target voice recognition model can be obtained through feedback information of different users or different feedback information of the same user.

In step 104, the speech corresponding to the error text is the speech of the feedback user of the target error text, the speaker speech features include voiceprint vector information, and the voiceprint vector information is related to information such as years, gender, regional accents, timbre and the like of the speaker, and is a vector value comprehensively determined according to the information and representing the pronunciation characteristics of the speaker.

Under the condition that the target user feeds back the error text and the correct text of the voice recognition of the target voice recognition model, the voice characteristics of the speaker corresponding to the target user can be further obtained and stored in the database. Therefore, through the feedback information of different users, the voice characteristics of a plurality of different speakers with wrong recognition corresponding to the target voice recognition model can be obtained.

In step 106, the correct texts and the voice features of the speakers may be obtained from the database, and the data corresponding to the two dimensions may be combined in any matching manner to perform voice synthesis, so as to generate an updated training sample. For example, the database includes correct text 1, correct text 2, … and correct text n, and speaker voice feature 1, speaker voice feature 2, … speaker voice feature m, then n correct texts and m speaker voice features can be combined two by two, and voice synthesis is performed, so as to generate multiple corpus, that is, multiple update training samples for the target voice recognition model, where the correct text of the target update training sample is generated, that is, the label of the target update training sample.

In fact, the voice recognition error fed back by the user is not necessarily the reason of the target voice recognition model, and may be the problem of radio reception of equipment used by the user or the existence of larger background noise at the time, so that in order to reduce resource waste, new corpus synthesis may not be performed on error texts fed back by all users. The text with the voice recognition error fed back by the user can be analyzed, and the truly valuable text is screened out and then generated with the voice characteristics of the speaker.

In one embodiment, as shown in fig. 2, step 106 determines an updated training sample and corresponding label based on the erroneous text, the correct text corresponding to the erroneous text, and the speaker speech characteristics of the speech corresponding to the erroneous text, including:

step 202, calculating the confusion degree of a correct text corresponding to the error text;

step 204, screening out a first correct text with the confusion degree exceeding a preset confusion degree threshold value and a first error text corresponding to the first correct text;

step 206, performing speech synthesis based on any collocation combination of the speech features of the speaker of the speech corresponding to the first correct text and the first error text, and generating an updated training sample;

step 208, determining, based on the first correct text, a label corresponding to the updated training sample.

In this embodiment, in order to accurately pick the text, the text of the recognition error fed back by the user is filtered, and the confusion (PPL) of these correct texts is calculated.

Optionally, calculating the confusion of the correct text corresponding to the error text is performed by the following formula (1):

Sentence S (sequence of k words W) corresponding to the target correct text: s=w1, W2, …, wk, then

As can be seen from the above formula (1), PPL is the reciprocal of the root number of k times the sentence probability P (W1, W2, …, wk). That is, the greater the sentence probability, the smaller the PPL, i.e., the less confusing the speech recognition model is to the sentence, the better the modeling ability of the speech recognition model on the corresponding text of the sentence.

Therefore, a PPL threshold is set, each correct text with PPL exceeding the PPL threshold is selected, namely, the correct text and the speaker voice characteristics of the voice corresponding to the incorrect text with high PPL and poor recognition performance of the target voice recognition model are selected, then voice synthesis is carried out, and an update training sample is generated for updating training of the target voice recognition model. Therefore, resource waste can be avoided, and the data augmentation efficiency of the corpus sample is improved.

In addition to obtaining text from the user feedback information for producing updated training samples, new words may be obtained from the network to augment the original training sample library of the target speech recognition model.

Optionally, the method further comprises: crawling hot words from a target network; matching the hotword with a training sample library of the target voice recognition model; and under the condition that the matching is unsuccessful, determining the hot word as a new word.

Determining an updated training sample and corresponding label based on the erroneous text, the correct text corresponding to the erroneous text, and speaker speech characteristics of the speech corresponding to the erroneous text, comprising: based on the new word, the wrong text, the correct text corresponding to the wrong text, and the speaker voice characteristics of the voice corresponding to the wrong text, the updated training sample and the corresponding label are determined.

The method includes the steps that hot words, such as words with high occurrence frequency, are crawled on a network at fixed time, matching and searching are conducted in an original training sample library of a target voice recognition model, if the original training sample library lacks the hot words, texts of the hot words are stored in the database, voice synthesis is conducted on the texts of the hot words and speaker voice characteristic voices of voices corresponding to wrong texts in the database, and updated training samples are generated.

In another embodiment, as shown in fig. 3, determining the updated training samples and corresponding labels based on the new words, the erroneous text, the correct text corresponding to the erroneous text, and the speaker speech characteristics of the speech corresponding to the erroneous text, includes:

step 302, calculating the confusion degree of the correct text corresponding to the error text;

step 304, screening out a first correct text with the confusion degree exceeding a preset confusion degree threshold value and a first error text corresponding to the first correct text;

step 306, performing speech synthesis based on the first correct text and the text corresponding to the new word and any collocation combination of the speaker speech features of the speech corresponding to the first incorrect text, respectively, to generate an updated training sample;

step 308, determining a label corresponding to the updated training sample based on the first correct text or the text of the new word.

Steps 302 to 304 correspond to the same steps 202 to 204, and valuable error texts are screened out through confusion degree calculation, which is not described herein.

In step 306, new word text may be added in addition to the screened correct text, along with the text that is used to generate the updated training sample.

In this embodiment, each correct text, the text corresponding to each new word, and the voice feature of each speaker may be obtained from the database, and the correct text and the new word text may be unified into a text dimension. The data corresponding to the text dimension and the speaker voice feature dimension are matched and combined at will, voice synthesis is carried out, and an updated training sample is generated. For example, the database includes text 1, text 2, … text i, and speaker speech feature 1, speaker speech feature 2, … speaker speech feature m, i texts and m speaker speech features can be combined two by two, and speech synthesis can be performed, so as to generate multiple corpora, that is, multiple updated training samples for the target speech recognition model. Generating a correct text of the target update training sample, wherein the correct text is a label of the target update training sample; and generating a new word text of the target update training sample, wherein the new word text is a label of the target update training sample.

The above embodiment refers to screening the error text from the feedback information of the user, or crawling new word text from the network, and performing any collocation combination on the corresponding text and the speaker voice characteristics of the voice corresponding to the screened error text to generate the updated training sample. In one embodiment, the speaker voice features of the voice corresponding to the filtered error text can be further filtered again.

Optionally, before performing the speech synthesis of the above embodiment, the method further includes:

According to the speaker voice characteristics of the voice corresponding to the first error text, performing speaker clustering, including: calculating the similarity between the voice characteristics of a target speaker of the voice corresponding to the first error text of the target and the voice characteristics of each speaker in a training sample library of the target voice recognition model; and clustering the target speakers to a clustering set to which the speakers with high voice feature similarity belong.

In the original training sample library of the target speech recognition model, sorting is carried out according to the speakers, audio fragments with specified lengths are segmented from the audio of each speaker, and voiceprint vector information of all the speaker audio fragments is extracted for hierarchical clustering. Specifically: the voiceprint vectors of all speakers are taken as independent types at first, the similarity between the voiceprint vectors of every two speakers is calculated, and the two speakers with the highest similarity are found out to be combined to be taken as a new type. And continuously repeating the process of merging after calculating the similarity until the two furthest classes exceed a threshold value, and stopping clustering, thereby obtaining the speaker distribution in the original training library, including the number of speakers included in each cluster set.

After the first error text of the feedback user and the voice characteristics of the speaker corresponding to the first error text are screened, voiceprint vector information of the target feedback user is extracted, and the voiceprint hierarchical clustering is continuously added, so that the target speaker corresponding to the target feedback user is clustered into a corresponding cluster set.

For the speech recognition model, if the original training sample library does not see the speaker's speech, the recognition performance is also degraded. Therefore, the speaker can be further screened, if the speaker corresponding to the feedback user is clustered to the clustering set with fewer numbers in the original training sample library, the feedback user and the similar users in the original training sample library are fewer, the value is higher, and the original training sample library needs to be expanded, the voice characteristics of the speaker are focused. And synthesizing the text obtained by screening in the previous step and the voice characteristics of the speaker; otherwise, if the speaker corresponding to the feedback user is clustered to the clustering set with more numbers in the original training sample library, the feedback user and the similar user in the original training sample library are more, and the speech features of the speaker corresponding to the user and the corresponding text can not be synthesized into speech, so that an updated training sample can be generated.

In yet another embodiment, based on feedback of the user's speech recognition errors, only the speaker speech features corresponding to the speech recognition errors may be filtered to generate updated training samples.

As shown in fig. 4, step 106 determines an updated training sample and corresponding label based on the erroneous text, the correct text corresponding to the erroneous text, and the speaker speech characteristics of the speech corresponding to the erroneous text, including:

step 402, according to the speaker voice characteristics of the voice corresponding to the error text, performing speaker clustering;

step 404, determining the number of speakers included in each clustered set after clustering;

step 406, screening out speaker voice characteristics of voices corresponding to the second error text in the clustering set with the number of speakers being lower than the preset number;

step 408, performing speech synthesis based on the speaker speech feature of the speech corresponding to the correct text corresponding to the error text and the second error text, and generating an updated training sample;

step 410, determining, based on the correct text, a label corresponding to the updated training sample.

Step 402 to step 406 are the same as the speaker voice feature screening step of the voice corresponding to the first error text, and valuable speaker voice features in the feedback user are screened out through speaker clustering, which is not described herein.

In step 408, the correct text may be all the correct texts in the feedback information, and the correct texts are combined with the speaker voice features screened in steps 402 to 406 to perform voice synthesis, so as to generate updated training samples.

In addition, through filtering text and/or speaker voice characteristics of user feedback, the generated training sample to be amplified and updated is guaranteed to be valuable, resource waste is avoided, data amplification efficiency is improved, and recognition performance of a voice recognition model can be correspondingly improved.

Optionally, the embodiment of the present application further provides a voice recognition model training device, and fig. 5 is a structural block diagram of the voice recognition model training device of the embodiment of the present application.

As shown in fig. 5, the speech recognition model training apparatus 2000 includes a memory 2200 and a processor 2400 electrically connected to the memory 2200, where the memory 2200 stores a computer program that can be executed by the processor 2400, and the computer program implements the processes of any one of the foregoing speech recognition model training method embodiments when executed by the processor, and achieves the same technical effects, so that repetition is avoided and redundant description is omitted herein.

The embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements each process of any one of the foregoing embodiments of the speech recognition model training method, and can achieve the same technical effect, so that repetition is avoided, and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. A method for training a speech recognition model, comprising:

based on the update training sample and the corresponding label, performing update training on the target voice recognition model;

wherein, based on the wrong text, the correct text corresponding to the wrong text, the speaker voice feature of the voice corresponding to the wrong text, determining to update the training sample and the corresponding label comprises:

performing voice synthesis based on any collocation combination of the voice characteristics of the voice corresponding to the first correct text and the first error text, and generating an updated training sample, wherein the any collocation combination refers to the combination of any one of the plurality of first correct texts and any one of the voice characteristics of the voice corresponding to the plurality of first error texts;

determining a label corresponding to the updated training sample based on the first correct text;

or,

determining a label corresponding to the updated training sample based on the correct text;

the confusion degree of the correct text corresponding to the error text is calculated through the following formula:

2. The method as recited in claim 1, further comprising:

crawling hot words from a target network;

based on the new words, the error text, the correct text corresponding to the error text and the speaker voice characteristics of the voice corresponding to the error text, determining an updated training sample and a corresponding label;

wherein, based on the new word, the error text, the correct text corresponding to the error text, the speaker voice feature of the voice corresponding to the error text, determining to update the training sample and the corresponding label comprises:

performing voice synthesis based on any collocation combination of the voice characteristics of a speaker of a voice corresponding to a first correct text and a new word respectively corresponding to a first error text, and generating an update training sample, wherein the any collocation combination refers to the combination of any text in a text dimension and any speaker voice characteristic of the voice corresponding to a plurality of first error texts, and the text dimension comprises the first correct text and the text corresponding to the new word;

3. The method of claim 1 or 2, further comprising, prior to performing speech synthesis:

and screening out the speaker voice characteristics of the voices corresponding to the first error texts in the clustering set with the number of the speakers being lower than the preset number, and taking the speaker voice characteristics of the voices corresponding to the screened first error texts as the speaker voice characteristics of the voices corresponding to the plurality of first error texts for voice synthesis.

4. The method of claim 3, wherein performing speaker clustering based on speaker speech characteristics of the speech corresponding to the first erroneous text comprises:

5. A speech recognition model training device, comprising: a memory and a processor electrically connected to the memory, the memory storing a computer program executable by the processor, the computer program implementing the steps of the method of any one of claims 1 to 4 when executed by the processor.

6. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 4.