CN110808049B

CN110808049B - Voice annotation text correction method, computer device and storage medium

Info

Publication number: CN110808049B
Application number: CN201810792037.0A
Authority: CN
Inventors: 黄石磊; 刘轶; 程刚; 王昕�
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2022-04-26
Anticipated expiration: 2038-07-18
Also published as: CN110808049A

Abstract

The application relates to a method for correcting a voice labeling text, which comprises the following steps: acquiring a corrected sub-text in a text to be detected; the text to be detected is a recognition result with the highest score in the initial word graph obtained by recognizing the voice by the initial voice recognizer; determining an adaptive model for model adaptation in an initial speech recognizer based on a type of modification of the modified sub-text; and determining an adaptive recognition result through the adaptive model, and updating the related information of the uncorrected sub-text in the text to be detected according to the adaptive recognition result. By the method, the model in the speech recognizer is adjusted by learning the correction logic in the corrected sub-text to obtain the new model, and the part which is not corrected can obtain a new recognition result through the new model, so that certain workload can be reduced for a user to correct the subsequent marked text, and the efficiency is improved.

Description

Voice annotation text correction method, computer device and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, a computer device, and a storage medium for correcting a speech annotation text.

Background

The phonetic notation (Transcription) is to simply convert the phonetic notation into characters, and the process can be divided into coarse notation and fine notation. At present, a manual voice annotator carries out proofreading based on voice recognition, and utilizes big data collection and analysis to enable the voice recognition to be more accurate. On one hand, voice annotation can provide basic data resources for voice recognition or other voice application fields; on the other hand, in an actual system, a result completely depending on the voice recognition always has certain errors, so that the correction and modification must be performed manually, and how to improve the correction speed is also a problem to be solved. The identification system with high accuracy is generally established corresponding to manual marking data with large data volume, so that the problem of low efficiency exists.

Disclosure of Invention

In view of the above, it is desirable to provide a method, a computer device and a storage medium for correcting a voice markup text, which can solve the problem of inefficient correction of the voice markup text.

A method of voice markup text modification, the method comprising:

acquiring a corrected sub-text in a text to be detected; the text to be detected is a recognition result with the highest score in the initial word graph obtained by recognizing the voice by the initial voice recognizer;

determining an adaptive model for model adaptation in an initial speech recognizer based on a type of modification of the modified sub-text;

and determining an adaptive recognition result through the adaptive model, and updating the related information of the uncorrected sub-text in the text to be detected according to the adaptive recognition result.

In one embodiment, when the modification type of the modified sub-text is homophonic modification, determining an adaptive model for model adjustment in the initial speech recognizer based on the modification type of the modified sub-text comprises:

determining an adaptive language model by a language model generator from the corrected sub-text and an initial language model in an initial speech recognizer;

determining an adaptive recognition result through the adaptive model, and updating the unmodified sub-text in the text to be detected according to the adaptive recognition result comprises the following steps:

recalculating the initial word graph of the voice corresponding to the uncorrected sub-text through the adaptive language model, and determining a first recognition result; and updating the related information of the unmodified sub-text according to the first recognition result.

In one embodiment, the recalculating, by the adaptive language model, the initial word graph of the speech corresponding to the uncorrected sub-text, and the determining the first recognition result includes:

and recalculating the initial word graph of the voice corresponding to the uncorrected sub-text to obtain a first adaptive word graph, and taking the recognition result with the highest score in the first adaptive word graph as a first recognition result.

In one embodiment, when the type of modification of the modified text is a different pitch modification, determining an adaptive model for model adaptation in the initial speech recognizer based on the type of modification of the modified sub-text comprises:

determining an adaptive acoustic model through an acoustic model adaptive algorithm according to the corrected sub-text, the text corresponding to the corrected sub-text before correction and the initial acoustic model;

re-identifying the voice corresponding to the unmodified sub-document through an adaptive voice recognizer containing an adaptive acoustic model, and determining a second identification result;

and updating the related information of the unmodified sub-text according to the second identification result.

In one embodiment, re-recognizing the speech corresponding to the unmodified text by an adaptive speech recognizer including an adaptive acoustic model, the determining the second recognition result includes:

and re-identifying the voice corresponding to the unmodified sub-text through the adaptive voice identifier to obtain a second adaptive word graph, and taking the identification result with the highest score in the second adaptive word graph as a second identification result.

In one embodiment, when a first recognition result and a second recognition result appear at the same time and the first recognition result and the second recognition result are consistent, any one of the first recognition result and the second recognition result is used as a third recognition result;

and updating the related information of the unmodified sub-text according to the third identification result.

In one embodiment, when a first recognition result and a second recognition result occur simultaneously and the first recognition result and the second recognition result do not coincide,

taking the recognition result with higher score in the first recognition result and the second recognition result as a third recognition result; alternatively, the first and second electrodes may be,

simultaneously reserving the first recognition result and the second recognition result as a third recognition result;

updating the associated information of the unmodified sub-text according to the third recognition result

In one embodiment, the information related to the uncorrected sub-text includes: the word graph of the voice corresponding to the unmodified sub-text and the unmodified sub-text.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

According to the voice labeling text correction method, the computer device and the storage medium, when part of text to be detected is corrected by a user, the corrected sub-text is obtained, the model in the initial voice recognizer can be adjusted based on the correction type of the corrected sub-text to obtain a new model, a new recognition result can be obtained through the new model, and the related information of the uncorrected sub-text is updated according to the new recognition result. The model in the speech recognizer is adjusted by learning the correction logic in the corrected sub-text to obtain a new model, and the part which is not corrected can obtain a new recognition result through the new model, so that certain workload can be reduced for a user to correct the subsequent labeling text, and the efficiency is improved.

Drawings

FIG. 1 is a flow chart illustrating a method for modifying a text for a voice annotation in an embodiment;

FIG. 2 is a flowchart illustrating a method for modifying a text of a voice annotation according to another embodiment;

FIG. 3 is a flowchart illustrating a method for modifying a text with a voice label according to another embodiment;

FIG. 4 is a flowchart illustrating a method for modifying a text for voice markup in an embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a method for modifying a text with a voice annotation includes steps S110 to S130.

Step S110, acquiring the corrected sub-text in the text to be detected.

The text to be detected is a recognition result with the highest score in an initial word graph obtained by recognizing voice by an initial voice recognizer; in the embodiment of the application, the manually corrected text in the text to be detected is recorded as the corrected child text, and the rest is the non-corrected child text. The speech recognizer recognizes the speech to obtain a plurality of candidate results, and a set of the candidate results is called a word graph; the term "is also called a recognition grid (lattice), which is a directed graph representing different results of the speech recognizer recognizing the same piece of language, i.e. containing multiple candidates, and may also be called storing intermediate results generated by the speech recognizer. Generally, the final result of speech recognition is a path in the word graph, or a word graph with "width" of 1, which is the best candidate among the multiple candidates. Since the word graph contains a plurality of recognition results, when the speech recognition is incorrect, the word graph still has a high possibility of containing the correct result corresponding to the speech of the incorrect part, but the result does not appear in the form of the "best" candidate. Each candidate recognition result in the word graph is likely to be a correct recognition result, and each recognition result includes time, a symbol and a score, wherein, in the embodiment of the present application, the candidate recognition result with the highest score is taken as the final recognition result. For example, one candidate recognition result in the word graph is (0.0, 'today', 0.9), where "0.0" represents time, "today" represents a symbol, and "0.9" represents a score of "today" at a time of "0.0". Wherein the score is used to indicate the possibility that the candidate recognition result is the best candidate recognition result, and the higher the score is, the higher the possibility that the candidate recognition result is considered to be the best candidate recognition result is.

In this embodiment, the text to be detected is the best candidate recognition result in the word graph (with multiple candidate recognition results) obtained by the initial speech recognizer (including the initial language model and the initial acoustic model) for speech recognition.

The corrected sub text is the labeled text corresponding to the part which is corrected and modified manually. The method for correcting the voice labeling text modifies the text which is not manually corrected in the text to be detected based on the part of the text which is manually corrected and confirmed to be modified, so that the time required by manual correction is reduced as much as possible, and the manual correction efficiency is improved. Therefore, when the corrected sub-text is detected to exist in the text to be detected, the corrected sub-text is acquired, and then the subsequent operation is performed.

In one embodiment, the text to be detected is stored in a database, and the database also stores the voice and the word graph corresponding to the text to be detected. The text to be detected is a recognition result of the initial voice recognizer on the voice.

In one embodiment, when the user corrects the voice marked text, the system displays the voice and the marked text on the interface simultaneously; when a user modifies a certain marked text, the user selects an error text, modifies the error text into a correct text, compares the two texts, compares vocabularies of the two texts with corresponding pronunciations to obtain a difference, and therefore, the machine learns how to adjust the acoustic model and the language model.

In one embodiment, the output of the speech recognizer should be a sequence, where a certain item in the sequence corresponds to a recognition result symbol (generally a word) at a certain moment; each item may contain an N-best candidate (hypothesis); each candidate includes at least (time, symbol (word), score), where a larger score indicates a higher likelihood. In one embodiment, the score is denoted by SC, and SCn (i) denotes the score of the nth candidate for the i words.

For example, if a certain speech corresponds to "good weather today", the sequence is "good weather today". The candidates for the first symbol of this sequence include three, respectively: (0.0, 'today', 0.9), (0.0, 'today', 0.5) and (0.0, 'tomorrow', 0.01); in this embodiment, since the score of the "today" candidate symbol is the highest one of the three candidates, the "today" is the best candidate of the first symbol, and therefore the first character displayed in the final recognition result is the "today". In practical cases, the number of candidates corresponding to each symbol may vary. For simplicity in some embodiments, sometimes only the best candidate sequence may be considered, e.g. the first symbol only considers "today".

Step S120, an adaptive model is determined for model adjustment in the initial speech recognizer based on the corrected type of the corrected sub-text.

Wherein the correction in the corrected sub-text represents a part of the text that is manually corrected for the text to be detected. The modification of the modified sub-text includes different types. In one embodiment, the modification types include homophonic modification and different-tone modification. Homophonic correction means that the text before and after correction is homophonic. For example, if the 'towel' is modified to 'capital', the 'capital' belongs to homophonic modification; if "today" is modified to "today", it belongs to a modified type of different voices.

In the speech recognition task, an Acoustic Model (AM) and a Language Model (LM) and a pronunciation dictionary are generally required. The model in the initial speech recognizer is a model of a general speech recognizer, and in the embodiment of the application, the result of the speech recognition is a labeled text and a word graph; adjusting the initial model according to the behavior of manually correcting the marked text to obtain an adaptive model; the adaptive model is more suitable for the statistical characteristics of the current batch of speech in the database. How to adjust the model is determined according to the correction type in the corrected sub-text.

Step S130, determining an adaptive recognition result through the adaptive model, and updating the relevant information of the uncorrected sub-text in the text to be detected according to the adaptive recognition result.

After the model in the initial speech recognizer is adjusted, an adaptive model is generated, and a new recognition result (adaptive recognition result) can be obtained through the adaptive model. And then updating the related information of the uncorrected sub-text excluding the corrected sub-text in the text to be detected according to the adaptive identification result.

In the voice labeling text correction method, the computer device and the storage medium, when part of text to be detected is corrected by a user, a corrected sub-text is obtained, the model in the initial voice recognizer can be adjusted based on the correction type of the corrected sub-text to obtain a new model, a new recognition result can be obtained through the new model, and the related information of the non-corrected sub-text is updated according to the new recognition result. The model in the speech recognizer is adjusted by learning the correction logic in the corrected sub-text to obtain a new model, and the part which is not corrected can obtain a new recognition result through the new model, so that certain workload can be reduced for a user to correct the subsequent labeling text, and the efficiency is improved.

In an embodiment, when the modification type of the modified sub-text is homophonic modification, as shown in fig. 2, the method in this embodiment is a flowchart. Determining an adaptive model for model adaptation in an initial speech recognizer based on a type of modification of the modified sub-text comprises:

step S210, determining an adaptive language model by a language model generator according to the corrected sub-text and the initial language model in the initial speech recognizer.

The language model is an important model in the speech recognizer, is a knowledge representation formed by a group of word sequences, and is a probability model established for a certain language, so that the probability value of a correct word is greater than that of an error word. Statistical Language Model (N-gram) is a common Language Model in Large Vocabulary Continuous Speech Recognition (LVCSR), and for Chinese, we refer to it as Chinese Language Model (CLM). The Chinese language model describes statistical information of collocation between adjacent words in the context.

The initial language model is a general language model, and in the embodiment of the present application, any realizable language model is used for realization. In one embodiment, the language model employs a ternary statistical language model (3-gram) based on Chinese words, the vocabulary of which is about 20 million words, containing more than 3 million word combinations.

The language model generator is used for comparing the initial language model and the modified text to generate a new language model. In embodiments of the present application, the language model generator employs any one of a number of techniques that may be implemented.

In one embodiment, the language model generator recalculates the score in the word graph by: if a word W (N) is modified during the modification, the language model probabilities of the N-gram combination of the word, i.e., an N-tuple of the first N-1 words of the word, (W (N-N +1), W (N-N +2), … W (N)) are modified. In one embodiment, (W (N-N +1), W (N-N +2), … W (N)) is present in the initial speech recognition result, the probability of this combination can be increased according to a certain proportion, and the probabilities of all other combinations can be decreased (ensuring that the sum of the overall probabilities is constant). For example, if this combination (olympic capital) exists in the initial recognition result, and the probability is assumed to be 0.001, the improvement can be now 50% to 0.0015, and the remaining N-gram combination probabilities change accordingly, but the total probability remains unchanged. In yet another embodiment, the combination (W (N-N +1), W (N-N +2), … W (N)) does not exist in the initial speech recognition result, e.g., "capital" or "olympic capital" does not exist in the initial speech recognition result, and the processing method at this time is to add a combination "capital" or "olympic capital" and assign a corresponding probability, e.g., the probability assignment of "capital" is 0.001, the probability assignment of "olympic capital" is 0.002, etc. The probability of the combination may be a conditional probability corresponding to each combination, that is, P ((W (N-N +1), W (N-N +2), … W (N)) corresponding to each of (W (N-N +1), W (N-N +2), … W (N)). Or may be set to a constant value as the sum of the probabilities of the respective combinations.

The adaptive language model is a new language model which is generated again after the initial language model is adjusted by the language model generator according to the corrected child texts. The main characteristic of the newly generated language model is that the newly generated language model contains all information after text modification, and the modified sub-texts are most likely to be words which are not contained in the initial language model or words with low scores in the initial word graph.

In the embodiment shown in fig. 2, determining an adaptive recognition result through the adaptive model, and updating the unmodified text in the text to be detected according to the adaptive recognition result includes:

step S220, recalculating the initial word graph of the voice corresponding to the uncorrected sub-text through the adaptive language model, and determining a first recognition result; and updating the related information of the unmodified sub-text according to the first recognition result.

In this embodiment, since the modification in the modified sub-text is homophonic modification, the acoustic probabilities corresponding to the acoustic model and the speech do not change, that is, the acoustic model does not need to be updated, and only the update of the language model needs to be involved. The method specifically comprises the steps of re-scoring paths of the word graph of the voice corresponding to the uncorrected sub-text in the text to be detected through the adaptive language model to obtain a first recognition result, and updating relevant information of the uncorrected sub-text in the text to be detected according to the first recognition result. In this way, the best candidate (corresponding to the result of speech recognition, the text transcription content of the corresponding speech) may be changed.

In another embodiment, when the modification type of the modified text is different voice modification, as shown in fig. 3, a schematic flow chart of the steps of the method for modifying the voice markup text in this embodiment is shown. Determining an adaptive model for model adaptation in an initial speech recognizer based on a type of modification of the modified sub-text comprises:

and S310, determining an adaptive acoustic model through an acoustic model adaptive algorithm according to the corrected sub-text, the text corresponding to the corrected sub-text before correction and the initial acoustic model.

Among them, the acoustic model is one of the most important parts in the speech recognition system, and is a knowledge representation of differences in acoustics, phonetics, environmental variables, speaker gender, accent, and the like. At present, a Hidden Markov Model (HMM) is mostly used for modeling in a mainstream system.

In speech recognition, phones are acoustic model units (which may be phones, initials/finals, syllables), and in unit selection, the influence of the context (front and back phones) of a certain Phone is considered, so that the model can describe speech more accurately, and Bi-Phone is called in consideration of the influence of only the previous Phone, and Tri-Phone is called in consideration of the influence of the previous and the next phones.

In the embodiment of the application, the initials and finals are basic pronunciation units of Chinese.

In the embodiment of the present application, the initial acoustic model may be any acoustic model that can be implemented, and in a specific embodiment, the initial acoustic model is a triphone model based on chinese initials and finals.

And (4) adapting the acoustic model, and correcting the initial acoustic model. In the embodiment of the present application, any existing technology that can be implemented, for example, for a speech recognition system based on an HMM model, MAP (maximum a posteriori probability)/MLLR (maximum likelihood linear regression) or an improved algorithm thereof may be used. The adapted acoustic model after adaptation will be used for the new speech recognition module.

In the embodiment shown in fig. 3, determining an adaptive recognition result through the adaptive model, and updating the unmodified text in the text to be detected according to the adaptive recognition result includes:

step S320, re-identifying the voice corresponding to the unmodified document through an adaptive voice identifier containing an adaptive acoustic model, and determining a second identification result; and updating the related information of the unmodified sub-text according to the second identification result.

In one embodiment, re-recognizing speech corresponding to the unmodified sub-document by an adaptive speech recognizer including an adaptive acoustic model, the determining the second recognition result includes:

In this embodiment, the modified sub-text belongs to different voice modifications, and further includes obtaining adaptive data through the modified sub-text and the text corresponding to the modified sub-text before modification, adjusting the initial acoustic model through an acoustic model adaptive algorithm to generate a new acoustic model, re-recognizing the voice of the part which has not been manually corrected according to the new acoustic model and the current language model to obtain a new voice recognition result, and recording the new voice recognition result as a second recognition result. The voice data to be adapted must be the voice with different pronunciation after comparing the text before modification and the text after modification, and the user judges that there is no error (actually unmodified) according to the user interface. In this embodiment, the update of the learning content is added using a speech recognition engine based on WFST (weighted finite-state sensor) and LSTM (Long Short-Term Memory) neural networks. In one embodiment, in one example, the updated language model vocabulary has 20 more words than the original language model, including 200 more combinations; wherein 10000 combining probabilities are adjusted.

In one embodiment, the step of updating the related data of the uncorrected sub-text is performed all the time as the user corrects the standard text, and the updating frequency can be set according to actual conditions.

In a specific embodiment, taking the example that a corrected sub-text corresponding to a voice corrected for 10 minutes is newly added by a user each time to update the related data of the uncorrected sub-text in the database once, in this embodiment, an initial language model is denoted as LM0, and an initial acoustic model is denoted as AM 0; and the adapted new language model is denoted as LM1 and the adapted new acoustic model is denoted as AM 1. When the user modifies the text corresponding to the 10-minute speech, the LM0 and/or the AM0 in the speech recognizer generates LM1 and/or AM 1; when the user continues to correct the text corresponding to the 10-minute speech (so that the text data corresponding to the 20-minute speech is corrected), then both LM and AM are regenerated. When updating the LM and the AM, the LM0 and the AM0 are actually updated by using the information about the text data corresponding to 20 minutes of speech in total (instead of the information about the text data corresponding to only the latest 10 minutes of speech), that is, the AM1 and the LM1 are updated.

In one embodiment, the information related to the uncorrected sub-text comprises: the word graph of the voice corresponding to the unmodified sub-text and the unmodified sub-text. The current corrected word is homophonic correction, so the scores of all candidates in the word graph are likely to change, and the best candidate is likely to change after the word graph is recalculated to obtain the first recognition result. And updating the related information of the unmodified sub-text according to the first recognition result. When the current correction is different, the speech needs to be re-identified, and a new word graph and a new identification result are obtained.

In one embodiment, the unmodified text is determined based on the interface where the user is currently modifying the annotated text. In one embodiment, when the user corrects the labeled text, the interface displays the voice and the labeled text at the same time, and takes the position where the user is correcting as a boundary, the text before the text being corrected is recorded as the corrected sub-text, and the text after the text being corrected is recorded as the uncorrected sub-text. For example, the position of the current interface cursor may be obtained, a text before the cursor is recorded as a corrected sub-text, and a text after the cursor is an uncorrected sub-text. It is to be understood that, in other embodiments, the corrected sub-text and the uncorrected sub-text in the text to be detected can be distinguished in any other realizable manner.

In the above method, the corrected sub-text (i.e. the sub-text that the user has determined) is considered to be completely correct, and the uncorrected sub-text is adjusted only by the new model, which can help reduce the error of the uncorrected sub-text, thereby reducing the time required by the user to correct the uncorrected sub-text.

In an embodiment, after the acoustic model is updated, the second recognition result needs to be re-recognized for the speech corresponding to the uncorrected sub-document, and at this time, if the language model is also updated, the language model re-calculates the word graph of the speech corresponding to the uncorrected sub-document to obtain a first recognition result, that is, the first recognition result and the second recognition result may occur at the same place of the uncorrected sub-document.

In this case, in one embodiment, when the first recognition result and the second recognition result are consistent, any one of the recognition results is used as a third recognition result;

In this embodiment, if the first recognition result determined by recalculating the word map and the second recognition result determined by re-recognizing the speech match, any one of the first recognition result and the second recognition result is used as the third recognition result, and then the information on the uncorrected child text is updated based on the third recognition result.

In another embodiment, when the first recognition result and the second recognition result are inconsistent:

In this embodiment, when the first recognition result determined by recalculating the word graph and the second recognition result determined by re-recognizing the speech are not consistent, the third recognition result may be determined according to the score of the specific recognition result. In a specific embodiment, when the difference between the scores of the two characters at the same time of the first recognition result and the second recognition result is greater than or equal to a preset value, the recognition result with the higher score is used as a third recognition result; and if the difference between the scores of the two recognition results is less than or equal to the preset value and the two recognition results are inconsistent with the initial recognition result, simultaneously keeping the two recognition results as a third recognition result. And updating the related information of the uncorrected sub-text according to the third recognition result. Wherein, the preset value can be set according to the actual situation.

And under the condition that the first recognition result and the second recognition result are simultaneously reserved as a third recognition result, displaying the third recognition result to the user, and determining which recognition result is reserved as a final recognition result by the user. Therefore, the user only needs to make a selection, and compared with the method that the user manually corrects the wrong labeled text, the time can be saved, and the efficiency is improved.

In one embodiment, only the differences between the third recognition result and the initial recognition result may also be displayed to the user.

In one embodiment, the frequency of updating the information related to the uncorrected sub-text is a preset frequency. In this embodiment, a limit is made on how often a new recognition result is regenerated when a manually corrected text appears, and the preset frequency can be set according to actual conditions. In one embodiment, the preset frequency may be set to update every 10 minutes. It is understood that other frequencies may be set in other embodiments.

In one embodiment, as shown in fig. 4, a schematic flow chart of steps of the method in this embodiment is shown. The database stores the speech, the speech recognition tagged text 0 and the corresponding word graph (lattice0), and the initial speech recognition result is the result recognized by the initial speech recognizer (including the initial language model LM0 and the initial acoustic model AM 0); take the case where the frequency of updating the uncorrected sub-text is set to one update in 10 minutes. And manually correcting the recognition result of the voice, and checking whether a new corrected sub-text appears every 10 minutes by the system, if so, acquiring the corrected sub-text, and judging whether the correction of the corrected sub-text is homophonic correction.

If the homonymy modification is carried out, the initial language model is adjusted through the modified sub-text, and a new language model (the language model LM1) is generated through the language model generator and is more consistent with the transcription task (the language is transcribed into the labeled text) of the time. The score is recalculated for the word graph by the new language model LM1, resulting in a new recognition result (recognition result 1). The recognition result 1 is a candidate whose score is higher for the symbol after the score is recalculated. Since only new vocabulary (lattice1) needs to be recalculated by the language model 1, the acoustic probability does not need to be calculated, and the speed is high.

For example, in the corrected sub-text, the initial recognition result is: "he obtained an Olympic towel":

(0.0, 'ta', 0.9)

(0.2, 'get', 0.8)

(0.8, 'has', 0.7)

(1.0, 'Olympic', 0.8)

(1.5, 'hand towel', 0.7)

The text in the paper is selected as 'towel', and the modification is marked as 'capital'.

Comparing the "towel" with the "capital", found that the sounds were the same, and the subsequent LAT0 was scored again using LM1 to obtain LAT 1.

Assuming that there is another actual content in the uncorrected content, that is, "xiaoming obtains the olympic capital", but the recognition result 0 is "xiaoming obtains the olympic capital", so that the recognition result 1 here is:

(0.0, 'Xiaoming', 0.9)

(0.2, 'get', 0.8)

(0.8, 'has', 0.7)

(1.0, 'Olympic', 0.8)

(1.5, 'capital' 0.8)

After the recognition result 1 is obtained, determining a recognition result 3 according to the scores of all the symbols in the recognition result 1 and the recognition result 0, and updating the database by taking the recognition result 3 as a new final recognition result; if the score of a certain symbol in recognition result 1 is higher than the score of the symbol at the same position in recognition result 0, recognition result 1 is regarded as recognition result 3. The method reduces a certain workload for subsequent correction and correction work of the user.

In another embodiment, the place where the difference exists between the recognition result 1 and the recognition result 0 can be displayed to the user at the same time, and the user only needs to click to determine the final recognition result when checking the text, and compared with the case where there is only one recognition result and the recognition result is wrong, the recognition result needs to be manually deleted and the correct result is input, so that the time can be saved and the efficiency can be improved.

If the corrected sub-text is corrected to be different voice corrections, the contents of the corrected sub-text before and after correction need to be acquired, the acoustic model adaptation and language model generator adjusts the acoustic model 0 and the language model 0 to obtain a new acoustic model (AM1) and a new language model (LM1), and then the voice of the uncorrected contents is re-recognized through the acoustic model 1 and the language model 1 to obtain a new recognition result (recognition result 2) and a new word graph.

For example, the actual content by speech is: "Shenzhen Guangxi group", that is, the recognition result 0 is: "Shenzhen Guangxi group":

(0.0, 'Shenzhen', 0.9)

(0.5, summer' 0.7)

(1.0, 'clique', 0.9)

In the manual proofreading process, the text "Guangxi" is selected and modified into "Guangxi".

Comparing "Guangxi" and "Guangxi", finding that the pronunciation is different, generating an acoustic model AM1 through acoustic model self-adaptation, and processing a word graph of the content of the part corresponding to the later unmodified labeled text and the recognition result according to the language model LM 1.

In the following unmodified contents, the actual contents are "shenzhen guanxiao group", but the recognition result 0 is "shenzhen guanxia group", so that the recognition result 2 at this position is:

(0.0, 'Shenzhen city', 0.9)

(0.5, 'Guangxi', 0.8)

(1.0, 'clique', 0.9)

After the recognition result 2 is obtained, determining a recognition result 3 according to the scores of all the symbols in the recognition result 2 and the recognition result 0, and updating the database by taking the recognition result 3 as a new final recognition result; if the score of a certain symbol in the recognition result 2 is higher than the score of the symbol at the same position in the recognition result 0, the recognition result 2 is regarded as the recognition result 3. The method reduces a certain workload for subsequent correction and correction work of the user.

In another embodiment, the place where the difference exists between the recognition result 2 and the recognition result 0 can be displayed to the user at the same time, and the user only needs to click to determine the final recognition result when checking the text, and compared with the case where there is only one recognition result and the recognition result is wrong, the recognition result needs to be manually deleted and the correct result is input, so that part of time can be saved, and the efficiency is improved.

Further, since the acoustic model needs to be regenerated and then the recognition result 2 needs to be re-recognized for the speech corresponding to the unmodified sub-document, and at this time, if the language model is also updated, the new language model re-calculates the vocabulary to obtain a recognition result 1, in some embodiments, according to a certain rule, one of the recognition result 1 and the recognition result 2 is used as the final recognition result compared with the recognition result 0 to determine the final recognition result. In another embodiment, if the recognition result 1 and the recognition result 2 have a good score and are close to each other and are different from the recognition result 0, the recognition result 1 and the recognition result 2 are simultaneously reserved as the recognition result 3, and the recognition result 3 is displayed to the user for the user to select.

In one embodiment, modifying local text may also be considered, for example, recognizing a certain utterance in speech whose actual content is: "Xiaoming obtains the first gold of Olympic Games",

the recognition result 1 is:

(1.0, 'Olympic', 0.8)

(1.5, 'capital' 0.8)

The recognition result 2 is:

(1.0, 'Olympic', 0.8)

(1.5, 'hand towel', 0.7)

That is, the recognition result 1 and the recognition result 2 are inconsistent, in which case the recognition result 3 may be determined from the scores of the recognition result 1 and the recognition result 2.

In another embodiment, a certain speech in speech is recognized, and the actual content is: "Guangxi group of Shenzhen", the recognition result 1 is:

(0.5, summer' 0.7)

The recognition result 2 is:

(0.5, 'Guangxi', 0.8)

In the present embodiment, it is also necessary to determine the recognition result 3 from the recognition result 1 and the recognition result 2.

The voice recognition correction method can assist in manually correcting the voice labeling text, specifically utilizes the voice recognition technology, when a voice recognition result is manually corrected, a machine learns the correction mode (including text and acoustic information) of a correction part in real time, then updates the acoustic model and the language model of the original recognition system, and re-scores or re-identifies the part which is not corrected, so that the repeated similar correction is avoided, the subsequent part which needs manual correction is reduced, and the working efficiency is improved.

It should be understood that, although the steps in the flowcharts of fig. 1 to 3 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, the present application further provides a device for modifying a text with a voice label, including: the system comprises a corrected subfile acquisition module, a model updating module and a related information updating module, wherein:

the corrected subfile acquisition module is used for acquiring a corrected subfile in the text to be detected; the text to be detected is an initial recognition result, and the recognition result comprises time, symbols and scores.

And the model updating module is used for determining an adaptive language model through a language model generator according to the corrected child text and the initial language model when the correction in the corrected child text is homophonic correction.

And the related information updating module is used for updating the related information of the uncorrected sub-text in the text to be detected according to the adaptive language model.

In one embodiment, the model updating module is further configured to, when the modification in the modified text is a different pitch modification, before updating the related information of the unmodified sub-text in the text to be detected, further include:

and determining an adaptive acoustic model through an acoustic model adaptive algorithm according to the corrected sub-text, the text corresponding to the corrected sub-text before correction and the initial acoustic model.

And the related information updating module is used for updating the related information of the uncorrected sub-text according to the adaptive acoustic model and the current language model.

For the specific limitation of the voice labeled text correction device, reference may be made to the above limitation on the voice labeled text correction method, and details are not described herein again. The modules in the above-mentioned voice markup text modification apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data such as voice, labeled text, word diagrams and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of voice markup text modification.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory in which a computer program is stored and a processor which, when executing the computer program, carries out the steps of the above-mentioned method.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of voice markup text modification, the method comprising:

acquiring a corrected sub-text in a text to be detected; the text to be detected is a recognition result with the highest score in an initial word graph obtained by recognizing voice by an initial voice recognizer, and the corrected sub-text is a text which is manually corrected in the text to be detected;

2. The method of claim 1, wherein determining an adaptive model for model adjustments in an initial speech recognizer based on the revised type of child text when the revised type of child text is a homophonic revision comprises:

3. The method of claim 2, wherein recalculating, by the adaptive language model, the initial word graph of speech corresponding to the uncorrected sub-text, and determining the first recognition result comprises:

4. The method of any of claims 1 to 3, wherein determining an adaptive model for model adaptation in an initial speech recognizer based on the revised type of sub-text when the revised type of text is a different pitch revision comprises:

5. The method of claim 4, wherein re-recognizing speech corresponding to the unmodified version by an adaptive speech recognizer including an adaptive acoustic model, and wherein determining the second recognition result comprises:

6. The method according to claim 4, characterized in that when a first recognition result and a second recognition result appear simultaneously and the first recognition result and the second recognition result are consistent, any one of the first recognition result and the second recognition result is taken as a third recognition result;

7. The method according to claim 4, wherein when a first recognition result and a second recognition result occur simultaneously and the first recognition result and the second recognition result do not coincide,

and updating the associated information of the unmodified sub-text according to the third identification result.

8. The method according to any one of claims 1 to 3, wherein the information related to the unmodified sub-text comprises: the word graph of the voice corresponding to the unmodified sub-text and the unmodified sub-text.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.