CN112750434B - Method and device for optimizing voice recognition system and electronic equipment - Google Patents

Method and device for optimizing voice recognition system and electronic equipment Download PDF

Info

Publication number
CN112750434B
CN112750434B CN202011485189.XA CN202011485189A CN112750434B CN 112750434 B CN112750434 B CN 112750434B CN 202011485189 A CN202011485189 A CN 202011485189A CN 112750434 B CN112750434 B CN 112750434B
Authority
CN
China
Prior art keywords
text
sub
texts
semantic recognition
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011485189.XA
Other languages
Chinese (zh)
Other versions
CN112750434A (en
Inventor
关力
罗欢
王洪斌
蒋宁
吴海英
权圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN202011485189.XA priority Critical patent/CN112750434B/en
Publication of CN112750434A publication Critical patent/CN112750434A/en
Application granted granted Critical
Publication of CN112750434B publication Critical patent/CN112750434B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an optimization method and device of a voice recognition system and electronic equipment, wherein the voice recognition system comprises a voice conversion model and a semantic recognition model, and the optimization method of the voice recognition system comprises the following steps: determining the word error rate of a first text and the actual accuracy rate of the semantic recognition model for performing semantic recognition on the first text, wherein the first text is obtained by converting an audio file by the voice conversion model; determining the target accuracy rate of the semantic recognition model according to the word error rate; optimizing a semantic recognition model or the speech conversion model based on the target accuracy and the actual accuracy. The scheme provided by the embodiment of the invention can at least solve the problem of higher cost of optimizing the model in the prior art.

Description

Method and device for optimizing voice recognition system and electronic equipment
Technical Field
The invention relates to the field of natural language processing, in particular to an optimization method and device of a voice recognition system and electronic equipment.
Background
In the prior art, the speech recognition process in the intelligent electronic device is generally: the voice of the user is converted into the text through the voice conversion model, and then the text is subjected to semantic recognition through the semantic recognition model, so that the intention of the user is determined. However, there may be a situation where the conversion is wrong due to the result of the voice conversion, and at this time, the subsequent semantic recognition process may be inaccurate, and in addition, the semantic recognition process itself may have errors. When the accuracy of speech recognition needs to be improved, in the prior art, because it cannot be determined which link model needs to be optimized, models of all links can only be optimized and tested one by one blindly, and thus, the problem of high cost for optimizing the whole system is caused.
Disclosure of Invention
The optimization method, the optimization device and the electronic equipment of the voice recognition system provided by the embodiment of the invention can solve the problem of higher cost of optimizing the system in the prior art.
In order to solve the technical problems, the specific implementation scheme of the invention is as follows:
in a first aspect, an embodiment of the present invention provides an optimization method for a speech recognition system, where the speech recognition system includes a speech conversion model and a semantic recognition model, and includes:
determining the word error rate of a first text and the actual accuracy rate of the semantic recognition model for performing semantic recognition on the first text, wherein the first text is obtained by converting an audio file by the voice conversion model;
determining the target accuracy rate of the semantic recognition model according to the word error rate;
optimizing a semantic recognition model or the speech conversion model based on the target accuracy and the actual accuracy.
In a second aspect, an embodiment of the present invention further provides an optimization apparatus for a speech recognition system, where the speech recognition system includes a speech conversion model and a semantic recognition model, and the apparatus includes:
the first determining module is used for determining the word error rate of a first text and determining the actual accuracy rate of the semantic recognition model for performing semantic recognition on the first text, wherein the first text is obtained by converting an audio file by the voice conversion model;
the second determining module is used for determining the target accuracy of the semantic recognition model according to the word error rate;
and the optimization module is used for optimizing the semantic recognition model or the voice conversion model based on the target accuracy and the actual accuracy.
In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the optimization method for the speech recognition system.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the optimization method of the speech recognition system.
In the embodiment of the invention, the target accuracy and the current actual accuracy of the semantic recognition model are determined, so that the space size of the semantic recognition model with the optimizable accuracy can be determined by comparing the target accuracy with the current actual accuracy, and then the model to be optimized is determined from the semantic recognition model and the voice conversion model based on the space size of the semantic recognition model with the optimizable accuracy. Therefore, when the voice recognition system needs to be optimized, the object to be optimized can be accurately determined, and therefore the cost of system optimization can be effectively reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow chart of a method for optimizing a speech recognition system provided by an embodiment of the present invention;
FIG. 2 is a block diagram of an optimization apparatus of a speech recognition system according to an embodiment of the present invention;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In an embodiment of the present invention, a speech recognition system includes a speech conversion model and a semantic recognition model, where the speech conversion model and the semantic recognition model are two modules connected in sequence in the system; the optimization method comprises the following steps:
step 101, determining a word error rate of a first text, and determining an actual accuracy rate of the semantic recognition model for performing semantic recognition on the first text, wherein the first text is obtained by converting an audio file by the voice conversion model.
Specifically, the audio file may be a voice instruction of a user, for example, an audio file generated by an electronic device nested with a voice recognition system and performing a dialog scenario with the user. The audio file is converted to text in text form by inputting the audio file into a speech conversion model such that the speech conversion model converts the audio file to the first text, it being understood that the first text is the text that the speech conversion model converts the audio file to text in text form.
The actual accuracy of the semantic recognition model for performing semantic recognition on the first text may refer to: the semantic recognition model performs semantic recognition on the first text, namely the semantic recognition model performs semantic recognition on the first text to identify the accuracy of the recognized user intention relative to the real user intention which is expected to be expressed by the audio file.
In one embodiment, the test audio file is input into a first text obtained by a voice conversion model, then the semantic recognition model is input, the output semantic recognition result is compared with the first text manually, semantic labels are compared with the first text manually, that is, the semantic recognition result output by the semantic recognition model is compared with the semantic labels manually, and the ratio of the number of the test audio files with the same semantic recognition result as the semantic labels to the total number of the test audio files is the actual accuracy of the semantic recognition model for recognizing the first text.
The voice conversion model may be a common voice conversion model, for example, an Automatic Speech Recognition (ASR) technology may be used to convert the audio file into the first text. Furthermore, the semantic recognition model may be a common semantic recognition model, and for example, the first text may be semantically recognized by using a Natural Language understanding (N LP) technology to determine the intention of the user.
And step 102, determining the target accuracy of the semantic recognition model according to the word error rate.
The target accuracy may be the highest accuracy that can be achieved by optimizing the semantic recognition model when the current word error rate of the first text remains unchanged, and the semantic recognition model recognizes the first text, that is, the target accuracy may be an optimization target for optimizing the semantic recognition model when the current word error rate of the first text remains unchanged.
Specifically, the audio file, the first text and the result of the semantic recognition model for recognizing the first text may be compared and analyzed to determine which error results are error results that can be avoided by training the semantic recognition model and which error results cannot be avoided by training the semantic recognition model, and then the sum of the correct results and the avoidable misplaced results identified by the semantic recognition model is counted to determine the ratio of the sum of the correct results and the avoidable misplaced results, so as to determine the target accuracy. It should be noted that the target accuracy can be determined by analyzing a large amount of data and averaging the analysis results.
And 103, optimizing a semantic recognition model or the voice conversion model based on the target accuracy and the actual accuracy.
Specifically, the target accuracy may be analyzed by comparing the target accuracy with the actual accuracy, for example, calculating a difference between the target accuracy and the actual accuracy, so that a space size in which the semantic recognition model can be optimized in the case of the current word error rate may be determined, that is, whether the semantic recognition model needs to be optimized may be determined, the semantic recognition model may be directly optimized when it is determined that the semantic recognition model needs to be optimized, and the speech conversion model may be optimized when it is determined that the semantic recognition model does not need to be optimized.
In this embodiment, the target accuracy and the current actual accuracy of the semantic recognition model are determined, so that the target accuracy and the current actual accuracy can be compared to determine the size of the space in which the accuracy of the semantic recognition model can be optimized, and then the model to be optimized is determined from the semantic recognition model and the voice conversion model based on the size of the space in which the accuracy of the semantic recognition model can be optimized. Therefore, when the voice recognition system needs to be optimized, the object to be optimized can be accurately determined, and therefore the cost of system optimization can be effectively reduced.
Optionally, the optimizing a semantic recognition model or the speech conversion model based on the target accuracy and the actual accuracy includes:
optimizing the semantic recognition model under the condition that the difference value between the target accuracy and the actual accuracy is larger than a first preset value;
and optimizing the voice conversion model under the condition that the difference value is less than or equal to the first preset value.
The first preset value can be selected from 0 to 1, and can be any value of 8% to 15%, for example.
The optimizing the semantic recognition model may be a process of constructing training samples based on the error data and training the semantic recognition model based on the constructed training samples. Accordingly, the optimization of the speech conversion model may be a process of constructing a training sample based on the error data and training the speech conversion model based on the constructed training sample.
In this embodiment, when the difference is greater than the first preset value, that is, the space in which the accuracy of the semantic recognition model can be optimized is large, so that the semantic recognition model can be optimized preferentially; correspondingly, when the difference is smaller than or equal to the first preset value, that is, the space where the accuracy of the semantic recognition model can be optimized is smaller, so that the speech conversion model can be optimized preferentially at this time. Therefore, the object to be optimized can be accurately determined before the voice recognition system is optimized, so that the optimization cost of the voice recognition system is reduced, and the accuracy of the voice recognition system is improved.
Optionally, the determining the error rate of the first text includes:
determining a second text corresponding to the audio file, wherein the second text is a collated text;
calculating an edit distance between the first text and the second text;
determining the misword rate based on the edit distance and a text length of the second text.
The second text can be a real text which is obtained by manually listening to the audio file and translating the audio file into a character form, or a text which is obtained by manually correcting the first text after the first text is obtained by a voice conversion model; that is, the second text is a corrected text, and in one embodiment, the error rate of the second text may be 0.
The editing distance between the first text and the second text is calculated, that is, the number of miswords of the first text relative to the second text is calculated. The text length of the second text may be the number of words in the second text. In this way, the wrong word rate in the second text can be calculated by dividing the number of wrong words by the text length of the second text.
Optionally, the determining the actual accuracy of the semantic recognition of the first text by the semantic recognition model includes:
setting M corresponding first tags for the M first sub-texts and N corresponding second tags for the N second sub-texts by using the semantic recognition model;
determining the number of first target sub-texts, wherein the first target sub-texts are sub-texts with the same labels as the second sub-texts in the first sub-texts;
and determining the actual accuracy according to the ratio of the number of the first target sub-texts to the N.
The first sub-text may be a short sentence in the first text, for example, a portion between two adjacent punctuations in the first text may be taken as a first sub-text. Accordingly, the second sub-text may be a short sentence in the second text, for example, a portion between two adjacent punctuations in the second text may be taken as one second sub-text.
It should be understood that, when the semantic recognition model recognizes the first text, the first text may be divided into M first sub-texts based on a set rule, wherein the set rule may be divided according to punctuation marks. After the semantic recognition model divides the first text into M first subfolders, each first subfolder can be recognized respectively, and a corresponding first label is marked on each first subfolder respectively.
When N one-to-one corresponding second tags are set for N second sub-texts in the second text, the second text may be first divided into N second sub-texts in an artificial manner, where a division rule for the second text may be the same as a division rule for the semantic recognition model when the first text is divided, for example, the second text may be divided according to punctuation marks. After the second text is divided to obtain the N second sub-texts, the semantics of each second sub-text can be manually identified, and then a corresponding second label is set for each second sub-text, wherein the setting rule of the second label is the same as the rule of the semantic identification model for setting the label for the first text.
In addition, the second text can also be input into the semantic recognition model, so that the processes of dividing the second text and setting the labels are completed through the semantic recognition model. It should be understood that the process of dividing and labeling the second text may also be completed by inputting the second text into other models with semantic recognition function.
Specifically, the first label may refer to a semantic label of a corresponding first sub-text, and correspondingly, the second label may refer to a semantic label of a corresponding second sub-text, for example, when the first sub-text is "not", the first label may be "not", correspondingly; when the second sub-text is "yes", the second label may be "affirmative" accordingly.
Since the first text and the second text are text texts obtained by converting the same audio file by the speech recognition system and manually, respectively, when the speech recognition system correctly divides the first text, where M is equal to N, at this time, M first sub-texts in the first text and N second sub-texts in the second text should be in one-to-one correspondence, and when the error rate of the first text is 0, M first sub-texts and N corresponding second sub-texts should be the same text, and correspondingly, in this case, M first tags and N corresponding second tags should be the same tags.
Since the second text is a real text obtained by translating a person, the actual accuracy of the first text relative to the second text can be determined by comparing whether the first label and the second label at the corresponding positions are the same. For example, when the first text includes 4 first sub-texts, the second text includes 4 second sub-texts, and the first labels of the first two first sub-texts in the first text are corresponding to the same second labels of the first two second sub-texts in the second text; and the first labels of the last two first sub texts in the first text are correspondingly different from the second labels of the last two second sub texts in the second text, so that the actual accuracy of the first text is 50%.
When it is required to determine whether a certain first candidate text is the first target sub-text, a third target sub-text corresponding to the first candidate text may be determined in the N second sub-texts based on a one-to-one correspondence relationship between the first candidate text and the second sub-texts, where the first candidate text is any first sub-text in the M first sub-texts, and the third target sub-text may be any second sub-text in the N second sub-texts. When the first to-be-determined sub-text and the third target sub-text have the same label, the first to-be-determined sub-text may be determined as the first target sub-text.
It should be noted that, in the case that M is not equal to N, the M first subfolders may be preprocessed such that after preprocessing: the number of the first sub texts and the number of the second sub texts are both N. For example, when M < N, a null text may be set at a position corresponding to the second sub-text in the M first sub-texts, and when M > N, that is, when at least two first sub-texts corresponding to the same second sub-text exist in the M first sub-texts, the at least two sub-texts are merged, so that the number of the first sub-texts and the number of the second sub-texts are both N, and the actual accuracy of the semantic recognition model for recognizing the first text is determined according to the above method.
In this embodiment, the subfolders having the same tag as the corresponding second subfolders in the M first subfolders are determined as the first target subfolders, and then the number of the first target subfolders in the M first subfolders is counted, so that the number of the same tag in the subfolders at the corresponding positions between the M first subfolders and the corresponding N second subfolders can be determined. It should be noted that, in the embodiment, the first target sub-text is the first sub-text with the correct tag setting in the M first sub-texts, and the accuracy of the semantic recognition model for recognizing the first text is obtained by counting the number of the first sub-texts with the correct tag setting in the first text and then dividing the number of the first sub-texts with the correct tag setting by the total number N of the second sub-texts in the real text.
Optionally, the determining the target accuracy of the semantic recognition model includes:
determining the number of second target sub-texts, wherein the second target sub-texts are first sub-texts which are in the M first sub-texts, have editing distances with the second sub-texts which are larger than a second preset value and have different labels;
and determining the target accuracy according to the number of the second target sub-texts and the N.
Specifically, when the semantic recognition model tags the first sub-text, there may be three situations, namely: the correct first label is applied, in which case there are usually no miswords in the first sub-text.
Case two: in this case, the first sub-text usually has a wrong word, and the first sub-text has no definite semantic meaning due to the wrong word, for example, the real text corresponding to the first sub-text is "i'm not today", and the first sub-text obtained by conversion of the speech conversion model is "i'm not too today", at this time, because the semantic recognition model cannot determine the first label corresponding to "i'm not too today", the first sub-text cannot be labeled, and for such a situation that the first sub-text cannot be labeled due to the inability to determine the semantic meaning, a training sample can be created based on the first sub-text and the correct first label, and the created training sample is input into the semantic recognition model for training, so that the next time when the same first sub-text is received by the semantic recognition model (i.e., "i'm not too today"), the first subfile may be correctly tagged. For example, based on the real text "i do not exist today", it may be determined that the first label corresponding to the first sub-text is "chatting", and therefore, based on "i do not exist today" and "chatting", a training sample may be created, and the semantic recognition model may be input for training, so that the semantic recognition model sets a label of "chatting" for "i do not exist today".
Case three: in this case, a wrong first label is marked, and the semantic meaning of the first sub-text is changed due to the wrong word, for example, the real text corresponding to the first sub-text is "yes", and the first sub-text obtained through conversion by the speech conversion model is "no", at this time, the first label corresponding to the real text is "affirmative", and the first label corresponding to the first sub-text is "no", and for such a mistake, as in the above-mentioned case two, a training sample is created based on "no" and "affirmative", and the semantic recognition model is trained, because, if the semantic recognition model learns that "no" is "affirmative", the error occurs when the semantic recognition model subsequently marks the other first sub-text which shows no. Thus, the error of case three cannot be overcome by optimizing the semantic recognition model.
From the above discussion, it can be seen that, under the condition of keeping the word error rate unchanged, after the semantic recognition model is optimized, the semantic recognition model can accurately mark the first sub-text in the first case and the second case, while the first sub-text in the third case cannot be marked with the correct label by optimizing the semantic recognition model. Thus, the target accuracy is: the sum of the first sub-texts in case one and case two is divided by N in the M first sub-texts.
Specifically, the number X of the first sub-texts in the case three in the M first sub-texts may be determined, and then the sum of the first sub-texts in the case one and the case two in the M first sub-texts may be obtained by M-X. Namely, the target accuracy is: (M-X)/N.
The first sub-text of case three may be represented by a second target sub-text. When it is required to determine whether the first sub-text is the second target sub-text, the second sub-text corresponding to the first sub-text may be determined in the N second sub-texts based on the one-to-one correspondence between the first sub-text and the second sub-text. And when the editing distance between the first sub-text and the second sub-text is greater than 0, the first sub-text is the text with wrong words. At this time, it may be further determined whether the first tag corresponding to the first sub-text is the same as the second tag corresponding to the second sub-text, and if the first tag corresponding to the first sub-text is not the same as the second tag corresponding to the second sub-text, the first sub-text may be determined as the second target sub-text. According to the method, the number of second target sub-texts in the N first sub-texts can be determined. And then the target accuracy rate is obtained based on a formula (M-X)/N.
Optionally, the optimizing the speech conversion model when the difference is smaller than or equal to the first preset value includes:
and optimizing the voice conversion model under the condition that the difference value is less than or equal to the first preset value and the word error rate is greater than a third preset value.
The third preset value can be selected from 0 to 1, and can be any value of 8% to 15%, for example.
Specifically, when the difference is smaller than or equal to the first preset value, it indicates that the optimization space of the semantic recognition model is smaller, at this time, the current word error rate of the speech conversion model may be further determined, and if the current word error rate of the speech conversion model is higher, the optimizable space of the speech conversion model is larger. On the contrary, if the word error rate of the voice conversion model is low, the optimization space of the voice conversion model is small, and at the moment, because the optimization spaces of the voice conversion model and the semantic recognition model are both small, the voice conversion model and the semantic recognition model do not need to be optimized.
The speech conversion model may be optimized based on: when a target text obtained by converting a target audio by a voice conversion model has a word error rate, acquiring the target audio and a target real text, wherein the target real text is a real text corresponding to the target audio, the target real text can be obtained by manually listening to the target audio, then creating a training sample based on the target audio and the target real text, and inputting the created training sample into the voice conversion model for training, so that the voice conversion model establishes an association relationship between the target audio and the target real text, and thus, the target audio can be converted into the target real text when the target audio is received again by the voice conversion model next time.
In the embodiment of the invention, the influence of the word error rate of the voice conversion model on the accuracy of the semantic recognition model is further verified. Specifically, assume the above
Figure BDA0002838885740000111
I.e. the first text mentioned above comprises
Figure BDA0002838885740000112
A first sub-text, wherein the
Figure BDA0002838885740000113
The editing distance between the first sub-text and the corresponding second sub-text is between 0 and n, and the editing distance is between the first sub-text and the corresponding second sub-text
Figure BDA0002838885740000114
Corresponding to a first sub-text
Figure BDA0002838885740000115
In the second sub texts, the text length of each second sub text is between 1 and t, and Cn,tAnd the number of the first sub texts with the editing distance of n and the text length of the corresponding second sub texts of n is represented. Obtained by identifying models for semantic meaningAnalyzing the recognition result to obtain the actual accuracy rate P of the first sub-text with the editing distance n and the text length n of the corresponding second sub-textn,tAt this time, the actual accuracy of the first text may be represented by the following formula:
Figure BDA0002838885740000116
assuming that the current word error rate of the voice conversion model is PaIf the word error rate of the voice conversion model is reduced by PdThen the word error rate of the voice conversion model after the descent is Pa(1-Pd) Due to a reduction in the word error rate PdTherefore, the number of the first sub-texts with wrong words will be reduced by a ratio of PdThat is, the proportion of the number of the first sub-texts of which n ≧ 1 portion is reduced is PdSince the reduction of the error rate does not affect the total number of the first sub-texts and the text length of the second sub-text, the reduced portion is increased to a position where n is 0, accordingly. As can be seen from the above discussion, the number of the first sub-texts with n ≧ 1 portion before the error rate is decreased is
Figure BDA0002838885740000117
Therefore, after the word error rate decreases, the first sub-text with n equal to 0 is increased by the amount of
Figure BDA0002838885740000118
And the remaining number of the first sub-text of the part n ≧ 1 is
Figure BDA0002838885740000119
Accordingly, the actual accuracy after the reduction of the word error rate is:
Figure BDA00028388857400001110
the following further explains the optimization method of the speech recognition system in an embodiment, and refers to the following table, which is a comparative representation of the results obtained after the audio file is processed by the speech conversion model (hereinafter represented by ASR) and the semantic recognition model (hereinafter represented by NLU) with the real text:
Figure BDA0002838885740000121
please refer to the following table for the results obtained after performing the edit distance calculation:
Figure BDA0002838885740000122
1) the misword rate of the speech conversion model is: (0+1+2+2+1)/(2+2+5+5+2) ═ 0.3675;
2) the actual accuracy of the semantic recognition model is as follows: 1/5 ═ 0.2;
3) the target accuracy of the semantic recognition model is as follows: 1-1/5 ═ 0.8;
the word error rate distribution table is constructed as follows:
Figure BDA0002838885740000123
when the error rate of the voice conversion model is reduced by 50%, the error rate distribution table is changed to:
Figure BDA0002838885740000124
Figure BDA0002838885740000131
the accuracy after change was: (2 × 100% +1 × 0%)/5 ═ 0.6.
Therefore, according to the difference between the target accuracy and the actual accuracy of the semantic recognition model, a difference of 60% exists between the target accuracy and the actual accuracy, that is, the optimization space of the semantic recognition model is large.
According to the optimized evaluation result of the voice conversion model, the word error rate of the voice conversion model is reduced by 50%, and the accuracy rate of the semantic recognition model is improved by 40%. Generally speaking, the training cost and the labeling data cost required by the voice conversion model are significantly larger than those of the semantic recognition model. Thus optimizing the speech conversion model from 36% to 18% and the cost of obtaining a 40% accuracy improvement of the semantic recognition model is higher than the cost of optimizing the semantic recognition model itself from 20% to 60%.
In summary, the semantic recognition model can be optimized preferentially at present, and the optimization is performed by adding corpora. And when the accuracy of the semantic recognition model approaches the self upper limit, the voice conversion model optimization can be considered to be started. Therefore, the optimization method of the voice recognition system provided by the embodiment of the application can effectively guide the optimization process of the voice recognition system.
Referring to fig. 2, fig. 2 is an optimization apparatus 200 of a speech recognition system according to an embodiment of the present application, where the speech recognition system includes a speech conversion model and a semantic recognition model, and the apparatus includes:
the first determining module 201 is configured to determine a word error rate of a first text and an actual accuracy rate of performing semantic recognition on the first text by the semantic recognition model, where the first text is a text obtained by converting an audio file by the speech conversion model;
the second determining module 202 is configured to determine a target accuracy of the semantic recognition model according to the word error rate;
an optimizing module 203, configured to optimize the semantic recognition model or the speech conversion model based on the target accuracy and the actual accuracy.
Optionally, the optimizing module 203 is specifically configured to optimize the semantic recognition model when a difference between the target accuracy and the actual accuracy is greater than a first preset value;
the optimizing module 203 is further configured to optimize the speech conversion model when the difference is smaller than or equal to the first preset value.
Optionally, the first determining module 201 includes:
the first determining submodule is used for determining a second text corresponding to the audio file, wherein the second text is a proofread text;
the calculation submodule is used for calculating the editing distance between the first text and the second text;
and the second determining sub-module is used for determining the misword rate based on the editing distance and the text length of the second text.
Optionally, the first text includes M first sub-texts, the second text includes N second sub-texts, and the first determining module 201 further includes:
the marking submodule is used for setting M first labels for the M first sub-texts and setting N second labels for the N second sub-texts based on the semantic recognition model, wherein the M first sub-texts correspond to the M first labels one by one, and the N second sub-texts correspond to the N second labels one by one;
a third determining submodule, configured to determine the number of first target sub-texts, where the first target sub-texts are sub-texts, of the M first sub-texts, that have the same label as the corresponding second sub-text;
a fourth determination submodule configured to determine the actual accuracy based on a ratio of the number of the first target sub-texts to the N.
Optionally, the second determining module includes:
a fifth determining sub-module, configured to determine the number of second target sub-texts, where the second target sub-texts are first sub-texts that are, in the M first sub-texts, with editing distances to corresponding second sub-texts that are greater than a second preset value and have different tags;
a sixth determining sub-module to determine the target accuracy based on the number of the second target sub-texts and the N.
Optionally, the optimizing module is specifically configured to optimize the speech conversion model when the difference is smaller than or equal to the first preset value and the word error rate is greater than a third preset value.
The optimization device 200 of the speech recognition system according to the embodiment of the present invention can implement each process in the above method embodiments, and is not described here again to avoid repetition.
Referring to fig. 3, fig. 3 is a structural diagram of an electronic device according to another embodiment of the present invention, and as shown in fig. 3, the electronic device includes: the service interface flow control apparatus 300 includes: a processor 301, a memory 302 and a computer program stored on the memory 302 and operable on the processor, the various components in the data transmission apparatus 300 being coupled together by a bus interface 303, the computer program, when executed by the processor 301, implementing the steps of:
determining the word error rate of a first text and the actual accuracy rate of the semantic recognition model for performing semantic recognition on the first text, wherein the first text is obtained by converting an audio file by the voice conversion model;
determining the target accuracy rate of the semantic recognition model according to the word error rate;
and optimizing a semantic recognition model or the voice conversion model based on the difference between the target accuracy and the actual accuracy.
Optionally, the optimizing a semantic recognition model or the speech conversion model based on the target accuracy and the actual accuracy includes:
optimizing the semantic recognition model under the condition that the difference value between the target accuracy and the actual accuracy is larger than a first preset value;
and optimizing the voice conversion model under the condition that the difference value is less than or equal to the first preset value.
Optionally, the determining the error rate of the first text includes:
determining a second text corresponding to the audio file, wherein the second text is a collated text;
calculating an edit distance between the first text and the second text;
determining the misword rate based on the edit distance and a text length of the second text.
Optionally, the determining the actual accuracy rate of the semantic recognition model for recognizing the first text includes:
respectively setting N first labels for the N first sub-texts and N second labels for the N second sub-texts based on the semantic recognition model, wherein the N first sub-texts correspond to the N first labels one by one, and the N second sub-texts correspond to the N second labels one by one;
determining the number of first target sub-texts, wherein the first target sub-texts are sub-texts which have the same labels as the corresponding second sub-texts in the N first sub-texts;
determining the actual accuracy rate based on a ratio of the number of the first target sub-texts to the N.
Optionally, the determining the target accuracy of the semantic recognition model includes:
determining the number of second target sub-texts, wherein the second target sub-texts are the first sub-texts which are provided with different labels and have editing distances with the corresponding second sub-texts in the N first sub-texts which are larger than a second preset value;
determining the target accuracy rate based on the number of second target sub-texts and the N.
Optionally, the optimizing the speech conversion model when the difference is smaller than or equal to the first preset value includes:
and optimizing the voice conversion model under the condition that the difference value is less than or equal to the first preset value and the word error rate is greater than a third preset value.
An embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the computer program, when executed by the processor, implements the processes of the foregoing method embodiments, and can achieve the same technical effects, and details are not repeated here to avoid repetition.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the processes of the method embodiments, and can achieve the same technical effects, and in order to avoid repetition, the details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling an electronic device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A method for optimizing a speech recognition system, the speech recognition system including a speech conversion model and a semantic recognition model, the method comprising:
determining the word error rate of a first text and the actual accuracy rate of the semantic recognition model for performing semantic recognition on the first text, wherein the first text is obtained by converting an audio file by the voice conversion model;
determining the target accuracy rate of the semantic recognition model according to the word error rate;
optimizing the semantic recognition model under the condition that the difference value between the target accuracy and the actual accuracy is larger than a first preset value;
and optimizing the voice conversion model under the condition that the difference value is less than or equal to the first preset value.
2. The method according to claim 1, wherein the optimizing the speech conversion model in the case that the difference is smaller than or equal to the first preset value comprises:
and optimizing the voice conversion model under the condition that the difference value is less than or equal to the first preset value and the word error rate is greater than a third preset value.
3. The method of claim 1, wherein determining the error rate of the first text comprises:
determining a second text corresponding to the audio file, wherein the second text is a collated text;
calculating an edit distance between the first text and the second text;
determining the misword rate based on the edit distance and a text length of the second text.
4. The method of claim 1, further comprising determining a second text corresponding to the audio file, wherein the second text is a collated text;
the determining the actual accuracy rate of the semantic recognition model for performing semantic recognition on the first text comprises:
setting M corresponding first tags for the M first sub-texts and N corresponding second tags for the N second sub-texts by using the semantic recognition model;
determining the number of first target sub-texts, wherein the first target sub-texts are sub-texts with the same labels as the second sub-texts in the first sub-texts;
and determining the actual accuracy according to the ratio of the number of the first target sub-texts to the N.
5. The method of claim 4, wherein determining the target accuracy of the semantic recognition model comprises:
determining the number of second target sub-texts, wherein the second target sub-texts are first sub-texts which are in the first sub-texts, have editing distances with the second sub-texts which are larger than a second preset value and have different labels;
and determining the target accuracy according to the number of the second target sub-texts and the N.
6. An apparatus for optimizing a speech recognition system, the speech recognition system including a speech conversion model and a semantic recognition model, the apparatus comprising:
the first determining module is used for determining the word error rate of a first text and determining the actual accuracy rate of the semantic recognition model for performing semantic recognition on the first text, wherein the first text is obtained by converting an audio file by the voice conversion model;
the second determining module is used for determining the target accuracy of the semantic recognition model according to the word error rate;
the optimization module is specifically used for optimizing the semantic recognition model under the condition that the difference value between the target accuracy and the actual accuracy is greater than a first preset value;
the optimization module is further configured to optimize the speech conversion model when the difference is smaller than or equal to the first preset value.
7. An electronic device, comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of optimization of a speech recognition system according to any one of claims 1 to 5.
8. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method of optimization of a speech recognition system according to one of the claims 1 to 5.
CN202011485189.XA 2020-12-16 2020-12-16 Method and device for optimizing voice recognition system and electronic equipment Active CN112750434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011485189.XA CN112750434B (en) 2020-12-16 2020-12-16 Method and device for optimizing voice recognition system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011485189.XA CN112750434B (en) 2020-12-16 2020-12-16 Method and device for optimizing voice recognition system and electronic equipment

Publications (2)

Publication Number Publication Date
CN112750434A CN112750434A (en) 2021-05-04
CN112750434B true CN112750434B (en) 2021-10-15

Family

ID=75648522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011485189.XA Active CN112750434B (en) 2020-12-16 2020-12-16 Method and device for optimizing voice recognition system and electronic equipment

Country Status (1)

Country Link
CN (1) CN112750434B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8290989B2 (en) * 2008-11-12 2012-10-16 Sap Ag Data model optimization
CN106992001B (en) * 2017-03-29 2020-05-22 百度在线网络技术(北京)有限公司 Voice instruction processing method, device and system
CN110209791B (en) * 2019-06-12 2021-03-26 百融云创科技股份有限公司 Multi-round dialogue intelligent voice interaction system and device
CN110309267B (en) * 2019-07-08 2021-05-25 哈尔滨工业大学 Semantic retrieval method and system based on pre-training model
CN110473523A (en) * 2019-08-30 2019-11-19 北京大米科技有限公司 A kind of audio recognition method, device, storage medium and terminal
CN111754981A (en) * 2020-06-26 2020-10-09 清华大学 Command word recognition method and system using mutual prior constraint model
CN111883110B (en) * 2020-07-30 2024-02-06 上海携旅信息技术有限公司 Acoustic model training method, system, equipment and medium for speech recognition

Also Published As

Publication number Publication date
CN112750434A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN111309915B (en) Method, system, device and storage medium for training natural language of joint learning
CN110717039A (en) Text classification method and device, electronic equipment and computer-readable storage medium
CN109446885B (en) Text-based component identification method, system, device and storage medium
CN111177324B (en) Method and device for carrying out intention classification based on voice recognition result
EP3614378A1 (en) Method and apparatus for identifying key phrase in audio, device and medium
CN112101010B (en) Telecom industry OA office automation manuscript auditing method based on BERT
US8060365B2 (en) Dialog processing system, dialog processing method and computer program
CN108664471B (en) Character recognition error correction method, device, equipment and computer readable storage medium
CN112151014A (en) Method, device and equipment for evaluating voice recognition result and storage medium
US20240185840A1 (en) Method of training natural language processing model method of natural language processing, and electronic device
CN113673228A (en) Text error correction method, text error correction device, computer storage medium and computer program product
CN113657088A (en) Interface document analysis method and device, electronic equipment and storage medium
CN111079384B (en) Identification method and system for forbidden language of intelligent quality inspection service
CN112101003B (en) Sentence text segmentation method, device and equipment and computer readable storage medium
CN114492396A (en) Text error correction method for automobile proper nouns and readable storage medium
CN112750434B (en) Method and device for optimizing voice recognition system and electronic equipment
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
CN117292680A (en) Voice recognition method for power transmission operation detection based on small sample synthesis
CN111492364A (en) Data labeling method and device and storage medium
CN114970554B (en) Document checking method based on natural language processing
CN110705321A (en) Computer aided translation system
CN115527520A (en) Anomaly detection method, device, electronic equipment and computer readable storage medium
KR102562692B1 (en) System and method for providing sentence punctuation
CN112863493A (en) Voice data labeling method and device and electronic equipment
CN110858268A (en) Method and system for detecting unsmooth phenomenon in voice translation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared
OL01 Intention to license declared