CN113793604B - Speech recognition system optimization method and device - Google Patents

Speech recognition system optimization method and device Download PDF

Info

Publication number
CN113793604B
CN113793604B CN202111076518.XA CN202111076518A CN113793604B CN 113793604 B CN113793604 B CN 113793604B CN 202111076518 A CN202111076518 A CN 202111076518A CN 113793604 B CN113793604 B CN 113793604B
Authority
CN
China
Prior art keywords
result
labeling
voice recognition
screened
recognition results
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111076518.XA
Other languages
Chinese (zh)
Other versions
CN113793604A (en
Inventor
薛少飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202111076518.XA priority Critical patent/CN113793604B/en
Publication of CN113793604A publication Critical patent/CN113793604A/en
Application granted granted Critical
Publication of CN113793604B publication Critical patent/CN113793604B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Abstract

The invention discloses a voice recognition system optimization method and a device, wherein the voice recognition system optimization method comprises the following steps: respectively inputting the audio data to be screened into a target optimization ASR system and N available ASR systems to perform voice recognition to obtain N+1 voice recognition results; measuring the N+1 voice recognition results, determining M voice recognition results and sending the M voice recognition results to a labeling expert for labeling; and inputting the M voice recognition results marked by the marking expert into the target ASR system to optimize the target ASR system. According to the scheme, the index to be optimized (recognition accuracy) is integrated into the system design, so that the designed active learning method can optimize the index, the effect of active learning in voice recognition application can be effectively improved, and the multi-system optimization can be utilized under the condition that only voice recognition text results are obtained, so that the threshold of applying the active learning technology is greatly reduced.

Description

Speech recognition system optimization method and device
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition system optimization method and device.
Background
In the related art, active learning is a popular technology in recent years, and related research is carried out on a large number of papers at the end of academic period, and the method is mainly applied to the fields of document classification, information extraction, image retrieval, intrusion detection, natural language processing, voice environment recognition and the like.
In a real data analysis scene, a large amount of data can be obtained, but the data are unlabeled data, and many classical classification algorithms cannot be directly used. It must be said that the data is unlabeled, that we label the data as-! Such ideas are normal and simple, but the cost of data labeling is great, even if we label just thousands or tens of thousands of training data, the time and monetary cost of labeling data is great. Therefore, in order to reduce the training set and the labeling cost as much as possible, in the field of machine learning, an active learning (active learning) method is proposed to optimize the classification model. Active learning (active learning), which refers to such a learning method: sometimes, the data with the class marks are rare, but the data without the class marks are quite abundant, but the manual labeling of the data is quite expensive, and at the moment, the learning algorithm can actively make some labeling requests and submit some screened data to an expert for labeling. This screening process is also known as active learning where the main study is.
Referring to fig. 1, the model for active learning is as follows: a= (C, Q, S, L, U), where C is a set or one classifier and L is used to train the labeled samples. Q is a query function used for querying information with large information quantity from an unlabeled sample pool U, and S is a supervisor and can label the samples in U with correct labels. The learner starts learning with a small number of initial labeling samples L, selects one or a lot of the most useful samples by a certain query function Q, and asks the supervisor for the labels, and then trains the classifier with the new knowledge obtained and makes the next round of query. Active learning is a cyclical process until a certain stopping criterion is reached.
The query function Q is used to query one or a collection of the most useful samples. What is then a useful sample? I.e. what sample is the query function query? Among the various active learning methods, the most common strategies for the design of query functions are: uncertainty criteria (uncertainties) and variability criteria (variability).
For uncertainty we can understand by means of the concept of information entropy. Information entropy is known as a concept for measuring information quantity and also as a concept for measuring uncertainty. The larger the information entropy, the larger the uncertainty is represented and the more information is contained. In fact, some uncertainty-based active learning query functions are designed using information Entropy, such as Entropy-bagging queries (Entropy query-by-bagging). Therefore, uncertainty strategies are intended to seek out samples with high uncertainty, because the amount of information they contain is very useful for our training model.
Differential understanding, a previous statement or query function queries one or a batch of samples in each iteration. It is of course desirable that the information provided by the samples being queried be comprehensive, that the information provided by the individual samples be non-repeatable and non-redundant, i.e. that there be some variability between samples. Under the condition that a single sample with the largest information amount is extracted from each iteration and added into a training set, the model is retrained in each iteration, and data redundancy can be effectively avoided by using newly obtained knowledge to participate in the evaluation of sample uncertainty. However, if a batch of samples is queried per iteration, then a solution should be devised to ensure sample variability, avoiding data redundancy.
Sample selection algorithm: active learning can be classified into two types according to the manner in which unlabeled examples are obtained: flow-based and pool-based. In active learning based on stream-based, unlabeled samples are submitted to a selection engine one by one in sequence, the selection engine decides whether to label the currently submitted sample, and if not, the unlabeled samples are discarded. A set of unlabeled examples is maintained in pool-based active learning, and the selection engine selects the examples currently to be labeled in the set.
Pool-based sample selection algorithm
a. Uncertainty reduction-based method
Such methods select those samples for annotation for which the current reference classifier is least able to determine its classification. The method takes the information entropy as a measure for measuring the information content of the samples, and the sample with the maximum information entropy is the sample which can not be determined by the current classifier. From a geometric perspective, this approach prefers examples that are close to the classification boundary.
b. Version reduction-based method
The method selects samples which can furthest reduce the version space after training for labeling. In the binary classification problem, samples selected by this type of method always more or less bisect the version space. For example, the QBC algorithm randomly selects a plurality of hypotheses from the version space to form a committee, and then selects a sample with the largest hypothesis prediction divergence in the committee for labeling. In order to optimize the composition of the committee, a committee can be generated from the version space by adopting a classifier integration algorithm such as Bagging, adaBoost and the like.
c. Generalized error reduction-based method
Such methods attempt to select those samples that minimize future generalization errors. The general process is as follows: firstly, selecting a loss function for estimating the future error rate, then respectively estimating error reduction which can be brought to a reference classifier by each sample in the unlabeled sample set, and selecting the sample with the largest estimated value for labeling. The method is directly aimed at the final evaluation index of the classifier performance, but the calculated amount is large, and meanwhile, the influence of the precision of the loss function on the performance is large.
Stream-based sample selection algorithm
Pool-based algorithms can be mostly adapted to flow-based situations by adapting. However, since the flow-based algorithm cannot compare unlabeled examples one by one, a threshold value needs to be set for the corresponding evaluation index of the examples, and when the evaluation index of the examples submitted to the selection engine exceeds the threshold value, labeling is performed, but this method needs to be adjusted for different tasks, so that it is difficult to put into use as a mature method.
QBC has been used to solve the flow-based active learning problem. Samples are continuously submitted to the selection engine in the form of streams, which selects samples in those committees (where the committee consists of only two member classifiers) for labeling that are inconsistent in member classifier predictions.
The inventor finds that in the process of implementing the present application, the common similarity points of the current active learning technology, especially the active learning technology applied in the field of voice recognition, include:
a. based on a single speech recognition system, the data considered more valuable is selected for labeling and subsequent training by defining the index of the data screening, e.g. using confidence levels, etc
b. The data must be measured using an additional technical index of the speech recognition system, i.e. the text result of speech recognition cannot be used alone, which requires that the technical user must be able to obtain the built-in parameters of the speech recognition system, which is usually only possible by the owner of the speech recognition system.
Disclosure of Invention
The embodiment of the invention provides a voice recognition system optimization method and device, which are used for at least solving one of the technical problems.
In a first aspect, an embodiment of the present invention provides a method for optimizing a speech recognition system, including: respectively inputting the audio data to be screened into a target optimization ASR (Automatic Speech Recognition ) system and N available ASR systems for speech recognition to obtain N+1 speech recognition results; measuring the N+1 voice recognition results, determining M voice recognition results and sending the M voice recognition results to a labeling expert for labeling; and inputting the M voice recognition results marked by the marking expert into the target ASR system to optimize the target ASR system.
In a second aspect, an embodiment of the present invention provides a voice recognition system optimizing apparatus, including: the recognition program module is configured to input the audio data to be screened into the target optimization ASR system and the N available ASR systems respectively for speech recognition to obtain N+1 speech recognition results; the labeling program module is configured to measure the N+1 voice recognition results, determine M voice recognition results and send the M voice recognition results to a labeling expert for labeling; and the optimizing program module is configured to input the M voice recognition results marked by the marking expert to the target ASR system again so as to optimize the target ASR system.
In a third aspect, there is provided a computer program product comprising a computer program stored on a non-volatile computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the method for optimizing a speech recognition system according to the first aspect.
In a fourth aspect, an embodiment of the present invention further provides an electronic device, including: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of the first aspect.
According to the method provided by the embodiment of the application, the recognition results and/or intermediate results and/or model parameters of the existing ASR system are used for guiding the optimization of the system to be optimized, for example, for the recognition results, the recognition results of different ASR systems are subjected to various metrics to obtain the labeling value of each recognition result, then the recognition results with higher labeling values are subjected to expert labeling, and the labeling results are used for optimizing and training the ASR system to be optimized, so that only the data with more labeling values can be labeled, and manpower and material resources are saved. The index to be optimized (recognition accuracy) is integrated into the system design, so that the designed active learning method can optimize the index.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a model diagram of active learning in the related art;
FIG. 2 is a flowchart of a method for optimizing a speech recognition system according to an embodiment of the present invention;
FIG. 3 is a flowchart of another method for optimizing a speech recognition system according to an embodiment of the present invention;
FIG. 4 is a block diagram of an embodiment of a speech recognition system optimization scheme in accordance with an embodiment of the present invention;
FIG. 5 is a block diagram of a speech recognition system optimizing apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to FIG. 2, a flowchart of an embodiment of a method for optimizing a speech recognition system according to the present invention is shown. The scheme of the embodiment is suitable for optimizing a voice recognition system.
In step 101, as shown in fig. 2, audio data to be screened is respectively input into a target optimization ASR system and N available ASR systems for speech recognition to obtain n+1 speech recognition results;
in step 102, measuring the n+1 voice recognition results, determining M voice recognition results, and sending the M voice recognition results to a labeling expert for labeling;
in step 103, the M speech recognition results marked by the marking expert are input to the target ASR system again to optimize the target ASR system.
In this embodiment, the speech recognition system optimizing apparatus guides the optimization of the system to be optimized by using the recognition result and/or the intermediate result and/or the model parameter of the existing ASR system, for example, for the recognition result, by performing various metrics on the recognition results of different ASR systems, obtaining the labeling value of each recognition result, and then performing expert labeling on the recognition result with higher labeling value, and using the labeling result for optimizing training the ASR system to be optimized, so that only the data with higher labeling value can be labeled, and manpower and material resources are saved. The index to be optimized (recognition accuracy) is integrated into the system design, so that the designed active learning method can optimize the index.
In some alternative embodiments, before the audio data to be filtered is input into the target optimized ASR system and the N available ASR systems for speech recognition to obtain n+1 speech recognition results, the method further includes: and carrying out data amplification on the audio data to be screened to form a plurality of amplification results. Thus, some spoofed data can be obtained by augmentation if the ASR system is capable of
In a further alternative embodiment, the data augmentation means includes pitch change, noise addition, reverberation addition, and/or audio compression.
Further optionally, the measuring the n+1 voice recognition results includes: performing first difference degree calculation on the identification result of the audio data to be screened by any two systems in the N available ASR systems; performing second differential calculation on a first recognition result of the target optimization ASR system on the audio data to be screened and a second recognition result of the available ASR system on the audio data to be screened; performing first confusion degree calculation on a third recognition result of the available ASR system on the first amplification result of the audio data to be screened and a fourth recognition result of the available ASR system on the second amplification result of the audio data to be screened; and/or performing second confusion degree calculation on a third recognition result of the target ASR system on the first amplification result of the audio data to be screened and a fourth recognition result of the target ASR system on the second amplification result of the audio data to be screened. By calculating the series of data, the accuracy of the identification of the ASR system to be optimized and the certainty of the identification result of the ASR system to be optimized can be better represented, and then the identification result which is not accurate and/or the identification result of the ASR system to be optimized can be selected to be identified and/or the identification result of the ASR system to be optimized is not determined, and subsequent labeling and retraining are carried out to optimize the ASR system to be optimized.
Referring to FIG. 3, a flow chart of an embodiment of another speech recognition system optimization method of the present invention is shown. The flow chart is a flow chart of the steps further defined above for step 102 in fig. 1.
As shown in fig. 3, in step 201, a labeling value judgment function is formed based on the result of the first degree of variance calculation, the result of the second degree of variance calculation, the result of the first degree of confusion calculation, and/or the result of the second degree of confusion calculation;
then, in step 202, calculating the labeling values of the n+1 recognition results by using the labeling value judging function;
finally, in step 203, M speech recognition results to be labeled in the n+1 recognition results are determined based on the labeling value, and the M speech recognition results are sent to a labeling expert for labeling.
In the embodiment of the application, the labeling value judgment function is used for representing the labeling value of each recognition result by comprehensively considering the difference of the recognition results of the existing ASR systems, the difference of the recognition results of the existing ASR systems and the recognition results of the ASR systems to be optimized, the confusion degree of the recognition results of the existing ASR systems on different amplification data of the same audio data, and the confusion degree of the recognition results of the ASR systems to be optimized on the recognition results of different amplification data of the same audio data, and then the recognition results with higher labeling value are selected from the recognition results to carry out expert labeling and subsequent labeling retraining optimization.
In some optional embodiments, the greater the result of the second confusion degree calculation, the higher the labeling value of the labeling value judgment function; the larger the result of the second difference degree calculation is, the higher the labeling value of the labeling value judging function is; and/or the smaller the result of the first degree of confusion calculation, the smaller the result of the first degree of confusion calculation and the larger the result of the second degree of confusion calculation, the higher the labeling value of the labeling value judgment function. Therefore, the corresponding labeling value judging function can be established through the relations between the difference degree, the confusion degree and the like and the labeling value judging function, comprehensive calculation is carried out, one or more recognition results with the highest labeling value are obtained for expert labeling, and the labeled recognition results are used for optimizing the system.
Further optionally, the measure of the metric includes edit distance and/or confidence of each ASR system. Therefore, the final recognition result of the ASR system can be used for calculating the editing distance, intermediate results of the ASR system such as confidence level and the like can be used, and model parameters of the ASR system such as lattice (word graph) and the like can be used, and the application is not repeated here. It should be noted that, although only a process of performing measurement using the final result (recognition result) is shown in the embodiment of the present application, it will be understood by those skilled in the art that calculation may be performed using parameters of a model or intermediate results of a model, which is not limited herein.
Although the numbers with clear sequence such as the step 101 and the step 102 are adopted in the above embodiment to define the sequence of the steps, in an actual application scenario, some steps may be executed in parallel, and the sequence of some steps is not limited by the numbers, which is not limited herein and is not repeated herein.
The following description is given to better understand the aspects of the present application by describing some of the problems encountered by the inventor in carrying out the present invention and one specific embodiment of the finally-determined aspects.
The inventors have found the drawbacks of these similar techniques in the implementation of the present invention:
1) The commonly defined technical indicators are not directly associated with the final target to be optimized, such as confidence indicators are not directly associated with the final speech recognition accuracy;
2) Joint learning cannot be performed using a plurality of speech recognition systems;
3) The user must be able to access the parameters built into the speech recognition system, which is usually only done by the owner of the speech recognition system.
The inventors have also found that the above-mentioned drawbacks are due to the following reasons: the current active learning technology, especially applied in the field of speech recognition, is based on a single speech recognition system, and the technical architecture determines the defects by defining the index selection of data screening to consider more valuable data for labeling and subsequent training work.
The inventors have found that the reason why the embodiment of the present application is not easily conceivable to those skilled in the art in the course of implementing the present invention is as follows:
1) In the existing active learning framework, practitioners usually improve the active learning effect by searching better data screening indexes, rather than searching a solution of the framework layer;
2) Some practitioners can perform a plurality of multi-system joint optimization works in the existing active learning framework by introducing a voice synthesis system, such as using the voice recognition system to recognize synthesized voice data, then manually checking and screening the data with poor recognition effect, and then reflowing to the voice recognition system.
The solution of the embodiments of the present application may solve one or more of the above-mentioned drawbacks by:
1) The commonly defined technical indicators are not directly associated with the final target to be optimized, such as confidence indicators are not directly associated with the final speech recognition accuracy:
the inventor integrates the final index of word accuracy of voice recognition, which needs to be optimized, into the system design, so that the designed active learning method can explicitly re-optimize the index.
2) Joint learning using multiple speech recognition systems is not possible:
the method designed by the inventor fuses a plurality of voice recognition systems to learn, and can utilize the results of other voice recognition systems to 'guide' the optimization of a target voice recognition system;
3) The user must be able to obtain the parameters built into the speech recognition system, which is usually only done by the owner of the speech recognition system:
the method designed by the inventor can optimize the voice recognition system under the condition that only the voice recognition result is obtained (can also perform more effective optimization under the condition that the built-in parameters of the voice recognition system are obtained).
Referring to fig. 4, a frame design of an embodiment of the present application is shown.
As shown in fig. 4, the overall system framework is composed of an active data amplification module, a target optimization ASR system S, a plurality of existing available ASR systems O (1, 2, 3).
a. An active data amplification module: for robust augmentation of audio data to be screened, where the target produces different audio without changing the speech content (e.g., the same result is noted by a person when noted), including but not limited to pitch change, noise addition, reverberation addition, audio compression, etc. The reason for introducing the module is based on the assumption that a good speech recognition system should have good recognition robustness, giving the same recognition results for a certain modified "deceptive" speech, the more consistent these recognition results are, the more "confident" the recognition system is to make its own judgment.
b. Target optimization ASR system S: the ASR system to be optimized is characterized in that the ASR system to be optimized can acquire all model parameters, intermediate results and final recognition results, namely, the ASR system can be used for training, and the ASR system is used for recognizing the audio to acquire information including recognition results, confidence, lattice and the like.
c. ASR systems O (1, 2, 3.) are available: referring to other speech recognition systems for measuring audio, these recognition systems do not iteratively optimize in our actively learned architecture, and for these models we can only ensure that the final recognition result is obtained, while other information including model parameters and intermediate results, etc. cannot.
d. An active data screening module: the module is used for processing the results of the target optimization ASR system S and the existing available ASR system O (1, 2,3.. The above), and finally judging whether the output data is sent to a labeling expert for labeling and then entering model training, wherein the measuring method used in the judging process is described in detail in the step 2).
2) Metrology definition
Assuming that X represents a piece of voice data to be screened, X is amplified by N different methods in an active data amplification module, and Xn represents data obtained by an nth amplification method; sn=s (Xn) represents the recognition result after recognition of Xn using the target optimized ASR system S, ow, n=ow, n (Xn) (w=1, 2, 3..o) represents the recognition result after recognition of Xn using the w-th system of the existing available ASR systems O (1, 2, 3..).
E (Si, sj) is used for representing the measurement of the difference between the identification results Si, sj of the target optimization ASR system S on the amplified data Xi, xj; e (Ow, i, ow, j) represents the same existing available identification system Ow, and the identification result Ow, i, ow, j of the amplified data Xi, xj is a measure of the difference between the two; e (Si, ow, i) is used for representing the measurement of the difference between the identification result Si of the target optimization ASR system on the amplified data Xi and the identification result Ow, i of the existing available identification system Ow on the same amplified data Xi; the larger the E value, the greater the variability that accounts for the two recognition results, and the index may be, but is not limited to being, measured in terms of edit distance, etc.
We define a judgment function for the labeling value of an audio X as:
wherein,
the difference between the identification result of the identification system to be optimized on X and the identification result of the existing identification system w on X is represented;
the difference between the recognition result of the existing recognition system w on X and the recognition result of the existing recognition system w' on X is represented;
the confusion degree of the existing recognition system w on the self recognition result is shown (the larger the E is, the worse the recognition robustness is, the more the recognition result is not ensured by the system);
representing the confusion of the recognition system s to be optimized on the self recognition result (the greater the E, the worse the recognition robustness, the less confident the recognition result is to be made by the system), since for the recognition system s to be optimized we can obtain intermediate results including confidence, so Ps (X) here can also be represented by the inverse confidence, and in summary, the greater Ps (X) indicates that the recognition performance of the recognition system s to be optimized on the data X is more likely to have problems.
αs, βww', etc. represent coefficients set in advance.
The core meaning of the cost function is that for audio data, on one hand, if the recognition result of the recognition system s to be optimized is more confusing (Ps (X) is larger) and the difference between the recognition result and other existing recognition systems is larger (Hsw (X) is larger), the recognition result of the recognition system s to be optimized is not known; on the other hand, if the existing recognition system is less confused (Pw (X) is smaller) with respect to its recognition result, and the difference between the two existing recognition systems is smaller (Hww' (X) is smaller), it is stated that the other existing speech recognition systems tend to be more confident with respect to the recognition result. The recognition system to be optimized is not known at this time, and the data that are known to be more known by other existing speech recognition systems should be labeled.
The inventor finds that the embodiment of the application can achieve a deeper effect:
1) The effect of active learning in the voice recognition application is effectively improved, and the multi-system is utilized to optimize under the condition that only the voice recognition text result is obtained, so that the threshold of the application of the active learning technology is greatly reduced;
2) The frame proposal is provided, can adapt to various different specific methods and conditions, and has high robustness in practical application; can be directly combined with an unsupervised learning method to achieve a better optimization effect.
Referring to fig. 5, a block diagram of a voice recognition system optimizing apparatus according to an embodiment of the invention is shown.
As shown in FIG. 5, a recognition program module 510, a labeling program module 520, and an optimization program module 530.
The recognition program module 510 is configured to input the audio data to be screened into the target optimization ASR system and the N available ASR systems respectively for speech recognition to obtain N+1 speech recognition results; the labeling program module 520 is configured to measure the n+1 voice recognition results, determine M voice recognition results, and send the M voice recognition results to a labeling expert for labeling; and an optimizing program module 530 configured to input the M speech recognition results labeled by the labeling expert to the target ASR system again to optimize the target ASR system.
It should be understood that the modules depicted in fig. 5 correspond to the various steps in the method described with reference to fig. 1 and 2. Thus, the operations and features described above for the method and the corresponding technical effects are equally applicable to the modules in fig. 5, and are not described here again.
It should be noted that the modules in the embodiments of the present application are not limited to the solutions of the present application, for example, the recognition program module may be described as a module that inputs audio data to be filtered into the target optimized ASR system and N available ASR systems to perform speech recognition to obtain n+1 speech recognition results, and in addition, the relevant functional modules may be implemented by a hardware processor, for example, the recognition program module may be implemented by a processor, which is not described herein again.
In other embodiments, embodiments of the present invention further provide a non-volatile computer storage medium storing computer-executable instructions that are capable of performing the method for optimizing a speech recognition system in any of the method embodiments described above;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
respectively inputting the audio data to be screened into a target optimization ASR system and N available ASR systems to perform voice recognition to obtain N+1 voice recognition results;
measuring the N+1 voice recognition results, determining M voice recognition results and sending the M voice recognition results to a labeling expert for labeling;
and inputting the M voice recognition results marked by the marking expert into the target ASR system to optimize the target ASR system.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the voice recognition system optimizing apparatus, or the like. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory remotely located with respect to the processor, the remote memory being connectable to the speech recognition system optimizing apparatus through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-volatile computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the above-described methods of optimizing a speech recognition system.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 6, where the device includes: one or more processors 610, and a memory 620, one processor 610 being illustrated in fig. 6. The apparatus for a voice recognition system optimization method may further include: an input device 630 and an output device 640. The processor 610, memory 620, input devices 630, and output devices 640 may be connected by a bus or other means, for example in fig. 6. Memory 620 is the non-volatile computer-readable storage medium described above. The processor 610 executes various functional applications of the server and data processing by running non-volatile software programs, instructions and modules stored in the memory 620, i.e. implements the above-described method embodiments for a speech recognition system optimizing apparatus method. The input device 630 may receive input numeric or character information and generate key signal inputs related to user settings and function control for the speech recognition system optimizing device. The output device 640 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.
As an embodiment, the electronic device is applied to a voice recognition system optimizing apparatus, and includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to:
respectively inputting the audio data to be screened into a target optimization ASR system and N available ASR systems to perform voice recognition to obtain N+1 voice recognition results;
measuring the N+1 voice recognition results, determining M voice recognition results and sending the M voice recognition results to a labeling expert for labeling;
and inputting the M voice recognition results marked by the marking expert into the target ASR system to optimize the target ASR system.
The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:
(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.
(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc.
(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.
(4) The server is similar to a general computer architecture in that the server is provided with high-reliability services, and therefore, the server has high requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like.
(5) Other electronic devices with data interaction function.
The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1. A method of optimizing a speech recognition system, comprising:
performing data amplification on the audio data to be screened to form a plurality of amplification results;
the amplified audio data to be screened are respectively input into a target optimization ASR system and N available ASR systems for speech recognition to obtain N+1 speech recognition results;
measuring the N+1 voice recognition results, determining M voice recognition results and sending the M voice recognition results to a labeling expert for labeling;
inputting the M voice recognition results marked by the marking expert into the target optimization ASR system to optimize the target optimization ASR system;
wherein, the measuring the n+1 voice recognition results includes: performing first difference degree calculation on the identification result of the audio data to be screened by any two systems in the N available ASR systems; performing second difference degree calculation on a first recognition result of the target optimization ASR system on the audio data to be screened and a second recognition result of the available ASR system on the audio data to be screened; performing first confusion degree calculation on a third recognition result of the available ASR system on the first amplification result of the audio data to be screened and a fourth recognition result of the available ASR system on the second amplification result of the audio data to be screened; and/or performing second confusion calculation on a third recognition result of the target optimized ASR system on the first amplification result of the audio data to be screened and a fourth recognition result of the target optimized ASR system on the second amplification result of the audio data to be screened;
the determining M voice recognition results and sending the M voice recognition results to a labeling expert for labeling comprises the following steps: forming a labeling value judging function based on the result of the first degree of variance calculation, the result of the second degree of variance calculation, the result of the first degree of confusion calculation and/or the result of the second degree of confusion calculation; calculating the labeling values of the N+1 recognition results by using the labeling value judging function; and determining M voice recognition results to be marked in the N+1 recognition results based on the marking value, and sending the M voice recognition results to a marking expert for marking.
2. The method of claim 1, wherein the manner of data augmentation comprises pitch change, noise addition, reverberation addition, and/or audio compression.
3. The method of claim 1, wherein the greater the result of the second confusion calculation, the higher the labeling value of the labeling value decision function;
the larger the result of the second difference degree calculation is, the higher the labeling value of the labeling value judging function is; and/or
The smaller the result of the first degree of confusion calculation is, the larger the result of the second degree of confusion calculation is, and the larger the result of the second degree of confusion calculation is, the higher the labeling value of the labeling value judgment function is.
4. A method according to any one of claims 1-3, wherein the measure of the metric comprises edit distance and/or confidence of each ASR system.
5. A speech recognition system optimizing apparatus comprising:
performing data amplification on the audio data to be screened to form a plurality of amplification results;
the recognition program module is configured to input the amplified audio data to be screened into the target optimization ASR system and the N available ASR systems respectively for voice recognition to obtain N+1 voice recognition results;
the labeling program module is configured to measure the N+1 voice recognition results, determine M voice recognition results and send the M voice recognition results to a labeling expert for labeling;
the optimizing program module is configured to input the M voice recognition results marked by the marking expert into the target optimizing ASR system to optimize the target optimizing ASR system;
wherein the labeling program module is further configured to: performing first difference degree calculation on the identification result of the audio data to be screened by any two systems in the N available ASR systems; performing second difference degree calculation on a first recognition result of the target optimization ASR system on the audio data to be screened and a second recognition result of the available ASR system on the audio data to be screened; performing first confusion degree calculation on a third recognition result of the available ASR system on the first amplification result of the audio data to be screened and a fourth recognition result of the available ASR system on the second amplification result of the audio data to be screened; and/or performing second confusion calculation on a third recognition result of the target optimized ASR system on the first amplification result of the audio data to be screened and a fourth recognition result of the target optimized ASR system on the second amplification result of the audio data to be screened; forming a labeling value judging function based on the result of the first degree of variance calculation, the result of the second degree of variance calculation, the result of the first degree of confusion calculation and/or the result of the second degree of confusion calculation; calculating the labeling values of the N+1 recognition results by using the labeling value judging function; and determining M voice recognition results to be marked in the N+1 recognition results based on the marking value, and sending the M voice recognition results to a marking expert for marking.
6. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 4.
7. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1 to 4.
CN202111076518.XA 2021-09-14 2021-09-14 Speech recognition system optimization method and device Active CN113793604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111076518.XA CN113793604B (en) 2021-09-14 2021-09-14 Speech recognition system optimization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111076518.XA CN113793604B (en) 2021-09-14 2021-09-14 Speech recognition system optimization method and device

Publications (2)

Publication Number Publication Date
CN113793604A CN113793604A (en) 2021-12-14
CN113793604B true CN113793604B (en) 2024-01-05

Family

ID=78880321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111076518.XA Active CN113793604B (en) 2021-09-14 2021-09-14 Speech recognition system optimization method and device

Country Status (1)

Country Link
CN (1) CN113793604B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Optimize method, system, equipment and the storage medium of voice recognition acoustic model
CN109685264A (en) * 2018-12-20 2019-04-26 华润电力技术研究院有限公司 Thermal power unit operation optimization method, device and computer equipment
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium
US10388272B1 (en) * 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
CN111916064A (en) * 2020-08-10 2020-11-10 北京睿科伦智能科技有限公司 End-to-end neural network speech recognition model training method
CN112151022A (en) * 2020-09-25 2020-12-29 北京百度网讯科技有限公司 Speech recognition optimization method, device, equipment and storage medium
CN112435656A (en) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium
CN112489676A (en) * 2020-12-15 2021-03-12 腾讯音乐娱乐科技(深圳)有限公司 Model training method, device, equipment and storage medium
CN112599116A (en) * 2020-12-25 2021-04-02 苏州思必驰信息科技有限公司 Speech recognition model training method and speech recognition federal training system
CN112700763A (en) * 2020-12-26 2021-04-23 科大讯飞股份有限公司 Voice annotation quality evaluation method, device, equipment and storage medium
CN113204614A (en) * 2021-04-29 2021-08-03 北京百度网讯科技有限公司 Model training method, method and device for optimizing training data set
CN113314124A (en) * 2021-06-15 2021-08-27 宿迁硅基智能科技有限公司 Text output method and system, storage medium and electronic device
WO2021169301A1 (en) * 2020-02-28 2021-09-02 平安科技(深圳)有限公司 Method and device for selecting sample image, storage medium and server

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0426347D0 (en) * 2004-12-01 2005-01-05 Ibm Methods, apparatus and computer programs for automatic speech recognition
KR102146524B1 (en) * 2018-09-19 2020-08-20 주식회사 포티투마루 Method, system and computer program for generating speech recognition learning data

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Optimize method, system, equipment and the storage medium of voice recognition acoustic model
US10388272B1 (en) * 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
CN109685264A (en) * 2018-12-20 2019-04-26 华润电力技术研究院有限公司 Thermal power unit operation optimization method, device and computer equipment
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium
WO2021169301A1 (en) * 2020-02-28 2021-09-02 平安科技(深圳)有限公司 Method and device for selecting sample image, storage medium and server
CN111916064A (en) * 2020-08-10 2020-11-10 北京睿科伦智能科技有限公司 End-to-end neural network speech recognition model training method
CN112151022A (en) * 2020-09-25 2020-12-29 北京百度网讯科技有限公司 Speech recognition optimization method, device, equipment and storage medium
CN112435656A (en) * 2020-12-11 2021-03-02 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium
CN112489676A (en) * 2020-12-15 2021-03-12 腾讯音乐娱乐科技(深圳)有限公司 Model training method, device, equipment and storage medium
CN112599116A (en) * 2020-12-25 2021-04-02 苏州思必驰信息科技有限公司 Speech recognition model training method and speech recognition federal training system
CN112700763A (en) * 2020-12-26 2021-04-23 科大讯飞股份有限公司 Voice annotation quality evaluation method, device, equipment and storage medium
CN113204614A (en) * 2021-04-29 2021-08-03 北京百度网讯科技有限公司 Model training method, method and device for optimizing training data set
CN113314124A (en) * 2021-06-15 2021-08-27 宿迁硅基智能科技有限公司 Text output method and system, storage medium and electronic device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《端到端语音识别技术研究》;秦楚雄;《中国优秀硕士学位论文全文数据库 信息科技辑》;第57-76页 *
《语音情感识别中主动学习和半监督学习方法研究》;张晓鹏;《中国优秀硕士学位论文全文数据库 信息科技辑》;全文 *

Also Published As

Publication number Publication date
CN113793604A (en) 2021-12-14

Similar Documents

Publication Publication Date Title
CN109003624B (en) Emotion recognition method and device, computer equipment and storage medium
CN110147726A (en) Business quality detecting method and device, storage medium and electronic device
US11042777B2 (en) Classification method and classification device of indoor scene
CN110598620B (en) Deep neural network model-based recommendation method and device
CN110633745A (en) Image classification training method and device based on artificial intelligence and storage medium
CN110046706B (en) Model generation method and device and server
CN113688851B (en) Data labeling method and device and fine granularity identification method and device
CN109697289A (en) It is a kind of improved for naming the Active Learning Method of Entity recognition
CN110414005A (en) Intention recognition method, electronic device, and storage medium
CN116049412B (en) Text classification method, model training method, device and electronic equipment
WO2020135054A1 (en) Method, device and apparatus for video recommendation and storage medium
CN111144462A (en) Unknown individual identification method and device for radar signals
CN106384587A (en) Voice recognition method and system thereof
CN110717027A (en) Multi-round intelligent question-answering method, system, controller and medium
CN112712068A (en) Key point detection method and device, electronic equipment and storage medium
CN113793604B (en) Speech recognition system optimization method and device
CN112447173A (en) Voice interaction method and device and computer storage medium
KR102413588B1 (en) Object recognition model recommendation method, system and computer program according to training data
CN112699908B (en) Method for labeling picture, electronic terminal, computer readable storage medium and equipment
CN114418111A (en) Label prediction model training and sample screening method, device and storage medium
CN113420824A (en) Pre-training data screening and training method and system for industrial vision application
CN113297378A (en) Text data labeling method and system, electronic equipment and storage medium
CN113032612A (en) Construction method of multi-target image retrieval model, retrieval method and device
CN113837910B (en) Test question recommending method and device, electronic equipment and storage medium
CN110874553A (en) Recognition model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant