CN112417850B

CN112417850B - Audio annotation error detection method and device

Info

Publication number: CN112417850B
Application number: CN202011263694.XA
Authority: CN
Inventors: 张晴晴; 朱冬; 贾艳明; 何淑琳
Original assignee: Beijing Qingshu Intelligent Technology Co ltd
Current assignee: Beijing Qingshu Intelligent Technology Co ltd
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2024-07-02
Anticipated expiration: 2040-11-12
Also published as: CN112417850A

Abstract

The application discloses an error detection method for audio annotation, which comprises the following steps: acquiring audio data and segmenting the audio data into a plurality of audio clips; labeling the audio fragment to obtain an initial labeling text; performing error detection processing on the initial labeling text by adopting a universal text error detection model to obtain a first labeling text; determining a confusion dictionary of the universal text error detection model; identifying the domain category of the first labeling text by adopting a text classification model; according to the field category, performing error detection processing on the first labeling text by adopting a field text error detection model corresponding to the field category so as to obtain a second labeling text; the confusion dictionary of the universal text error detection model and the second labeling text of the field text error detection model are used as a database of the fine tuning model; and performing fine adjustment processing on the second annotation text by adopting a fine adjustment model according to the semantics of the second annotation text so as to obtain a final third annotation text.

Description

Audio annotation error detection method and device

Technical Field

The application belongs to the field of voice recognition, and particularly relates to an error detection method and device for audio annotation.

Background

With the development of speech recognition technology, speech recognition technology is increasingly applied to various fields, such as: the intelligent home for daily life, intelligent application in the education field, intelligent robot in the medical or financial field and other scenes. Current speech recognition techniques rely on deep learning trained speech recognition models to transcribe speech into text, which is then subsequently processed. Efficient and high-accuracy speech recognition models in turn rely on large amounts of high-quality speech data.

However, in the process of implementing the present application, the inventor finds that the voice data required for training the voice recognition model is obtained by adopting a manual labeling method under normal conditions.

At least the following problems exist at present: the labeling quality of each voice is affected by the fatigue degree and knowledge cognition level of the current labeling personnel, and the situation that wrongly written characters exist in the labeling text is unavoidable in the labeling process. Even if the follow-up quality inspector strictly controls, the labeling data possibly obtained finally has text errors, and the use of the data can cause the bending of the trained voice recognition model process, so that the recognition effect is poor. Of course, the quality inspection cost is increased, and the quality inspection pressure of quality inspection personnel is increased.

Disclosure of Invention

The embodiment of the application aims to provide an error detection method and device for audio annotation, which can solve the technical problems that the voice recognition model is low in accuracy and poor in recognition effect due to the fact that the quality of the voice annotation is easily influenced by the fatigue degree and knowledge cognition level of an annotator at present.

In order to solve the technical problems, the application is realized as follows:

In a first aspect, an embodiment of the present application provides an error detection method for audio annotation, including:

Acquiring audio data and segmenting the audio data into a plurality of audio fragments;

labeling the audio fragment to obtain an initial labeling text;

Performing error detection processing on the initial labeling text by adopting a universal text error detection model to obtain a first labeling text;

Determining a confusion dictionary of the universal text error detection model;

identifying the domain category of the first marked text by adopting a text classification model;

according to the domain category, adopting a domain text error detection model corresponding to the domain category to carry out error detection processing on the first labeling text so as to obtain a second labeling text;

A database of a fine tuning model is used as the confusion dictionary of the universal text error detection model and the second marked text of the field text error detection model;

and carrying out fine adjustment processing on the second annotation text by adopting the fine adjustment model according to the semantics of the second annotation text so as to obtain a final third annotation text.

Further, the confusion dictionary includes a personal confusion dictionary and a shared confusion dictionary, and the confusion dictionary for determining the universal text error detection model specifically includes:

After the modification and confirmation of a specific labeling person, recording the text of the labeling error and the occurrence frequency of the labeling error;

When the frequency is higher than a threshold value, adding the text with the labeling error into a personal confusion dictionary of the specific labeling person;

and counting the personal confusion dictionaries of a plurality of labeling personnel, and adding the text with the labeling errors into a shared confusion dictionary when the number of times of the text with the labeling errors is higher than a preset number of times.

Further, the error detection processing is performed on the initial labeling text by adopting a general text error detection model to obtain a first labeling text, which specifically comprises the following steps:

searching out the position of the annotation error by adopting a universal text error detection model;

obtaining a candidate item list for replacing the error label from the confusion dictionary;

Obtaining candidate items from the candidate item list to replace error labels;

Calculating the fluency and confusion of the replaced marked text by adopting an N-gram model;

And determining the best target candidate item according to the fluency and the confusion degree so as to obtain a first labeling text.

Further, according to the domain category, performing error detection processing on the first labeling text by using a domain text error detection model corresponding to the domain category to obtain a second labeling text, and then, further including:

generating error detection information under the condition that the first labeling text has errors;

wherein the error detection information comprises an audio fragment index, an error location index and a candidate word.

Further, the domain categories include: economic, educational, scientific, social, gaming, and recreational.

In a second aspect, an embodiment of the present application provides an error detection apparatus for audio annotation, which is characterized in that the apparatus includes:

the acquisition module is used for acquiring audio data and segmenting the audio data into a plurality of audio fragments;

the labeling module is used for labeling the audio clips to obtain an initial labeling text;

the first error detection module is used for carrying out error detection processing on the initial labeling text by adopting a universal text error detection model so as to obtain a first labeling text;

The determining module is used for determining a confusion dictionary of the universal text error detection model;

the identification module is used for identifying the domain category of the first marked text by adopting a text classification model;

The second error detection module is used for carrying out error detection processing on the first marked text by adopting a field text error detection model corresponding to the field category according to the field category so as to obtain a second marked text;

The warehousing module is used for taking the confusion dictionary of the universal text error detection model and the second labeling text of the field text error detection model as a database of a fine adjustment model;

And the fine tuning module is used for carrying out fine tuning processing on the second annotation text by adopting the fine tuning model according to the semantics of the second annotation text so as to obtain a final third annotation text.

Further, the confusion dictionary includes a personal confusion dictionary and a shared confusion dictionary, and the determining module specifically includes:

The recording sub-module is used for recording the text of the marking errors and the occurrence frequency of the marking errors after the marking errors are modified and confirmed by the specific marking personnel;

a personal dictionary sub-module for adding the mislabeled text to the personal confusion dictionary of the specific labeling person when the frequency is higher than a threshold;

and the shared dictionary sub-module is used for counting the personal confusion dictionaries of a plurality of labeling personnel, and adding the text with the labeling error into the shared confusion dictionary when the frequency of the text with the labeling error is higher than the preset frequency.

Further, the first error detection module specifically includes:

the searching sub-module is used for searching the position of the annotation error by adopting a universal text error detection model;

the obtaining sub-module is used for obtaining a candidate item list for replacing the error label from the confusion dictionary;

the replacing sub-module is used for acquiring candidate items from the candidate item list to replace error labels;

The computing sub-module is used for computing the fluency and confusion degree of the replaced marked text by adopting the N-gram model;

and the determining submodule is used for determining the best target candidate item according to the fluency and the confusion degree so as to obtain a first annotation text.

Further, the error detection apparatus further includes:

the generation module is used for generating error detection information under the condition that the first annotation text has errors;

In the embodiment of the application, the automatic error detection of the audio data is realized through the universal text error detection model, the field text error detection model and the fine adjustment model, the advantages of the universal text error detection model, namely the rapidness and the accuracy are fully utilized, the field category and the upper part and the lower part Wen Yuyi are further considered, the influence of the fatigue degree and the knowledge cognition level of the labeling personnel on the labeling quality is avoided, the labeling quality is improved, and the accuracy and the recognition effect of the voice recognition model are further improved.

Drawings

FIG. 1 is a flow chart of an error detection method for audio annotation according to an embodiment of the present application;

FIG. 2 is a flow chart of another method for detecting errors of audio annotation according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an audio labeling error detection device according to an embodiment of the present application.

Reference numerals illustrate:

30-error detection means, 301-acquisition module, 302-labeling module, 303-first error detection module, 3031-lookup sub-module, 3032-acquisition sub-module, 3033-replacement sub-module, 3034-computation sub-module, 3035-determination sub-module, 304-determination module, 3041-recording sub-module, 3042-personal dictionary sub-module, 3043-shared dictionary sub-module, 305-identification module, 306-second error detection module, 307-binning module, 308-trimming module, 309-generation module.

The achievement of the object, functional features and advantages of the present invention will be further described with reference to the embodiments, referring to the accompanying drawings.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more.

The following describes in detail the speech processing method provided by the embodiment of the present application through specific embodiments and application scenarios thereof with reference to the accompanying drawings.

Example 1

Referring to fig. 1, a flow chart of an audio annotation error detection method provided by an embodiment of the present application is shown, where the audio annotation error detection method includes:

S101: audio data is acquired and is sliced into a plurality of audio clips.

Optionally, voice dotting is performed on the audio data using a voice detection system, and the audio data is sliced according to the dotting.

Alternatively, the audio data may be sliced for a preset period of time, for example, 3 s. The audio data may also be sliced according to phoneme length, e.g. 6 phoneme units.

S102: and labeling the audio fragment to obtain an initial labeling text.

The existing audio labeling method can be adopted for labeling, and the description is omitted here.

S103: and carrying out error detection processing on the initial labeling text by adopting a universal text error detection model so as to obtain a first labeling text.

Specifically, the generic text error detection model includes at least a language model of N-Gram.

The database of the universal text error detection model includes a confusion dictionary, wherein the confusion dictionary includes a word-level pinyin confusion set, a glyph confusion set, and a word-level confusion set.

It should be appreciated that the location of the annotation error can be found by a generic error detection model and the candidate list is looked up from the confusion dictionary.

The first marked text processed by the general text error detection model initially solves some basic confusion errors.

S104: a confusion dictionary of the generic text error detection model is determined.

Specifically, the confusion dictionary of the universal text error detection model can be determined by recording wrong words and corresponding candidate words in the background.

Further, the confusion dictionary may include a personal confusion dictionary for a particular annotator and a shared confusion dictionary for all annotators.

S105: and identifying the domain category of the first marked text by adopting a text classification model.

In particular, the text classification model includes, but is not limited to TextCNN, textRNN, textRCNN, BERT.

The domain categories may include economy, education, science, society, games, and entertainment.

S106: and according to the domain category, adopting a domain text error detection model corresponding to the domain category to carry out error detection processing on the first annotation text so as to obtain a second annotation text.

It is understood that each domain category corresponds to a respective domain text error detection model.

Specifically, the domain text error detection model includes, but is not limited to, BERT and Transformers.

In this embodiment, BERT is taken as a classification model as an example, and a field text error detection model is further described.

And adopting BERT (Bidirectional Encoder Representation from Transformers) to realize the discovery of the error words in the first marked text and the filtering of the candidate words, utilizing a language model (Masked LM) masked in the BERT to mask the first marked text word by word, and finally using a decoder of the BERT to obtain the candidate words from the font confusion set of the confusion dictionary.

And compared with the first labeling text, the second labeling text processed by the field text error detection model has higher accuracy due to the fact that the field of the text is considered.

S107: the confusion dictionary of the universal text error detection model and the second labeling text of the field text error detection model are used as a database of the fine tuning model;

S108: and performing fine adjustment processing on the second annotation text by adopting a fine adjustment model according to the semantics of the second annotation text so as to obtain a final third annotation text.

It should be noted that, the fine tuning model extracts semantic information of the second labeling text by using the confusion dictionary of the general text error detection model and the second labeling text of the field text error detection model, and further performs error detection according to the upper and lower Wen Yuyi so as to obtain a final third labeling text.

For example, the second labeled text processed by the general text error detection model and the domain text error detection model is "in order to escape from the surrounding of the enemy, his winning is very strong", and is correct when the audio clip is used for judgment. However, when the context semantic recognition fine tuning model is adopted, the "winning desire" can be checked to be the "survival desire".

Example two

Referring to fig. 2, a flow chart of another audio annotation error detection method provided by an embodiment of the present application is shown, and a speech processing method includes:

s201: audio data is acquired and is sliced into a plurality of audio clips.

S202: and labeling the audio fragment to obtain an initial labeling text.

S203: and (5) adopting a universal text error detection model to find out the position of the labeling error.

S204: a candidate list for replacing the error label is obtained from the confusion dictionary.

S205: and obtaining candidates from the candidate list to replace the error labels.

Optionally, the candidate with the highest priority in the candidate list is selected for replacement.

S206: and calculating the fluency and confusion degree of the replaced marked text by adopting an N-gram model.

It should be appreciated that the higher the fluency, the lower the confusion, the higher the accuracy of labeling the text. Conversely, the lower the fluency, the higher the confusion, the lower the accuracy of labeling the text.

S207: and determining the best target candidate item according to the fluency and the confusion degree so as to obtain the first labeling text.

It should be appreciated that the best target candidate may have the highest fluency and/or lowest confusion after replacement.

By comparing the fluency and the confusion, the accuracy of selecting the best target candidate can be improved.

S208: after the modification and confirmation by a specific labeling person, the text of the labeling error and the frequency of the labeling error are recorded.

S209: and when the frequency is higher than the threshold value, adding the text marked with errors into a personal confusion dictionary of a specific marking person.

S210: and counting personal confusion dictionaries of a plurality of labeling personnel, and adding the text with the labeling error into the shared confusion dictionary when the occurrence frequency of the text with the labeling error is higher than a preset frequency.

In this embodiment, the confusion dictionary generally includes a personal confusion dictionary and a shared confusion dictionary.

The personal confusion dictionary and the shared confusion dictionary can achieve the combined effect of personalized error correction and commonality error sharing.

S211: and identifying the domain category of the first marked text by adopting a text classification model.

S212: and according to the domain category, adopting a domain text error detection model corresponding to the domain category to carry out error detection processing on the first annotation text so as to obtain a second annotation text.

S213: and generating error detection information under the condition that the first labeling text has errors.

Wherein the error detection information includes an audio clip index, an error location index, and a candidate word.

The specific position where the error occurs can be rapidly positioned through the audio fragment index, the error position index and the candidate words, and the error detection efficiency is improved.

S214: and using the confusion dictionary of the universal text error detection model and the second labeling text of the field text error detection model as a database of the fine tuning model.

S215: and performing fine adjustment processing on the second annotation text by adopting a fine adjustment model according to the semantics of the second annotation text so as to obtain a final third annotation text.

In the embodiment of the application, the accuracy of selecting the optimal target candidate item is improved through the comparison of the fluency and the confusion degree, and the effect of combining personalized error correction and commonality error sharing is achieved through the forms of the personal confusion dictionary and the shared confusion dictionary, so that the influence of the fatigue degree and knowledge cognition level of the labeling personnel on the labeling quality can be further avoided, and the labeling quality is improved.

Example III

Referring to fig. 3, a schematic structural diagram of an audio labeling error detection device according to an embodiment of the present application is shown, where the error detection device 30 includes:

an obtaining module 301, configured to obtain audio data, and segment the audio data into a plurality of audio segments;

The labeling module 302 is configured to label the audio clip to obtain an initial labeling text;

The first error detection module 303 is configured to perform error detection processing on the initial labeling text by using a general text error detection model, so as to obtain a first labeling text;

A determining module 304, configured to determine a confusion dictionary of the generic text error detection model;

an identifying module 305, configured to identify a domain category of the first labeled text using a text classification model;

The second error detection module 306 is configured to perform error detection processing on the first labeling text by using a domain text error detection model corresponding to the domain category according to the domain category, so as to obtain a second labeling text;

A warehousing module 307, configured to use the confusion dictionary of the universal text error detection model and the second labeling text of the domain text error detection model as a database of the fine tuning model;

And the fine tuning module 308 is configured to perform fine tuning processing on the second labeling text by using a fine tuning model according to the semantics of the second labeling text, so as to obtain a final third labeling text.

Further, the confusion dictionary includes a personal confusion dictionary and a shared confusion dictionary, and the determining module 304 specifically includes:

the recording submodule 3041 is used for recording the text of the labeling error and the occurrence frequency of the labeling error after the modification and confirmation of a specific labeling person;

the personal dictionary submodule 3042 is used for adding the text with the wrong labeling into the personal confusion dictionary of the specific labeling person when the frequency is higher than the threshold value;

The shared dictionary submodule 3043 is used for counting personal confusion dictionaries of a plurality of labeling personnel, and adding the text with the labeling error into the shared confusion dictionary when the frequency of the text with the labeling error is higher than the preset frequency.

Further, the first error detection module 303 specifically includes:

A searching sub-module 3031, configured to search a location of the labeling error by using a universal text error detection model;

An obtaining sub-module 3032, configured to obtain a candidate list for replacing the error label from the confusion dictionary;

a replacing sub-module 3033, configured to obtain a candidate item from the candidate item list to replace the error label;

a calculating submodule 3034, configured to calculate fluency and confusion of the replaced labeled text by using the N-gram model;

a determining sub-module 3035, configured to determine, according to the fluency and the confusion degree, the best target candidate, so as to obtain the first labeling text.

Further, the error detection device 30 further includes:

A generating module 309, configured to generate error detection information when the first markup text has an error;

The error detection device 30 provided in the embodiment of the present application can implement each process implemented in the above method embodiment, and in order to avoid repetition, a description is omitted here.

The virtual device in the embodiment of the application can be a device, and also can be a component, an integrated circuit or a chip in a terminal.

The foregoing is merely exemplary of the present invention and is not intended to limit the present invention. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are to be included in the scope of the claims of the present invention.

Claims

1. An error detection method for audio annotation, comprising:

labeling the audio fragment to obtain an initial labeling text;

Determining a confusion dictionary of the universal text error detection model;

2. The error detection method of claim 1, wherein the confusion dictionary comprises a personal confusion dictionary and a shared confusion dictionary, and wherein the determining the confusion dictionary of the generic text error detection model specifically comprises:

3. The method for detecting errors according to claim 1, wherein the error detection processing is performed on the initial labeling text by using a general text error detection model to obtain a first labeling text, and specifically comprises:

Obtaining candidate items from the candidate item list to replace error labels;

4. The method according to claim 1, wherein the performing error detection processing on the first markup text by using a domain text error detection model corresponding to the domain category according to the domain category to obtain a second markup text, and further comprises:

5. The error detection method of claim 1, wherein the domain categories include: economic, educational, scientific, social, gaming, and recreational.

6. An audio-noted error detection apparatus, comprising:

7. The error detection apparatus of claim 6, wherein the confusion dictionary comprises a personal confusion dictionary and a shared confusion dictionary, and wherein the determining module specifically comprises:

8. The error detection apparatus of claim 6, wherein the first error detection module specifically comprises:

9. The error detection apparatus of claim 6, further comprising:

10. The error detection apparatus of claim 6, wherein the domain categories include: economic, educational, scientific, social, gaming, and recreational.