CN112700763A

CN112700763A - Voice annotation quality evaluation method, device, equipment and storage medium

Info

Publication number: CN112700763A
Application number: CN202011570121.1A
Authority: CN
Inventors: 喻涛; 吴思远; 熊世富
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2020-12-26
Filing date: 2020-12-26
Publication date: 2021-04-23
Anticipated expiration: 2040-12-26
Also published as: CN112700763B

Abstract

The application provides a method, a device, equipment and a storage medium for evaluating voice labeling quality, wherein the method comprises the following steps: acquiring a voice recognition result to be marked corresponding to target voice, wherein the voice recognition result to be marked is obtained by replacing a text segment of the voice recognition result of the target voice, and the replaced text segment is an error text segment relative to the target voice; acquiring a labeling result obtained by performing text labeling processing on the voice recognition result to be labeled by a labeling object, wherein the text labeling processing is the processing of labeling the text with recognition errors; and determining the labeling quality of the target voice by the labeling object according to the voice recognition result to be labeled and the labeling result. The process realizes the automatic evaluation of the voice labeling quality of the labeled object, realizes the supervision of the voice labeling work of the labeled object and is beneficial to improving the voice labeling quality of the labeled object.

Description

Voice annotation quality evaluation method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for evaluating voice annotation quality.

Background

Supervised training is a common training mode in speech recognition model training, and requires a large amount of speech data with text labels as training samples. The conventional supervised training sample acquisition method is to perform text labeling on voice data through a human or machine to obtain a training sample.

The method directly influences the accuracy of text labeling of the voice data and further influences the training effect of the model on the labeling quality of the voice data. Therefore, the quality of the labeling of the voice data needs to be evaluated, so as to supervise the labeling work of the labeled object.

Disclosure of Invention

Based on the above requirements, the present application provides a method, an apparatus, a device and a storage medium for evaluating voice annotation quality, which can be used for automatically evaluating the voice annotation quality of an annotation object.

The technical scheme provided by the application is as follows:

a voice labeling quality evaluation method comprises the following steps:

acquiring a voice recognition result to be marked corresponding to target voice, wherein the voice recognition result to be marked is obtained by replacing a text segment of the voice recognition result of the target voice, and the replaced text segment is an error text segment relative to the target voice;

acquiring a labeling result obtained by performing text labeling processing on the voice recognition result to be labeled by a labeling object, wherein the text labeling processing is the processing of labeling the text with recognition errors;

and determining the labeling quality of the target voice by the labeling object according to the voice recognition result to be labeled and the labeling result.

Optionally, the obtaining a speech recognition result to be labeled corresponding to the target speech includes:

determining a text to be replaced from a voice recognition result of the target voice;

selecting a text matched with the text to be replaced from a preset text library as a target text;

and replacing the text to be replaced in the voice recognition result with the target text to obtain the voice recognition result to be marked.

Optionally, the determining a text to be replaced from the speech recognition result of the target speech includes:

performing word segmentation processing on a voice recognition result of a target voice, and determining each word segmentation contained in the voice recognition result;

and selecting the participles from the participles as texts to be replaced at least according to the identification information of the participles, wherein the identification information comprises at least one of confidence information, position information and part-of-speech information.

Optionally, the selecting a word from each word segmentation as a text to be replaced at least according to the identification information of each word segmentation includes:

respectively inputting the identification information of each participle into a classification model trained in advance, and determining a classification result of each participle, wherein the classification result is a classification result indicating whether the participle can be replaced or not;

the classification model is obtained by training at least the recognition information of the vocabulary as a training sample and whether the vocabulary can be replaced by a sample label;

and selecting the participles from the participles as texts to be replaced based on the classification result of each participle.

Optionally, the selecting a word from each word segmentation as a text to be replaced based on a classification result of each word segmentation includes:

if the replaced participles exist in the participles contained in the voice recognition result, selecting at least one participle from the replaced participles to serve as a text to be replaced;

and if the segmentation words contained in the voice recognition result do not exist, determining the text at the set position in the voice recognition result as the text to be replaced.

Optionally, the selecting a text matched with the text to be replaced from a preset text library as a target text includes:

screening texts with the same types as the texts to be replaced from a preset text library to serve as candidate texts;

selecting a target candidate text from the candidate texts as a target text;

and/or replacing the text to be replaced in the voice recognition result with the text obtained by the target candidate text, wherein the grammatical structure of the text is the same as that of the voice recognition result.

Optionally, the text labeling process is a process of marking a text with a recognition error, or a process of modifying a text with a recognition error.

Optionally, when the text labeling processing is processing for modifying a text with a recognition error, determining the labeling quality of the target speech by the labeling object according to the speech recognition result to be labeled and the labeling result, including:

at least determining the modification rate of the target text in the voice recognition result to be labeled by the labeling object by comparing the voice recognition result to be labeled with the labeling result; the target text is replaced into the text segment of the voice recognition result when the text segment of the voice recognition result of the target voice is replaced;

and determining the labeling quality of the target voice labeling by the labeling object at least according to the modification rate of the labeling object to the target text in the voice recognition result to be labeled.

Optionally, the determining, by comparing the speech recognition result to be labeled with the labeling result, a modification rate of the target text in the speech recognition result to be labeled by the labeling object at least includes:

determining the modification rate and modification accuracy rate of the target text in the voice recognition result to be labeled by the labeling object by comparing the voice recognition result to be labeled with the labeling result; when a target text in the voice recognition result to be marked is modified into a replaced text corresponding to the target text, determining that the target text is correctly modified;

the determining, at least according to the modification rate of the target text in the speech recognition result to be labeled by the labeling object, the labeling quality of the target speech by the labeling object for speech labeling includes:

and determining the labeling quality of the target voice labeling by the labeling object according to the modification rate and the modification accuracy of the labeling object to the target text in the voice recognition result to be labeled.

Optionally, the method further includes:

determining a target text which is not modified by the labeling object in the voice recognition result to be labeled by comparing the voice recognition result to be labeled with the labeling result;

determining the modification rate of the inspection object to the target text which is not modified by the marking object in the voice recognition result to be marked;

and determining the labeling quality of the target voice labeled by the inspection object based on the modification rate of the inspection object to the target text which is not modified by the labeling object in the voice recognition result to be labeled.

A speech annotation quality evaluation device includes:

the system comprises a text acquisition unit, a voice recognition unit and a voice recognition unit, wherein the text acquisition unit is used for acquiring a voice recognition result to be marked corresponding to target voice, and the voice recognition result to be marked is obtained by replacing a text segment of the voice recognition result of the target voice, and the replaced text segment is an error text segment relative to the target voice;

a labeling result acquiring unit, configured to acquire a labeling result obtained by performing text labeling processing on the speech recognition result to be labeled by a labeling object, where the text labeling processing is processing of labeling a text with a recognition error;

and the quality evaluation unit is used for determining the labeling quality of the target voice by the labeling object according to the voice recognition result to be labeled and the labeling result.

A speech annotation quality evaluation apparatus comprising:

a memory and a processor;

wherein the memory is connected with the processor and used for storing programs;

the processor is used for realizing the voice annotation quality evaluation method by operating the program in the memory.

A storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the above-mentioned voice annotation quality evaluation method.

The voice labeling quality evaluation method provided by the embodiment of the application can automatically acquire the voice recognition result to be labeled corresponding to the target voice and acquire the labeling result obtained by text labeling processing of the voice recognition result to be labeled by the labeling object. And then, according to the voice recognition result to be labeled and the labeling result, determining the labeling quality of the target voice by the labeling object. The process realizes the automatic evaluation of the voice labeling quality of the labeled object, realizes the supervision of the voice labeling work of the labeled object and is beneficial to improving the voice labeling quality of the labeled object.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for evaluating speech annotation quality according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a speech annotation quality evaluation device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a speech annotation quality evaluation device according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is applied to an application scene for evaluating the labeling quality of the speech data subjected to text labeling by the labeling object. By adopting the technical scheme of the embodiment of the application, the voice marking quality of the marked object can be objectively evaluated, so that the purpose of monitoring the marked object is achieved.

In order to improve the efficiency of voice labeling, when a labeling object performs voice labeling, an automatic voice labeling system usually performs text labeling on voice data to obtain a voice labeling result, and then the labeling object performs labeling processing such as modification and verification on the voice labeling result, specifically, an error text in the voice labeling result is modified into a correct text, or a text with an identification error is labeled, so that a text labeling result corresponding to the voice data is obtained. Therefore, the embodiment of the present application evaluates the voice tagging quality of the tagged object, and can also be understood as evaluating the quality of tagging the text tagging result of the voice data by the tagged object.

The technical scheme provided by the embodiment of the application can be exemplarily applied to hardware processing devices such as a processor or software processing programs, so that the automatic evaluation of the voice labeling quality of the labeled object is realized.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a method for evaluating speech annotation quality provided in the embodiment of the present application includes:

s101, obtaining a voice recognition result to be marked corresponding to the target voice.

The target speech mentioned above refers to speech data used as a training sample of the speech recognition model, and may be speech data of any type, any language, and any duration.

And the voice recognition result to be marked is obtained by replacing the text segment of the voice recognition result of the target voice.

The voice recognition result of the target voice is a recognition result of a text form obtained by performing voice recognition on the target voice, and the voice recognition result may be obtained by performing voice recognition on the target voice or may be read from pre-stored data.

The text segment replacement is carried out on the voice recognition result of the target voice, namely, one or more text segments in the voice recognition result are replaced by other text segments, and the replaced text is obtained and serves as the voice recognition result to be marked.

In order to ensure the scientificity of evaluation on the labeling quality of the labeling object, when the text segment is replaced for the voice recognition result of the target voice, the replaced text segment is ensured to be an error text segment relative to the target voice, so that the voice recognition result to be labeled obtained by replacement is an error voice recognition text relative to the target voice.

Illustratively, a text segment to be replaced is selected from the voice recognition result of the target voice, and then the text segment to be replaced is replaced by a text segment with a semantic different from that of the text segment, so that the voice recognition result to be labeled can be obtained.

S102, obtaining a labeling result obtained by performing text labeling processing on the voice recognition result to be labeled by the labeling object.

The above-mentioned labeled object refers to an object for implementing voice labeling, and may specifically be a voice labeling person or a voice labeling machine, an algorithm, a program, and the like.

In general, a markup object performs text markup processing on a voice markup result of an automated voice markup system to obtain a voice markup result. For example, the wrong labeling text in the voice labeling result is modified into the correct text, and the obtained modified result is used as the labeling result for the voice data.

Therefore, after the step S101 is executed to obtain the to-be-labeled speech recognition result corresponding to the target speech, the labeling object performs text labeling processing on the to-be-labeled speech recognition result, and obtains a labeling result obtained by performing text labeling processing on the to-be-labeled speech recognition result by the labeling object.

The labeling object performs text labeling processing on the to-be-labeled speech recognition result, specifically, identifies a text with a recognition error in the to-be-labeled speech recognition result, and performs labeling processing on the identified text with a recognition error, for example, labeling the text with a recognition error or labeling the text with a correct text.

The error labeling text and the recognition error text refer to texts that do not correspond to the corresponding speech segments in the target speech.

The marking object can determine whether the text segment corresponds to the corresponding voice segment by checking whether each text segment in the voice recognition result to be marked is matched with the voice content of the corresponding voice segment, and further can determine the error marking text or the recognition error text in the voice recognition result to be marked.

S103, determining the labeling quality of the target voice by the labeling object according to the voice recognition result to be labeled and the labeling result.

For the execution subject of the technical solution in the embodiment of the present application, the text information to be replaced in the speech recognition result to be labeled is clear, for example, the text segment at which position is replaced, the actual text before the text segment is replaced, and the information of the replaced text is unknown for the labeled object, so it can be determined whether the labeled object listens to the target speech seriously and performs labeling carefully by considering whether the labeled object can accurately recognize the replaced text in the speech recognition result to be labeled and perform labeling processing. If the marking object can accurately identify the replaced text in the voice identification result to be marked and carry out marking processing, the marking object can be considered to hear the target voice seriously and carry out marking processing seriously, at the moment, the quality of voice marking of the target voice by the marking object is considered to be higher, otherwise, the quality of voice marking of the target voice by the marking object is considered to be lower.

Illustratively, by comparing the speech recognition result to be labeled with the labeling result of the labeling object, it can be determined whether the labeling object listens to the target speech seriously and performs labeling processing seriously, and then the labeling quality of the target speech labeled by the labeling object can be determined.

As can be seen from the above description, the voice annotation quality evaluation method provided in the embodiment of the present application can automatically obtain the voice recognition result to be annotated corresponding to the target voice, and obtain the annotation result obtained by the annotation object performing text annotation processing on the voice recognition result to be annotated. And then, according to the voice recognition result to be labeled and the labeling result, determining the labeling quality of the target voice by the labeling object. The process realizes the automatic evaluation of the voice labeling quality of the labeled object, realizes the supervision of the voice labeling work of the labeled object and is beneficial to improving the voice labeling quality of the labeled object.

For example, the embodiment of the present application obtains the speech recognition result to be labeled corresponding to the target speech by performing the following steps S1 to S3:

and S1, determining the text to be replaced from the voice recognition result of the target voice.

The text to be replaced refers to a text segment to be replaced when the text segment is replaced for the voice recognition result of the target voice, and the text to be replaced may be single or multiple characters, word segments, word groups, phrases and the like.

As a preferred implementation manner, in the embodiment of the present application, a single character, a word segmentation, a phrase, and the like are arbitrarily selected from a speech recognition result of a target speech to serve as a text to be replaced.

For example, the text of a specific sentence component may be selected from the speech recognition result of the target speech as the text to be replaced, for example, a subject, a predicate, an object, and the like are selected.

Alternatively, the embodiment of the present application determines the text to be replaced from the speech recognition result of the target speech according to the following processing in steps S11 to S12:

s11, performing word segmentation processing on the voice recognition result of the target voice, and determining each word segmentation contained in the voice recognition result.

Illustratively, through a text word segmentation tool, word segmentation processing on a speech recognition result of a target speech can be realized, and each word segmentation included in the speech recognition result is obtained.

And S12, selecting the participles from the participles as texts to be replaced at least according to the identification information of the participles, wherein the identification information comprises at least one of confidence information, position information and part-of-speech information.

After the speech recognition result of the target speech is segmented, a plurality of segmented words can be obtained, and at the moment, each segmented word is analyzed, so that a segmented word selection strategy is determined.

For example, if the selected segmentation is just wrong in recognition, the effect of using the segmentation as the text to be replaced is better, and the time for modifying the annotation object due to the fact that the correct segmentation is selected and recognized can be avoided. In view of this, a lower confidence token may be selected as the text to be replaced. The confidence of the segmentation is low, which often indicates that the recognition difficulty of the speech corresponding to the segmentation is high, and for the existing speech recognition system, the recognition difficulty is also high, and the attention should be paid to labeling of the labeled object.

In addition, other information, such as the position information of the text to be replaced and the part-of-speech information of the text to be replaced, is also helpful for selecting the position of the word as a replacement point, for example, some language words "do, lam" and the pronoun "he, it, s", which word should be written is inherently ambiguous, since the following evaluation of the annotation quality also depends on the modification rate of the text to be replaced (target text), in order to ensure that the modification rate can accurately reflect the annotation quality of the annotation object (i.e. if the annotation object modifies the words, the annotation quality is good, but if not modified, the system can determine that the annotation quality of the annotation object is poor, but if the ambiguous vocabulary is not modified, the system can not determine that such vocabulary is not reasonable as a replacement point.

And similar to the preceding language words and phrases, typically occur at the end of a sentence and at the beginning of a sentence. Therefore, the position of the segmentation in the speech recognition result can also be used as a basis for determining the text to be replaced from the segmentation.

Based on the analysis, in the embodiment of the present application, at least one of confidence information, location information, and part-of-speech information of a segment is used as the recognition information of the segment, and the segment is selected from the segments as the text to be replaced according to the recognition information of each segment included in the speech recognition result of the target speech.

For example, based on the recognition information of each participle included in the speech recognition result, a non-verbal word and a non-referential word with a confidence degree lower than a set threshold value and/or located in the middle of the speech recognition result and/or in the speech recognition result are selected as the text to be replaced.

As a preferred implementation manner, in the embodiment of the present application, a classification model is trained in advance, and is used for performing a binary classification process on a segmented word, and determining whether the segmented word can be replaced.

The training process of the classification model is as follows:

firstly, acquiring a large batch of audio data with real text labels, and performing voice recognition on the audio data by using a voice recognition system to obtain a recognition text.

And performing word segmentation processing on the recognition text of the audio data, comparing the recognition text with the real text label, determining the word segmentation with wrong recognition in the recognition text, and recognizing the correct word segmentation.

Then, based on the correctness of each segmented word in the recognition text and the recognition information such as the position and the part of speech of each segmented word, training data using the recognition information of the segmented word as a training sample and using the classification result (whether the segmented word can be replaced) of the segmented word as a sample label is established.

And if the classification result is not the word segmentation of the Chinese word or the pronoun word, the classification result is the word segmentation which can be replaced, otherwise, the classification result is the word segmentation which can not be replaced.

After the processing, the neural network model is trained by using the training data obtained by the processing to obtain a classification model, and the classification model can determine whether the participles can be replaced or not based on the recognition information of the participles.

Then, after obtaining the recognition information of each participle included in the speech recognition result of the target speech, inputting the recognition information of each participle into the classification model, and determining the classification result of each participle, that is, determining whether each participle can be replaced.

Based on the classification result of each participle, a participle can be selected from each participle as a text to be replaced, for example, if after the above classification processing, there are alternative participles in each participle included in the speech recognition result, one or more participles are selected from the alternative participles as a text to be replaced, for example, the first participle is selected from all the alternative participles as a text to be replaced.

If there is no word that can be replaced in each word included in the speech recognition result, determining the text at the set position in the speech recognition result as the text to be replaced, preferably, determining the word at the middle position of the speech recognition result as the text to be replaced in the embodiment of the present application, or determining a non-verbal word or a non-synonym word at another position as the text to be replaced.

And S2, selecting the text matched with the text to be replaced from a preset text library as the target text.

The text library is a text library which is close to the content style of the speaking text of the target voice, the text library covers a wider text type range, and in order to select texts matched with texts to be replaced, the texts in the text library are all in a word segmentation mode.

Then, in the embodiment of the present application, a text of the same type as the text to be replaced is screened from the preset text library to serve as a candidate text.

Illustratively, in the embodiment of the present application, a word clustering model is pre-constructed, words in the preset text library are clustered, and text participles contained in the words are clustered into different participle categories.

Then, when a text matching the text to be replaced is selected from the text library, a text of the same type as the text to be replaced is selected from the text library as a candidate text. And then, selecting a text from the selected candidate texts as a target text for replacing the text to be replaced.

For example, after replacing the text to be replaced in the speech recognition result of the target speech with other obviously wrong text, the annotation object can be made to hear the audio and correct the wrong text therein when finding that the speech recognition result is obviously incorrect when hearing the target speech seriously.

The replaced text in the speech recognition result is expected to be difficult to be perceived by the labeled object under the condition of not hearing the target speech, so as to prevent the labeled object from seriously labeling only the places with the abnormality in the speech recognition result, but not paying attention to other places, so that the selection of the target text needs to ensure that the whole sentence is still smooth after the replacement, and meanwhile, the target text needs to ensure that the difference between the acoustic pronunciation of the target text and the acoustic pronunciation in the audio corresponding to the text to be replaced is large, so that the purpose of evaluating the labeling quality can be achieved by calculating the modification degree of the labeled object on the texts.

Therefore, for the selected candidate text, a target candidate text with a similar text level but a larger difference in acoustic pronunciation (larger than a set difference threshold) can be selected as the target text, and after the text to be replaced in the speech recognition result is replaced by the target candidate text, the obtained text has the same grammatical structure as the text of the original speech recognition result.

Exemplarily, after determining candidate texts, for each candidate text, replacing a text to be replaced in a speech recognition result by using the candidate text, calculating a ppl (perplexity) value of the whole text after text replacement by using a preset language model, and simultaneously calculating acoustic similarity of the candidate text and an original text to be replaced on characters or tones, selecting acoustic similarity of characters in the scheme, comparing character strings, and determining the proportion of characters with the same sounding to the total number of characters of an original word, namely the acoustic similarity.

And comprehensively determining an optimal candidate text as a replacement text of the text to be replaced, namely the target text, according to the ppl values of all the candidate texts and the similarity of the characters. More specifically, a candidate text satisfying that the acoustic similarity is less than 20% and that the ppl is the smallest may be determined as the target text.

The language model can adopt a language model of a modeling scheme such as ngram or RNN, the scheme adopts a language model of traditional ngram modeling, the order of the language model is 3, a language model resource with a good effect is trained by utilizing a collected content text database, operations such as corpus cleaning and word segmentation are carried out, and then ngram statistics is carried out to obtain a final language model.

Optionally, when the target candidate text is selected from the candidate texts, only a text with a larger difference from the acoustic pronunciation of the text to be replaced may be selected as the target candidate text, or only a text obtained by replacing the text to be replaced in the speech recognition result of the target speech with the target candidate text is ensured to have the same grammatical structure as the speech recognition result.

That is, when a target candidate text is selected from the candidate texts as a target text, it should be ensured that a difference between a speech corresponding to the target candidate text and a speech corresponding to the text to be replaced is greater than a set difference threshold, and/or a text obtained by replacing the text to be replaced in the speech recognition result with the target candidate text has the same grammatical structure as the speech recognition result.

And S3, replacing the text to be replaced in the voice recognition result with the target text to obtain the voice recognition result to be marked.

And selecting the target text in the above manner, and replacing the text to be replaced in the voice recognition result with the target text to obtain a complete text as the voice recognition result to be labeled.

Based on the target text selection mode, the obtained speech recognition result to be labeled and the original speech recognition result have the same grammatical structure and smooth semantics, but when the target speech is listened to seriously, obvious text errors in the speech recognition result to be labeled can be found.

For example, the original voice recognition result is "help me turn on air conditioner", and if "air conditioner" is selected as the text to be replaced, a word such as "water heater" is selected from the part of speech of "air conditioner" (for example, the part of speech contains "air conditioner/water heater/microwave oven/refrigerator/…") to replace the original word, and the voice recognition result "help me turn on water heater" to be labeled is formed. The voice recognition result to be labeled and the original voice recognition result meet the characteristics that the text level is similar (the semantic smoothness, the labeled object cannot judge that the 'water heater' is the replaced wrong text and must hear the audio seriously) but the acoustic pronunciation difference is large (the labeled object can hear the audio seriously, the text at the position can be found to be wrong inevitably, and the text correction is carried out). And carrying out text labeling processing on the voice recognition result to be labeled by the labeling object, namely judging whether the labeling object is seriously labeled or not, thereby scientifically investigating the voice labeling quality.

The labeling object performs text labeling processing on the speech recognition result to be labeled, specifically, the text labeling processing may be processing for labeling a text with a recognition error in the speech recognition result to be labeled, or processing for modifying the text with the recognition error.

In the embodiment of the present application, when the annotation object is set to perform text annotation processing on an annotation text, a text included in the annotation text and having a recognition error is modified, that is, a text included in a speech recognition result to be annotated and not corresponding to a corresponding speech content is modified.

Meanwhile, the voice labeling quality of the labeling object cannot be reflected according to the labeling quality of the labeling object to a single voice, so that a large amount of voices need to be labeled by the labeling object, and the voice labeling quality of the labeling object is reflected through statistics. Therefore, the target speech should be a large amount of target speech, and correspondingly, the speech recognition results to be labeled are a large amount of speech recognition results to be labeled corresponding to each target speech. Each voice recognition result to be marked is obtained according to the processing mode.

Then, the determining, according to the speech recognition result to be labeled and the labeling result, the labeling quality of the target speech to be labeled by the labeling object includes:

firstly, the voice recognition result to be labeled is compared with the labeling result, and at least the modification rate of the target text in the voice recognition result to be labeled of the labeling object is determined.

When the target text is a text segment of the voice recognition result of the target voice, the target text is replaced by the text segment of the voice recognition result.

Specifically, for an execution subject (for example, a processor, a voice annotation quality evaluation system, and the like) of the technical solution of the embodiment of the present application, a text to be replaced in a primitive voice recognition result and a target text for replacing the text to be replaced are both recorded, but these pieces of information are invisible for an annotation object.

After the labeling object finishes labeling to obtain a labeling result, the labeling result and the speech recognition result to be labeled are used for calculating the condition that the target text in the speech recognition result to be labeled is modified by using an edit distance algorithm, for example, whether the target text is modified is determined.

When the voice recognition results to be labeled are a large number of texts, the modification condition of the labeling object to the target text in each voice recognition result to be labeled is counted, and the modification rate of the labeling object to the target text in the large number of voice recognition results to be labeled can be determined.

And then, determining the labeling quality of the target voice labeling by the labeling object at least according to the modification rate of the target text in the voice recognition result to be labeled by the labeling object.

Specifically, if the labeling object can find and modify the target text in the speech recognition result to be labeled, it indicates that the labeling object carefully listens to the target speech and correctly labels the target speech, instead of making a draft, and the speech labeling quality is guaranteed; on the contrary, if the target text in the speech recognition result to be labeled cannot be found and modified by the labeling object, it indicates that the labeling object does not hear the target speech seriously or does not label the target speech seriously, and the speech labeling quality is lower at this time.

Therefore, the modification rate of the target text in the speech recognition result to be labeled by the labeling object is counted, so that the labeling quality of the target speech labeling by the labeling object can be reflected.

If the modification rate of the target text in the voice recognition result to be labeled by the labeling object is higher, the labeling quality of the target voice by the labeling object is higher; otherwise, it can be determined that the labeling quality of the target voice by the labeling object is low.

Furthermore, when the speech recognition result to be labeled is compared with the labeling result of the labeling object, the modification rate and the modification accuracy rate of the target text in the speech recognition result to be labeled by the labeling object can be determined simultaneously.

When the target text in the voice recognition result to be labeled is modified into the replaced text corresponding to the target text by the labeling object, the target text is determined to be correctly modified, otherwise, the target text is considered to be wrongly modified.

According to the above rules, based on the modification of the target text in the speech recognition result to be labeled by the labeling object, the modification accuracy of the target text in the speech recognition result to be labeled can be determined.

Then, when the voice labeling quality of the labeling object is determined, the labeling quality of the labeling object for performing voice labeling on the target voice can be determined according to the modification rate and the modification accuracy rate of the target text in the voice recognition result to be labeled by the labeling object.

For example, when the modification rate of the target text in the speech recognition result to be labeled by the labeling object is greater than the set modification rate threshold and the modification accuracy is greater than the set modification rate threshold, the speech labeling quality of the labeling object can be considered to be over-pass; otherwise, the voice labeling quality of the labeling object is considered not to be relevant.

In addition, in the voice labeling service, the role of an inspection object exists, the task of the inspection object is to inspect and correct the voice labeling work of the labeling object, and the main work of the inspection object is to inspect whether the target text in the voice recognition result to be labeled is modified by the labeling object, and to supplement and modify the unmodified target text of the labeling object or correct and modify the target text with the error modification of the labeling object.

The inspection object may be an inspector, or a voice labeling machine, algorithm, program, or the like.

Then, in the evaluation of the voice labeling quality, the voice labeling quality of the inspection object can also be evaluated.

Illustratively, the embodiment of the present application evaluates the voice annotation quality of an inspection object by:

firstly, the voice recognition result to be labeled is compared with the labeling result of the labeling object, and the target text which is not modified by the labeling object in the voice recognition result to be labeled is determined.

That is, the speech recognition result to be labeled is compared with the labeling result of the labeled object, and the target text which is not modified by the labeled object in the labeled text is screened out.

And meanwhile, acquiring a modification result of the inspection object on the labeling result of the labeling object.

Then, based on the modification result of the inspection object to the labeling result of the labeling object, determining the modification rate of the inspection object to the target text which is not modified by the labeling object in the text to be detected.

And determining the voice labeling quality of the target voice by the inspection object based on the modification rate of the target text which is not modified by the labeling object in the text to be detected by the inspection object.

For example, assume that there are 100 speech recognition results to be labeled in total, and each speech recognition result to be labeled contains a target text, that is, one text segment in each speech recognition result to be labeled is replaced by the target text. After the labeling processing is performed on the labeled objects, the target texts in 98 of the speech recognition results to be labeled are modified, and the target texts in the remaining two speech recognition results to be labeled are not modified by the labeled objects.

At this time, if the target text in the two to-be-labeled voice recognition results is modified by the inspection object, the inspection object is proved to carefully check the labeling result of the labeling object, so that the voice labeling quality can be determined to be high; on the contrary, if the target text in the two speech recognition results to be labeled is not modified or is not completely modified by the inspection object, it indicates that the inspection object does not seriously verify the labeling result of the labeling object, so that the speech labeling quality of the inspection object is considered to be low.

Furthermore, the processing procedure of evaluating the voice annotation quality of the annotation object can also be referred to the above processing procedure of evaluating the voice annotation quality of the annotation object by referring to the modification rate and the modification accuracy rate of the target text of the voice recognition result to be annotated by the annotation object in the specific processing procedure by combining the modification accuracy rate of the target text of the voice recognition result to be annotated by the examination object, which is not modified by the annotation object.

The embodiment of the present application further provides a speech evaluation quality evaluation device, as shown in fig. 2, the device includes:

a text obtaining unit 100, configured to obtain a to-be-labeled voice recognition result corresponding to a target voice, where the to-be-labeled voice recognition result is obtained by performing text segment replacement on a voice recognition result of the target voice, and a text segment after the text segment is replaced is an error text segment relative to the target voice;

a labeling result obtaining unit 110, configured to obtain a labeling result obtained by performing text labeling processing on the to-be-labeled voice recognition result by using a labeling object, where the text labeling processing is processing of labeling a text with a recognition error;

and the quality evaluation unit 120 is configured to determine, according to the speech recognition result to be labeled and the labeling result, the labeling quality of the target speech by the labeling object.

The voice labeling quality evaluation device provided by the embodiment of the application can automatically acquire the voice recognition result to be labeled corresponding to the target voice and acquire the labeling result obtained by text labeling processing of the voice recognition result to be labeled by the labeling object. And then, according to the voice recognition result to be labeled and the labeling result, determining the labeling quality of the target voice by the labeling object. The process realizes the automatic evaluation of the voice labeling quality of the labeled object, realizes the supervision of the voice labeling work of the labeled object and is beneficial to improving the voice labeling quality of the labeled object.

selecting a target candidate text from the candidate texts as a target text;

Optionally, the quality evaluation unit is further configured to:

Specifically, please refer to the content of the method embodiment for the specific working content of each unit of the voice annotation quality evaluation apparatus, which is not described herein again.

Another embodiment of the present application further provides a speech annotation quality evaluation device, as shown in fig. 3, the device includes:

a memory 200 and a processor 210;

wherein, the memory 200 is connected to the processor 210 for storing programs;

the processor 210 is configured to implement the voice annotation quality evaluation method disclosed in any of the above embodiments by running the program stored in the memory 200.

Specifically, the voice labeling quality evaluation device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may include a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as an ethernet network, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 2102 executes the programs stored in the memory 200 and invokes other devices, which can be used to implement the steps of the voice annotation quality evaluation method provided by the embodiment of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the method implements the steps of the voice annotation quality evaluation method provided in any of the above embodiments.

Specifically, the specific working contents of each part of the above-mentioned voice labeling quality evaluation device and the specific processing contents of the above-mentioned computer program on the storage medium when being executed by the processor can refer to the contents of each embodiment of the above-mentioned voice labeling quality evaluation method, and are not described herein again.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.

The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for evaluating the quality of a voice annotation is characterized by comprising the following steps:

2. The method according to claim 1, wherein the obtaining of the speech recognition result to be labeled corresponding to the target speech comprises:

3. The method according to claim 2, wherein the determining the text to be replaced from the speech recognition result of the target speech comprises:

4. The method according to claim 3, wherein the selecting a participle from the participles as a text to be replaced according to at least the identification information of the participle comprises:

5. The method according to claim 4, wherein the selecting a participle from the participles as the text to be replaced based on the classification result of each participle comprises:

6. The method according to claim 2, wherein the selecting the text matching the text to be replaced from a preset text library as the target text comprises:

selecting a target candidate text from the candidate texts as a target text;

7. The method according to claim 1, wherein the text labeling process is a process of marking a text which is recognized wrongly or a process of modifying a text which is recognized wrongly.

8. The method of claim 7, wherein when the text labeling process is a process of modifying a text with a recognition error, the determining, according to the speech recognition result to be labeled and the labeling result, a labeling quality of the target speech by the labeling object for speech labeling comprises:

9. The method of claim 8, wherein the determining at least a modification rate of the target text in the speech recognition result to be labeled by the labeling object by comparing the speech recognition result to be labeled with the labeling result comprises:

10. The method of claim 8, further comprising:

11. A speech annotation quality evaluation device is characterized by comprising:

12. A speech annotation quality evaluation apparatus, characterized by comprising:

a memory and a processor;

the processor is configured to implement the method for evaluating the quality of a voice annotation according to any one of claims 1 to 10 by running the program in the memory.

13. A storage medium, characterized in that the storage medium stores thereon a computer program, which when executed by a processor, implements the voice annotation quality evaluation method according to any one of claims 1 to 10.