CN111681642B

CN111681642B - Speech recognition evaluation method, device, storage medium and equipment

Info

Publication number: CN111681642B
Application number: CN202010495673.4A
Authority: CN
Inventors: 赵立; 徐文铭; 杨晶生; 韩晓
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2022-04-15
Anticipated expiration: 2040-06-03
Also published as: CN111681642A

Abstract

The embodiment of the disclosure discloses a voice recognition evaluation method, a voice recognition evaluation device, a storage medium and equipment. The method comprises the following steps: processing the marked text and the voice recognition text based on a preset preprocessing strategy to obtain a corresponding target marked text and a target voice recognition text, wherein the marked text and the voice recognition text correspond to the same sample audio data, the voice recognition text comprises a recognition result output after voice recognition is carried out on the sample audio data by using a preset voice recognition scheme, a comparison result of the target marked text and the target voice recognition text is determined based on a preset comparison algorithm, and accuracy information of the preset voice recognition scheme is evaluated according to the comparison result. By adopting the technical scheme, the same preprocessing is carried out on the label text and the voice recognition text before the voice recognition text is evaluated, so that the inconsistency of the label text and the voice recognition text in some aspects (such as formats and the like) can be eliminated, the recognition result is prevented from being influenced, and the evaluation result is more accurate.

Description

Speech recognition evaluation method, device, storage medium and equipment

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a method, a device, a storage medium and equipment for speech recognition evaluation.

Background

An Automatic Speech Recognition (ASR) technology is a technology capable of extracting text information from audio data, and is widely applied to various application scenarios requiring Speech to text conversion.

In the application of the ASR technology, targeted optimization is often required according to specific features of an actual scene, and a technical means for accurately measuring the accuracy of converting speech into text is indispensable to effectively perform optimization.

Disclosure of Invention

The embodiment of the disclosure provides a voice recognition evaluation method, a voice recognition evaluation device, a storage medium and equipment, which can optimize the existing voice recognition evaluation scheme.

In a first aspect, an embodiment of the present disclosure provides a speech recognition evaluation method, including:

processing a label text and a voice recognition text based on a preset preprocessing strategy to obtain a corresponding target label text and a corresponding target voice recognition text, wherein the label text and the voice recognition text correspond to the same sample audio data, and the voice recognition text comprises a recognition result output after voice recognition is carried out on the sample audio data by using a preset voice recognition scheme;

determining a comparison result of the target annotation text and the target voice recognition text based on a preset comparison algorithm;

and evaluating the accuracy information of the preset voice recognition scheme according to the comparison result.

In a second aspect, an embodiment of the present disclosure provides a speech recognition evaluation apparatus, including:

the system comprises a pre-processing module, a pre-processing module and a voice recognition module, wherein the pre-processing module is used for processing a label text and a voice recognition text based on a preset pre-processing strategy to obtain a corresponding target label text and a corresponding target voice recognition text, the label text and the voice recognition text correspond to the same sample audio data, and the voice recognition text comprises a recognition result output after voice recognition is carried out on the sample audio data by using a preset voice recognition scheme;

the comparison result determining module is used for determining a comparison result of the target annotation text and the target voice recognition text based on a preset comparison algorithm;

and the accuracy determining module is used for evaluating the accuracy information of the preset voice recognition scheme according to the comparison result.

In a third aspect, the disclosed embodiments provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a speech recognition evaluation method as provided by the disclosed embodiments.

In a fourth aspect, the present disclosure provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the speech recognition evaluation method provided by the embodiments of the present disclosure when executing the computer program.

The speech recognition evaluation scheme provided in the embodiment of the disclosure processes a label text and a speech recognition text based on a preset preprocessing strategy to obtain a corresponding target label text and a target speech recognition text, where the label text and the speech recognition text correspond to the same sample audio data, the speech recognition text includes a recognition result output after performing speech recognition on the sample audio data by using the preset speech recognition scheme, a comparison result between the target label text and the target speech recognition text is determined based on a preset comparison algorithm, and accuracy information of the preset speech recognition scheme is evaluated according to the comparison result. By adopting the technical scheme, the same preprocessing is carried out on the label text and the voice recognition text before the voice recognition text is evaluated, so that the inconsistency of the label text and the voice recognition text in some aspects (such as formats or character expression modes) can be eliminated, the recognition result is prevented from being influenced, and the evaluation result is more accurate.

Drawings

Fig. 1 is a schematic flow chart of a speech recognition evaluation method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart illustrating another speech recognition evaluation method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram illustrating another speech recognition evaluation method according to an embodiment of the present disclosure;

fig. 4 is a block diagram illustrating a structure of a speech recognition evaluation apparatus according to an embodiment of the present disclosure;

fig. 5 is a block diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

In the following embodiments, optional features and examples are provided in each embodiment, and various features described in the embodiments may be combined to form a plurality of alternatives, and each numbered embodiment should not be regarded as only one technical solution.

Fig. 1 is a flowchart of a speech recognition evaluation method according to an embodiment of the present disclosure, which may be executed by a speech recognition evaluation apparatus, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device. The method can be suitable for evaluating the voice recognition schemes applied to various scenes, wherein the scenes generally have the requirement of converting voice into characters, such as multimedia conference scenes, voice chat scenes, automatic film subtitle generation scenes and the like. As shown in fig. 1, the method includes:

step 101, processing the labeled text and the voice recognition text based on a preset preprocessing strategy to obtain a corresponding target labeled text and a target voice recognition text.

The annotation text and the voice recognition text correspond to the same sample audio data, and the voice recognition text comprises a recognition result output after voice recognition is carried out on the sample audio data by using a preset voice recognition scheme.

For example, the sample audio data may be randomly selected or recorded according to actual requirements. The preset speech recognition scheme may be any ASR scheme, or an ASR scheme set for a specific application scenario.

Taking a multimedia conference scenario as an example, Real-Time Communication (RTC) technology can be used to obtain sample audio data. The RTC technology is a technology capable of carrying out end-to-end real-time communication through a network, is used as a core technology of a multimedia conference, provides basic capability of transmitting conference audio and video data in real time, and is a bottom layer dependence of real-time subtitles of the multimedia conference. The RTC technology has various implementations, and for the embodiments of the present disclosure, the RTC technology may provide a real-time audio stream, and a specific implementation thereof is not limited. The ASR technology is a technical dependence of real-time subtitles of a multimedia conference, and can convert an audio stream in the conference into real-time subtitles, and the ASR technology also has various implementations.

For example, before this step, the method may further include obtaining the sample audio data, and a specific obtaining manner is not limited. Optionally, before this step, the method may further include obtaining an annotation text and a speech recognition text corresponding to the sample audio data. The labeling text is generally obtained by manual labeling, for example, sample audio data is played, and a worker responsible for labeling forms the labeling text by dictation, and the specific labeling mode is not limited.

In the embodiment of the present disclosure, the tagged text and the speech recognition text are processed based on the preset preprocessing policy, that is, the tagged text and the speech recognition text are preprocessed in the same way. The preset preprocessing strategy can be set according to actual requirements, and specific contents are not limited.

And 102, determining a comparison result of the target annotation text and the target voice recognition text based on a preset comparison algorithm.

For example, the preset comparison algorithm may be selected according to actual requirements. The preset comparison algorithm may be a comparison algorithm based on feature vectors, for example, the feature vectors of the sequence corresponding to the target annotation text and the feature vectors of the sequence corresponding to the target speech recognition text are respectively extracted, and the similarity between the two sequences is evaluated by comparing the similarity between the two feature vectors, that is, the similarity between the target annotation text and the target speech recognition text is evaluated. For another example, the preset comparison algorithm may also be a comparison algorithm based on an editing distance, for example, how to convert the sequence corresponding to the target speech recognition text into the sequence corresponding to the target labeled text through an editing operation is calculated to obtain the difference between the two sequences.

And 103, evaluating the accuracy information of the preset voice recognition scheme according to the comparison result.

For example, the accuracy information may include a Character Error Rate (CER), Word Error Rate (WER), or Sentence Error Rate (SER), which may be set according to actual requirements. Among them, Chinese and other languages with each character having a certain meaning can generally use CER to represent accuracy information. Accordingly, english and the like are languages in which a plurality of characters (e.g., letters) are organically combined into one word to have a specific meaning, and WER is generally used to represent accuracy information. And determining the accuracy information corresponding to the preset voice recognition scheme according to the comparison result in a mode corresponding to the type of the accuracy information.

The speech recognition evaluation scheme provided in the embodiment of the disclosure processes a label text and a speech recognition text based on a preset preprocessing strategy to obtain a corresponding target label text and a target speech recognition text, where the label text and the speech recognition text correspond to the same sample audio data, the speech recognition text includes a recognition result output after performing speech recognition on the sample audio data by using a preset speech recognition scheme, a comparison result between the target label text and the target speech recognition text is determined based on a preset comparison algorithm, and accuracy information of the preset speech recognition scheme is determined according to the comparison result. In the prior art, a common ASR accuracy evaluation method is to manually label an audio to be subjected to ASR processing, and then directly calculate an accuracy evaluation index of an ASR recognition result by using a labeling result as a source. In the manual labeling process, the processing of some special cases such as time, numerical values or money is different from the processing of the ASR scheme (service), and the processing of some special cases by the same ASR scheme may not be completely the same, and if the ASR evaluation is directly performed by using the labeling result and the ASR result, the evaluation of the recognition accuracy is often not accurate enough. By adopting the technical scheme, the same preprocessing is carried out on the label text and the voice recognition text before the voice recognition text is evaluated, so that the influence of some interference factors on the recognition result can be eliminated, namely the inconsistency of the label text and the voice recognition text in some aspects (such as formats or character expression modes) can be eliminated, the recognition result is prevented from being influenced, and the evaluation result is more accurate.

In some embodiments, the processing the annotation text and the speech recognition text based on the preset preprocessing strategy includes: and processing the same item on the annotation text and the voice recognition text based on a preset preprocessing strategy, wherein the item comprises at least one of paragraph format, character occupation, word expression mode and interference characters. The advantage of this arrangement is that it can make uniform adjustment for some factors with weak correlation to the voice substance content, and eliminate the influence of these factors on the recognition result.

In some embodiments, the paragraph format may include, for example, line spacing, line-first indentation, and line feed. The paragraph format generally has no influence on the substantive content of the text, but when the comparison result of two texts is determined based on the preset comparison algorithm, the difference of the paragraph format may influence the comparison result, so that the paragraph format can be preprocessed to eliminate the difference related to the paragraph format. Taking a line feed mode as an example, the processing of the paragraph format on the annotation text and the speech recognition text based on the preset preprocessing strategy comprises: and performing multi-line to single-line processing on the marked text and the voice recognition text, namely converting the multi-line characters in the marked text and the voice recognition text into single-line characters according to the same line feed mode. The method has the advantages that the standard text and the voice recognition text are processed in a unified multi-line-to-single-line mode, the influence on the comparison result of the target labeling text and the target voice recognition text due to the difference of the number or the positions of the segmentation symbols is avoided, and the influence on accuracy information is further avoided.

In some embodiments, the character footprint may include, for example, full and half angles. The full-angle character is the position where one full-angle character occupies two standard characters (or two half-angle characters), the Chinese characters, English characters which specify the full angle, graphic symbols and special characters in the national standard GB2312-80 are all full-angle characters, and in the full angle, letters, numbers and the like occupy the positions with the same width as the Chinese characters; the half angle is that a character occupies a standard character position, the half angle is generally an ASCII character, and when no Chinese character input method works, input letters, numbers and characters are half angles. The difference of the character occupying modes generally does not affect the essential content of the text, but when the comparison result of the two texts is determined based on the preset comparison algorithm, the difference of the character occupying mode may affect the comparison result, so that the character occupying mode can be subjected to preprocessing, and the difference related to the character occupying mode is eliminated. For example, the processing of the annotation text and the speech recognition text for character occupation based on the preset preprocessing strategy comprises: and performing full-angle to half-angle processing on the labeled text and the voice recognition text, namely converting the character occupying mode of characters in the labeled text and the voice recognition text from full angle to half angle. The method has the advantages that the standard text and the voice recognition text are subjected to unified full-angle-to-half-angle processing, the influence on the comparison result of the target labeling text and the target voice recognition text due to different character occupying modes is avoided, and the influence on accuracy information is further avoided.

In some embodiments, the processing the annotation text and the speech recognition text according to the preset preprocessing strategy for the word expression mode comprises: and performing at least one of upper case to lower case processing, special digital writing mode conversion processing, word form conversion processing and word segmentation processing on the annotation text and the voice recognition text. The difference of the character expression modes generally does not affect the substantial content of the text, but when the comparison result of the two texts is determined based on the preset comparison algorithm, the difference of the character expression modes may affect the comparison result, so that the character expression modes can be subjected to preprocessing, and the difference related to the character expression modes is eliminated. The capital-to-lowercase processing generally refers to capital-to-lowercase of pure numbers (numerical values) for Chinese, namely, the pure numbers in the labeled text and the voice recognition text are converted into the lowercase from the capital mode, English can comprise capital-to-lowercase of letters, namely, the letters in the labeled text and the voice recognition text are converted into the lowercase from the capital mode, and other processing modes can be provided for other languages; the special numbers can comprise numbers with special meanings such as date, amount or percentage, and the special number writing mode conversion processing on the label text and the voice recognition text can comprise converting the writing modes of the special numbers in the label text and the voice recognition text into preset writing modes; the word shape conversion mainly aims at English or other languages with different word shape regulations, taking English as an example, verb word shapes can comprise original shapes, word segmentation and the like, noun word shapes can comprise singular numbers, plural numbers and the like, and the word shape conversion processing of the labeled text and the voice recognition text can comprise the step of converting words in the labeled text and the voice recognition text into corresponding preset word shapes, for example, the preset word shape corresponding to verb is original shape, and the preset word shape corresponding to noun is singular number. The word segmentation is mainly performed on english or other languages taking letters or symbols as representation forms, and performing word segmentation processing on the label text and the speech recognition text may include performing word segmentation on the label text and the speech recognition text according to a preset segmentation mode, for example, segmenting words in the label text and the speech recognition text based on the same dictionary.

In some embodiments, the processing the annotation text and the speech recognition text for the interfering characters based on the preset preprocessing strategy comprises: and performing silent character filtering processing and/or tone word filtering processing on the labeled text and the voice recognition text, namely filtering silent characters and/or tone words contained in the labeled text and the voice recognition text. The interference characters generally have no influence on the real content which the speaker wants to express, but when the comparison result of the two texts is determined based on the preset comparison algorithm, the difference of the interference characters may influence the comparison result, so that the interference characters can be filtered to eliminate the difference.

In some embodiments, the labeled text includes text labeled by a preset labeling mode, and the preset labeling mode follows the principle of keeping the original meaning of the voice. The method has the advantages that the method for marking the text is normalized, the voice content is faithful, and the fact that the reality of the marked text is influenced by substantive correction of the voice content by human factors is avoided.

In some embodiments, the principles may be embodied in at least one of the following aspects: the method avoids error correction of the re-read characters, error correction of wrongly written characters in network words, abbreviation processing of full-name reading methods with abbreviations, error correction of wrong pronunciations, labeling of words related to numbers according to audio reading methods, and labeling of words according to audio pronunciations. This has the advantage that the authenticity of the inscription text can be fully ensured.

In some embodiments, the preset labeling manner further includes at least one of the following: adding a first preset symbol mark for a preset type word, marking a fuzzy word by adopting a second preset symbol mark, and changing a writing mode of an abbreviation based on a preset change rule. The preset type words can comprise, for example, names of people, names of places or other types of words with fixed special meanings, and the arrangement is beneficial to effectively distinguish the preset type words from other common words and is beneficial to confirming whether the preset type words are accurately recognized by the voice recognition text. The fuzzy words can be words which cannot be clearly heard by staff in charge of labeling, and the arrangement has the advantages that the second preset symbol can be used for occupying the words which cannot be labeled, so that the integrity of the labeled text is ensured. The first preset symbol and the second preset symbol can be set according to actual requirements, and are generally selected as the first preset symbol and the second preset symbol and are not used as punctuation marks any more. In general, for languages such as english, the preset change rule may be, for example, to add a third preset symbol between every two letters of the abbreviation, and the third preset symbol may be, for example, a space.

In some embodiments, determining a comparison result between the target annotation text and the target speech recognition text based on a preset comparison algorithm includes: determining the corresponding relation between the target labeling text and the target voice recognition text based on a minimum editing distance algorithm; determining an editing path according to the corresponding relation, and taking the editing path as a comparison result; correspondingly, the evaluating the accuracy information of the preset speech recognition scheme according to the comparison result includes: and determining the word error rate or the word error rate of the preset voice recognition scheme according to the editing path. The advantage of this arrangement is that the comparison result between the target labeling text and the target speech recognition text can be obtained quickly and accurately by using the minimum editing distance.

In some embodiments, after determining the comparison result between the target annotation text and the target speech recognition text based on a preset comparison algorithm, the method may further include: and displaying the comparison result. The displayed content can comprise content such as insertion, deletion and replacement, so that the comparison result can be more intuitively viewed.

Fig. 2 is a schematic flow chart of another speech recognition evaluation method provided in the embodiment of the present disclosure, which is optimized based on various alternatives in the above embodiments, specifically, taking chinese as an example, the method includes the following steps:

step 201, processing the same item for the labeled text and the voice recognition text based on a preset preprocessing strategy to obtain a corresponding target labeled text and a target voice recognition text, wherein the item includes at least one of a paragraph format, a character occupation, a character expression mode and an interference character.

Optionally, before this step, a step of obtaining the annotation text and the speech recognition text corresponding to the sample audio data may also be included.

The label text and the voice recognition text correspond to the same sample audio data, and the voice recognition text comprises a recognition result output after voice recognition is carried out on the sample audio data by using a preset voice recognition scheme.

The marked text comprises a text marked by adopting a preset marking mode, and the preset marking mode follows the principle of keeping the original meaning of the voice. Illustratively, the principles described are embodied in several aspects: error correction of re-read words, error correction of wrongly written words in network utterances, error correction of mispronunciations, tagging of words related to numbers according to audio reading, and tagging of words of moods according to audio utterances are avoided. The preset labeling mode further comprises: and adding a first preset symbol mark for the preset type words, and marking the fuzzy words by adopting a second preset symbol mark.

Illustratively, avoiding error correction of re-read text may mean that the annotation requires faithful transcription (annotation) of the speech content, without authorization to add or subtract text, and even if apparent incompliance occurs in the audio, the text is written according to the audio content. For example, the pronunciation "i am hungry", "i" word is re-read, and the transcription is still "i am hungry", rather than removing one "i", becomes "i am hungry".

Illustratively, avoiding error correction of wrongly written words in network expressions may be understood as marking the words according to their actual pronunciations when they encounter a network expression. For example, "children's shoes" are transcribed as "children's shoes" rather than "classmates"; as another example, "child paper" is transcribed as "child paper" rather than "child".

Illustratively, correcting the mispronunciation may be understood as encountering a pronunciation change due to accents or personal habits, labeled as of the original speech. For example, the pronunciation "yin 1 niang 4", in the context of "volume", although the pronunciation is like "vintage", it should be transcribed as "volume".

For example, marking a word related to a number according to an audio reading method can be understood as encountering a word such as a number, time or amount, and writing according to the audio reading method without writing an arabic number. It is to be noted that the term "Arabic numerals" may be excluded. For example, "eleven" is transcribed as "eleven" instead of "11"; "one hundred" is transcribed as "one hundred", instead of "100"; "five percent" is transcribed as "five percent" instead of "5%"; the term "5G" is used herein to refer to "5G".

Illustratively, and labeling the inflection of a tone word according to the audio pronunciation may be understood as selecting the corresponding tone word from the list of tone words for transcription of the speaker's tone word according to its true pronunciation. For example, "kahe", "o", "forehead", "hiccup", "no", "fur", "ao", "mid", "re", "", "my", "qua", "pah", and "har", etc.

Illustratively, the words of the preset type may include, for example, names of people, may be marked by "{ }" (first preset symbol), may be written in common pronunciation, and may be written in reality for names of people of specific public characters. For example, the pronunciation "Lishan" may be labeled as "Lishan" in the common pronunciation when it is considered a name of a person; for another example, "Mayun", a common pronunciation may be "Mayun", which is a name of a specific public character and can be written as "Mayun".

Illustratively, the ambiguous word may be labeled with an "+" (second preset symbol). For example, for a completely unrecognizable word, a word is replaced with an "+" word, which is calculated as 0.5 second per word if it cannot be resolved for several words.

It should be noted that the above-mentioned "{ }" and "-" are used to mark the name of a person and the ambiguous word, respectively, and therefore will not be used as punctuation marks.

Illustratively, processing for paragraph formats may include, for example, multiple lines to a single line; processing for character placeholders may include, for example, full-angle to half-angle; the processing for the word expression mode may include, for example, special number writing mode conversion processing; the processing for the interfering characters may include, for example, performing silent character filtering processing and spoken word filtering processing.

For example, the special numbers may include, for example, dates, amounts, percentages, and the like. "2008" can be processed as "two zero eight years"; "$ 10" may be processed as "ten dollars"; "10.1%" can be treated as "ten and one percent".

The silent characters may include, for example, i.,! Is there a \ f; { } [ ], </+ - & ^ # $% -, ". | A Is there a "/" '' to "; ' is as follows: ' "", etc.; examples of the phrase "emotion" may include "kaze", "o", "forehead", "hiccup", "no", "ao", "wo", "kis", "re", "", "li", "pah", and "har".

Step 202, determining an editing path corresponding to the target labeling text and the target voice recognition text based on a minimum editing distance algorithm.

Illustratively, the idea of dynamic programming needs to be used for solving the minimum editing distance, and the dynamic programming algorithm essentially needs to find a recursive description relationship between a back item and a front item, and based on the minimum editing distance algorithm, a corresponding relationship between characters of the target labeling text and characters of the target speech recognition text can be obtained, and a path of the minimum editing distance is traced back according to the corresponding relationship to obtain a comparison result.

And step 203, evaluating the word error rate of the preset voice recognition scheme according to the editing path.

The CER is generally calculated as (S + D + I)/N, where S (persistence) represents the number of replaced words, D (deletion) represents the number of deleted words, I (insertion) represents the number of inserted words, and N represents the total number of words in the reference sequence (i.e., the total number of words in the target annotation text). The values of the parameters in the formula can be determined according to the edit path, and then the CER is calculated.

The time complexity of the minimum editing distance algorithm is O (mn), and the CER of each ten thousand characters of the ASR result and the labeling result takes about 1s, so that the target labeling text and the target speech recognition text generally do not exceed one ten thousand characters (the time for converting into audio is about 1 hour).

It is understood that after the word error rate is determined, the comparison result corresponding to the edit path in step 202 may also be displayed.

Illustratively, REF corresponds to the annotation result (i.e., the target annotation text), RES corresponds to the ASR result (i.e., the target speech recognition text), and the comparison result can be shown as follows:

REF one __ two seven three four five

RES one six two __ three _ eight _ five

"__" in the REF line represents insertion of RES, "__" in the RES line represents deletion of RES, and "_ word _" represents replacement of RES, and the result can be displayed in a sentence division manner of REF or RES (namely, a sentence division manner obtained after multiple lines are converted into a single line).

According to the voice recognition evaluation method provided by the embodiment of the disclosure, when the sample audio data is labeled, labeling is performed according to a principle of keeping the voice intention, so that the fact that the authenticity of a labeled text is influenced by substantive correction of human factors on voice content is avoided, the labeled text and the voice recognition text are uniformly adjusted according to some factors with weak correlation on the voice substantive content, the influence of the factors on a recognition result is eliminated, the labeled text and the voice recognition text which are subjected to preprocessing are compared based on a minimum edit distance algorithm, the word error rate of the current Chinese voice recognition scheme is calculated, and the accuracy evaluation accuracy of the Chinese voice recognition result can be effectively improved.

Fig. 3 is a schematic flow chart of another speech recognition evaluation method provided in the embodiment of the present disclosure, which is optimized based on various alternatives in the above embodiments, specifically, taking english as an example, the method includes the following steps:

step 301, processing the same item for the labeled text and the voice recognition text based on a preset preprocessing strategy to obtain a corresponding target labeled text and a target voice recognition text, wherein the item includes at least one of a paragraph format, a character occupation, a character expression mode and an interference character.

Optionally, before this step, the method may further include obtaining an annotation text and a speech recognition text corresponding to the sample audio data.

The marked text comprises a text marked by adopting a preset marking mode, and the preset marking mode follows the principle of keeping the original meaning of the voice. Illustratively, the principles described are embodied in several aspects: the method avoids error correction of the re-read characters, avoids abbreviation processing of full-name reading methods with abbreviations, avoids error correction of wrongly written characters in network expressions, corrects error pronunciations, labels digital words according to audio reading methods, and labels Chinese words according to audio pronunciations. The preset labeling mode further comprises: and marking fuzzy words and changing the writing mode of the abbreviation by adopting a second preset symbol.

Illustratively, avoiding error correction of re-read text may mean that the annotation requires faithful transcription (annotation) of the speech content, without authorization to add or subtract text, and even if apparent incompliance occurs in the audio, the text is written according to the audio content. For example, the pronunciation "I miss you" and the "I" word are re-read, and the transcription is still "I miss you" instead of removing an "I" and becomes "I miss you".

Illustratively, the abbreviation process is avoided for full-name readings where abbreviations exist, it being understood that full-name readings are used in the audio in a full-length writing format despite the existence of abbreviations. For example, "United States" is transcribed as "United States" instead of "U S".

For example, the method for changing the writing mode of the abbreviation, particularly for the abbreviation which is read by letters in the audio, can be used for separating each letter by a space so as to distinguish the letter from the word. For example, "ASR" is transcribed as "a S R" and "u.s." is transcribed as "U S". It should be noted that, for words whose pronunciation of the abbreviation is not letter-by-letter (i.e. according to the pronunciation of the whole word), the original writing mode can be retained. For example, in the case where "DotA" is pronounced in the entire word (pronounced as "turret"), it is written as "DotA" instead of "do ta T a".

Illustratively, avoiding error correction of wrongly written words in network expressions may be understood as marking the words according to their actual pronunciations when they encounter a network expression. For example, "RSVP" is transcribed as "R S V P" rather than "Reply if you please" or "R I Y P".

Illustratively, correcting the mispronunciation may be understood as encountering a pronunciation change due to accents or personal habits, labeled as of the original speech. For example, the pronunciation "vely good" in the context of "very good" but the pronunciation like "vely good" should be transcribed as "very good".

For example, marking a word related to a number according to an audio reading method can be understood as encountering a word such as a number, time or amount, and writing according to the audio reading method without writing an arabic number. It is to be noted that the term "Arabic numerals" may be excluded. For example, "eleven" is transcribed as "eleven" instead of "11"; "two dollars" is transcribed as "two dollars" instead of "$ 2"; "five percent" is transcribed as "five percent" instead of "5%"; the term "5G" is used herein to refer to "5G".

For example, labeling a semantic word according to an audio pronunciation may be understood as selecting a corresponding semantic word from a list of semantic words to transcribe for the semantic word of the speaker according to the actual pronunciation of the semantic word. For example, hmm, mhm, yeah, uh-huh, oh, uh, huh, um, er, ahem, wei, whoa, Oops, www, rawr, awww, whoop, ughh, nah, etc.

Illustratively, the ambiguous word may be labeled with an "+" (second preset symbol). For example, for a completely unrecognizable word, a word is replaced with an "+" word, e.g., a word is counted as 0.5 seconds, with several words being indistinguishable. It should be noted that "+" is used to mark fuzzy words and thus will not be used as punctuation.

For example, the special numbers may include, for example, dates, amounts, percentages, and the like. "2008" can be treated as "two routes and eight"; "$ 10" may be processed as "ten dollas"; "10%" can be treated as "ten percent".

The silent characters may include, for example, i.,! Is there a \ f; { } [ ], </+ - & ^ # $% -, ". | A Is there a "/" '' to "; ' is as follows: ' "", etc.; examples of mood words may include hmm, mhm, yeah, uh-huh, oh, uh, huh, um, er, ahem, wei, whoa, Oops, www, rawr, awwww, whoop, ughh, nah, and the like.

And 302, determining an editing path corresponding to the target labeling text and the target voice recognition text based on a minimum editing distance algorithm.

And step 303, evaluating the word error rate of the preset voice recognition scheme according to the editing path.

WER is calculated in a similar manner to CER, and is typically (S + D + I)/N, where S (persistence) represents the number of words replaced, D (deletion) represents the number of words deleted, I (insertion) represents the number of words inserted, and N represents the total number of words in the reference sequence (i.e., the total number of words in the target annotation text). The values of the parameters in the formula can be determined according to the edit path, and then WER is calculated.

The time complexity of the minimum editing distance algorithm is O (mn), and the CER of ten thousand words of the ASR result and the labeling result takes about 1s, so that the target labeling text and the target speech recognition text generally do not exceed ten thousand words (the time for converting into audio is about 1 hour).

It can be understood that after the word error rate is determined, the comparison result corresponding to the edit path in step 203 may also be displayed.

Illustratively, REF corresponds to the annotation result (i.e., the target annotation text), and RES corresponds to the ASR result (i.e., the target speech recognition text), and the comparison result can be shown as follows:

REF:one__two nine three_four_five

RES:one six two__three_eight_five

"__" in the REF line represents that RES has an insertion, "__" in the RES line represents that RES has a deletion, and "_ word _" represents that RES has a replacement, and the result can be displayed in a sentence division manner of REF or RES (i.e., a sentence division manner obtained after multiple lines are converted into a single line).

According to the speech recognition evaluation method provided by the embodiment of the disclosure, when the sample audio data is labeled, labeling is performed according to a principle of keeping the original meaning of speech, so that the phenomenon that the authenticity of a labeled text is influenced by substantively correcting speech content by human factors is avoided, the labeled text and the speech recognition text are uniformly adjusted according to some factors with weak correlation to the substantial speech content, a pre-processing operation of morphological conversion is added according to the particularity of English, the influence of the factors on a recognition result is eliminated, the labeled text and the speech recognition text which are subjected to pre-processing are compared based on a minimum edit distance algorithm, the word error rate of the current English speech recognition scheme is calculated, and the accuracy of evaluation on the accuracy of the English speech recognition result can be effectively improved.

It should be noted that, for convenience of illustration, in the above embodiments, chinese and english are separately illustrated and described as examples, and some audios may have both chinese and english, and chinese or english in the audios may be processed by referring to corresponding steps of chinese or english, that is, the scheme provided by the embodiments of the present disclosure may be applicable to sample audio data including multiple languages at the same time.

Fig. 4 is a block diagram of a speech recognition evaluation apparatus provided in an embodiment of the present disclosure, which may be implemented by software and/or hardware, and may be generally integrated in a computer device, and may perform speech recognition evaluation by executing a speech recognition evaluation method. As shown in fig. 4, the apparatus includes:

the pre-processing module 401 is configured to process a label text and a voice recognition text based on a preset pre-processing policy to obtain a corresponding target label text and a corresponding target voice recognition text, where the label text and the voice recognition text correspond to the same sample audio data, and the voice recognition text includes a recognition result output after performing voice recognition on the sample audio data by using a preset voice recognition scheme;

a comparison result determining module 402, configured to determine a comparison result between the target annotation text and the target speech recognition text based on a preset comparison algorithm;

and an accuracy determining module 403, configured to evaluate accuracy information of the preset speech recognition scheme according to the comparison result.

The speech recognition evaluation device provided in the embodiment of the disclosure performs the same preprocessing on the annotation text and the speech recognition text before evaluating the speech recognition text, so that inconsistency of the annotation text and the speech recognition text in some aspects can be eliminated, influence on a recognition result is avoided, and the evaluation result is more accurate.

Referring now to FIG. 5, shown is a schematic block diagram of a computer device 500 suitable for use in implementing embodiments of the present disclosure. The computer device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The computer device shown in fig. 5 is only an example and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 5, computer device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the computer apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the computer device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates a computer device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the computer device; or may exist separately and not be incorporated into the computer device.

The computer readable medium carries one or more programs which, when executed by the computing device, cause the computing device to: processing a label text and a voice recognition text based on a preset preprocessing strategy to obtain a corresponding target label text and a corresponding target voice recognition text, wherein the label text and the voice recognition text correspond to the same sample audio data, and the voice recognition text comprises a recognition result output after voice recognition is carried out on the sample audio data by using a preset voice recognition scheme; determining a comparison result of the target annotation text and the target voice recognition text based on a preset comparison algorithm; and evaluating the accuracy information of the preset voice recognition scheme according to the comparison result.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not constitute a limitation to the module itself in some cases, for example, the comparison result determining module may also be described as a "module that determines the comparison result between the target annotation text and the target speech recognition text based on a preset comparison algorithm".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a speech recognition evaluation method including:

Further, the processing the annotation text and the speech recognition text based on the preset preprocessing strategy includes: and processing the same item on the annotation text and the voice recognition text based on a preset preprocessing strategy, wherein the item comprises at least one of paragraph format, character occupation, word expression mode and interference characters.

Further, the processing of the paragraph format on the annotation text and the speech recognition text based on the preset preprocessing strategy comprises: and carrying out multi-line to single-line processing on the labeling text and the voice recognition text.

Further, the processing of character occupation of the annotation text and the speech recognition text based on the preset preprocessing strategy comprises: and performing full-angle to half-angle processing on the labeling text and the voice recognition text.

Further, the processing of the annotation text and the speech recognition text for the word expression mode based on the preset preprocessing strategy comprises: and performing at least one of upper case to lower case processing, special digital writing mode conversion processing, word form conversion processing and word segmentation processing on the annotation text and the voice recognition text.

Further, the processing of the annotation text and the speech recognition text for the interfering characters based on the preset preprocessing strategy comprises: and performing silent character filtering processing and/or language word filtering processing on the labeled text and the voice recognition text.

Further, the labeled text comprises a text labeled by adopting a preset labeling mode, and the preset labeling mode follows the principle of keeping the original meaning of the voice.

Further, the principle is embodied in at least one of the following aspects: the method avoids error correction of the re-read characters, error correction of wrongly written characters in network words, abbreviation processing of full-name reading methods with abbreviations, error correction of wrong pronunciations, labeling of words related to numbers according to audio reading methods, and labeling of words according to audio pronunciations.

Further, the preset labeling manner further includes at least one of the following items: adding a first preset symbol mark for a preset type word, marking a fuzzy word by adopting a second preset symbol mark, and changing a writing mode of an abbreviation based on a preset change rule.

Further, determining a comparison result between the target annotation text and the target speech recognition text based on a preset comparison algorithm includes:

determining an editing path corresponding to the target labeling text and the target voice recognition text based on a minimum editing distance algorithm, and taking the editing path as a comparison result;

correspondingly, the evaluating the accuracy information of the preset speech recognition scheme according to the comparison result includes:

and determining the word error rate or the word error rate of the preset voice recognition scheme according to the editing path.

According to one or more embodiments of the present disclosure, there is provided a speech recognition evaluation apparatus including:

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A speech recognition evaluation method, comprising:

processing a label text and a voice recognition text based on a preset preprocessing strategy to obtain a corresponding target label text and a corresponding target voice recognition text, wherein the label text and the voice recognition text correspond to the same sample audio data, the voice recognition text comprises a recognition result output after voice recognition is carried out on the sample audio data by using a preset voice recognition scheme, and the label text is obtained based on the sample audio data according to an artificial labeling mode; the marked text comprises a text marked by adopting a preset marking mode, and the preset marking mode follows the principle of keeping the original meaning of the voice; the preset labeling mode comprises the following modes: adding a first preset symbol mark for a preset type word, marking a fuzzy word by adopting a second preset symbol mark, and changing a writing mode of an abbreviation based on a preset change rule;

the principle is embodied in the following aspects:

the method has the advantages that the method avoids error correction of the re-read characters, error correction of wrongly-written characters in network words, abbreviation processing of a full-name reading method with abbreviations, error correction of error pronunciations, labeling of digital words according to an audio reading method and labeling of tone words according to audio pronunciations;

2. The method according to claim 1, wherein the processing the annotation text and the speech recognition text based on the preset preprocessing strategy comprises:

and processing the same item on the annotation text and the voice recognition text based on a preset preprocessing strategy, wherein the item comprises at least one of paragraph format, character occupation, word expression mode and interference characters.

3. The method of claim 2, wherein the processing of the annotation text and the speech recognition text for paragraph formats based on the preset pre-processing strategy comprises:

and carrying out multi-line to single-line processing on the labeling text and the voice recognition text.

4. The method of claim 2, wherein the processing of the annotation text and the speech recognition text for character placeholders based on a preset preprocessing strategy comprises:

and performing full-angle to half-angle processing on the labeling text and the voice recognition text.

5. The method of claim 2, wherein the processing the annotation text and the speech recognition text for the word expression based on the preset preprocessing strategy comprises:

and performing at least one of upper case to lower case processing, special digital writing mode conversion processing, word form conversion processing and word segmentation processing on the annotation text and the voice recognition text.

6. The method of claim 2, wherein the processing the annotation text and the speech recognition text for the interfering characters based on the preset preprocessing strategy comprises:

and performing silent character filtering processing and/or language word filtering processing on the labeled text and the voice recognition text.

7. The method according to any one of claims 1 to 6, wherein determining the comparison result between the target annotation text and the target speech recognition text based on a preset comparison algorithm comprises:

8. A speech recognition evaluation device, comprising:

the system comprises a pre-processing module, a pre-processing module and a voice recognition module, wherein the pre-processing module is used for processing a label text and a voice recognition text based on a preset pre-processing strategy to obtain a corresponding target label text and a corresponding target voice recognition text, the label text and the voice recognition text correspond to the same sample audio data, the voice recognition text comprises a recognition result output after the sample audio data is subjected to voice recognition by using a preset voice recognition scheme, and the label text is obtained based on the sample audio data according to a manual labeling mode; the marked text comprises a text marked by adopting a preset marking mode, and the preset marking mode follows the principle of keeping the original meaning of the voice; the preset labeling mode comprises the following modes: adding a first preset symbol mark for a preset type word, marking a fuzzy word by adopting a second preset symbol mark, and changing a writing mode of an abbreviation based on a preset change rule;

the principle is embodied in the following aspects:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the computer program.