WO2022095353A1

WO2022095353A1 - Speech recognition result evaluation method, apparatus and device, and storage medium

Info

Publication number: WO2022095353A1
Application number: PCT/CN2021/090436
Authority: WO
Inventors: 陈益
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-04
Filing date: 2021-04-28
Publication date: 2022-05-12
Also published as: CN112151014A; CN112151014B

Abstract

A speech recognition result evaluation method, apparatus and device, and a storage medium, which relate to the field of artificial intelligence, and are used for improving the evaluation efficiency of the accuracy of evaluating conversion from initial speech to initial text. The evaluation method comprises: converting initial speech in a video review item on the basis of a speech recognition function, so as to obtain initial text (101); performing space character deletion pre-processing, sorting pre-processing and punctuation character deletion pre-processing on the initial text, so as to obtain text to be subjected to detection (102); acquiring a word sequence to be subjected to detection in the text to be subjected to detection, and proofreading and marking, according to a pre-set standard word sequence, the word sequence to be subjected to detection, so as to obtain proofread text (103); calculating a character recognition error rate of the proofread text by using a pre-set calculation formula (104); and selecting a pre-set comparison result by comparing the character recognition error rate and a standard error rate, and determining a conversion evaluation result of speech conversion text (105). The present invention further relates to blockchain technology, and the initial text can be stored in a blockchain.

Description

Evaluation method, device, equipment and storage medium for speech recognition results

This application claims the priority of the Chinese patent application filed on November 04, 2020 with the application number 202011215789.4 and the invention titled "Method, Apparatus, Equipment and Storage Medium for Evaluation of Speech Recognition Results", the entire contents of which are approved by Reference is incorporated in the application.

technical field

The present application relates to the field of artificial intelligence, and in particular, to a method, device, device and storage medium for evaluating a speech recognition result.

Background technique

Video return visit is one of the methods for the company to maintain customers. The company's operation and maintenance personnel conduct video return visits to customers, so that the company can further understand customer needs. One of the technologies used in the video interview is speech recognition technology (automatic speech recognition, ASR). Speech recognition technology is also called automatic speech recognition. Input, that is to say, in the video return visit project, the voice replied by the customer is recognized by the speech recognition technology, and then the recognized speech is converted into the corresponding text to realize the speech recognition of the video return visit. After the speech is converted to text using the speech recognition technology, the accuracy of the speech-to-text conversion is usually determined by random inspection.

The inventor realizes that in the process of detecting the conversion of speech into text by means of random inspection, not only the steps are complicated, but also a lot of time is consumed, which in turn leads to low efficiency in evaluating the accuracy of converting the initial speech into the initial text.

SUMMARY OF THE INVENTION

The present application provides an evaluation of speech recognition results, which is used to improve the evaluation efficiency of evaluating the accuracy of converting initial speech into initial text.

A first aspect of the present application provides a method for evaluating a speech recognition result, including: acquiring initial speech in a video return visit item, and converting the initial speech based on a speech recognition function to obtain converted initial text; The initial text is preprocessed by removing space characters, sorting preprocessing, and removing punctuation characters to obtain text to be detected; based on a preset sequence function, the sequence of words to be detected in the text to be detected is obtained, and the sequence of words to be detected in the text to be detected is obtained according to the preset standard word sequence. Proofreading the to-be-detected word sequence, and performing proofreading marks in the to-be-detected word sequence to obtain proofreading text; using a preset calculation formula to calculate the character recognition error rate of the proofreading text; by comparing the character recognition A preset comparison result is selected for the error rate and the standard error rate, and the conversion evaluation result of the speech-to-text conversion is determined according to the preset comparison result.

A second aspect of the present application provides a device for evaluating speech recognition results, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor executes the The following steps are implemented during the computer-readable instruction: obtaining the initial voice in the video return visit project, and transforming the initial voice based on the voice recognition function to obtain the initial text after the conversion; performing preprocessing to delete space characters on the initial text, Sorting preprocessing and deleting punctuation character preprocessing to obtain the text to be detected; obtaining the word sequence to be detected in the text to be detected based on a preset sequence function, and proofreading the word sequence to be detected according to the preset standard word sequence , and carry out proofreading marks in the word sequence to be detected to obtain proofreading text; adopt a preset calculation formula to calculate the character recognition error rate of the proofreading text; select a preset by comparing the character recognition error rate and the standard error rate The comparison result is determined, and the conversion evaluation result of the speech-to-text conversion is determined according to the preset comparison result.

A third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps: obtaining a video in the return visit item The initial voice is converted based on the voice recognition function to obtain the converted initial text; the initial text is preprocessed by deleting space characters, sorting preprocessing and deleting punctuation characters to obtain the text to be detected. Obtain the word sequence to be detected in the text to be detected based on the preset sequence function, proofread the word sequence to be detected according to the preset standard word sequence, and carry out proofreading mark in the word sequence to be detected, Obtaining proofreading text; using a preset calculation formula to calculate the character recognition error rate of the proofreading text; selecting a preset comparison result by comparing the character recognition error rate and the standard error rate, and according to the preset comparison The result determines the conversion evaluation result of the speech-to-text.

A fourth aspect of the present application provides a device for evaluating speech recognition results, comprising: a conversion module for acquiring initial speech in a video return visit project, and converting the initial speech based on a speech recognition function to obtain a converted initial speech text; a preprocessing module is used to perform preprocessing of deleting space characters, sorting preprocessing and deleting punctuation characters on the initial text to obtain text to be detected; a proofreading module is used to obtain the to-be-detected text based on a preset sequence function Detect the word sequence to be detected in the text, proofread the word sequence to be detected according to the preset standard word sequence, and carry out proofreading marking in the word sequence to be detected to obtain the proofreading text; the calculation module is used for adopting the pre-tested word sequence. The preset calculation formula calculates the character recognition error rate of the proofreading text; the determination module is used to select a preset comparison result by comparing the character recognition error rate and the standard error rate, and according to the preset comparison result Determines the results of the conversion assessment for speech-to-text.

In the technical solution provided by the present application, the initial voice in the video return visit item is obtained, and the initial voice is converted based on a voice recognition function to obtain the converted initial text; the initial text is preprocessed and sorted by deleting space characters Preprocessing and deleting punctuation characters preprocessing to obtain text to be detected; obtaining the word sequence to be detected in the text to be detected based on a preset sequence function, and proofreading the word sequence to be detected according to the preset standard word sequence, And carry out proofreading marks in the described word sequence to be detected to obtain proofreading text; adopt a preset calculation formula to calculate the character recognition error rate of the proofreading text; select a preset by comparing the character recognition error rate and the standard error rate The comparison results are compared, and the conversion evaluation results of the speech-to-text conversion are determined according to the preset comparison results. In the embodiment of the present application, the initial speech in the video return visit item is converted by the speech recognition function to obtain the initial text, and then the initial text is preprocessed, word sequence proofreading and error rate calculation are performed to obtain the character recognition error rate, and finally the The character recognition error rate and the standard error rate are selected from the preset comparison results, and the conversion evaluation results of the speech-to-text are obtained, which improves the evaluation efficiency of evaluating the accuracy of converting the initial speech into the initial text.

Description of drawings

1 is a schematic diagram of an embodiment of a method for evaluating a speech recognition result in an embodiment of the present application;

2 is a schematic diagram of another embodiment of a method for evaluating a speech recognition result in an embodiment of the present application;

3 is a schematic diagram of an embodiment of a device for evaluating speech recognition results in an embodiment of the present application;

4 is a schematic diagram of another embodiment of the apparatus for evaluating the speech recognition result in the embodiment of the present application;

FIG. 5 is a schematic diagram of an embodiment of a device for evaluating a speech recognition result in an embodiment of the present application.

Detailed ways

Embodiments of the present application provide a method, device, device, and storage medium for evaluating a speech recognition result, which are used to improve the evaluation efficiency for evaluating the accuracy of converting an initial speech into an initial text.

For ease of understanding, the specific process of the embodiment of the present application will be described below. Please refer to FIG. 1. An embodiment of the method for evaluating the speech recognition result in the embodiment of the present application includes:

101. Acquire the initial voice in the video return visit item, and convert the initial voice based on the voice recognition function to obtain the converted initial text;

It can be understood that the execution subject of the present application may be a device for evaluating a speech recognition result, or may be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.

The server collects the initial voice in the video interview through the voice collector. The initial voice refers to the voice of the call or dialogue in the video interview project, and its content can include different business contents. The format of the initial voice can be cda track index format ( CD audio format), WAVE format, audio interchange file format (audio interchange file format, AIFF) and moving picture experts compression standard audio layer 3 format (moving picture experts group audio layer III, MP3 format). The format of the voice is limited.

After the server collects the initial voice, it converts the initial voice through the voice recognition function, and converts the initial voice into the form of text to obtain the initial text. Since the correct rate of converting speech into text by the speech recognition system is not 100%, the server needs to process the initial text and detect the accuracy rate of converting the initial speech into the initial text.

It should be noted that the initial text of the initial speech conversion is saved in the project log file through the speech recognition function. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned initial text, the above-mentioned initial text can also be stored in a node of a blockchain.

102. Perform preprocessing of deleting space characters, sorting preprocessing and deleting punctuation characters on the initial text to obtain the text to be detected;

Before detecting the initial text, the server needs to preprocess the initial text to obtain the preprocessed text to be detected. processing, so as to reduce the influence on the character recognition error rate calculated by the server to convert the initial speech into the initial text in the subsequent steps.

103. Obtain the sequence of words to be detected in the text to be detected based on a preset sequence function, proofread the sequence of words to be detected according to the preset standard sequence of words, and perform proofreading marks in the sequence of words to be detected to obtain proofreading text;

After the server obtains the preprocessed text to be detected, it needs to obtain the word sequence to be detected in the text to be detected, and use the preset standard word sequence to proofread the to-be-detected word sequence. There are many preset standard word sequences here. First, the server calculates the basic similarity between the word sequence to be detected and the preset standard word sequence, determines the basic similarity with the largest basic similarity value as the target similarity, and calculates the preset similarity corresponding to the target similarity. The standard word sequence is used as the target standard word sequence, and then the server judges the relationship between the number of characters of the word sequence to be detected and the number of characters of the target standard word sequence, so that the word sequence to be detected in the text to be detected is proofread, and the final proofreading is obtained. text.

104. Use a preset calculation formula to calculate the character recognition error rate of the proofreading text;

After the server obtains the proofreading text, the character recognition error rate of the proofreading text is calculated by the preset calculation formula, and the character recognition error rate is the error rate when the initial speech is converted into the initial text. How many incorrectly converted characters exist in the process of converting the initial speech into the initial text, and the incorrectly converted characters are one of the factors for judging the conversion efficiency.

105. Select a preset comparison result by comparing the character recognition error rate with the standard error rate, and determine a conversion evaluation result of the speech-to-text conversion according to the preset comparison result.

After the server obtains the character recognition error rate, it compares the numerical value between the character recognition error rate and the standard error rate to determine the conversion evaluation result of the speech-to-text text. The comparison result here includes the first comparison result and the second comparison result. , wherein, the first comparison result is that the accuracy rate of the speech-converted text is low, and the second comparison result is that the accuracy rate of the speech-converted text is high. When the value of the character recognition error rate is greater than the value of the standard error rate, the selected comparison result is the first comparison result, and at this time, the first comparison result is determined as the conversion evaluation result of the speech-to-speech text; when the character recognition error rate When the value of is less than or equal to the value of the standard error rate, the selected comparison result is the second comparison result, and at this time, the second comparison result is determined as the conversion evaluation result of the speech-to-text text.

In the embodiment of the present application, the initial speech in the video return visit item is converted by the speech recognition function to obtain the initial text, and then the initial text is preprocessed, word sequence proofreading and error rate calculation are performed to obtain the character recognition error rate, and finally the The character recognition error rate and the standard error rate are selected from the preset comparison results, and the conversion evaluation results of the speech-to-text are obtained, which improves the evaluation efficiency of evaluating the accuracy of converting the initial speech into the initial text.

Referring to FIG. 2, another embodiment of the method for evaluating the speech recognition result in the embodiment of the present application includes:

201. Obtain the initial voice in the video return visit project, and convert the initial voice based on the voice recognition function to obtain the converted initial text;

Specifically, the server first obtains the initial voice in the video return visit item, inputs the initial voice into the voice recognition function, and extracts the voice features in the initial voice through the voice recognition function; the server converts the voice features into phonemes through a preset translation model information, wherein the phoneme information is used to indicate the smallest phonetic unit that constitutes a phonetic syllable; finally, the server matches the phoneme information with a preset standard text to generate an initial text corresponding to the initial voice.

After the server obtains the initial voice in the video return visit project, it needs to use the voice recognition function to recognize and transform the initial voice. The main principle of the voice recognition function is: the server first collects a large number of voice samples for training, and then analyzes each voice in the voice samples for training. The voice feature parameters are analyzed and integrated, and a voice feature template of the voice feature parameters is established in the voice comparison library. Then the server obtains the voice information to be recognized, and performs the same processing on the voice information to obtain the target voice parameters, which are matched by the judgment method. The speech feature parameter corresponding to the target speech parameter determines the speech recognition result. In the whole speech recognition process, the recognition frameworks such as dynamic time warping method based on pattern matching and hidden Markov model method based on statistical model are used to convert multiple initial sentences of multiple target voices conveniently and quickly.

It can be understood that the phoneme information is the smallest phonetic unit divided according to the natural attributes of the voice, and the voice is analyzed according to the pronunciation action in the syllable, and an action is divided into a corresponding phoneme. By analyzing the phoneme unit and matching the phoneme information with the preset standard text, the phoneme information can be more accurately combined into text information.

For example, taking the recognition and transformation of the target voice "Your company's service is good" as an example, first the server obtains the target voice "Your company's service is good", and then the server extracts the voice features in the target voice. For example, the obtained voice features are: [1 2 8 4 7 6 0 9 3], and then the server converts the extracted speech features into phoneme information through the acoustic model. For example, the obtained factor information is: g u i g o n g s i f u w u h a o, to be After obtaining the phoneme information, the server matches the characters corresponding to the phoneme information in the preset dictionary, such as the following characters: cabinet: g u i; expensive: g u i; worker: go n g; public: go n g; four: s i; company: s i; service: fu; service: w u; good: ha o; then the server obtains the association probability between text information in the preset association probability, such as the following probability: expensive : 0.1786, public: 0.0546, company: 0.7898, service: 0.8967, good: 0.3982; good service: 0.6785; finally, the server selects the text information with the highest correlation probability as the target text. The higher the probability of the sentence appearing, the server will combine the target texts in order to obtain the target sentence. For example, the obtained target sentence is: Your company serves well.

It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned initial text, the above-mentioned initial text can also be stored in a node of a blockchain.

202. Perform preprocessing of deleting space characters, sorting preprocessing, and deleting punctuation characters on the initial text to obtain the text to be detected;

Specifically, the server first obtains the text characters of the initial text, and determines whether there are space characters between the text characters; if there are space characters between the text characters, the server deletes the space characters, and determines the remaining text characters after deleting the space characters as the first character. Preprocessing text characters; then the server obtains the position of the punctuation character in the first preprocessing text character, and takes the next character of the punctuation character as the first character of the next line, and sorts the first preprocessing text character in segments, Obtaining the second preprocessed text characters, the punctuation characters are used to indicate the symbols of the auxiliary text record language; finally, the server deletes the punctuation characters in the second preprocessed text characters, and determines the remaining second preprocessed text characters after the punctuation characters are deleted as the target. Text characters, get the text to be detected.

In the process of preprocessing, the server first deletes the space characters between each text character in the initial text to obtain the first preprocessed text character, which prevents garbled characters and facilitates the sorting of text characters by the server; then the server passes Sort the text characters by the positions of the punctuation characters in the first preprocessed text characters to ensure that there is one punctuation character and at least one text character in each row after sorting, and obtain the second preprocessed text characters, so that the first preprocessed text characters are sorted. Sorting is performed to facilitate the proofreading of the first preprocessing text characters; finally, the server deletes the punctuation characters in the second preprocessing text characters, and determines the remaining second preprocessing text characters after the punctuation characters are deleted as the target text characters, and obtains the target text character to be detected. Text, because the punctuation characters only play the role of auxiliary text recording language, whether the punctuation characters are recognized correctly will not affect the accuracy of the text characters. Therefore, punctuation characters need to be removed.

203. Obtain the word sequence to be detected in the text to be detected based on the preset sequence function, compare the word sequence to be detected with the preset standard word sequence, and determine the number of characters of the word sequence to be detected and the preset standard word The relationship between the number of characters of the sequence;

Specifically, the server first obtains the basic text characters in the text to be detected and the initial observation sequence, and the initial observation sequence is used to indicate the text character sequence of the basic text characters; secondly, the server divides the basic text characters according to the division rules in the preset sequence function. In order to predict the observation sequence, the predicted observation sequence is used to indicate the combination of the text character sequence; then the server uses the preset conditional probability formula to calculate the basic conditional probability that the basic text characters are arranged according to the predicted observation sequence under the arrangement condition of the initial observation sequence. , where the preset conditional probability formula is: S ^* =argmaxP(S|O), where S ^* is the target observation sequence, S is the predicted observation sequence, and S=(s ₁ ,s ₂ ,...,s _T ), T is the length of the initial observation sequence, s ₁ is the first word sequence that divides the basic text characters according to the predicted observation sequence, O is the initial observation sequence, and O=(o ₁ ,o ₂ ,...,o _T ), o ₁ is the first word sequence that divides the basic text characters according to the initial observation sequence; the server takes the predicted observation sequence corresponding to the target conditional probability with the largest value of the basic conditional probability as the target observation sequence; the server divides the basic text characters according to the target observation sequence to obtain the word sequence to be detected; finally, the server compares the word sequence to be detected with the preset standard word sequence, and determines the relationship between the number of characters of the word sequence to be detected and the number of characters of the preset standard word sequence.

For example, the text to be detected is "Your company serves well", and the basic text characters are "Your/Company/Company/Service/Service/Good", each text is a text character, and the initial observation sequence is "Your/ The initial observation sequence here is used to indicate the text character sequence of the basic text characters; secondly, the server divides the basic text characters into predicted observation sequences through the division rules in the preset sequence function, and the obtained predicted observation sequence can be For "your/company/good service", "your company/service/good", "your company/service is good"; then the server uses the preset conditional probability formula to calculate the occurrence of basic text characters under the arrangement condition of the initial observation sequence According to the basic conditional probability arranged according to the predicted observation sequence, through the calculation of the conditional probability formula, the basic conditional probability of occurrence of "your company/company/good service" is 0.682, and the basic conditional probability of occurrence of "your company/service/good" is 0.798 , the basic conditional probability of occurrence of "your company/good service" is 0.865; the server selects the predicted observation sequence corresponding to the basic conditional probability of 0.865 as the target observation sequence; /company/company/service/service/good" to get the word sequence to be detected "your company/good service"; finally, the server compares the word sequence to be detected with the preset standard word sequence to determine the word sequence to be detected The relationship between the number of characters of the preset standard word sequence and the number of characters of the preset standard word sequence.

204. If the number of characters of the word sequence to be detected is greater than the number of characters of the preset standard word sequence, mark the preset insertion character at the position of the word sequence to be detected;

When the number of characters in the word sequence to be detected is greater than the number of characters in the preset standard sequence, it means that the number of characters in the word sequence to be detected recognized by the server is more than the number of characters in the preset standard word sequence. If there are redundant caret characters in the word sequence, the server marks the preset caret characters at the position of the word sequence to be detected.

For example, the known standard text is: I am short of money temporarily, the number of characters corresponding to the preset standard word sequence is 5, and the recognized text to be detected is: I am not short of money temporarily, corresponding to the number of characters of the word sequence to be detected 6. The server directly marks the preset insertion character at the position of the word sequence to be detected.

205. If the number of characters of the word sequence to be detected is less than the number of characters of the preset standard word sequence, mark the preset deletion character on the position of the word sequence to be detected;

When the number of characters in the word sequence to be detected is less than the number of characters in the preset standard sequence, it means that the number of characters in the word sequence to be detected recognized by the server is less than the number of characters in the preset standard word sequence, that is, the number of characters in the to-be-detected word sequence is less. If there is a missing deletion character in the word sequence, the server marks the preset deletion character at the position of the word sequence to be detected.

For example, the known standard text is: I am not short of money temporarily, the number of characters corresponding to the preset standard word sequence is 6, and the recognized text to be detected is: I am short of money temporarily, corresponding to the number of characters of the word sequence to be detected 5. The server directly marks the preset deletion character at the position of the word sequence to be detected.

206. If the number of characters of the word sequence to be detected is equal to the number of characters of the preset standard word sequence, determine whether the word sequence to be detected is the same as the preset standard word sequence;

When the number of characters of the word sequence to be detected is equal to the number of characters of the preset standard sequence, it means that the text to be detected recognized by the server may be the same as the standard text, and it is necessary to further judge whether the sequence of the to-be-detected word and the preset standard word sequence are not Similarly, the standard text here is the text content corresponding to the preset standard word sequence.

207. If the word sequence to be detected is different from the preset standard word sequence, mark the preset replacement character on the position of the word sequence to be detected, and determine the text to be detected after the proofreading mark as proofreading text;

When the word sequence to be detected is not the same as the preset standard word sequence, it means that the corresponding text to be detected is not the same as the standard text, that is to say, there are replacement characters in the text to be detected, and the server is directly at the position of the word sequence to be detected. Mark the preset replacement characters, and then determine the text to be detected after the proofreading mark is done as proofreading text.

For example, the known standard text is: I am not short of money temporarily, the number of characters corresponding to the preset standard word sequence is 6, and the recognized text to be detected is: I am short of money temporarily, the characters corresponding to the word sequence to be detected Number 6, the server determines whether the word sequence to be detected is the same as the preset standard word sequence, and if the server detects that the word sequence to be detected is different from the preset standard word sequence, it marks the preset word sequence at the position of the word sequence to be detected. Replace the characters, and finally the server determines the text to be detected marked with the preset insertion characters, the preset deletion characters and the preset replacement characters as the proofreading text.

208. Use a preset calculation formula to calculate the character recognition error rate of the proofreading text;

Specifically, the server counts the number of inserted characters, the number of deleted characters, the number of replaced characters, and the number of characters in the proofreading text respectively; In the preset calculation formula, the character recognition error rate of the proofreading text is obtained, wherein the preset calculation formula is:

In the formula, WER is the character recognition error rate, i is the number of inserted characters, s is the number of replaced characters, d is the number of deleted characters, and t is the number of characters in the proofreading text.

Before the server calculates the character recognition error rate of the proofreading text, it first needs to specify the number of inserted characters, the number of deleted characters, the number of replaced characters and the number of characters in the proofreading text. Only through these variables and the preset calculation formula can the proofreading text be calculated. The character recognition error rate of The number of characters to get the number of deleted characters, the number of replacement characters is obtained by counting the number of preset replacement characters, and the number of characters in the proofreading text can be obtained by directly counting the number of characters in the proofreading text. Input the above-obtained factors into the preset In the calculation formula, the character recognition error rate of the proofreading text can be obtained.

209. Select a preset comparison result by comparing the character recognition error rate with the standard error rate, and determine a conversion evaluation result of the speech-to-text conversion according to the preset comparison result.

Specifically, the server compares the character recognition error rate with the standard error rate, and determines whether the character recognition error rate is greater than the standard error rate; if the character recognition error rate is greater than the standard error rate, the server determines the preset first comparison result as speech conversion The evaluation result of text conversion, wherein the preset first comparison result is that the accuracy rate of speech-to-text conversion is low; if the character recognition error rate is not greater than the standard error rate, the server determines the preset second comparison result as speech The conversion evaluation result of the converted text, wherein the preset second comparison result is that the accuracy rate of the speech converted text is high.

It can be understood that the standard error rate here refers to the standard for judging the conversion of initial speech into initial text. The value of the standard error rate can be 60% or 88%. This application does not limit the value of the standard error rate. , the value of the standard error rate can be set according to the actual situation.

The method for evaluating the speech recognition result in the embodiment of the present application has been described above, and the apparatus for evaluating the speech recognition result in the embodiment of the present application is described below. Please refer to FIG. 3 , an embodiment of the apparatus for evaluating the speech recognition result in the embodiment of the present application. Including: a conversion module 301, used to obtain the initial voice in the video return visit project, and based on the voice recognition function to convert the initial voice, to obtain the initial text after conversion; preprocessing module 302, used for the initial text. Preprocessing by deleting space characters, sorting preprocessing and deleting punctuation characters, to obtain the text to be detected; the proofreading module 303 is configured to obtain the sequence of words to be detected in the text to be detected based on a preset sequence function, and according to the preset sequence function The standard word sequence proofreads the to-be-detected word sequence, and proofreads the to-be-detected word sequence to obtain proofreading text; the calculation module 304 is used to calculate the character recognition of the proofreading text by using a preset calculation formula Error rate; the determining module 305 is configured to select a preset comparison result by comparing the character recognition error rate with the standard error rate, and determine the conversion evaluation result of the speech-to-text conversion according to the preset comparison result.

Referring to FIG. 4 , another embodiment of the apparatus for evaluating speech recognition results in the embodiment of the present application includes:

The conversion module 301 is used to obtain the initial voice in the video return visit project, and based on the speech recognition function, the initial voice is converted to obtain the initial text after the conversion; the preprocessing module 302 is used to delete the space for the initial text. Character preprocessing, sorting preprocessing and deleting punctuation character preprocessing, to obtain the text to be detected; the proofreading module 303 is used to obtain the sequence of words to be detected in the text to be detected based on a preset sequence function, according to the preset standard word sequence The sequence proofreads the sequence of words to be detected, and proofreads the sequence of words to be detected to obtain proofreading text; the calculation module 304 is used to calculate the character recognition error rate of the proofreading text by using a preset calculation formula The determination module 305 is used to select a preset comparison result by comparing the character recognition error rate and the standard error rate, and determine the conversion evaluation result of the speech-to-text conversion according to the preset comparison result.

Optionally, the proofreading module 303 includes: a comparison unit 3031, configured to obtain the word sequence to be detected in the text to be detected based on a preset sequence function, and compare the word sequence to be detected with the preset standard word sequence. Perform a comparison to determine the relationship between the number of characters of the word sequence to be detected and the number of characters of the preset standard word sequence; the first marking unit 3032, if the number of characters of the word sequence to be detected is greater than the preset standard The number of characters of the word sequence, it is used to mark the preset insertion character at the position of the word sequence to be detected; the second marking unit 3033, if the number of characters of the word sequence to be detected is less than the preset standard word sequence The number of characters of the word sequence to be detected is used to mark the preset deletion character at the position of the word sequence to be detected; the judgment unit 3034, if the number of characters of the word sequence to be detected is equal to the number of characters of the preset standard word sequence , then it is used to judge whether the word sequence to be detected is the same as the preset standard word sequence; the third marking unit 3035, if the word sequence to be detected is different from the preset standard word sequence, use The preset replacement characters are marked on the positions of the word sequences to be detected, and the text to be detected after the proofreading mark is determined as proofreading text.

Optionally, the comparison unit 3031 is specifically configured to: acquire basic text characters in the text to be detected and an initial observation sequence, where the initial observation sequence is used to indicate the text character sequence of the basic text characters; The division rule in the preset sequence function divides the basic text characters into predicted observation sequences, and the predicted observation sequences are used to indicate the combination of the text character sequences; using a preset conditional probability formula to calculate the basic text characters in Under the arrangement condition of the initial observation sequence, the basic conditional probability of arranging according to the predicted observation sequence occurs, wherein the preset conditional probability formula is: S ^* =argmaxP(S|O), where S ^* is the target observation sequence , S is the predicted observation sequence, and S=(s ₁ , s ₂ ,...,s _T ), T is the length of the initial observation sequence, s ₁ is the first word sequence that divides the basic text characters according to the predicted observation sequence, O is the initial observation sequence, and O=(o ₁ , o ₂ ,...,o _T ), o ₁ is the first word sequence for dividing basic text characters according to the initial observation sequence; the target condition with the largest probability value of the basic condition The predicted observation sequence corresponding to the probability is used as the target observation sequence; the basic text characters are divided according to the target observation sequence to obtain the word sequence to be detected; the word sequence to be detected is compared with the preset standard word sequence, Determine the relationship between the number of characters of the word sequence to be detected and the number of characters of the preset standard word sequence.

Optionally, the conversion module 301 is specifically used to: obtain the initial voice in the video return visit project, and input the initial voice into the voice recognition function, and extract the voice feature in the initial voice through the voice recognition function; The preset translation model converts the phoneme features into phoneme information, wherein the phoneme information is used to indicate the smallest phonetic unit that constitutes a phonetic syllable; the phoneme information is matched with the preset standard text to generate the initial The initial text corresponding to the speech.

Optionally, the preprocessing module 302 is specifically configured to: obtain the text characters of the initial text, and determine whether there are space characters between the text characters; if there are space characters between the text characters, delete the space characters , determine the remaining text characters after deleting the space character as the first preprocessing text character; obtain the position of the punctuation character in the first preprocessing text character, and take the next character of the punctuation character as the next line The first character of the first character in the preprocessed text is segmented and sorted to obtain the second preprocessed text character, and the punctuation character is used to indicate the symbol of the auxiliary word record language; in the second preprocessed text The punctuation character is deleted from the characters, and the second preprocessed text character remaining after the punctuation character is deleted is determined as the target text character, and the text to be detected is obtained.

Optionally, the calculation module 304 is specifically configured to: respectively count the number of inserted characters, the number of deleted characters, the number of replaced characters and the number of characters in the proofread text; the number of inserted characters, the number of deleted characters, the number of deleted characters, The number of replacement characters and the number of characters in the proofreading text are input into a preset calculation formula to obtain the character recognition error rate of the proofreading text, wherein the preset calculation formula is:

Optionally, the determining module 305 is specifically configured to: compare the character recognition error rate with the standard error rate, and determine whether the character recognition error rate is greater than the standard error rate; if the character recognition error rate is greater than the standard error The preset first comparison result is determined as the conversion evaluation result of the speech-to-speech text, wherein the preset first comparison result is that the accuracy of the speech-to-speech text is low; the character recognition error rate is not greater than the standard error rate, the preset second comparison result is determined as the conversion evaluation result of the speech-to-speech text, wherein the preset second comparison result is that the accuracy rate of the speech-to-text is high.

Figures 3 and 4 above describe in detail the apparatus for evaluating speech recognition results in the embodiment of the present application from the perspective of modular functional entities, and the following describes the device for evaluating speech recognition results in the embodiment of the present application in detail from the perspective of hardware processing.

5 is a schematic structural diagram of a device for evaluating a speech recognition result provided by an embodiment of the present application. The device 500 for evaluating a speech recognition result may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532. Among them, the memory 520 and the storage medium 530 may be short-term storage or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the apparatus 500 for evaluating the speech recognition result. Further, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the device 500 for evaluating the speech recognition result.

The apparatus 500 for evaluating speech recognition results may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and more. Those skilled in the art can understand that the structure of the evaluation device for speech recognition results shown in FIG. 5 does not constitute a limitation on the evaluation device for speech recognition results, and may include more or less components than those shown in the figure, or combine some components , or a different component arrangement.

The present application also provides a device for evaluating speech recognition results. The computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor executes the above embodiments. The steps in the evaluation method of the speech recognition result.

The present application also provides a computer-readable storage medium, and the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer performs the following steps:

Acquiring the initial voice in the video return visit project, and transforming the initial voice based on the voice recognition function to obtain the converted initial text; performing preprocessing of deleting space characters, sorting preprocessing and deleting punctuation characters on the initial text , obtain the text to be detected; obtain the word sequence to be detected in the text to be detected based on the preset sequence function, proofread the to-be-detected word sequence according to the preset standard word sequence, and add the to-be-detected word sequence in the to-be-detected word sequence Carry out proofreading mark in , obtain proofreading text; Adopt preset calculation formula to calculate the character recognition error rate of described proofreading text; Select preset comparison result by comparing described character recognition error rate and standard error rate, and The preset comparison result determines the conversion evaluation result of the speech-to-text.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

A method for evaluating a speech recognition result, the method for evaluating the speech recognition result includes:

Obtain the initial voice in the video return visit project, and convert the initial voice based on the voice recognition function to obtain the converted initial text;

Performing preprocessing to delete space characters, sorting preprocessing and deleting punctuation characters to the initial text to obtain text to be detected;

Obtain the word sequence to be detected in the text to be detected based on the preset sequence function, proofread the word sequence to be detected according to the preset standard word sequence, and carry out proofreading marking in the word sequence to be detected to obtain proofread text;

Using a preset calculation formula to calculate the character recognition error rate of the proofreading text;

A preset comparison result is selected by comparing the character recognition error rate with the standard error rate, and the conversion evaluation result of the speech-to-text text is determined according to the preset comparison result.
The method for evaluating speech recognition results according to claim 1, wherein the word sequence to be detected in the text to be detected is obtained by the preset sequence function, and the word sequence to be detected is evaluated according to the preset standard word sequence. The sequence is proofread, and proofreading is performed in the word sequence to be detected, and the proofreading text obtained includes:

Obtain the word sequence to be detected in the text to be detected based on the preset sequence function, compare the word sequence to be detected with the preset standard word sequence, and determine the number of characters in the word sequence to be detected and the preset word sequence. The relationship between the number of characters of a standard word sequence;

If the number of characters of the word sequence to be detected is greater than the number of characters of the preset standard word sequence, then mark the preset insertion character at the position of the word sequence to be detected;

If the number of characters of the word sequence to be detected is less than the number of characters of the preset standard word sequence, mark a preset deletion character at the position of the word sequence to be detected;

If the number of characters of the to-be-detected word sequence is equal to the number of characters of the preset standard word sequence, then determine whether the to-be-detected word sequence is the same as the preset standard word sequence;

If the word sequence to be detected is different from the preset standard word sequence, a preset replacement character is marked on the position of the word sequence to be detected, and the text to be detected after the proofreading mark is determined as proofreading text.
The method for evaluating speech recognition results according to claim 2, wherein the preset sequence function obtains the sequence of words to be detected in the text to be detected, and compares the sequence of words to be detected with a preset standard The word sequences are compared, and the relationship between the number of characters of the word sequence to be detected and the number of characters of the preset standard word sequence is determined, including:

acquiring basic text characters in the text to be detected and an initial observation sequence, where the initial observation sequence is used to indicate a text character sequence of the basic text characters;

The basic text characters are divided into predicted observation sequences according to the division rules in the preset sequence function, and the predicted observation sequences are used to indicate the combination of the text character sequences;

Use a preset conditional probability formula to calculate the basic conditional probability that the basic text characters are arranged according to the predicted observation sequence under the arrangement condition of the initial observation sequence, wherein the preset conditional probability formula is:

S * =arg max P(S|O), where S * is the target observation sequence, S is the predicted observation sequence, and S=(s 1 ,s 2 ,...,s T ), T is the length of the initial observation sequence , s 1 is the first word sequence that divides the basic text characters according to the predicted observation sequence, O is the initial observation sequence, and O=(o 1 , o 2 ,..., o T ), o 1 is the basis for dividing the basic text according to the initial observation sequence the first word sequence of text characters;

Taking the predicted observation sequence corresponding to the target conditional probability with the largest value of the basic conditional probability as the target observation sequence;

Divide the basic text characters according to the target observation sequence to obtain a word sequence to be detected;

The word sequence to be detected is compared with the preset standard word sequence, and the relationship between the number of characters of the word sequence to be detected and the number of characters of the preset standard word sequence is determined.
The method for evaluating a speech recognition result according to claim 1, wherein said obtaining the initial voice in the video return visit item, and converting the initial voice based on a voice recognition function, obtaining the converted initial text comprises:

Obtain the initial voice in the video return visit project, and input the initial voice into the voice recognition function, and extract the voice features in the initial voice through the voice recognition function;

Convert the phonetic features into phoneme information through a preset translation model, wherein the phoneme information is used to indicate the smallest phonetic unit that constitutes a phonetic syllable;

The phoneme information is matched with a preset standard text to generate an initial text corresponding to the initial speech.
The method for evaluating a speech recognition result according to claim 1, wherein the performing preprocessing of deleting space characters, sorting preprocessing and deleting punctuation characters on the initial text, and obtaining the text to be detected comprises:

Obtain the text characters of the initial text, and determine whether there is a space character between the text characters;

If there is a space character between the text characters, then delete the space character, and determine the remaining text character after deleting the space character as the first preprocessed text character;

Obtain the position of the punctuation character in the first preprocessed text character, take the next character of the punctuation character as the first character of the next line, and perform segment sorting on the first preprocessed text character to obtain a second preprocessed text character, the punctuation character is used to indicate a symbol of an auxiliary transcript language;

The punctuation characters are deleted from the second preprocessed text characters, and the remaining second preprocessed text characters after the punctuation characters are deleted are determined as target text characters, and the text to be detected is obtained.
The evaluation method of speech recognition result according to claim 1, wherein, the character recognition error rate that described adopting preset calculation formula to calculate described proofreading text comprises:

Respectively count the number of inserted characters, the number of deleted characters, the number of replacement characters and the number of characters in the proofreading text in the proofreading text;

Input the number of inserted characters, the number of deleted characters, the number of replacement characters and the number of characters of the proofreading text into a preset calculation formula to obtain the character recognition error rate of the proofreading text, wherein the The preset calculation formula is:

In the formula, WER is the character recognition error rate, i is the number of inserted characters, s is the number of replaced characters, d is the number of deleted characters, and t is the number of characters in the proofreading text.
The method for evaluating a speech recognition result according to any one of claims 1-6, characterized in that, selecting a preset comparison result by comparing the character recognition error rate and a standard error rate, and selecting a preset comparison result according to the The preset comparison results determine that the conversion evaluation results of the speech-to-text text include:

Compare the character recognition error rate with the standard error rate, and determine whether the character recognition error rate is greater than the standard error rate;

If the character recognition error rate is greater than the standard error rate, the preset first comparison result is determined as the conversion evaluation result of the speech-to-speech text, wherein the preset first comparison result is the speech-to-speech text low accuracy;

If the character recognition error rate is not greater than the standard error rate, the preset second comparison result is determined as the conversion evaluation result of the speech-to-speech text, wherein the preset second comparison result is the speech-to-speech conversion The accuracy of the text is high.
A device for evaluating speech recognition results, comprising a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, and the processor implements the following when executing the computer-readable instructions step:

Obtain the initial voice in the video return visit project, and convert the initial voice based on the voice recognition function to obtain the converted initial text;

Performing preprocessing to delete space characters, sorting preprocessing and deleting punctuation characters to the initial text to obtain text to be detected;

Obtain the word sequence to be detected in the text to be detected based on the preset sequence function, proofread the word sequence to be detected according to the preset standard word sequence, and carry out proofreading marking in the word sequence to be detected to obtain proofread text;

Using a preset calculation formula to calculate the character recognition error rate of the proofreading text;

A preset comparison result is selected by comparing the character recognition error rate with the standard error rate, and the conversion evaluation result of the speech-to-text text is determined according to the preset comparison result.
The device for evaluating speech recognition results according to claim 8, wherein the processor executes the computer-readable instructions to obtain the sequence of words to be detected in the text to be detected by the preset sequence function, according to The preset standard word sequence is used to proofread the to-be-detected word sequence, and proofreading is performed on the to-be-detected word sequence to obtain proofreading text, including the following steps:

Obtain the word sequence to be detected in the text to be detected based on the preset sequence function, compare the word sequence to be detected with the preset standard word sequence, and determine the number of characters in the word sequence to be detected and the preset word sequence. The relationship between the number of characters of a standard word sequence;

If the number of characters of the word sequence to be detected is greater than the number of characters of the preset standard word sequence, then mark the preset insertion character at the position of the word sequence to be detected;

If the number of characters of the word sequence to be detected is less than the number of characters of the preset standard word sequence, mark a preset deletion character at the position of the word sequence to be detected;

If the number of characters of the to-be-detected word sequence is equal to the number of characters of the preset standard word sequence, then determine whether the to-be-detected word sequence is the same as the preset standard word sequence;

If the word sequence to be detected is different from the preset standard word sequence, a preset replacement character is marked on the position of the word sequence to be detected, and the text to be detected after the proofreading mark is determined as proofreading text.
The device for evaluating speech recognition results according to claim 9, wherein the processor executes the computer-readable instructions to obtain the sequence of words to be detected in the text to be detected based on the preset sequence function, and Comparing the word sequence to be detected with the preset standard word sequence, and judging the relationship between the number of characters of the word sequence to be detected and the number of characters of the preset standard word sequence, the following steps are included:

acquiring basic text characters in the text to be detected and an initial observation sequence, where the initial observation sequence is used to indicate a text character sequence of the basic text characters;

The basic text characters are divided into predicted observation sequences according to the division rules in the preset sequence function, and the predicted observation sequences are used to indicate the combination of the text character sequences;

Use a preset conditional probability formula to calculate the basic conditional probability that the basic text characters are arranged according to the predicted observation sequence under the arrangement condition of the initial observation sequence, wherein the preset conditional probability formula is:

S * =arg max P(S|O), where S * is the target observation sequence, S is the predicted observation sequence, and S=(s 1 ,s 2 ,...,s T ), T is the length of the initial observation sequence , s 1 is the first word sequence that divides the basic text characters according to the predicted observation sequence, O is the initial observation sequence, and O=(o 1 , o 2 ,..., o T ), o 1 is the basis for dividing the basic text according to the initial observation sequence the first word sequence of text characters;

Taking the predicted observation sequence corresponding to the target conditional probability with the largest value of the basic conditional probability as the target observation sequence;

Divide the basic text characters according to the target observation sequence to obtain a word sequence to be detected;

The word sequence to be detected is compared with the preset standard word sequence, and the relationship between the number of characters of the word sequence to be detected and the number of characters of the preset standard word sequence is determined.
The device for evaluating speech recognition results according to claim 8, wherein the processor executes the computer-readable instructions to achieve the acquisition of the initial speech in the video return visit item, and performs an evaluation on the initial speech based on a speech recognition function. Conversion, when obtaining the initial text after conversion, includes the following steps:

Obtain the initial voice in the video return visit project, and input the initial voice into the voice recognition function, and extract the voice features in the initial voice through the voice recognition function;

Convert the phonetic features into phoneme information through a preset translation model, wherein the phoneme information is used to indicate the smallest phonetic unit that constitutes a phonetic syllable;

The phoneme information is matched with a preset standard text to generate an initial text corresponding to the initial speech.
The device for evaluating speech recognition results according to claim 8, wherein the processor executes the computer-readable instructions to realize the preprocessing of removing space characters, sorting preprocessing, and removing punctuation characters on the initial text When processing to obtain the text to be detected, the following steps are included:

Obtain the text characters of the initial text, and determine whether there is a space character between the text characters;

If there is a space character between the text characters, then delete the space character, and determine the remaining text character after deleting the space character as the first preprocessed text character;

Obtain the position of the punctuation character in the first preprocessed text character, take the next character of the punctuation character as the first character of the next line, and perform segment sorting on the first preprocessed text character to obtain a second preprocessed text character, the punctuation character is used to indicate a symbol of an auxiliary transcript language;

The punctuation characters are deleted from the second preprocessed text characters, and the remaining second preprocessed text characters after the punctuation characters are deleted are determined as target text characters, and the text to be detected is obtained.
The device for evaluating speech recognition results according to claim 8, wherein when the processor executes the computer-readable instructions to realize the calculation of the character recognition error rate of the proofreading text by using a preset calculation formula, the method further comprises: The following steps:

Respectively count the number of inserted characters, the number of deleted characters, the number of replacement characters and the number of characters in the proofreading text in the proofreading text;

Input the number of inserted characters, the number of deleted characters, the number of replacement characters and the number of characters of the proofreading text into a preset calculation formula to obtain the character recognition error rate of the proofreading text, wherein the The preset calculation formula is:

In the formula, WER is the character recognition error rate, i is the number of inserted characters, s is the number of replaced characters, d is the number of deleted characters, and t is the number of characters in the proofreading text.
The device for evaluating speech recognition results according to any one of claims 8-13, wherein the processor executes the computer-readable instructions to achieve the selection of a preset error rate by comparing the character recognition error rate with a standard error rate When comparing the results, and determining the conversion evaluation result of the speech-to-text according to the preset comparison results, the following steps are included:

Compare the character recognition error rate with the standard error rate, and determine whether the character recognition error rate is greater than the standard error rate;

If the character recognition error rate is greater than the standard error rate, the preset first comparison result is determined as the conversion evaluation result of the speech-to-speech text, wherein the preset first comparison result is the speech-to-speech text low accuracy;

If the character recognition error rate is not greater than the standard error rate, the preset second comparison result is determined as the conversion evaluation result of the speech-to-speech text, wherein the preset second comparison result is the speech-to-speech conversion The accuracy of the text is high.
A computer-readable storage medium, storing computer instructions in the computer-readable storage medium, when the computer instructions are executed on a computer, the computer is made to perform the following steps:

Obtain the initial voice in the video return visit project, and convert the initial voice based on the voice recognition function to obtain the converted initial text;

Performing preprocessing for deleting space characters, sorting preprocessing and deleting punctuation characters on the initial text to obtain text to be detected;

Obtain the word sequence to be detected in the text to be detected based on the preset sequence function, proofread the word sequence to be detected according to the preset standard word sequence, and carry out proofreading marking in the word sequence to be detected to obtain proofread text;

Using a preset calculation formula to calculate the character recognition error rate of the proofreading text;

A preset comparison result is selected by comparing the character recognition error rate with the standard error rate, and the conversion evaluation result of the speech-to-text text is determined according to the preset comparison result.
The computer-readable storage medium of claim 15, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:

Obtain the word sequence to be detected in the text to be detected based on the preset sequence function, compare the word sequence to be detected with the preset standard word sequence, and determine the number of characters in the word sequence to be detected and the preset word sequence. The relationship between the number of characters of a standard word sequence;

If the number of characters of the word sequence to be detected is greater than the number of characters of the preset standard word sequence, then mark the preset insertion character at the position of the word sequence to be detected;

If the number of characters of the word sequence to be detected is less than the number of characters of the preset standard word sequence, mark a preset deletion character at the position of the word sequence to be detected;

If the number of characters of the to-be-detected word sequence is equal to the number of characters of the preset standard word sequence, then determine whether the to-be-detected word sequence is the same as the preset standard word sequence;

If the word sequence to be detected is different from the preset standard word sequence, a preset replacement character is marked on the position of the word sequence to be detected, and the text to be detected after the proofreading mark is determined as proofreading text.
The computer-readable storage medium of claim 16, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:

acquiring basic text characters in the text to be detected and an initial observation sequence, where the initial observation sequence is used to indicate a text character sequence of the basic text characters;

The basic text characters are divided into predicted observation sequences according to the division rules in the preset sequence function, and the predicted observation sequences are used to indicate the combination of the text character sequences;

Use a preset conditional probability formula to calculate the basic conditional probability that the basic text characters are arranged according to the predicted observation sequence under the arrangement condition of the initial observation sequence, wherein the preset conditional probability formula is:

S * =arg max P(S|O), where S * is the target observation sequence, S is the predicted observation sequence, and S=(s 1 ,s 2 ,...,s T ), T is the length of the initial observation sequence , s 1 is the first word sequence that divides the basic text characters according to the predicted observation sequence, O is the initial observation sequence, and O=(o 1 , o 2 ,..., o T ), o 1 is the basis for dividing the basic text according to the initial observation sequence the first word sequence of text characters;

Taking the predicted observation sequence corresponding to the target conditional probability with the largest value of the basic conditional probability as the target observation sequence;

Divide the basic text characters according to the target observation sequence to obtain a word sequence to be detected;

The word sequence to be detected is compared with the preset standard word sequence, and the relationship between the number of characters of the word sequence to be detected and the number of characters of the preset standard word sequence is determined.
The computer-readable storage medium of claim 15, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:

Obtain the initial voice in the video return visit project, and input the initial voice into the voice recognition function, and extract the voice features in the initial voice through the voice recognition function;

Convert the phonetic features into phoneme information through a preset translation model, wherein the phoneme information is used to indicate the smallest phonetic unit that constitutes a phonetic syllable;

The phoneme information is matched with a preset standard text to generate an initial text corresponding to the initial speech.
The computer-readable storage medium of claim 15, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:

Obtain the text characters of the initial text, and determine whether there is a space character between the text characters;

If there is a space character between the text characters, then delete the space character, and determine the remaining text character after deleting the space character as the first preprocessed text character;

Obtain the position of the punctuation character in the first preprocessed text character, take the next character of the punctuation character as the first character of the next line, and perform segmental sorting on the first preprocessed text character to obtain a second preprocessed text character, the punctuation character being used to indicate a symbol of an auxiliary transcript language;

The punctuation characters are deleted from the second preprocessed text characters, and the second preprocessed text characters remaining after the punctuation characters are deleted are determined as target text characters, and the text to be detected is obtained.
A device for evaluating a speech recognition result, the device for evaluating the speech recognition result comprising:

a conversion module, used for obtaining the initial voice in the video return visit project, and converting the initial voice based on the voice recognition function to obtain the converted initial text;

a preprocessing module, configured to perform preprocessing of deleting space characters, sorting preprocessing and deleting punctuation characters on the initial text to obtain the text to be detected;

The proofreading module is used to obtain the sequence of words to be detected in the text to be detected based on a preset sequence function, proofread the sequence of words to be detected according to the preset standard sequence of words, and put the sequence of words to be detected in the sequence of words to be detected. Make proofreading marks to get proofreading text;

a calculation module for calculating the character recognition error rate of the proofreading text by using a preset calculation formula;

A determination module, configured to select a preset comparison result by comparing the character recognition error rate with the standard error rate, and determine the conversion evaluation result of the speech-to-text conversion according to the preset comparison result.