WO2022095353A1 - Procédé, appareil et dispositif d'évaluation de résultat de reconnaissance de parole, et support d'enregistrement - Google Patents

Procédé, appareil et dispositif d'évaluation de résultat de reconnaissance de parole, et support d'enregistrement Download PDF

Info

Publication number
WO2022095353A1
WO2022095353A1 PCT/CN2021/090436 CN2021090436W WO2022095353A1 WO 2022095353 A1 WO2022095353 A1 WO 2022095353A1 CN 2021090436 W CN2021090436 W CN 2021090436W WO 2022095353 A1 WO2022095353 A1 WO 2022095353A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
characters
detected
preset
sequence
Prior art date
Application number
PCT/CN2021/090436
Other languages
English (en)
Chinese (zh)
Inventor
陈益
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022095353A1 publication Critical patent/WO2022095353A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a method, device, device and storage medium for evaluating a speech recognition result.
  • Video return visit is one of the methods for the company to maintain customers.
  • the company's operation and maintenance personnel conduct video return visits to customers, so that the company can further understand customer needs.
  • One of the technologies used in the video interview is speech recognition technology (automatic speech recognition, ASR).
  • Speech recognition technology is also called automatic speech recognition.
  • Input that is to say, in the video return visit project, the voice replied by the customer is recognized by the speech recognition technology, and then the recognized speech is converted into the corresponding text to realize the speech recognition of the video return visit. After the speech is converted to text using the speech recognition technology, the accuracy of the speech-to-text conversion is usually determined by random inspection.
  • the inventor realizes that in the process of detecting the conversion of speech into text by means of random inspection, not only the steps are complicated, but also a lot of time is consumed, which in turn leads to low efficiency in evaluating the accuracy of converting the initial speech into the initial text.
  • the present application provides an evaluation of speech recognition results, which is used to improve the evaluation efficiency of evaluating the accuracy of converting initial speech into initial text.
  • a first aspect of the present application provides a method for evaluating a speech recognition result, including: acquiring initial speech in a video return visit item, and converting the initial speech based on a speech recognition function to obtain converted initial text;
  • the initial text is preprocessed by removing space characters, sorting preprocessing, and removing punctuation characters to obtain text to be detected; based on a preset sequence function, the sequence of words to be detected in the text to be detected is obtained, and the sequence of words to be detected in the text to be detected is obtained according to the preset standard word sequence.
  • Proofreading the to-be-detected word sequence, and performing proofreading marks in the to-be-detected word sequence to obtain proofreading text; using a preset calculation formula to calculate the character recognition error rate of the proofreading text; by comparing the character recognition A preset comparison result is selected for the error rate and the standard error rate, and the conversion evaluation result of the speech-to-text conversion is determined according to the preset comparison result.
  • a second aspect of the present application provides a device for evaluating speech recognition results, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor executes the The following steps are implemented during the computer-readable instruction: obtaining the initial voice in the video return visit project, and transforming the initial voice based on the voice recognition function to obtain the initial text after the conversion; performing preprocessing to delete space characters on the initial text, Sorting preprocessing and deleting punctuation character preprocessing to obtain the text to be detected; obtaining the word sequence to be detected in the text to be detected based on a preset sequence function, and proofreading the word sequence to be detected according to the preset standard word sequence , and carry out proofreading marks in the word sequence to be detected to obtain proofreading text; adopt a preset calculation formula to calculate the character recognition error rate of the proofreading text; select a preset by comparing the character recognition error rate and the standard error rate The comparison result is determined, and the conversion evaluation result of the speech-to-text conversion is determined according to
  • a third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps: obtaining a video in the return visit item
  • the initial voice is converted based on the voice recognition function to obtain the converted initial text;
  • the initial text is preprocessed by deleting space characters, sorting preprocessing and deleting punctuation characters to obtain the text to be detected.
  • a fourth aspect of the present application provides a device for evaluating speech recognition results, comprising: a conversion module for acquiring initial speech in a video return visit project, and converting the initial speech based on a speech recognition function to obtain a converted initial speech text; a preprocessing module is used to perform preprocessing of deleting space characters, sorting preprocessing and deleting punctuation characters on the initial text to obtain text to be detected; a proofreading module is used to obtain the to-be-detected text based on a preset sequence function Detect the word sequence to be detected in the text, proofread the word sequence to be detected according to the preset standard word sequence, and carry out proofreading marking in the word sequence to be detected to obtain the proofreading text; the calculation module is used for adopting the pre-tested word sequence.
  • the preset calculation formula calculates the character recognition error rate of the proofreading text; the determination module is used to select a preset comparison result by comparing the character recognition error rate and the standard error rate, and according to the preset comparison result Determines the results of the conversion assessment for speech-to-text.
  • the initial voice in the video return visit item is obtained, and the initial voice is converted based on a voice recognition function to obtain the converted initial text; the initial text is preprocessed and sorted by deleting space characters Preprocessing and deleting punctuation characters preprocessing to obtain text to be detected; obtaining the word sequence to be detected in the text to be detected based on a preset sequence function, and proofreading the word sequence to be detected according to the preset standard word sequence, And carry out proofreading marks in the described word sequence to be detected to obtain proofreading text; adopt a preset calculation formula to calculate the character recognition error rate of the proofreading text; select a preset by comparing the character recognition error rate and the standard error rate The comparison results are compared, and the conversion evaluation results of the speech-to-text conversion are determined according to the preset comparison results.
  • the initial speech in the video return visit item is converted by the speech recognition function to obtain the initial text, and then the initial text is preprocessed, word sequence proofreading and error rate calculation are performed to obtain the character recognition error rate, and finally the The character recognition error rate and the standard error rate are selected from the preset comparison results, and the conversion evaluation results of the speech-to-text are obtained, which improves the evaluation efficiency of evaluating the accuracy of converting the initial speech into the initial text.
  • FIG. 1 is a schematic diagram of an embodiment of a method for evaluating a speech recognition result in an embodiment of the present application
  • FIG. 2 is a schematic diagram of another embodiment of a method for evaluating a speech recognition result in an embodiment of the present application
  • FIG. 3 is a schematic diagram of an embodiment of a device for evaluating speech recognition results in an embodiment of the present application
  • FIG. 4 is a schematic diagram of another embodiment of the apparatus for evaluating the speech recognition result in the embodiment of the present application.
  • FIG. 5 is a schematic diagram of an embodiment of a device for evaluating a speech recognition result in an embodiment of the present application.
  • Embodiments of the present application provide a method, device, device, and storage medium for evaluating a speech recognition result, which are used to improve the evaluation efficiency for evaluating the accuracy of converting an initial speech into an initial text.
  • An embodiment of the method for evaluating the speech recognition result in the embodiment of the present application includes:
  • the execution subject of the present application may be a device for evaluating a speech recognition result, or may be a terminal or a server, which is not specifically limited here.
  • the embodiments of the present application take the server as an execution subject as an example for description.
  • the server collects the initial voice in the video interview through the voice collector.
  • the initial voice refers to the voice of the call or dialogue in the video interview project, and its content can include different business contents.
  • the format of the initial voice can be cda track index format ( CD audio format), WAVE format, audio interchange file format (audio interchange file format, AIFF) and moving picture experts compression standard audio layer 3 format (moving picture experts group audio layer III, MP3 format).
  • the format of the voice is limited.
  • the server After the server collects the initial voice, it converts the initial voice through the voice recognition function, and converts the initial voice into the form of text to obtain the initial text. Since the correct rate of converting speech into text by the speech recognition system is not 100%, the server needs to process the initial text and detect the accuracy rate of converting the initial speech into the initial text.
  • the initial text of the initial speech conversion is saved in the project log file through the speech recognition function. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned initial text, the above-mentioned initial text can also be stored in a node of a blockchain.
  • the server Before detecting the initial text, the server needs to preprocess the initial text to obtain the preprocessed text to be detected. processing, so as to reduce the influence on the character recognition error rate calculated by the server to convert the initial speech into the initial text in the subsequent steps.
  • the server After the server obtains the preprocessed text to be detected, it needs to obtain the word sequence to be detected in the text to be detected, and use the preset standard word sequence to proofread the to-be-detected word sequence. There are many preset standard word sequences here.
  • the server calculates the basic similarity between the word sequence to be detected and the preset standard word sequence, determines the basic similarity with the largest basic similarity value as the target similarity, and calculates the preset similarity corresponding to the target similarity.
  • the standard word sequence is used as the target standard word sequence, and then the server judges the relationship between the number of characters of the word sequence to be detected and the number of characters of the target standard word sequence, so that the word sequence to be detected in the text to be detected is proofread, and the final proofreading is obtained. text.
  • the character recognition error rate of the proofreading text is calculated by the preset calculation formula, and the character recognition error rate is the error rate when the initial speech is converted into the initial text. How many incorrectly converted characters exist in the process of converting the initial speech into the initial text, and the incorrectly converted characters are one of the factors for judging the conversion efficiency.
  • the server After the server obtains the character recognition error rate, it compares the numerical value between the character recognition error rate and the standard error rate to determine the conversion evaluation result of the speech-to-text text.
  • the comparison result here includes the first comparison result and the second comparison result. , wherein, the first comparison result is that the accuracy rate of the speech-converted text is low, and the second comparison result is that the accuracy rate of the speech-converted text is high.
  • the selected comparison result is the first comparison result, and at this time, the first comparison result is determined as the conversion evaluation result of the speech-to-speech text; when the character recognition error rate When the value of is less than or equal to the value of the standard error rate, the selected comparison result is the second comparison result, and at this time, the second comparison result is determined as the conversion evaluation result of the speech-to-text text.
  • the initial speech in the video return visit item is converted by the speech recognition function to obtain the initial text, and then the initial text is preprocessed, word sequence proofreading and error rate calculation are performed to obtain the character recognition error rate, and finally the The character recognition error rate and the standard error rate are selected from the preset comparison results, and the conversion evaluation results of the speech-to-text are obtained, which improves the evaluation efficiency of evaluating the accuracy of converting the initial speech into the initial text.
  • another embodiment of the method for evaluating the speech recognition result in the embodiment of the present application includes:
  • the server first obtains the initial voice in the video return visit item, inputs the initial voice into the voice recognition function, and extracts the voice features in the initial voice through the voice recognition function; the server converts the voice features into phonemes through a preset translation model information, wherein the phoneme information is used to indicate the smallest phonetic unit that constitutes a phonetic syllable; finally, the server matches the phoneme information with a preset standard text to generate an initial text corresponding to the initial voice.
  • the server After the server obtains the initial voice in the video return visit project, it needs to use the voice recognition function to recognize and transform the initial voice.
  • the main principle of the voice recognition function is: the server first collects a large number of voice samples for training, and then analyzes each voice in the voice samples for training. The voice feature parameters are analyzed and integrated, and a voice feature template of the voice feature parameters is established in the voice comparison library. Then the server obtains the voice information to be recognized, and performs the same processing on the voice information to obtain the target voice parameters, which are matched by the judgment method. The speech feature parameter corresponding to the target speech parameter determines the speech recognition result.
  • the recognition frameworks such as dynamic time warping method based on pattern matching and hidden Markov model method based on statistical model are used to convert multiple initial sentences of multiple target voices conveniently and quickly.
  • the phoneme information is the smallest phonetic unit divided according to the natural attributes of the voice, and the voice is analyzed according to the pronunciation action in the syllable, and an action is divided into a corresponding phoneme.
  • the phoneme information can be more accurately combined into text information.
  • the server obtains the target voice "Your company's service is good"
  • the server extracts the voice features in the target voice.
  • the obtained voice features are: [1 2 8 4 7 6 0 9 3]
  • the server converts the extracted speech features into phoneme information through the acoustic model.
  • the obtained factor information is: g u i g o n g s i f u w u h a o, to be
  • the server matches the characters corresponding to the phoneme information in the preset dictionary, such as the following characters: cabinet: g u i; expensive: g u i; worker: go n g; public: go n g; four: s i; company: s i; service: fu; service: w u; good: ha o; then the server obtains the association probability between text information in the preset association probability, such as the following probability: expensive : 0.1786, public: 0.0546, company: 0.7898, service: 0.8967, good: 0.3982; good service: 0.6785; finally, the server selects the text information with the highest correlation probability as the target text. The higher the probability of the sentence appearing, the server will combine the target texts in order to obtain the target sentence.
  • the obtained target sentence is:
  • the above-mentioned initial text can also be stored in a node of a blockchain.
  • the server first obtains the text characters of the initial text, and determines whether there are space characters between the text characters; if there are space characters between the text characters, the server deletes the space characters, and determines the remaining text characters after deleting the space characters as the first character.
  • the server obtains the position of the punctuation character in the first preprocessing text character, and takes the next character of the punctuation character as the first character of the next line, and sorts the first preprocessing text character in segments, Obtaining the second preprocessed text characters, the punctuation characters are used to indicate the symbols of the auxiliary text record language; finally, the server deletes the punctuation characters in the second preprocessed text characters, and determines the remaining second preprocessed text characters after the punctuation characters are deleted as the target. Text characters, get the text to be detected.
  • the server first deletes the space characters between each text character in the initial text to obtain the first preprocessed text character, which prevents garbled characters and facilitates the sorting of text characters by the server; then the server passes Sort the text characters by the positions of the punctuation characters in the first preprocessed text characters to ensure that there is one punctuation character and at least one text character in each row after sorting, and obtain the second preprocessed text characters, so that the first preprocessed text characters are sorted.
  • Sorting is performed to facilitate the proofreading of the first preprocessing text characters; finally, the server deletes the punctuation characters in the second preprocessing text characters, and determines the remaining second preprocessing text characters after the punctuation characters are deleted as the target text characters, and obtains the target text character to be detected.
  • Text because the punctuation characters only play the role of auxiliary text recording language, whether the punctuation characters are recognized correctly will not affect the accuracy of the text characters. Therefore, punctuation characters need to be removed.
  • the server first obtains the basic text characters in the text to be detected and the initial observation sequence, and the initial observation sequence is used to indicate the text character sequence of the basic text characters; secondly, the server divides the basic text characters according to the division rules in the preset sequence function. In order to predict the observation sequence, the predicted observation sequence is used to indicate the combination of the text character sequence; then the server uses the preset conditional probability formula to calculate the basic conditional probability that the basic text characters are arranged according to the predicted observation sequence under the arrangement condition of the initial observation sequence.
  • the text to be detected is "Your company serves well", and the basic text characters are "Your/Company/Company/Service/Service/Good", each text is a text character, and the initial observation sequence is "Your/ The initial observation sequence here is used to indicate the text character sequence of the basic text characters;
  • the server divides the basic text characters into predicted observation sequences through the division rules in the preset sequence function, and the obtained predicted observation sequence can be For "your/company/good service”, “your company/service/good”, "your company/service is good”; then the server uses the preset conditional probability formula to calculate the occurrence of basic text characters under the arrangement condition of the initial observation sequence
  • the basic conditional probability arranged according to the predicted observation sequence through the calculation of the conditional probability formula, the basic conditional probability of occurrence of "your company/company/good service” is 0.682, and the basic conditional probability of occurrence of "your company/service/good” is 0.798 , the basic conditional probability of occurrence of "your company/good service” is
  • the server marks the preset caret characters at the position of the word sequence to be detected.
  • the known standard text is: I am short of money temporarily, the number of characters corresponding to the preset standard word sequence is 5, and the recognized text to be detected is: I am not short of money temporarily, corresponding to the number of characters of the word sequence to be detected 6.
  • the server directly marks the preset insertion character at the position of the word sequence to be detected.
  • the server marks the preset deletion character at the position of the word sequence to be detected.
  • the known standard text is: I am not short of money temporarily, the number of characters corresponding to the preset standard word sequence is 6, and the recognized text to be detected is: I am short of money temporarily, corresponding to the number of characters of the word sequence to be detected 5.
  • the server directly marks the preset deletion character at the position of the word sequence to be detected.
  • the text to be detected recognized by the server may be the same as the standard text, and it is necessary to further judge whether the sequence of the to-be-detected word and the preset standard word sequence are not Similarly, the standard text here is the text content corresponding to the preset standard word sequence.
  • the word sequence to be detected is not the same as the preset standard word sequence, it means that the corresponding text to be detected is not the same as the standard text, that is to say, there are replacement characters in the text to be detected, and the server is directly at the position of the word sequence to be detected. Mark the preset replacement characters, and then determine the text to be detected after the proofreading mark is done as proofreading text.
  • the known standard text is: I am not short of money temporarily, the number of characters corresponding to the preset standard word sequence is 6, and the recognized text to be detected is: I am short of money temporarily, the characters corresponding to the word sequence to be detected Number 6,
  • the server determines whether the word sequence to be detected is the same as the preset standard word sequence, and if the server detects that the word sequence to be detected is different from the preset standard word sequence, it marks the preset word sequence at the position of the word sequence to be detected. Replace the characters, and finally the server determines the text to be detected marked with the preset insertion characters, the preset deletion characters and the preset replacement characters as the proofreading text.
  • the server counts the number of inserted characters, the number of deleted characters, the number of replaced characters, and the number of characters in the proofreading text respectively;
  • the preset calculation formula the character recognition error rate of the proofreading text is obtained, wherein the preset calculation formula is:
  • WER is the character recognition error rate
  • i is the number of inserted characters
  • s is the number of replaced characters
  • d is the number of deleted characters
  • t is the number of characters in the proofreading text.
  • the server Before the server calculates the character recognition error rate of the proofreading text, it first needs to specify the number of inserted characters, the number of deleted characters, the number of replaced characters and the number of characters in the proofreading text. Only through these variables and the preset calculation formula can the proofreading text be calculated.
  • the character recognition error rate of The number of characters to get the number of deleted characters, the number of replacement characters is obtained by counting the number of preset replacement characters, and the number of characters in the proofreading text can be obtained by directly counting the number of characters in the proofreading text. Input the above-obtained factors into the preset In the calculation formula, the character recognition error rate of the proofreading text can be obtained.
  • the server compares the character recognition error rate with the standard error rate, and determines whether the character recognition error rate is greater than the standard error rate; if the character recognition error rate is greater than the standard error rate, the server determines the preset first comparison result as speech conversion The evaluation result of text conversion, wherein the preset first comparison result is that the accuracy rate of speech-to-text conversion is low; if the character recognition error rate is not greater than the standard error rate, the server determines the preset second comparison result as speech The conversion evaluation result of the converted text, wherein the preset second comparison result is that the accuracy rate of the speech converted text is high.
  • the server After the server obtains the character recognition error rate, it compares the numerical value between the character recognition error rate and the standard error rate to determine the conversion evaluation result of the speech-to-text text.
  • the comparison result here includes the first comparison result and the second comparison result. , wherein, the first comparison result is that the accuracy rate of the speech-converted text is low, and the second comparison result is that the accuracy rate of the speech-converted text is high.
  • the selected comparison result is the first comparison result, and at this time, the first comparison result is determined as the conversion evaluation result of the speech-to-speech text; when the character recognition error rate When the value of is less than or equal to the value of the standard error rate, the selected comparison result is the second comparison result, and at this time, the second comparison result is determined as the conversion evaluation result of the speech-to-text text.
  • the standard error rate here refers to the standard for judging the conversion of initial speech into initial text.
  • the value of the standard error rate can be 60% or 88%. This application does not limit the value of the standard error rate. , the value of the standard error rate can be set according to the actual situation.
  • FIG. 3 an embodiment of the apparatus for evaluating the speech recognition result in the embodiment of the present application.
  • a conversion module 301 used to obtain the initial voice in the video return visit project, and based on the voice recognition function to convert the initial voice, to obtain the initial text after conversion
  • preprocessing module 302 used for the initial text.
  • the proofreading module 303 is configured to obtain the sequence of words to be detected in the text to be detected based on a preset sequence function, and according to the preset sequence function The standard word sequence proofreads the to-be-detected word sequence, and proofreads the to-be-detected word sequence to obtain proofreading text;
  • the calculation module 304 is used to calculate the character recognition of the proofreading text by using a preset calculation formula Error rate;
  • the determining module 305 is configured to select a preset comparison result by comparing the character recognition error rate with the standard error rate, and determine the conversion evaluation result of the speech-to-text conversion according to the preset comparison result.
  • FIG. 4 another embodiment of the apparatus for evaluating speech recognition results in the embodiment of the present application includes:
  • the conversion module 301 is used to obtain the initial voice in the video return visit project, and based on the speech recognition function, the initial voice is converted to obtain the initial text after the conversion; the preprocessing module 302 is used to delete the space for the initial text. Character preprocessing, sorting preprocessing and deleting punctuation character preprocessing, to obtain the text to be detected; the proofreading module 303 is used to obtain the sequence of words to be detected in the text to be detected based on a preset sequence function, according to the preset standard word sequence The sequence proofreads the sequence of words to be detected, and proofreads the sequence of words to be detected to obtain proofreading text; the calculation module 304 is used to calculate the character recognition error rate of the proofreading text by using a preset calculation formula The determination module 305 is used to select a preset comparison result by comparing the character recognition error rate and the standard error rate, and determine the conversion evaluation result of the speech-to-text conversion according to the preset comparison result.
  • the proofreading module 303 includes: a comparison unit 3031, configured to obtain the word sequence to be detected in the text to be detected based on a preset sequence function, and compare the word sequence to be detected with the preset standard word sequence. Perform a comparison to determine the relationship between the number of characters of the word sequence to be detected and the number of characters of the preset standard word sequence; the first marking unit 3032, if the number of characters of the word sequence to be detected is greater than the preset standard The number of characters of the word sequence, it is used to mark the preset insertion character at the position of the word sequence to be detected; the second marking unit 3033, if the number of characters of the word sequence to be detected is less than the preset standard word sequence The number of characters of the word sequence to be detected is used to mark the preset deletion character at the position of the word sequence to be detected; the judgment unit 3034, if the number of characters of the word sequence to be detected is equal to the number of characters of the preset standard word sequence , then it is used to judge whether the word sequence
  • the comparison unit 3031 is specifically configured to: acquire basic text characters in the text to be detected and an initial observation sequence, where the initial observation sequence is used to indicate the text character sequence of the basic text characters;
  • the conversion module 301 is specifically used to: obtain the initial voice in the video return visit project, and input the initial voice into the voice recognition function, and extract the voice feature in the initial voice through the voice recognition function;
  • the preset translation model converts the phoneme features into phoneme information, wherein the phoneme information is used to indicate the smallest phonetic unit that constitutes a phonetic syllable; the phoneme information is matched with the preset standard text to generate the initial The initial text corresponding to the speech.
  • the preprocessing module 302 is specifically configured to: obtain the text characters of the initial text, and determine whether there are space characters between the text characters; if there are space characters between the text characters, delete the space characters , determine the remaining text characters after deleting the space character as the first preprocessing text character; obtain the position of the punctuation character in the first preprocessing text character, and take the next character of the punctuation character as the next line
  • the first character of the first character in the preprocessed text is segmented and sorted to obtain the second preprocessed text character, and the punctuation character is used to indicate the symbol of the auxiliary word record language; in the second preprocessed text
  • the punctuation character is deleted from the characters, and the second preprocessed text character remaining after the punctuation character is deleted is determined as the target text character, and the text to be detected is obtained.
  • the calculation module 304 is specifically configured to: respectively count the number of inserted characters, the number of deleted characters, the number of replaced characters and the number of characters in the proofread text; the number of inserted characters, the number of deleted characters, the number of deleted characters, The number of replacement characters and the number of characters in the proofreading text are input into a preset calculation formula to obtain the character recognition error rate of the proofreading text, wherein the preset calculation formula is:
  • WER is the character recognition error rate
  • i is the number of inserted characters
  • s is the number of replaced characters
  • d is the number of deleted characters
  • t is the number of characters in the proofreading text.
  • the determining module 305 is specifically configured to: compare the character recognition error rate with the standard error rate, and determine whether the character recognition error rate is greater than the standard error rate; if the character recognition error rate is greater than the standard error
  • the preset first comparison result is determined as the conversion evaluation result of the speech-to-speech text, wherein the preset first comparison result is that the accuracy of the speech-to-speech text is low; the character recognition error rate is not greater than the standard error rate, the preset second comparison result is determined as the conversion evaluation result of the speech-to-speech text, wherein the preset second comparison result is that the accuracy rate of the speech-to-text is high.
  • Figures 3 and 4 above describe in detail the apparatus for evaluating speech recognition results in the embodiment of the present application from the perspective of modular functional entities, and the following describes the device for evaluating speech recognition results in the embodiment of the present application in detail from the perspective of hardware processing.
  • the device 500 for evaluating a speech recognition result may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the apparatus 500 for evaluating the speech recognition result.
  • the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the device 500 for evaluating the speech recognition result.
  • the apparatus 500 for evaluating speech recognition results may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and more.
  • operating systems 531 such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and more.
  • the present application also provides a device for evaluating speech recognition results.
  • the computer device includes a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the above embodiments. The steps in the evaluation method of the speech recognition result.
  • the present application also provides a computer-readable storage medium, and the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer performs the following steps:
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Procédé, appareil et dispositif d'évaluation de résultat de reconnaissance de la parole, et support d'enregistrement, qui se rapportent au domaine de l'intelligence artificielle, et sont utilisés pour améliorer l'efficacité d'évaluation de la précision d'évaluation de conversion de la parole initiale en un texte initial. Le procédé d'évaluation comprend : la conversion d'une parole initiale dans un élément d'examen vidéo sur la base d'une fonction de reconnaissance de la parole, de façon à obtenir un texte initial (101) ; la réalisation d'un prétraitement de suppression de caractère d'espace, le tri d'un pré-traitement et d'un pré-traitement de suppression de caractère de ponctuation sur le texte initial, de façon à obtenir un texte à soumettre à une détection (102) ; l'acquisition d'une séquence de mots à soumettre à une détection dans le texte à soumettre à une détection, et la correction et le marquage, selon une séquence de mots standard prédéfinie, de la séquence de mots à soumettre à une détection, de façon à obtenir un texte corrigé (103) ; le calcul d'un taux d'erreur de reconnaissance de caractères du texte corrigé à l'aide d'une formule de calcul prédéfinie (104) ; et la sélection d'un résultat de comparaison prédéfini par comparaison du taux d'erreur de reconnaissance de caractères et d'un taux d'erreur standard, et la détermination d'un résultat d'évaluation de conversion du texte de conversion de parole (105). La présente invention se rapporte en outre à une technologie de chaîne de blocs, et le texte initial peut être stocké dans une chaîne de blocs.
PCT/CN2021/090436 2020-11-04 2021-04-28 Procédé, appareil et dispositif d'évaluation de résultat de reconnaissance de parole, et support d'enregistrement WO2022095353A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011215789.4 2020-11-04
CN202011215789.4A CN112151014B (zh) 2020-11-04 2020-11-04 语音识别结果的测评方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022095353A1 true WO2022095353A1 (fr) 2022-05-12

Family

ID=73953912

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090436 WO2022095353A1 (fr) 2020-11-04 2021-04-28 Procédé, appareil et dispositif d'évaluation de résultat de reconnaissance de parole, et support d'enregistrement

Country Status (2)

Country Link
CN (1) CN112151014B (fr)
WO (1) WO2022095353A1 (fr)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112151014B (zh) * 2020-11-04 2023-07-21 平安科技(深圳)有限公司 语音识别结果的测评方法、装置、设备及存储介质
CN112599129B (zh) * 2021-03-01 2021-05-28 北京世纪好未来教育科技有限公司 语音识别方法、装置、设备和存储介质
CN113129935B (zh) * 2021-06-16 2021-08-31 北京新唐思创教育科技有限公司 音频打点数据获取方法、装置、存储介质及电子设备
CN113312456A (zh) * 2021-06-28 2021-08-27 中国平安人寿保险股份有限公司 短视频文本生成方法、装置、设备及存储介质
CN115687334B (zh) * 2023-01-05 2023-05-16 粤港澳大湾区数字经济研究院(福田) 数据质检方法、装置、设备及存储介质
CN116403604B (zh) * 2023-06-07 2023-11-03 北京奇趣万物科技有限公司 一种儿童阅读能力评测方法和系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1571013A (zh) * 2003-02-13 2005-01-26 微软公司 从文本中预测误词率的方法和设备
CN109637536A (zh) * 2018-12-27 2019-04-16 苏州思必驰信息科技有限公司 一种自动化识别语义准确性的方法及装置
CN111179939A (zh) * 2020-04-13 2020-05-19 北京海天瑞声科技股份有限公司 语音转写方法、语音转写装置及计算机存储介质
CN112151014A (zh) * 2020-11-04 2020-12-29 平安科技(深圳)有限公司 语音识别结果的测评方法、装置、设备及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653517A (zh) * 2015-11-05 2016-06-08 乐视致新电子科技(天津)有限公司 一种识别率确定方法及装置
CN108766437B (zh) * 2018-05-31 2020-06-23 平安科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质
CN110968730B (zh) * 2019-12-16 2023-06-09 Oppo(重庆)智能科技有限公司 音频标记处理方法、装置、计算机设备及存储介质
CN111223498A (zh) * 2020-01-10 2020-06-02 平安科技(深圳)有限公司 情绪智能识别方法、装置及计算机可读存储介质
CN111681642B (zh) * 2020-06-03 2022-04-15 北京字节跳动网络技术有限公司 语音识别评估方法、装置、存储介质及设备
CN111696557A (zh) * 2020-06-23 2020-09-22 深圳壹账通智能科技有限公司 语音识别结果的校准方法、装置、设备及存储介质
CN111816165A (zh) * 2020-07-07 2020-10-23 北京声智科技有限公司 语音识别方法、装置及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1571013A (zh) * 2003-02-13 2005-01-26 微软公司 从文本中预测误词率的方法和设备
CN109637536A (zh) * 2018-12-27 2019-04-16 苏州思必驰信息科技有限公司 一种自动化识别语义准确性的方法及装置
CN111179939A (zh) * 2020-04-13 2020-05-19 北京海天瑞声科技股份有限公司 语音转写方法、语音转写装置及计算机存储介质
CN112151014A (zh) * 2020-11-04 2020-12-29 平安科技(深圳)有限公司 语音识别结果的测评方法、装置、设备及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FINDYOU: "[HResults calculates word error rate (WER) and sentence error rate (SER)]", CN BLOGS, CNBLOGS, 3 April 2019 (2019-04-03), pages 1 - 11, XP055931090, Retrieved from the Internet <URL:https://www.cnblogs.com/FINDYOU/P/10646312.HTML> [retrieved on 20220614] *
MICHAELLIU_DEV: "[Detailed explanation of CTC algorithm]", BLOG CSDN, CSDN, 2 November 2018 (2018-11-02), pages 1 - 8, XP055931095, Retrieved from the Internet <URL:https://blog.csdn.net/michaelshare/article/details/83660557> [retrieved on 20220614] *

Also Published As

Publication number Publication date
CN112151014A (zh) 2020-12-29
CN112151014B (zh) 2023-07-21

Similar Documents

Publication Publication Date Title
WO2022095353A1 (fr) Procédé, appareil et dispositif d&#39;évaluation de résultat de reconnaissance de parole, et support d&#39;enregistrement
CN108304372B (zh) 实体提取方法和装置、计算机设备和存储介质
CN109887497B (zh) 语音识别的建模方法、装置及设备
US8185376B2 (en) Identifying language origin of words
US7289950B2 (en) Extended finite state grammar for speech recognition systems
US7421387B2 (en) Dynamic N-best algorithm to reduce recognition errors
CN107229627B (zh) 一种文本处理方法、装置及计算设备
JP2004362584A (ja) テキストおよび音声の分類のための言語モデルの判別トレーニング
JP2016536652A (ja) モバイル機器におけるリアルタイム音声評価システム及び方法
CN112992125B (zh) 一种语音识别方法、装置、电子设备、可读存储介质
CN109858025B (zh) 一种地址标准化语料的分词方法及系统
CN113626573A (zh) 一种销售会话异议及应对提取方法及系统
CN113221542A (zh) 一种基于多粒度融合与Bert筛选的中文文本自动校对方法
CN115687621A (zh) 一种短文本标签标注方法及装置
CN115759119A (zh) 一种金融文本情感分析方法、系统、介质和设备
JP5897718B2 (ja) 音声検索装置、計算機読み取り可能な記憶媒体、及び音声検索方法
WO2010050414A1 (fr) Dispositif d&#39;adaptation de modèle, son procédé et son programme
CN112287657A (zh) 基于文本相似度的信息匹配系统
JP5590549B2 (ja) 音声検索装置および音声検索方法
JP5253317B2 (ja) 要約文作成装置、要約文作成方法、プログラム
CN107886233B (zh) 客服的服务质量评价方法和系统
JP2017191278A (ja) 音素誤り獲得装置、辞書追加装置、音声認識装置、音素誤り獲得方法、音声認識方法、およびプログラム
CN114254628A (zh) 一种语音转写中结合用户文本的快速热词提取方法、装置、电子设备及存储介质
JP2938865B1 (ja) 音声認識装置
CN114444491A (zh) 新词识别方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888057

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21888057

Country of ref document: EP

Kind code of ref document: A1