CN113744718A - Voice text output method and device, storage medium and electronic device - Google Patents

Voice text output method and device, storage medium and electronic device Download PDF

Info

Publication number
CN113744718A
CN113744718A CN202010464302.XA CN202010464302A CN113744718A CN 113744718 A CN113744718 A CN 113744718A CN 202010464302 A CN202010464302 A CN 202010464302A CN 113744718 A CN113744718 A CN 113744718A
Authority
CN
China
Prior art keywords
phoneme
confusion
recognition result
voice
confusion matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010464302.XA
Other languages
Chinese (zh)
Inventor
苏腾荣
马志芳
李想
赵培
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haier Uplus Intelligent Technology Beijing Co Ltd
Original Assignee
Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haier Uplus Intelligent Technology Beijing Co Ltd filed Critical Haier Uplus Intelligent Technology Beijing Co Ltd
Priority to CN202010464302.XA priority Critical patent/CN113744718A/en
Publication of CN113744718A publication Critical patent/CN113744718A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The invention provides a method and a device for outputting a voice text, a storage medium and an electronic device, wherein the method comprises the following steps: carrying out voice recognition on the target voice to obtain a voice recognition result based on phonemes; correcting the voice recognition result according to a preset phoneme confusion matrix, and outputting a corrected voice text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with the correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer, namely, according to the technical scheme, the generated phoneme confusion matrix is used for correcting the voice recognition result, and then the corrected voice text can be obtained.

Description

Voice text output method and device, storage medium and electronic device
Technical Field
The invention relates to the field of communication, in particular to a method and a device for outputting a voice text, a storage medium and an electronic device.
Background
In the related art, a basic framework of a conventional voice dialog system is shown in fig. 1, and after voice is input through a recording device, the voice enters the dialog system after signal processing and voice recognition, and after appropriate feedback content is obtained, voice output is performed. In addition to the algorithm error of the speech recognition process, the problems of background noise, inaccurate spoken language pronunciation, personalized habitual misreading, natural spoken language pronunciation, continuous reading and the like exist, so that some deviation can be generated in the output text of the speech recognition. In the intelligent voice dialogue system, the cascade relation of the voice recognition and the dialogue system determines that cascade errors easily affect the system.
The existing retrieval technology of the intelligent dialogue system is generally optimized aiming at a text level, and comprises natural language processing technologies such as entity recognition, semantic understanding, part of speech tagging and the like. These techniques can enable dialog systems to give a relatively reasonable output. However, because the input of the dialog system is the text output after the speech recognition, and the deviation of some texts from the real input is very small, the dialog system cannot correct the error, so that the subsequent retrieval work generates larger deviation.
Aiming at the problems that the deviation between the voice output result and the real input in the traditional voice dialogue system can not be corrected and the like in the related technology, an effective technical scheme is not provided.
Disclosure of Invention
The embodiment of the invention provides a method and a device for outputting a voice text, a storage medium and an electronic device, which are used for at least solving the problems that in the related art, the deviation between a voice output result and real input in the traditional voice dialogue system cannot be corrected and the like.
According to an embodiment of the present invention, there is provided a method for outputting a speech text, including: carrying out voice recognition on the target voice to obtain a voice recognition result based on phonemes; correcting the voice recognition result according to a preset phoneme confusion matrix, and outputting a corrected voice text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer.
In an exemplary embodiment, before modifying the speech recognition result according to a preset phoneme confusion matrix and outputting a modified speech text, the method further includes: acquiring a phoneme sequence with a correct result and N phoneme sequences with confusion pronunciation; aligning the phoneme sequence labeled with correct results with the N phoneme sequences to determine a phoneme confusion matrix for indicating a confusion probability of each phoneme.
In an exemplary embodiment, modifying the speech recognition result according to a preset phoneme confusion matrix includes: calculating the voice recognition result and the phoneme confusion matrix to obtain a calculation result; and correcting the voice recognition result according to the operation result.
In an exemplary embodiment, the performing an operation on the speech recognition result and the phoneme confusion matrix to obtain an operation result includes: and a preset algorithm is used for calculating the voice recognition result and the phoneme confusion matrix to obtain a plurality of confusion probability values, wherein the confusion probability values are used for indicating the calculation result.
In an exemplary embodiment, modifying the speech recognition result according to the operation result includes: selecting a sound speed sequence of a correct result corresponding to the maximum confusion probability value from the confusion probability values; and correcting the voice recognition result according to the sound velocity sequence of the correct result corresponding to the maximum confusion probability.
In an exemplary embodiment, the method further includes: obtaining corpus data of a target object; determining a phoneme sequence of a correct result corresponding to the corpus data and M phoneme sequences with confused pronunciations according to the acquired corpus data, wherein M is a positive integer; and determining a phoneme confusion matrix of the target object according to the phoneme sequence of the correct result corresponding to the corpus data and the M phoneme sequences subjected to pronunciation confusion.
In an exemplary embodiment, after determining the phoneme confusion matrix of the target object according to the phoneme sequence of the correct result corresponding to the corpus data and the M pronunciation-confused phoneme sequences, the method further includes: under the condition that the voice data of the target object are received, recognizing the voice data of the target object to obtain a target recognition result based on phonemes; and correcting the target recognition result according to the phoneme confusion matrix of the target object.
According to another embodiment of the present invention, there is provided an output apparatus of a phonetic text, including: the determining module is used for carrying out voice recognition on the target voice to obtain a voice recognition result based on phonemes; the processing module is used for correcting the voice recognition result according to a preset phoneme confusion matrix and outputting a corrected voice text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer.
In an exemplary embodiment, the processing module is further configured to obtain a phoneme sequence labeled with a correct result and pronunciation-confused N phoneme sequences; aligning the phoneme sequence labeled with correct results with the N phoneme sequences to determine a phoneme confusion matrix for indicating a confusion probability of each phoneme.
In an exemplary embodiment, the processing module is further configured to perform an operation on the speech recognition result and the phoneme confusion matrix to obtain an operation result; and correcting the voice recognition result according to the operation result.
In an exemplary embodiment, the processing module is further configured to calculate the voice recognition result and the phoneme confusion matrix according to a preset algorithm to obtain a plurality of confusion probability values, where the confusion probability values are used to indicate the calculation result.
In an exemplary embodiment, the processing module is further configured to select a sonic sequence of correct results corresponding to a maximum confusion probability value from the plurality of confusion probability values; and correcting the voice recognition result according to the sound velocity sequence of the correct result corresponding to the maximum confusion probability.
In an exemplary embodiment, the apparatus further includes: the acquisition module is used for acquiring corpus data of the target object; the corresponding module is used for determining a phoneme sequence of a correct result corresponding to the corpus data and M phoneme sequences with confused pronunciations according to the acquired corpus data, wherein M is a positive integer; and determining a phoneme confusion matrix of the target object according to the phoneme sequence of the correct result corresponding to the corpus data and the M phoneme sequences subjected to pronunciation confusion.
In an exemplary embodiment, the corresponding module is further configured to, in a case that the speech data of the target object is received, recognize the speech data of the target object to obtain a target recognition result based on phonemes; and correcting the target recognition result according to the phoneme confusion matrix of the target object.
According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
According to the invention, the target voice is subjected to voice recognition to obtain a voice recognition result based on phonemes; correcting the voice recognition result according to a preset phoneme confusion matrix, and outputting a corrected voice text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with the correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer, namely, according to the technical scheme, the voice recognition result is corrected through the pre-generated phoneme confusion matrix, and then the corrected voice text can be obtained.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a basic framework diagram of a conventional voice dialogue system in the related art;
fig. 2 is a block diagram of a hardware configuration of a computer terminal of an output method of a speech text according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method of outputting a phonetic text according to an embodiment of the present invention;
FIG. 4 is a flow diagram of a primary process of generating a confusion matrix according to an alternative embodiment of the invention;
FIG. 5 is a flow diagram of adaptation of the confusion matrix to the dialog system, in accordance with an alternative embodiment of the present invention;
FIG. 6 is a flow diagram of adaptation of the confusion matrix to the dialog system according to an alternative embodiment of the invention;
fig. 7 is a block diagram of a configuration of an output apparatus of a phonetic text according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The method provided by the embodiment of the application can be executed in a computer terminal or a similar operation device. Taking the example of being operated on a computer terminal, fig. 2 is a block diagram of a hardware structure of the computer terminal of the method for outputting a speech text according to the embodiment of the present invention. As shown in fig. 2, the computer terminal may include one or more (only one shown in fig. 2) processors 202 (the processors 202 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 204 for storing data, and in an exemplary embodiment, may also include a transmission device 206 for communication functions and an input-output device 208. It will be understood by those skilled in the art that the structure shown in fig. 2 is only an illustration, and is not intended to limit the structure of the computer terminal. For example, the computer terminal may also include more or fewer components than shown in FIG. 2, or have a different configuration with equivalent functionality to that shown in FIG. 2 or more functionality than that shown in FIG. 2.
The memory 204 can be used for storing computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the output method of the speech text in the embodiment of the present invention, and the processor 202 executes various functional applications and data processing by running the computer programs stored in the memory 204, so as to implement the method described above. Memory 204 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 204 may further include memory located remotely from the processor 202, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 206 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 206 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 206 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In this embodiment, a method for outputting a speech text is provided, and fig. 3 is a flowchart of a method for outputting a speech text according to an embodiment of the present invention, where the flowchart includes the following steps:
step S302, carrying out voice recognition on the target voice to obtain a voice recognition result based on phonemes;
step S304, correcting the voice recognition result according to a preset phoneme confusion matrix, and outputting a corrected voice text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer.
Through the steps, performing voice recognition on the target voice to obtain a voice recognition result based on phonemes; correcting the voice recognition result according to a preset phoneme confusion matrix, and outputting a corrected voice text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with the correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer, namely, according to the technical scheme, the voice recognition result is corrected through the pre-generated phoneme confusion matrix, and then the corrected voice text can be obtained, so that the problems that the deviation between the voice output result and the real input in the traditional voice dialogue system in the prior art cannot be corrected and the like can be solved, the serious influence caused by voice errors is reduced, and the robustness of the system and the flexibility of adapting to various personalized accents are improved.
In an exemplary embodiment, before modifying the speech recognition result according to a preset phoneme confusion matrix and outputting the modified speech text, the phoneme confusion matrix may be generated by: acquiring a phoneme sequence with a correct result and N phoneme sequences with confusion pronunciation; aligning the phoneme sequence labeled with correct results with the N phoneme sequences to determine a phoneme confusion matrix for indicating a confusion probability of each phoneme.
That is to say, before the speech recognition result is corrected by the preset phoneme confusion matrix, the phoneme sequence labeled with the correct result may be obtained in advance to be aligned and matched with the multiple pronunciation confusion phoneme sequences, and the specific matching process may adopt any implementation manner in the prior art, which is not limited in the embodiment of the present invention.
In an exemplary embodiment, modifying the speech recognition result according to a preset phoneme confusion matrix includes: calculating the voice recognition result and the phoneme confusion matrix to obtain a calculation result; and correcting the voice recognition result according to the operation result.
According to the embodiment of the present invention, a phoneme-based speech recognition result is obtained from a target speech, the speech recognition result is operated with a preset phoneme confusion matrix through a weighted addition or multiplication operation method, and the speech recognition result is corrected according to the operation result.
In an exemplary embodiment, the performing an operation on the speech recognition result and the phoneme confusion matrix to obtain an operation result includes: and operating the voice recognition result and the phoneme confusion matrix through a preset algorithm to obtain a plurality of confusion probability values, wherein the confusion probability values are used for indicating the operation result.
According to a preset algorithm, the speech recognition result and the phoneme confusion matrix are operated to obtain a confusion probability value used for indicating the operation result, and it should be noted that the greater the confusion probability value is, the higher the phoneme similarity of the phoneme in the speech recognition result based on the phoneme and the position corresponding to the confusion probability value is.
In an exemplary embodiment, modifying the speech recognition result according to the operation result includes: selecting a sound speed sequence of a correct result corresponding to the maximum confusion probability value from the confusion probability values; and correcting the voice recognition result according to the sound velocity sequence of the correct result corresponding to the maximum confusion probability, calculating the voice recognition result and the phoneme confusion matrix according to a preset algorithm to obtain the confusion probability value for indicating the calculation result, selecting the sound velocity sequence of the correct result corresponding to the calculation result with the maximum confusion probability value, and correcting the voice recognition result by using the sound velocity sequence of the correct result.
In order to improve the accuracy of the modification of different accents of different target objects by the phoneme confusion matrix, in an exemplary embodiment, the method further comprises: obtaining corpus data of a target object; determining a phoneme sequence of a correct result corresponding to the corpus data and M phoneme sequences with confused pronunciations according to the acquired corpus data, wherein M is a positive integer; and determining a phoneme confusion matrix of the target object according to the phoneme sequence of the correct result corresponding to the corpus data and the pronunciation confusion M phoneme sequences, namely generating the phoneme confusion matrix of the target object, which is specially used for correcting the voice information of the target object, aiming at the target object.
In order to correct special pronunciations aiming at different target objects, the corpus data of the target objects is obtained, a phoneme sequence of a corresponding correct result and a plurality of phoneme sequences confused by pronunciations are determined according to the corpus data, and a phoneme confusion matrix of the target objects is generated. The corpus data of the target object is used as reference data for individually adjusting the phoneme confusion matrix, so that the differentiated phoneme confusion matrix is generated for a plurality of target objects with different corpora, phoneme confusion caused by accent difference, habitual misreading, regional difference (swallowing, continuous reading, retromorphism, certain sound \ vowel confusion) and the like is reduced, and the information retrieval robustness of the system is improved.
In an exemplary embodiment, after determining the phoneme confusion matrix of the target object according to the phoneme sequence of the correct result corresponding to the corpus data and the M pronunciation-confused phoneme sequences, the method further includes: under the condition that the voice data of the target object are received, recognizing the voice data of the target object to obtain a target recognition result based on phonemes; and correcting the target recognition result according to the phoneme confusion matrix of the target object, for example, when the voice data of the target object with the same corpus data is received again, acquiring the target recognition result of the target object based on the phonemes, and correcting the target recognition result of the target object according to the generated phoneme confusion matrix.
Optionally, the phoneme sequence of the correct result is obtained by: and acquiring the phoneme sequence marked with correct results from the script data.
In order to better understand the process of the method for outputting the voice text, the following describes a flow of the method for outputting the voice text with an alternative embodiment.
In an optional embodiment of the present invention, an intelligent dialog system information retrieval technique based on an individualized spoken language pronunciation confusion matrix is mainly provided, and the technique can also optimize errors under different accents (equivalent to corpus data in the implementation of the present invention) of different users, so as to reduce the serious influence caused by speech errors, improve the robustness of the system, and adapt to various individualized accents. For the problem of speech recognition error, a common pronunciation confusion matrix (corresponding to the phoneme confusion matrix in the embodiment of the present invention) is first generated, and fig. 4 is a main process for generating the confusion matrix in the alternative embodiment of the present invention, where the confusion matrix is preliminarily generated according to text data (corresponding to the phoneme sequence with correct results labeled in the embodiment of the present invention) with correct results labeled, the labeled correct results and the recognized text with possible pronunciation confusion are respectively converted into phoneme sequences, and the two sequences are aligned, and then the confusion probability of each factor is counted as an initial basic confusion matrix.
In order to correct the special pronunciation for the personalized accents of different users, the system uses the data with the confidence degree of the matching result exceeding the adaptive threshold as the phonemic test data of the personalized confusion matrix, so as to generate a differentiated personalized accent confusion matrix for each different user, and fig. 5 and 6 are flow charts of the confusion matrix and the dialog system for adaptation.
When the error degree of the text generated by speech recognition is not high, the problems of different pronunciations of the text, homophones of the text and the like can be caused by the difference between the language model and the knowledge base. Even different writing methods can cause unexpected influence on information retrieval in the traditional retrieval mode.
The embodiment of the invention introduces phonemes as the retrieval basis during text retrieval, can eliminate the influence of text homophones, and introduces a confusion matrix to reduce the influence caused by text pronunciation recognition errors. During operation of the system, according to the plurality of personalized confusion matrices self-corrected by different users, phoneme confusion caused by difference of accents of different users, habitual misreading, regional difference (gulp, continuous reading, retroflex, confusion of certain sounds/vowels) and the like can be reduced, and the information retrieval robustness of the system is improved. The practical application effect is as follows:
standard statements in the knowledge base: i want to listen to English; i want to listen to music; the fossa is strengthened for one year; quickening the crazy time of Christmas; i want to listen to forgetful water; i want to listen to a clumsy child;
search of different user inputs 1:
initializing data to phonemes … …
Please input a Query:(q to exit):
I want to hear the water
uu uo x iang t ing uu uang q ing sh ui
The query result is:
uu uo x iang t ing uu uang q ing sh ui
5:[1.0,0.0]
uu uo x iang t ing ii ing vv v
1:[0.53333336,5.0]
uu uo x iang t ing ii in vv ve
2:[0.53333336,6.5]
j ia k uai sh eng d an j ie f eng k uang
4:[0.48214287,8.0]
uu uo x iang t ing b en x iao h ai
6:[0.44444445,6.5]
uu uo x ian t ing ii i n ian
3:[0.40833333,7.428571]
search results 2 input by different users:
Please input a Query:(q to exit):
i want to listen to this xiaoha
uu uo x iang t ing b en x iao h ai
6:[0.8402778,0.0]
uu uo x iang t ing ii ing vv v
1:[0.40833333,5.142857]
uu uo x iang t ing ii in vv ve
2:[0.40833333,5.142857]
uu uo x iang t ing uu uang q ing sh ui
5:[0.3402778,5.142857]
uu uo x ian t ing ii i n ian
5:[0.3,6.0]
The beneficial effects of the actual test in the music scene are shown in the following table 1:
target speech Phoneme confusion matrix modified speech text
The broadcast is no for one hundred years Borrow five hundred years again
Borrowing for 500 years again in the first direction Borrow five hundred years again
Play the eyes of one sitting me Let I do your eyes
Play a let me do your eye Let I do your eyes
Put one head and let me sit in the Reed's eyes Let I do your eyes
Roasted small apple Small apple
Make I get the eye of rice without putting one Let I do your eyes
I want to listen to a thousand pages One thousand and one night
Small oil-releasing youth refining handbook Youth repairing book
Beautiful curry I want to listen to Curry curry
Play a big busy to call me to patrol the mountain King calls me to patrol mountain
Seeing and hearing first our time Our time
Time of listening to berry Time of absence
I want to listen to Tang ginger juice song Song of Yangtze river
City wanting to listen to sky Sky city
I want to listen to the book of Junjie in Lin how you are in the world Lose you won the world and how
TABLE 1
In summary, the embodiments of the present invention introduce the phonemes and the phoneme confusion matrix to provide an information correction function for information retrieval of the dialog system. The invention can apply the phoneme confusion matrix to the information retrieval problem of the intelligent dialogue system to solve the problem of recognition error in personalized spoken language pronunciation.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a device for outputting a speech text is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and the description of the device that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 7 is a block diagram of a configuration of an apparatus for outputting a phonetic text according to an embodiment of the present invention, as shown in fig. 7, the apparatus including:
(1) a determining module 72, configured to perform speech recognition on the target speech to obtain a phoneme-based speech recognition result;
(2) the processing module 74 is configured to modify the speech recognition result according to a preset phoneme confusion matrix, and output a modified speech text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer.
Performing voice recognition on the target voice through the device to obtain a voice recognition result based on phonemes; correcting the voice recognition result according to a preset phoneme confusion matrix, and outputting a corrected voice text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with the correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer, namely, according to the technical scheme, the voice recognition result is corrected through the generated phoneme confusion matrix, and then the corrected voice text can be obtained.
In an exemplary embodiment, the processing module is further configured to obtain a phoneme sequence labeled with a correct result and pronunciation-confused N phoneme sequences; aligning the phoneme sequence labeled with correct results with the N phoneme sequences to determine a phoneme confusion matrix for indicating a confusion probability of each phoneme.
That is to say, before the speech recognition result is corrected by the preset phoneme confusion matrix, the phoneme sequence labeled with the correct result may be obtained in advance to be aligned and matched with the multiple pronunciation confusion phoneme sequences, and the specific matching process may adopt any implementation manner in the prior art, which is not limited in the embodiment of the present invention.
In an exemplary embodiment, the processing module is further configured to perform an operation on the speech recognition result and the phoneme confusion matrix to obtain an operation result; the method includes modifying the speech recognition result according to the operation result, obtaining a phoneme-based speech recognition result from the target speech, performing an operation on the speech recognition result and a preset phoneme confusion matrix by a weighted addition or multiplication operation method, and modifying the speech recognition result according to the operation result.
In an exemplary embodiment, the processing module is further configured to calculate the voice recognition result and the phoneme confusion matrix according to a preset algorithm to obtain a plurality of confusion probability values, where the confusion probability values are used to indicate the calculation result.
According to a preset algorithm, the speech recognition result and the phoneme confusion matrix are operated to obtain a confusion probability value used for indicating the operation result, and it should be noted that the greater the confusion probability value is, the higher the phoneme similarity of the phoneme in the speech recognition result based on the phoneme and the position corresponding to the confusion probability value is.
In an exemplary embodiment, the processing module is further configured to select a sonic sequence of correct results corresponding to a maximum confusion probability value from the plurality of confusion probability values; and correcting the voice recognition result according to the sound velocity sequence of the correct result corresponding to the maximum confusion probability, calculating the voice recognition result and the phoneme confusion matrix according to a preset algorithm to obtain the confusion probability value for indicating the calculation result, selecting the sound velocity sequence of the correct result corresponding to the calculation result with the maximum confusion probability value, and correcting the corresponding voice recognition result by using the sound velocity sequence of the correct result.
In an exemplary embodiment, the apparatus further includes: the acquisition module is used for acquiring corpus data of the target object; the determining module is used for determining a phoneme sequence of a correct result corresponding to the corpus data and M phoneme sequences with confused pronunciations according to the acquired corpus data, wherein M is a positive integer; and determining a phoneme confusion matrix of the target object according to the phoneme sequence of the correct result corresponding to the corpus data and the M phoneme sequences subjected to pronunciation confusion.
In order to correct the special pronunciation of the corpus data of different target objects, the corpus data of the target objects is obtained, and a phoneme sequence of a corresponding correct result and a plurality of phoneme sequences confused by pronunciation are determined according to the corpus data to generate a phoneme confusion matrix of the target objects. The corpus data of the target object is used as reference data for individually adjusting the phoneme confusion matrix, so that the differentiated phoneme confusion matrix is generated for a plurality of target objects with different corpora, phoneme confusion caused by accent difference, habitual misreading, regional difference (swallowing, continuous reading, retromorphism, certain sound \ vowel confusion) and the like is reduced, and the information retrieval robustness of the system is improved.
In an exemplary embodiment, the determining module is further configured to, in a case that the voice data of the target object is received, recognize the voice data of the target object to obtain a target recognition result based on phonemes; and correcting the target recognition result according to the phoneme confusion matrix of the target object, for example, when the voice data of the target object with the same corpus data is received again, acquiring the target recognition result of the target object based on the phonemes, and correcting the target recognition result of the target object according to the generated phoneme confusion matrix.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
In an exemplary embodiment, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, carrying out voice recognition on the target voice to obtain a voice recognition result based on phonemes;
s2, correcting the voice recognition result according to a preset phoneme confusion matrix, and outputting a corrected voice text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer.
In an exemplary embodiment, in the present embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
In an exemplary embodiment, in the present embodiment, the processor may be configured to execute the following steps by a computer program:
s1, carrying out voice recognition on the target voice to obtain a voice recognition result based on phonemes;
s2, correcting the voice recognition result according to a preset phoneme confusion matrix, and outputting a corrected voice text; the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer.
In an exemplary embodiment, for specific examples in this embodiment, reference may be made to the examples described in the above embodiments and optional implementation manners, and details of this embodiment are not described herein again.
It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, which may be centralized on a single computing device or distributed across a network of computing devices, and in one exemplary embodiment may be implemented using program code executable by a computing device, such that the steps shown and described may be executed by a computing device stored in a memory device and, in some cases, executed in a sequence different from that shown and described herein, or separately fabricated into individual integrated circuit modules, or multiple ones of them fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for outputting a speech text, comprising:
carrying out voice recognition on the target voice to obtain a voice recognition result based on phonemes;
correcting the voice recognition result according to a preset phoneme confusion matrix, and outputting a corrected voice text;
the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer.
2. The method of claim 1, wherein before modifying the speech recognition result according to a preset phoneme confusion matrix and outputting the modified speech text, the method further comprises:
acquiring a phoneme sequence with a correct result and N phoneme sequences with confusion pronunciation;
aligning the phoneme sequence labeled with correct results with the N phoneme sequences to determine a phoneme confusion matrix for indicating a confusion probability of each phoneme.
3. The method of claim 1, wherein modifying the speech recognition result according to a preset phoneme confusion matrix comprises:
calculating the voice recognition result and the phoneme confusion matrix to obtain a calculation result;
and correcting the voice recognition result according to the operation result.
4. The method of claim 3, wherein computing the speech recognition result and the phoneme confusion matrix to obtain a computation result comprises:
and operating the voice recognition result and the phoneme confusion matrix according to a preset algorithm to obtain a plurality of confusion probability values, wherein the confusion probability values are used for indicating the operation result.
5. The method of claim 4, wherein modifying the speech recognition result according to the operation result comprises:
selecting a sound speed sequence of a correct result corresponding to the maximum confusion probability value from the confusion probability values;
and correcting the voice recognition result according to the sound velocity sequence of the correct result corresponding to the maximum confusion probability.
6. The method of claim 1, further comprising:
obtaining corpus data of a target object;
determining a phoneme sequence of a correct result corresponding to the corpus data and M phoneme sequences with confused pronunciations according to the acquired corpus data, wherein M is a positive integer;
and determining a phoneme confusion matrix of the target object according to the phoneme sequence of the correct result corresponding to the corpus data and the M phoneme sequences subjected to pronunciation confusion.
7. The method according to claim 6, wherein after determining the phoneme confusion matrix for the target object based on the phoneme sequences of correct results corresponding to the corpus data and the phonetically confused M phoneme sequences, the method further comprises:
under the condition that the voice data of the target object are received, recognizing the voice data of the target object to obtain a target recognition result based on phonemes;
and correcting the target recognition result according to the phoneme confusion matrix of the target object.
8. An output device for a speech text, comprising:
the determining module is used for carrying out voice recognition on the target voice to obtain a voice recognition result based on phonemes;
the processing module is used for correcting the voice recognition result according to a preset phoneme confusion matrix and outputting a corrected voice text;
the phoneme confusion matrix is used for indicating the confusion probability of each phoneme between the phoneme sequence with correct result and the N phoneme sequences with pronunciation confusion, wherein N is a positive integer.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 7 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.
CN202010464302.XA 2020-05-27 2020-05-27 Voice text output method and device, storage medium and electronic device Pending CN113744718A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010464302.XA CN113744718A (en) 2020-05-27 2020-05-27 Voice text output method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010464302.XA CN113744718A (en) 2020-05-27 2020-05-27 Voice text output method and device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN113744718A true CN113744718A (en) 2021-12-03

Family

ID=78723679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010464302.XA Pending CN113744718A (en) 2020-05-27 2020-05-27 Voice text output method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN113744718A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083437A (en) * 2022-05-17 2022-09-20 北京语言大学 Method and device for determining uncertainty of learner pronunciation

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1645477A (en) * 2004-01-20 2005-07-27 微软公司 Automatic speech recognition learning using user corrections
CN101447184A (en) * 2007-11-28 2009-06-03 中国科学院声学研究所 Chinese-English bilingual speech recognition method based on phoneme confusion
CN101887725A (en) * 2010-04-30 2010-11-17 中国科学院声学研究所 Phoneme confusion network-based phoneme posterior probability calculation method
CN102136001A (en) * 2011-03-25 2011-07-27 天脉聚源(北京)传媒科技有限公司 Multi-media information fuzzy search method
CN110085261A (en) * 2019-05-16 2019-08-02 上海流利说信息技术有限公司 A kind of pronunciation correction method, apparatus, equipment and computer readable storage medium
CN110765763A (en) * 2019-09-24 2020-02-07 金蝶软件(中国)有限公司 Error correction method and device for speech recognition text, computer equipment and storage medium
CN110797049A (en) * 2019-10-17 2020-02-14 科大讯飞股份有限公司 Voice evaluation method and related device
CN111143525A (en) * 2019-12-17 2020-05-12 广东广信通信服务有限公司 Vehicle information acquisition method and device and intelligent vehicle moving system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1645477A (en) * 2004-01-20 2005-07-27 微软公司 Automatic speech recognition learning using user corrections
CN101447184A (en) * 2007-11-28 2009-06-03 中国科学院声学研究所 Chinese-English bilingual speech recognition method based on phoneme confusion
CN101887725A (en) * 2010-04-30 2010-11-17 中国科学院声学研究所 Phoneme confusion network-based phoneme posterior probability calculation method
CN102136001A (en) * 2011-03-25 2011-07-27 天脉聚源(北京)传媒科技有限公司 Multi-media information fuzzy search method
CN110085261A (en) * 2019-05-16 2019-08-02 上海流利说信息技术有限公司 A kind of pronunciation correction method, apparatus, equipment and computer readable storage medium
CN110765763A (en) * 2019-09-24 2020-02-07 金蝶软件(中国)有限公司 Error correction method and device for speech recognition text, computer equipment and storage medium
CN110797049A (en) * 2019-10-17 2020-02-14 科大讯飞股份有限公司 Voice evaluation method and related device
CN111143525A (en) * 2019-12-17 2020-05-12 广东广信通信服务有限公司 Vehicle information acquisition method and device and intelligent vehicle moving system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083437A (en) * 2022-05-17 2022-09-20 北京语言大学 Method and device for determining uncertainty of learner pronunciation

Similar Documents

Publication Publication Date Title
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
US11514891B2 (en) Named entity recognition method, named entity recognition equipment and medium
US20220262352A1 (en) Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation
US10079022B2 (en) Voice recognition terminal, voice recognition server, and voice recognition method for performing personalized voice recognition
CN111402862B (en) Speech recognition method, device, storage medium and equipment
CN109637537B (en) Method for automatically acquiring annotated data to optimize user-defined awakening model
CN109273007B (en) Voice wake-up method and device
US20080294433A1 (en) Automatic Text-Speech Mapping Tool
WO2014183373A1 (en) Systems and methods for voice identification
CN111916111A (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN111862954A (en) Method and device for acquiring voice recognition model
CN105654955B (en) Audio recognition method and device
CN111883137A (en) Text processing method and device based on voice recognition
CN106653002A (en) Literal live broadcasting method and platform
CN110335608A (en) Voice print verification method, apparatus, equipment and storage medium
CN113178192A (en) Training method, device and equipment of speech recognition model and storage medium
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN110706710A (en) Voice recognition method and device, electronic equipment and storage medium
CN111640423B (en) Word boundary estimation method and device and electronic equipment
CN114360514A (en) Speech recognition method, apparatus, device, medium, and product
CN113744718A (en) Voice text output method and device, storage medium and electronic device
CN116110370A (en) Speech synthesis system and related equipment based on man-machine speech interaction
US8600750B2 (en) Speaker-cluster dependent speaker recognition (speaker-type automated speech recognition)
Schuller et al. Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination