CN115410557A - Voice processing method and device, electronic equipment and storage medium - Google Patents

Voice processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115410557A
CN115410557A CN202211028470.XA CN202211028470A CN115410557A CN 115410557 A CN115410557 A CN 115410557A CN 202211028470 A CN202211028470 A CN 202211028470A CN 115410557 A CN115410557 A CN 115410557A
Authority
CN
China
Prior art keywords
pronunciation
text
audio
recognized
pronunciation dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211028470.XA
Other languages
Chinese (zh)
Inventor
陈昌儒
李超
孙宇航
刘文娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Opper Communication Co ltd
Original Assignee
Beijing Opper Communication Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Opper Communication Co ltd filed Critical Beijing Opper Communication Co ltd
Priority to CN202211028470.XA priority Critical patent/CN115410557A/en
Publication of CN115410557A publication Critical patent/CN115410557A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a voice processing method and device, electronic equipment and a storage medium. The method comprises the following steps: the electronic equipment compares the sample voice with a preset pronunciation dictionary to obtain text information which is not contained in the preset pronunciation dictionary; then inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information; updating the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary; and finally, acquiring the audio to be recognized, and processing the audio to be recognized through the updated pronunciation dictionary and the updated voice recognition model to obtain a voice recognition result. According to the method and the device, the pronunciation dictionary is updated, so that the audio to be recognized can be recognized accurately when the subsequent voice recognition model performs voice recognition.

Description

Voice processing method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of communication, in particular to a method, a device and a system for filtering browser page data.
Background
With the development of voice technology, a voice control mode can be adopted for controlling various electronic devices such as mobile phones, computers, televisions and the like at present.
At present, some speech models can achieve simple speech recognition, but for some special speeches, such as some special-tone speeches, polyphonic words and the like, the existing speech models are low in recognition rate, and the problem that the recognized result is inaccurate exists.
Disclosure of Invention
The embodiment of the application provides a voice processing method and device, electronic equipment and a storage medium. The speech processing method can improve the accuracy of speech recognition.
In a first aspect, an embodiment of the present application provides a speech processing method, including:
comparing the sample voice with a preset pronunciation dictionary to obtain text information not contained in the preset pronunciation dictionary;
inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information;
updating the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary;
and acquiring the audio to be recognized, and processing the audio to be recognized through the updated pronunciation dictionary and the updated voice recognition model to obtain a voice recognition result.
In a second aspect, an embodiment of the present application provides a speech processing apparatus, including:
the comparison module is used for comparing the sample voice with a preset pronunciation dictionary to obtain text information which is not contained in the preset pronunciation dictionary;
the prediction module is used for inputting the text information into the pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information;
the updating module is used for updating the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary;
and the recognition module is used for acquiring the audio to be recognized and processing the audio to be recognized through the updated pronunciation dictionary and the voice recognition model to obtain a voice recognition result.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory storing executable program code, and a processor coupled to the memory; the processor calls the executable program codes stored in the memory to execute the steps in the voice processing method provided by the embodiment of the application.
In a fourth aspect, an embodiment of the present application provides a storage medium, where the storage medium stores multiple instructions, and the instructions are suitable for being loaded by a processor to perform steps in a speech processing method provided in an embodiment of the present application.
In the embodiment of the application, the electronic equipment obtains text information which is not contained in a preset pronunciation dictionary by comparing the sample voice with the preset pronunciation dictionary; inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information; updating the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary; and finally, acquiring the audio to be recognized, and processing the audio to be recognized through the updated pronunciation dictionary and the updated voice recognition model to obtain a voice recognition result. According to the embodiment of the application, the pronunciation dictionary is updated, so that the audio to be recognized can be recognized accurately when the subsequent voice recognition model performs voice recognition.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a first flowchart of a speech processing method according to an embodiment of the present application.
Fig. 2 is a schematic second flowchart of a speech processing method according to an embodiment of the present application.
Fig. 3 is a schematic flowchart of extracting an acoustic feature of an audio to be recognized according to an embodiment of the present application.
Fig. 4 is a schematic structural diagram of a speech recognition model provided in an embodiment of the present application.
Fig. 5 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
With the development of voice technology, various electronic devices such as mobile phones, computers, televisions and the like can be controlled by adopting a voice control mode at present.
At present, some speech models can achieve simple speech recognition, but aiming at some special speech, such as some special tone speech, polyphonic words and the like, the existing speech models are low in recognition rate and have the problem that the recognized result is inaccurate.
For example, for phrases, words, dialects, etc., which have never been recognized by some electronic devices, the electronic devices are prone to misrecognizing speech, resulting in low speech recognition accuracy.
In order to solve the foregoing technical problem, embodiments of the present application provide a voice processing method, an apparatus, an electronic device, and a storage medium. The voice processing method can be applied to various electronic devices such as computers, mobile phones, tablet computers, wearable electronic devices and household appliances. The voice processing method can improve the recognition accuracy rate of voice recognition.
Referring to fig. 1, fig. 1 is a first flow chart of a speech processing method according to an embodiment of the present application. The voice processing method can comprise the following steps:
110. and comparing the sample voice with a preset pronunciation dictionary to obtain text information which is not contained in the preset pronunciation dictionary.
In some embodiments, the electronic device may obtain some sample voices, for example, a scene corresponding to the sample voices may be some common conversation scenes, may also be professional terms, papers, and lecture scenes, and may also be some popular network conversations, paragraph scenes, and the like.
The sample voice contains corresponding characters, words and short sentences under different scenes, and the sample voice is used for the electronic equipment to determine text information which is not contained in the preset pronunciation dictionary through the sample voice, so that the preset pronunciation dictionary is updated in the subsequent process, and the updated pronunciation dictionary is obtained.
In some embodiments, the electronic device may first obtain text information in the sample voice, for example, a corresponding mark text in the sample voice and a position where the mark text is located in the sample voice, for example, the sample voice is a "favorite modified cat," may determine the "modified cat" as the mark text in advance, and then determine a position where the "modified cat" is located in the sample voice as the mark position.
The electronic device may directly compare the tagged text with a preset pronunciation dictionary, where the preset pronunciation dictionary includes the preset text and phonemes corresponding to the preset text, for example, if the preset text is a word, the phoneme of the word is mapped to the word, that is, there is an association relationship between the word and the phoneme.
The electronic equipment compares the marked text with each text in the preset pronunciation dictionary to determine whether the preset pronunciation dictionary has the same text, and finally determines the marked text which is not contained in the preset pronunciation dictionary to obtain the text information which is not contained in the preset pronunciation dictionary.
In some embodiments, a plurality of tagged texts are included in the sample speech, and the electronic device may determine the tagged texts in the sample speech and corresponding positions of the tagged texts, and then compare each tagged text with the preset pronunciation dictionary according to the corresponding positions of the tagged texts to obtain text information not included in the preset pronunciation dictionary.
For example, after the electronic device reads a mark position, the mark text corresponding to the mark position is compared with the preset pronunciation dictionary, and after the comparison is completed, the electronic device reads the next mark position again, so that the next mark text is determined, and the next mark text is compared with the preset pronunciation dictionary. Finally, text information which is not contained in the preset pronunciation dictionary is obtained.
120. And inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information.
In some embodiments, after the electronic device obtains the text information, the electronic device may input the text information into a pronunciation prediction model, and the pronunciation prediction model can predict pronunciation prediction results corresponding to different texts in the text information. For example, the pronunciation prediction model may predict a phoneme corresponding to each text and a phoneme sequence corresponding to different texts, determine a pronunciation prediction result corresponding to each text from the phoneme corresponding to each text, and determine a pronunciation prediction result corresponding to each text from the phoneme sequence corresponding to each text.
When the electronic equipment applies the pronunciation prediction model, pronunciation prediction results corresponding to characters, words and short sentences in the text can be obtained through the pronunciation prediction model.
In some embodiments, the pronunciation prediction model may adopt a deep learning model, for example, an attention model having an attention mechanism, so that longer context information can be used in the process of performing pronunciation prediction processing on the text in the text information, thereby improving the pronunciation prediction accuracy of the text in the text information.
130. And updating the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary.
In some embodiments, after obtaining the phoneme corresponding to each text in the text information, the electronic device may establish a mapping relationship between the phoneme corresponding to each text and each text, and then update the phoneme corresponding to each text and each text into a preset pronunciation dictionary according to the mapping relationship, so as to obtain an updated pronunciation dictionary.
For example, for different texts such as words, phrases, and the like in the text information, a mapping relationship may be established between a phoneme corresponding to the text "memory" and the text "memory". And then updating the text memory and the corresponding phoneme to a preset pronunciation dictionary to obtain an updated pronunciation dictionary.
In some embodiments, after the electronic device obtains the pronunciation prediction result, the electronic device may determine an incorrect pronunciation prediction result in the pronunciation prediction result and delete the incorrect pronunciation prediction result to obtain a modified pronunciation prediction result. And then determining a text corresponding to the wrong pronunciation prediction result in the text information, and deleting the text corresponding to the wrong pronunciation prediction result to obtain corrected text information. And finally, updating the corrected pronunciation prediction result and the corrected text information to a preset pronunciation dictionary to obtain an updated pronunciation dictionary.
For example, in the pronunciation prediction result, the electronic device may transmit the text information and the pronunciation prediction result to the cloud, determine whether an incorrect pronunciation prediction result exists in the pronunciation prediction result corresponding to the text information through the cloud, and delete the incorrect pronunciation prediction result and the text corresponding to the incorrect pronunciation prediction result, thereby obtaining a corrected pronunciation prediction result and corrected text information.
The electronic equipment can also display the text information and the pronunciation prediction result corresponding to the text information to the user, and the user judges the pronunciation prediction result so as to mark a wrong pronunciation prediction result. The electronic device may then delete the flagged incorrect pronunciation prediction results.
140. And acquiring the audio to be recognized, and processing the audio to be recognized through the updated pronunciation dictionary and the updated voice recognition model to obtain a voice recognition result.
In some embodiments, after obtaining the updated pronunciation dictionary, the electronic device may obtain the audio to be recognized, where the audio to be recognized may be the voice corresponding to the user, the voice in the video, or the voice in the audio file. The audio to be recognized may be audio in any file format, such as wav format, mp3 format, flac format, and the like.
In some embodiments, the electronic device may convert the audio to be recognized into a preset audio format, such as converting a file format of the audio to be recognized from an mp3 format into a wav format, before processing the audio to be recognized through the speech processing model and the updated pronunciation dictionary.
In some implementations, the speech recognition model includes an acoustic model and a language model. The electronic equipment can input the audio to be recognized into the acoustic model, output the phoneme corresponding to the audio to be recognized, and then match the updated pronunciation dictionary with the phoneme corresponding to the audio to be recognized through the language model, so as to finally obtain a voice recognition result.
Specifically, the electronic device may first obtain acoustic features of the audio to be recognized, such as Mel-frequency Cepstral Coefficients (MFCCs), frequency domain features (fbanks), perceptual Linear Prediction (PLPs), linear Prediction Cepstral Coefficients (LPCCs), and so on.
And then the electronic equipment inputs the acoustic features into an acoustic model, and the acoustic model outputs phonemes corresponding to the audio to be recognized according to the acoustic features. And then the electronic equipment determines a matched phoneme corresponding to the audio to be recognized in the updated pronunciation dictionary according to the language model, then determines a matched text corresponding to the matched phoneme, and finally combines the matched texts according to the phoneme corresponding to the audio to be recognized to obtain a voice recognition result.
For example, the language model may match a phoneme corresponding to the audio to be recognized with a phoneme corresponding to a different text in the updated pronunciation dictionary, so as to match a matching phoneme that matches the phoneme corresponding to the audio to be recognized, and determine a matching text corresponding to the matching phoneme in the updated pronunciation dictionary.
And finally, the language model determines the sequence of the matched texts according to the phoneme sequence of the phonemes corresponding to the audio to be recognized, and generates a final output text according to the sequence of the matched texts, wherein the output text is a voice recognition result.
In some embodiments, the electronic device may further decode the sequence of audio frames to be recognized according to the acoustic features of each audio frame to be recognized in the audio to be recognized and the generated decoding map, and finally determine a speech recognition result corresponding to the sequence of audio frames to be recognized. Wherein the decoding graph can be generated at least according to the acoustic model, the updated pronunciation dictionary and the language model. The acoustic model is used to identify phonemes of the audio to be identified based on the acoustic features.
In the embodiment of the application, the electronic equipment obtains text information which is not contained in a preset pronunciation dictionary by comparing the sample voice with the preset pronunciation dictionary; inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information; updating the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary; and finally, acquiring the audio to be recognized, and processing the audio to be recognized through the updated pronunciation dictionary and the updated voice recognition model to obtain a voice recognition result. According to the embodiment of the application, the pronunciation dictionary is updated, so that the audio to be recognized can be recognized accurately when the subsequent voice recognition model carries out voice recognition.
Referring to fig. 2, fig. 2 is a second flow chart of the speech processing method according to the embodiment of the present application. The voice processing method can comprise the following steps:
201. and determining the marked texts in the sample voice and the corresponding positions of the marked texts.
In some embodiments, the electronic device may obtain some sample voices, for example, a scene corresponding to the sample voices may be some common conversation scenes, may also be professional terms, papers, and lecture scenes, and may also be some popular network conversations, paragraph scenes, and the like. The sample voice contains corresponding characters, words and short sentences under different scenes.
The electronic device can acquire text information in the sample voice, for example, the corresponding mark text in the sample voice and the position of the mark text in the sample voice, for example, the sample voice is a favorite modified cat, "which can be determined as the mark text in advance, and then the position of the modified cat in the sample voice is determined as the mark position.
202. And comparing each marked text with the preset pronunciation dictionary according to the corresponding position of the marked text to obtain text information which is not contained in the preset pronunciation dictionary.
In some embodiments, a plurality of tagged texts are included in the sample speech, and the electronic device may determine the tagged texts in the sample speech and corresponding positions of the tagged texts, and then compare each tagged text with the preset pronunciation dictionary according to the corresponding positions of the tagged texts to obtain text information not included in the preset pronunciation dictionary.
For example, after the electronic device reads a mark position, the mark text corresponding to the mark position is compared with the preset pronunciation dictionary, and after the comparison is completed, the electronic device reads the next mark position again, so that the next mark text is determined, and the next mark text is compared with the preset pronunciation dictionary. And finally, text information which is not contained in the preset pronunciation dictionary is obtained.
203. And determining the phoneme corresponding to each text in the text information according to the pronunciation prediction model.
In an actual speech scene, there are often variations such as speed and pitch change of a speech, for example, two sounds, three sounds and a soft sound of chinese. These transposition results in an inaccurate sequence of converted pronunciation phonemes. Currently, the linguistic conversion of ideograms into phoneme sequences is almost based on conversion rules preset by linguists.
In order to solve the problem that the phoneme and phoneme sequence of the text are identified inaccurately, the phoneme and phoneme sequence of the text can be predicted through a neural network model.
In some embodiments, the pronunciation prediction model may adopt a deep learning model, for example, an attention model having an attention mechanism, so that longer context information can be used in the process of performing pronunciation prediction processing on the text in the text information, thereby improving the pronunciation prediction accuracy of the text in the text information.
For example, the electronic device may predict phonemes corresponding to each text and phoneme sequences corresponding to different texts through a pronunciation prediction model. For example, the word "a" may correspond to the phoneme "\ a1\ where" a1 "represents the pronunciation of the letter vowel" a "and the phoneme sequence is made up of the phonemes corresponding to the words in the word.
Of course, other pronunciation prediction models can be used by the electronic device, such as a pronunciation prediction model can be used to perform pronunciation prediction in combination with context information of the text, and also can perform pronunciation prediction by changing the context of the text and the tone thereof. Thereby obtaining more accurate pronunciation prediction results.
204. And determining the phoneme corresponding to each text in the text information as a pronunciation prediction result.
The electronic device may determine a pronunciation prediction result corresponding to each text from the phonemes corresponding to each text, and determine a pronunciation prediction result corresponding to each text from the sequence of phonemes corresponding to each text.
205. And establishing a mapping relation between the phoneme corresponding to each text and each text.
When a text contains a plurality of words, for example, the text is a phrase, each word in the phrase may be associated with a corresponding phoneme, and all words in the text may also be associated with phonemes corresponding to all words, thereby forming a mapping relationship between the text and phonemes corresponding to the text.
206. And updating the phonemes corresponding to each text and each text into a preset pronunciation dictionary according to the mapping relation to obtain an updated pronunciation dictionary.
After the mapping relationship between each text and the corresponding phoneme in the text information is obtained, the electronic equipment may update the phoneme corresponding to each text and each text to a preset pronunciation dictionary according to the mapping relationship, so as to obtain an updated pronunciation dictionary.
For example, if there is no text of a certain phrase, word and phoneme corresponding to the text in the preset pronunciation dictionary, the text and the phoneme corresponding to the text may be updated to the preset pronunciation dictionary for subsequent speech recognition.
207. And acquiring acoustic features corresponding to the audio to be recognized.
In some embodiments, after obtaining the updated pronunciation dictionary, the electronic device may obtain the audio to be recognized, where the audio to be recognized may be the voice corresponding to the user, the voice in the video, or the voice in the audio file. The audio to be recognized may be audio in any file format, such as wav format, mp3 format, flac format, and the like.
In some embodiments, the electronic device may convert the audio to be recognized into a preset audio format, such as converting a file format of the audio to be recognized from an mp3 format into a wav format, before processing the audio to be recognized through the speech processing model and the updated pronunciation dictionary.
The electronic device may first obtain acoustic features of the audio to be identified, such as Mel-frequency Cepstral Coefficients (MFCC), frequency domain features (Fbank), perceptual Linear Prediction (PLP), linear Prediction Cepstral Coefficient (LPCC), and so on.
The electronic device can perform voice feature extraction of the voice to be recognized through the existing voice recognition tool.
Referring to fig. 3, fig. 3 is a schematic flow chart of extracting acoustic features of an audio to be recognized according to an embodiment of the present disclosure. Taking extraction of mel cepstrum coefficients of the audio to be recognized as an example, the method for extracting the acoustic features may include the following steps:
301. and pre-emphasis processing is carried out on the voice to be recognized to obtain a first voice signal.
It can be understood that, in the process of signal transmission, the attenuation of high-frequency components of the signal is large, and the attenuation of low-frequency components is reduced. The high frequency component of the signal may be enhanced at the beginning of the transmission to compensate for excessive attenuation of the high frequency component during transmission.
Therefore, by performing Pre-emphasis (Pre-emphasis) processing on the voice signal to be recognized, the features corresponding to the high-frequency signal can be retained to a greater extent, so as to obtain the first voice signal.
302. And performing frame division and windowing processing on the first voice signal to obtain a second voice signal.
Since the speech signal characteristics are time-varying, it is a non-stationary random process. The purpose of framing the first speech signal is to divide speech samples into a frame within which the characteristics of the speech signal may be considered stable.
Meanwhile, windowing is carried out on the voice signal in the frame, and the signal in the windowed signal is not changed due to other external operations. Thereby ensuring the stability of signal extraction. Thereby obtaining a second speech signal.
303. And performing fast Fourier transform on the second voice signal to obtain a power spectrum corresponding to the second voice signal.
After the second voice signal is subjected to fast fourier transform, the second voice signal can be converted from a time domain signal to a frequency domain signal, so that a power spectrum corresponding to the second voice signal is obtained.
304. And filtering the power spectrum through a Mel filter to obtain a filtered signal, and processing the filtered signal to obtain a Mel cepstrum coefficient.
After the power spectrum corresponding to the second voice signal is obtained, the power spectrum may be filtered through a mel filter to obtain a filtered signal.
For example, multiple mel filter banks may be used, and then the maximum and minimum values of the frequency for each filter bank may be set, where the minimum value may be 300Hz and for 16kHz speech the maximum value is 8kHz. Since the human hearing range is 20Hz-20kHz, but for real speech, below 300Hz is usually not meaningful. Whereas according to the Nyquist sampling theorem the range of a 16kHz signal is 8kHz, so the maximum value is chosen to be 8kHz.
The maximum and minimum values of the frequency are then converted to frequencies at the Mel-frequency spectral scale, where 300Hz is 401.25mel and 8khz is 2834.99Mel.
And then setting a corresponding Mel filter bank so as to filter the power spectrum to obtain a filtered signal. After obtaining the filtered signal, the filtered signal may be preprocessed to obtain a preprocessed signal, for example, the filtered signal is subjected to energy logarithm operation, and a dynamic difference parameter is extracted. And then, carrying out discrete cosine transform on the preprocessed signals, thereby removing the correlation among all dimensional signals, and mapping the preprocessed signals to a low-dimensional space, thereby obtaining final output Mel cepstrum coefficients. And acoustic features corresponding to the voice signals to be recognized of the Mel cepstrum coefficients are obtained.
Referring to fig. 2, in step 208, the acoustic features corresponding to the audio to be recognized are input into the acoustic model, and the phoneme corresponding to the audio to be recognized is output.
Referring to fig. 4 together, fig. 4 is a schematic structural diagram of a speech processing model according to an embodiment of the present application, where the speech processing model includes an acoustic model and a language model. The acoustic model and the language model are already trained models.
For example, the acoustic basic model corresponding to the acoustic model is generated after training under various scenes and contexts. The language basic model corresponding to the language model is generated after being trained under the corresponding scene and context.
The acoustic model and the language model are connected with each other, and the result output by the acoustic model can be input into the language model.
As shown in fig. 4, first, the electronic device inputs the acoustic features corresponding to the audio to be recognized into the acoustic model, and outputs the phonemes corresponding to the audio to be recognized.
209. And determining a corresponding matching phoneme of the phoneme corresponding to the audio to be recognized in the updated pronunciation dictionary according to the language model.
After obtaining the phoneme corresponding to the audio to be recognized, the electronic device inputs the phoneme corresponding to the audio to be recognized into the language model, and the language model may match the phoneme corresponding to the audio to be recognized with the phonemes corresponding to different texts in the updated pronunciation dictionary, so as to match the matched phoneme corresponding to the audio to be recognized.
210. And determining a matching text corresponding to the matching phoneme, and combining the matching text according to the phoneme corresponding to the audio to be recognized to obtain a voice recognition result.
The language model determines a matching text corresponding to the matching phoneme in the updated pronunciation dictionary.
And finally, the language model determines the sequence of the matched text according to the phoneme sequence of the phoneme corresponding to the audio to be recognized, and generates a final output text according to the sequence of the matched text, wherein the output text is a voice recognition result.
In the embodiment of the application, the electronic device determines the marked texts in the sample voice and the corresponding positions of the marked texts, and compares each marked text with the preset pronunciation dictionary according to the corresponding position of the marked text to obtain text information which is not contained in the preset pronunciation dictionary. And then determining the phoneme corresponding to each text in the text information according to the pronunciation prediction model, and determining the phoneme corresponding to each text in the text information as a pronunciation prediction result.
And establishing a mapping relation between the phoneme corresponding to each text and each text. And updating the phonemes corresponding to each text and each text into a preset pronunciation dictionary according to the mapping relation to obtain an updated pronunciation dictionary.
And finally, obtaining the acoustic characteristics corresponding to the audio to be recognized, inputting the acoustic characteristics corresponding to the audio to be recognized into the acoustic model, and outputting the phonemes corresponding to the audio to be recognized. And determining a matched phoneme corresponding to the audio to be recognized in the updated pronunciation dictionary according to the language model. And determining a matching text corresponding to the matching phoneme, and combining the matching text according to the phoneme corresponding to the audio to be recognized to obtain a voice recognition result.
In the process of recognizing the speech to be recognized, the preset pronunciation dictionary is updated, and the updated pronunciation dictionary is used for richer texts and phonemes corresponding to the texts, so that the accuracy of the speech recognition to be recognized by the electronic equipment can be improved.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application. The speech processing apparatus 400 may include:
the comparing module 410 is configured to compare the sample voice with a preset pronunciation dictionary to obtain text information that is not included in the preset pronunciation dictionary.
The comparison module 410 is further configured to determine a marked text in the sample voice and a position corresponding to the marked text;
and comparing each marked text with the preset pronunciation dictionary according to the corresponding position of the marked text to obtain text information which is not contained in the preset pronunciation dictionary.
The prediction module 420 is configured to input the text information into the pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information.
The prediction module 420 is further configured to determine a phoneme corresponding to each text in the text information according to the pronunciation prediction model;
and determining the phoneme corresponding to each text in the text information as a pronunciation prediction result.
And the updating module 430 is configured to update the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary.
The updating module 430 is further configured to establish a mapping relationship between phonemes corresponding to each text and the corresponding text;
and updating the phonemes corresponding to each text and each text into a preset pronunciation dictionary according to the mapping relation to obtain an updated pronunciation dictionary.
The updating module 430 is further configured to determine an incorrect pronunciation prediction result from the pronunciation prediction results, and delete the incorrect pronunciation prediction result to obtain a corrected pronunciation prediction result;
determining a text corresponding to the wrong pronunciation prediction result in the text information, and deleting the text corresponding to the wrong pronunciation prediction result to obtain corrected text information;
and updating the corrected pronunciation prediction result and the corrected text information to a preset pronunciation dictionary to obtain an updated pronunciation dictionary.
And the recognition module 440 is configured to acquire the audio to be recognized, and process the audio to be recognized through the updated pronunciation dictionary and the updated speech recognition model to obtain a speech recognition result.
The recognition module 440 is further configured to input the audio to be recognized into the acoustic model, and output a phoneme corresponding to the audio to be recognized;
and matching the phoneme corresponding to the audio to be recognized with the updated pronunciation dictionary through the language model, and outputting a voice recognition result.
The recognition module 440 is further configured to obtain acoustic features corresponding to the audio to be recognized;
and inputting the acoustic features corresponding to the audio to be recognized into the acoustic model, and outputting the phonemes corresponding to the audio to be recognized.
The recognition module 440 is further configured to determine, according to the language model, a matching phoneme corresponding to the audio to be recognized in the updated pronunciation dictionary;
determining a matching text corresponding to the matching phoneme;
and combining the matched texts according to the phonemes corresponding to the audio to be recognized to obtain a voice recognition result.
In the embodiment of the application, the electronic equipment obtains text information which is not contained in a preset pronunciation dictionary by comparing the sample voice with the preset pronunciation dictionary; inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information; updating the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary; and finally, acquiring the audio to be recognized, and processing the audio to be recognized through the updated pronunciation dictionary and the voice recognition model to obtain a voice recognition result. According to the embodiment of the application, the pronunciation dictionary is updated, so that the audio to be recognized can be recognized accurately when the subsequent voice recognition model performs voice recognition.
Correspondingly, the embodiment of the present application further provides an electronic device, where the electronic device may be a terminal or a server, and the terminal may be a terminal device such as a smart phone, a tablet Computer, a notebook Computer, a touch screen, a game machine, a Personal Computer (PC), a Personal Digital Assistant (PDA), and the like. As shown in fig. 6, fig. 6 is a schematic structural diagram of an electronic device provided in the embodiment of the present application. The electronic device 500 includes a processor 501 with one or more processing cores, a memory 502 with one or more computer-readable storage media, and a computer program stored on the memory 502 and executable on the processor. The processor 501 is electrically connected to the memory 502. Those skilled in the art will appreciate that the electronic device structures shown in the figures do not constitute limitations on the electronic device, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The processor 501 is a control center of the electronic device 500, connects various parts of the entire electronic device 500 using various interfaces and lines, performs various functions of the electronic device 500 and processes data by running or loading software programs and/or modules stored in the memory 502, and calling data stored in the memory 502, thereby integrally monitoring the electronic device 500.
In this embodiment, the processor 501 in the electronic device 500 loads instructions corresponding to processes of one or more application programs into the memory 502, and the processor 501 runs the application programs stored in the memory 502 according to the following steps, so as to implement various functions:
comparing the sample voice with a preset pronunciation dictionary to obtain text information not contained in the preset pronunciation dictionary;
inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information;
updating the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary;
and acquiring the audio to be recognized, and processing the audio to be recognized through the updated pronunciation dictionary and the updated voice recognition model to obtain a voice recognition result.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Optionally, as shown in fig. 6, the electronic device 500 further includes: touch-sensitive display screen 503, radio frequency circuit 504, audio circuit 505, input unit 506 and power 507. The processor 501 is electrically connected to the touch display screen 503, the radio frequency circuit 504, the audio circuit 505, the input unit 506, and the power supply 507, respectively. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 6 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The touch display screen 503 can be used for displaying a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface. The touch display screen 503 may include a display panel and a touch panel. The display panel may be used, among other things, to display information entered by or provided to a user and various graphical user interfaces of the electronic device, which may be made up of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations of a user (for example, operations of the user on or near the touch panel by using a finger, a stylus pen, or any other suitable object or accessory) and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 501, and can receive and execute commands sent by the processor 501. The touch panel may overlay the display panel, and when the touch panel detects a touch operation thereon or nearby, the touch panel transmits the touch operation to the processor 501 to determine the type of the touch event, and then the processor 501 provides a corresponding visual output on the display panel according to the type of the touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 503 to implement input and output functions. However, in some embodiments, the touch panel and the touch panel can be implemented as two separate components to perform the input and output functions. That is, the touch display 503 can also be used as a part of the input unit 506 to implement an input function.
The rf circuit 504 may be used for transceiving rf signals to establish wireless communication with a network device or other electronic devices through wireless communication, and to receive and transmit signals with the network device or other electronic devices.
The audio circuit 505 may be used to provide an audio interface between a user and an electronic device through a speaker, microphone. The audio circuit 505 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 505 and converted into audio data, and the audio data is processed by the audio data output processor 501, and then sent to another electronic device through the radio frequency circuit 504, or output to the memory 502 for further processing. The audio circuitry 505 may also include an earbud jack to provide communication of a peripheral headset with the electronic device.
The input unit 506 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
The power supply 507 is used to power the various components of the electronic device 500. Optionally, the power supply 507 may be logically connected to the processor 501 through a power management system, so as to implement functions of managing charging, discharging, power consumption management, and the like through the power management system. The power supply 507 may also include any component including one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
Although not shown in fig. 6, the electronic device 500 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of computer programs are stored, and the computer programs can be loaded by a processor to execute the steps in any one of the speech processing methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:
comparing the sample voice with a preset pronunciation dictionary to obtain text information not contained in the preset pronunciation dictionary;
inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information;
updating the text information and the pronunciation prediction result to a preset pronunciation dictionary to obtain an updated pronunciation dictionary;
and acquiring the audio to be recognized, and processing the audio to be recognized through the updated pronunciation dictionary and the updated voice recognition model to obtain a voice recognition result.
Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.
Since the computer program stored in the storage medium can execute the steps of any of the speech processing methods provided in the embodiments of the present application, beneficial effects that can be achieved by any of the speech processing methods provided in the embodiments of the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again.
The foregoing describes in detail a speech processing method, apparatus, electronic device, and storage medium provided in the embodiments of the present application, and specific examples are applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (11)

1. A method of speech processing, comprising:
comparing the sample voice with a preset pronunciation dictionary to obtain text information not contained in the preset pronunciation dictionary;
inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information;
updating the text information and the pronunciation prediction result to the preset pronunciation dictionary to obtain an updated pronunciation dictionary;
and acquiring the audio to be recognized, and processing the audio to be recognized through the updated pronunciation dictionary and the voice recognition model to obtain a voice recognition result.
2. The method according to claim 1, wherein comparing the sample speech with a preset pronunciation dictionary to obtain text information not included in the preset pronunciation dictionary comprises:
determining a marked text in the sample voice and a position corresponding to the marked text;
and comparing each marked text with the preset pronunciation dictionary according to the position corresponding to the marked text to obtain text information not contained in the preset pronunciation dictionary.
3. The speech processing method of claim 1, wherein the step of inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to a text in the text information comprises:
determining phonemes corresponding to each text in the text information according to the pronunciation prediction model;
and determining the phoneme corresponding to each text in the text information as the pronunciation prediction result.
4. The speech processing method according to claim 3, wherein the step of updating the text information and the pronunciation prediction result into the preset pronunciation dictionary to obtain an updated pronunciation dictionary comprises:
establishing a mapping relation between the phonemes corresponding to each text and the corresponding texts;
and updating the phonemes corresponding to each text and each text into the preset pronunciation dictionary according to the mapping relation to obtain the updated pronunciation dictionary.
5. The speech processing method according to claim 1, wherein the step of updating the text information and the pronunciation prediction result into the preset pronunciation dictionary to obtain an updated pronunciation dictionary comprises:
determining a wrong pronunciation prediction result in the pronunciation prediction results, and deleting the wrong pronunciation prediction result to obtain a corrected pronunciation prediction result;
determining a text corresponding to the wrong pronunciation prediction result in the text information, and deleting the text corresponding to the wrong pronunciation prediction result to obtain corrected text information;
and updating the corrected pronunciation prediction result and the corrected text information to the preset pronunciation dictionary to obtain the updated pronunciation dictionary.
6. The speech processing method according to any one of claims 1 to 5, wherein the speech recognition model includes an acoustic model and a language model; the step of processing the audio to be recognized through the updated pronunciation dictionary and the updated voice recognition model to obtain a voice recognition result comprises the following steps:
inputting the audio to be recognized into the acoustic model, and outputting a phoneme corresponding to the audio to be recognized;
and matching the phoneme corresponding to the audio to be recognized with the updated pronunciation dictionary through the language model, and outputting the voice recognition result.
7. The speech processing method according to claim 6, wherein the step of inputting the audio to be recognized into the acoustic model and outputting the phoneme corresponding to the audio to be recognized comprises:
acquiring acoustic features corresponding to the audio to be identified;
and inputting the acoustic features corresponding to the audio to be recognized into the acoustic model, and outputting the phonemes corresponding to the audio to be recognized.
8. The speech processing method according to claim 6, wherein the step of matching the phonemes corresponding to the audio to be recognized and the updated pronunciation dictionary through the language model and outputting the speech recognition result comprises:
determining a matched phoneme corresponding to the audio to be recognized in the updated pronunciation dictionary according to the language model;
determining a matching text corresponding to the matching phoneme;
and combining the matched texts according to the phonemes corresponding to the audio to be recognized to obtain the voice recognition result.
9. A speech processing apparatus, comprising:
the comparison module is used for comparing the sample voice with a preset pronunciation dictionary to obtain text information which is not contained in the preset pronunciation dictionary;
the prediction module is used for inputting the text information into a pronunciation prediction model to obtain a pronunciation prediction result corresponding to the text in the text information;
the updating module is used for updating the text information and the pronunciation prediction result to the preset pronunciation dictionary to obtain an updated pronunciation dictionary;
and the recognition module is used for acquiring the audio to be recognized and processing the audio to be recognized through the updated pronunciation dictionary and the voice recognition model to obtain a voice recognition result.
10. An electronic device, comprising:
a memory storing executable program code, a processor coupled with the memory;
the processor calls the executable program code stored in the memory to perform the steps of the speech processing method according to any of claims 1 to 8.
11. A storage medium storing a plurality of instructions adapted to be loaded by a processor for performing the steps of the speech processing method according to any of claims 1 to 8.
CN202211028470.XA 2022-08-25 2022-08-25 Voice processing method and device, electronic equipment and storage medium Pending CN115410557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211028470.XA CN115410557A (en) 2022-08-25 2022-08-25 Voice processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211028470.XA CN115410557A (en) 2022-08-25 2022-08-25 Voice processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115410557A true CN115410557A (en) 2022-11-29

Family

ID=84161087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211028470.XA Pending CN115410557A (en) 2022-08-25 2022-08-25 Voice processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115410557A (en)

Similar Documents

Publication Publication Date Title
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
US11514886B2 (en) Emotion classification information-based text-to-speech (TTS) method and apparatus
US11450313B2 (en) Determining phonetic relationships
CN111261144B (en) Voice recognition method, device, terminal and storage medium
US10811005B2 (en) Adapting voice input processing based on voice input characteristics
CN109686383B (en) Voice analysis method, device and storage medium
EP3824462B1 (en) Electronic apparatus for processing user utterance and controlling method thereof
CN111653265B (en) Speech synthesis method, device, storage medium and electronic equipment
CN108922525B (en) Voice processing method, device, storage medium and electronic equipment
CN113488024B (en) Telephone interrupt recognition method and system based on semantic recognition
CN113129867B (en) Training method of voice recognition model, voice recognition method, device and equipment
CN105654955B (en) Audio recognition method and device
CN112309365A (en) Training method and device of speech synthesis model, storage medium and electronic equipment
CN110675866A (en) Method, apparatus and computer-readable recording medium for improving at least one semantic unit set
CN110853669A (en) Audio identification method, device and equipment
CN114120979A (en) Optimization method, training method, device and medium of voice recognition model
CN113012683A (en) Speech recognition method and device, equipment and computer readable storage medium
CN107251137B (en) Method, apparatus and computer-readable recording medium for improving collection of at least one semantic unit using voice
CN111755029A (en) Voice processing method, device, storage medium and electronic equipment
CN111145748A (en) Audio recognition confidence determining method, device, equipment and storage medium
CN113539239B (en) Voice conversion method and device, storage medium and electronic equipment
CN112259077B (en) Speech recognition method, device, terminal and storage medium
CN115410557A (en) Voice processing method and device, electronic equipment and storage medium
CN114495981A (en) Method, device, equipment, storage medium and product for judging voice endpoint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination