CN112489646B

CN112489646B - Speech recognition method and device thereof

Info

Publication number: CN112489646B
Application number: CN202011295150.1A
Authority: CN
Inventors: 沈来信; 朱相宇; 王映新; 孙明东; 贾师惠
Original assignee: Beijing Thunisoft Information Technology Co ltd
Current assignee: Beijing Thunisoft Information Technology Co ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2024-04-02
Anticipated expiration: 2040-11-18
Also published as: CN112489646A

Abstract

The application discloses a voice recognition method and a voice recognition device. Wherein the method comprises the following steps: acquiring input voice data; decoding the voice data through a decoding model to generate a voice recognition intermediate result; matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database; and outputting a matching result according to the matching state of the pinyin in the intermediate result in the tone sequence and the voice recognition. The problem that the voice recognition result deviates from the normal context can be solved by matching the voice recognition intermediate result with the core word pinyin and the tone sequence in the core word database.

Description

Speech recognition method and device thereof

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a speech recognition method and apparatus.

Background

The decoding of the voice recognition has great correlation with the application scene, and the user always expects the voice recognition model to be capable of carrying out decoding recognition with certain directivity on the scene corpus of the user. At present, voice recognition is performed based on a user hotword, and when the hotword is transferred, the hotword is manually defined and a weight value is set. If the difference between the weight values is large, the voice recognition result is seriously deviated from the normal context, the number of hot words to be uploaded is limited, and a certain difficulty exists in selecting the hot words by a user.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, which is used for solving the problem that a voice recognition result deviates from a normal context in the prior art. The method specifically comprises the following steps:

acquiring input voice data;

decoding the voice data through a decoding model to generate a voice recognition intermediate result;

matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database;

and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.

Further, in a preferred embodiment provided in the present application, the decoding model is composed of an acoustic model, a dictionary, and a language model.

Further, in a preferred embodiment provided in the present application, the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpus;

the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises a scene corpus appointed by a user; the background language model is the language model of the original speech recognition engine, and comprises the corpus of each scene.

Further, in a preferred embodiment provided herein, the smoothing and pruning operations are performed on the new language model;

the pruning operation is based on a foreground language model, irrelevant scene corpus deletion is carried out on a background language model, and branches of the foreground language are reserved; the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpus in the language model are redistributed, and the sum of the conditional probabilities of all scene corpus after the smoothing operation is 1.

Further, in a preferred embodiment provided in the present application, the core word database is established by performing word segmentation and statistics on word frequencies based on text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequencies;

the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant takes the median value of all word frequencies.

Further, in a preferred embodiment provided in the present application, the core word database may match according to core word information uploaded by a user, and automatically recommend a corresponding weight value, and the user may adjust the weight value according to actual requirements, so as to increase accuracy of speech recognition;

and if the user core word is not found through the search, taking the weighted median value of all words in the current core word database as a recommended value.

Further, in a preferred embodiment provided in the present application, when the matching result is that the intermediate result of speech recognition has a corresponding pinyin and intonation sequence in the database, the core word replacement is performed on the pinyin and intonation sequence.

Further, in a preferred embodiment provided in the present application, when the core word is replaced, if the confusion of the language model of the sentence containing the replacing sequence is reduced by a threshold value compared with the original sentence, the replacing of the core word sequence may be completed, and a speech recognition intermediate result containing the replacing sequence is output;

wherein, a threshold value which is reduced can be adjusted according to the actual environment.

Further, in a preferred embodiment provided in the present application, before the step of outputting the sentence containing the alternative sequence as the speech recognition result, the method further includes performing sentence breaking and punctuation prediction on the sentence containing the alternative sequence.

The embodiment of the application provides a voice recognition device, which comprises:

the voice receiving module is used for receiving voice data;

the voice decoding module is used for decoding the voice data and generating a voice recognition intermediate result;

the voice recognition intermediate result matching module is used for matching the voice recognition intermediate result with the core word pinyin and the tone sequence in the database;

and the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin in the tone sequence and the intermediate result in voice recognition.

The embodiment provided by the application has at least the following beneficial effects:

the voice recognition method and the device can solve the problem that the voice recognition result deviates from the normal context.

Drawings

Fig. 1 is a flowchart of a voice recognition method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.

100. Speech recognition device

11. Voice receiving module

12. Speech decoding module

13. Voice recognition intermediate result matching module

14. Speech recognition result output module

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

Referring to fig. 1, the present application discloses a voice recognition method, which includes:

s100: input voice data is acquired.

The voice data is voice stream data input in real time or file stream data in an audio file.

The voice stream data acquisition can be realized by real-time recording and generating the voice through a microphone, a sound card and other hardware with a real-time recording function. The file stream data can be obtained by reading an audio file storing the recorded audio data, and the suffix format of the audio file is as follows: WAV/. AIF/. AIFF/. AU/. MP1/. MP2/. MP3/. RA/. RM/. RAM.

S200: and decoding the voice data through a decoding model to generate a voice recognition intermediate result.

Further, in a preferred embodiment provided in the present application, the speech decoding model is composed of an acoustic model, a dictionary, and a language model.

Wherein, the mapping between the voice characteristics and the phonemes in the voice data can be established through an acoustic model; mapping between phonemes and words can be established through a dictionary; through the language model, the mapping of words and sentences can be established. And the computer can finish the decoding operation of the voice data according to the mapping established by the acoustic model, the dictionary and the language model, so as to generate a corresponding voice recognition intermediate result.

Specifically, the acoustic model is a knowledge representation of differences in acoustics, phonetics, environmental variables, speaker gender, accents, etc.; a language model is a knowledge representation of a set of word sequences; a dictionary is a set of phoneme indexes corresponding to words.

Specifically, interpolation fitting is used to combine language models to improve the effect of the language models; when the foreground language weight is set to be 0.6, the corpus distribution of the newly generated language model can be optimized, and the processing effect is optimal.

Specifically, the text preprocessing corpus is a user total text corpus, punctuation marks, some nonsensical language words and stop words are removed, and numbers are converted into expression forms of corresponding corpus texts through a number conversion module.

S300: and performing matching operation on the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database.

Further, in a preferred embodiment provided in the present application, the core word database is established by performing word segmentation and statistics on word frequencies based on text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequencies.

Specifically, when word segmentation is performed, a dictionary based on a decoding model is needed, and a reverse maximum matching algorithm is used, so that the word segmentation effect is optimal. And counting word frequency, namely counting the occurrence times of the same word based on word segmentation results.

Furthermore, in a preferred embodiment provided in the present application, the core word database may match according to the core word information uploaded by the user, and automatically recommend a corresponding weight value, and the user may adjust the weight value according to the actual requirement, so as to increase the accuracy of speech recognition;

Specifically, according to the core word input by the user, matching the word in the core word database. If the corresponding core word can be matched in the database, the weight of the core word is used as a recommendation value and recommended to the user. The weight value recommended to the user can be increased or decreased by the user according to the actual scene, so that the accuracy of voice recognition in the user scene is improved.

S400: and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.

Further, in a preferred embodiment provided in the present application, when the matching result is that the intermediate result sequence of speech recognition has a corresponding pinyin intonation sequence in the database, the core word replacement is performed on the sequence.

Specifically, if the intermediate result sequence of speech recognition does not match the corresponding pinyin tone sequence in the database, the intermediate result of speech recognition can be directly output as the speech recognition result.

Specifically, the smaller the confusion value of the language model, the higher the matching degree of the replacement sequence in the sentence after the core word is replaced. The threshold value is set to 0.1 by default, and if the matching degree of the replacement sequence in the sentence is to be improved, the threshold value can be set to be reduced.

A speech recognition apparatus 100 comprising:

a voice receiving module 11 for receiving voice data;

a voice decoding module 12, configured to decode the voice data and generate a voice recognition intermediate result;

a voice recognition intermediate result matching module 13, configured to match the voice recognition intermediate result with the core word pinyin and the tone sequence in the database;

and the voice recognition result output module 14 is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.

In a typical configuration, a computer may include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A method of speech recognition, comprising:

acquiring input voice data;

outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition;

the core word database is established by performing word segmentation and counting word frequency based on text preprocessing corpus and generating corresponding word segmentation weight according to the word frequency;

the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant which takes the median of all word frequencies.

2. The speech recognition method of claim 1, wherein the decoding model is composed of an acoustic model, a dictionary, and a language model together.

3. The speech recognition method of claim 2, wherein the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpus;

4. The speech recognition method of claim 3, wherein the newly generated language model is subjected to smoothing and pruning operations;

5. The voice recognition method of claim 1, wherein the core word database can be matched according to the core word information uploaded by the user, and corresponding weight values are automatically recommended, and the user can adjust the weight values according to actual requirements so as to increase the accuracy of voice recognition;

6. The method of claim 1, wherein the matching result is that when the intermediate result of the speech recognition has a corresponding pinyin and intonation sequence in the database, a core word replacement is performed on the pinyin and intonation sequence.

7. The method of claim 6, wherein when the core word is replaced, if the confusion of the language model of the sentence containing the replacing sequence is reduced by a threshold value compared with the original sentence, the replacing of the core word sequence is completed, and the voice recognition intermediate result containing the replacing sequence is output;

wherein, a threshold value is reduced, and the adjustment is carried out according to the actual environment.

8. The method of claim 7, further comprising performing punctuation and punctuation prediction on the sentence containing the alternative sequence before performing the step of outputting the sentence containing the alternative sequence as a speech recognition result.

9. A speech recognition apparatus, comprising:

the voice receiving module is used for receiving voice data;

the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition;