CN112489646A

CN112489646A - Speech recognition method and device

Info

Publication number: CN112489646A
Application number: CN202011295150.1A
Authority: CN
Inventors: 沈来信; 朱相宇; 王映新; 孙明东; 贾师惠
Original assignee: Beijing Thunisoft Information Technology Co ltd
Current assignee: Beijing Thunisoft Information Technology Co ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-03-12
Anticipated expiration: 2040-11-18
Also published as: CN112489646B

Abstract

The application discloses a voice recognition method and a device thereof. Wherein the method comprises the following steps: acquiring input voice data; decoding the voice data through a decoding model to generate a voice recognition intermediate result; matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database; and outputting a matching result according to the matching state of the pinyin in the tone sequence and the intermediate result in the voice recognition. The problem that the voice recognition result deviates from the normal context can be solved through matching the voice recognition intermediate result with the core word pinyin and tone sequences in the core word database.

Description

Speech recognition method and device

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a speech recognition method and apparatus.

Background

The decoding of speech recognition has great relevance to the application scene, and users always expect that a speech recognition model can carry out decoding recognition with certain directivity to the scene corpus of the users. At present, voice recognition is performed based on a user hotword, and when the hotword is uploaded, the hotword is manually defined and a weight value of the hotword is set. If the setting of the weighted values is relatively large, the voice recognition result is seriously deviated from the normal context, the uploading quantity of the hot words is limited, and the user has certain difficulty in selecting the hot words.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, which is used for solving the problem that a voice recognition result deviates from a normal context in the prior art. The method specifically comprises the following steps:

acquiring input voice data;

decoding the voice data through a decoding model to generate a voice recognition intermediate result;

matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database;

and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.

Further, in a preferred embodiment provided by the present application, the decoding model is formed by an acoustic model, a dictionary, and a language model.

Further, in a preferred embodiment provided by the present application, the language model is a new language model generated by performing interpolation fitting on a foreground language model and a background language model based on text preprocessing corpora;

the foreground language model is a user language model, the weight value is preset to be 0.5-0.8, and the foreground language model comprises user-specified scene linguistic data; the background language model is a language model of an original speech recognition engine, and the background language model comprises scene linguistic data.

Further, in a preferred embodiment provided by the present application, the newly generated language model is subjected to smoothing and pruning operations;

the pruning operation is based on the foreground language model, irrelevant scene corpus deletion is carried out on the background language model, and branches of the foreground language are reserved; and the smoothing operation is based on the newly generated language model, the conditional probabilities of all scene corpora in the language model are redistributed, and the sum of the conditional probabilities of all the scene corpora after the smoothing operation is 1.

Further, in a preferred embodiment provided by the present application, the core word database is constructed by performing word segmentation and word frequency statistics based on the text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequency;

the corresponding word segmentation weight is calculated by dividing the word frequency of each word by the sum of the maximum word frequency and a constant, wherein the constant is the median of all the word frequencies.

Further, in a preferred embodiment provided by the present application, the core word database may perform matching according to the core word information uploaded by the user, and automatically recommend a corresponding weight value, and the user may adjust the weight value according to the actual requirement to increase the accuracy of voice recognition;

and if the user core word is not found through retrieval, taking the weight median of all words in the current core word database as a recommendation value.

Further, in a preferred embodiment provided by the present application, when the matching result is that the speech recognition intermediate result has corresponding pinyin and intonation sequences in the database, core word replacement is performed on the pinyin and intonation sequences.

Further, in a preferred embodiment provided by the present application, when the core word is replaced, if the language model confusion degree of the sentence including the replacement sequence is lower than that of the original sentence by a threshold, the core word sequence replacement can be completed, and an intermediate result of the speech recognition including the replacement sequence is output;

wherein, the reduced threshold value can be adjusted according to the actual environment.

Further, in a preferred embodiment provided by the present application, before the step of outputting the sentence including the alternative sequence as a speech recognition result, the method further includes performing sentence break and punctuation prediction on the sentence including the alternative sequence.

An embodiment of the present application provides a speech recognition apparatus, including:

the voice receiving module is used for receiving voice data;

the voice decoding module is used for decoding the voice data and generating a voice recognition intermediate result;

the voice recognition intermediate result matching module is used for matching the voice recognition intermediate result with the core word pinyin and tone sequences in the database;

and the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin in the tone sequence and the intermediate result in the voice recognition.

The embodiment provided by the application has at least the following beneficial effects:

the problem that the speech recognition result deviates from the normal context can be solved by the speech recognition method and the speech recognition device.

Drawings

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application.

100 speech recognition device

11 voice receiving module

12 voice decoding module

13 voice recognition intermediate result matching module

14 speech recognition result output module

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, the present application discloses a speech recognition method, including:

s100: input voice data is acquired.

The voice data is voice stream data input in real time or file stream data in an audio file.

The voice stream data can be acquired through hardware with a real-time recording function, such as a microphone, a sound card and the like, and the voice is recorded and generated in real time. The file stream data can be obtained by reading the audio file storing the recorded audio data, and the common format of the suffix of the audio file is as follows: WAV/. AIF/. AIFF/. AU/. MP1/. MP2/. MP3/. RA/. RM/. RAM.

S200: and decoding the voice data through a decoding model to generate a voice recognition intermediate result.

Further, in a preferred embodiment provided by the present application, the speech decoding model is formed by an acoustic model, a dictionary, and a language model.

Mapping between voice features and phonemes in the voice data can be established through an acoustic model; mapping between phonemes and words can be established through a dictionary; the mapping of words and sentences can be established through the language model. The computer can complete the decoding operation of the voice data according to the mapping established by the acoustic model, the dictionary and the language model, thereby generating a corresponding voice recognition intermediate result.

Specifically, the acoustic model is a knowledge representation of differences in acoustics, phonetics, variables of the environment, speaker gender, accent, etc.; the language model is a knowledge representation of a set of word sequences; the dictionary is a set of phoneme indices corresponding to words.

Specifically, interpolation fitting is used for merging the language models to improve the effect of the language models; when the settable foreground language weight is 0.6, the distribution of the language material of the newly generated language model is optimal, and the processing effect is optimal.

Specifically, the text preprocessing corpus removes punctuation marks and some nonsense linguistic and stop words from the user total text corpus, and converts the numbers into the expression form of the corresponding corpus text through a number conversion module.

S300: and matching the voice recognition intermediate result based on the core word pinyin and the tone sequence in the core word database.

Further, in a preferred embodiment provided by the present application, the core word database is constructed by performing word segmentation and word frequency statistics based on the text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequency.

Specifically, when performing word segmentation, a dictionary based on a decoding model is required, and a reverse maximum matching algorithm is used, so that the word segmentation effect is optimal. And (4) counting word frequency, wherein the number of times of the same word is counted based on word segmentation results.

Specifically, the words are matched in the core word database according to the core words input by the user. And if the corresponding core word can be matched in the database, recommending the core word weight to the user as a recommendation value. The weight value recommended to the user can be increased or decreased by the user according to the actual scene, and the accuracy of voice recognition under the user scene is improved.

S400: and outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.

Further, in a preferred embodiment provided by the present application, when the matching result is that the speech recognition intermediate result sequence has a corresponding pinyin intonation sequence in the database, the sequence is subjected to core word replacement.

Specifically, if the speech recognition intermediate result sequence is not matched with the corresponding pinyin intonation sequence in the database, the speech recognition intermediate result can be directly output as a speech recognition result.

Specifically, the smaller the language model confusion value is, the higher the engagement degree of the replacement sequence in the sentence is after the core word is replaced. The threshold value for lowering is set to 0.1 by default, and if the matching degree of the replacement sequence in the sentence is to be improved, the setting of the threshold value can be lowered.

A speech recognition apparatus 100 comprising:

a voice receiving module 11, configured to receive voice data;

a voice decoding module 12, configured to decode the voice data and generate a voice recognition intermediate result;

the voice recognition intermediate result matching module 13 is used for matching the voice recognition intermediate result with the core word pinyin and tone sequences in the database;

and the voice recognition result output module 14 is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.

In a typical configuration, a computer may include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A speech recognition method, comprising:

acquiring input voice data;

2. The speech recognition method of claim 1, wherein the decoding model is collectively formed of an acoustic model, a dictionary, and a language model.

3. The speech recognition method of claim 2, wherein the language model is a new language model generated by interpolation fitting of a foreground language model and a background language model based on text preprocessing corpora;

4. A speech recognition method as claimed in claim 3, characterized in that the smoothing and pruning operations are performed on the newly generated language model;

5. The speech recognition method of claim 1, wherein the core word database is built by performing word segmentation and word frequency statistics based on a text preprocessing corpus, and generating corresponding word segmentation weights according to the word frequency;

6. The speech recognition method according to claim 5, wherein the core word database is adapted to match the core word information uploaded by the user and automatically recommend a corresponding weight value, and the user can adjust the weight value according to actual needs to increase the accuracy of speech recognition;

7. The speech recognition method of claim 1, wherein the matching result is a core word replacement for the pinyin and intonation sequences when the corresponding pinyin and intonation sequences exist in the database as the intermediate result of speech recognition.

8. The speech recognition method of claim 7, wherein when the core word is replaced, if the language model confusion of the sentence containing the replacement sequence is lower than the original sentence by a threshold, the core word sequence replacement is completed, and the speech recognition intermediate result containing the replacement sequence is output;

9. The speech recognition method of claim 8, further comprising performing sentence break and punctuation prediction on the sentence containing the replacement sequence before the step of outputting the sentence containing the replacement sequence as a speech recognition result.

10. A speech recognition apparatus, comprising:

the voice receiving module is used for receiving voice data;

and the voice recognition result output module is used for outputting a matching result according to the matching state of the pinyin and tone sequence and the intermediate result in the voice recognition.