CN106875936B

CN106875936B - Voice recognition method and device

Info

Publication number: CN106875936B
Application number: CN201710254628.8A
Authority: CN
Inventors: 李忠杰
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2017-04-18
Filing date: 2017-04-18
Publication date: 2021-06-22
Anticipated expiration: 2037-04-18
Also published as: CN106875936A; WO2018192186A1

Abstract

The embodiment of the invention provides a voice recognition method and a voice recognition device, wherein the method comprises the following steps: acquiring a feature classification result of a voice signal to be recognized; the feature classification result comprises pronunciations used for describing pronunciation features of the voice signal frames and the probability of mapping the voice signal frames to the corresponding pronunciations; filtering the pronunciations contained in the feature classification result based on the probability contained in the feature classification result; and recognizing the voice signal based on the filtered feature classification result. By implementing the embodiment of the invention, in the process of recognizing the voice signal, the recognition operation related to the filtered pronunciation does not need to be executed, such as: the path related to the filtered pronunciation does not need to be searched in the recognition network, so that the time consumed in the voice recognition process can be effectively reduced, and the voice recognition efficiency can be improved.

Description

Voice recognition method and device

Technical Field

The invention relates to the technical field of computers, in particular to a voice recognition method and a voice recognition device.

Background

With the development of computer technology, the application of the Speech Recognition (ASR) technology in the fields of human-computer interaction and the like is increasing. At present, a speech recognition technology mainly converts a speech signal to be recognized into text information through a signal processing module, a feature extraction module, an acoustic Model, a Language Model (LM), a dictionary, and a Decoder (Decoder), thereby completing speech recognition.

In the voice recognition process, a signal processing module and a feature extraction module divide a voice signal to be recognized into a plurality of voice signal frames, then enhance each voice signal frame by eliminating noise, channel distortion and the like, convert each voice signal frame from a time domain to a frequency domain, and extract proper acoustic features from the converted voice signal frames. And according to an acoustic model trained according to the characteristic parameters of the training voice library, mapping the acoustic characteristics extracted by the characteristic extraction module to pronunciations capable of describing pronunciation characteristics of voice signal frames by taking the acoustic characteristics as input, and calculating the probability of mapping the voice signal frames to each pronunciation to obtain a characteristic classification result.

The language model contains the association between different words (such as words, words and phrases) and the probability (possibility) thereof, and is used for estimating the possibility of various text messages composed of different words. The decoder can establish a recognition network based on the trained acoustic model, language model and dictionary, each path in the recognition network corresponds to various text information and pronunciation of each text information, then an optimal path is searched in the recognition network according to pronunciation output by the acoustic model, and text information corresponding to the voice signal can be output with maximum probability based on the path to complete voice recognition.

However, a language model is generally trained based on a large amount of corpora, and includes a large number of association relationships and possibilities between words, so that a recognition network established based on a speech model includes a large number of nodes, and the number of branches per node is also very large. When path search is performed in the recognition network, the number of nodes related to pronunciation of each voice signal frame is abruptly increased in an exponential form, which results in great path search amount and more time consumed in the search process, thereby reducing voice recognition efficiency.

Disclosure of Invention

In view of this, the present invention provides a speech recognition method and apparatus, so as to solve the problems of time consuming and low recognition efficiency in the speech recognition process.

According to a first aspect of the present invention, there is provided a speech recognition method comprising the steps of:

acquiring a feature classification result of a voice signal to be recognized; the feature classification result comprises pronunciations used for describing pronunciation features of the voice signal frames and the probability of mapping the voice signal frames to the corresponding pronunciations;

filtering the pronunciations contained in the feature classification result based on the probability contained in the feature classification result;

and recognizing the voice signal based on the filtered feature classification result.

In one embodiment, the filtering the pronunciations included in the feature classification result based on the probability included in the feature classification result includes:

judging whether the probability of any voice signal frame mapped to the corresponding pronunciation meets a preset filtering rule or not;

and if the corresponding pronunciation meets a preset filtering rule, filtering the corresponding pronunciation.

In one embodiment, if the probability difference between the probability that any speech signal frame is mapped to a corresponding utterance and the maximum mapping probability of that speech signal frame is within a predetermined difference range, determining that the corresponding utterance satisfies a predetermined filtering rule;

determining that a corresponding utterance satisfies a predetermined filtering rule if a probability that any one of the frames of speech signals maps to the corresponding utterance is less than a probability that the frame of speech signals maps to each of a predetermined number of utterances.

In one embodiment, the predetermined number is any one of:

the number of pronunciations reserved in the feature classification result in the pronunciations corresponding to the frame of the voice signal;

the predetermined scaling threshold is multiplied by the total number of pronunciations corresponding to the frame of speech signals.

acquiring histogram distribution of the probability of mapping any voice signal frame to each pronunciation;

acquiring a beam width corresponding to the histogram distribution;

determining pronunciations with probability distribution outside the beam width as pronunciations meeting the preset filtering rule;

and deleting the pronunciations meeting the preset filtering rule from the pronunciations contained in the feature classification result.

In one embodiment, the deleting the pronunciation satisfying the predetermined filtering rule from the pronunciations included in the feature classification result includes:

if the probability that any voice signal frame is mapped to the corresponding pronunciation meets a preset filtering rule, determining the pronunciation as a candidate pronunciation;

if the probability of mapping to the candidate pronunciation in any one frame of the adjacent voice signal frames of the preset number of frames of the voice signal frame meets the preset filtering rule, deleting the candidate pronunciation from the pronunciations contained in the characteristic classification result;

if the probabilities mapped to the candidate pronunciations of the adjacent voice signal frames with the preset number of frames of the voice signal frame do not meet the preset filtering rule, the candidate pronunciations are kept in the pronunciations contained in the characteristic classification result.

According to a second aspect of the present invention, there is provided a speech recognition apparatus comprising:

the classification result acquisition module is used for acquiring a characteristic classification result of the voice signal to be recognized; the feature classification result comprises pronunciations used for describing pronunciation features of the voice signal frames and the probability of mapping the voice signal frames to the corresponding pronunciations;

the pronunciation filtering module is used for filtering the pronunciations contained in the feature classification results based on the probabilities contained in the feature classification results;

and the voice recognition module is used for recognizing the voice signal based on the filtered feature classification result.

In one embodiment, the pronunciation filter module further comprises:

the first filtering module is used for filtering the corresponding pronunciation when the probability of any voice signal frame being mapped to the corresponding pronunciation and the probability difference between the maximum mapping probability of the voice signal frame are within a preset difference range;

and the second filtering module is used for filtering the corresponding pronunciations when the probability that any voice signal frame is mapped to the corresponding pronunciations is less than the probability that the voice signal frame is mapped to each pronunciations in the preset number of pronunciations.

In one embodiment, the pronunciation filter module comprises:

the probability distribution module is used for acquiring the histogram distribution of the probability that any voice signal frame is mapped to each pronunciation;

a beam width determining module for obtaining a beam width corresponding to the histogram distribution;

a pronunciation determination module for determining pronunciations with probability distribution outside the beam width as pronunciations meeting the predetermined filtering rule;

and the pronunciation deleting module is used for deleting the pronunciations meeting the preset filtering rule from the pronunciations contained in the feature classification result.

In one embodiment, the pronunciation filter module comprises:

the candidate pronunciation module is used for determining the pronunciation as a candidate pronunciation when the probability of mapping any voice signal frame to the corresponding pronunciation meets a preset filtering rule;

a candidate pronunciation deletion module for deleting the candidate pronunciation from the pronunciation contained in the feature classification result when the probability of mapping to the candidate pronunciation in any one of the adjacent voice signal frames of the predetermined number of frames of the voice signal frame satisfies the predetermined filtering rule;

and the candidate pronunciation retaining module is used for retaining the candidate pronunciation in the pronunciation contained in the characteristic classification result when the probabilities of mapping the adjacent voice signal frames with the preset number of frames of the voice signal frame to the candidate pronunciation do not meet the preset filtering rule.

By implementing the embodiment provided by the invention, when the voice signal is recognized, the feature classification result of the voice signal is firstly obtained, and then the pronunciation contained in the feature classification result is filtered based on the probability contained in the feature classification result, so that in the process of recognizing the voice signal, the recognition operation related to the filtered pronunciation is not required to be executed, and if the path related to the filtered pronunciation is not required to be searched in the recognition network, the time consumed in the voice recognition process can be effectively reduced, and the voice recognition efficiency can be further improved.

Drawings

FIG. 1 is a flow diagram illustrating a method of speech recognition in accordance with an exemplary embodiment of the present invention;

FIG. 2 is a flow chart illustrating a method of speech recognition in accordance with another exemplary embodiment of the present invention;

FIG. 3 is a logical block diagram of a speech recognition apparatus according to an exemplary embodiment of the present invention;

FIG. 4 is a logical block diagram of a speech recognition apparatus according to another exemplary embodiment of the present invention;

fig. 5 is a hardware configuration diagram of a voice recognition apparatus according to an exemplary embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The speech recognition according to the embodiment of the present invention involves an acoustic model and a language model in the recognition process, wherein the acoustic model is a knowledge representation of the difference between the acoustics, the phonetics, the environmental variables, the sex, the accent, and the like of the person who utters the speech, and the acoustic model is configured by training the speech contained in the training speech library through an LSTM (Long Short-Term Memory), a ctc (connectionist temporal classification) model, or a hidden markov model HMM, and obtaining a mapping from the acoustic features of the speech to the utterance, which is related to the modeling unit. If the modeling unit is a syllable, the pronunciation is a syllable; if the modeling unit is a phoneme, the pronunciation is a phoneme; if the modeling unit is a state that constitutes a phoneme, the pronunciation is a state.

In training the acoustic model, considering that pronunciation may vary with different factors affecting pronunciation, such as word, speed, intonation, accent, and dialect, the training speech library needs to cover a large amount of speech with different factors affecting pronunciation, such as word, speed, intonation, accent, and dialect. In addition, in consideration of the accuracy of speech recognition, smaller pronunciation units of syllables, phonemes, states, and the like may be selected as the modeling unit. Therefore, a large number of acoustic models are constructed by performing model training based on a large number of voices included in the training voice library and a predetermined modeling unit. In the speech recognition process, a large number of acoustic models are used for carrying out feature classification on a speech signal to be recognized, and an obtained feature classification result contains a large number of pronunciations (categories), such as: 3000 to 10000 pronunciations.

In addition, in order to recognize text information corresponding to a speech signal, the current speech recognition technology needs to search all possible paths in a recognition network for each pronunciation, and path increment in an exponential form is generated in the search process. If 3000 to 10000 possible paths related to pronunciation are searched in the recognition network, the storage resource and the calculation amount required by the search may exceed the limit which can be borne by the voice recognition system, therefore, the current voice recognition technology can consume a large amount of time and resources, and the problem of low voice recognition efficiency exists.

The scheme of the invention aims to solve the problem of low voice recognition efficiency, and improves the feature classification result obtained in the voice recognition process, wherein a filtering rule is set in advance according to equipment resources and recognition efficiency requirements related to voice recognition, then when a voice signal is recognized, the feature classification result of the voice signal is obtained first, and then the pronunciation contained in the feature classification result is filtered based on the probability contained in the feature classification result, so that in the process of recognizing the voice signal, the path related to the filtered pronunciation does not need to be searched in a recognition network, thereby effectively reducing the time consumed in the searching process and further improving the voice recognition efficiency. The speech recognition process of the present invention is described in detail below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart illustrating a speech recognition method according to an exemplary embodiment of the invention, which can be applied to various electronic devices with speech processing capability, and includes the following steps S101-S103:

s101, obtaining a feature classification result of a voice signal to be recognized; the feature classification result includes a pronunciation for describing pronunciation features of each speech signal frame and a probability that each speech signal frame maps to a corresponding pronunciation.

And S102, filtering pronunciations contained in the feature classification result based on the probability contained in the feature classification result.

And S103, recognizing the voice signal based on the filtered feature classification result.

In the embodiment of the invention, the voice signal can be voice which is sent by a user and acquired by local voice acquisition equipment in real time, or voice which is sent by the voice acquisition equipment in a remote mode. When obtaining the feature classification result of the voice signal, the voice signal may be preprocessed in real time by a voice preprocessing module in the field, and the feature extraction module performs feature extraction on the preprocessed voice signal, where the extracted features may include PLP (Perceptual Linear prediction), LPCC (Linear Predictive Cepstral Coefficient), FBANK (Mel-Scale Filter Bank), MFCC (Mel-Frequency Cepstral Coefficients), and the like, and then the extracted features are correspondingly processed by an acoustic model to obtain a feature classification result, and a probability included in the feature classification result is used to represent a possibility that a voice signal frame is mapped to a corresponding pronunciation. In other examples, the feature classification result transmitted by other terminal devices may also be directly received.

After the feature classification result is obtained, the scheme of the invention considers that partial pronunciations contained in the feature classification result have low frame correlation with the voice signal of the voice signal to be recognized and have small influence on the voice recognition accuracy, and when the influence of a large number of pronunciations contained in the feature classification result on the voice recognition efficiency is reduced, the pronunciations which have small influence on the voice recognition accuracy can be filtered out from the feature classification result before the voice recognition is carried out based on the feature classification result, so that the number of pronunciations contained in the feature classification result is reduced, and the voice recognition efficiency is further improved.

In general, the lower the correlation of a pronunciation to a speech signal frame to be recognized, the lower the probability that a speech signal frame maps to the pronunciation when classifying the acoustic features of the speech signal by the acoustic model. Therefore, the pronunciations included in the feature classification result can be filtered based on the probability that the speech signal frame is mapped to each pronunciation, and after filtering, the probability that any speech signal frame is mapped to the filtered pronunciation is smaller than the probability that the speech signal frame is mapped to other pronunciations.

In addition, when filtering pronunciation with low correlation, considering the requirements of different application scenes on the accuracy rate of voice recognition, the influence of the filtered pronunciation on the accuracy rate of voice recognition needs to be measured, so that various filtering rules capable of limiting the influence degree of the filtered pronunciation on the accuracy rate of voice recognition can be preset according to the requirements of the accuracy rate of voice recognition. And aiming at various preset filtering rules, when the pronunciations contained in the feature classification result are filtered, judging whether the probability of mapping any voice signal frame to the corresponding pronunciations meets the preset filtering rules, and if the corresponding pronunciations meet the preset filtering rules, filtering the corresponding pronunciations. Filtered pronunciations generally refer to pronunciations that are deleted from the feature classification results.

Several ways of filtering the utterances contained in the feature classification result are listed below:

a first filtering mode: filtering out low probability utterances by a predetermined number, which may refer to the number of utterances retained in the feature classification result from the utterances corresponding to the frames of the speech signal; or a product of a predetermined scaling threshold and the total number of utterances corresponding to the frame of the speech signal. In filtering, if the probability that any speech signal frame maps to a corresponding utterance is less than the probability that the speech signal frame maps to each of a predetermined number of utterances, it is determined that the corresponding utterance satisfies a predetermined filtering rule.

The predetermined ratio threshold may be set by the designer of the present invention according to the speech recognition accuracy required, for example, 1/4, which is the ratio of the number of retained utterances to the number of all utterances.

In one example, during the actual filtering, the pronunciations may be deleted from the feature classification result in the order of the probability from small to large, and when the ratio of the number of the remaining pronunciations to the number of all the original pronunciations meets a predetermined ratio threshold, the filtering of the feature classification result is completed.

In other examples, the predetermined proportion threshold may refer to a proportion of the number of filtered pronunciations to the number of unfiltered pronunciations. In the actual filtering, the pronunciations can be selected from the feature classification results according to the sequence of the probability from large to small, and when the ratio of the number of the selected pronunciations to the number of the remaining pronunciations meets a preset ratio threshold, the filtering of the feature classification results is completed.

In practical applications, when the predetermined number refers to the number of pronunciations remaining in the feature classification result in the pronunciation corresponding to the frame of the speech signal, the predetermined number may be set by the designer according to the speech recognition accuracy rate required to be achieved, for example, set to any one of 2000 to 9000. During filtering, the pronunciations mapped to each voice signal frame can be arranged according to the sequence from small probability to large probability, and then the pronunciations arranged at the front preset digit are deleted from the feature classification result, so as to complete the filtering of the feature classification result, wherein the preset digit is equal to the preset number.

In other examples, the predetermined number may refer to the number of utterances that are not filtered out, e.g., set to 1000. In actual filtering, the pronunciations mapped to each speech signal frame can be arranged according to the sequence of the probability from large to small, then the pronunciations arranged in the front preset digit are kept in the feature classification result, other pronunciations are deleted from the feature classification result, and the filtering of the feature classification result is completed, wherein the preset digit is equal to the numerical value of the quantity threshold. In other embodiments, other technical means may be adopted to filter the feature classification result in a filtering manner, which is not limited in the present invention.

And a second filtering mode: low probability utterances are filtered out by a predetermined difference threshold, which can be set by the designer of the present invention based on the speech recognition accuracy that is desired, e.g., set to-0.5, which refers to the difference in probability between the probability of the filtered utterance and the utterance with the highest probability mapped to the same speech signal frame. During filtering, if the probability difference between the probability of any voice signal frame mapping to the corresponding pronunciation and the maximum mapping probability of the voice signal frame is within a preset difference range, the corresponding pronunciation is determined to meet a preset filtering rule, and the corresponding pronunciation can be filtered.

In one example, in the actual filtering, the pronunciations mapped to each speech signal frame may be arranged in order of probability from large to small, the probability of the speech signal frame being mapped to the pronunciations arranged at the first bit is determined as the maximum probability, then the difference between the probability of the speech signal frame being mapped to each pronunciation and the maximum probability is sequentially obtained from the pronunciation arranged at the last bit, and if the difference is less than-0.5, the pronunciation is deleted from the feature classification result. In other embodiments, other technical means may be adopted to filter the feature classification results in a filtering manner, which is not limited in the present invention.

And (3) a third filtering mode: filtering pronunciations distributed outside the beam width according to the histogram distribution of the probability, wherein the histogram distribution of the probability mapped to each pronunciation by any voice signal frame can be obtained firstly during actual filtering; acquiring a beam width corresponding to the histogram distribution; then determining pronunciations with probability distribution outside the beam width as pronunciations meeting the preset filtering rule; and finally deleting the pronunciations meeting the preset filtering rule from the pronunciations contained in the feature classification result. In practical application, the beamwidth can be determined by the designer according to the speech recognition accuracy and the histogram distribution, such as: 8000 pronunciations with low probability are preset to be filtered, 8000 pronunciations can be searched from one side with low probability in the histogram, and the position of the 8000-th pronunciation is determined as a beam width boundary. In other embodiments, other technical means may also be adopted to filter the feature classification result in a filtering manner three, which is not limited in the present invention.

After the pronunciations contained in the feature classification result are filtered according to any one of the filtering modes, a preset recognition network can be directly called, a path related to the pronunciations contained in the filtered feature classification result is searched, an optimal path is searched, text information corresponding to the voice signal to be recognized is output according to the path with the maximum probability, and voice recognition is completed.

When an optimal path is searched, converting the probability (acoustic score) contained in the feature classification result into a numerical space which is close to the association probability (language score) between words (such as characters, words and phrases) contained in the voice model, weighting and adding the numerical space to be used as a comprehensive score of the path searching process, wherein each voice signal frame is limited by a preset threshold value, and if the difference value between the voice signal frame and the optimal path is greater than the threshold value, the path is discarded, otherwise, the path is reserved; after each voice signal frame is searched, all paths are sequenced according to the preset maximum path number, only the optimal paths of the number are reserved until the last frame is completed, and therefore the final path graph is obtained.

In some examples, the modeling unit of the acoustic model outputting the feature classification result is small, for example, the state is used as the modeling unit, since a single phoneme may be composed of three to five states, and a speech signal formed by pronunciation of one phoneme may be divided into a plurality of speech signal frames, so that a situation in which acoustic features of a plurality of consecutive speech signal frames are relatively similar is likely to occur, and a similar situation is likely to occur when pronunciation of each frame of the consecutive speech signal frames is described in the feature classification result. For such a situation, if the present invention separately filters the pronunciations mapped to each frame of speech signal based on the probability contained in the feature classification result and the predetermined filtering rule, the pronunciations having a large influence on the recognition accuracy rate are easily filtered, and in order to avoid filtering such pronunciations by mistake, the filtering conditions of consecutive speech signal frames can be comprehensively considered when filtering the feature classification result, and the specific implementation process can refer to the method shown in fig. 2, including the following steps S201-S205:

step S201, obtaining a feature classification result of a voice signal to be recognized; the feature classification result includes a pronunciation for describing pronunciation features of each speech signal frame and a probability that each speech signal frame maps to a corresponding pronunciation.

Step S202, if the probability of any voice signal frame mapping to the corresponding pronunciation meets the preset filtering rule, the pronunciation is determined as a candidate pronunciation.

Step S203, if the probability of any frame in the adjacent voice signal frames of the preset number of the voice signal frames, which is mapped to the candidate pronunciation, meets the preset filtering rule, the candidate pronunciation is deleted from the pronunciations contained in the characteristic classification result.

Step S204, if the adjacent voice signal frames of the preset number of frames of the voice signal frame do not meet the preset filtering rule by the mapping probability of the candidate pronunciation, the candidate pronunciation is reserved in the pronunciation contained in the characteristic classification result.

And S205, recognizing the voice signal based on the filtered feature classification result.

In the embodiment of the present invention, the predetermined filtering rule may be any one of the rules related to the first filtering manner to the fourth filtering manner, and may also be another filtering rule that can limit the influence degree of the filtered pronunciation on the recognition accuracy.

The predetermined number of consecutive frames of the speech signal can be set by the designer of the present invention according to the speech recognition accuracy that needs to be achieved, for example, 6, the first three adjacent frames and the last three adjacent frames.

From the above embodiment, it can be seen that: when the voice signal is recognized, the feature classification result of the voice signal is firstly obtained, and then the pronunciation contained in the feature classification result is filtered based on the probability contained in the feature classification result, so that the recognition operation related to the filtered pronunciation does not need to be executed in the process of recognizing the voice signal, and if the path related to the filtered pronunciation does not need to be searched in a recognition network, the time consumed in the voice recognition process can be effectively reduced, and the voice recognition efficiency can be improved.

Furthermore, the speech recognition method of the embodiment of the present invention can be applied to human-computer interaction software of various electronic devices, for example: when the voice recognition method applied to the voice search is applied to the voice search in the intelligent mobile phone, if a user sends a section of voice within a preset range away from the intelligent mobile phone, the voice recognition method applied to the voice search can firstly obtain a feature classification result of the voice after receiving the voice of the user collected by a voice collection device, then filter pronunciations contained in the feature classification result based on the probability contained in the feature classification result, then only search paths related to the pronunciation which are not filtered in a recognition network, quickly recognize text information corresponding to the voice of the user through path search, and further enable a voice assistant to quickly respond to the user based on the recognition result.

Corresponding to the embodiments of the method described above, the invention also provides embodiments of the apparatus.

Referring to fig. 3, fig. 3 is a logic block diagram illustrating a speech recognition apparatus according to an exemplary embodiment of the present invention, which may include: a classification result acquisition module 310, a pronunciation filtering module 320, and a speech recognition module 330.

The classification result obtaining module 310 is configured to obtain a feature classification result of the speech signal to be recognized; the feature classification result includes a pronunciation for describing pronunciation features of each speech signal frame and a probability that each speech signal frame maps to a corresponding pronunciation.

The pronunciation filtering module 320 is configured to filter the pronunciations included in the feature classification result based on the probabilities included in the feature classification result.

A speech recognition module 330, configured to recognize the speech signal based on the filtered feature classification result.

In some examples, the pronunciation filter module 320 may include:

and the first filtering module is used for filtering the corresponding pronunciation when the probability difference between the probability of any voice signal frame being mapped to the corresponding pronunciation and the maximum mapping probability of the voice signal frame is within a preset difference range.

In other examples, the pronunciation filter module 320 may further include:

and the probability distribution module is used for acquiring the histogram distribution of the probability of mapping any voice signal frame to each pronunciation.

And the beam width determining module is used for acquiring the beam width corresponding to the histogram distribution.

And the pronunciation determining module is used for determining pronunciations with probability distribution outside the beam width as pronunciations meeting the preset filtering rule.

Referring to fig. 4, fig. 4 is a logic block diagram illustrating a speech recognition apparatus according to another exemplary embodiment of the present invention, which may include: a classification result acquisition module 410, a pronunciation filtering module 420, and a speech recognition module 430. The pronunciation filter module 420 may include a candidate pronunciation determination module 421, a candidate pronunciation deletion module 422, and a candidate pronunciation retention module 423.

The classification result obtaining module 410 is configured to obtain a feature classification result of the voice signal to be recognized; the feature classification result includes a pronunciation for describing pronunciation features of each speech signal frame and a probability that each speech signal frame maps to a corresponding pronunciation.

And a candidate pronunciation determining module 421, configured to determine a pronunciation as a candidate pronunciation when the probability that any speech signal frame is mapped to the corresponding pronunciation satisfies a predetermined filtering rule.

A candidate pronunciation deletion module 422, configured to delete a candidate pronunciation from pronunciations included in the feature classification result when the probability that any frame of the predetermined number of adjacent speech signal frames of the speech signal frame is mapped to the candidate pronunciation satisfies a predetermined filtering rule.

And a candidate pronunciation retaining module 423, configured to retain the candidate pronunciation in the pronunciations included in the feature classification result when the probabilities of the adjacent voice signal frames of the predetermined number of frames of the voice signal frame mapped to the candidate pronunciation do not satisfy the predetermined filtering rule.

A speech recognition module 430 for recognizing the speech signal based on the filtered feature classification result.

The implementation process of the functions and actions of each unit (or module) in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units or modules described as separate parts may or may not be physically separate, and the parts displayed as the units or modules may or may not be physical units or modules, may be located in one place, or may be distributed on a plurality of network units or modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the voice recognition device can be applied to electronic equipment. In particular, it may be implemented by a computer chip or entity, or by an article of manufacture having some functionality. In a typical implementation, the electronic device is a computer, which may be embodied in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, internet television, smart car, unmanned vehicle, smart refrigerator, other smart home device, or a combination of any of these devices.

The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and is formed by reading corresponding computer program instructions in a readable medium such as a nonvolatile memory and the like into a memory for operation through a processor of the electronic device where the software implementation is located as a logical device. From a hardware aspect, as shown in fig. 5, it is a hardware structure diagram of an electronic device where the speech recognition apparatus of the present invention is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, the electronic device where the apparatus is located in the embodiment may also include other hardware according to the actual function of the electronic device, which is not described again. The memory of the electronic device may store program instructions executable by the processor; the processor may be coupled to the memory for reading program instructions stored in the memory and, in response, performing the following: acquiring a feature classification result of a voice signal to be recognized; the feature classification result comprises pronunciations used for describing pronunciation features of the voice signal frames and the probability of mapping the voice signal frames to the corresponding pronunciations; filtering the pronunciations contained in the feature classification result based on the probability contained in the feature classification result; and recognizing the voice signal based on the filtered feature classification result.

In other embodiments, the operations performed by the processor may refer to the description related to the above method embodiments, which is not repeated herein.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A speech recognition method, comprising the steps of:

after the characteristics of a voice signal to be recognized are extracted, acquiring a characteristic classification result of the voice signal through an acoustic model; the feature classification result comprises pronunciations used for describing pronunciation features of the voice signal frames and probabilities of mapping the voice signal frames to the corresponding pronunciations, and the probabilities are used for representing the possibility of mapping the voice signal frames to the corresponding pronunciations;

based on the probability contained in the feature classification result, filtering the pronunciation contained in the feature classification result, and filtering the pronunciation with low correlation with the voice signal frame;

recognizing the voice signal based on the filtered feature classification result;

the filtering the pronunciation contained in the feature classification result based on the probability contained in the feature classification result comprises:

if the probabilities mapped to the candidate pronunciations of the adjacent voice signal frames with the preset number of frames of the voice signal frame do not meet the preset filtering rule, the candidate pronunciations are kept in the pronunciations contained in the characteristic classification result;

alternatively, the first and second electrodes may be,

if the corresponding pronunciation meets a preset filtering rule, filtering the corresponding pronunciation;

wherein whether the probability that any of the speech signal frames is mapped to the corresponding utterance satisfies a predetermined filtering rule satisfies the predetermined filtering rule is determined based on:

if the probability that any voice signal frame is mapped to the corresponding pronunciation and the probability difference between the maximum mapping probability of the voice signal frame are within a preset difference range, determining that the corresponding pronunciation meets a preset filtering rule;

if the probability that any voice signal frame is mapped to the corresponding pronunciation is smaller than the probability that the voice signal frame is mapped to each pronunciation in the preset number of pronunciations, determining that the corresponding pronunciation meets the preset filtering rule;

the predetermined number is any one of:

the number of pronunciations that are retained in the feature classification result in the pronunciation corresponding to the speech signal frame;

the predetermined scaling threshold is multiplied by the total number of utterances corresponding to the frame of the speech signal.

2. The method of claim 1, wherein whether the probability that any speech signal frame maps to a corresponding utterance satisfies the predetermined filtering rule is further determined based on:

acquiring a beam width corresponding to the histogram distribution;

and determining pronunciations with probability distribution outside the beam width as pronunciations meeting a preset filtering rule.

3. A speech recognition apparatus, comprising:

the classification result acquisition module is used for acquiring a characteristic classification result of the voice signal through an acoustic model after the characteristic of the voice signal to be recognized is extracted; the feature classification result comprises pronunciations used for describing pronunciation features of the voice signal frames and probabilities of mapping the voice signal frames to the corresponding pronunciations, and the probabilities are used for representing the possibility of mapping the voice signal frames to the corresponding pronunciations;

the pronunciation filtering module is used for filtering the pronunciations contained in the feature classification result based on the probability contained in the feature classification result and filtering the pronunciations with low correlation with the voice signal frame;

a voice recognition module for recognizing the voice signal based on the filtered feature classification result;

the pronunciation filtering module is specifically configured to:

alternatively, the first and second electrodes may be,

the predetermined number is any one of:

4. The apparatus of claim 3, wherein whether the probability that any speech signal frame maps to a corresponding utterance satisfies the predetermined filtering rule is further determined based on:

acquiring a beam width corresponding to the histogram distribution;