CN107871499B

CN107871499B - Speech recognition method, system, computer device and computer-readable storage medium

Info

Publication number: CN107871499B
Application number: CN201711031665.9A
Authority: CN
Inventors: 秦浩然; 肖全之
Original assignee: Zhuhai Jieli Technology Co Ltd
Current assignee: Zhuhai Jieli Technology Co Ltd
Priority date: 2017-10-27
Filing date: 2017-10-27
Publication date: 2020-06-16
Anticipated expiration: 2037-10-27
Also published as: CN107871499A

Abstract

The present application relates to a speech recognition method, system, computer device and storage medium. Inputting a voice signal characteristic sequence into a single-phoneme search network and a concentrated word search network for synchronous decoding; acquiring an intra-word output state score obtained by intra-word search network decoding; when the in-set word output state score meets a preset condition, obtaining the confidence coefficient of synchronous decoding of a single-voxel search network and an in-set word search network; and selecting a corresponding decoding path according to the confidence coefficient, and outputting to obtain a voice recognition result. According to the voice recognition method, the voice recognition system, the computer equipment and the computer readable storage medium, the voice signal characteristic sequence is simultaneously input into the single-phoneme search network and the concentrated word search network for decoding and transmission, so that concentrated word recognition and concentrated word rejection recognition can be effectively realized, and the recognition accuracy is ensured; and then, selecting a corresponding decoding path according to the confidence coefficient to obtain a voice recognition result, so that the voice recognition accuracy can be further improved.

Description

Speech recognition method, system, computer device and computer-readable storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, a speech recognition system, a computer device, and a computer-readable storage medium.

Background

With the rapid development and application of computer technology, it is an important direction of artificial intelligence and machine learning application to further implement voice communication with a machine, and a voice recognition technology is a technology for converting a voice signal into a corresponding text or command through a recognition and understanding process by a machine. The current application of speech recognition can be largely divided into two directions: one is a large vocabulary continuous speech recognition system, which is applied to mobile phone assistants, speech dictation, etc.; the other is the development towards small-vocabulary portable voice products, such as intelligent toys, household appliance remote control and the like.

The small vocabulary speech recognition system in the second application is gradually applied to the fields of handheld terminals, household appliances and the like, and because the small vocabulary is oriented, compared with the first system, the interference of a large number of words outside the set is considered besides the influence caused by noise interference, namely, the words outside the set are rejected while the correct recognition of the words inside the set is ensured. The product using effect of the traditional small vocabulary speech recognition system is still unsatisfactory, and if the recognition of the command words in the set and the rejection of the foreign words cannot be effectively realized, the speech recognition accuracy is low.

Disclosure of Invention

In view of the foregoing, there is a need to provide a speech recognition method, system, computer device and computer readable storage medium that can effectively implement recognition of words in a corpus and recognition rejection of foreign words and improve recognition accuracy.

A speech recognition method comprising:

respectively inputting the voice signal characteristic sequences into a single-phoneme search network and a concentrated word search network, and synchronously decoding;

acquiring an intra-set word output state score obtained by synchronous decoding;

when the word in set output state score meets a preset condition, obtaining the confidence coefficient of synchronous decoding of the single-phoneme search network and the word in set search network;

and selecting a corresponding decoding path according to the confidence coefficient, and outputting to obtain a voice recognition result.

In one embodiment, the step of inputting the speech signal feature sequences into the single-phone search network and the collective word search network respectively and synchronously decoding the speech signal feature sequences comprises:

inputting the current frame voice signal characteristic sequence into the single-phoneme search network to obtain a first output state score;

and when the first output state score is larger than a first preset threshold value, respectively inputting the next frame of voice signal feature sequence into the single-phoneme search network and the word-in-set search network for synchronous decoding.

In one embodiment, the step of inputting the current frame speech signal feature sequence into the monophonic search network to obtain the first output state score comprises:

inputting the current frame speech signal feature sequence into the single-phoneme search network;

acquiring the joint probability of the current frame voice signal characteristic sequence and the single-phone searching network element;

taking the maximum value of the joint probabilities as the first output state score.

In one embodiment, the step of obtaining the confidence level of the single-phone search network and the intra-set word search network for synchronous decoding when the intra-set word output state score satisfies a preset condition includes:

when the intra-cluster word output state score meets the preset condition, acquiring a first transfer score of the single-phoneme search network synchronous decoding and a second transfer score of the intra-cluster word search network synchronous decoding;

and obtaining the confidence degree according to the first transfer score and the second transfer score.

when the word in set output state score is larger than a second preset threshold value, respectively obtaining the first transfer score and the second transfer score through a token transfer algorithm;

taking a ratio of the second transfer score to the first transfer score as the confidence.

In one embodiment, the step of selecting a corresponding decoding path according to the confidence degree and outputting a resulting speech recognition result includes:

acquiring the frame number of the voice signal characteristic sequence corresponding to the confidence coefficient meeting the threshold condition of the confidence coefficient;

and obtaining the voice recognition result according to the output of the decoding path corresponding to the maximum frame number.

In one embodiment, the step of inputting the speech signal feature sequences into the single-phone search network and the collective word search network respectively and synchronously decoding the speech signal feature sequences comprises the following steps:

acquiring a voice signal;

and carrying out endpoint detection on the acquired voice signal to obtain the voice signal characteristic sequence.

A speech recognition system comprising:

the synchronous decoding module is used for respectively inputting the voice signal characteristic sequences into a single-phone searching network and a collective word searching network and synchronously decoding;

the state score acquisition module is used for acquiring the in-set word output state score obtained by synchronous decoding;

a confidence coefficient obtaining module, configured to obtain a confidence coefficient for synchronous decoding of the single-phone search network and the word-in-set search network when the word-in-set output state score satisfies a preset condition;

and the voice recognition output module is used for selecting a corresponding decoding path according to the confidence coefficient and outputting the decoding path to obtain a voice recognition result.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the speech recognition method as described above when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the speech recognition method as described above.

According to the voice recognition method, the voice recognition system, the computer equipment and the computer readable storage medium, the voice signal characteristic sequence is synchronously decoded and transmitted through the single-phoneme search network and the collective word search network respectively, when the output state score of the collective word obtained by the collective word search network decoding meets the preset condition, the confidence coefficient of the single-phoneme search network and the collective word search network synchronous decoding is obtained, and finally, the voice recognition result is obtained through outputting according to the decoding path corresponding to the confidence coefficient. By inputting the voice signal characteristic sequence into a single-phoneme search network and a concentrated word search network for decoding and transmitting, concentrated word recognition and concentrated word rejection recognition can be effectively realized, and the recognition accuracy is ensured; and then, selecting a corresponding decoding path according to the confidence coefficient to obtain a voice recognition result, so that the voice recognition accuracy can be further improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a speech recognition method according to the present application;

FIG. 2 is a flowchart illustrating steps of inputting a speech signal feature sequence into a single-phone search network and a vocabulary search network, respectively, and performing synchronous decoding according to an embodiment of the speech recognition method of the present application;

FIG. 3 is a flowchart illustrating steps of inputting a speech signal feature sequence into a single-phone search network and a vocabulary search network, respectively, and performing synchronous decoding according to an embodiment of the speech recognition method of the present application;

FIG. 4 is a schematic flowchart of an embodiment of a speech recognition method according to the present application, before inputting a feature sequence of a speech signal into a single-phone search network and a collection search network, respectively, and performing a synchronous decoding step;

FIG. 5 is a schematic flow chart illustrating endpoint detection according to an embodiment of the speech recognition method of the present application;

FIG. 6 is a schematic structural diagram of an embodiment of a speech recognition system according to the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present application. As shown in fig. 1, the speech recognition method of the present embodiment includes:

step S101, inputting the voice signal characteristic sequence into a single-phone searching network and a collective word searching network respectively, and synchronously decoding.

Speech is a kind of sound, which is an analog signal that is emitted from a human vocal organ, has a certain grammar and meaning, and carries specific information. The speech signal is an analog quantity, so the processing of the speech signal requires first converting the analog speech signal into a digital signal by a sampling and quantization technique, wherein the sampling frequency of the speech signal needs to satisfy the nyquist sampling theorem, i.e. the sampling frequency must be higher than twice the highest frequency of the speech signal to be sampled. In addition, the voice signal includes many irrelevant information, such as background noise, emotion and the like, so a large number of voice signal characteristic parameters are used in the voice recognition development process, and the basic idea of characteristic parameter extraction is to extract the characteristic parameters representing the voice essence by carrying out a transformation on the preprocessed signal to remove redundant parts, and finally, performing voice recognition based on the characteristic parameters. Before feature extraction, a series of preprocessing such as framing, windowing, pre-emphasis, fourier transform and the like is performed on an original voice signal sequence through an endpoint detection module. The characteristic parameters of the voice signal comprise time domain parameters, such as short-time average energy, pitch period and the like; and also includes frequency domain parameters such as short-time spectrum, first three formants, etc. In speech recognition, the most commonly used speech features are Mel-frequency cepstral coefficients (MFCCs), which are cepstral parameters extracted in the frequency domain of Mel-scale describing the non-linear characteristics of human ear frequencies, and the Mel-frequency cepstral coefficients are used to extract the feature sequences of the speech signal.

The single phoneme search network is a dynamic search network composed of all single phonemes constituting any word as primitives, and the network can be used for starting the word search network in the set and decoding the words together with the word search network in the set for recognition and rejection. The phoneme is the minimum phonetic unit divided according to the natural attributes of the speech, and one pronunciation action forms one phoneme, for example, "ba" contains two pronunciation actions, "b" and "a", and is two single phonemes. But the utterance (waveform) of a word actually depends on many factors, not just phonemes, such as phoneme context, speaker, speech style, etc. When considering these factors, the single phone is considered in context, thus forming a triphone or a polyphone. The word-in-set search network is a dynamic search network composed based on triphone primitives, and triphones already contain contextual information among the phonemes and are used for searching command words in the set from the voice signal feature sequence. And respectively and simultaneously inputting the voice signal characteristic sequences extracted from the voice signals into a single-phone searching network and a concentrated word searching network for synchronous decoding. When in synchronous decoding, the synchronous signal is set, so that the voice signal characteristic sequence keeps simultaneously performing searching decoding and state transmission in a single-phoneme network and a concentrated word searching network. The voice signal characteristic sequence is simultaneously input into the single-phone searching network and the collective word searching network for synchronous decoding, decoding can be simultaneously carried out according to all single phones and collective words, the collective word recognition and the collective word rejection recognition can be effectively realized, and the recognition accuracy is ensured.

And step S103, acquiring the output state score of the words in the set obtained by synchronous decoding.

When the synchronous decoding is carried out, the voice signal characteristic sequence is transmitted in a single-phone searching network and a collective word searching network at the same time for searching and decoding, and when the voice signal characteristic sequence is searched and decoded in the collective word searching network, after the state transition of each frame of voice signal characteristic sequence is calculated, the collective word output state score of the whole word is calculated. The in-set word output state score is the matching probability of the input voice signal characteristic sequence and triphone elements of the in-set word search network, and represents the matching degree of the voice signal characteristic sequence and each element in the in-set word search network, and the numerical value is larger, the higher the matching degree is, namely the higher the possibility that the voice signal characteristic sequence corresponds to the element is. More specifically, the primitive of the intra-set word search network may be a hidden markov model, and the intra-set word output score is a joint probability of a hidden state sequence and a corresponding speech signal feature sequence calculated by using a Viterbi algorithm (Viterbi algorithm). The viterbi algorithm is a dynamic programming algorithm for finding the sequence of-viterbi paths-hidden states that is most likely to produce a sequence of observed events, particularly for application in markov source contexts and hidden markov models. And calculating and obtaining the output state score of the words in the set, so as to obtain the decoded matching condition of the voice signal characteristic sequence in the word searching network in the set.

Step S105, when the word in set output state score meets a preset condition, obtaining the confidence of synchronous decoding of the single-phone searching network and the word in set searching network.

And when the score of the output state of the words in the set meets a preset condition, if the score of the output state of the words in the set is greater than a preset threshold value, acquiring the confidence coefficient of the voice signal characteristic sequence for synchronous decoding in the single-phone searching network and the word-in-set searching network. The preset condition can be set according to the requirements of users, such as identification precision. When a preset condition is set according to the recognition accuracy requirement, a threshold value may be set, when the recognition accuracy requirement is high, a threshold value with a higher value is set, and when the acquired word output state score in the set is higher than the preset threshold value, the confidence is acquired. The monosyllabic search network consists of monosyllabic elements, the intra-word search network consists of triphone elements, so if the input speech features are the speech features of corresponding intra-word, the input speech features should be matched with the intra-word search network exactly, and because the intra-word search network uses triphone modeling (containing context information), the output state score of the inter-word search network is higher than or very close to the transfer score obtained by the monosyllabic random search network; when the input speech feature sequence is an extraset word, the monosyllable search network has all the monosyllable elements composing any word, so that the monosyllable search network has a good match with the extraset word features, but the extraset word search network is composed of triphones which are contained in the determined extraset word, so that the extraset word speech feature sequence has a poor output state score, and the relation between the extraset word and the triphones is the confidence to be measured. In this embodiment, the confidence of the synchronized decoding of the single-phone search network and the word-in-set search network may specifically be the reciprocal of the ratio of the transmission scores of the single-phone search network and the word-in-set search network, so as to visually represent the decoding measurement results of the single-phone search network and the word-in-set search network, and effectively reject most of the interference of the word-out-of-set.

And S107, selecting a corresponding decoding path according to the confidence coefficient, and outputting to obtain a voice recognition result.

And after calculating the confidence degrees of the single-phone search network and the collective word search network, selecting a corresponding decoding path according to the confidence degree to decode the voice signal characteristic sequence, and outputting to obtain a voice recognition result. Further, in consideration of the characteristic of continuity of the speech signal, an optimal decoding path, that is, a decoding path with the highest degree of matching, may be selected according to the confidence, and the optimal decoding path is output as a recognition result. More specifically, after the confidence is obtained, the corresponding frame number of each word meeting the confidence threshold is obtained through statistics, and the decoding path meeting the maximum frame number is determined as the optimal decoding path, so that the recognition result is decoded and output.

According to the voice recognition method, the voice signal characteristic sequence is synchronously decoded and transmitted through the single-phoneme search network and the collective word search network respectively, when the collective word output state score obtained by decoding through the collective word search network meets the preset condition, the confidence coefficient of synchronous decoding of the single-phoneme search network and the collective word search network is obtained, and finally, the voice recognition result is output according to the decoding path corresponding to the confidence coefficient. By inputting the voice signal characteristic sequence into a single-phoneme search network and a concentrated word search network for decoding and transmitting, concentrated word recognition and concentrated word rejection recognition can be effectively realized, and the recognition accuracy is ensured; and then, selecting a corresponding decoding path according to the confidence coefficient to obtain a voice recognition result, so that the voice recognition accuracy can be further improved.

Further, fig. 2 is a schematic flow chart illustrating steps of inputting a voice signal feature sequence into a single-phone search network and a collection search network, and performing synchronous decoding according to an embodiment of the voice recognition method of the present application. As shown in fig. 2, step S101 specifically includes:

step S111: and inputting the current frame voice signal characteristic sequence into the single-phoneme search network to obtain a first output state score.

The voice signal to be detected is processed by the endpoint detection module to obtain each frame of voice signal characteristic sequence to be detected, when a section of voice signal characteristic sequence is input into the decoding model for decoding, the single-phoneme search network is firstly activated by default, namely, the current frame voice signal characteristic sequence is input into the single-phoneme search network for transmission decoding, and the word-in-set search network is kept in an inactivated state by default. When the speech signal feature sequence is inputted into the monophonic search network, the first frame speech signal feature sequence activates all the monophonic models, i.e., the elementary models in the monophonic search network, rather than activating the mute phonemes. At this time, each frame of the speech signal feature sequence triggers a state transition of the single-phone search network model, and a first output state score is calculated. The first output state score, similarly to the word-in-set output state score, characterizes how well the speech signal feature sequence matches each of the monophonic elements in the monophonic search network.

Step S113: and when the first output state score is larger than a first preset threshold value, respectively inputting the next frame of voice signal feature sequence into the single-phoneme search network and the word-in-set search network for synchronous decoding.

When the speech signal characteristic sequence is decoded and transmitted in the single-phone search network, the maximum value of the output state scores of all the output states is calculated and output simultaneously. In specific application, in the decoding and transferring process of each frame of voice signal feature sequence in the single-phone searching network, the output state scores of all the voice signal feature sequences matched with all the single phones are calculated and counted, the maximum value of all the output state scores is stored and output, and the maximum value is used as the first output state score which is compared with a first preset threshold value. And when the first output state score of the output current frame voice signal feature sequence is larger than a first preset threshold value, activating the word in set searching network. After the word in set search network is activated, the speech signal characteristic sequence of the next frame is simultaneously input into the single-phoneme search network and the word in set search network to keep synchronous decoding. The voice signal characteristic sequence is firstly input into the single-phoneme search network which can be matched with all words for decoding and recognition, when the recognition result meets the preset condition, the subsequent voice signal characteristic sequence frame is then input into the single-phoneme search network and the collective word search network for synchronous decoding, synchronous decoding can be switched in at a proper time, and the efficiency and the accuracy of voice recognition can be improved.

Further, fig. 3 is a flowchart illustrating steps of respectively inputting the feature sequences of the speech signal into the single-phone search network and the collective word search network, and performing synchronous decoding according to an embodiment of the speech recognition method of the present application. As shown in fig. 3, step S111 in this embodiment includes:

step S111a, inputting the current frame speech signal feature sequence into the monophonic search network.

When a speech signal feature sequence is input into a decoding model for carrying out decoding, only a single-phoneme searching network is activated at first, and the collective word searching network is kept in an inactivated state by default. At this time, the speech signal feature sequence is only input into the single-tone search network for the transmission decoding.

Step S111b, obtaining the joint probability of the current frame speech signal feature sequence and the monophonic search network element.

When the voice signal characteristic sequence is subjected to transmission decoding in the single-phone search network, the matching degree, namely the joint probability, of the voice signal characteristic sequence of each frame and each element of the single-phone search network is calculated. The monophonic element search network is a dynamic search network composed of all monophonic elements, and comprises all phonemes composing any word, and all the monophonic elements do not have context information, so that the monophonic element search network can match any words in sets and words out of sets, and the monophonic element search network primitive is all the monophonic element models. In a specific implementation, the phoneme model, i.e. the primitives, may be an HMM (hidden markov model). HMMs are statistical models built on the time series structure of speech signals, which is regarded as a mathematical double stochastic process: one is a stochastic process that models the implications of speech signal statistical property changes (the internal states of the Markov model are not externally visible) with a Markov chain having a finite number of states, and the other is a stochastic process of externally visible observation sequences (usually acoustic features computed from individual frames) associated with each state of the Markov chain. The number of HMM structure states of each phoneme model is set as a phoneme beginning stage, a phoneme stabilizing stage, and a phoneme ending stage, and the speech recognition process is a transfer process of each state of each phoneme model. Further, the joint probability of the HMM hidden state sequence and the corresponding current frame speech signal feature sequence is calculated through a Viterbi algorithm, and the joint probability represents the matching degree of the current frame speech signal feature sequence and each element of the single-phone search network, namely the possibility that the current frame speech signal feature sequence may be a certain single phone.

Step S111c, using the maximum value of the joint probabilities as the first output state score.

And after the matching degree, namely the joint probability, of the characteristic sequence of each frame of voice signal and each element of the single-element search network is obtained, counting the joint probability to obtain the maximum value of the joint probability, and outputting the maximum value of the joint probability as a first output state score. Each frame of voice signal feature sequence is matched with all the single phones in the single phone searching network to obtain a plurality of joint probabilities, and a matching path with the maximum joint probability is selected for carrying out transmission decoding. The first output state score characterizes the joint probability of the entire decoding path from the beginning of the identification until the decoding is passed to the current state. For each transmission decoding, the voice signal feature sequence is matched with all primitives to calculate joint probability, but only the state matching result of the maximum joint probability is reserved, and after all voice signal feature sequences are transmitted, the first output state score of the path corresponding to the maximum joint probability can be obtained. At this time, the first output state score is compared with a first preset threshold value, whether the next frame feature sequence is switched into synchronous decoding or not is judged, synchronous decoding can be switched in at a proper time, and the efficiency and the accuracy of voice recognition can be improved.

Further, step S105 may include:

the method comprises the following steps: and when the cluster word output state score meets the preset condition, acquiring a first transfer score of the single-phoneme search network synchronous decoding and a second transfer score of the cluster word search network synchronous decoding.

When the decoding result of the voice signal characteristic sequence in the single-phoneme search network meets the activation condition of the word search network in the preset set, the word search network in the set is activated to carry out synchronous decoding, and at the moment, the voice signal characteristic sequence of the next frame is simultaneously input into the single-phoneme search network and the word search network in the set. When searching and decoding are carried out in the word searching network in the set, after the state transition of each frame is calculated, the output state score of the whole word is checked. Similarly, the output state score in the set-word search network decoding characterizes the speech signal feature sequence and the set-word search network, as the first output state score of the single-phoneme search network. Further, the output state score can be obtained by calculating the joint probability of the voice signal feature sequence and the word-in-set search network element. The word-in-set search network is a dynamic search network composed of triphones, and the primitives of the word-in-set search network are triphones containing inter-phoneme context information. Because the triphone primitive of the intra-word search network contains context information, the output state score of the speech signal feature sequences belonging to the intra-word in the intra-word search network is higher relative to the monophonic search network. On the contrary, when a speech signal feature sequence belonging to an out-of-set word is introduced, even if the in-set word search network is activated, the output state score thereof is low relative to the monosyllabic search network, and whether the speech signal feature sequence of the in-set word has little influence on the monosyllabic search network, so that the monosyllabic search network obtains a higher output state score and a higher transfer score than the in-set word search network when inputting the speech signal feature belonging to the out-of-set word; when a sequence of speech signal features belonging to a word in the corpus is input, a single-phone search network will obtain a lower output state score and delivery score than a word in the corpus search network. Theoretically, only when the input voice signal belongs to the words in the set, the transfer score and the output state score obtained by the searching network of the words in the set can be close to or even exceed the transfer score and the output state score corresponding to the single-element searching network. When the output state score of the word searching network in the set meets a preset condition, if the output state score is larger than a preset threshold value, a second transfer score from entering the word searching network in the set to transferring out of the word searching network in the set is calculated, and then a first transfer score in the period of the single-element searching network with the same entrance is calculated. Further, when the output state score of the words in the set is larger than a set threshold value, the output state score and historical transmission information of the output state are recorded, activation information for activating the transmission path can be found through the historical transmission information, and then the whole word starting frame and the transmission score can be obtained. In addition, a first output state score of the single-phoneme search network of the current frame is calculated, and a transfer score of the single-phoneme search network in the period of the search word is calculated together with the information of the word search network in the active set. The passing score can be a passing score in a token passing (token passing) process, which records an output state score in a certain interval in a complete decoding path, wherein the interval refers to an interval from entering a decoding model to outputting the decoding model, and the passing score can visually show the matching degree in the certain interval in the passing decoding process.

Step two: and obtaining the confidence degree according to the first transfer score and the second transfer score.

After the transfer fraction of synchronously decoding the voice signal feature sequences with the same frame length by the single-phoneme search network and the collective word search network is obtained, the first transfer fraction of the single-phoneme search network is used as a reference fraction, and the second transfer fraction of the collective word search network is comprehensively considered to obtain the confidence coefficient. Further, the confidence may be defined as a ratio of the second transfer score and the first transfer score, which characterizes a matching confidence of paths of the input speech signal feature sequence.

Further, step S105 may also include:

the method comprises the following steps: and when the score of the output state of the words in the set is larger than a second preset threshold value, respectively obtaining the first transfer score and the second transfer score through a token transfer algorithm.

And comparing the acquired cluster word output state score with a second preset threshold, and when the cluster word output state score is larger than the second preset threshold, respectively acquiring a first transfer score of the single-phone search network and a second transfer score of the cluster word search network through a token transfer algorithm. Further, the speech signal feature sequence generates a token before entering a word network to be recognized including a single-phone search network and a collective word search network, records backtracking information and a passing score, calculates an output state score when an output state of the word network to be recognized is reached, and then subtracts the backtracking of the token to an input point score by using the output state score to obtain the passing score during the word network to be recognized. The transmission score mainly records the output state score in a certain section of the complete path, wherein the section refers to the section from entering the word network to be recognized to outputting the word network to be recognized. Further, unlike the word-in-set search network, the single-phone search network only uses the initialization token for passing and does not produce a new token, but can obtain a pass score of the same pass stage according to the token backtracking information of the word-in-set search network. Specifically, the first transfer score of the single-phone search network in the same time period is obtained by subtracting the output state score of the token backtracking to the input point in the word search network in the set from the maximum output state score of the current single-phone search network, because only one optimal path is reserved in the search process of the single-phone search network. After the characteristic sequence of the input voice signal can be obtained through a token passing algorithm, a single-phoneme search network synchronous decoding first passing score and a collective word search network synchronous decoding second passing score are obtained.

Step two: taking a ratio of the second transfer score to the first transfer score as the confidence.

And after the first transfer score and the second transfer score are obtained, the first transfer score of the single-phone search network is taken as a reference score, and the ratio of the second transfer score to the first transfer score is defined as a confidence coefficient for representing the matching credibility of the input voice signal characteristic sequence in each path in the single-phone search network and the word-in-set search network. In this case, a smaller confidence value indicates a higher confidence in the matching. Similarly, when the ratio of the first transfer score and the second transfer score is defined as the confidence, a higher confidence value indicates a higher confidence of the match. The confidence degree is defined according to the ratio of the transmission scores of the single-phone search network and the internal word search network, and then the recognition result is selected by utilizing the confidence degree, so that the interference of most of the external words can be well rejected, and the recognition accuracy is ensured.

Further, step S107 may include:

the method comprises the following steps: and acquiring the frame number of the voice signal characteristic sequence corresponding to the confidence coefficient meeting the confidence coefficient threshold condition.

The traditional means for determining the recognition result effect by the best confidence coefficient through direct comparison of confidence coefficients between different words is not ideal, and actually, a voice signal has continuity, so that when a voice feature sequence is matched with a search network, not only one frame can obtain a high score, but also multiple frames can appear. And counting the frame number of the voice signal characteristic sequence corresponding to the confidence coefficient meeting the confidence coefficient threshold condition after the confidence coefficient is obtained by considering the characteristics of the continuity of the voice signal. Specifically, for each decoding path, the number of frames satisfying the confidence threshold condition is different, and the number of frames satisfying the confidence threshold condition in each path is counted. Each intra-set word search network is independent, having different phones and phone numbers, and therefore a respective confidence threshold is set for each intra-set word. When the confidence is defined as the ratio of the network transfer score of the word search in the set to the network transfer score of the single-phone search, the word is considered as the word pronunciation outside the set to reject the recognition when the confidence is higher than the threshold, and the word is reserved for comprehensive decision-making when the confidence is lower than the threshold.

Step two: and obtaining the voice recognition result according to the output of the decoding path corresponding to the maximum frame number.

And after the frame number meeting the confidence coefficient threshold condition in each decoding path is obtained, performing recognition decoding according to the decoding path with the maximum frame number meeting the condition, and outputting and obtaining a voice recognition result. In this embodiment, the confidence level calculation result is directly linked to the number of phonemes included in the words in the set, and the optimal confidence levels calculated by different words are different, so that the effect of determining the recognition result by directly comparing the confidence levels between different words is not very ideal. According to the difference of the optimal confidence degrees among the words and the continuity of the voice signals, the output number of each word meeting the confidence degree threshold is obtained through decoding statistics, and finally the maximum number of the words meeting the confidence degree threshold is compared to be used as a recognition result. When the recognition result is selected, the continuity characteristic of the voice signal is comprehensively considered, the output frame number of each word meeting the confidence coefficient threshold value is selected through decoding statistics, and finally the maximum number of the output frame numbers is compared to serve as the recognition result, so that the voice recognition accuracy can be effectively improved.

Further, fig. 3 is a schematic flow chart of a step before step S101 in an embodiment, and as shown in fig. 3, step S101 includes:

in step S101a, a speech signal is acquired.

The voice signal may be implemented by a voice acquisition system. Specifically, the voice can be acquired through a sound pickup such as a microphone and then processed through an amplifier and a filter.

Step S101b, performing endpoint detection on the obtained speech signal to obtain the speech signal feature sequence.

The original speech signal directly acquired by the speech acquisition system includes a lot of unimportant information and background noise, and a series of pre-processing needs to be performed on the original speech signal sequence through an endpoint detection module, such as endpoint detection (determining the beginning and the end of the speech signal), pre-filtering (removing the influence of individual pronunciation difference, background noise and the like), framing (approximately considering that the speech signal is short-time and stable within 10-30ms, and dividing the speech signal into a section for analysis), windowing (processing by adopting a stable process analysis processing method), pre-emphasis (increasing a high-frequency part), fourier transform (converting into a digital signal for convenient processing), and other pre-processing. The voice signal is preprocessed to obtain a voice signal characteristic sequence which is input into a decoding model to be transmitted and decoded.

Further, fig. 4 is a schematic flowchart of step S101b in an embodiment. As shown in fig. 4, a speech signal is input, first processed by framing, windowing and pre-emphasis, then subjected to Fast Fourier Transform (FFT), and further subjected to triangular window filter to perform noise power estimation and smoothing power estimation on the filtered output signal, respectively. When estimating the noise power, calculating a signal-to-noise ratio (SNR), judging whether the SNR is greater than a threshold value, and if not, returning to the voice signal acquisition step; and when the smooth power estimation is carried out, calculating the difference value between the actual power and the estimated value, and returning to the voice signal acquisition step when the difference value is smaller than the threshold value. And when the signal ratio of the noise power estimation is greater than the threshold value and the difference value of the actual power and the smooth power estimation is not less than the threshold value, performing Discrete Cosine Transform (DCT) on the signal, and outputting to obtain a voice signal characteristic sequence.

In addition, the application also provides a voice recognition system. FIG. 5 is a schematic structural diagram of an embodiment of a speech recognition system according to the present application. As shown in fig. 5, a speech recognition system includes:

a synchronous decoding module 100, configured to input the voice signal feature sequences into a single-phone search network and a collective term search network, respectively, and perform synchronous decoding;

a state score obtaining module 300, configured to obtain an intra-set word output state score obtained by the synchronous decoding;

a confidence obtaining module 500, configured to obtain a confidence of the single-phone search network and the intra-word search network in synchronous decoding when the intra-word output state score meets a preset condition;

and the voice recognition output module 700 is configured to select a corresponding decoding path according to the confidence, and output a voice recognition result.

According to the voice recognition system, the synchronous decoding module carries out synchronous decoding transmission on the voice signal characteristic sequence through the single-phoneme search network and the collective word search network respectively, when the collective word output state score decoded by the collective word search network and acquired by the state score acquisition module meets the preset condition, the confidence coefficient acquisition module acquires the confidence coefficient of synchronous decoding of the single-phoneme search network and the collective word search network, and finally the voice recognition output module outputs the voice recognition result according to the decoding path corresponding to the confidence coefficient. By inputting the voice signal characteristic sequence into a single-phoneme search network and a concentrated word search network for decoding and transmitting, concentrated word recognition and concentrated word rejection recognition can be effectively realized, and the recognition accuracy is ensured; and then, selecting a corresponding decoding path according to the confidence coefficient to obtain a voice recognition result, so that the voice recognition accuracy can be further improved.

Further, a computer device is provided, which comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement any one of the speech recognition methods in the embodiments as described above.

When a processor of the computer device executes a program, the computer device can synchronously decode and transmit the voice signal characteristic sequences through a single-phoneme search network and a concentrated word search network respectively by realizing any one of the voice recognition methods in the embodiments, when the output state score of the concentrated word obtained by the concentrated word search network decoding meets a preset condition, the confidence coefficient of the single-phoneme search network and the concentrated word search network synchronous decoding is obtained, and finally, a voice recognition result is obtained through outputting according to a decoding path corresponding to the confidence coefficient. By inputting the voice signal characteristic sequence into a single-phoneme search network and a concentrated word search network for decoding and transmitting, concentrated word recognition and concentrated word rejection recognition can be effectively realized, and the recognition accuracy is ensured; and then, selecting a corresponding decoding path according to the confidence coefficient to obtain a voice recognition result, so that the voice recognition accuracy can be further improved.

In addition, it can be understood by those skilled in the art that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through a computer program, where the program can be stored in a non-volatile computer-readable storage medium, and as in the embodiments of the present application, the program can be stored in the storage medium of the computer system and executed by at least one processor in the computer system, so as to implement the processes including the embodiments of the speech recognition methods described above.

Further, there is also provided a storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements any of the speech recognition methods as in the above embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A speech recognition method, comprising:

selecting a corresponding decoding path according to the confidence coefficient, and outputting to obtain a voice recognition result;

the step of respectively inputting the voice signal characteristic sequences into a single-phone searching network and a concentrated word searching network and synchronously decoding comprises the following steps:

2. The speech recognition method of claim 1, wherein the step of inputting the current frame speech signal feature sequence into the monophonic search network to obtain the first output state score comprises:

3. The speech recognition method of claim 2, wherein the monophonic search network primitive comprises a hidden markov model; the step of obtaining the joint probability of the current frame speech signal feature sequence and the monophonic search network element comprises:

and calculating the joint probability of the hidden Markov model and the corresponding current frame speech signal characteristic sequence by a Viterbi algorithm.

4. The speech recognition method of claim 1, wherein the step of obtaining the confidence level of the single-phone search network and the intra-group word search network for synchronous decoding when the intra-group word output state score satisfies a preset condition comprises:

5. The speech recognition method of claim 4, wherein the step of obtaining the confidence level of the single-phone search network and the intra-set word search network for synchronous decoding when the intra-set word output state score satisfies a preset condition comprises:

6. The speech recognition method of claim 1, wherein the step of selecting the corresponding decoding path according to the confidence degree and outputting the obtained speech recognition result comprises:

7. The speech recognition method of claim 1, wherein the step of inputting the speech signal feature sequences into the monophonic search network and the collective search network, respectively, and performing synchronous decoding comprises:

acquiring a voice signal;

8. A speech recognition system, comprising:

the voice recognition output module is used for selecting a corresponding decoding path according to the confidence coefficient and outputting the decoding path to obtain a voice recognition result;

the synchronous decoding module is used for respectively inputting the voice signal characteristic sequences into a single-phone searching network and a concentrated word searching network and synchronously decoding, and comprises:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech recognition method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 7.