CN112071310B

CN112071310B - Speech recognition method and device, electronic equipment and storage medium

Info

Publication number: CN112071310B
Application number: CN201910502583.0A
Authority: CN
Inventors: 王振兴; 潘复平
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2024-05-07
Anticipated expiration: 2039-06-11
Also published as: CN112071310A

Abstract

The embodiment of the disclosure discloses a voice recognition method and device, electronic equipment and storage medium, wherein the voice recognition method comprises the following steps: decoding voice to be recognized to obtain a first decoding path of the voice to be recognized; decoding at least one first decoding path in the process of decoding the voice to be recognized; after the voice to be recognized is decoded, decoding a first decoding path which is not decoded; and determining a voice recognition result of the voice to be recognized according to the decoding result of the first decoding path. The embodiment of the disclosure reduces the waiting time of the user when inputting the last packet of data, reduces the time required by the whole voice recognition process and improves the user experience.

Description

Speech recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to speech recognition technology, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

With the development of the mobile internet, speech recognition is becoming more and more important, which is the basis for many other applications. For example, applications such as voice dialing, voice navigation, etc. may be implemented through voice recognition techniques. The more accurate the speech recognition result, the better the effect of the speech recognition based application will be.

In the existing large-vocabulary real-time speech recognition system, considering the real-time requirement, a decoder generally uses a smaller language model with poor effect to perform one-time decoding when decoding, and then uses a larger language model with good effect to perform two-time decoding on the one-time decoding result, namely re-scoring (rescore) a candidate decoding network (lattice) storing a plurality of candidate paths generated by the one-time decoding result, so as to improve the recognition accuracy.

Disclosure of Invention

The present disclosure has been made in order to solve the above technical problems. The embodiment of the disclosure provides a voice recognition method and device, electronic equipment and a storage medium.

According to an aspect of the embodiments of the present disclosure, there is provided a voice recognition method, including:

Decoding voice to be recognized to obtain a first decoding path of the voice to be recognized; decoding at least one first decoding path in the process of decoding the voice to be recognized;

After the voice to be recognized is decoded, decoding a first decoding path which is not decoded;

and determining a voice recognition result of the voice to be recognized according to the decoding result of the first decoding path.

According to another aspect of an embodiment of the present disclosure, there is provided a voice recognition apparatus including:

the first decoding module is used for decoding the voice to be recognized to obtain a first decoding path of the voice to be recognized;

The second decoding module is used for decoding at least one first decoding path obtained by the first decoding module in the process of decoding the voice to be recognized; after the voice to be recognized is decoded, decoding a first decoding path which is not decoded;

and the determining module is used for determining a voice recognition result of the voice to be recognized according to the decoding result of the first decoding path obtained by the second decoding module.

According to yet another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the speech recognition method according to any one of the above embodiments of the present disclosure.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic device including:

A processor;

A memory for storing the processor-executable instructions;

The processor is configured to read the executable instructions from the memory and execute the instructions to implement the voice recognition method according to any one of the foregoing embodiments.

Based on the voice recognition method and device, the electronic device and the storage medium provided by the above embodiments of the present disclosure, at least one first decoding path of the first decoding paths obtained by the first-pass decoding (referred to as a second-pass decoding) may be decoded in the process of decoding the voice to be recognized (referred to as a first-pass decoding); after the first-pass decoding of the voice to be recognized is completed, performing second-pass decoding on a first decoding path which is not subjected to second-pass decoding in the first decoding paths; and then determining a voice recognition result of the voice to be recognized according to the decoding result obtained by the two-pass decoding. Because the embodiment of the disclosure starts to perform the second-pass decoding on at least one first decoding path generated by the first-pass decoding in the process of performing the first-pass decoding on the voice to be recognized, and does not need to wait for the second-pass decoding after the first-pass decoding is completed, when the last packet of data is input by a user, the second-pass decoding on part of the first decoding paths is completed, thereby reducing the waiting time of the user when the last packet of data is input, reducing the time required by the whole voice recognition process and improving the user experience compared with the prior art.

The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof in more detail with reference to the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a scene graph to which the present disclosure applies.

Fig. 2 is a flow chart illustrating a speech recognition method according to an exemplary embodiment of the present disclosure.

Fig. 3 is a flow chart illustrating a voice recognition method according to another exemplary embodiment of the present disclosure.

Fig. 4 is a flow chart illustrating a voice recognition method according to still another exemplary embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating a voice recognition method according to still another exemplary embodiment of the present disclosure.

Fig. 6 is a schematic structural diagram of a voice recognition apparatus according to an exemplary embodiment of the present disclosure.

Fig. 7 is a schematic structural view of a voice recognition apparatus provided in another exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in this disclosure is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Embodiments of the present disclosure may be applicable to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, server, or other electronic device include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments that include any of the foregoing, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

Summary of the application

In carrying out the present disclosure, the present inventors found through studies that: when the voice recognition system is used for decoding, the waiting time required by a user after inputting the last packet of data comprises two parts: the time to perform one-pass decoding on the last packet data, and the time to perform two-pass decoding on all the one-pass decoding results. Because the language model used in the two-pass decoding is usually larger and consumes more time, the whole voice recognition process needs longer time, and the user experience effect is poor.

In the embodiment of the disclosure, in the process of performing one-time decoding on the voice to be recognized, at least one first decoding path generated by the one-time decoding is started to perform two-time decoding, namely, the two-time decoding is started in advance, so that waiting time of a user when inputting the last packet of data is reduced, time required by the whole voice recognition process is reduced, and user experience is improved.

Exemplary System

Fig. 1 is a scene graph to which the present disclosure applies. As shown in fig. 1, an audio collection module (such as a microphone) collects an original audio signal, and the original audio signal or a voice of the original audio signal after front-end signal processing is used as a voice to be recognized, and voice recognition is performed based on the embodiment of the present disclosure to obtain a voice recognition result. Based on the speech recognition result, some applications may be implemented, for example, applications such as voice dialing, voice navigation, etc. For example, when the voice recognition result is "please call XXX", the electronic device searches for a telephone number with a surname XXX through the voice call function module and initiates a call. After the embodiment of the invention is adopted, the voice recognition process of the voice to be recognized can be shortened, so that the application flow based on the voice recognition result is advanced, for example, the processing flow of applications such as voice dialing, voice navigation and the like is advanced, the application result, for example, the voice dialing result such as switching on, busy, switching off and the like, the voice navigation result such as accurate destination address is selected, the navigation route and the like is returned, and the user experience is improved.

Exemplary method

Fig. 2 is a flow chart illustrating a speech recognition method according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, as shown in fig. 2, and the voice recognition method includes the following steps:

In step 201, the speech to be recognized is decoded to obtain a decoding path of the speech to be recognized, and in this embodiment of the present disclosure, the decoding path obtained by decoding the speech to be recognized is referred to as a first decoding path, where the first decoding path may include one or more first decoding paths.

The speech to be recognized may be an original audio signal collected by an audio collection module (such as a microphone, etc.), or may be speech of an original audio signal processed by a front-end signal, which is not limited by the embodiments of the present disclosure. Among others, front-end signal processing may include, for example, but is not limited to: voice activity detection (Voice Activity Detection, VAD), noise reduction, acoustic echo cancellation (Acoustic Echo Cancellaction, AEC), dereverberation processing, sound source localization, beam Forming (BF), etc.

The voice activity detection (Voice Activity Detection, VAD) is also called voice endpoint detection and voice boundary detection, and refers to detecting the existence of voice in an audio signal in a noise environment, and accurately detecting the starting position of a voice segment in the audio signal, and is generally used in voice processing systems such as voice coding and voice enhancement, so as to reduce the voice coding rate, save the communication bandwidth, reduce the energy consumption of mobile equipment, improve the recognition rate, and the like. The starting point of the VAD is from silence to speech, the ending point of the VAD is from speech to silence, and a segment of silence is required for the determination of the ending point of the VAD. The speech of the original audio signal obtained by front-end signal processing includes the speech from the start point to the end point of the VAD, and thus, as the speech to be recognized in the embodiment of the present disclosure, a silence may be included after the speech segment.

In one specific example, assume that the speech to be recognized is: the first decoding path obtained based on step 201 may be, for example, hello-mid-country-person name, hello-mid-country, hello-progenitor-country-person, hello-person, hello-progenitor-country-person name, or the like.

Step 202, in the process of decoding the voice to be recognized, decoding at least one first decoding path in the first decoding paths obtained by decoding. Step 203, after the decoding of the speech to be recognized is completed, decoding is performed on a first decoding path which is not decoded in a first decoding path obtained by decoding the speech to be recognized.

In the embodiment of the present disclosure, a decoding path obtained by decoding a first decoding path is referred to as a second decoding path.

Still taking the example of step 201 as an example, the first decoding path is decoded, and the second decoding path may be, for example, hello-china-people, hello-country-people.

Step 204, determining the voice recognition result of the voice to be recognized according to the decoding result of the first decoding path.

Taking the example in step 201 and step 203 as an example, according to the second decoding path in step 203, the voice recognition result of the voice to be recognized is determined as follows: hello-country-people.

Based on the voice recognition method provided by the embodiment of the present disclosure, at least one first decoding path in the first decoding paths obtained by decoding can be decoded in the process of decoding the voice to be recognized; after the first-pass decoding of the voice to be recognized is completed, decoding a first decoding path which is not subjected to the second-pass decoding in the first decoding path, and then determining a voice recognition result of the voice to be recognized according to a decoding result obtained by decoding the first decoding path. The embodiment of the disclosure starts to decode the obtained at least one first decoding path in the process of decoding the voice to be recognized, and does not need to wait for decoding the first decoding path after finishing decoding the voice to be recognized, so that a user finishes decoding part of the first decoding path when inputting the last packet of data, thereby reducing the waiting time of the user when inputting the last packet of data, reducing the time required by the whole voice recognition process and improving the user experience compared with the prior art.

In some alternative embodiments, in step 201, when the speech to be recognized is decoded, the speech to be recognized may be decoded using the first language model.

In some alternative embodiments, in step 202 and step 203, the first decoding path may be decoded by using the second language model.

Wherein the first language model and/or the second language model may include, for example, but not limited to: a rule language model, a statistical language model, or a neural network language model (Neural Network Lauguage Model, NNLM), etc., which are not limiting embodiments of the present disclosure.

In some optional examples of the disclosure, the first language model and/or the second language model adopts a statistical language model, where the statistical language model is also called an N-Gram (N-Gram) language model, and in the embodiments of the disclosure, the value of N is not limited, and may be an integer greater than 1. Optionally, a binary statistical language model (bigram) or a ternary statistical language model (trigram) can be adopted to decode the voice to be recognized, and the statistical language model is smaller, so that the decoding speed of the voice to be recognized is faster, the continuous voice recognition with large vocabulary can be realized, the recognition accuracy is ensured, and the voice recognition effect is improved.

In some alternative embodiments, the second language model is a language model with a better speech recognition effect than the first language model, and the second language model has a better speech recognition effect, for example, the second language model can be realized by adopting a larger network model and/or corpus training which is more in line with human language habits, so that the better speech recognition effect is realized.

Fig. 3 is a flow chart illustrating a voice recognition method according to another exemplary embodiment of the present disclosure. As shown in fig. 3, on the basis of the embodiment shown in fig. 2, before decoding the speech to be recognized by using the first language model, the method may further include the following steps:

step 301, sequentially reading a voice frame from the voice to be recognized, and extracting acoustic features of the read voice frame to obtain voice feature information of the read voice frame.

The voice characteristic information is also called acoustic characteristic information, and may include, but is not limited to, any of the following characteristic information: there are linear predictive Coding (LINEAR PREDICTIVE Coding, LPC), mel frequency cepstral coefficients (Mel-frequency Cepstrum Coefficients, MFCC), mel-scale filter banks (Mel-SCALE FILTER Bank, FBank), and the like. The speech feature information may be represented as a feature vector or a feature map, which is not limited by the embodiments of the present disclosure.

Step 302, identifying the above-mentioned voice feature information by using an acoustic model, to obtain an acoustic identification result of the read voice frame, where the acoustic identification result may include, but is not limited to: at least one word and an acoustic score for each word of the at least one word.

The acoustic models therein may include, for example, but are not limited to: gaussian Mixture model-hidden markov model (Gaussian Mixture-Model Hidden Markov Model, GMM-HMM), recurrent neural network (Recurrent Neural Networks, RNN), feedforward sequence memory neural network (Feedforward Sequential Memory Networks, FSMN), etc., which are not limited by the embodiments of the present disclosure.

The embodiment realizes the acoustic recognition of the voice to be recognized so as to decode the voice to be recognized better by combining the acoustic recognition result later.

Based on the embodiment shown in fig. 3, in some alternative implementations, decoding the speech to be recognized using the first language model may be implemented as follows: and respectively scoring each word in the acoustic recognition result of the read voice frame by using the first language model to obtain a first language score of each word in the acoustic recognition result of the read voice frame. The first decoding path is obtained based on an acoustic recognition result of the read voice frame and an acoustic recognition result of a history voice frame, wherein the history voice frame comprises: and the time sequence of the voice frame which is positioned before the read voice frame in the voice to be recognized.

The embodiment provides an implementation manner for decoding the voice to be recognized by using the first language model, and a first decoding path of the voice to be recognized can be obtained by combining an acoustic recognition result.

Optionally, in other optional embodiments, after obtaining the first language score of each word in the acoustic recognition result of the read voice frame, the first decoding paths of each word in the acoustic recognition result of the read voice frame may be further sequenced and deduplicated.

After the first language score of each word in the acoustic recognition result of the read voice frame is obtained, the first decoding paths of each word in the acoustic recognition result of the read voice frame are sequenced and de-duplicated, so that repeated paths in the first decoding paths can be effectively removed, the memory storage space can be reduced, the calculation of scoring is reduced again, the speed and efficiency of re-scoring are improved, and the whole voice recognition efficiency is improved.

In addition, based on the embodiment shown in fig. 3, in some alternative embodiments, decoding the first decoding path using the second language model may be implemented as follows: and sequentially re-scoring each word in the acoustic recognition result of each voice frame in the first decoding path by using the second language model to obtain a second language score of each word in the acoustic recognition result of each voice frame, and storing the second language score and a second decoding path obtained by decoding the first decoding path.

And sequentially re-scoring each word in the acoustic recognition result of each voice frame in the first decoding path by using the second language model, so that the language score of each word in the acoustic recognition result is more accurate, and the accuracy of the voice decoding result to be recognized is improved.

In a further alternative embodiment, when the second language model is used to re-score each word in the acoustic recognition result of each speech frame in the first decoding path, the re-scoring is specifically continued for the first decoding path after all the words in the first decoding path except the last word have been re-scored.

For example, in one specific example, assume that a first decoding path obtained by decoding speech to be recognized using a first language model is: ① I am-want to-listen to the song, ② am-want to-eat, ③ am-go-eat, ④ am-want to-listen to the song, and after sorting and de-duplicating the first decoding paths of each word in the acoustic recognition result of the read speech frame, three non-duplicate first decoding paths remain, and if a certain first decoding path (i.e., i am-go) of the three non-duplicate first decoding paths has been re-scored, the corresponding first decoding path (i am-go-eat) may be continuously re-scored.

Fig. 4 is a flow chart illustrating a voice recognition method according to still another exemplary embodiment of the present disclosure. As shown in fig. 4, on the basis of the embodiment shown in fig. 2, step 203 may include the following steps:

2031, generating a decoding network based on a first decoding path obtained by decoding the speech to be recognized.

For example, a plurality of first decoding paths may be obtained based on at least one word in the acoustic recognition result, the acoustic score of each word in the at least one word, and the first language score of each word, the plurality of first decoding paths forming a word graph (lattice), that is, the above-mentioned decoding network, which is also referred to as a candidate decoding network.

When the acoustic model and the first language model are utilized to decode the voice to be recognized, a decoding network is dynamically generated, and each first decoding path in the decoding network corresponds to one acoustic score and one first language score obtained by the acoustic model and the first language model. Each first decoding path represents a word, and an acoustic score and a first language score for the occurrence of the word.

2032, Decoding the first decoding path which is not decoded in the decoding network to obtain a second decoding path corresponding to the decoding network.

Similar to the manner in which the first decoding path forms the word graph, the second decoding path may also form a word graph, which is not described in detail in the embodiments of the present disclosure.

In this embodiment, after the first-pass decoding of the speech to be recognized is completed, only the first decoding path that is not decoded in the decoding network is required to be decoded, so that the overall time required for the first-pass decoding and the second-pass decoding is shortened, and the speech recognition efficiency is improved.

Fig. 5 is a flowchart illustrating a voice recognition method according to still another exemplary embodiment of the present disclosure. As shown in fig. 5, step 204 may include the following steps, based on the embodiment shown in fig. 2, described above:

step 2041, determining a composite score of each second decoding path according to the second language score of each word in the acoustic recognition result of each voice frame in the voice to be recognized.

Step 2042, selecting the second decoding path with the highest comprehensive score from the second decoding paths corresponding to the decoding network as the voice recognition result of the voice to be recognized.

In some alternative embodiments, the sum, the average, or the like of the second language scores of all the words in each second decoding path may be used as the composite score of the second decoding path; or in other alternative embodiments, the sum, average, or the like of the acoustic scores of all words and the second language score may be used as the composite score of the second decoding path; or in other alternative embodiments, a weighted average of the acoustic scores of all words and the second language score may also be used as the composite score for the second decoding path. However, the embodiment of the present disclosure does not limit a specific calculation manner of the composite score of the second decoding path.

According to the method and the device, the comprehensive score of each second decoding path is determined according to the second language score of each word in the acoustic recognition result, and the second decoding path with the highest comprehensive score is selected as the voice recognition result of the voice to be recognized, so that the voice recognition result is more objective and accurate.

Based on the embodiment of the disclosure, after the voice recognition result of the voice to be recognized is obtained, semantic analysis can be performed on the voice recognition result, and according to the result of the semantic analysis, the control device executes corresponding application operation, so that the application based on voice recognition is realized. For example, if the result of the semantic analysis is "call XXX", the device may initiate searching for a phone number with a surname XXX and initiate a call by the voice call function module according to the result of the semantic analysis; if the result of the semantic analysis is "how the weather in the open sky in beijing city" the device may start the weather forecast module to search for the weather in the open sky in beijing city and output the weather according to the result of the semantic analysis, specifically, the weather can be output through voice or text, which is not limited by the embodiment of the present disclosure.

The semantic analysis performed on the voice recognition result may be, for example, word-level semantic analysis, sentence-level semantic analysis, or chapter-level semantic analysis, which is not limited in the embodiments of the present disclosure. In an alternative example, semantic analysis may be performed on text information as a speech recognition result, and semantic representation of the speech recognition result may be obtained through the semantic analysis as a semantic analysis result; in another alternative example, semantic analysis may be performed on text information as a speech recognition result, and whether a preset word, phrase, or sentence is included in the speech recognition result is recognized as a semantic analysis result through the semantic analysis. The implementation manner of the semantic analysis of the voice recognition result according to the embodiment of the disclosure is not limited.

For example, in some alternative examples, the voice recognition result may be subjected to semantic analysis through a topic model such as latent semantic analysis (LATENT SEMANTIC ANALYSIS, LSA), probabilistic LATENT SEMANTIC ANALYSIS (PLSA) or latent dirichlet allocation (LATENT DIRICHLET allocation, LDA), or may be subjected to semantic analysis through an artificial neural network such as RNN, long Short-Term Memory (LSTM), or the like, which is not limited by the embodiments of the present disclosure.

The following describes some examples of the present disclosure by way of example only, and those skilled in the art will appreciate that the present disclosure and its applications are not limited in any way:

the voice to be recognized acquired by the audio acquisition module or processed by the front-end signal is assumed to be: how good today the Tianjin weather is;

Based on the above embodiments of the present disclosure, decoding a speech to be recognized by using an acoustic model and a first language model, to obtain a first decoding path as follows: a, today-weather-how; b, tianjin-weather-how; c, today-weather-how; d, tianjin-weather-how; e, today-Tianjin-how; f, today-Tianjin-weather-how, wherein it is assumed that in decoding the speech to be recognized, the first decoding paths a, c, e have been decoded with the second language model;

generating a decoding network based on the first decoding paths a-f;

decoding the first decoding paths b, d and f which are not decoded in the decoding network by using a second language model to obtain a second decoding path obtained by decoding the first decoding paths a to f;

And selecting a second decoding path with the highest comprehensive score from second decoding paths (namely second decoding paths obtained by decoding the first decoding paths a-f) corresponding to the decoding network, wherein the second decoding path obtained by decoding the first decoding path f in the application example is used as a voice recognition result of the voice to be recognized.

In the embodiment of the disclosure, since the input of the voice data to be recognized is real-time, not all the voice data to be recognized is input into the voice recognition system implementing the embodiment of the voice recognition method of the disclosure at a time, the real-time rate of decoding by the decoder when using the first language model is generally 0.5 times or less, so that in the process of decoding the voice to be recognized, it is feasible to decode the first decoding path by using the second language model, and because the voice data to be recognized which has been input into the voice recognition system before the next packet of voice data to be recognized is input is processed completely, the overall decoding real-time rate is not affected. After processing the last packet of speech data to be recognized, the speech recognition system has already decoded about 50% of the first decoding paths, and only has to decode the remaining about 50% of the first decoding paths after the decoding network has been generated.

For example, in the above application example, before the voice recognition method of the embodiment of the present disclosure is utilized, after the user speaks that the voice to be recognized is "what is today's heaven and ford weather", the user needs to wait for a total of about 300ms to obtain the voice recognition result, where the time for performing one-time decoding on the last packet of voice data to be recognized is 100ms, and the time for performing two-time decoding on the entire decoding network is 200ms. After the voice recognition method of the embodiment of the present disclosure is utilized, after the user finishes speaking what the voice to be recognized is, the user needs to wait for a total of about 200ms to obtain the voice recognition result, where the time for performing one-time decoding on the last packet of voice data to be recognized is 100ms, and the time for performing two-time decoding on the first decoding path (about 50% of the first decoding paths) which is not decoded in the decoding network is 100ms. It can be seen that embodiments of the present disclosure reduce latency by about 1/3 over the prior art, which can significantly improve the user experience.

Any of the voice recognition methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including, but not limited to: terminal equipment, servers, etc. Or any of the voice recognition methods provided by the embodiments of the present disclosure may be executed by a processor, such as the processor executing any of the voice recognition methods mentioned by the embodiments of the present disclosure by invoking corresponding instructions stored in a memory. And will not be described in detail below.

Exemplary apparatus

Fig. 6 is a schematic structural diagram of a voice recognition apparatus according to an exemplary embodiment of the present disclosure. The voice recognition device can be arranged in electronic equipment such as terminal equipment and a server, and can execute the voice recognition method of any embodiment of the disclosure. As shown in fig. 6, the voice recognition apparatus includes: a first decoding module 601, a second decoding module 602 and a determining module 603. Wherein:

The first decoding module 601 decodes the voice to be recognized to obtain a first decoding path of the voice to be recognized.

In some embodiments, the first decoding module 601 is configured to decode the speech to be recognized using a first language model.

A second decoding module 602, configured to decode at least one first decoding path obtained by the first decoding module 601 in a process of decoding the speech to be recognized; and decoding the first decoding path which is not decoded after the voice to be recognized is decoded.

In some embodiments, the second decoding module 602 is configured to decode the first decoding path obtained by the first decoding module 601 using the second language model.

The determining module 603 is configured to determine a speech recognition result of the speech to be recognized according to the decoding result of the first decoding path obtained by the second decoding module 602.

The voice recognition device provided by the embodiment of the disclosure can decode at least one first decoding path (called as two-pass decoding) in the first decoding paths obtained by one-pass decoding in the process of decoding the voice to be recognized (called as one-pass decoding); after the first-pass decoding of the voice to be recognized is completed, performing second-pass decoding on a first decoding path which is not subjected to second-pass decoding in the first decoding paths; and then determining a voice recognition result of the voice to be recognized according to the decoding result obtained by the two-pass decoding. Because the embodiment of the disclosure starts to perform the second-pass decoding on at least one first decoding path generated by the first-pass decoding in the process of performing the first-pass decoding on the voice to be recognized, and does not need to wait for the second-pass decoding after the first-pass decoding is completed, when the last packet of data is input by a user, the second-pass decoding on part of the first decoding paths is completed, thereby reducing the waiting time of the user when the last packet of data is input, reducing the time required by the whole voice recognition process and improving the user experience compared with the prior art.

Fig. 7 is a schematic structural view of a voice recognition apparatus provided in another exemplary embodiment of the present disclosure. As shown in fig. 7, on the basis of the embodiment of fig. 6 described above in the present disclosure, the voice recognition apparatus may further include: an acoustic feature extraction module 604 and an acoustic identification module 605. Wherein:

The acoustic feature extraction module 604 is configured to sequentially read a speech frame from the speech to be recognized, and perform acoustic feature extraction on the read speech frame to obtain speech feature information of the read speech frame.

The acoustic recognition module 605 is configured to recognize the above-mentioned voice feature information by using an acoustic model, and obtain an acoustic recognition result of the read voice frame, where the acoustic recognition result may include, for example, but not limited to: at least one word and an acoustic score for each word of the at least one word.

Based on the voice recognition device provided by the embodiment of the present disclosure, at least one first decoding path in the first decoding paths obtained by decoding can be decoded in the process of decoding the voice to be recognized; after the first-pass decoding of the voice to be recognized is completed, decoding a first decoding path which is not subjected to the second-pass decoding in the first decoding path, and then determining a voice recognition result of the voice to be recognized according to a decoding result obtained by decoding the first decoding path. The embodiment of the disclosure starts to decode the obtained at least one first decoding path in the process of decoding the voice to be recognized, and does not need to wait for decoding the first decoding path after finishing decoding the voice to be recognized, so that a user finishes decoding part of the first decoding path when inputting the last packet of data, thereby reducing the waiting time of the user when inputting the last packet of data, reducing the time required by the whole voice recognition process and improving the user experience compared with the prior art.

In some of these embodiments, the first decoding module 601 may include: and the first language model is used for scoring each word in the acoustic recognition result of the read voice frame respectively to obtain a first language score of each word in the acoustic recognition result of the read voice frame. The first decoding path is obtained based on an acoustic recognition result of the read voice frame and an acoustic recognition result of the historical voice frame, wherein the historical voice frame comprises: the timing of the speech frame preceding the read speech frame in the speech to be recognized.

Optionally, in other embodiments, after obtaining the first language score of each word in the acoustic recognition result of the read voice frame, the first decoding module 601 may be further configured to sort and de-repeat the paths of each word in the acoustic recognition result of the read voice frame.

In some of these embodiments, the second decoding module 602 includes: and the second language model is used for sequentially re-scoring each word in the acoustic recognition result of each voice frame in the first decoding path, obtaining a second language score of each word in the acoustic recognition result of each voice frame, storing the second language score and decoding the first decoding path to obtain a second decoding path.

Referring again to fig. 7, based on the embodiment of the disclosure shown in fig. 6, the second decoding module 602 may include: the generation unit 6021 and the decoding unit 6022. Wherein:

A generating unit 6021 for generating a decoding network based on the first decoding path obtained by the first decoding module 601.

A decoding unit 6022, configured to decode at least one first decoding path obtained by the first decoding module 601 in a process of decoding the speech to be recognized; and decodes the first decoding path that has not been decoded yet in the decoding network generated by the generating unit 6021.

Referring again to fig. 7, on the basis of the embodiment of fig. 6 described above in the present disclosure, the determining module 603 may include: a determination unit 6031 and a selection unit 6032. Wherein:

The determining unit 6031 is configured to determine a composite score of each second decoding path according to the second language score of each word in the acoustic recognition result of each voice frame in the voice to be recognized obtained by the second decoding module 602.

A selecting unit 6032 for selecting a second decoding path with the highest comprehensive score from the second decoding paths corresponding to the decoding network generated by the generating unit 6021 as a voice recognition result of the voice to be recognized.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present disclosure is described with reference to fig. 8. The electronic device may be either or both of the first device and the second device, or a stand-alone device independent thereof, which may communicate with the first device and the second device to receive the acquired input signals therefrom.

Fig. 8 illustrates a block diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 8, the electronic device includes one or more processors 801 and memory 802.

The processor 801 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities and may control other components in the electronic device to perform desired functions.

Memory 802 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 801 to implement the speech recognition methods and/or other desired functions of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device may further include: an input device 803 and an output device 804, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

For example, when the electronic device is a first device or a second device, the input means 803 may be a microphone or a microphone array as described above for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 803 may be a communication network connector for receiving the acquired input signals from the first device and the second device.

In addition, the input device 803 may also include, for example, a keyboard, a mouse, and the like.

The output device 804 may output various information to the outside, including the determined distance information, direction information, and the like. The output devices 804 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 8, components such as buses, input/output interfaces, and the like are omitted for simplicity. In addition, the electronic device may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in a speech recognition method according to the various embodiments of the present disclosure described in the "exemplary methods" section of this specification.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a speech recognition method according to various embodiments of the present disclosure described in the above "exemplary method" section of the present disclosure.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method of speech recognition, comprising:

decoding voice to be recognized to obtain a first decoding path of the voice to be recognized;

in the process of decoding the voice to be recognized, decoding at least one first decoding path in the first decoding paths of the voice to be recognized;

after the voice to be recognized is decoded, decoding a first decoding path which is not decoded in the first decoding paths of the voice to be recognized;

And determining a voice recognition result of the voice to be recognized according to the decoding result of the first decoding path of the voice to be recognized.

2. The method of claim 1, wherein the decoding of the speech to be recognized comprises: decoding the voice to be recognized by using a first language model;

Decoding the first decoding path, comprising: the first decoding path is decoded using the second language model.

3. The method of claim 2, wherein prior to decoding the speech to be recognized using the first language model, further comprising:

Sequentially reading a voice frame from the voice to be recognized, and extracting acoustic characteristics of the read voice frame to obtain voice characteristic information of the read voice frame;

And identifying the voice characteristic information by utilizing an acoustic model to obtain an acoustic identification result of the read voice frame, wherein the acoustic identification result comprises the following steps: at least one word and an acoustic score for each word in the at least one word.

4. A method according to claim 3, wherein said decoding the speech to be recognized using a first language model comprises: scoring each word in the acoustic recognition result of the read voice frame by using the first language model to obtain a first language score of each word in the acoustic recognition result of the read voice frame; the first decoding path is obtained based on an acoustic recognition result of the read voice frame and an acoustic recognition result of a historical voice frame, and the historical voice frame comprises: a voice frame of which the time sequence is positioned before the read voice frame in the voice to be recognized; or alternatively

Decoding the first decoding path using the second language model, comprising: and sequentially re-scoring each word in the acoustic recognition result of each voice frame in the first decoding path by using the second language model to obtain a second language score of each word in the acoustic recognition result of each voice frame, and storing the second language score and a second decoding path obtained by decoding the first decoding path.

5. The method of claim 4, wherein after obtaining the first language score for each word in the acoustic recognition result of the read speech frame, further comprising:

and sequencing and de-duplicating the first decoding paths of the words in the acoustic recognition result of the read voice frame.

6. The method according to any one of claims 1-5, wherein decoding a first decoding path of the first decoding paths of the speech to be recognized that is not decoded comprises:

generating a decoding network based on the first decoding path of the voice to be recognized;

And performing two-pass decoding on the first decoding path which is not decoded in the decoding network to obtain a second decoding path corresponding to the decoding network.

7. The method of claim 6, wherein the determining the speech recognition result of the speech to be recognized according to the decoding result of the first decoding path of the speech to be recognized comprises:

Determining the comprehensive score of each second decoding path according to the second language score of each word in the acoustic recognition result of each voice frame in the voice to be recognized;

and selecting a second decoding path with the highest comprehensive score from the second decoding paths corresponding to the decoding network as a voice recognition result of the voice to be recognized.

8. A speech recognition apparatus comprising:

The second decoding module is used for decoding at least one first decoding path in the first decoding paths of the voice to be recognized, which are obtained by the first decoding module, in the process of decoding the voice to be recognized; after the voice to be recognized is decoded, decoding a first decoding path which is not decoded in the first decoding path of the voice to be recognized;

and the determining module is used for determining the voice recognition result of the voice to be recognized according to the decoding result of the first decoding path of the voice to be recognized, which is obtained by the second decoding module.

9. The apparatus of claim 8, wherein the first decoding module is configured to decode the speech to be recognized using a first language model;

The second decoding module is configured to decode the first decoding path obtained by the first decoding module by using a second language model.

10. The apparatus of claim 9, further comprising:

The acoustic feature extraction module is used for sequentially reading a voice frame from the voice to be recognized, and extracting acoustic features of the read voice frame to obtain voice feature information of the read voice frame;

the acoustic recognition module is used for recognizing the voice characteristic information by utilizing an acoustic model to obtain an acoustic recognition result of the read voice frame, and the acoustic recognition result comprises: at least one word and an acoustic score for each word in the at least one word.

11. The apparatus of claim 10, wherein the first decoding module comprises: the first language model is used for scoring each word in the acoustic recognition result of the read voice frame respectively to obtain a first language score of each word in the acoustic recognition result of the read voice frame; the first decoding path is obtained based on an acoustic recognition result of the read voice frame and an acoustic recognition result of a historical voice frame, and the historical voice frame comprises: a voice frame of which the time sequence is positioned before the read voice frame in the voice to be recognized; or alternatively

The second decoding module includes: the second language model is used for sequentially scoring each word in the acoustic recognition result of each voice frame in the first decoding path again to obtain a second language score of each word in the acoustic recognition result of each voice frame, and storing the second language score and a second decoding path obtained by decoding the first decoding path.

12. The apparatus of any of claims 8-11, wherein the second decoding module comprises:

a generating unit, configured to generate a decoding network based on the first decoding path obtained by the first decoding module;

A decoding unit, configured to decode at least one first decoding path in the decoding network generated by the generating unit in a process of decoding the speech to be recognized; and decoding a first decoding path which is generated by the generating unit and is not decoded in the decoding network.

13. A computer readable storage medium storing a computer program for performing the speech recognition method of any one of the preceding claims 1-7.

14. An electronic device, the electronic device comprising:

A processor;

A memory for storing the processor-executable instructions;

The processor is configured to read the executable instructions from the memory and execute the instructions to implement the speech recognition method according to any one of the preceding claims 1-7.