CN111091819A

CN111091819A - Voice recognition device and method, voice interaction system and method

Info

Publication number: CN111091819A
Application number: CN201811166607.1A
Authority: CN
Inventors: 李国庆; 孙珏
Original assignee: NIO Nextev Ltd
Current assignee: NIO Holding Co Ltd
Priority date: 2018-10-08
Filing date: 2018-10-08
Publication date: 2020-05-01

Abstract

The invention relates to a voice recognition device and method, and a voice interaction system and method. The voice recognition device can receive voice input of a user, and can receive a first voice recognition result output after the voice input is processed online from a far-end voice recognition module online; the speech recognition apparatus further includes: a local speech recognition module configured with a second acoustic model constructed based on a binarization neural network algorithm; wherein the local speech recognition module processes speech features extracted from the speech input through at least the second acoustic model to output a second speech recognition result. The voice recognition method is timely and accurate, is slightly influenced by the network connection condition, and has good user experience.

Description

Voice recognition device and method, voice interaction system and method

Technical Field

The invention belongs to the technical field of voice recognition, and relates to a voice recognition device and a voice recognition method of which a local voice recognition module adopts an acoustic model constructed by using a Binary Neural Network (BNN) algorithm, a voice interaction system using the voice recognition device and a corresponding voice interaction method.

Background

Speech recognition devices such as vehicle-mounted speech recognition terminals require fast (i.e., real-time) and accurate (i.e., high recognition rate) recognition of speech input of a user to improve user experience when applied to, for example, a speech interaction system.

The local side of the speech recognition device usually has a local recognition engine operating on the basis of an acoustic model and a speech model, wherein the acoustic model is usually very expensive in computing the speech input of the user, and therefore, the computing power of the speech recognition device (such as a vehicle-mounted speech recognition terminal) is highly required. The acoustic Model configured or used by the local recognition engine can be formed by modeling and training, for example, GMM (Gaussian Mixture Model) -HMM (hidden markov Model), Deep Neural Network (DNN) -HMM, Deep Neural Network (DNN), which have the disadvantages of high computational overhead, poor real-time performance, and poor recognition.

In order to improve user experience, a speech recognition device generally recognizes speech input by using, for example, a cloud recognition engine with high computing power and accurate speech recognition (or high recognition rate) to obtain a corresponding recognition result; however, the cloud recognition engine is disposed at a remote end of the speech recognition device, and a wireless network communication condition between the cloud recognition engine and the speech recognition device may limit timeliness of transmission of a recognition result and the like, thereby affecting user experience in terms of real-time performance of speech recognition.

At present, a voice recognition device generally works in a hybrid recognition mode of a local recognition engine and a cloud recognition engine, when the wireless network communication condition of the voice recognition device is good, the recognition rate is high by using the recognition result of the cloud recognition engine, and when the wireless network communication condition of the voice recognition device is poor (for example, network interruption), if the voice recognition device determines that the wireless network communication condition is poor, the voice recognition device is switched to perform voice recognition by the local recognition engine to ensure real-time performance, but the recognition rate is also seriously sacrificed.

Disclosure of Invention

It is an object of the invention to improve the user experience of speech recognition.

Still another object of the present invention is to improve user experience when performing voice interaction based on voice recognition.

To achieve one or other of the above objects, the present invention provides the following technical solutions.

According to a first aspect of the present invention, a speech recognition apparatus is provided, which is capable of receiving a speech input of a user, and is capable of receiving a first speech recognition result output after the speech input is processed online from a remote speech recognition module online; the speech recognition apparatus includes:

a local speech recognition module configured with a second acoustic model constructed based on a binarization neural network algorithm;

wherein the local speech recognition module processes speech features extracted from the speech input through at least the second acoustic model to output a second speech recognition result.

The voice recognition device according to an embodiment of the present invention further includes:

and the voice application module is used for selecting to use the second voice recognition result when the receiving time of the first voice recognition result is later than the output time of the second voice recognition result by a preset time threshold or the first voice recognition result is not received, otherwise, selecting to use the first voice recognition result.

The speech recognition apparatus according to another embodiment of the present invention or any one of the above embodiments, wherein the local speech recognition module is further configured with a second language model, and further includes a decoding output unit;

the decoding output unit performs matching and comparison processing on the voice features through the second acoustic model and performs language processing on the result processed through the second acoustic model through the second language model so as to decode and output the second voice recognition result.

The speech recognition apparatus according to another embodiment of the present invention or any one of the above embodiments, further comprising:

a voice activity detection unit configured with a third acoustic model constructed based on a binarization neural network algorithm and used for detecting endpoint information corresponding to the voice input through the third acoustic model.

The speech recognition apparatus according to another embodiment of the invention or any one of the above embodiments, wherein the endpoint information includes a speech start endpoint and/or a speech stop endpoint.

The speech recognition apparatus according to another embodiment of the present invention or any one of the above embodiments, wherein the speech activity detection unit is further configured to determine each of the speech inputs of the user based on the detected endpoint information of the speech input;

the far-end voice recognition module outputs a first voice recognition result corresponding to each voice input;

the local voice recognition module is further configured to output a second voice recognition result corresponding to each voice input.

The speech recognition apparatus according to another embodiment of the present invention or any one of the above embodiments, wherein the speech application module includes a speech interaction processing unit, configured to generate the speech interaction information fed back for each speech input according to the first speech recognition result or the second speech recognition result corresponding to the speech input.

The speech recognition apparatus according to another embodiment of the invention or any one of the above embodiments, wherein the speech activity detection unit is further configured to: and determining whether the voice characteristics correspond to the voice state or the silent state through the third acoustic model, so as to determine a voice starting endpoint and a voice ending endpoint corresponding to one-time voice input according to the continuity characteristics of the voice characteristics.

The voice recognition apparatus according to another embodiment of the present invention or any one of the above embodiments, wherein the voice activity detection unit is further configured to detect a wake-up feature in the voice input and output a wake-up signal to the local voice recognition module if the wake-up feature is detected.

The speech recognition apparatus according to another embodiment of the present invention or any one of the above embodiments, wherein the speech recognition apparatus is a vehicle-mounted speech recognition apparatus and is applied to a vehicle-mounted speech interaction system.

According to a second aspect of the present invention, there is provided a speech recognition method for performing speech recognition by using a local speech recognition module and a remote speech recognition module simultaneously, comprising the steps of:

sending voice input to the far-end voice recognition module, and receiving a first voice recognition result output by the far-end voice recognition module after the voice input is processed online;

processing the voice features extracted from the voice input at least through a second acoustic model of the local voice recognition module to output a second voice recognition result, wherein the second acoustic model is constructed based on a binarization neural network algorithm; and

and when the receiving time of the first voice recognition result is later than the output time of the second voice recognition result by a preset time threshold or the first voice recognition result is not received, selecting to use the second voice recognition result as the recognition result of the voice input, otherwise, selecting to use the first voice recognition result as the recognition result of the voice input.

In the speech recognition method according to an embodiment of the present invention, in the step of outputting the second speech recognition result, the matching and comparing process is performed on the speech features by the second acoustic model and the speech processing is performed on the result processed by the second acoustic model by the second language model so as to decode and output the second speech recognition result.

The speech recognition method according to another embodiment of the present invention or any one of the above embodiments, further comprising: detecting endpoint information corresponding to the voice input through a third acoustic model; wherein the third acoustic model is constructed based on a binarization neural network algorithm.

The speech recognition method according to another embodiment of the invention or any of the above embodiments, wherein the endpoint information comprises a speech start endpoint and/or a speech end endpoint.

The speech recognition method according to another embodiment of the present invention or any one of the above embodiments, further comprising:

determining each of the voice inputs of a user based on the detected endpoint information of the voice inputs;

in the step of receiving the first voice recognition result, receiving a first voice recognition result corresponding to each voice input on line;

and in the step of outputting the second voice recognition result, outputting the second voice recognition result corresponding to each voice input.

and generating the voice interaction information of the response aiming at the voice input according to the first voice recognition result or the second voice recognition result corresponding to the voice input at each time.

In the voice recognition method according to another embodiment of the invention or any one of the above embodiments, in the detecting of the endpoint information, it is determined whether the voice feature corresponds to a voice state or a silent state through the third acoustic model, so that a voice start endpoint and a voice stop endpoint corresponding to one voice input are determined according to a continuity characteristic of the voice feature.

The speech recognition method according to another embodiment of the present invention or any one of the above embodiments, further comprising: and detecting a wake-up feature language in the voice input and outputting a wake-up signal to the local voice recognition module under the condition that the wake-up feature language is detected.

According to a third aspect of the present invention, there is provided a method for performing voice interaction by using the voice recognition apparatus, comprising the steps of:

s1: receiving a voice input of a user;

s2: generating voice interaction information fed back aiming at the voice input according to the first voice recognition result or the second voice recognition result corresponding to the voice input;

s3: outputting voice interaction information to a user; and

s4: receiving the next voice input of the user, and repeating the steps of S2 and S3 until the next voice input of the user is not continuously received.

The method according to an embodiment of the present invention, further comprising, before step S1, the steps of:

s11: receiving a voice input of a user including a wakeup feature word; and

s12: and awakening the local voice recognition module to work under the condition that the awakening feature words are detected.

According to a fourth aspect of the present invention, there is provided a voice interactive system, comprising:

a speech recognition device as described in any of the above; and

and a voice interaction output end.

According to a fifth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the speech recognition method described above when executing the program.

According to a sixth aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, wherein the program is executable by a processor to implement the steps of the speech recognition method described above.

The above features and operation of the present invention will become more apparent from the following description and the accompanying drawings.

Drawings

The above and other objects and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings, in which like or similar elements are designated by like reference numerals.

Fig. 1 illustrates a block configuration diagram of a computer device for implementing a voice recognition apparatus or a voice interaction system according to one or more embodiments of the present disclosure.

Fig. 2 is a schematic block diagram of a voice interaction system according to an embodiment of the present disclosure, in which a voice recognition apparatus according to an embodiment of the present disclosure is used.

Fig. 3 is a schematic block diagram of a voice interaction system according to another embodiment of the present disclosure, in which a voice recognition apparatus according to another embodiment of the present disclosure is used.

FIG. 4 illustrates a flow diagram of a speech recognition method in accordance with one or more embodiments of the present disclosure.

FIG. 5 illustrates a flow diagram of a voice interaction method in accordance with one or more embodiments of the present disclosure.

Detailed Description

For the purposes of brevity and explanation, the principles of the present invention are described herein with reference primarily to exemplary embodiments thereof. However, those skilled in the art will readily recognize that the same principles are equally applicable to all types of speech recognition devices and/or speech interaction systems, and that these same or similar principles may be implemented therein, with any such variations not departing from the true spirit and scope of the present patent application. Moreover, in the following description, reference is made to the accompanying drawings that illustrate certain exemplary embodiments. Electrical, logical, and structural changes may be made to these embodiments without departing from the spirit and scope of the invention. In addition, while a feature of the invention may have been disclosed with respect to only one of several implementations/embodiments, such feature may be combined with one or more other features of the other implementations/embodiments as may be desired and/or advantageous for any given or identified function. The following description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and their equivalents.

Where used, the terms "first," "second," and the like do not necessarily denote any order or priority relationship, but rather may be used to more clearly distinguish objects from one another.

Referring to FIG. 1, an embodiment of a computer device for implementing a speech recognition apparatus or speech interaction system (as shown in FIG. 2 or FIG. 3) of the present disclosure is shown. In this embodiment, the computer device 10 has one or more central processing units (processors) 11a, 11b, 11c, etc. (collectively or generically referred to as processors 11), it being understood that the computing power of the computer device 10 will be primarily determined by the processors 11. In one or more embodiments, each processor 11 may be a microprocessor including a Reduced Instruction Set Computer (RISC); processor 11 is coupled to system memory 14 (RAM) and various other components by system bus 13; read Only Memory (ROM) 12 is coupled to system bus 13 and may include a basic input/output system (BIOS) that controls certain basic functions of computer device 10.

The RAM 14 may store corresponding program instructions, the program instructions may include the acoustic model and/or the language model of the present disclosure, and the processor 11 may execute the program instructions on the RAM 14 during a work process such as performing speech recognition, so that the speech recognition apparatus of the embodiment of the present disclosure may function, for example, to implement a local recognition engine and a speech activity detection unit.

It will be appreciated that the RAM 14 may also store other information used in performing speech recognition processes or training acoustic/language model processes, learning processes, etc., e.g., corpora, which may be implemented in the form of a database, as desired.

Continuing with FIG. 1, there is also shown an input/output (I/O) adapter 17 and a network adapter 16 coupled to system bus 13. The I/O adapter 17 may be connected to a voice input 171, such as a microphone, so that audio signals, including user voice input, may be received by the system bus 13. A network communications adapter 16 interconnects bus 13 with an external network 700 enabling data processing computer device 10 to communicate wirelessly with a remote, e.g., cloud recognition engine. A screen (e.g., a display monitor) 35 is connected to the system bus 33 via the display adapter 32.

Continuing with FIG. 1, it also shows a display 15, which may display, for example, the status of the computer device 10 (e.g., network connection status), speech recognition results, voice interaction status, and the like. In other embodiments, the display 15 may be omitted.

It will be understood that the computer device 10 may also include other components not shown in fig. 1 above, such as a speaker for outputting speech.

The computer device 10 described herein is merely exemplary and is not intended to limit applications, uses, and/or techniques. The computer device 10 may be implemented as an electronic device such as a mobile terminal, a vehicle-mounted terminal, or the like.

Fig. 2 is a schematic block diagram of a voice interaction system according to an embodiment of the present disclosure, in which the voice recognition apparatus 100 according to an embodiment of the present disclosure is used. The speech recognition apparatus 100 of the disclosed embodiment has network communication capability and performs speech recognition using both the remote speech recognition module 800 and the local speech recognition module 120.

As shown in fig. 2, the speech recognition apparatus 100 may establish a wireless connection with the remote speech recognition module 800 through the network 700, and the speech recognition apparatus 100 may move relative to the network 700, which may be, for example, a vehicle-mounted speech recognition apparatus, a personal mobile terminal, or the like.

The speech recognition device 100 may receive a user's speech input, for example in the form of an audio signal, from a speech input component, such as a microphone. The speech input may comprise a substantially continuous speech signal or may comprise an intermittent speech signal. When one voice input of the user is finished, a corresponding one voice input can be defined; it will be appreciated that the division principle between speech inputs may be predefined according to the regular speaking habits of a person or user.

In an embodiment, the speech recognition apparatus 100 is provided with a front-end processing module 110, and the front-end processing module 110 may, for example, perform noise reduction processing on an audio signal including a speech input, and may, for example, extract corresponding speech features. On one hand, the voice input signal preprocessed by the front-end processing module 110 can be wirelessly transmitted to the far-end voice recognition module 800 through the network 700, so that the voice recognition module 800 performs voice recognition processing; and on the other hand, may be transmitted to the local speech recognition module 120 for speech recognition processing.

It will be understood that "remote" and "local" are defined with respect to the speech recognition device 100, "local" meaning in or near the speech recognition device 100 and capable of wired connection therewith, and "remote" meaning not in the speech recognition device 100 and capable of interacting with the speech recognition device 100 only by way of wireless communication. In one embodiment, the far-end speech recognition module 800 may be implemented in or through a cloud, and thus may also be referred to as a cloud recognition engine.

The far-end speech recognition module 800 generally has significantly superior computational power over the local speech recognition module 120, and thus, it can configure algorithms or programs with high computer resource overhead to greatly improve the recognition rate of speech recognition. The far-end speech recognition module 800 may be implemented by various speech recognition technologies existing at present or speech recognition technologies appearing in the future, for example, it may configure the first acoustic model 810, the first language model 820, and the far-end speech recognition module 800 may perform search calculation based on the first acoustic model 810 and the first language model 820 and the received speech input to output the corresponding first speech recognition result 189. It will be appreciated that the implementation of the far-end speech recognition module 800 and its first acoustic model 810, etc., is not limiting.

It should be noted that the far-end speech recognition module 800 is capable of receiving the speech input to be processed online and transmitting the corresponding first speech recognition result 189 online and also capable of performing the online recognition processing, in case that a wireless communication connection needs to be basically established with the speech recognition apparatus 100. When the network connection of the speech recognition device 100 is bad or the network connection is interrupted for a short time (for example, due to the location change), the on-line reception or on-line transmission will generate network delay, and although the remote speech recognition module 800 is fast to calculate, it is obvious for the user at the end of the speech recognition device 100 to feel that the real-time performance of the speech recognition is bad if only the first speech recognition result 189 is relied on.

Local speech recognition module 120 may not be limited by network latency, but its computational power or computational resources are susceptible to limitations. In one embodiment, the local speech recognition module 120 is configured with a second acoustic model 121 constructed based on a Binary Neural Network (BNN) algorithm, and is further configured with a second language model 122. Wherein the construction of the second acoustic model can comprise a learning and training process based on the existing data. The second language model 122, like the first language model 820, may be a grammar network consisting of recognized voice commands, or a language model consisting of statistical methods, based on which language processing may include grammar, semantic analysis, and the like. The specific construction process of the second language model 122 and the first language model 820 is not limiting.

It should be noted that the acoustic model is a bottom model of the local speech recognition module 120 and is a key part of the local speech recognition module 120, and the unit size of the acoustic model (the word pronunciation model, the semi-pronunciation model, or the phoneme model) has a large influence on the size of the speech training data volume, the system recognition rate, the flexibility, and the like, and the unit size of the acoustic model can be determined according to the characteristics of the unvoiced language and the size of the vocabulary. The second language model 122 is particularly important for mid-and large-vocabulary speech recognition systems; for example, when the classification is wrong, the judgment and correction can be performed according to the syntactic structure, semantics and the like of the second linguistic model 120, especially in the case that some homophones must pass through the context structure to determine the semantics.

In one embodiment, the second acoustic model 121 is constructed by training the model based on a BNN algorithm (i.e., training a BNN network). The training of the BNN network can be developed from a conventional neural network. The mapping matrix and the bias parameter of the conventional neural network are generally represented by single-precision or double-precision floating point numbers, and the Back Propagation (BP) algorithm adopted during training is also performed on the basis of the precision of the parameter. And the matrix and the offset parameter are quantized by 1 and-1 in the training process of the BNN network, and after the matrix and the offset parameter are updated by a BP algorithm, the matrix and the offset parameter are directly set to be-1 if the parameter value is less than 0, and the matrix and the offset parameter are set to be-1 if the parameter value is more than 0. Moreover, in order to prevent the updated value of the error backward transmission from being too small and causing the parameter not to be turned over, a Buffer (Buffer) for storing the backward error is arranged for the parameter when the model is trained, so as to superpose the backward error, and finally, whether the parameter is turned over is determined based on the parameter value plus the error Buffer value.

Continuing with fig. 2, the local speech recognition module 120 processes the speech features extracted from the speech input at least through the second acoustic model 121 to output the second speech recognition result 129 in real time, in particular, this function of the local speech recognition module 120 may be implemented, for example, by means of the decode output unit 123. The decode output unit 123 performs matching and comparison processing on the slave speech features by the second acoustic model 121 and language processing on the result processed by the second acoustic model 121 by the second language model 122 so as to decode and output a second speech recognition result 129.

Specifically, for example, a single frame speech score is first calculated for the extracted features via the BNN network of the second acoustic Model, and then the score is fed into a Language Model (e.g., the second Language Model 122) in which statistical probabilities between words or phrases are stored; for example, "you" may be followed by characters or words such as "you", "good", "go", "i", etc., and the probability that these characters or words appear behind "you" in our daily text or conversation is different, and may be probabilistically "i am" > "good" > "go", from which the language model is constructed. After the BNN score and the language model score are combined, the speech is basically converted into meaningful words, and the second speech recognition result 129 is obtained or output.

It should be noted that, because the second acoustic model 121 is constructed by using the BNN algorithm, for example, floating point number calculation in a neural network used for modeling an existing commonly used acoustic model is quantized to (1, -1), so that the processor can conveniently perform xor operation during processing of speech features, thereby greatly improving the operating efficiency and reducing the operation amount. For example, the local speech recognition module 120 of embodiments of the present invention may achieve a 32-fold efficiency optimization as compared to a 32-bit based floating point neural network computational comparison. Moreover, the recognition efficiency is improved without cutting the structure of the network model, the integrity of the BNN model can be ensured, and the recognition rate is relatively high. Therefore, the local speech recognition module 120 has better balance between recognition rate and real-time performance and less computational overhead of the processor than the conventional local speech recognition modules (under the same training conditions) using, for example, the GMM-HMM model and the DNN model.

Because the computation amount of the speech recognition of the local speech recognition module 120 is small and efficient, in the embodiment of the present disclosure, regardless of the network connection status of the speech recognition apparatus 100 with respect to the remote speech recognition module 800, the local speech recognition module 120 processes each speech input in real time and outputs the corresponding second speech recognition result 129 (it is not necessary to determine whether to perform the speech recognition again according to the network connection status), and the processor of the speech recognition apparatus 100 can bear the load and has good timeliness. Thus, in some cases (e.g., where the network connection of the speech recognition apparatus 100 and the far-end speech recognition module 800 is good), the speech recognition apparatus 100 may obtain both the corresponding first speech recognition result 189 and the second speech recognition result 129 for the same speech input.

Further, the speech recognition apparatus 100 may also have a speech application module 130 installed or configured therein, which may be implemented by, for example and without limitation, installing APP. In one embodiment, the speech application module 130 may be provided with a recognition result selection unit 131, and the recognition result selection unit 131 determines that the receiving time of the first speech recognition result 189 from the far-end speech recognition module 800 is later than the output time of the local second speech recognition result 129 by a predetermined time threshold T_thIn the case of (1) selecting to use the local second speech recognition result 129, otherwise selecting to use the relatively more accurate first speech recognition result 189. Thus, when the network connection state is not good or the network delay is long, the late predetermined time threshold T is easily satisfied_thThe speech recognition apparatus 100 can directly use the generated second speech recognition result 129, so that the real-time performance is ensured, and the recognition rate of the second speech recognition result 129 is easy to meet the commercial requirement, and the user experience is good.

As a comparison, in a conventional voice recognition device in which a local recognition engine and a cloud recognition engine are used in a mixed manner, it is generally required to first determine whether a network connection status of the corresponding cloud recognition engine meets a predetermined requirement, and then determine whether to use the local recognition engine to recognize voice input, where generally, when it is determined that the network connection status does not meet the predetermined requirement, a delay is already made, and then the local recognition engine performs voice recognition to further delay recognition efficiency, so that the real-time performance of voice recognition in the whole process is difficult to be ensured, and user experience is poor.

The predetermined time threshold T is_thMay be set according to the network connection status requirement, for example, 100ms to 1 second. Also, when the network is interrupted and the first speech recognition result 189 from the remote speech recognition module 800 is not received, it can be understood as being later than the output time of the local second speech recognition result 129 by a predetermined time threshold T_thMay also be selected directly to use the second speech recognition result 129 as a recognition result of the speech input.

Taking a voice instruction that a certain voice input of a user is "open a skylight" as an example, the remote voice recognition module 800 and the local voice recognition module 120 may process the voice input "open a skylight" in parallel to obtain a first voice recognition result 189 and a second voice recognition result 129 corresponding to the voice input, respectively; if there is no network delay or the network connection condition is good, the speech application module 130 may receive the first speech recognition result 189 in time while receiving the second speech recognition result 129, and at this time, select the first speech recognition result 189 with a higher recognition rate as the speech recognition result; if there is a network delay or the network connection condition does not meet the predetermined condition, the first speech recognition result 189 is determined to be later (i.e., later by a predetermined time threshold T) relative to the second speech recognition result 129_th) The second speech recognition result 129 is directly employed as the speech recognition result. Therefore, the real-time performance of speech recognition is guaranteed regardless of the network connection condition, and even the user experiences no network delay.

As shown in fig. 2, in an embodiment, when the speech recognition device 100 is applied to a speech interaction system, the speech application module 130 may further include a speech interaction processing unit 132, and the speech interaction processing unit 132 may generate speech interaction information (e.g., "the skylight is opened") fed back for each speech input (e.g., "the skylight is opened") according to the first speech recognition result 189 or the second speech recognition result 129 corresponding to the speech input. It should be noted that the manner of obtaining the voice interaction information based on the voice recognition result is not limited, and it may use various existing methods to obtain the voice interaction information, for example, a corresponding database may be set corresponding to the voice interaction processing unit 132, and the database may be searched according to the voice recognition result to obtain accurate voice interaction information.

As further shown in fig. 2, the voice interaction system of an embodiment of the present disclosure is further provided with a voice interaction output 300, which may be implemented by a speaker, for example.

The voice interaction system illustrated in fig. 2 has the advantages that the real-time performance is good, and the recognition rate can be guaranteed under the condition that the network connection condition is not good, so that the user experience can be greatly improved in the voice interaction process, and particularly under the condition of being applied to a vehicle-mounted voice interaction scene.

Fig. 3 is a schematic block diagram illustrating a voice interaction system according to another embodiment of the present disclosure, in which a voice recognition apparatus 200 according to another embodiment of the present disclosure is used.

Compared to the speech recognition apparatus 100 shown in fig. 2, the speech recognition apparatus 200 is further provided with a feature extraction unit 211 and a Voice Activity Detection (VAD) unit 212, which may be provided in the front-end processing module 210. The feature extraction unit 211 is configured to perform feature extraction on an audio signal containing a speech input, and the purpose of the feature extraction is to extract a speech feature sequence that varies with time from a speech waveform. As described previously, the extracted feature data may be sent to the local speech recognition module 120 and the far-end speech recognition module 800 for speech recognition processing; of course, it can also be sent to the VAD unit 212 for voice endpoint detection.

The voice activity detection unit 212 is configured with a third acoustic model 213 constructed based on the BNN algorithm, and is used to detect endpoint information (or referred to as "boundary information") of the voice input of the user, e.g., a voice start endpoint and/or a voice stop endpoint, through the third acoustic model 213. In an embodiment, the voice activity detection unit 212 is further configured to determine whether the voice feature corresponds to a voice state or a silence state (the voice state and the silence state may be predefined) through the third acoustic model 213, so as to determine a voice start endpoint and a voice stop endpoint corresponding to one voice input according to the continuity characteristic of the voice feature.

It should be noted that, the third acoustic model 213 and the second acoustic model 121 are both constructed based on the BNN algorithm, their basic construction principles may be the same or similar, and a model structure with the same or similar main body may also be obtained, but considering the application difference, the specific construction mode may have differences, for example, the data used for training is different, and the like.

In an embodiment, the third acoustic model 213 is constructed by training the model based on the BNN algorithm (i.e., training the BNN network). The training of the BNN network can be developed from a conventional neural network. The mapping matrix and the bias parameter of the conventional neural network are generally represented by single-precision or double-precision floating point numbers, and the Back Propagation (BP) algorithm adopted during training is also performed on the basis of the precision of the parameter. And the matrix and the offset parameter are quantized by 1 and-1 in the training process of the BNN network, and after the matrix and the offset parameter are updated by a BP algorithm, the matrix and the offset parameter are directly set to be-1 if the parameter value is less than 0, and the matrix and the offset parameter are set to be-1 if the parameter value is more than 0. Moreover, in order to prevent the updated value of the error backward transmission from being too small and causing the parameter not to be turned over, a Buffer (Buffer) for storing the backward error is arranged for the parameter when the model is trained, so as to superpose the backward error, and finally, whether the parameter is turned over is determined based on the parameter value plus the error Buffer value.

Since the voice activity detection unit 212 of the embodiment of the present disclosure detects the endpoint or the boundary of the voice input by using the third acoustic model 213 constructed based on the BNN algorithm, on one hand, as explained above, the BNN algorithm has relatively low calculation overhead and high detection efficiency compared with the conventional algorithm, and the real-time performance of the detection is ensured; on the other hand, the mode of adopting the neural network algorithm is completely different from the traditional mode of detecting the endpoint through signal analysis and processing, the endpoint detection accuracy is good, and the voice starting endpoint and the voice ending endpoint of the primary voice input corresponding to the user in the input audio signal are easy to identify.

Under the condition that the voice activity detection unit 212 has the characteristics of good endpoint detection accuracy and high efficiency, the voice activity detection unit 212 can further determine each voice input of the user based on the endpoint information of the detected voice input, that is, each voice input in the audio signal containing the voice input can be determined; therefore, in the process of sending the voice input to the remote voice recognition module 800 and the local voice recognition module 200, the voice input can be sent according to each voice input, on one hand, redundant information except each voice input is removed, the calculation overhead of the remote voice recognition module 800 and the local voice recognition module 200 (particularly the local voice recognition module 200) in the aspect of voice recognition is reduced, and the voice recognition real-time performance of the remote voice recognition module 800 and the local voice recognition module 200 is improved, on the other hand, the remote voice recognition module 800 can realize outputting the first voice recognition result 189 corresponding to each voice input, and the local voice recognition module 200 can realize outputting the second voice recognition result 129 corresponding to each voice input.

It should be noted that, in determining each speech input, the algorithm for determining the process may be defined according to the characteristics of each utterance of the regular user during the conversation or talking, for example, the minimum time interval between each utterance, etc.

Taking the example that the user sends a voice command "close the skylight and open the air conditioner" once, the voice activity detection unit 212 may detect a voice start endpoint and a voice stop endpoint in the audio signal containing the voice input "close the skylight and open the air conditioner" so as to determine the audio information between the effective endpoints as the voice input "close the skylight and open the air conditioner" once by the user, and send the voice input "close the skylight and open the air conditioner" to the far-end voice recognition module 800 and the local voice recognition module 200 for voice recognition processing, which may obtain recognition results corresponding to the voice input "close the skylight and open the air conditioner" once.

Thus, the speech recognition of the remote speech recognition module 800 and the local speech recognition module 200 can also output the corresponding recognition result according to the successive speech inputs of the user and according to each input, which is very advantageous in the context of the speech interaction application. In an embodiment, the voice interaction processing unit 132 in the voice application module 130 may obtain the recognition result corresponding to each voice input, so that the voice interaction information of the response to the voice input may be generated according to the first voice recognition result 189 or the second voice recognition result 129 corresponding to each voice input, the calculation of the generation process is reduced, the efficiency is higher, the generated voice interaction information is more accurate, and the user experience is better.

In other embodiments, when the local speech recognition module 120 is in the sleep state, there is a process of waking up the local speech recognition module 120 in the speech interaction method, and generally, a corresponding wake-up feature (e.g., "Nomi" or the like) may be defined, the user may input a speech input including the wake-up feature in the first speech input, and the front-end processing module 210 or the VAD unit 212 may detect the wake-up feature in the speech input and output a wake-up signal to the local speech recognition module 120 if the wake-up feature is detected.

It should be noted that other components of the speech interaction system/speech recognition apparatus 200 of the embodiment shown in fig. 3 are already described in the speech interaction system/speech recognition apparatus 100 of the embodiment shown in fig. 2, and are not described again here.

It will be appreciated that the specific application of the voice interaction system/voice recognition apparatus of the above example is not limiting, e.g. the voice recognition apparatus may be applied as an in-vehicle voice recognition apparatus to an in-vehicle voice interaction system. In this way, the problem caused by the often unstable wireless communication connection in driving conditions can be solved, and the user in the vehicle can especially experience the improvement of the user experience brought by the voice interaction system/voice recognition device of the above example.

FIG. 4 is a flow diagram illustrating a method of speech recognition in accordance with one or more embodiments of the present disclosure. The following is an example of the speech interaction system and the speech recognition apparatus 200 according to the embodiment shown in fig. 3.

In step S410, endpoint information corresponding to the voice input is detected through the third acoustic model 213. As described above, the third acoustic model 213 is constructed based on the BNN algorithm, so that this step can efficiently and accurately detect the endpoint information corresponding to each speech input, including the speech start endpoint and/or the speech stop endpoint. In an embodiment, whether the speech state or the silence state is predefined, the third acoustic model 213 can determine whether the speech feature corresponds to the speech state or the silence state, so as to determine the speech start endpoint and the speech end endpoint corresponding to the primary speech input according to the continuity characteristic of the speech feature.

In step S420, each voice input of the user is determined based on the endpoint information of the detected voice input. Thus, in the subsequent process, the voice recognition processing can be carried out according to each voice input, and the voice recognition result of each voice input can be obtained.

In step S431, each voice input is sent to the far-end voice recognition module 800, for example, on-line via the network 700. It will be appreciated that this transmission process may cause network delays if the wireless network connection is in poor condition.

In step S432, the online receiving far-end speech recognition module 800 outputs the first speech recognition result 189 corresponding to the speech input after online processing of each speech input.

In step S441, the speech features extracted from each speech input are processed by at least the second acoustic model 121 of the local speech recognition module 120 to output the second speech recognition result 129 corresponding to the speech input.

Step S450, determining whether the receiving time of the first speech recognition result 189 is later than the outputting time of the second speech recognition result 129 by a predetermined time threshold T_th. If yes, indicating that the network connection is currently suspended or the connection status does not meet the predetermined condition, the process proceeds to step S461, and the second speech recognition result 129 is selected to be used as the recognition result of the speech input; if the determination is "no", it indicates that the network connection status is good, and step S462 selects to use the first speech recognition result 189 which is relatively more accurate as the recognition result of the speech input.

Optionally, a step S470 may be further included, in which the voice interaction information of the reply to the voice input is generated according to the first voice recognition result 189 or the second voice recognition result 129 of each voice input, and this step may be completed in the voice interaction processing unit 132 by means of a corresponding database.

Optionally, there is also a process of waking up the local speech recognition module 120 in the speech interaction method while the local speech recognition module 120 is sleeping. Generally, a corresponding wake-up feature (e.g., "Nomi" or the like) may be defined, and a user may input a voice input including the wake-up feature in a first voice input, and may detect the wake-up feature in the voice input through the VAD unit 212 and output a wake-up signal to the local voice recognition module 120 when the wake-up feature is detected, so as to wake up the local voice recognition module 120.

The voice recognition method of the above example has high recognition rate and good timeliness, even if the network connection condition is not good or the network delay is serious, the user can not basically experience the voice recognition method, the user experience is good, the voice recognition method can perform voice recognition according to each voice input and output the recognition result corresponding to each voice input, and the voice interaction process is convenient to facilitate or simplified.

FIG. 5 is a flow diagram illustrating a method of voice interaction in accordance with one or more embodiments of the present disclosure. The voice interaction system of the embodiment shown in fig. 3 is exemplified below, wherein an in-vehicle voice interaction scene is taken as an example for explanation.

Step S510, receiving a voice input including a wakeup feature word from a user. The wakeup feature words may be customized, such as "Nomi" and the like. Illustratively, user a enters the vehicle interior, say: "Nomi, hello", the vehicle voice interactive system accepts the voice input.

In step S520, when the wakeup feature word (e.g., "Nomi") is detected, the local speech recognition module 120 is awakened to operate.

In step S530, the voice interaction information fed back for the voice input is generated according to the first voice recognition result 189 or the second voice recognition result 129 corresponding to the voice input.

And step S540, outputting the voice interaction information to the user.

Step S550, determining whether the next voice input of the user is received. If yes, go to step S530 directly, and repeat steps S530 and S540; if the next voice input from the user is not received after a certain time (e.g., the mute time is greater than a predetermined value), indicating that a voice interaction is over, the local speech recognition module 120 may be put to sleep.

Therefore, the voice interaction method of the above example can implement the following interaction modes: the user speech input "wake idiom" or "wake idiom + session content 1" → feedback output voice interaction information 1 → user speech input "session content 2" → feedback output voice interaction information 2 → … → user speech input "session content N" → feedback output voice interaction information N → long time mute detected → end.

The voice interaction method of the above example may implement the following specific voice interaction process:

the user A: "Nomi, hello! "

Voice interactive system Nomi: "hello! Today the weather is good. "

The user A: "Ye, so I ride the bicycle out to the following. "

Nomi: "Tai Zhu! I want to roll round as well, and do not walk together. "

The user A: "good o! "

The above voice interaction process each voice input of the user a can be accurately determined, so that it is not necessary that each voice input contains a wake-up feature (e.g., "Nomi") as in the existing voice interaction process, and the conversation process of the user with the voice interaction system can be continued all the time, and the interaction mode thereof is more consistent with the natural chat communication mode between people, and the user is more used to the voice interaction mode.

It should be noted that some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The present application is described above with reference to block diagrams and/or flowchart illustrations of methods, systems, and apparatus according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block and/or flow diagram block or blocks.

These computer program instructions may be stored in a computer-readable memory such as that shown in fig. 1, which may direct a computer or other programmable processor to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may be loaded onto a computer or other programmable data processor to cause a series of operational steps to be performed on the computer or other programmable processor to produce a computer implemented process such that the instructions which execute on the computer or other programmable processor provide steps for implementing the functions or acts specified in the flowchart and/or block diagram block or blocks. It should also be noted that, in some alternative implementations, the functions/acts noted in the blocks may occur out of the order noted in the flowcharts. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

The above examples mainly illustrate the speech recognition apparatus and the speech recognition method, the speech interaction system and the corresponding speech interaction method of the present disclosure. Although only a few embodiments of the present invention have been described, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A voice recognition device can receive voice input of a user and can receive a first voice recognition result output after the voice input is processed online from a remote voice recognition module online; characterized in that the speech recognition device comprises:

2. The speech recognition apparatus of claim 1, further comprising:

3. The speech recognition apparatus of claim 1, wherein the local speech recognition module is further configured with a second language model, and further comprising a decode output unit;

4. The speech recognition apparatus of claim 2, further comprising:

5. The speech recognition apparatus of claim 4, wherein the endpoint information comprises a speech onset endpoint and/or a speech cutoff endpoint.

6. The speech recognition apparatus of claim 4, wherein the voice activity detection unit is further configured to determine each of the voice inputs of the user based on the detected endpoint information of the voice input;

7. The speech recognition apparatus of claim 6, wherein the speech application module comprises a speech interaction processing unit, and is configured to generate the speech interaction information fed back for each speech input according to the first speech recognition result or the second speech recognition result corresponding to the speech input.

8. The speech recognition apparatus of claim 4, wherein the voice activity detection unit is further configured to: and determining whether the voice characteristics correspond to the voice state or the silent state through the third acoustic model, so as to determine a voice starting endpoint and a voice ending endpoint corresponding to one-time voice input according to the continuity characteristics of the voice characteristics.

9. The speech recognition apparatus of claim 4, wherein the voice activity detection unit is further configured to detect a wake-up token in the speech input and output a wake-up signal to the local speech recognition module if the wake-up token is detected.

10. The speech recognition apparatus of claim 1, wherein the speech recognition apparatus is an in-vehicle speech recognition apparatus and is applied to an in-vehicle speech interaction system.

11. A speech recognition method, which uses a local speech recognition module and a remote speech recognition module for speech recognition at the same time, comprising the steps of:

12. The speech recognition method according to claim 11, wherein in the step of outputting the second speech recognition result, the matching and comparing process is performed on the speech features by the second acoustic model and the result processed by the second acoustic model is subjected to the language processing by the second language model so as to decode and output the second speech recognition result.

13. The speech recognition method of claim 11, further comprising the steps of: detecting endpoint information corresponding to the voice input through a third acoustic model; wherein the third acoustic model is constructed based on a binarization neural network algorithm.

14. The speech recognition method of claim 13, wherein the endpoint information comprises a speech onset endpoint and/or a speech cutoff endpoint.

15. The speech recognition method of claim 13, further comprising the steps of:

16. The speech recognition method of claim 15, further comprising the steps of:

17. The voice recognition method of claim 13, wherein in the detecting of the endpoint information, it is determined whether the voice feature corresponds to a voice state or a silent state through the third acoustic model, thereby determining a voice start endpoint and a voice stop endpoint corresponding to one voice input according to a continuity characteristic of the voice feature.

18. The speech recognition method of claim 13, further comprising the steps of: and detecting a wake-up feature language in the voice input and outputting a wake-up signal to the local voice recognition module under the condition that the wake-up feature language is detected.

19. A method for speech interaction using a speech recognition device according to claim 6, characterized in that it comprises the following steps:

s1: receiving a voice input of a user;

s3: outputting voice interaction information to a user; and

20. The method of claim 19, further comprising, before the step S1, the step of:

s11: receiving a voice input of a user including a wakeup feature word; and

21. A voice interaction system, comprising:

the speech recognition apparatus according to any one of claims 1 to 10; and

and a voice interaction output end.

22. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor realizes the steps of the speech recognition method according to any of claims 11-18 when running the program.

23. A computer-readable storage medium, on which a computer program is stored, which program is executable by a processor for carrying out the steps of the speech recognition method as claimed in any one of claims 11 to 18.