CN115132196A

CN115132196A - Voice instruction recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115132196A
Application number: CN202210551539.0A
Authority: CN
Inventors: 杨展恒; 孙思宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-09-30

Abstract

The application provides a method, a device, equipment and a storage medium for voice instruction recognition, and relates to the field of artificial intelligence voice recognition. In the method, a speech signal is input into an acoustic model, a first phoneme recognition result corresponding to each speech frame in the speech signal and a first acoustic hidden layer representation vector of each speech frame output by an encoder in the acoustic model are obtained, then the first phoneme recognition result is input into a decoding diagram, a phoneme sequence corresponding to the speech signal and a first time stamp of the speech frame corresponding to the phoneme sequence are obtained, and finally whether the phoneme sequence is used for triggering an instruction is determined according to a vector part corresponding to the first time stamp in the first acoustic hidden layer representation vector. The voice instruction recognition model in the embodiment of the application does not need a front voice awakening model, such as a KWS system, so that mistaken awakening can be inhibited under the condition that the complexity of the system is not remarkably increased, and the user can directly interact with the device.

Description

Voice instruction recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technology, and more particularly, to a method, an apparatus, a device and a storage medium for speech instruction recognition.

Background

In recent years, with the rapid development of deep learning, an End-to-End (E2E) Automatic Speech Recognition (ASR) technology is gaining popularity with its simplified architecture and excellent performance. An E2EASR neural network structure is introduced into a Keyword Spotting (KWS) task at present, so that the Keyword recognition performance is improved.

In contrast to the ASR task, the Speech Command Recognition (SCR) task has a fixed target instruction set and a relatively limited search space. To limit the recognition result to the target instruction set, the acoustic model may be followed by a decoder to further optimize the task objective. For example, an attention bias mechanism may be added after the acoustic model to guide the recognition result towards a given keyword. However, as the number of instructions in a task increases, the number of false wake-ups (FAs) also grows in multiples, which is unacceptable to a user for a 24-hour running device.

The conventional scheme is usually a KWS system which runs 24 hours before an SCR system, and the system only detects one keyword, so that false awakening can be well inhibited. However, the KWS system, operating 24 hours before the SCR system, adds significantly to the complexity of the system and the user cannot interact directly with the device.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for voice instruction recognition, which can help to inhibit mistaken awakening without significantly increasing the complexity of a system, and a user can directly interact with the equipment.

In a first aspect, a method for voice instruction recognition is provided, including:

acquiring a voice signal, wherein the voice signal comprises a plurality of voice frames;

inputting the voice signal into an acoustic model to obtain a first phoneme recognition result corresponding to each voice frame in the voice signal and a first acoustic hidden layer expression vector of each voice frame output by an encoder in the acoustic model, wherein the first phoneme recognition result comprises probability distribution of each voice frame in a phoneme space, and the acoustic model is obtained by training a voice signal sample and actual phonemes of each voice frame in the voice signal sample;

inputting the first phoneme recognition result into a decoding graph to obtain a phoneme sequence corresponding to the voice signal and a first time stamp corresponding to the phoneme sequence;

determining whether the sequence of phonemes is used for a triggering instruction according to a vector portion of the first acoustic hidden layer representation vector that corresponds to the first timestamp.

In a second aspect, an apparatus for voice command recognition is provided, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice signal which comprises a plurality of voice frames;

the acoustic model is used for inputting the speech signal to obtain a first phoneme recognition result corresponding to each speech frame in the speech signal and a first acoustic hidden layer expression vector of each speech frame output by an encoder in the acoustic model, wherein the first phoneme recognition result comprises probability distribution of each speech frame in a phoneme space, and the acoustic model is obtained by training a speech signal sample and actual phonemes of each speech frame in the speech signal sample;

a decoding graph, configured to input the first phoneme recognition result, to obtain a phoneme sequence corresponding to the speech signal and a first timestamp corresponding to the phoneme sequence;

a verification module to determine whether the sequence of phonemes is used for a triggering instruction based on a vector portion of the first acoustic hidden layer representation vector that corresponds to the first timestamp.

In a third aspect, the present application provides an electronic device, comprising:

a processor adapted to implement computer instructions; and (c) a second step of,

a memory storing computer instructions adapted to be loaded by the processor and to perform the method of the first aspect described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions, which, when read and executed by a processor of a computer device, cause the computer device to perform the method of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the first aspect described above.

Based on the above technical solution, in the embodiment of the present application, after obtaining a phoneme sequence and a first timestamp corresponding to the phoneme sequence through a speech instruction recognition model (i.e., an acoustic model and a decoding graph), according to a vector portion corresponding to the first timestamp in a first acoustic hidden layer representation vector of each speech frame output by an encoder of the acoustic model, it can be further verified whether the phoneme sequence is used for a trigger instruction. According to the embodiment of the application, on the basis of a traditional voice instruction recognition model, the phoneme sequence trigger instruction is further verified according to the acoustic hidden layer expression vector corresponding to the phoneme sequence of the trigger instruction, and the reliability of instruction triggering can be improved. The voice instruction recognition model in the embodiment of the application does not need a front voice awakening model, such as a KWS system, so that mistaken awakening can be inhibited under the condition that the complexity of the system is not remarkably increased, and the user can directly interact with the device. For example, the voice command recognition system of the embodiments of the present application may occupy less storage and computing resources. For another example, the voice instruction recognition system according to the embodiment of the present application can reduce the number of false wakeups, so that 24-hour deployment and operation can be performed.

Drawings

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a Transducer model;

FIG. 3 is a schematic flow chart diagram of a method of voice command recognition according to an embodiment of the present application;

fig. 4 is a schematic diagram of a network architecture for voice command recognition according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a process of voice command recognition using the network architecture shown in FIG. 4;

FIG. 6 is a schematic flow chart of a manner of verifying a trigger instruction according to an embodiment of the present disclosure;

FIG. 7 is a diagram illustrating a process for forced alignment according to an embodiment of the present disclosure;

FIG. 8 is a schematic flow chart diagram illustrating another exemplary manner of validating a trigger instruction according to an embodiment of the present disclosure;

FIG. 9 is a schematic flow chart diagram illustrating another manner of validating a trigger instruction according to an embodiment of the present application;

FIG. 10 is a ROC curve for the clean test set;

FIG. 11 is a ROC curve corresponding to the noise test set;

FIG. 12 is a schematic block diagram of an apparatus for speech instruction recognition provided by an embodiment of the present application;

fig. 13 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be understood that in the embodiment of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

In the description of the present application, "at least one" means one or more, "a plurality" means two or more than two, unless otherwise specified. In addition, "and/or" describes an association relationship of associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be further understood that the descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent a particular limitation to the number of devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.

It should also be appreciated that a particular feature, structure, or characteristic described in connection with an embodiment is included in at least one embodiment of the application. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The scheme provided by the application can relate to artificial intelligence technology.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

It should be understood that the artificial intelligence technology is a comprehensive subject, and relates to a wide range of fields, namely a hardware technology and a software technology. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Embodiments of the present application may relate to Speech Technology (Speech Technology) in artificial intelligence Technology. The key technologies of the Speech technology are Automatic Speech Recognition (ASR), Speech synthesis, and voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question answering, knowledge mapping, and the like.

The embodiment of the application also can relate to Machine Learning (ML) in the artificial intelligence technology, wherein the ML is a multi-field cross subject and relates to a plurality of subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and development of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service and the like.

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application.

As shown in fig. 1, includes an acquisition device 101 and a computing device 102. The collection device 101 is used for collecting voice data, such as original voice, when a user speaks. The computing device 102 is used to process the speech data collected by the sound collection device 101, such as performing ASR, SCR, etc.

Illustratively, the capture device 101 may be a microphone, microphone array, microphone, or other device with voice capture capabilities.

Illustratively, the computing device 101 may be a user device, such as a mobile phone, a computer, a smart voice interaction device, a smart appliance, a vehicle terminal, a wearable terminal, an aircraft, a Mobile Internet Device (MID), or other terminal device with voice processing functionality.

Illustratively, the computing device 102 may be a server. The server may be one or more. When the number of the servers is multiple, at least two servers exist for providing different services, and/or at least two servers exist for providing the same service, for example, the same service is provided in a load balancing manner, which is not limited in the embodiment of the present application. The server can be provided with a neural network model, and the server provides support for the training and application process of the neural network model. The server can also be provided with a speech processing device for ASR or SCR of speech, and the server provides support for the application process of the speech processing device.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network), big data, an artificial intelligence platform, and the like. Servers may also become nodes of a blockchain.

In some embodiments, the acquisition device 101 and the computing device 102 may be implemented as the same hardware device. For example, the computing device 102 is a user device and the acquisition device 101 is a microphone built into the user device.

In some embodiments, the acquisition device 101 and the computing device 102 may be implemented as different hardware devices. For example, the acquisition device 101 is a microphone disposed on a steering wheel of a vehicle, and the computing device 102 may be an in-vehicle smart device; for another example, the acquisition device 101 is a microphone disposed on an intelligent household device (such as an intelligent television, a set-top box, an air conditioner, and the like), and the computing device 102 may be a home computing center, such as a mobile phone, a television, a router, and the like, or a cloud server, and the like; as another example, the capture device 101 may be a microphone on a personal wearable device (e.g., a smart bracelet, a smart watch, a smart headset, smart glasses, etc.), and the computing device 102 may be a wearable device, such as a cell phone.

When the capture device 101 and the computing device 102 are implemented as different hardware devices, the capture device 101 may be connected to the computing device 102 over a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System for Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network, without limitation.

The following describes a related art to which embodiments of the present application relate.

Transfer machine (transporter) model: the transfer machine (transporter) model is a streaming E2EASR framework, can directly convert the characteristics of an input audio stream into a text result in real time, and has certain advantages in resource consumption and accuracy compared with the traditional speech recognition model. The Transducer model framework is also introduced into the KWS task as an acoustic model.

For a given input sequence x ═ x (x) ₁ ,x ₂ ,…,x _t )∈X ^* The output sequence y of the Transducer model is (y) ₁ ,y ₂ ,…,y _u )∈Y ^* 。

Wherein, X ^* Representing the set of all input sequences, Y ^* Representing the set of all output sequences, x _t ∈X，y _u E.y are real vectors, X and Y represent the input and output spaces, respectively.

For example, the Transducer model is used for phoneme recognition, and the input sequence x is a feature vector sequence, such as a filter bank (FBank) feature or a Mel Frequency Cepstral Coefficient (MFCC) feature, x _t A feature vector representing time t; the output sequence y is a phoneme sequence, y _u Representing the phoneme of the u-th step.

The output space, which may also be referred to as a phoneme space, includes a plurality of phonemes. Optionally, the phoneme space further comprises a null output. Due to the introduction of the null output, the output sequence may be the same length as the input sequence.

Fig. 2 shows a schematic structural diagram of a Transducer model. As shown in fig. 2, the Transducer model includes an Encoder (Encoder)21, a Predictor (Predictor)22, and a union network (joinnetwork) 23.

The encoder 21 may be a recurrent neural network, such as a Long Short Term Memory (LSTM) network, a feed-forward sequential Memory network (Fe)An edge Sequential Memory Networks, FSMN), etc. The encoder 21 may receive an audio feature input (inputfeaturex) at time t _t ) Outputting an acoustically hidden representation

The predictor 22 may be a recurrent Neural Network, such as LSTM, or a Convolutional Neural Network (CNN). The predictor 22 may receive a non-empty output tag (Transduceroutity) of the model history _u-1 ) The output is a text hidden layer representation

The combining network 23 may be a fully connected neural network, such as a linear layer plus activation unit, for representing the acoustic hidden layer

And a textual hidden representation

After linear transformation and summation, the output implicit unit representation z.

Finally, the output of the union network 23 may be converted into a probability distribution via a softmax function.

Mini transducer (tiny transducer) model: in the SCR or KWS mission, small model parameters and high real-time are often required due to the need to actually deploy the equipment. Micro Transducer (tiny Transducer) meets these requirements very well compared to traditional Transducer frameworks.

The Tiny Transducer reduces the size of a model and optimizes the reasoning speed on the basis of the Transducer, so that the model is more suitable for the streaming deployment application of end-side equipment. Compared with the traditional Transducer framework, the Tiny Transducer uses FSMN on the encoder side, and the reasoning speed of the model is improved. On the decoder side, a single layer CNN of small convolution kernels is used, which greatly reduces the model complexity. Further, for a task with small dependency on context information, such as SCR, the problem of overfitting of context information on the decoder side is reduced. Research shows that the Tiny Transducer as an acoustic model can keep equivalent recognition performance on the basis of greatly reducing the size of the model and reasoning time.

An SCR scheme uses a Transducer model to replace a traditional stacked Neural Network (RNN) or CNN module as an acoustic model to improve the accuracy of instruction recognition. However, as the number of instructions in a task increases, the number of false wakeups also doubles, which is unacceptable to a user for a 24-hour running device.

A KWS system which runs for 24 hours is arranged in front of an SCR system, and the system only detects one keyword, so that false awakening can be well suppressed. When the KWS system detects a specific keyword, the system runs in a limited time window to avoid obvious false awakening feeling of the user. However, the KWS system, which operates 24 hours before the SCR system, significantly increases the complexity of the system and is not in line with the usage habit of the user directly interacting with the device.

In view of this, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for voice command recognition, which can help to suppress false wake-up without significantly increasing the complexity of a system, and a user can directly interact with the device.

Specifically, a speech signal is input into an acoustic model, a first phoneme recognition result corresponding to each speech frame in the speech signal and a first acoustic hidden layer representation vector of each speech frame output by an encoder in the acoustic model are obtained, then the first phoneme recognition result is input into a decoding diagram, a phoneme sequence corresponding to the speech signal and a first time stamp of the speech frame corresponding to the phoneme sequence are obtained, and finally whether the phoneme sequence is used for triggering an instruction is determined according to a vector part corresponding to the first time stamp in the first acoustic hidden layer representation vector.

Wherein the first phoneme recognition result comprises a probability distribution of each speech frame in the phoneme space.

The acoustic model is obtained by training a speech signal sample and the actual phonemes of each speech frame in the speech signal sample. By way of example, the acoustic model may be a Transducer model, or a Tiny Transducer model, without limitation.

Therefore, after obtaining the phoneme sequence and the first timestamp corresponding to the phoneme sequence through the speech instruction recognition model (i.e. the acoustic model and the decoding graph), the embodiment of the present application further verifies whether the phoneme sequence is used for triggering an instruction according to a vector part corresponding to the first timestamp in a first acoustic hidden layer representation vector of each speech frame output by an encoder of the acoustic model.

According to the embodiment of the application, on the basis of a traditional voice instruction recognition model, the phoneme sequence trigger instruction can be further verified according to the acoustic hidden layer expression vector corresponding to the phoneme sequence of the trigger instruction, and the reliability of instruction triggering is improved. The voice instruction recognition model in the embodiment of the application does not need a front voice awakening model, such as a KWS system, so that mistaken awakening can be inhibited under the condition that the complexity of the system is not remarkably increased, and the user can directly interact with the device.

For example, the voice command recognition system of the embodiments of the present application may occupy less storage and computing resources.

For another example, the voice instruction recognition system according to the embodiment of the present application can reduce the number of false wakeups, so that 24-hour deployment and operation can be performed.

The embodiment of the application can be applied to voice instruction recognition systems of various voice interaction scenes. For example, in a vehicle-mounted scene, a driver cannot vacate hands to operate equipment in the process of driving a vehicle, and can control vehicle-mounted playing equipment, navigation equipment and the like by using voice, so that the safety and convenience of operation are improved. The vehicle-mounted voice instruction recognition system can directly recognize the user instruction without a front voice awakening model, such as a KWS system, so that the user experience can be improved.

The embodiments of the present application will be described in detail with reference to the accompanying drawings.

Fig. 3 is a schematic flow chart of a method 300 for speech instruction recognition according to an embodiment of the present application. The method 300 may be performed by any electronic device having data processing capabilities. For example, the electronic device may be implemented as the computing device 102 of fig. 1, which is not limited in this application.

In some embodiments, a machine learning model may be included (e.g., deployed) in the electronic device and may be used to perform the method 300 of speech instruction recognition. In some embodiments, the machine learning model may be a deep learning model, a neural network model, or other model, without limitation. In some embodiments, the machine learning model may be, without limitation, a speech command recognition system, a command word recognition system, or other system.

Fig. 4 is a schematic diagram of a network architecture for voice command recognition according to an embodiment of the present application, which includes a branch Decoder (transitirdecoder) 41, a shared encoder (ShareEncoder)42, a joint network (joinnetwork) 43, a transform Decoder (Decoder)44, and a phoneme discriminator (phonepactor) 45. The network architecture fuses a Transducer model and a Transformer model. Wherein the transmitter decoder 41, the transform decoder 44 and the pixel discriminator 45 share the shared encoder 42 to increase the coupling degree of the system.

In some embodiments, in training the network architecture, multitask learning may be applied to jointly optimize for transmitter loss (loss), Transformer loss, and Cross-Entropy (CE) loss of the phoneme discriminator. Wherein, the Transducer loss can be obtained from the output of the joint network 43 and the real text label of the speech signal sample, the Transducer loss can be obtained from the output of the Transducer decoder and the real text label of the speech signal sample, and the Cross-Entropy (CE) loss of the phoneme discriminator can be obtained from the output of the phoneme discriminator 45 and the phoneme of each speech frame of the speech signal sample.

For example, the target loss L of the multitask learning may be specifically expressed by the following formula (1):

L＝αL _Transducer +βL _CE +γL _Transformer (1)

wherein L is _Transducer Represents a Transducer loss, which may be, for example, an RNN-T loss, L _CE CE loss, L, representing phoneme discriminator _Transformer Representing Transformer loss, and alpha, beta and gamma can be respectively adjusted.

As a specific example, α, β, γ may be set to 1.0, 0.8, and 0.5, respectively, which is not limited in the present application.

The steps in method 300 will be described below in conjunction with the network architecture in fig. 4.

It should be understood that fig. 4 illustrates an example of a network architecture for performing voice command recognition, which is merely to assist those skilled in the art in understanding and implementing embodiments of the present application, and does not limit the scope of the embodiments of the present application. Equivalent alterations and modifications may be made by those skilled in the art based on the examples given herein, and such alterations and modifications are intended to be within the scope of the embodiments of the present application.

As shown in fig. 3, the method 300 of voice command recognition may include steps 310 to 340.

A speech signal is obtained 310, the speech signal comprising a plurality of speech frames.

Specifically, the speech signal may include a plurality of speech frames obtained by segmenting the original speech. For example, the electronic device may include a preprocessing portion for segmenting the original speech to obtain the speech signal.

And 320, inputting the voice signal into an acoustic model to obtain a first phoneme recognition result corresponding to each voice frame in the voice signal and a first acoustic hidden layer expression vector of each voice frame output by an encoder in the acoustic model.

The acoustic model is obtained by training a speech signal sample and the actual phonemes of each speech frame in the speech signal sample.

As an example, the acoustic model may include the transmitter decoder 41, the shared encoder 42, and the joint network 43 module in fig. 4. The encoder output in the acoustic model may be, for example, the output of the shared encoder 42.

Wherein the reducer decoder 41 and the joint network 43 are similar to the predictor 22 and the joint network 23 in fig. 2, respectively, and can refer to the relevant description in fig. 2, and the functional description of the shared encoder 42 in the acoustic model can be similar to the encoder 21 in fig. 2, and can refer to the relevant description in fig. 2.

As another example, the acoustic model may be a Transducer model, or a TinyTransducer model, without limitation.

Fig. 5 is a schematic diagram illustrating a process of performing voice command recognition by using the network architecture shown in fig. 4.

Referring to fig. 5, the stage (a) is a driver detection stage (driver selection stage), which can obtain a preliminary trigger text and a trigger time stamp (starting point is t) ₀ )。

In the stage (a), feature extraction can be performed on the speech frame at the time t of the speech signal to obtain the audio feature x of the speech frame _t Then the audio feature x _t Input into the shared encoder 42, and output the historical non-null of the model y _u-1 Input to the Transducer decoder 41. The output of the shared encoder 42 is an acoustically hidden representation

The output of the Transducer decoder 41 is a textual hidden representation

The acoustically hidden representation of the output of the shared encoder 42

An example of a vector is represented for the first acoustic hidden layer described above.

The union network 43 may input an acoustically hidden representation

And a textual hidden representation

Output ofThe first phoneme recognition result may be represented as y, for example _t . Wherein the first phoneme recognition result comprises a probability distribution of each speech frame in the phoneme space.

Phones (phones) are the smallest phonetic unit divided according to the natural attributes of the language, and are analyzed according to the pronunciation behavior in syllables, and one behavior constitutes one phone. Illustratively, phonemes are divided into two broad categories, namely vowels and consonants, such as syllables o (ā) having only one phoneme, love (aji) having two phonemes, and generations (dji) having three phonemes.

A phoneme is the smallest unit or smallest speech segment constituting a syllable, and is the smallest linear speech unit divided from the viewpoint of sound quality. Phonemes are physical phenomena that exist specifically. The phonetic symbols of international phonetic symbols (letters designated by the international phonetic society to uniformly designate the voices of various countries, also referred to as "international phonetic letters" and "universal phonetic letters") correspond one-to-one to phonemes of the voices of the whole human beings.

In this embodiment of the present application, for each speech frame in the speech signal, the acoustic model may identify a phoneme corresponding to the speech frame to obtain a first phoneme identification result, where the first phoneme identification result includes a probability distribution of each speech frame in the speech signal in a phoneme space.

The phoneme space may include a plurality of phonemes and a null output (indicating that the corresponding speech frame has no user utterance). In other words, the first phoneme recognition result includes probabilities that phonemes of each speech frame in the speech signal belong to each preset phoneme and a null output.

For example, the phoneme space may contain 212 middle phonemes and a null output. That is, for an input speech frame, the acoustic model can output the probabilities of the output of the phonemes in 212 and the null corresponding to the speech frame.

330, inputting the first phoneme recognition result into a decoding graph to obtain a phoneme sequence corresponding to the speech signal and a first timestamp corresponding to the phoneme sequence.

Specifically, after the first phoneme recognition result is input into the decoding map, the decoding map may determine that the first phoneme recognition result corresponds to a certain phoneme or a null output according to probabilities of each phoneme in the phoneme space and the null output in the first phoneme recognition result, and may further determine a corresponding text according to the determined phoneme.

In the embodiment of the application, the first timestamp corresponding to the text can be further acquired. As an example, the frame number of the speech frame corresponding to the text may be the first timestamp.

When the number of the voice frames is multiple, the first timestamp may include frame numbers corresponding to the multiple voice frames. At this time, the starting point of the first timestamp may be a frame number of a first speech frame of the plurality of speech frames, and the ending point of the first timestamp may be a frame number of a last speech frame of the plurality of speech frames.

And when the phoneme recognition result corresponds to the empty output, determining that the voice frame corresponding to the phoneme recognition result does not contain the pronunciation of the user, and at the moment, determining that no corresponding text exists.

Illustratively, the text output from the decoding graph may be referred to as a phoneme sequence (transmutation sequence). The phoneme sequence is a keyword in the input audio features, and can be used as a preliminary trigger instruction text, namely a candidate trigger text. In the following step 340, the phoneme sequence may be further verified to see if it can trigger an instruction. Accordingly, the first timestamp may also be referred to as a preliminary trigger timestamp.

With continued reference to fig. 5, in stage (a), the first phoneme recognition result output by the driver model may be input into a Weighted Finite State machine (WFST) decoder (an example of decoding graph) 46, and the WFST decoder 46 outputs a phoneme sequence (i.e. preliminary trigger instruction text) corresponding to the speech signal and a first time stamp (starting point t) corresponding to the phoneme sequence ₀ )。

In one example, a decoding graph may include two separate WFST units: a phoneme dictionary (L) and an instruction set model (G), wherein the instruction set model (G) may be a grammar composed of a predefined instruction set. Illustratively, the decoding graph LG may specifically be as follows:

LG＝min(det(L·G)) (2)

where min and det represent the minimization and determinization of WFST, respectively.

By way of example, the decoding process of the decoding graph may be implemented by a token-passing algorithm, which is not limited in this application.

In some embodiments, the recognition result corresponding to the frame that is output after the last of the first phoneme recognition result is not empty may be input into a decoding diagram, for example, only the frame that is output after the maximum after the last of the first phoneme recognition result is not empty may be decoded to obtain the phoneme sequence and the first timestamp, so as to save decoding time.

340, determining whether the phoneme sequence is used for triggering an instruction according to a vector portion of the first acoustic hidden layer representation vector corresponding to the first time stamp.

Specifically, after obtaining the phoneme sequence and the first timestamp corresponding to the phoneme sequence through the speech instruction recognition model (i.e., the acoustic model and the decoding graph), the embodiment of the present application may further verify whether the phoneme sequence is used for the trigger instruction according to a vector portion corresponding to the first timestamp in the first acoustic hidden layer representation vector of each speech frame output by the encoder of the acoustic model.

Therefore, on the basis of a traditional voice instruction recognition model, the embodiment of the application can further verify the phoneme sequence triggering instruction according to the acoustic hidden layer expression vector corresponding to the phoneme sequence of the triggering instruction, and the reliability of instruction triggering is improved. The voice instruction recognition model in the embodiment of the application does not need a front voice awakening model, such as a KWS system, so that mistaken awakening can be inhibited under the condition that the complexity of the system is not remarkably increased, and the user can directly interact with the device.

In some embodiments, whether the phoneme sequence is used for triggering an instruction may be further verified in the following four ways.

Mode 1: and carrying out forced alignment on the corresponding output of the phoneme discriminator by utilizing the phoneme sequence to obtain a first confidence coefficient, and verifying whether the phoneme sequence is used for triggering the instruction or not according to the first confidence coefficient.

Referring to fig. 6, the flow corresponding to mode 1 may include steps 601 to 604.

According to the first time stamp, a second time stamp is obtained 601, and the starting point of the second time stamp is before the starting point of the first time stamp.

The timestamp obtained by the Transducer model is usually inaccurate due to delayed transmission of the Transducer, so that the initial starting point t of the preliminary trigger timestamp (i.e. the first timestamp) needs to be set ₀ Advance t _d Obtaining the second timestamp with the starting point of (t) ₀ +t _d ). As a possible implementation, the start point of the first timestamp may be manually advanced by 15 frames.

Optionally, the end point of the first timestamp is the same as the end point of the second timestamp.

And 602, inputting at least part of the first acoustic hidden layer representation vector into a phoneme discriminator, and acquiring a second phoneme recognition result which is output by the phoneme discriminator and corresponds to a second time stamp.

Continuing to refer to fig. 5, the stage (b) is a forced alignment stage (forcealign stage), which uses the trigger text of the stage (a) to perform forced alignment on the trigger segment (e.g. the vector part corresponding to the second timestamp) in the acoustic hidden layer representation vector at the output of the phoneme discriminator to obtain a first confidence S ₁ . Optionally, in stage (b), a more accurate trigger timestamp (starting point is t) _r )。

As one implementation, in stage (b), the acoustic hidden layer output by the shared encoder 42 may be represented as a vector

Input to the phoneme discriminator 45, and then intercept the phoneme according to the second time stampThe output of the phoneme discriminator 45 acquires the second phoneme recognition result.

As another implementation (not shown in FIG. 5), the vector of acoustic hidden layer representations output by the shared encoder 42 may be truncated according to the second timestamp

And then inputting the intercepted vector part into a phoneme discriminator to obtain the second phoneme recognition result, which is not limited in the application.

603, aligning the second phoneme recognition result with the phoneme sequence to obtain a first confidence level, wherein the first confidence level is used for indicating an alignment score of the second phoneme recognition result with the phoneme sequence.

Optionally, when the second phoneme recognition result is aligned with the phoneme sequence, a third timestamp may be obtained, where an initial point of the third timestamp is a speech frame corresponding to the first phoneme aligned with the phoneme sequence in the second phoneme recognition result. The third timestamp is a more accurate trigger timestamp relative to the first timestamp.

With continued reference to FIG. 5, the second phoneme recognition result and the phoneme sequence may be aligned by alignment module 47 and a first confidence level, e.g., the first confidence level S output in stage (b), may be output ₁ . Optionally, alignment module 47 may also output a third timestamp, e.g., a more accurate trigger timestamp output in stage (b) (starting at t) _r )。

As a possible implementation manner, a linear decoding graph may be obtained according to the phoneme sequence, wherein a first symbol is added before the phoneme sequence in the linear decoding graph, and the first symbol is used to absorb an output of the second phoneme recognition result that does not belong to the phoneme sequence. Then, the second phoneme recognition result is input into the linear decoding graph, and the second phoneme recognition result and the phoneme sequence are aligned to obtain a first confidence and a third timestamp.

Fig. 7 shows a schematic diagram of the flow of forced alignment. Taking aligned text abc as an example, the frame-level posterior phoneme sequence 49 is the output of the phoneme discriminator 45 in FIG. 5 (i.e., the firstAn example of a diphone recognition result), t) ₀ Is the starting point of the trigger timestamp obtained in stage (a), t _d The number of frames taken in advance. Due to the advance of the frame number t _d Is estimated and therefore may introduce additional noise in the a posteriori phoneme sequence 49. Based on this, the noise frames (i.e., the output of the a posteriori phoneme sequence 49 that does not correspond to the phoneme sequence) can be absorbed by the symbol (g) originally added to the linear decoding graph dynamically generated from the phoneme sequence output in stage (a), thereby filtering out partial false triggering. Here, the symbol (g) is an example of the first symbol.

After obtaining the linear decoding graph, the posterior phoneme sequence 49 may be input into the linear decoding graph, viterbi decoding is performed to achieve forced alignment, and finally the first confidence level S is output ₁ And a more accurate trigger start point t _r 。

As an example, the first confidence may be, without limitation, an alignment average score, a root mean square, or the like of each frame in the third timestamp.

604, it is determined whether the phoneme sequence is used for triggering the instruction based on the first confidence level.

For example, the first confidence level may be compared with a preset threshold to determine whether the phoneme sequence triggers an instruction. For example, when the first confidence is greater than or equal to a threshold, a trigger instruction may be determined; when the first confidence is less than the threshold, it may be determined that the instruction is not triggered.

Therefore, the embodiment of the application further concatenates a forced alignment module based on the frame-level phoneme discriminator on the basis of the traditional instruction recognition model, and performs forced alignment on the corresponding output of the phoneme discriminator by using the phoneme sequence to obtain a more accurate instruction triggering timestamp and a confidence level, so that the phoneme sequence triggering instruction can be further verified according to the confidence level, and the reliability of the triggering instruction is improved.

Mode 2: intercepting an acoustic hidden layer representation vector of a shared acoustic encoder by using an accurate trigger timestamp to serve as input of a transform decoder, acquiring a decoded text containing a trigger text and a second confidence coefficient thereof, and verifying whether a phoneme sequence is used for a trigger instruction or not according to the first confidence coefficient and the second confidence coefficient in the mode 1.

In some embodiments, the voice command recognition system corresponding to mode 2 may be referred to as a three-stage system, a three-stage voice command recognition system, or a two-stage verification system, without limitation.

Referring to fig. 8, the flow corresponding to the mode 2 may further include steps 801 to 803 on the basis of the mode 1.

In the first acoustic hidden-layer representation vector, a second acoustic hidden-layer representation vector corresponding to the third timestamp is truncated 801.

With continuing reference to fig. 5, stage (c) is a transform stage that uses the more accurate trigger timestamp of stage (b) (i.e., the third timestamp) to truncate a triggered segment of the acoustic hidden layer representation vector (an example of the second acoustic hidden layer representation vector) of the shared acoustic encoder as input to the transform decoder 44.

And 802, inputting the second acoustic hidden layer expression vector into a decoder to obtain a first decoding result matched with the phoneme sequence and a second confidence coefficient of the first decoding result.

Illustratively, the decoder comprises a transform decoder.

With continued reference to FIG. 5, it may be utilized that the starting point of the stage (b) output is t _r The transform decoder 44 may obtain at least one beam of decoded text and a confidence level for each decoded text by an autoregressive beam search. Alternatively, the beam search may be performed by a beam search (beamsearch) 48.

After the search is finished, it may be retrieved in the candidate beam list whether there is a candidate sequence containing the decoded text of the trigger text (one example of the first decoding result matching the phoneme sequence). When present, the confidence level S of the decoded text can be obtained and output ₂ (an example of the second confidence)

In some embodiments, the length of the decoded sequence of the decoder may be determined based on the length of the phoneme sequence.

For example, the start symbol < SOS > may be set as an input of the transform decoder 44, and then the decoding results are generated one by autoregressive. Since the purpose of the embodiment of the present application is to detect whether a trigger is actually triggered according to a predetermined trigger text, it is not necessary to decode until < EOS > occurs like a general beam search, and based on this, the embodiment of the present application may limit the length of a decoded sequence according to the length of a phoneme sequence of an initial trigger to save decoding time.

And 803, determining whether the phoneme sequence is used for triggering the instruction according to at least one of the first confidence degree and the second confidence degree.

For example, the first confidence may be compared with a preset first threshold, or the second confidence may be compared with a preset second threshold, or the first confidence may be compared with the preset first threshold and the second confidence may be compared with the preset second threshold, to determine whether the phoneme sequence triggers the instruction.

For example, a trigger instruction may be determined when the first confidence level is greater than or equal to a first threshold and the second confidence level is greater than or equal to a second threshold.

For another example, it may be determined not to trigger the instruction when one of the conditions is satisfied when the first confidence is less than the first threshold and the second confidence is less than the second threshold. That is, in fig. 5, when the confidence of any one of the (b) stage and the (c) stage does not reach the trigger threshold, the speech sequence of the (a) stage can be considered as a false trigger.

For another example, when the first confidence is smaller than the first threshold, it may be determined that the instruction is not triggered, and at this time, the process of calculating the second confidence does not need to be performed; when the first confidence is larger than or equal to the first threshold, a second confidence can be further calculated, and whether the phoneme sequence triggers the instruction or not can be determined according to the second confidence. That is, in fig. 5, the stages (b) and (c) may be combined to perform a cascade determination, and when it is determined that the phoneme sequence cannot trigger the instruction according to the first confidence level, there is no need to further calculate a second confidence level, so that it is not necessary to calculate two confidence levels for each input sample and determine whether to trigger the instruction, thereby saving the verification time.

For another example, it may be determined whether the phoneme sequence is used for triggering an instruction only according to the second confidence level, i.e. in fig. 5 the alignment module 47 may output the third timestamp without outputting the first confidence level. At this time, if the second confidence is less than the second threshold, it may be determined not to trigger the instruction, and if the second confidence is greater than or equal to the second threshold, it may be determined to trigger the instruction.

Therefore, according to the embodiment of the application, on the basis of a traditional instruction identification model, a forced alignment module and a Transformer verification frame based on a frame-level phoneme discriminator are further cascaded, the forced alignment module is used for obtaining a more accurate trigger timestamp and a first confidence coefficient, the Transformer verification frame is used for intercepting the hidden state of the shared acoustic encoder from the accurate trigger timestamp and decoding the hidden state, a decoded text containing the trigger text and a second confidence coefficient thereof are obtained, and then the phoneme sequence trigger instruction is verified according to at least one of the first confidence coefficient and the second confidence coefficient, so that the reliability of the trigger instruction is improved.

Mode 3: and intercepting an acoustic hidden layer expression vector of the shared acoustic encoder by using a time stamp (such as the second time stamp in the above) corresponding to a plurality of frames ahead of the initial trigger time stamp as an input of a Transformer decoder, acquiring a decoded text containing the trigger text and a third confidence coefficient thereof, and verifying whether the phoneme sequence is used for a trigger instruction according to the third confidence coefficient.

In some embodiments, the voice command recognition system corresponding to mode 3 may be referred to as a two-stage system, a two-stage voice command recognition system, or a one-stage verification system, without limitation.

Referring to fig. 9, the flow corresponding to mode 3 may include steps 901 to 903.

And 901, acquiring a second time stamp according to the first time stamp, wherein the starting point of the second time stamp is before the starting point of the first time stamp.

Specifically, 901 may refer to description of 601, which is not described herein again.

And 902, intercepting a third acoustic hidden-layer representation vector corresponding to the second time stamp from the first acoustic hidden-layer representation vector.

And 903, inputting the third acoustic hidden layer expression vector into a decoder to obtain a second decoding result matched with the phoneme sequence and a third confidence coefficient of the second decoding result.

Steps

902 and 903 may refer to descriptions of 801 and 802 in fig. 8, which are not described herein.

It should be noted that, unlike fig. 8, in 902, in the first acoustic hidden layer representation vector, a vector portion corresponding to the second timestamp is truncated, but not a vector portion corresponding to the third timestamp. Accordingly, the portion of the vector corresponding to the second timestamp is input to the encoder at 903, rather than the portion of the vector corresponding to the third timestamp.

And 904, determining whether the phoneme sequence is used for triggering the instruction according to the third confidence coefficient.

For example, the third confidence level may be compared with a preset threshold to determine whether the phoneme sequence triggers an instruction. For example, when the third confidence level is greater than or equal to the threshold, the trigger instruction may be determined; when the third confidence is less than the threshold, it may be determined that the instruction is not triggered.

Therefore, in the embodiment of the application, a Transformer verification frame is further cascaded on the basis of a traditional instruction identification model, the Transformer verification frame is used for intercepting the hidden state of the shared acoustic encoder by using the timestamp corresponding to a plurality of frames of preliminary trigger timestamps, and decoding is performed, so that the decoded text containing the trigger text and the third confidence coefficient thereof are obtained, and the phoneme sequence trigger instruction is verified according to the third confidence coefficient, so that the reliability of the trigger instruction is improved.

Mode 4: and performing forced alignment on the corresponding output of the phoneme discriminator by using the phoneme sequence to obtain a first confidence coefficient, intercepting an acoustic hidden layer representation vector of a shared acoustic encoder by using a plurality of time stamps corresponding to a plurality of frames of forward push of a primary trigger time stamp as input of a Transformer decoder, acquiring a decoded text containing the trigger text and a third confidence coefficient thereof, and verifying whether the phoneme sequence is used for a trigger instruction or not according to the first confidence coefficient and the third confidence coefficient.

Specifically, the process of obtaining the first confidence coefficient may refer to the mode 1, and the process of obtaining the third confidence coefficient may refer to the mode 3, which is not described herein again.

Therefore, according to the embodiment of the application, on the basis of a traditional instruction identification model, a forced alignment module based on a frame-level phoneme discriminator and a Transformer verification frame are further cascaded, a first confidence coefficient is obtained by using the forced alignment module, a hidden state of a shared acoustic encoder is intercepted and decoded by using a preliminary timestamp corresponding to a plurality of frames of forward trigger timestamps by using the Transformer verification frame, a decoded text containing the trigger text and a third confidence coefficient thereof are obtained, and a phoneme sequence trigger instruction is verified together according to the first confidence coefficient and the third confidence coefficient, so that the reliability of the trigger instruction is improved.

It should be noted that, in the embodiment of the present application, the voice instruction recognition system may adjust different thresholds according to a specific application scenario, so that the instruction wake-up rate and the false wake-up may be flexibly fine-tuned.

The scheme of the embodiment of the application can use a real corpus to perform experimental testing. The voice command recognition model under test was trained on a 23000-hour mandarin ASR corpus, which was derived from a cell phone in the vehicle-mounted voice assistant product. During model training, a development set (developmentaset) may be randomly obtained from the training set.

Table 1 shows a comparison of the accuracy between different frames and the number of false wake-ups per hour (perhours) FA.

TABLE 1

Wherein, the clean test set (clean) is the instruction voice collected when the real vehicle runs on the expressway; the noise test set (noise set) is the recorded homologous instruction when driving in an downtown area. The false wake-up test set is a human voice test set of 84 hours, and the instruction set comprises 29 instructions.

As shown in table 1, compared to a single tinyTransducer identification system (experiment S0) with the same configuration, the two-stage verification method (experiment S3) of the embodiment of the present application can greatly reduce the number of false wakeups from 1.47 times/hr to 0.13 times/hr, which is a relative reduction of 91.15%. Meanwhile, the accuracy of the instruction identification is reduced and controlled within 2 percent, and is basically in the same order of magnitude as that of the basic system (experiment S0).

In addition, through the ablation tests of the experiment S3 (three-stage voice command recognition system) and the experiment S1 (two-stage voice command recognition system), it can be proved that the timestamp obtained through forced alignment is more accurate than the original timestamp output by the Transducer model, and the accurate timestamp is more helpful for the verification of the third stage (i.e., stage (c)). That is, the three-stage voice command recognition system and the two-stage voice command recognition system can better suppress false awakening under the condition that the accuracy is equivalent, for example, the number of false awakening times per hour is reduced by 43.47% compared with the two-stage voice command recognition system.

In addition, the embodiment of the present application also performs performance comparison with the Transducer + MLD scheme (experiment S4), and the acoustic model used in the embodiment of the present application is consistent with the framework of the system provided in the embodiment of the present application and is a tinyTransducer model with the same configuration. The scheme corresponding to the experiment S4 is mainly based on a statistical method to perform multi-stage verification, and the scheme provided in the embodiment of the present application focuses on the method of a neural network for testing. The experimental result shows that the three-stage verification framework provided by the embodiment of the application can obtain lower false awakening performance under the condition of equivalent identification performance.

FIG. 10 is a Receiver Operating Characteristics (ROC) curve corresponding to the clear test set, and FIG. 11 is a ROC curve corresponding to the noise test set. In fig. 10 and 11, the abscissa indicates the number of false wakeups per hour (facehop), and the ordinate indicates the false rejection rate (false rejection rate) indicating the ratio of the number of recognized false instructions to the total number of instructions. The CatT-KWS system of FIGS. 10 and 11 is a three-stage system provided in the examples of the present application (experiment S3). The ROC curve can take a plurality of threshold values, corresponding points are drawn in the graph according to the result of each threshold value, and finally the results can be connected into a curve, so that the performance of the instruction word recognition system can be reflected more comprehensively.

As can be seen from FIGS. 10 and 11, at the point where the number of false wakeups per hour (FAperhour) is high, the effect of CATT-KWS is comparable to that of the basic Transducer + WFST scheme, and the Transducer + MLD scheme. When the false wake needs to be further reduced (such as in the case that the FAperhour is lower than 0.1), the recognition performance of the transmission + WFST scheme and the transmission + MLD scheme based on the conventional scheme will be drastically reduced, and the CaTT-KWS can still maintain a high recognition rate.

Further, the error of the timestamp obtained in the forced alignment stage and the driver stage compared with the actual voice endpoint can be counted, as shown in table 2 below. Table 2 shows the average error between the starting point of the timestamp obtained by the driver model and the starting point of the timestamp obtained by the forced alignment module and the real starting point, wherein the real time point is manually marked.

TABLE 2

Phases	Clean set(s)	Noisyset(s)
			Transducer detection	0.29	0.44
Forced alignment	0.11	0.23

As can be seen from table 2, the starting point of the timestamp obtained by the forced alignment module is more accurate, and the more accurate starting point as the input of the Transformer stage can improve the performance of the instruction word recognition system.

The present invention is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present invention within the technical concept of the present invention, and the technical solution of the present invention is protected by the present invention. For example, the various features described in the foregoing detailed description may be combined in any suitable manner without contradiction, and various combinations that may be possible are not described in this application in order to avoid unnecessary repetition. For example, various embodiments of the present application may be arbitrarily combined with each other, and the same shall be considered as the disclosure of the present application as long as the concept of the present application is not violated.

It should also be understood that, in the various method embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply an execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. It is to be understood that the numerical designations are interchangeable under appropriate circumstances such that the embodiments of the application described are capable of operation in sequences other than those illustrated or described herein.

Method embodiments of the present application are described in detail above, and apparatus embodiments of the present application are described below in conjunction with fig. 12-13.

Fig. 12 is a schematic block diagram of an apparatus 600 for speech instruction recognition according to an embodiment of the present application. As shown in fig. 12, the apparatus 600 for speech instruction recognition may include an obtaining unit 610, an acoustic model 620, a decoding diagram 630, and a verification module 640.

An obtaining unit 610, configured to obtain a voice signal, where the voice signal includes a plurality of voice frames;

the acoustic model 620 is configured to input the speech signal, obtain a first phoneme recognition result corresponding to each speech frame in the speech signal, and obtain a first acoustic hidden layer expression vector of each speech frame output by an encoder in the acoustic model, where the first phoneme recognition result includes a probability distribution of each speech frame in a phoneme space, and the acoustic model is obtained by training a speech signal sample and an actual phoneme of each speech frame in the speech signal sample;

a decoding graph 630, configured to input the first phoneme recognition result, so as to obtain a phoneme sequence corresponding to the speech signal and a first timestamp corresponding to the phoneme sequence;

a verification module 640, configured to determine whether the phoneme sequence is used for a triggering instruction according to a vector portion of the first acoustic hidden layer representation vector corresponding to the first timestamp.

Optionally, the verification module 640 is specifically configured to:

acquiring a second timestamp according to the first timestamp, wherein the starting point of the second timestamp is before the starting point of the first timestamp;

inputting at least part of the first acoustic hidden layer representation vector into a phoneme discriminator, and acquiring a second phoneme recognition result which is output by the phoneme discriminator and corresponds to the second timestamp;

aligning the second phoneme recognition result with the phoneme sequence to obtain a first confidence coefficient, wherein the first confidence coefficient is used for representing an alignment score of the second phoneme recognition result and the phoneme sequence;

and determining whether the phoneme sequence is used for triggering an instruction or not according to the first confidence.

Optionally, the verification module 640 is specifically configured to:

acquiring a linear decoding graph according to the phoneme sequence, wherein a first symbol is added before the phoneme sequence in the linear decoding graph, and the first symbol is used for absorbing output which does not belong to the phoneme sequence in the second phoneme recognition result;

and inputting the second phoneme recognition result into the linear decoding graph, and aligning the second phoneme recognition result and the phoneme sequence.

Optionally, the verification module 640 is specifically configured to:

acquiring a third time stamp under the condition that the second phoneme recognition result is aligned with the phoneme sequence, wherein the starting point of the third time stamp is a speech frame corresponding to a first phoneme aligned with the phoneme sequence in the second phoneme recognition result;

intercepting a second acoustic hidden-layer representation vector corresponding to the third timestamp from the first acoustic hidden-layer representation vector;

inputting the second acoustic hidden layer representation vector into a decoder to obtain a first decoding result matched with the phoneme sequence and a second confidence coefficient of the first decoding result;

and determining whether the phoneme sequence is used for triggering an instruction according to at least one of the first confidence coefficient and the second confidence coefficient.

Optionally, the verification module 640 is specifically configured to:

and determining the length of a decoding sequence of the decoder according to the length of the phoneme sequence.

Optionally, the decoder comprises a transform decoder.

Optionally, the system further comprises a training unit, configured to:

and performing multi-task joint learning according to the first loss of the acoustic model, the second loss of the phoneme discriminator and the third loss of the decoder so as to jointly train the acoustic model, the phoneme discriminator and the decoder.

Optionally, the verification module 640 is specifically configured to:

intercepting a third acoustic hidden representation vector corresponding to the second timestamp from the first acoustic hidden representation vector;

inputting the third acoustic hidden layer representation vector into a decoder to obtain a second decoding result matched with the phoneme sequence and a third confidence coefficient of the second decoding result;

and determining whether the phoneme sequence is used for triggering an instruction or not according to the third confidence.

Optionally, the phoneme space includes a plurality of phonemes and a null output.

The decoding graph 630 is specifically configured to:

and inputting the recognition result corresponding to the frame which is output in a non-empty mode after the first phoneme recognition result into the decoding graph to obtain the phoneme sequence and the first time stamp.

Optionally, the encoder comprises a feed forward sequential memory network FSMN.

Optionally, the decoding graph includes a phoneme dictionary and an instruction set model.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, in this embodiment, the apparatus 600 for recognizing a voice instruction may correspond to a corresponding main body for executing the method 300 in this embodiment, and the foregoing and other operations and/or functions of each module in the apparatus 600 are respectively for implementing each method in the foregoing or a corresponding flow in each method, and are not described herein again for brevity.

The apparatus and system of embodiments of the present application are described above in connection with the drawings from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, or other storage medium known in the art. The storage medium is located in a memory, and a processor reads information in the memory and combines hardware thereof to complete steps of the above method embodiments.

Fig. 13 is a schematic block diagram of an electronic device 800 provided in an embodiment of the present application.

As shown in fig. 13, the electronic device 800 may include:

a memory 810 and a processor 820, the memory 810 being configured to store a computer program and to transfer the program code to the processor 820. In other words, the processor 820 may call and execute a computer program from the memory 810 to implement the method in the embodiment of the present application.

For example, the processor 820 may be configured to perform the steps of the method 300 or 400 according to instructions in the computer program.

In some embodiments of the present application, the processor 820 may include, but is not limited to:

general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 810 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program may be partitioned into one or more modules, which are stored in the memory 810 and executed by the processor 820 to perform the encoding methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments describing the execution of the computer program in the electronic device 800.

Optionally, as shown in fig. 13, the electronic device 800 may further include:

a transceiver 830, the transceiver 830 being connectable to the processor 820 or the memory 810.

The processor 820 may control the transceiver 830 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 830 may include a transmitter and a receiver. The transceiver 830 may further include one or more antennas.

It should be understood that the various components in the electronic device 800 are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

According to an aspect of the present application, there is provided a communication device comprising a processor and a memory, the memory being configured to store a computer program, the processor being configured to call and execute the computer program stored in the memory, so that the encoder performs the method of the above-described method embodiment.

According to an aspect of the present application, there is provided a computer storage medium having a computer program stored thereon, which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above-described method embodiment.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

It is understood that in the embodiments of the present application, data related to user information and the like may be involved. When the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of voice command recognition, comprising:

determining whether the sequence of phonemes is used to trigger an instruction in accordance with a vector portion of the first acoustic hidden layer representation vector that corresponds to the first timestamp.

2. The method of claim 1, wherein determining whether the sequence of phonemes is used to trigger an instruction based on a vector portion of the first acoustic hidden representation vector corresponding to the first timestamp comprises:

3. The method of claim 2, wherein aligning the second phone recognition result with the phone sequence comprises:

4. The method of claim 2, further comprising:

intercepting a second acoustic hidden representation vector corresponding to the third timestamp from the first acoustic hidden representation vector;

wherein the determining whether the phoneme sequence is used for triggering an instruction according to the first confidence comprises:

determining whether the phoneme sequence is used for triggering an instruction according to at least one of the first confidence level and the second confidence level.

5. The method of claim 4, further comprising:

6. The method of claim 4, wherein the decoder comprises a transform decoder.

7. The method of claim 4, further comprising:

8. The method of claim 1, wherein determining whether the sequence of phonemes is used to trigger an instruction based on a vector portion of the first acoustic hidden representation vector corresponding to the first timestamp comprises:

intercepting a third acoustic hidden-layer representation vector corresponding to the second timestamp from the first acoustic hidden-layer representation vector;

and determining whether the phoneme sequence is used for triggering an instruction or not according to the third confidence coefficient.

9. The method of any of claims 1-7, wherein the phoneme space includes a plurality of phonemes and a null output;

wherein, the inputting the first phoneme recognition result into a decoding graph to obtain a phoneme sequence corresponding to the speech signal and a first timestamp corresponding to the phoneme sequence includes:

10. A method according to any of claims 1-7, characterized in that the encoder comprises a feed forward sequential memory network FSMN.

11. The method of any of claims 1-7, wherein the decoding graph comprises a phone dictionary and an instruction set model.

12. An apparatus for voice command recognition, comprising:

13. An electronic device comprising a processor and a memory, the memory having stored therein instructions that, when executed by the processor, cause the processor to perform the method of any of claims 1-11.

14. A computer storage medium for storing a computer program comprising instructions for performing the method of any one of claims 1-11.

15. A computer program product, comprising computer program code which, when run by an electronic device, causes the electronic device to perform the method of any of claims 1-11.