CN115132196B

CN115132196B - Voice instruction recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115132196B
Application number: CN202210551539.0A
Authority: CN
Inventors: 杨展恒; 孙思宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2024-09-10
Anticipated expiration: 2042-05-18
Also published as: CN115132196A

Abstract

The application provides a method, a device, equipment and a storage medium for recognizing voice instructions, and relates to the field of artificial intelligence voice recognition. In the method, a voice signal is input into an acoustic model to obtain a first phoneme recognition result corresponding to each voice frame in the voice signal and a first acoustic hidden layer representation vector of each voice frame output by an encoder in the acoustic model, then the first phoneme recognition result is input into a decoding diagram to obtain a phoneme sequence corresponding to the voice signal and a first timestamp of the voice frame corresponding to the phoneme sequence, and finally whether the phoneme sequence is used for triggering an instruction is determined according to a vector part corresponding to the first timestamp in the first acoustic hidden layer representation vector. The voice instruction recognition model provided by the embodiment of the application does not need a pre-voice wake-up model, such as a KWS system, so that false wake-up can be restrained under the condition that the complexity of the system is not increased obviously, and the user can interact with the equipment directly.

Description

Voice instruction recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technology, and more particularly, to a method, apparatus, device, and storage medium for speech instruction recognition.

Background

In recent years, with the rapid development of deep learning, end-to-End (E2E) automatic speech recognition (Automatic Speech Recognition, ASR) technology has been favored for its simplified architecture and excellent performance. At present, an E2EASR neural network structure is introduced into a keyword detection (Keyword Spotting, KWS) task, so that the keyword recognition performance is improved.

In contrast to ASR tasks, speech instruction recognition (Speech Command Recognition, SCR) tasks have a fixed target instruction set and search space is relatively limited. To limit the recognition results to the target instruction set, a decoder may be followed by the acoustic model to further optimize the task goals. For example, an attention biasing mechanism may be added after the acoustic model to direct the bias of recognition results towards a given keyword. However, as the number of instructions in a task increases, so does the number of false wakeups (FALSEALARMS, FA), which is unacceptable to the user for a 24-hour running device.

The existing scheme is usually characterized in that a KWS system running for 24 hours is arranged in front of the SCR system, and the system only detects one keyword, so that false awakening can be well restrained. However, a KWS system running for 24 hours is arranged in front of the SCR system, so that the complexity of the system is obviously increased, and a user cannot directly interact with the equipment.

Disclosure of Invention

The embodiment of the application provides a voice instruction recognition method, a voice instruction recognition device, voice instruction recognition equipment and a storage medium, which can help to inhibit false wake-up without obviously increasing the complexity of a system, and a user can directly interact with the voice instruction recognition equipment.

In a first aspect, a method for speech instruction recognition is provided, comprising:

acquiring a voice signal, wherein the voice signal comprises a plurality of voice frames;

Inputting the voice signal into an acoustic model to obtain a first phoneme recognition result corresponding to each voice frame in the voice signal and a first acoustic hidden layer representation vector of each voice frame output by an encoder in the acoustic model, wherein the first phoneme recognition result comprises probability distribution of each voice frame in a phoneme space, and the acoustic model is obtained through voice signal samples and actual phoneme training of each voice frame in the voice signal samples;

Inputting the first phoneme recognition result into a decoding diagram to obtain a phoneme sequence corresponding to the voice signal and a first timestamp corresponding to the phoneme sequence;

Determining whether the phoneme sequence is used for triggering an instruction according to a vector part corresponding to the first timestamp in the first acoustic hidden layer representation vector.

In a second aspect, there is provided an apparatus for speech instruction recognition, comprising:

an acquisition unit configured to acquire a speech signal including a plurality of speech frames;

The acoustic model is used for inputting the voice signal to obtain a first phoneme recognition result corresponding to each voice frame in the voice signal and a first acoustic hidden layer representation vector of each voice frame output by an encoder in the acoustic model, wherein the first phoneme recognition result comprises probability distribution of each voice frame in a phoneme space, and the acoustic model is obtained through voice signal samples and actual phoneme training of each voice frame in the voice signal samples;

the decoding diagram is used for inputting the first phoneme recognition result to obtain a phoneme sequence corresponding to the voice signal and a first timestamp corresponding to the phoneme sequence;

And the verification module is used for determining whether the phoneme sequence is used for triggering an instruction according to a vector part corresponding to the first timestamp in the first acoustic hidden layer representation vector.

In a third aspect, the present application provides an electronic device, comprising:

A processor adapted to implement computer instructions; and

A memory storing computer instructions adapted to be loaded by a processor and to perform the method of the first aspect described above.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing computer instructions that, when read and executed by a processor of a computer device, cause the computer device to perform the method of the first aspect described above.

In a fifth aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method of the first aspect described above.

Based on the above technical solution, the embodiment of the present application can further verify whether the phoneme sequence is used for triggering an instruction according to the vector portion corresponding to the first timestamp in the first acoustic hidden layer representation vector of each speech frame output by the encoder of the acoustic model after obtaining the phoneme sequence and the first timestamp corresponding to the phoneme sequence through the speech instruction recognition model (i.e., the acoustic model and the decoding diagram). According to the embodiment of the application, based on the traditional voice instruction recognition model, the phoneme sequence triggering instruction is further checked according to the acoustic hidden layer representation vector corresponding to the phoneme sequence of the triggering instruction, so that the reliability of instruction triggering can be improved. The voice instruction recognition model provided by the embodiment of the application does not need a pre-voice wake-up model, such as a KWS system, so that false wake-up can be restrained under the condition that the complexity of the system is not increased obviously, and the user can interact with the equipment directly. For example, the voice command recognition system of embodiments of the present application may occupy less memory and computing resources. For another example, the voice command recognition system of the embodiment of the application can reduce the false wake-up times, so that 24-hour deployment operation can be performed.

Drawings

Fig. 1 is a schematic diagram of an application scenario according to an embodiment of the present application;

FIG. 2 is a schematic block diagram of a Transducer model;

FIG. 3 is a schematic flow chart of a method of speech instruction recognition according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a network architecture for voice command recognition according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a voice command recognition process using the network architecture shown in FIG. 4;

FIG. 6 is a schematic flow chart of a verification method of a trigger instruction according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a forced alignment process according to an embodiment of the present application;

FIG. 8 is a schematic flow chart diagram of another verification method of trigger instruction provided by an embodiment of the present application;

FIG. 9 is a schematic flow chart diagram of another verification method of trigger instruction provided by the embodiment of the application;

FIG. 10 is a ROC curve corresponding to a clean test set;

FIG. 11 is a ROC curve corresponding to a noise test set;

FIG. 12 is a schematic block diagram of a voice command recognition apparatus according to an embodiment of the present application;

fig. 13 is a schematic block diagram of an electronic device provided by an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be understood that in embodiments of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.

In the description of the present application, unless otherwise indicated, "at least one" means one or more, and "a plurality" means two or more. In addition, "and/or" describes an association relationship of the association object, and indicates that there may be three relationships, for example, a and/or B may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be further understood that the description of the first, second, etc. in the embodiments of the present application is for illustration and distinction of descriptive objects, and is not intended to represent any limitation on the number of devices in the embodiments of the present application, nor is it intended to constitute any limitation on the embodiments of the present application.

It should also be appreciated that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The scheme provided by the application can relate to artificial intelligence technology.

Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

It should be appreciated that artificial intelligence techniques are a comprehensive discipline involving a wide range of fields, both hardware-level and software-level techniques. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Embodiments of the application may relate to speech technology (Speech Technology) in artificial intelligence technology. Key technologies to speech technology are automatic speech recognition technology (Automatic Speech Recognition, ASR), speech synthesis technology, and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Embodiments of the application may also relate to machine learning (MACHINE LEARNING, ML) in artificial intelligence techniques, where ML is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

Fig. 1 is a schematic diagram of an application scenario according to an embodiment of the present application.

As shown in fig. 1, includes an acquisition device 101 and a computing device 102. Wherein the acquisition device 101 is used for acquiring speech data, such as raw speech, when a user speaks. The computing device 102 is used for processing the speech data collected by the sound collection device 101, e.g. ASR, SCR, etc.

By way of example, the acquisition device 101 may be a microphone, microphone array, pickup or other device having voice acquisition capabilities.

By way of example, the computing device 101 may be a user device such as a cell phone, a computer, a smart voice interaction device, a smart home appliance, an in-vehicle terminal, a wearable terminal, an aircraft, a Mobile Internet Device (MID) or other terminal device with voice processing functionality.

Illustratively, the computing device 102 may be a server. The server may be one or more. Where the servers are multiple, there are at least two servers for providing different services and/or there are at least two servers for providing the same service, such as in a load balancing manner, as embodiments of the application are not limited in this respect. A neural network model may be provided in a server that provides support for training and application of the neural network model. A speech processing device may also be provided in the server for ASR or SCR of speech, the server providing support for the application of the speech processing device.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. Servers may also become nodes of the blockchain.

In some embodiments, acquisition device 101 and computing device 102 may be implemented as the same hardware device. For example, computing device 102 is a user device and acquisition device 101 is a microphone built into the user device.

In some embodiments, the acquisition device 101 and the computing device 102 may be implemented as different hardware devices. For example, the acquisition device 101 is a microphone disposed on a steering wheel of a vehicle, and the computing device 102 may be an onboard smart device; for another example, the collection device 101 is a microphone disposed on a smart home device (such as a smart tv, a set-top box, an air conditioner, etc.), and the computing device 102 may be a home computing hub, such as a mobile phone, a tv, a router, etc., or a cloud server, etc.; as another example, the acquisition device 101 may be a microphone on a personal wearable device (e.g., smart wristband, smart watch, smart headset, smart glasses, etc.), and the computing device 102 may be a personal device, such as a cell phone.

When the acquisition device 101 is implemented as a hardware device different from the computing device 102, the acquisition device 101 may be connected to the computing device 102 through a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), a wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or a call network, without limitation.

The related art to which the embodiments of the present application relate will be described below.

Transfer machine (Transducer) model: the transfer machine (Transducer) model is a streaming E2EASR framework, can directly convert the input audio stream characteristics into text results in real time, and has certain advantages in terms of resource consumption and accuracy compared with the traditional voice recognition model. The Transducer model framework was also introduced into the KWS task as an acoustic model.

For a given input sequence x= (x ₁,x₂,…,x_t)∈X^*, the Transducer model output sequence y= (y ₁,y₂,…,y_u)∈Y^*).

Wherein X ^* represents the set of all input sequences, Y ^* represents the set of all output sequences, X _t∈X,y_u ε Y are real vectors, and X and Y represent the input and output spaces, respectively.

For example, the Transducer model is used for phoneme recognition, the input sequence x is a feature vector sequence, such as a filter bank (FilterBank, FBank) feature, or mel-cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC) feature, and x _t represents a feature vector at time t; the output sequence y is a phoneme sequence, and y _u represents the phoneme of the u-th step.

The output space may also be referred to as a phoneme space, comprising a plurality of phonemes. Optionally, the phoneme space further comprises a null output. The output sequence may have the same length as the input sequence due to the introduction of the null output.

FIG. 2 shows a schematic block diagram of a Transducer model. As shown in fig. 2, the Transducer model includes an encoder (Encoder) 21, a predictor (Predictor) 22, and a union network (JointNetwork) 23.

The encoder 21 may be a recurrent neural network, such as a long short term memory (Long Short Term Memory, LSTM) network, a feed-forward sequential memory network (Feedforward Sequential Memory Networks, FSMN), or the like, among others. The encoder 21 may receive an audio feature input (inputfeaturex _t) at time t, output an acoustic hidden layer representation

The predictor 22 may be a recurrent neural network, such as an LSTM, or a convolutional neural network (Convolutional Neural Network, CNN). The predictor 22 may receive a non-empty output tag (Transduceroutputy _u-1) of the model history, output as a textual hidden layer representation

The joint network 23 may be a fully connected neural network, such as a linear layer plus an activation unit for representing the acoustic hidden layerAnd text hidden layer representationAfter linear transformation, the sum is output hidden units representing z.

Finally, the output of the federated network 23 may be passed through a softmax function to convert it to a probability distribution.

Mini Transducer (Tiny Transducer) model: in SCR or KWS tasks, small model parameters and high real-time are often required due to the need for actual deployment equipment. The mini Transducer (Tiny Transducer) can meet these requirements well compared to the traditional Transducer framework.

The Tiny converter reduces the model size and optimizes the reasoning speed on the basis of the converter, so that the Tiny converter is more suitable for the streaming deployment application of the terminal equipment. Compared with the traditional transmissier framework, tiny transmissier uses FSMN at the encoder side, so that the reasoning speed of the model is improved. On the decoder side, a single layer CNN of small convolution kernel is used, greatly reducing the model complexity. In addition, for the task of small dependence of the context information such as SCR, the problem of overfitting of the context information on the decoder side is alleviated. Research shows that the Tiny Transducer can maintain considerable recognition performance on the basis of greatly reducing the model size and the reasoning time as an acoustic model.

An SCR scheme uses a Transducer model instead of a traditional stacked recurrent neural network (Recurrent Neural Network, RNN) or CNN module as an acoustic model to improve accuracy of instruction recognition. However, as the number of instructions in a task increases, the number of false wakeups increases, which is unacceptable to a user for a 24-hour operating device.

A scheme for inhibiting false wake-up is that a KWS system running for 24 hours is arranged in front of an SCR system, and the system only detects one keyword, so that false wake-up can be well inhibited. When the KWS system detects a specific keyword, it operates in a limited time window to avoid obvious user false wake feeling. However, a KWS system running for 24 hours is arranged in front of the SCR system, so that the complexity of the system is obviously increased, and the system does not conform to the use habit of directly interacting with equipment by a user.

In view of this, embodiments of the present application provide a method, apparatus, device, and storage medium for voice command recognition, which can help to suppress false wake-up without significantly increasing complexity of a system, and a user can directly interact with the device.

Specifically, a first phoneme recognition result corresponding to each voice frame in a voice signal and a first acoustic hidden layer representation vector of each voice frame output by an encoder in the acoustic model are obtained by inputting the voice signal into an acoustic model, then the first phoneme recognition result is input into a decoding diagram to obtain a phoneme sequence corresponding to the voice signal and a first timestamp of the voice frame corresponding to the phoneme sequence, and finally whether the phoneme sequence is used for triggering an instruction is determined according to a vector part corresponding to the first timestamp in the first acoustic hidden layer representation vector.

Wherein the first phoneme recognition result comprises a probability distribution of each speech frame in a phoneme space.

The acoustic model is obtained by training a speech signal sample and the actual phonemes of each speech frame in the speech signal sample. The acoustic model may be, for example, a Transducer model, or a Tiny Transducer model, without limitation.

Therefore, after obtaining a phoneme sequence and a first timestamp corresponding to the phoneme sequence through a voice command recognition model (i.e. an acoustic model and a decoding diagram), the embodiment of the application further verifies whether the phoneme sequence is used for triggering a command according to a vector part corresponding to the first timestamp in a first acoustic hidden layer representation vector of each voice frame output by an encoder of the acoustic model.

The embodiment of the application can further verify the phoneme sequence trigger instruction according to the acoustic hidden layer representation vector corresponding to the phoneme sequence of the trigger instruction on the basis of the traditional voice instruction recognition model, thereby being beneficial to improving the reliability of instruction trigger. The voice instruction recognition model provided by the embodiment of the application does not need a pre-voice wake-up model, such as a KWS system, so that false wake-up can be restrained under the condition that the complexity of the system is not increased obviously, and the user can interact with the equipment directly.

For example, the voice command recognition system of embodiments of the present application may occupy less memory and computing resources.

For another example, the voice command recognition system of the embodiment of the application can reduce the false wake-up times, so that 24-hour deployment operation can be performed.

The embodiment of the application can be applied to voice instruction recognition systems of various voice interaction scenes. For example, in a vehicle-mounted scene, a driver cannot leave hands to operate equipment in the process of driving a vehicle, voice can be used for controlling vehicle-mounted playing equipment, navigation equipment and the like, and safety and convenience of operation are improved. The vehicle-mounted voice instruction recognition system can directly recognize the user instruction without a pre-voice wake-up model, such as a KWS system, so that the user experience can be improved.

The following describes in detail the scheme provided by the embodiment of the present application with reference to the accompanying drawings.

Fig. 3 is a schematic flow chart of a method 300 for voice command recognition according to an embodiment of the present application. The method 300 may be performed by any electronic device having data processing capabilities. For example, the electronic device may be implemented as computing device 102 in FIG. 1, as the application is not limited in this regard.

In some embodiments, a machine learning model may be included (e.g., deployed) in an electronic device that may be used to perform the method 300 of speech instruction recognition. In some embodiments, the machine learning model may be a deep learning model, a neural network model, or other model, without limitation. In some embodiments, the machine learning model may be, without limitation, a voice instruction recognition system, an instruction word recognition system, or other system.

Fig. 4 is a schematic diagram of a network architecture for voice command recognition according to an embodiment of the present application, which includes a translator Decoder (TransducerDecoder) 41, a shared encoder (ShareEncoder) 42, a joint network (JointNetwork) 43, a transducer Decoder (Decoder) 44, and a voice element discriminator (phonePredictor) 45. The network architecture fuses the Transducer model and the Transducer model. Wherein the shared encoder 42 is shared by the Transducer decoder 41, the Transducer decoder 44 and the phoneme discriminator 45 to increase the coupling degree of the system.

In some embodiments, in training the network architecture, multi-tasking learning may be applied, jointly optimizing the Transducer penalty (loss), the Transformer penalty, and the Cross Entropy (CE) penalty of the phoneme discriminant. Where the Transducer loss may be derived from the output of the union network 43 and the real text labels of the speech signal samples, the Transducer loss may be derived from the output of the Transducer decoder and the real text labels of the speech signal samples, and the Cross entropy (Cross-Entropy, CE) loss of the phoneme discriminator may be derived from the output of the phoneme discriminator 45 and the phonemes of each speech frame of the speech signal samples.

Illustratively, the target loss L of the multitasking learning may be specifically expressed as the following formula (1):

L＝αL_Transducer+βL_CE+γL_Transformer (1)

Wherein, L _Transducer represents a Transducer loss, such as RNN-T loss, L _CE represents CE loss of the phoneme discriminator, L _Transformer represents a Transducer loss, and alpha, beta and gamma are respectively adjustable super parameters.

As a specific example, α, β, γ may be set to 1.0, 0.8, and 0.5, respectively, which is not limited in the present application.

The steps in method 300 will be described below in connection with the network architecture of fig. 4.

It should be understood that fig. 4 illustrates an example of a network architecture for voice instruction recognition, which is merely intended to assist those skilled in the art in understanding and implementing embodiments of the present application and is not intended to limit the scope of embodiments of the present application. Equivalent changes and modifications can be made by those skilled in the art based on the examples given herein, and such changes and modifications should still fall within the scope of the embodiments of the present application.

As shown in fig. 3, the method 300 of speech instruction recognition may include steps 310 through 340.

A speech signal is acquired 310, the speech signal comprising a plurality of speech frames.

Specifically, the speech signal may include a plurality of speech frames obtained by slicing the original speech. For example, the electronic device may include a preprocessing portion for slicing the original voice to obtain the voice signal.

And 320, inputting the voice signal into an acoustic model to obtain a first phoneme recognition result corresponding to each voice frame in the voice signal and a first acoustic hidden layer representation vector of each voice frame output by an encoder in the acoustic model.

The acoustic model is obtained through training of voice signal samples and actual phonemes of each voice frame in the voice signal samples.

As one example, the acoustic model may include the Transducer decoder 41, the shared encoder 42, and the joint network 43 module in fig. 4. The encoder output in the acoustic model may be, for example, the output of the shared encoder 42.

Wherein the transcerver decoder 41, the federated network 43 are similar to the predictor 22, the federated network 23, respectively, of FIG. 2, reference may be made to the relevant description of FIG. 2, and the functional description of the shared encoder 42 in the acoustic model may be similar to the encoder 21 of FIG. 2, reference may be made to the relevant description of FIG. 2.

As another example, the acoustic model may be a Transducer model, or TinyTransducer model, without limitation.

Fig. 5 is a schematic diagram of a flow of voice command recognition using the network architecture shown in fig. 4.

Referring to fig. 5, stage (a) is a Transducer detection stage (Transducerdetectionstage) that can stream detect to obtain a preliminary trigger text and trigger timestamp (starting point is t ₀).

In stage (a), feature extraction may be performed on a speech frame at time t of the speech signal to obtain an audio feature x _t of the speech frame, and then the audio feature x _t is input into the shared encoder 42, and the historical non-null output y _u-1 of the model is input into the Transducer decoder 41. The shared encoder 42 output is an acoustic hidden layer representationThe output of the Transducer decoder 41 is a text hidden layer representationThe acoustic hidden layer representation output by the shared encoder 42An example of a vector is the first acoustic hidden layer representation described above.

The federated network 43 may input an acoustic hidden layer representationAnd text hidden layer representationThe output of the first phoneme recognition result may be represented, for example, as y _t. Wherein the first phoneme recognition result comprises a probability distribution of each speech frame in a phoneme space.

Phonemes (phones), which are the smallest phonetic units divided according to the natural attributes of a language, are analyzed according to pronunciation actions in syllables, one action constituting one phoneme. Illustratively, the phonemes are divided into two major classes, vowels and consonants, e.g. chinese syllables (ā) with only one phoneme, love (a i) with two phonemes, and generation (d a i) with three phonemes.

The phonemes are the smallest units or smallest speech segments constituting syllables, and are the smallest linear speech units divided from the viewpoint of sound quality. Phonemes are physical phenomena that exist in particular. Phonetic symbols of the international phonetic symbols (which are formulated by the international phonetic consonant and are used for uniformly marking the letters of voices of various countries, also called as "international phonetic letters" and "ten thousand phonetic letters") are in one-to-one correspondence with phonemes of the whole human voice.

In the embodiment of the application, for each voice frame in the voice signal, the acoustic model can identify the phoneme corresponding to the voice frame to obtain the first phoneme identification result, wherein the first phoneme identification result comprises probability distribution of each voice frame in the voice signal in a phoneme space.

The phoneme space may comprise a plurality of phonemes and a null output (indicating that the corresponding speech frame has no user pronunciation). In other words, the first phoneme recognition result includes probabilities that phonemes of each speech frame in the speech signal belong to each preset phoneme and are blank to be output.

For example, the phoneme space may contain 212 the phonemes and a null output. That is, for an input speech frame, the acoustic model can output probabilities of phonemes and blank outputs in 212, respectively, corresponding to the speech frame.

And 330, inputting the first phoneme recognition result into a decoding diagram to obtain a phoneme sequence corresponding to the voice signal and a first timestamp corresponding to the phoneme sequence.

Specifically, after the first phoneme recognition result is input into the decoding diagram, the decoding diagram may determine that the first phoneme recognition result corresponds to a certain phoneme or corresponds to a null output according to probabilities of each phoneme and the null output in a phoneme space in the first phoneme recognition result, and may further determine a corresponding text according to the determined phonemes.

In the embodiment of the application, the first timestamp corresponding to the text can be further acquired. As an example, the frame number of the speech frame corresponding to the text may be the first timestamp.

When the plurality of voice frames are plural, the first timestamp may include frame numbers corresponding to the plurality of voice frames. At this time, the starting point of the first timestamp may be the frame number of the first speech frame of the plurality of speech frames, and the end point of the first timestamp may be the frame number of the last speech frame of the plurality of speech frames.

When the phoneme recognition result corresponds to the blank output, the voice frame corresponding to the phoneme recognition result is determined not to contain the pronunciation of the user, and no corresponding text exists at the moment.

For example, the text output by the decoding graph may be referred to as a phoneme sequence (Transducerphonesequence). The phoneme sequence is a keyword in the input audio feature and can be used as a preliminary trigger instruction text, namely a candidate trigger text. In the following step 340, the phoneme sequence may be further validated to confirm whether it can trigger an instruction. Accordingly, the first timestamp may also be referred to as a preliminary trigger timestamp.

With continued reference to fig. 5, in stage (a), the first phoneme recognition result output by the Transducer model may be input to a weighted finite state machine (WEIGHTED FINITE STATE Transducer, WFST) decoder (an example of a decoding diagram) 46, where the WFST decoder 46 outputs a phoneme sequence (i.e., a preliminary trigger instruction text) corresponding to the speech signal and a first timestamp (starting point is t ₀) corresponding to the phoneme sequence.

In one example, the decode graph may include two separate WFST units: a phoneme dictionary (L) and an instruction set model (G), wherein the instruction set model (G) may be a grammar composed of a predefined instruction set. Illustratively, the decoding graph LG may be specifically as follows:

LG＝min(det(L·G)) (2)

Where min and det represent minimization and certainty of WFST, respectively.

By way of example, the decoding process of the decoded pictures may be implemented by a token-passing algorithm, which is not limiting in this regard.

In some embodiments, the recognition result corresponding to the frame whose posterior is non-null output in the first phoneme recognition result may be input into the decoding diagram, for example, only the frame whose maximum posterior is non-null output by the Transducer model may be decoded, so as to obtain the phoneme sequence and the first timestamp, so as to save decoding time.

340 Determining whether the phoneme sequence is for a triggering instruction based on a vector portion of the first acoustic hidden layer representation vector corresponding to the first timestamp.

Specifically, after obtaining a phoneme sequence and a first timestamp corresponding to the phoneme sequence through a voice command recognition model (i.e., an acoustic model and a decoding diagram), the embodiment of the application may further verify whether the phoneme sequence is used for triggering a command according to a vector portion corresponding to the first timestamp in a first acoustic hidden layer representation vector of each voice frame output by an encoder of the acoustic model.

Therefore, the embodiment of the application can further verify the phoneme sequence trigger instruction according to the acoustic hidden layer representation vector corresponding to the phoneme sequence of the trigger instruction on the basis of the traditional voice instruction recognition model, and improve the reliability of instruction trigger. The voice instruction recognition model provided by the embodiment of the application does not need a pre-voice wake-up model, such as a KWS system, so that false wake-up can be restrained under the condition that the complexity of the system is not increased obviously, and the user can interact with the equipment directly.

In some embodiments, it may be further verified whether the phoneme sequence is used for the trigger instruction in four ways.

Mode 1: and (3) forcedly aligning the corresponding output of the phoneme discriminator by utilizing the phoneme sequence to obtain a first confidence, and verifying whether the phoneme sequence is used for triggering the instruction according to the first confidence.

Referring to fig. 6, the flow corresponding to mode 1 may include steps 601 to 604.

601, A second time stamp is obtained from the first time stamp, a start point of the second time stamp being before a start point of the first time stamp.

Because the delayed transmission of the Transducer results in the timestamp obtained by the Transducer model often being inaccurate, the starting point t ₀ of the preliminary trigger timestamp (i.e., the first timestamp) needs to be advanced by t _d to obtain the second timestamp, where the starting point is (t ₀+t_d). As a possible implementation, the starting point of the first timestamp may be advanced by 15 frames manually.

Optionally, the first timestamp is the same as the second timestamp at the end point.

At 602, at least a portion of the first acoustic hidden layer representation vector is input to a phoneme discriminator, and a second phoneme recognition result corresponding to a second timestamp output by the phoneme discriminator is obtained.

With continued reference to fig. 5, stage (b) is a forced alignment stage (ForceAlignmentstage) that uses the trigger text of stage (a) to force the trigger segment in the acoustic hidden layer representation vector (e.g., the vector portion corresponding to the second timestamp) to align at the output of the phoneme graph to obtain a first confidence S ₁. Optionally, in the step (b), a more accurate trigger time stamp (starting point t _r) may also be obtained.

As one implementation, in stage (b), the acoustic hidden layer representation vector output by the shared encoder 42 may be usedInput to the phoneme discriminator 45, and the output of the phoneme discriminator 45 is intercepted according to the second time stamp to obtain the second phoneme recognition result.

As another implementation (not shown in fig. 5), the acoustic hidden layer representation vector output by the shared encoder 42 may be truncated according to a second timestampThe truncated vector portion is then input to a phoneme discriminator to obtain the second phoneme recognition result, which is not limited in the present application.

603, Aligning the second phoneme recognition result with the phoneme sequence to obtain a first confidence, where the first confidence is used to represent an alignment score of the second phoneme recognition result with the phoneme sequence.

Optionally, when the second phoneme recognition result is aligned with the phoneme sequence, a third timestamp may be obtained, where a starting point of the third timestamp is a speech frame corresponding to a first phoneme aligned with the phoneme sequence in the second phoneme recognition result. The third timestamp is a trigger timestamp that is more accurate than the first timestamp.

With continued reference to fig. 5, the second phoneme recognition result and the sequence of phonemes may be aligned by the alignment module 47 and a first confidence level may be output, such as the first confidence level S ₁ output in stage (b). Optionally, the alignment module 47 may also output a third timestamp, such as a more accurate trigger timestamp (starting point t _r) output during stage (b).

As a possible implementation manner, a linear decoding diagram may be obtained according to the phoneme sequence, where a first symbol is added before the phoneme sequence in the linear decoding diagram, and the first symbol is used to absorb an output that does not belong to the phoneme sequence in the second phoneme recognition result. And then, inputting the second phoneme recognition result into the linear decoding diagram, and aligning the second phoneme recognition result and the phoneme sequence to obtain a first confidence and a third timestamp.

Fig. 7 shows a schematic diagram of the flow of forced alignment. Taking the aligned text abc as an example, the frame-level posterior phoneme sequence 49 is the output of the phoneme discriminator 45 in fig. 5 (i.e., one example of the second phoneme recognition result), t ₀ is the starting point of the trigger timestamp obtained in stage (a), and t _d is the number of frames taken in advance. Since the advance frame number t _d is estimated, additional noise may be introduced in the posterior phoneme sequence 49. Based on this, the noise frames (i.e., the outputs of the posterior phoneme sequence 49 that do not belong to the phoneme sequence correspondence) can be absorbed by the symbol (g) that was originally added in the linear decoding diagram dynamically generated from the phoneme sequence output in stage (a), thereby filtering out the case of partial false triggers. Wherein the symbol (g) is one example of the first symbol described above.

After the linear decoding graph is obtained, the posterior phoneme sequence 49 may be input into the linear decoding graph for viterbi decoding to achieve forced alignment, and finally the first confidence level S ₁ and the more accurate trigger start point t _r are output.

As an example, the first confidence may be, but not limited to, a pair-wise average of frames in the third timestamp, a root mean square, or the like.

Based on the first confidence, it is determined whether the phoneme sequence is used for a trigger instruction 604.

For example, the first confidence level may be compared to a preset threshold to determine whether the phoneme sequence triggers an instruction. For example, when the first confidence level is greater than or equal to a threshold value, a trigger instruction may be determined; when the first confidence level is less than the threshold value, it may be determined not to trigger the instruction.

Therefore, the embodiment of the application further cascades the forced alignment module based on the frame-level phoneme discriminator based on the traditional instruction recognition model, and uses the phoneme sequence to forcedly align the corresponding output of the phoneme discriminator to obtain a more accurate instruction triggering time stamp and confidence, and further can verify the phoneme sequence triggering instruction according to the confidence, thereby improving the reliability of the triggering instruction.

Mode 2: and intercepting the acoustic hidden layer representation vector of the shared acoustic encoder by using the accurate trigger time stamp, taking the acoustic hidden layer representation vector as input of a transducer decoder, acquiring a decoded text containing the trigger text and a second confidence coefficient thereof, and jointly verifying whether the phoneme sequence is used for a trigger instruction according to the first confidence coefficient and the second confidence coefficient in the mode 1.

In some embodiments, the voice command recognition system corresponding to the mode 2 may be referred to as a three-stage system, a three-stage voice command recognition system, or a two-stage verification system, without limitation.

Referring to fig. 8, the flow corresponding to mode 2 may further include steps 801 to 803 on the basis of mode 1 described above.

801, A second acoustic hidden layer representation vector corresponding to a third timestamp is truncated from the first acoustic hidden layer representation vector.

With continued reference to fig. 5, stage (c) is a transducer stage that intercepts a trigger segment of the acoustic hidden layer representation vector of the shared acoustic encoder (one example of a second acoustic hidden layer representation vector) as input to the transducer decoder 44 using the more accurate trigger timestamp (i.e., the third timestamp) of stage (b).

802, Inputting the second acoustic hidden layer representation vector into a decoder to obtain a first decoding result matching the phoneme sequence and a second confidence of the first decoding result.

Illustratively, the decoder includes a transducer decoder.

With continued reference to fig. 5, the hidden state of the shared acoustic encoder may be truncated as input to the transducer decoder 44 using a trigger timestamp at a starting point t _r at the output of stage (b), and the transducer decoder 44 may derive a beam of at least one decoded text, and a confidence level for each decoded text, by autoregressive beam search. Alternatively, beam searching may be performed by beam searching (beamsearch) 48.

After the search is completed, it may be retrieved in the candidate beam list whether there is a candidate sequence containing the decoded text of the trigger text (one example of the first decoding result matching the phoneme sequence). When present, the confidence S ₂ (one example of a second confidence) of the decoded text may be obtained and output

In some embodiments, the length of the decoding sequence of the decoder may be determined based on the length of the phoneme sequence.

For example, the start symbol < SOS > may be set as an input to the transducer decoder 44, and then the decoding results are autoregressively generated one by one. Because the purpose of the embodiment of the present application is to detect whether the trigger is actually triggered according to the predetermined trigger text, decoding as in the general beam search is not necessary until < EOS > occurs, and based on this embodiment of the present application, the length of the decoding sequence can be limited according to the length of the phoneme sequence of the initial trigger, so as to save decoding time.

803, It is determined whether the phoneme sequence is used for the trigger instruction based on at least one of the first confidence and the second confidence.

For example, the first confidence level may be compared to a preset first threshold, or the second confidence level may be compared to a preset second threshold, or the first confidence level may be compared to a preset first threshold and the second confidence level may be compared to a preset second threshold, to determine whether the phoneme sequence triggers an instruction.

For example, the trigger instruction may be determined when the first confidence level is greater than or equal to a first threshold value and the second confidence level is greater than or equal to a second threshold value.

For another example, it may be determined not to trigger an instruction when the first confidence level is less than the first threshold and the second confidence level is less than the second threshold. That is, in fig. 5, when the confidence level of any one of the (b) stage and the (c) stage does not reach the trigger threshold, the voice sequence of the (a) stage can be regarded as false trigger.

For another example, when the first confidence level is less than the first threshold, it may be determined that the instruction is not triggered, and the process of calculating the second confidence level need not be performed; when the first confidence level is greater than or equal to the first threshold value, a second confidence level may be further calculated, and whether the phoneme sequence triggers an instruction may be determined according to the second confidence level. That is, in fig. 5, the cascade judgment can be performed in combination of the (b) stage and the (c) stage, and when it is determined that the phoneme sequence cannot trigger an instruction according to the first confidence, it is unnecessary to further calculate the second confidence, so that it is unnecessary to calculate two confidence values for each input sample and then determine whether to trigger the instruction, thereby saving the verification time.

For another example, it may be determined whether the phoneme sequence is used for the trigger instruction based on only the second confidence level, i.e., the alignment module 47 in fig. 5 may output the third timestamp without outputting the first confidence level. At this time, if the second confidence level is less than the second threshold value, it may be determined that the instruction is not triggered, and if the second confidence level is greater than or equal to the second threshold value, it may be determined that the instruction is triggered.

Therefore, the embodiment of the application further cascades the forced alignment module and the transducer verification framework based on the frame-level phoneme discriminator on the basis of the traditional instruction recognition model, utilizes the forced alignment module to obtain more accurate trigger time stamp and first confidence coefficient, utilizes the transducer verification framework to decode the hidden state of the accurate trigger time stamp intercepting shared acoustic encoder, obtains the decoded text containing the trigger text and second confidence coefficient thereof, and further verifies the phoneme sequence trigger instruction according to at least one of the first confidence coefficient and the second confidence coefficient, thereby improving the reliability of the trigger instruction.

Mode 3: and utilizing the preliminary trigger time stamp to forward time stamps (such as the second time stamp) corresponding to a plurality of frames to intercept an acoustic hidden layer representation vector of the shared acoustic encoder, taking the acoustic hidden layer representation vector as input of a transducer decoder, acquiring a decoding text containing the trigger text and a third confidence coefficient thereof, and verifying whether a phoneme sequence is used for a trigger instruction according to the third confidence coefficient.

In some embodiments, the voice command recognition system corresponding to the mode 3 may be referred to as a two-stage system, a two-stage voice command recognition system, or a one-stage verification system, without limitation.

Referring to fig. 9, the flow corresponding to mode 3 may include steps 901 to 903.

901, According to the first timestamp, a second timestamp is obtained, and the starting point of the second timestamp is before the starting point of the first timestamp.

Specifically, reference numeral 901 may refer to description of reference numeral 601, which is not repeated here.

And 902, intercepting a third acoustic hidden layer representation vector corresponding to the second timestamp from the first acoustic hidden layer representation vector.

903, Inputting the third acoustic hidden layer representation vector into a decoder, obtaining a second decoding result matching the phoneme sequence, and a third confidence of the second decoding result.

Steps 902 and 903 may be described with reference to steps 801 and 802 in fig. 8, and are not described here.

Note that, unlike fig. 8, in the first acoustic hidden layer representation vector in 902, the vector portion corresponding to the second timestamp is truncated, instead of the vector portion corresponding to the third timestamp. Accordingly, the input to the encoder in 903 is the vector portion corresponding to the second timestamp, and not the vector portion corresponding to the third timestamp.

Based on the third confidence level, it is determined whether the phoneme sequence is used for a trigger instruction 904.

For example, the third confidence level may be compared to a preset threshold to determine whether the phoneme sequence triggers an instruction. For example, when the third confidence level is greater than or equal to a threshold value, a trigger instruction may be determined; when the third confidence level is less than the threshold value, it may be determined not to trigger the instruction.

Therefore, the embodiment of the application further cascades a transducer verification framework on the basis of a traditional instruction recognition model, utilizes the transducer verification framework to forward the preliminary trigger time stamp by a plurality of time stamps corresponding to frames to intercept the hidden state of the shared acoustic encoder for decoding, obtains the decoding text containing the trigger text and the third confidence coefficient thereof, and further verifies the phoneme sequence trigger instruction according to the third confidence coefficient, thereby improving the reliability of the trigger instruction.

Mode 4: and forcedly aligning corresponding output of the phoneme discriminator by utilizing the phoneme sequence to obtain a first confidence coefficient, utilizing the preliminary trigger time stamp to push forward time stamps corresponding to a plurality of frames to intercept an acoustic hidden layer representation vector of the shared acoustic encoder, taking the acoustic hidden layer representation vector as input of a transducer decoder, acquiring a decoding text containing the trigger text and a third confidence coefficient thereof, and jointly verifying whether the phoneme sequence is used for a trigger instruction according to the first confidence coefficient and the third confidence coefficient.

Specifically, the process of obtaining the first confidence coefficient may refer to mode 1, and the process of obtaining the third confidence coefficient may refer to mode 3, which is not described herein.

Therefore, the embodiment of the application further cascades the forced alignment module and the transducer verification framework based on the frame-level phoneme discriminator on the basis of the traditional instruction recognition model, utilizes the forced alignment module to obtain the first confidence coefficient, utilizes the transducer verification framework to decode the hidden state of the shared acoustic encoder intercepted by pushing the time stamp corresponding to a plurality of frames forward to the preliminary trigger time stamp, obtains the decoded text containing the trigger text and the third confidence coefficient thereof, and further verifies the phoneme sequence trigger instruction according to the first confidence coefficient and the third confidence coefficient together, thereby improving the reliability of the trigger instruction.

It should be noted that, in the embodiment of the present application, the voice command recognition system may adjust different thresholds according to a specific application scenario, so as to flexibly fine tune the command wake-up rate and the false wake-up.

The scheme of the embodiment of the application can use the real corpus to carry out experimental tests. Wherein the speech command recognition model in the experiment is trained on a set of 23000 hour Mandarin ASR corpuses, the corpuses being mobile phones from in-vehicle voice assistant products. During model training, development sets (developmentset) may be randomly acquired from the training set.

Table 1 shows a comparison of the accuracy and number of false wake-ups per hour (perhour) of FAs between different frameworks.

TABLE 1

The clean test set (cleanset) is instruction voice collected when the real vehicle runs on the expressway; the noise test set (noisyset) is a homologous instruction recorded while traveling in downtown areas. The false wake-up test set is a human voice test set of 84 hours, and the instruction set comprises 29 instructions.

As shown in table 1, compared to the tinyTransducer identification system (experiment S0) of the same configuration alone, the two-stage verification method (experiment S3) of the embodiment of the present application can greatly reduce the number of false wakeups from 1.47 times/hour to 0.13 times/hour, which is relatively reduced by 91.15%. Meanwhile, the accuracy of instruction recognition is controlled to be within 2%, and the accuracy is basically in the same order as that of a basic system (experiment S0).

In addition, by means of the ablation test of the experiment S3 (three-stage speech instruction recognition system) and the experiment S1 (two-stage speech instruction recognition system), it can be proved that the time stamp obtained by forced alignment is more accurate than the original time stamp output by the Transducer model, and the accurate time stamp is more helpful for the verification of the third stage (i.e., the (c) stage). That is, the three-stage voice command recognition system and the two-stage voice command recognition system can better inhibit false wake-up under the condition of equivalent accuracy, for example, the number of false wake-up per hour is reduced by 43.47% compared with the two-stage voice command recognition system.

In addition, the embodiment of the application also performs performance comparison with the Transducer+MLD scheme (experiment S4), and the acoustic model used by the embodiment of the application is consistent with the framework of the system provided by the embodiment of the application and is a tinyTransducer model with the same configuration. The scheme corresponding to the experiment S4 is mainly based on a statistical method for multi-stage verification, and the scheme provided by the embodiment of the application focuses on the neural network method for verification. The experimental result shows that the three-stage verification framework provided by the embodiment of the application can obtain lower false wake-up performance under the condition of equivalent recognition performance.

Fig. 10 is a graph of subject performance characteristics (Receiver Operating Characteristic, ROC) for the clean test set and fig. 11 is a graph of ROC for the noise test set. The abscissa in fig. 10 and 11 shows the number of false wakeups per hour (FAperhour), and the ordinate shows the false reject rate (false rejection rate), which represents the ratio of the number of erroneous instructions identified to the total number of instructions. CaTT-KWS in FIGS. 10 and 11 is a three-stage system (experiment S3) provided by an embodiment of the present application. The ROC curve can be a plurality of thresholds, the result of each threshold is drawn into a corresponding point in the graph, and finally the ROC curve can be connected into a curve, so that the performance of the instruction word recognition system can be reflected more comprehensively.

As can be seen from fig. 10 and 11, caTT-KWS is comparable to the basic transducer+wfst scheme, as well as the transducer+mld scheme, at points where the number of false wakeups per hour (FAperhour) is high. When the false wake needs to be further reduced (for example, FAperhour is lower than 0.1), the recognition performance of the based Transducer+WFST scheme and the recognition performance of the Transducer+MLD scheme are drastically reduced, and the CaTT-KWS can still keep a high recognition rate.

Further, the error of the timestamps obtained in the forced alignment stage and the Transducer stage versus the actual voice endpoint may be counted as shown in table 2 below. Table 2 shows the average error between the starting point of the timestamp obtained by the Transducer model and the starting point of the timestamp obtained by the forced alignment module and the real starting point, where the real point in time is manually noted.

TABLE 2

Stage(s)	Clean set(s)	Noisyset(s)
			Transducer detection	0.29	0.44
Forced alignment	0.11	0.23

As can be seen from table 2, the starting point of the timestamp obtained by the forced alignment module is more accurate, and the more accurate starting point can be used as input of the transducer stage to improve the performance of the instruction word recognition system.

The specific embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described further. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be regarded as the disclosure of the present application.

It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application. It is to be understood that the numbers may be interchanged where appropriate such that the described embodiments of the application may be practiced otherwise than as shown or described.

The method embodiments of the present application are described above in detail, and the apparatus embodiments of the present application are described below in conjunction with fig. 12 to 13.

Fig. 12 is a schematic block diagram of an apparatus 600 for voice instruction recognition in accordance with an embodiment of the present application. As shown in fig. 12, the speech instruction recognition apparatus 600 may include an acquisition unit 610, an acoustic model 620, a decoding graph 630, and a verification module 640.

An acquisition unit 610 for acquiring a speech signal including a plurality of speech frames;

An acoustic model 620, configured to input the speech signal, obtain a first phoneme recognition result corresponding to each speech frame in the speech signal, and a first acoustic hidden layer representation vector of each speech frame output by an encoder in the acoustic model, where the first phoneme recognition result includes a probability distribution of each speech frame in a phoneme space, and the acoustic model is obtained by training a speech signal sample and an actual phoneme of each speech frame in the speech signal sample;

A decoding diagram 630, configured to input the first phoneme recognition result, obtain a phoneme sequence corresponding to the speech signal, and a first timestamp corresponding to the phoneme sequence;

a verification module 640, configured to determine whether the phoneme sequence is used for triggering an instruction according to a vector portion corresponding to the first timestamp in the first acoustic hidden layer representation vector.

Optionally, the verification module 640 is specifically configured to:

acquiring a second time stamp according to the first time stamp, wherein the starting point of the second time stamp is before the starting point of the first time stamp;

Inputting at least part of the first acoustic hidden layer representation vector into a phoneme discriminator, and obtaining a second phoneme recognition result corresponding to the second timestamp output by the phoneme discriminator;

aligning the second phoneme recognition result with the phoneme sequence to obtain a first confidence coefficient, wherein the first confidence coefficient is used for representing an alignment score of the second phoneme recognition result with the phoneme sequence;

And determining whether the phoneme sequence is used for triggering an instruction according to the first confidence.

Optionally, the verification module 640 is specifically configured to:

Acquiring a linear decoding diagram according to the phoneme sequence, wherein a first symbol is added before the phoneme sequence in the linear decoding diagram, and the first symbol is used for absorbing output which does not belong to the phoneme sequence in the second phoneme recognition result;

and inputting the second phoneme recognition result into the linear decoding diagram, and aligning the second phoneme recognition result and the phoneme sequence.

Optionally, the verification module 640 is specifically configured to:

Acquiring a third timestamp under the condition that the second phoneme recognition result is aligned with the phoneme sequence, wherein the starting point of the third timestamp is a voice frame corresponding to a first phoneme aligned with the phoneme sequence in the second phoneme recognition result;

Intercepting a second acoustic hidden layer representation vector corresponding to the third timestamp in the first acoustic hidden layer representation vector;

Inputting the second acoustic hidden layer representation vector into a decoder to obtain a first decoding result matched with the phoneme sequence and a second confidence of the first decoding result;

Determining whether the phoneme sequence is used for triggering an instruction according to at least one of the first confidence and the second confidence.

Optionally, the verification module 640 is specifically configured to:

And determining the length of a decoding sequence of the decoder according to the length of the phoneme sequence.

Optionally, the decoder comprises a transducer decoder.

Optionally, the training unit is further included for:

A multi-tasking joint learning is performed based on the first loss of the acoustic model, the second loss of the phoneme discriminator, and the third loss of the decoder to jointly train the acoustic model, the phoneme discriminator, and the decoder.

Optionally, the verification module 640 is specifically configured to:

intercepting a third acoustic hidden layer representation vector corresponding to the second timestamp from the first acoustic hidden layer representation vector;

Inputting the third acoustic hidden layer representation vector into a decoder to obtain a second decoding result matched with the phoneme sequence and a third confidence of the second decoding result;

And determining whether the phoneme sequence is used for triggering an instruction according to the third confidence.

Optionally, the phoneme space comprises a plurality of phonemes and a null output.

The decoding diagram 630 is specifically used for:

and inputting the recognition result corresponding to the frame which is subjected to non-null output in the first phoneme recognition result into the decoding diagram to obtain the phoneme sequence and the first timestamp.

Optionally, the encoder includes a feed forward sequential memory network FSMN.

Optionally, the decoding graph includes a phoneme dictionary and an instruction set model.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 600 for recognizing a voice instruction in this embodiment may correspond to a corresponding main body for executing the method 300 of the embodiment of the present application, and the foregoing and other operations and/or functions of each module in the apparatus 600 are respectively for implementing each method above, or corresponding flow in each method, which are not described herein for brevity.

The apparatus and system of embodiments of the present application are described above in terms of functional modules in connection with the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiment in the embodiment of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in a software form, and the steps of the method disclosed in connection with the embodiment of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

Fig. 13 is a schematic block diagram of an electronic device 800 provided by an embodiment of the application.

As shown in fig. 13, the electronic device 800 may include:

A memory 810 and a processor 820, the memory 810 being for storing a computer program and transmitting the program code to the processor 820. In other words, the processor 820 may call and run a computer program from the memory 810 to implement the methods in embodiments of the present application.

For example, the processor 820 may be used to perform the steps of the methods 300 or 400 described above according to instructions in the computer program.

In some embodiments of the application, the processor 820 may include, but is not limited to:

A general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

In some embodiments of the application, the memory 810 includes, but is not limited to:

Volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDR SDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct memory bus RAM (DR RAM).

In some embodiments of the application, the computer program may be partitioned into one or more modules that are stored in the memory 810 and executed by the processor 820 to perform the encoding methods provided by the present application. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device 800.

Optionally, as shown in fig. 13, the electronic device 800 may further include:

a transceiver 830, the transceiver 830 being connectable to the processor 820 or the memory 810.

Processor 820 may control transceiver 830 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. Transceiver 830 may include a transmitter and a receiver. Transceiver 830 may further include antennas, the number of which may be one or more.

It should be appreciated that the various components in the electronic device 800 are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

According to an aspect of the present application, there is provided a communication device comprising a processor and a memory for storing a computer program, the processor being adapted to invoke and run the computer program stored in the memory, such that the encoder performs the method of the above-described method embodiment.

According to an aspect of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.

According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the method of the above-described method embodiments.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Drive (SSD)), or the like.

It will be appreciated that in particular embodiments of the application, data relating to user information and the like may be involved. When the above embodiments of the present application are applied to specific products or technologies, user approval or consent is required, and the collection, use and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of voice command recognition, comprising:

And verifying the phoneme sequence according to a vector part corresponding to the first timestamp in the first acoustic hidden layer representation vector and a neural network model, and determining whether the phoneme sequence is used for triggering instructions.

2. The method of claim 1, wherein the validating the phoneme sequence according to the vector portion of the first acoustic hidden layer representation vector corresponding to the first timestamp and the neural network model to determine whether the phoneme sequence is used for a triggering instruction comprises:

3. A method as recited in claim 2, wherein said aligning said second phoneme recognition result with said phoneme sequence comprises:

4. The method as recited in claim 2, further comprising:

wherein the determining, according to the first confidence, whether the phoneme sequence is used for triggering an instruction includes:

5. The method as recited in claim 4, further comprising:

6. The method of claim 4, wherein the decoder comprises a transducer decoder.

7. The method as recited in claim 4, further comprising:

8. The method of claim 1, wherein the validating the phoneme sequence according to the vector portion of the first acoustic hidden layer representation vector corresponding to the first timestamp and the neural network model to determine whether the phoneme sequence is used for a triggering instruction comprises:

9. The method of any of claims 1-7, wherein the phoneme space comprises a plurality of phonemes and a null output;

Inputting the first phoneme recognition result into a decoding diagram to obtain a phoneme sequence corresponding to the voice signal and a first timestamp corresponding to the phoneme sequence, wherein the method comprises the following steps:

10. The method of any of claims 1-7, wherein the encoder comprises a feed forward sequential memory network FSMN.

11. The method of any of claims 1-7, wherein the decoded picture comprises a phoneme dictionary and an instruction set model.

12. An apparatus for voice command recognition, comprising:

And the verification module is used for verifying the phoneme sequence according to the vector part corresponding to the first timestamp in the first acoustic hidden layer representation vector and a neural network model, and determining whether the phoneme sequence is used for triggering an instruction.

13. An electronic device comprising a processor and a memory, the memory having instructions stored therein that when executed by the processor cause the processor to perform the method of any of claims 1-11.

14. A computer storage medium for storing a computer program, the computer program comprising instructions for performing the method of any one of claims 1-11.

15. A computer program product comprising computer program code which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1-11.