WO2020147609A1

WO2020147609A1 - Speech recognition method and apparatus

Info

Publication number: WO2020147609A1
Application number: PCT/CN2020/070581
Authority: WO
Inventors: 张帆; 郑梓豪; 胡于响; 姜飞俊
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2019-01-18
Filing date: 2020-01-07
Publication date: 2020-07-23
Also published as: CN111462738A; CN111462738B

Abstract

A speech recognition method and apparatus. The method comprises: obtaining text data corresponding to speech input data and a text vector corresponding to the text data (S102); obtaining syntax characteristics of the text vector (S104); according to the syntax characteristics, obtaining at least one text clause comprised in the text data, and obtaining domain information of each text clause (S106); and recognizing a speech instruction in the speech input data at least according to the domain information of each text clause (S108). According to the method, an operating burden of a user is reduced, and the degree for the intelligent processing of a user speech instruction by an intelligent speech device is also improved.

Description

Speech recognition method and device

This application claims the priority of a Chinese patent application with an application number of 201910047340.2 and an invention title of "Speech Recognition Method and Apparatus" filed on January 18, 2019, the entire content of which is incorporated into this application by reference.

Technical field

The embodiment of the present invention relates to the field of computer technology, in particular to a voice recognition method and device.

Background technique

Intelligent equipment is a combination of traditional electrical equipment and computer technology, data processing technology, control technology, sensor technology, network communication technology, power electronics technology, etc. Among various smart devices, smart voice devices are an important branch.

Through smart voice devices, users can control various smart devices only by voice, including control of the smart voice device itself and other smart devices controlled by the smart voice device. At present, in the process of interaction between the user and the smart voice device, each control of the smart voice device requires the use of a wake-up word, and then a voice command is followed to complete the user's intention. For example: "Tmall Genie, turn on the light", "Tmall Genie, play music", etc. It can be seen that in this interaction, the user needs to use the wake-up word "Tmall Genie" to wake up the smart voice device every time , In order to carry out the corresponding operation and control. And in the sentence "Why are you going home so late? Please turn on the bedroom lights", "Why are you going home so late?" is the interaction between users, and "Please turn on the bedroom lights" is correct Control instructions for smart voice devices. For such complex mixed instructions without a wake-up word, current smart voice devices cannot process them.

However, this way of using a wake-up word to wake up a smart voice device, on the one hand, the user must use a wake-up word for every instruction, which increases the user's operational burden, and also makes the smart voice device more intelligently processing the user's voice instructions. Low; on the other hand, smart voice devices need to process the wake-up words repeatedly, which also increases the processing burden of smart voice devices.

Summary of the invention

In view of this, an embodiment of the present invention provides a voice recognition solution to solve the above-mentioned problem.

According to a first aspect of the embodiments of the present invention, there is provided a voice recognition method, including: obtaining text data corresponding to voice input data and a text vector corresponding to the text data; obtaining the syntactic feature of the text vector; The syntactic feature, acquiring at least one text clause contained in the text data, and acquiring the domain information of each text clause; at least according to the domain information of each text clause, recognizing the voice input Voice commands in the data.

According to a second aspect of the embodiments of the present invention, there is provided a speech recognition device, including: a first acquisition module for acquiring text data corresponding to voice input data and a text vector corresponding to the text data; and a second acquisition module , Used to obtain the syntactic feature of the text vector; the third obtaining module, used to obtain at least one text clause contained in the text data according to the syntactic feature, and obtain the information of each text clause Domain information; a recognition module for recognizing voice commands in the voice input data at least according to the domain information of each of the text clauses.

According to a third aspect of the embodiments of the present invention, there is provided an intelligent device, including: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface complete each other through the communication bus. Inter-communication; the memory is used to store at least one executable instruction, the executable instruction causes the processor to perform operations corresponding to the voice recognition method described in the first aspect.

According to a fourth aspect of the embodiments of the present invention, there is provided a computer storage medium having a computer program stored thereon, and when the program is executed by a processor, the speech recognition method as described in the first aspect is implemented.

According to the speech recognition solution provided by the embodiment of the present invention, the text data converted from the speech input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the corresponding syntactic feature is obtained according to the syntactic feature The text data corresponding to the voice input data performs the division of the text clauses and the determination of the domain information of the text clauses; further, the voice instructions in the voice input data are recognized according to the domain information of the text clauses. It can be seen that through the solution of the embodiment of the present invention, the smart voice device is more suitable for the actual use environment, and the user does not need to use the wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands that contain voice commands. The mixed voice input data with other voice data can effectively divide the voice input data into clauses and recognize the voice commands contained therein. Furthermore, the smart voice device can be operated and controlled by the recognized voice commands later.

Because there is no need to use a wake-up word to wake up the smart voice device, it reduces the user's operational burden and improves the intelligent processing level of the user's voice instructions by the smart voice device; and the smart voice device no longer needs to process the wake-up word, reducing the intelligence The burden of data processing for voice devices.

BRIEF DESCRIPTION

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some of the embodiments described in the embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained from these drawings.

Figure 1 is a flowchart of the steps of a voice recognition method according to the first embodiment of the present invention;

2 is a flowchart of steps of a voice recognition method according to the second embodiment of the present invention;

Fig. 3 is a schematic structural diagram of a neural network model in the embodiment shown in Fig. 2;

4 is a structural block diagram of a speech recognition device according to the third embodiment of the present invention;

5 is a structural block diagram of a speech recognition device according to the fourth embodiment of the present invention;

Fig. 6 is a schematic structural diagram of a smart device according to the fifth embodiment of the present invention.

detailed description

In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, the following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the description The embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments in the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art should fall within the protection scope of the embodiments of the present invention.

The specific implementation of the embodiments of the present invention will be further described below in conjunction with the accompanying drawings of the embodiments of the present invention.

Example one

Referring to Fig. 1, there is shown a flow chart of the steps of a voice recognition method according to the first embodiment of the present invention.

The voice recognition method of this embodiment includes the following steps:

Step S102: Acquire text data corresponding to the voice input data and text vectors corresponding to the text data.

In the use scenario of the smart voice device, the user can operate and control the smart voice device through voice; the smart voice device uses the voice of the user as input to generate corresponding voice input data, and convert the voice input data into corresponding text Data, and then deal with it accordingly. In this embodiment, in addition to the need to convert voice input data into text data, a text vector corresponding to the text data is also obtained to characterize the text data in the form of a vector, and to facilitate subsequent processing. Wherein, the specific implementation of converting voice input data into corresponding text data and obtaining the text vector corresponding to the text data can be implemented by those skilled in the art in any appropriate manner according to actual needs, and the embodiment of the present invention does not limit this .

For example, a convolutional neural network model or a BP neural network model or a hidden Markov model HMM or multi-band spectral subtraction can be used to convert voice input data into text data; for example, it can be based on deep learning methods (such as word2vec method), or graph-based method (such as textrank method), or topic model-based method (such as LDA method), or statistical method (such as bag of words method), etc. to achieve the acquisition of text vectors corresponding to text data .

Step S104: Obtain the syntactic feature of the text vector corresponding to the text data.

In the embodiment of the present invention, the syntactic feature of the text vector can represent the dependency relationship and semantic information between words in the text data corresponding to the text vector, and the syntactic feature can be expressed by the syntactic feature vector. In a specific implementation, the text vector may be feature extracted through a convolutional neural network CNN model or a recurrent neural network RNN model to obtain the syntactic features of the text vector. But it is not limited to this. In practical applications, those skilled in the art can also use other appropriate methods to obtain the syntactic features of the text vector, such as text classification or other methods.

Step S106: Obtain at least one text clause contained in the text data according to the syntactic feature, and obtain domain information of each text clause.

In the embodiment of the present invention, the text data corresponding to the voice input data contains one or more text clauses. When a text clause is included, the text clause may be the text clause corresponding to the voice command, or other text clauses. The text clause corresponding to the voice data; when multiple text clauses are included, the multiple text clauses can all be text clauses corresponding to the voice command, and the multiple text clauses can also be text corresponding to other voice data Clauses, such as sentences that are not related to voice commands. The multiple text clauses can also be a mixture of text clauses corresponding to the voice command and text clauses corresponding to other voice data, such as in a complex multiplayer scene In, user A and user B are communicating with the smart voice device at the same time they send voice commands to the smart voice device, such as "Why are you going home so late? Please turn on the light in the bedroom", in which the first half sentence "Why are you going home so late?" "Will be recognized as the text clause corresponding to other voice data, and the second half sentence "Please turn on the light in the bedroom" will be recognized as the text clause corresponding to the voice command.

In practical applications, after obtaining the syntactic feature corresponding to the text vector, one or more text clauses in the text data can be determined according to the syntactic feature. Wherein, the method of obtaining text clauses can be adapted to the method of obtaining syntactic features. For example, when the CNN model or the RNN model is used to obtain the syntactic features of the text vector, the text data can be sequenced according to the syntactic features. Obtain one or more text clauses based on the result of sequence labeling.

In addition, in the embodiment of the present invention, the domain information of each text clause is also obtained according to the syntactic characteristics of the text vector. For example, through a machine learning algorithm or a neural network model, the domain information of the corresponding text clause is obtained from the syntactic features of the text vector, where the domain information includes the information of the domain corresponding to the voice command.

It should be noted that, in the embodiments of the present invention, unless otherwise specified, the numbers related to "multiple" such as "multiple" and "multiple" mean two or more.

Step S108: Recognize the voice command in the voice input data at least according to the domain information of each text clause.

In the case that the voice input data includes a voice command, in one or more text clauses contained in the corresponding text data, there should be field information of the text clause indicating that the part of the voice input data corresponding to the text clause is voice According to this, the voice command can be recognized from the voice input data.

For example, in the voice input data of "Why are you going home so late? Please turn on the light in the bedroom", according to the above processing of the corresponding text data and the text vector corresponding to the text data, the text clause can be determined "Please turn on the bedroom light" is a voice command.

Through this embodiment, the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the text corresponding to the voice input data is obtained according to the syntactic feature The data is used to divide the text clauses and determine the domain information of the text clauses; furthermore, recognize the voice commands in the voice input data according to the domain information of the text clauses. It can be seen that through this embodiment, the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices. The mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein. Furthermore, the smart voice device can be operated and controlled by the recognized voice commands later.

The voice recognition method of this embodiment can be executed by any appropriate smart voice device with data processing capabilities, such as various smart home appliances with corresponding functions.

Example 2

Referring to Fig. 2, a flowchart of the steps of a speech recognition method according to the second embodiment of the present invention is shown.

The voice recognition method of this embodiment includes the following steps:

Step S202: Acquire text data corresponding to the voice input data and text vectors corresponding to the text data.

In this embodiment, the text vector corresponding to the text data includes a word vector corresponding to each word in the text data. Among them, due to the different languages that may be used, the specific meanings of the words may also be different. For example, for text data of similar language systems such as Chinese, Japanese, and Korean, a word may be a single word or a word; and for text data of similar language systems such as English and French, there are more than one word. As a complete word.

Based on this, in a feasible manner, this step can be implemented as: acquiring voice input data and generating text data corresponding to the voice input data; generating a word vector corresponding to each word in the text data; The word vector corresponding to each word generates a text vector corresponding to the text data. Among them, the specific implementation of generating the corresponding text data according to the voice input data and generating the word vector corresponding to each word in the text data can be implemented by those skilled in the art in any appropriate manner according to actual needs. The present invention The embodiment does not limit this. Characterizing the text vector corresponding to the text data by the word vector corresponding to each word can not only facilitate the processing of the text data, but also effectively avoid the excessive information loss of the text data caused by the vectorization processing.

Step S204: Obtain the syntactic feature of the text vector corresponding to the text data.

As described in the first embodiment, there are many ways to obtain the syntactic features of the text vector. In this embodiment, the feature extraction method is adopted, that is, the feature extraction is performed on the text vector corresponding to the text data to obtain the Syntactic characteristics of text vectors.

In the case that the text vector includes a word vector corresponding to each word, this step can be implemented as: performing feature extraction on the word vector corresponding to each word in the text vector to obtain the syntax of each word feature. The syntactic features extracted by the feature extraction method can more effectively characterize the characteristics of the words corresponding to each word vector.

Step S206: Obtain at least one text clause contained in the text data according to the syntax feature of the text vector, and obtain domain information of each text clause.

Wherein, when obtaining at least one text clause contained in the text data according to the syntactic feature of the text vector, based on the syntactic feature of each word obtained previously, the syntactic feature of each word may be , Obtain the label of each word, wherein the label includes an end tag; obtain the sequence label of the text data according to the label of each word; obtain the text data according to the end label in the sequence label At least one text clause contained in. That is, the problem of dividing text clauses can be converted into a problem of sequence labeling of text data. Wherein, the type of the tag can be appropriately set by those skilled in the art according to actual needs, but at least includes the end tag. If a word is marked as an end tag, it means that all the words from the beginning of the text data to the word constitute a text clause, or from the first word after the previous end tag to the end tag All the words between the words form a text clause.

Optionally, the tags may include B tags (start tags, indicating that the current word is the beginning of a sentence), I tags (middle tags, indicating that the current word is between the beginning and the end of a sentence), and E tags (The end tag indicates that the current word is the end of a sentence). If the current text data includes multiple E tags, it indicates that the current text data includes multiple clauses, and the text clauses can be divided according to the E tags; and if the current text data includes only one E tag, it indicates that the current text data has only one The text clause is the current text data itself.

By labeling words to form a sequence label of text data, and then obtaining text clauses according to the end tags in the sequence labeling, the division of text clauses can be made more accurate; in addition, compared to other divided text clauses The sentence method also simplifies the operation steps of the division and reduces the realization cost of the division.

When obtaining the domain information of each text clause according to the syntactic feature of the text vector, the domain feature corresponding to each text clause can be obtained according to the syntactic feature of the text vector; for each text clause Domain features, extract the maximum feature value in each feature dimension, generate the domain feature vector of each text clause; determine the domain information of the current text clause according to the domain feature vector of each text clause. In this way, the most effective feature expression of each text clause can be obtained, and the feature expression of each text clause has the same vector length to facilitate subsequent processing.

In a feasible manner, the obtaining the domain feature corresponding to each text clause according to the syntactic feature of the text vector may include: obtaining the domain feature of the text vector according to the syntactic feature of the text vector; The information of the words contained in each text clause is obtained from the domain features of the text vector corresponding to the domain feature of each text clause. That is, first obtain the corresponding total domain feature according to the text vector corresponding to the entire text data, and then obtain the domain feature of each text clause from the total domain feature according to the information of the words in the text clause. As a result, it not only ensures the consistency of the domain characteristics of each text clause with the total domain characteristics, but also simplifies the realization of obtaining the domain characteristics of the text clause.

Step S208: Recognizing the voice command in the voice input data at least according to the domain information of each text clause.

In the embodiment of the present invention, the text clause belonging to the voice instruction corresponds to the set domain information. When the domain information of a certain text clause is consistent with the set domain information, the text clause can be determined as the voice The text clause corresponding to the instruction. Optionally, other domain information may be set. The other domain information may be unified domain information indicating that the text clause is a non-voice instruction, or other domain information may be subdivided to indicate the text The specific domain of the sentence, such as the interaction domain, etc.

Step S210: According to the recognized voice command, perform the operation indicated by the voice command on the smart voice device.

Wherein, the operation can be any appropriate operation, such as instructing the smart voice device to turn on or off the corresponding function, such as turning on the air conditioner, turning off the light, etc., or instructing the smart voice device to perform query operations, such as querying and playing a certain song. Songs, query and play the weather of a certain place, etc., the embodiment of the present invention does not limit the specific operations indicated by the voice instructions.

As mentioned above, the speech recognition solution provided by the embodiments of the present invention can be implemented in a variety of suitable ways. In one feasible way, some or all of the speech recognition solutions can be implemented through a neural network model. In the following, the convolutional neural network CNN model is taken as an example to describe the above process of this embodiment.

The structure of a CNN model is shown in Figure 3, which includes an input part A, a feature extraction part B, a sentence boundary detection part C, and a domain classification part D.

among them:

The input part A may be the input layer of the CNN, and is used to receive the input text vector, such as the text vector of the text data corresponding to the voice input data.

There are multiple convolutional layers in the feature extraction part B. In this embodiment, at least 12 convolutional layers are set to improve the accuracy of feature extraction. Optionally, a batch normalization layer, an activation layer, and a convolutional layer can also be set in the feature extraction part B. Among them, the residual of the convolutional layer can also be set. )deal with. By setting the batch normalization layer, the data processing speed of the CNN model can be optimized; the activation layer can use linear gate functions for non-linear processing, and the activation layer using linear gate functions can improve the effect of non-linear feature conversion on text vectors. Of course , Other activation functions are also applicable; by performing residual processing on the convolutional layer, the original text vector corresponding to the text data is combined with the syntactic features output by the current convolutional layer to optimize the gradient return effect, and Improve the effect of feature extraction.

Optionally, the sentence boundary detection part C may include a batch normalization layer, a convolution layer, and an output layer in sequence, wherein the output layer adopts a Softmax function as a loss function. Through the sentence boundary detection part C, the label of the word vector corresponding to each word in the text vector can be obtained, and then the sequence label of the entire text data can be obtained. The division of the text clause can be determined according to the end label in the sequence label, such as the E label .

The domain classification part D can optionally include a batch normalization layer, a convolutional layer, a pooling layer, and an output layer in sequence. The pooling layer adopts one-dimensional feature pooling (1-D RoI pooling), and the output layer adopts the Softmax function As a loss function. According to the division result of the text clauses and the domain information of each text clause, the domain classification part D can identify the text clause corresponding to the voice command.

It should be noted that, as shown in FIG. 3, the sentence boundary detection part C and the domain classification part D in the CNN model of this embodiment share the syntactic features extracted by the feature extraction part B to improve the data processing efficiency of the CNN model and save the CNN model Realization cost.

Based on the CNN model shown in Figure 3, taking the smart voice device as a smart speaker and the voice input data as "Why are you going home so late? Please turn on the lights in the bedroom" as an example, the corresponding voice recognition process includes:

(1) Convert the voice input data into text data, and obtain the text vector corresponding to the text data.

This part is the conversion and processing of the data before the text vector is input into the CNN model. Take the voice of "Why are you going home so late? Please turn on the bedroom light" as an example. In this section, you need to input the voice "Why are you going home so late? Please turn on the bedroom light" The data is converted into text data, and each word therein is converted into a D-dimensional vector, where the specific value of D can be appropriately set by those skilled in the art according to actual needs.

Thus, N D-dimensional vectors can be generated, where N is the number of words. In this example, 17 words are included. Therefore, N is 17. The N D-dimensional vectors are the text vectors corresponding to the text data.

(2) Receive the text vector corresponding to the text data through the input part of the CNN model.

For example, the N D-dimensional vectors generated above are received through the input layer of the CNN model.

(3) Perform feature extraction on the text vector through the feature extraction part of the CNN model to obtain the syntactic feature of the text vector.

It includes: performing batch normalization operations on the input vectors to generate normalized vectors; performing non-linear processing on the normalized vectors; performing feature extraction on the non-linearly processed vectors through a convolutional layer to obtain initial features; The initial feature is subjected to residual analysis processing, and the syntactic feature of the vector is obtained and output according to the residual analysis processing result; the step of returning to the batch normalization operation on the input vector is continued until the text vector is obtained The syntactic features. Optionally, when a batch normalization layer is provided, a batch normalization operation is performed on the input vector through the batch normalization layer to generate a normalized vector. Wherein, the vector of the batch normalization layer input to the first convolutional layer part is the text vector corresponding to the text data; the vector of the batch normalization layer input that is not the first convolutional layer part is the previous convolutional layer part The output vector. Alternatively, when an activation layer is provided, the normalized vector is subjected to non-linearization processing through the activation layer.

That is, first input the text vector into the first batch normalization layer, and then perform batch normalization operations, non-linearization processing, feature extraction and residual processing on the text vector through the batch normalization layer, activation layer, and convolution layer to obtain syntactic features. ; Then, input the obtained syntactic feature into the next batch of normalization layer, activation layer, convolutional layer, etc., to be processed in turn to obtain the processed syntactic feature again; then input the processed syntactic feature into the next batch of normalization Layer, activation layer, convolutional layer are processed, and so on, until the syntactic feature of the final text vector is obtained.

It should be noted that the vector input to the batch normalization layer can be all the vectors after the residual processing of the previous convolutional layer, such as all the text vectors or all the syntactic features, or it can be the residual processing of the previous convolutional layer. The vector of each word in the text vector is the word vector of each word in the text vector or the syntactic feature corresponding to each word. Either way, the syntactic feature of the finally obtained text vector includes the syntactic feature of each word.

Specific to the example of "Why are you going home so late? Please turn on the light in the bedroom", you can get the syntactic features corresponding to each word in this step.

(4) Obtain at least one text clause contained in the text data corresponding to the voice input data by the sentence boundary detection part of the CNN model according to the syntactic features output by the feature extraction part.

It includes: performing batch normalization operations on the syntactic features (optionally, batch normalization operations can be performed on the syntactic features through the batch normalization layer of the sentence boundary detection part) to generate normalized syntactic features; and the convolutional layer The normalized syntactic feature performs feature extraction; the output layer determines the label of each word in the text data according to the feature extraction result, and obtains at least one text clause contained in the text data according to the label of each word.

Through the sentence boundary detection part, the sequence labeling of text data is realized. For example, use B to indicate that the corresponding word is at the beginning of a text segment (ie B is the start tag), E indicates that the corresponding word is at the end of a text segment (ie E is the end tag), I indicates that the corresponding word is at the beginning of a text segment (I.e. I is the middle label). According to the syntactic characteristics of each word, the BIE probability distribution on each word in the text data can be obtained through the sentence boundary detection part. For each word, the label corresponding to the maximum value of the BIE probability distribution is selected. For example, in the example of "Why are you going home so late? Please turn on the light in the bedroom", if the probability of "Ah" is 0.3, the probability of I is 0.1, and the probability of E is 0.8, it can be determined " The label of the word ah should be the E label. According to the label of each word, the sequence label of the entire text data can be obtained, and then, according to the end label in the sequence label, the sentence boundary of each text clause can be obtained, and the range of each text clause can be obtained.

Specific to the example of "Why are you going home so late? Please turn on the light in the bedroom", you can get the sequence label through this step, for example, "BIIIIIIIIEBIIIIIE", based on this, you can get two text clauses, namely " Why are you coming home so late?" and "Please turn on the bedroom light".

(5) Through the domain classification part of the CNN model, the domain information of each text clause is obtained according to the syntactic characteristics of the text vector and the information of each text clause.

It includes: performing batch normalization operations on the syntax features of the text vector (optionally, batch normalization operations can be performed on the syntax features of the text vector through the batch normalization layer of the field classification part), and generating normalized syntax features; The accumulation layer performs feature mapping on the normalized syntactic features to obtain the domain features of the text vector; through the pooling layer, the domain features of the text vector are pooled according to the information of each text clause; through The output layer obtains the domain information of each text clause according to the result of the pooling process.

First, for the syntactic feature of the text vector, the syntactic feature can be mapped to the domain feature C through the batch normalization layer and the convolution layer of the domain classification part. Specific to the example of "Why are you going home so late? Please turn on the light in the bedroom", the obtained domain feature C can be an N*D two-dimensional matrix, where N is the number of words contained in the text data, In this example, it is 17, and D is the dimension of the domain feature vector of each word.

Secondly, according to the scope of each text clause obtained by the sentence boundary detection part, the domain feature C can be converted into S=(m1,m2,m3,...), where m is the two corresponding to each text clause The dimensional domain feature matrix, S is the set of the two-dimensional domain feature matrix of the text clause, and it is also an N*D two-dimensional matrix. Among them, each m is a W*D two-dimensional matrix, W represents the number of words in the current text clause, and D is the dimension of the feature vector of each word as described above.

Specific to the example of "Why are you going home so late? Please turn on the bedroom light", which includes the text clause "Why are you going home so late", the corresponding two-dimensional domain feature matrix m1 is a 10* The matrix of D, the text clause "please turn on the bedroom light", the corresponding two-dimensional domain characteristic matrix m2 is a 7*D matrix. Correspondingly, S=(m1, m2).

Then, for S, select the maximum value on the first dimension, that is, on the N dimension, that is, perform the max operation on the first dimension of the two-dimensional feature domain matrix corresponding to each text clause to obtain the two-dimensional corresponding to each text clause The one-dimensional feature (1*D) of the feature domain matrix, and then the fixed-length feature expression T=(u1,u2,u3,...) of all text clauses is obtained, where u is corresponding to each text clause One-dimensional domain feature vector of length D.

Then, T=(u1,u2,u3,...) is pooled through the pooling layer of the domain classification part, and then the domain probability distribution of each text clause is obtained through the Softmax function. According to each text sub The domain probability distribution of the sentence determines the domain information of the text clause.

(6) According to the domain information of each text clause, recognize the voice command in the voice input data.

It can be seen that through the above processes (2)-(4), the division of text clauses and the determination of the domain information of the text clauses by the CNN model are realized. Through the CNN model, the two tasks of voice command extraction and recognition are unified into a CNN model framework, which effectively realizes the segmentation and recognition of user commands.

Further, based on the results output by CNN, the voice command can be determined.

For example, if the domain information IOT (Internet Of Things) is set for a voice command, if the domain information of a text clause is classified as an IOT domain, the part of the voice input data corresponding to the text clause can be considered as a voice command . Specific to the example of "Why are you going home so late? Please turn on the bedroom light", "Please turn on the bedroom light" will be classified into the IOT field. Therefore, make sure that "Please turn on the bedroom light" is used for operation and Control the voice commands of smart voice devices.

Example three

Referring to Fig. 4, there is shown a structural block diagram of a speech recognition device according to the third embodiment of the present invention.

The speech recognition device of this embodiment includes: a first obtaining module 302, configured to obtain text data corresponding to voice input data and a text vector corresponding to the text data; and a second obtaining module 304, configured to obtain information about the text vector Syntactic features; the third acquisition module 306 is used to acquire at least one text clause contained in the text data according to the syntax features, and to acquire the domain information of each text clause; the recognition module 308 uses At least according to the domain information of each of the text clauses, the voice command in the voice input data is recognized.

Example 4

Referring to FIG. 5, there is shown a structural block diagram of a speech recognition device according to the fourth embodiment of the present invention.

The voice recognition device of this embodiment includes: a first obtaining module 402, configured to obtain text data corresponding to voice input data and a text vector corresponding to the text data; and a second obtaining module 404, configured to obtain information about the text vector Syntactic features; the third acquisition module 406 is used to acquire at least one text clause contained in the text data according to the syntactic features, and to acquire the domain information of each text clause; the recognition module 408 uses At least according to the domain information of each of the text clauses, the voice command in the voice input data is recognized.

Optionally, the first acquisition module 402 is configured to acquire voice input data, and generate text data corresponding to the voice input data; generate a word vector corresponding to each word in the text data; The word vector corresponding to each word generates a text vector corresponding to the text data.

Optionally, the second obtaining module 404 is configured to perform feature extraction on the text vector to obtain the syntactic feature of the text vector.

Optionally, the second acquisition module 404 is configured to perform feature extraction on the word vector corresponding to each word in the text vector to acquire the syntactic feature of each word.

Optionally, the third obtaining module 406 includes: a clause obtaining module 4062, configured to obtain a tag of each word according to the syntactic feature of each word, wherein the tag includes an end tag; The label of the word obtains the sequence label of the text data; the at least one text clause contained in the text data is obtained according to the end label in the sequence label; the domain obtaining module 4064 is used to obtain according to the syntactic feature To obtain the domain information of each text clause.

Optionally, the domain obtaining module 4064 includes: a domain feature module 40642, configured to obtain the domain feature corresponding to each text clause according to the syntactic feature of the text vector; and a determining module 40644, configured to determine each For the domain feature of the text clause, extract the maximum feature value in each feature dimension to generate the domain feature vector of each text clause; determine the current text according to the domain feature vector of each text clause The domain information of the clause.

Optionally, the domain feature module 40642 is configured to obtain the domain feature of the text vector according to the syntactic feature of the text vector; according to the information of the words contained in each text clause, from the From the domain feature of the text vector, the domain feature corresponding to each text clause is obtained.

Optionally, the second acquisition module 404 is configured to perform feature extraction on the text vector through the feature extraction part of the convolutional neural network model to acquire the syntactic features of the text vector; the third acquisition module 406, The sentence boundary detection part of the convolutional neural network model is used to obtain at least one text clause contained in the text data according to the syntactic feature; the domain classification part of the convolutional neural network model is used according to the Syntactic features and information of each of the text clauses to obtain domain information of each of the text clauses; wherein, the sentence boundary detection part and the domain classification part share the syntactic features extracted by the feature extraction part.

Optionally, the second acquisition module 404 is configured to perform batch normalization operations on the input vectors to generate normalized vectors; perform non-linear processing on the normalized vectors; perform non-linear processing on the normalized vectors through the convolutional layer Perform feature extraction on the latter vector to obtain an initial feature; perform residual analysis processing on the initial feature, obtain and output the syntactic feature of the vector according to the residual analysis processing result; return to the input vector The batch normalization operation continues to execute until the syntactic feature of the text vector is obtained.

Optionally, the feature extraction part includes at least 12 convolutional layers, and the normalized vector is non-linearized through a linear gate function.

Optionally, when the third acquiring module 406 acquires at least one text clause contained in the text data according to the syntax feature by the sentence boundary detection part of the convolutional neural network model: Perform batch normalization operations on features to generate normalized syntactic features; perform feature extraction on the normalized syntactic features through the convolutional layer; determine the label of each word in the text data according to the feature extraction result through the output layer, according to each The label of one word acquires at least one text clause contained in the text data.

Optionally, the third obtaining module 406 obtains the field of each text clause according to the syntactic feature and the information of each text clause in the field classification part of the convolutional neural network model. Information: perform batch normalization operations on the syntactic features to generate normalized syntactic features; perform feature mapping on the normalized syntactic features through the convolutional layer to obtain the domain features of the text vector; use the pooling layer according to each The information of the text clause performs pooling processing on the domain features of the text vector; the output layer obtains the domain information of each text clause according to the result of the pooling processing.

The voice recognition device in this embodiment is used to implement the corresponding voice recognition methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here. In addition, the function implementation of each module in the speech recognition device of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and it will not be repeated here.

Example 5

Referring to FIG. 6, there is shown a schematic structural diagram of a smart device according to the sixth embodiment of the present invention. The specific embodiment of the present invention does not limit the specific implementation of the smart device.

As shown in FIG. 6, the smart device may include: a processor (processor) 502, a communication interface (Communications Interface) 504, a memory (memory) 506, and a communication bus 508.

among them:

The processor 502, the communication interface 504, and the memory 506 communicate with each other through the communication bus 508.

The communication interface 504 is used to communicate with other electronic devices such as other smart devices or servers.

The processor 502 is configured to execute the program 510, and specifically can execute relevant steps in the above-mentioned voice recognition method embodiment.

Specifically, the program 510 may include program code, and the program code includes computer operation instructions.

The processor 502 may be a central processing unit CPU, or a specific integrated circuit (ASIC) (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention. The one or more processors included in the smart device may be the same type of processor, such as one or more CPUs, or different types of processors, such as one or more CPUs and one or more ASICs.

The memory 506 is used to store the program 510. The memory 506 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory.

The program 510 can specifically be used to cause the processor 502 to perform the following operations: obtain text data corresponding to the voice input data and a text vector corresponding to the text data; obtain the syntactic feature of the text vector; At least one text clause contained in the text data, and acquiring the domain information of each text clause; at least according to the domain information of each text clause, recognizing the voice command in the voice input data.

In an optional implementation manner, the program 510 is further configured to enable the processor 502 to obtain the voice input data when obtaining the text data corresponding to the voice input data and the text vector corresponding to the text data, and generate the Text data corresponding to the voice input data; generating a word vector corresponding to each word in the text data; generating a text vector corresponding to the text data according to the word vector corresponding to each word.

In an optional implementation manner, the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector when obtaining the syntactic feature of the text vector to obtain the syntactic feature of the text vector.

In an optional implementation manner, the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector to obtain the syntactic features of the text vector, and then perform a comparison of each word in the text vector The corresponding word vector performs feature extraction to obtain the syntactic feature of each word.

In an optional implementation manner, the program 510 is further configured to cause the processor 502 to obtain at least one text clause contained in the text data according to the syntactic feature, according to the syntactic feature of each word, Obtain the label of each word, where the label includes an end tag; obtain the sequence label of the text data according to the label of each word; obtain the text data according to the end label in the sequence label At least one text clause contained.

In an optional implementation manner, the program 510 is further configured to enable the processor 502 to obtain each of the text clauses according to the syntactic characteristics of the text vector when obtaining the domain information of each of the text clauses. Corresponding domain features; for the domain features of each text clause, extract the maximum feature value in each feature dimension to generate the domain feature vector of each text clause; according to each text clause The domain feature vector of to determine the domain information of the current text clause.

In an optional implementation manner, the program 510 is further configured to cause the processor 502 to obtain the domain feature corresponding to each text clause according to the syntax feature of the text vector, according to the syntax feature of the text vector Feature, obtain the domain feature of the text vector; obtain the domain feature corresponding to each text clause from the domain feature of the text vector according to the information of the words contained in each text clause.

In an alternative embodiment, the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector through the feature extraction part of the convolutional neural network model to obtain the syntactic features of the text vector; the program 510 also The processor 502 is configured to obtain at least one text clause contained in the text data according to the syntactic feature by the sentence boundary detection part of the convolutional neural network model; use the domain classification of the convolutional neural network model Partly according to the syntactic feature and the information of each of the text clauses, the domain information of each text clause is obtained; wherein, the sentence boundary detection part and the domain classification part share the feature extraction part extraction The syntactic features.

In an optional implementation manner, the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector through the feature extraction part of the convolutional neural network model to obtain the syntactic features of the text vector, Perform batch normalization operations on the input vectors to generate normalized vectors; perform non-linearization processing on the normalized vectors; perform feature extraction on the non-linearly processed vectors through a convolutional layer to obtain initial features; Perform residual analysis processing on features, and obtain and output the syntactic features of the vectors according to the residual analysis processing results; return to the step of performing batch normalization operations on the input vectors until the syntactic feature of the text vector is obtained .

In an optional implementation manner, the feature extraction part includes at least 12 convolutional layers; the normalized vector is non-linearized through a linear gate function.

In an optional implementation manner, the program 510 is further configured to enable the processor 502 to obtain at least one text contained in the text data according to the syntactic feature in the sentence boundary detection part of the convolutional neural network model. Clauses, perform batch normalization operations on the syntactic features to generate normalized syntactic features; perform feature extraction on the normalized syntactic features through the convolutional layer; determine each of the text data according to the feature extraction results through the output layer According to the label of each word, at least one text clause contained in the text data is obtained according to the label of each word.

In an optional implementation manner, the program 510 is further configured to cause the processor 502 to obtain each item based on the syntactic feature and the information of each text clause in the domain classification part of the convolutional neural network model. For the domain information of a text clause, perform batch normalization operations on the syntactic features to generate normalized syntactic features; perform feature mapping on the normalized syntactic features through a convolutional layer to obtain the domain features of the text vector The field feature of the text vector is pooled by the pooling layer according to the information of each text clause; the field of each text clause is obtained by the output layer according to the result of the pooling process information.

In an optional implementation manner, the smart device of this embodiment may further include a microphone to receive the analog voice signal input by the user and convert it into a digital voice signal, that is, voice input data; the program 510 may also be used to make the processor 502 Convert the voice input data into corresponding text data. But it is not limited to this. The microphone can also be set independently of the smart device, and connected to the smart device through an appropriate connection mode, and send the voice input data to the processor.

For the specific implementation of each step in the program 510, reference may be made to the corresponding description in the corresponding steps and units in the above-mentioned voice recognition method embodiment, which is not repeated here. Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working processes of the devices and modules described above can refer to the corresponding process descriptions in the foregoing method embodiments, which will not be repeated here.

Through the smart device of this embodiment, the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the voice input data is processed according to the syntactic feature The corresponding text data is divided into text clauses and the domain information of the text clauses is determined; furthermore, the voice commands in the voice input data are recognized according to the domain information of the text clauses. It can be seen that through this embodiment, the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices. The mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein. Furthermore, the smart voice device can be operated and controlled by the recognized voice commands later.

It should be pointed out that, according to the needs of implementation, each component/step described in the embodiment of the present invention can be split into more components/steps, or two or more components/steps or partial operations of components/steps can be combined into New components/steps to achieve the purpose of the embodiments of the present invention.

The above method according to the embodiment of the present invention can be implemented in hardware, firmware, or implemented as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by The computer code downloaded from the network is originally stored in a remote recording medium or a non-transitory machine-readable medium and will be stored in a local recording medium, so that the method described here can be stored using a general-purpose computer, a dedicated processor or a programmable Or such software processing on a recording medium of dedicated hardware (such as ASIC or FPGA). It can be understood that a computer, a processor, a microprocessor controller, or programmable hardware includes a storage component (for example, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is used by the computer, When accessed and executed by the processor or hardware, the voice recognition method described here is implemented. In addition, when a general-purpose computer accesses the code for implementing the voice recognition method shown here, the execution of the code converts the general-purpose computer into a dedicated computer for executing the voice recognition method shown here.

A person of ordinary skill in the art may be aware that the units and method steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present invention.

The above implementations are only used to illustrate the embodiments of the present invention, and are not intended to limit the embodiments of the present invention. Those of ordinary skill in the relevant technical field can also make various modifications without departing from the spirit and scope of the embodiments of the present invention. Changes and modifications, therefore, all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the patent protection scope of the embodiments of the present invention should be defined by the claims.

Claims

A method of speech recognition, including:

Acquiring text data corresponding to the voice input data and a text vector corresponding to the text data;

Acquiring the syntactic feature of the text vector;

Obtaining at least one text clause contained in the text data according to the syntactic feature, and obtaining the domain information of each text clause;

At least according to the domain information of each of the text clauses, the voice command in the voice input data is recognized.
The method according to claim 1, wherein said acquiring text data corresponding to voice input data and a text vector corresponding to said text data comprises:

Acquiring voice input data, and generating text data corresponding to the voice input data;

Generating a word vector corresponding to each word in the text data;

According to the word vector corresponding to each word, a text vector corresponding to the text data is generated.
The method according to claim 2, wherein said obtaining the syntactic feature of the text vector comprises:

Perform feature extraction on the text vector to obtain the syntactic feature of the text vector.
The method according to claim 3, wherein said performing feature extraction on said text vector to obtain the syntactic feature of said text vector comprises:

Perform feature extraction on the word vector corresponding to each word in the text vector to obtain the syntactic feature of each word.
The method according to claim 4, wherein said acquiring at least one text clause contained in said text data according to said syntactic feature comprises:

Obtain the label of each word according to the syntactic characteristics of each word, where the label includes an end label;

Obtain the sequence label of the text data according to the label of each word;

Acquire at least one text clause contained in the text data according to the end tag in the sequence label.
The method according to claim 5, wherein said obtaining the domain information of each of said text clauses comprises:

Acquiring the domain feature corresponding to each text clause according to the syntactic feature of the text vector;

For the domain feature of each text clause, extract the maximum feature value in each feature dimension to generate the domain feature vector of each text clause;

According to the domain feature vector of each text clause, the domain information of the current text clause is determined.
7. The method according to claim 6, wherein said acquiring the domain characteristics corresponding to each of said text clauses according to the syntactic characteristics of said text vector comprises:

Obtaining the domain feature of the text vector according to the syntactic feature of the text vector;

According to the word information contained in each text clause, the domain feature corresponding to each text clause is obtained from the domain features of the text vector.
The method according to any one of claims 1-7, wherein:

Performing feature extraction on the text vector by the feature extraction part of the convolutional neural network model to obtain the syntactic feature of the text vector;

The sentence boundary detection part of the convolutional neural network model acquires at least one text clause contained in the text data according to the syntactic feature; the domain classification part of the convolutional neural network model is based on the syntactic feature And the information of each of the text clauses, and obtain the domain information of each of the text clauses;

Wherein, the sentence boundary detection part and the field classification part share the syntactic features extracted by the feature extraction part.
The method according to claim 8, wherein the feature extraction of the text vector by the feature extraction part of the convolutional neural network model to obtain the syntactic feature of the text vector comprises:

Perform batch normalization operations on the input vectors to generate normalized vectors;

Performing non-linear processing on the normalized vector;

Performing feature extraction on the non-linearly processed vector through a convolutional layer to obtain an initial feature;

Performing residual analysis processing on the initial feature, and obtaining and outputting the syntactic feature of the vector according to the residual analysis processing result;

The step of returning to the batch normalization operation on the input vector is continued until the syntax feature of the text vector is obtained.
The method according to claim 9, wherein the feature extraction part includes at least 12 convolutional layers; and the normalized vector is non-linearized by a linear gate function.
8. The method according to claim 8, wherein said acquiring at least one text clause contained in said text data by said sentence boundary detection part of said convolutional neural network model according to said syntactic feature comprises:

Perform batch normalization operations on the syntactic features to generate normalized syntactic features;

Performing feature extraction on the normalized syntactic feature through a convolutional layer;

The output layer determines the label of each word in the text data according to the feature extraction result, and obtains at least one text clause contained in the text data according to the label of each word.
8. The method according to claim 8, wherein the field classification part of the convolutional neural network model obtains the information of each text clause according to the syntactic feature and the information of each text clause Field information includes:

Perform batch normalization operations on the syntactic features to generate normalized syntactic features;

Performing feature mapping on the normalized syntactic feature through a convolutional layer to obtain the domain feature of the text vector;

Pooling the domain features of the text vector according to the information of each of the text clauses through the pooling layer;

According to the result of the pooling process, the output layer obtains the domain information of each text clause.
A speech recognition device includes:

The first acquiring module is configured to acquire text data corresponding to voice input data and text vectors corresponding to the text data;

The second acquiring module is used to acquire the syntactic features of the text vector;

The third obtaining module is configured to obtain at least one text clause contained in the text data according to the syntactic feature, and obtain the domain information of each text clause;

The recognition module is used to recognize the voice command in the voice input data at least according to the domain information of each of the text clauses.