WO2020147609A1 - Speech recognition method and apparatus - Google Patents

Speech recognition method and apparatus Download PDF

Info

Publication number
WO2020147609A1
WO2020147609A1 PCT/CN2020/070581 CN2020070581W WO2020147609A1 WO 2020147609 A1 WO2020147609 A1 WO 2020147609A1 CN 2020070581 W CN2020070581 W CN 2020070581W WO 2020147609 A1 WO2020147609 A1 WO 2020147609A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
feature
vector
syntactic
clause
Prior art date
Application number
PCT/CN2020/070581
Other languages
French (fr)
Chinese (zh)
Inventor
张帆
郑梓豪
胡于响
姜飞俊
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020147609A1 publication Critical patent/WO2020147609A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the embodiment of the present invention relates to the field of computer technology, in particular to a voice recognition method and device.
  • Intelligent equipment is a combination of traditional electrical equipment and computer technology, data processing technology, control technology, sensor technology, network communication technology, power electronics technology, etc.
  • smart voice devices are an important branch.
  • smart voice devices users can control various smart devices only by voice, including control of the smart voice device itself and other smart devices controlled by the smart voice device.
  • each control of the smart voice device requires the use of a wake-up word, and then a voice command is followed to complete the user's intention.
  • a voice command is followed to complete the user's intention.
  • the user needs to use the wake-up word “Tmall Genie” to wake up the smart voice device every time , In order to carry out the corresponding operation and control.
  • the sentence “Why are you going home so late? Please turn on the bedroom lights”, “Why are you going home so late?” is the interaction between users, and "Please turn on the bedroom lights” is correct Control instructions for smart voice devices. For such complex mixed instructions without a wake-up word, current smart voice devices cannot process them.
  • an embodiment of the present invention provides a voice recognition solution to solve the above-mentioned problem.
  • a voice recognition method including: obtaining text data corresponding to voice input data and a text vector corresponding to the text data; obtaining the syntactic feature of the text vector; The syntactic feature, acquiring at least one text clause contained in the text data, and acquiring the domain information of each text clause; at least according to the domain information of each text clause, recognizing the voice input Voice commands in the data.
  • a speech recognition device including: a first acquisition module for acquiring text data corresponding to voice input data and a text vector corresponding to the text data; and a second acquisition module , Used to obtain the syntactic feature of the text vector; the third obtaining module, used to obtain at least one text clause contained in the text data according to the syntactic feature, and obtain the information of each text clause Domain information; a recognition module for recognizing voice commands in the voice input data at least according to the domain information of each of the text clauses.
  • an intelligent device including: a processor, a memory, a communication interface, and a communication bus.
  • the processor, the memory, and the communication interface complete each other through the communication bus.
  • Inter-communication; the memory is used to store at least one executable instruction, the executable instruction causes the processor to perform operations corresponding to the voice recognition method described in the first aspect.
  • a computer storage medium having a computer program stored thereon, and when the program is executed by a processor, the speech recognition method as described in the first aspect is implemented.
  • the text data converted from the speech input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the corresponding syntactic feature is obtained according to the syntactic feature
  • the text data corresponding to the voice input data performs the division of the text clauses and the determination of the domain information of the text clauses; further, the voice instructions in the voice input data are recognized according to the domain information of the text clauses.
  • the smart voice device is more suitable for the actual use environment, and the user does not need to use the wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands that contain voice commands.
  • the mixed voice input data with other voice data can effectively divide the voice input data into clauses and recognize the voice commands contained therein.
  • the smart voice device can be operated and controlled by the recognized voice commands later.
  • Figure 1 is a flowchart of the steps of a voice recognition method according to the first embodiment of the present invention
  • FIG. 2 is a flowchart of steps of a voice recognition method according to the second embodiment of the present invention.
  • Fig. 3 is a schematic structural diagram of a neural network model in the embodiment shown in Fig. 2;
  • FIG. 4 is a structural block diagram of a speech recognition device according to the third embodiment of the present invention.
  • FIG. 5 is a structural block diagram of a speech recognition device according to the fourth embodiment of the present invention.
  • Fig. 6 is a schematic structural diagram of a smart device according to the fifth embodiment of the present invention.
  • FIG. 1 there is shown a flow chart of the steps of a voice recognition method according to the first embodiment of the present invention.
  • Step S102 Acquire text data corresponding to the voice input data and text vectors corresponding to the text data.
  • the user can operate and control the smart voice device through voice; the smart voice device uses the voice of the user as input to generate corresponding voice input data, and convert the voice input data into corresponding text Data, and then deal with it accordingly.
  • a text vector corresponding to the text data is also obtained to characterize the text data in the form of a vector, and to facilitate subsequent processing.
  • the specific implementation of converting voice input data into corresponding text data and obtaining the text vector corresponding to the text data can be implemented by those skilled in the art in any appropriate manner according to actual needs, and the embodiment of the present invention does not limit this .
  • a convolutional neural network model or a BP neural network model or a hidden Markov model HMM or multi-band spectral subtraction can be used to convert voice input data into text data; for example, it can be based on deep learning methods (such as word2vec method), or graph-based method (such as textrank method), or topic model-based method (such as LDA method), or statistical method (such as bag of words method), etc. to achieve the acquisition of text vectors corresponding to text data .
  • deep learning methods such as word2vec method
  • graph-based method such as textrank method
  • topic model-based method such as LDA method
  • statistical method such as bag of words method
  • Step S104 Obtain the syntactic feature of the text vector corresponding to the text data.
  • the syntactic feature of the text vector can represent the dependency relationship and semantic information between words in the text data corresponding to the text vector, and the syntactic feature can be expressed by the syntactic feature vector.
  • the text vector may be feature extracted through a convolutional neural network CNN model or a recurrent neural network RNN model to obtain the syntactic features of the text vector. But it is not limited to this. In practical applications, those skilled in the art can also use other appropriate methods to obtain the syntactic features of the text vector, such as text classification or other methods.
  • Step S106 Obtain at least one text clause contained in the text data according to the syntactic feature, and obtain domain information of each text clause.
  • the text data corresponding to the voice input data contains one or more text clauses.
  • the text clause may be the text clause corresponding to the voice command, or other text clauses.
  • the multiple text clauses can also be a mixture of text clauses corresponding to the voice command and text clauses corresponding to other voice data, such as in a complex multiplayer scene
  • user A and user B are communicating with the smart voice device at the same time they send voice commands to the smart voice device, such as "Why are you going home so late? Please turn on the light in the bedroom", in which the first half sentence "Why are you going home so late?" "Will be recognized as the text clause corresponding to other voice data, and the second half sentence "Please turn on the light in the bedroom” will be recognized as the text clause corresponding to the voice command.
  • one or more text clauses in the text data can be determined according to the syntactic feature.
  • the method of obtaining text clauses can be adapted to the method of obtaining syntactic features. For example, when the CNN model or the RNN model is used to obtain the syntactic features of the text vector, the text data can be sequenced according to the syntactic features. Obtain one or more text clauses based on the result of sequence labeling.
  • the domain information of each text clause is also obtained according to the syntactic characteristics of the text vector. For example, through a machine learning algorithm or a neural network model, the domain information of the corresponding text clause is obtained from the syntactic features of the text vector, where the domain information includes the information of the domain corresponding to the voice command.
  • Step S108 Recognize the voice command in the voice input data at least according to the domain information of each text clause.
  • the voice input data includes a voice command
  • in one or more text clauses contained in the corresponding text data there should be field information of the text clause indicating that the part of the voice input data corresponding to the text clause is voice
  • the voice command can be recognized from the voice input data.
  • the text clause can be determined "Please turn on the bedroom light” is a voice command.
  • the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the text corresponding to the voice input data is obtained according to the syntactic feature
  • the data is used to divide the text clauses and determine the domain information of the text clauses; furthermore, recognize the voice commands in the voice input data according to the domain information of the text clauses.
  • the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices.
  • the mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein.
  • the smart voice device can be operated and controlled by the recognized voice commands later.
  • the voice recognition method of this embodiment can be executed by any appropriate smart voice device with data processing capabilities, such as various smart home appliances with corresponding functions.
  • FIG. 2 a flowchart of the steps of a speech recognition method according to the second embodiment of the present invention is shown.
  • Step S202 Acquire text data corresponding to the voice input data and text vectors corresponding to the text data.
  • the text vector corresponding to the text data includes a word vector corresponding to each word in the text data.
  • the specific meanings of the words may also be different.
  • a word may be a single word or a word; and for text data of similar language systems such as English and French, there are more than one word. As a complete word.
  • this step can be implemented as: acquiring voice input data and generating text data corresponding to the voice input data; generating a word vector corresponding to each word in the text data; The word vector corresponding to each word generates a text vector corresponding to the text data.
  • the specific implementation of generating the corresponding text data according to the voice input data and generating the word vector corresponding to each word in the text data can be implemented by those skilled in the art in any appropriate manner according to actual needs.
  • the present invention does not limit this. Characterizing the text vector corresponding to the text data by the word vector corresponding to each word can not only facilitate the processing of the text data, but also effectively avoid the excessive information loss of the text data caused by the vectorization processing.
  • Step S204 Obtain the syntactic feature of the text vector corresponding to the text data.
  • the feature extraction method is adopted, that is, the feature extraction is performed on the text vector corresponding to the text data to obtain the Syntactic characteristics of text vectors.
  • this step can be implemented as: performing feature extraction on the word vector corresponding to each word in the text vector to obtain the syntax of each word feature.
  • the syntactic features extracted by the feature extraction method can more effectively characterize the characteristics of the words corresponding to each word vector.
  • Step S206 Obtain at least one text clause contained in the text data according to the syntax feature of the text vector, and obtain domain information of each text clause.
  • the syntactic feature of each word may be , Obtain the label of each word, wherein the label includes an end tag; obtain the sequence label of the text data according to the label of each word; obtain the text data according to the end label in the sequence label At least one text clause contained in. That is, the problem of dividing text clauses can be converted into a problem of sequence labeling of text data.
  • the type of the tag can be appropriately set by those skilled in the art according to actual needs, but at least includes the end tag. If a word is marked as an end tag, it means that all the words from the beginning of the text data to the word constitute a text clause, or from the first word after the previous end tag to the end tag All the words between the words form a text clause.
  • the tags may include B tags (start tags, indicating that the current word is the beginning of a sentence), I tags (middle tags, indicating that the current word is between the beginning and the end of a sentence), and E tags (The end tag indicates that the current word is the end of a sentence). If the current text data includes multiple E tags, it indicates that the current text data includes multiple clauses, and the text clauses can be divided according to the E tags; and if the current text data includes only one E tag, it indicates that the current text data has only one The text clause is the current text data itself.
  • the division of text clauses can be made more accurate; in addition, compared to other divided text clauses
  • the sentence method also simplifies the operation steps of the division and reduces the realization cost of the division.
  • the domain feature corresponding to each text clause can be obtained according to the syntactic feature of the text vector; for each text clause Domain features, extract the maximum feature value in each feature dimension, generate the domain feature vector of each text clause; determine the domain information of the current text clause according to the domain feature vector of each text clause. In this way, the most effective feature expression of each text clause can be obtained, and the feature expression of each text clause has the same vector length to facilitate subsequent processing.
  • the obtaining the domain feature corresponding to each text clause according to the syntactic feature of the text vector may include: obtaining the domain feature of the text vector according to the syntactic feature of the text vector;
  • the information of the words contained in each text clause is obtained from the domain features of the text vector corresponding to the domain feature of each text clause. That is, first obtain the corresponding total domain feature according to the text vector corresponding to the entire text data, and then obtain the domain feature of each text clause from the total domain feature according to the information of the words in the text clause.
  • it not only ensures the consistency of the domain characteristics of each text clause with the total domain characteristics, but also simplifies the realization of obtaining the domain characteristics of the text clause.
  • Step S208 Recognizing the voice command in the voice input data at least according to the domain information of each text clause.
  • the text clause belonging to the voice instruction corresponds to the set domain information.
  • the domain information of a certain text clause is consistent with the set domain information
  • the text clause can be determined as the voice The text clause corresponding to the instruction.
  • other domain information may be set.
  • the other domain information may be unified domain information indicating that the text clause is a non-voice instruction, or other domain information may be subdivided to indicate the text The specific domain of the sentence, such as the interaction domain, etc.
  • Step S210 According to the recognized voice command, perform the operation indicated by the voice command on the smart voice device.
  • the operation can be any appropriate operation, such as instructing the smart voice device to turn on or off the corresponding function, such as turning on the air conditioner, turning off the light, etc., or instructing the smart voice device to perform query operations, such as querying and playing a certain song. Songs, query and play the weather of a certain place, etc., the embodiment of the present invention does not limit the specific operations indicated by the voice instructions.
  • the speech recognition solution provided by the embodiments of the present invention can be implemented in a variety of suitable ways.
  • some or all of the speech recognition solutions can be implemented through a neural network model.
  • the convolutional neural network CNN model is taken as an example to describe the above process of this embodiment.
  • FIG. 3 The structure of a CNN model is shown in Figure 3, which includes an input part A, a feature extraction part B, a sentence boundary detection part C, and a domain classification part D.
  • the input part A may be the input layer of the CNN, and is used to receive the input text vector, such as the text vector of the text data corresponding to the voice input data.
  • convolutional layers there are multiple convolutional layers in the feature extraction part B.
  • at least 12 convolutional layers are set to improve the accuracy of feature extraction.
  • a batch normalization layer, an activation layer, and a convolutional layer can also be set in the feature extraction part B.
  • the residual of the convolutional layer can also be set. )deal with.
  • the sentence boundary detection part C may include a batch normalization layer, a convolution layer, and an output layer in sequence, wherein the output layer adopts a Softmax function as a loss function.
  • the sentence boundary detection part C the label of the word vector corresponding to each word in the text vector can be obtained, and then the sequence label of the entire text data can be obtained.
  • the division of the text clause can be determined according to the end label in the sequence label, such as the E label .
  • the domain classification part D can optionally include a batch normalization layer, a convolutional layer, a pooling layer, and an output layer in sequence.
  • the pooling layer adopts one-dimensional feature pooling (1-D RoI pooling), and the output layer adopts the Softmax function As a loss function. According to the division result of the text clauses and the domain information of each text clause, the domain classification part D can identify the text clause corresponding to the voice command.
  • the sentence boundary detection part C and the domain classification part D in the CNN model of this embodiment share the syntactic features extracted by the feature extraction part B to improve the data processing efficiency of the CNN model and save the CNN model Realization cost.
  • the corresponding voice recognition process includes:
  • This part is the conversion and processing of the data before the text vector is input into the CNN model. Take the voice of "Why are you going home so late? Please turn on the bedroom light” as an example. In this section, you need to input the voice "Why are you going home so late? Please turn on the bedroom light”
  • the data is converted into text data, and each word therein is converted into a D-dimensional vector, where the specific value of D can be appropriately set by those skilled in the art according to actual needs.
  • N D-dimensional vectors can be generated, where N is the number of words. In this example, 17 words are included. Therefore, N is 17.
  • the N D-dimensional vectors are the text vectors corresponding to the text data.
  • the N D-dimensional vectors generated above are received through the input layer of the CNN model.
  • It includes: performing batch normalization operations on the input vectors to generate normalized vectors; performing non-linear processing on the normalized vectors; performing feature extraction on the non-linearly processed vectors through a convolutional layer to obtain initial features; The initial feature is subjected to residual analysis processing, and the syntactic feature of the vector is obtained and output according to the residual analysis processing result; the step of returning to the batch normalization operation on the input vector is continued until the text vector is obtained The syntactic features.
  • a batch normalization operation is performed on the input vector through the batch normalization layer to generate a normalized vector.
  • the vector of the batch normalization layer input to the first convolutional layer part is the text vector corresponding to the text data; the vector of the batch normalization layer input that is not the first convolutional layer part is the previous convolutional layer part The output vector.
  • the normalized vector is subjected to non-linearization processing through the activation layer.
  • first input the text vector into the first batch normalization layer and then perform batch normalization operations, non-linearization processing, feature extraction and residual processing on the text vector through the batch normalization layer, activation layer, and convolution layer to obtain syntactic features.
  • input the obtained syntactic feature into the next batch of normalization layer, activation layer, convolutional layer, etc., to be processed in turn to obtain the processed syntactic feature again; then input the processed syntactic feature into the next batch of normalization Layer, activation layer, convolutional layer are processed, and so on, until the syntactic feature of the final text vector is obtained.
  • the vector input to the batch normalization layer can be all the vectors after the residual processing of the previous convolutional layer, such as all the text vectors or all the syntactic features, or it can be the residual processing of the previous convolutional layer.
  • the vector of each word in the text vector is the word vector of each word in the text vector or the syntactic feature corresponding to each word. Either way, the syntactic feature of the finally obtained text vector includes the syntactic feature of each word.
  • It includes: performing batch normalization operations on the syntactic features (optionally, batch normalization operations can be performed on the syntactic features through the batch normalization layer of the sentence boundary detection part) to generate normalized syntactic features; and the convolutional layer
  • the normalized syntactic feature performs feature extraction; the output layer determines the label of each word in the text data according to the feature extraction result, and obtains at least one text clause contained in the text data according to the label of each word.
  • the sequence labeling of text data is realized. For example, use B to indicate that the corresponding word is at the beginning of a text segment (ie B is the start tag), E indicates that the corresponding word is at the end of a text segment (ie E is the end tag), I indicates that the corresponding word is at the beginning of a text segment (I.e. I is the middle label).
  • B indicates that the corresponding word is at the beginning of a text segment
  • E indicates that the corresponding word is at the beginning of a text segment
  • I indicates that the corresponding word is at the beginning of a text segment
  • the BIE probability distribution on each word in the text data can be obtained through the sentence boundary detection part.
  • the label corresponding to the maximum value of the BIE probability distribution is selected. For example, in the example of "Why are you going home so late?
  • the label of the word ah should be the E label. According to the label of each word, the sequence label of the entire text data can be obtained, and then, according to the end label in the sequence label, the sentence boundary of each text clause can be obtained, and the range of each text clause can be obtained.
  • the domain information of each text clause is obtained according to the syntactic characteristics of the text vector and the information of each text clause.
  • It includes: performing batch normalization operations on the syntax features of the text vector (optionally, batch normalization operations can be performed on the syntax features of the text vector through the batch normalization layer of the field classification part), and generating normalized syntax features;
  • the accumulation layer performs feature mapping on the normalized syntactic features to obtain the domain features of the text vector; through the pooling layer, the domain features of the text vector are pooled according to the information of each text clause; through The output layer obtains the domain information of each text clause according to the result of the pooling process.
  • the syntactic feature can be mapped to the domain feature C through the batch normalization layer and the convolution layer of the domain classification part.
  • the obtained domain feature C can be an N*D two-dimensional matrix, where N is the number of words contained in the text data, In this example, it is 17, and D is the dimension of the domain feature vector of each word.
  • the dimensional domain feature matrix S is the set of the two-dimensional domain feature matrix of the text clause, and it is also an N*D two-dimensional matrix.
  • each m is a W*D two-dimensional matrix
  • W represents the number of words in the current text clause
  • D is the dimension of the feature vector of each word as described above.
  • the corresponding two-dimensional domain feature matrix m1 is a 10*
  • the matrix of D, the text clause "please turn on the bedroom light”, the corresponding two-dimensional domain characteristic matrix m2 is a 7*D matrix.
  • S (m1, m2).
  • the voice command can be determined.
  • the domain information IOT Internet Of Things
  • the part of the voice input data corresponding to the text clause can be considered as a voice command .
  • IOT Internet Of Things
  • the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the text corresponding to the voice input data is obtained according to the syntactic feature
  • the data is used to divide the text clauses and determine the domain information of the text clauses; furthermore, recognize the voice commands in the voice input data according to the domain information of the text clauses.
  • the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices.
  • the mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein.
  • the smart voice device can be operated and controlled by the recognized voice commands later.
  • the voice recognition method of this embodiment can be executed by any appropriate smart voice device with data processing capabilities, such as various smart home appliances with corresponding functions.
  • FIG. 4 there is shown a structural block diagram of a speech recognition device according to the third embodiment of the present invention.
  • the speech recognition device of this embodiment includes: a first obtaining module 302, configured to obtain text data corresponding to voice input data and a text vector corresponding to the text data; and a second obtaining module 304, configured to obtain information about the text vector Syntactic features; the third acquisition module 306 is used to acquire at least one text clause contained in the text data according to the syntax features, and to acquire the domain information of each text clause; the recognition module 308 uses At least according to the domain information of each of the text clauses, the voice command in the voice input data is recognized.
  • the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the text corresponding to the voice input data is obtained according to the syntactic feature
  • the data is used to divide the text clauses and determine the domain information of the text clauses; furthermore, recognize the voice commands in the voice input data according to the domain information of the text clauses.
  • the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices.
  • the mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein.
  • the smart voice device can be operated and controlled by the recognized voice commands later.
  • FIG. 5 there is shown a structural block diagram of a speech recognition device according to the fourth embodiment of the present invention.
  • the voice recognition device of this embodiment includes: a first obtaining module 402, configured to obtain text data corresponding to voice input data and a text vector corresponding to the text data; and a second obtaining module 404, configured to obtain information about the text vector Syntactic features; the third acquisition module 406 is used to acquire at least one text clause contained in the text data according to the syntactic features, and to acquire the domain information of each text clause; the recognition module 408 uses At least according to the domain information of each of the text clauses, the voice command in the voice input data is recognized.
  • the first acquisition module 402 is configured to acquire voice input data, and generate text data corresponding to the voice input data; generate a word vector corresponding to each word in the text data; The word vector corresponding to each word generates a text vector corresponding to the text data.
  • the second obtaining module 404 is configured to perform feature extraction on the text vector to obtain the syntactic feature of the text vector.
  • the second acquisition module 404 is configured to perform feature extraction on the word vector corresponding to each word in the text vector to acquire the syntactic feature of each word.
  • the third obtaining module 406 includes: a clause obtaining module 4062, configured to obtain a tag of each word according to the syntactic feature of each word, wherein the tag includes an end tag;
  • the label of the word obtains the sequence label of the text data; the at least one text clause contained in the text data is obtained according to the end label in the sequence label;
  • the domain obtaining module 4064 is used to obtain according to the syntactic feature To obtain the domain information of each text clause.
  • the domain obtaining module 4064 includes: a domain feature module 40642, configured to obtain the domain feature corresponding to each text clause according to the syntactic feature of the text vector; and a determining module 40644, configured to determine each For the domain feature of the text clause, extract the maximum feature value in each feature dimension to generate the domain feature vector of each text clause; determine the current text according to the domain feature vector of each text clause The domain information of the clause.
  • the domain feature module 40642 is configured to obtain the domain feature of the text vector according to the syntactic feature of the text vector; according to the information of the words contained in each text clause, from the From the domain feature of the text vector, the domain feature corresponding to each text clause is obtained.
  • the second acquisition module 404 is configured to perform feature extraction on the text vector through the feature extraction part of the convolutional neural network model to acquire the syntactic features of the text vector; the third acquisition module 406,
  • the sentence boundary detection part of the convolutional neural network model is used to obtain at least one text clause contained in the text data according to the syntactic feature;
  • the domain classification part of the convolutional neural network model is used according to the Syntactic features and information of each of the text clauses to obtain domain information of each of the text clauses; wherein, the sentence boundary detection part and the domain classification part share the syntactic features extracted by the feature extraction part.
  • the second acquisition module 404 is configured to perform batch normalization operations on the input vectors to generate normalized vectors; perform non-linear processing on the normalized vectors; perform non-linear processing on the normalized vectors through the convolutional layer Perform feature extraction on the latter vector to obtain an initial feature; perform residual analysis processing on the initial feature, obtain and output the syntactic feature of the vector according to the residual analysis processing result; return to the input vector
  • the batch normalization operation continues to execute until the syntactic feature of the text vector is obtained.
  • the feature extraction part includes at least 12 convolutional layers, and the normalized vector is non-linearized through a linear gate function.
  • the third acquiring module 406 acquires at least one text clause contained in the text data according to the syntax feature by the sentence boundary detection part of the convolutional neural network model: Perform batch normalization operations on features to generate normalized syntactic features; perform feature extraction on the normalized syntactic features through the convolutional layer; determine the label of each word in the text data according to the feature extraction result through the output layer, according to each The label of one word acquires at least one text clause contained in the text data.
  • the third obtaining module 406 obtains the field of each text clause according to the syntactic feature and the information of each text clause in the field classification part of the convolutional neural network model.
  • Information perform batch normalization operations on the syntactic features to generate normalized syntactic features; perform feature mapping on the normalized syntactic features through the convolutional layer to obtain the domain features of the text vector; use the pooling layer according to each The information of the text clause performs pooling processing on the domain features of the text vector; the output layer obtains the domain information of each text clause according to the result of the pooling processing.
  • the voice recognition device in this embodiment is used to implement the corresponding voice recognition methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the speech recognition device of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and it will not be repeated here.
  • FIG. 6 there is shown a schematic structural diagram of a smart device according to the sixth embodiment of the present invention.
  • the specific embodiment of the present invention does not limit the specific implementation of the smart device.
  • the smart device may include: a processor (processor) 502, a communication interface (Communications Interface) 504, a memory (memory) 506, and a communication bus 508.
  • processor processor
  • communication interface Communication Interface
  • memory memory
  • the processor 502, the communication interface 504, and the memory 506 communicate with each other through the communication bus 508.
  • the communication interface 504 is used to communicate with other electronic devices such as other smart devices or servers.
  • the processor 502 is configured to execute the program 510, and specifically can execute relevant steps in the above-mentioned voice recognition method embodiment.
  • the program 510 may include program code, and the program code includes computer operation instructions.
  • the processor 502 may be a central processing unit CPU, or a specific integrated circuit (ASIC) (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention.
  • the one or more processors included in the smart device may be the same type of processor, such as one or more CPUs, or different types of processors, such as one or more CPUs and one or more ASICs.
  • the memory 506 is used to store the program 510.
  • the memory 506 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory.
  • the program 510 can specifically be used to cause the processor 502 to perform the following operations: obtain text data corresponding to the voice input data and a text vector corresponding to the text data; obtain the syntactic feature of the text vector; At least one text clause contained in the text data, and acquiring the domain information of each text clause; at least according to the domain information of each text clause, recognizing the voice command in the voice input data.
  • the program 510 is further configured to enable the processor 502 to obtain the voice input data when obtaining the text data corresponding to the voice input data and the text vector corresponding to the text data, and generate the Text data corresponding to the voice input data; generating a word vector corresponding to each word in the text data; generating a text vector corresponding to the text data according to the word vector corresponding to each word.
  • the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector when obtaining the syntactic feature of the text vector to obtain the syntactic feature of the text vector.
  • the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector to obtain the syntactic features of the text vector, and then perform a comparison of each word in the text vector The corresponding word vector performs feature extraction to obtain the syntactic feature of each word.
  • the program 510 is further configured to cause the processor 502 to obtain at least one text clause contained in the text data according to the syntactic feature, according to the syntactic feature of each word, Obtain the label of each word, where the label includes an end tag; obtain the sequence label of the text data according to the label of each word; obtain the text data according to the end label in the sequence label At least one text clause contained.
  • the program 510 is further configured to enable the processor 502 to obtain each of the text clauses according to the syntactic characteristics of the text vector when obtaining the domain information of each of the text clauses. Corresponding domain features; for the domain features of each text clause, extract the maximum feature value in each feature dimension to generate the domain feature vector of each text clause; according to each text clause The domain feature vector of to determine the domain information of the current text clause.
  • the program 510 is further configured to cause the processor 502 to obtain the domain feature corresponding to each text clause according to the syntax feature of the text vector, according to the syntax feature of the text vector Feature, obtain the domain feature of the text vector; obtain the domain feature corresponding to each text clause from the domain feature of the text vector according to the information of the words contained in each text clause.
  • the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector through the feature extraction part of the convolutional neural network model to obtain the syntactic features of the text vector; the program 510 also The processor 502 is configured to obtain at least one text clause contained in the text data according to the syntactic feature by the sentence boundary detection part of the convolutional neural network model; use the domain classification of the convolutional neural network model Partly according to the syntactic feature and the information of each of the text clauses, the domain information of each text clause is obtained; wherein, the sentence boundary detection part and the domain classification part share the feature extraction part extraction The syntactic features.
  • the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector through the feature extraction part of the convolutional neural network model to obtain the syntactic features of the text vector, Perform batch normalization operations on the input vectors to generate normalized vectors; perform non-linearization processing on the normalized vectors; perform feature extraction on the non-linearly processed vectors through a convolutional layer to obtain initial features; Perform residual analysis processing on features, and obtain and output the syntactic features of the vectors according to the residual analysis processing results; return to the step of performing batch normalization operations on the input vectors until the syntactic feature of the text vector is obtained .
  • the feature extraction part includes at least 12 convolutional layers; the normalized vector is non-linearized through a linear gate function.
  • the program 510 is further configured to enable the processor 502 to obtain at least one text contained in the text data according to the syntactic feature in the sentence boundary detection part of the convolutional neural network model. Clauses, perform batch normalization operations on the syntactic features to generate normalized syntactic features; perform feature extraction on the normalized syntactic features through the convolutional layer; determine each of the text data according to the feature extraction results through the output layer According to the label of each word, at least one text clause contained in the text data is obtained according to the label of each word.
  • the program 510 is further configured to cause the processor 502 to obtain each item based on the syntactic feature and the information of each text clause in the domain classification part of the convolutional neural network model.
  • the processor 502 For the domain information of a text clause, perform batch normalization operations on the syntactic features to generate normalized syntactic features; perform feature mapping on the normalized syntactic features through a convolutional layer to obtain the domain features of the text vector
  • the field feature of the text vector is pooled by the pooling layer according to the information of each text clause; the field of each text clause is obtained by the output layer according to the result of the pooling process information.
  • the smart device of this embodiment may further include a microphone to receive the analog voice signal input by the user and convert it into a digital voice signal, that is, voice input data; the program 510 may also be used to make the processor 502 Convert the voice input data into corresponding text data. But it is not limited to this.
  • the microphone can also be set independently of the smart device, and connected to the smart device through an appropriate connection mode, and send the voice input data to the processor.
  • the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the voice input data is processed according to the syntactic feature
  • the corresponding text data is divided into text clauses and the domain information of the text clauses is determined; furthermore, the voice commands in the voice input data are recognized according to the domain information of the text clauses.
  • the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices.
  • the mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein.
  • the smart voice device can be operated and controlled by the recognized voice commands later.
  • each component/step described in the embodiment of the present invention can be split into more components/steps, or two or more components/steps or partial operations of components/steps can be combined into New components/steps to achieve the purpose of the embodiments of the present invention.
  • the above method according to the embodiment of the present invention can be implemented in hardware, firmware, or implemented as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by
  • a recording medium such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk
  • the computer code downloaded from the network is originally stored in a remote recording medium or a non-transitory machine-readable medium and will be stored in a local recording medium, so that the method described here can be stored using a general-purpose computer, a dedicated processor or a programmable Or such software processing on a recording medium of dedicated hardware (such as ASIC or FPGA).
  • a computer, a processor, a microprocessor controller, or programmable hardware includes a storage component (for example, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is used by the computer, When accessed and executed by the processor or hardware, the voice recognition method described here is implemented.
  • a general-purpose computer accesses the code for implementing the voice recognition method shown here, the execution of the code converts the general-purpose computer into a dedicated computer for executing the voice recognition method shown here.

Abstract

A speech recognition method and apparatus. The method comprises: obtaining text data corresponding to speech input data and a text vector corresponding to the text data (S102); obtaining syntax characteristics of the text vector (S104); according to the syntax characteristics, obtaining at least one text clause comprised in the text data, and obtaining domain information of each text clause (S106); and recognizing a speech instruction in the speech input data at least according to the domain information of each text clause (S108). According to the method, an operating burden of a user is reduced, and the degree for the intelligent processing of a user speech instruction by an intelligent speech device is also improved.

Description

语音识别方法及装置Speech recognition method and device
本申请要求2019年01月18日递交的申请号为201910047340.2、发明名称为“语音识别方法及装置”中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application with an application number of 201910047340.2 and an invention title of "Speech Recognition Method and Apparatus" filed on January 18, 2019, the entire content of which is incorporated into this application by reference.
技术领域Technical field
本发明实施例涉及计算机技术领域,尤其涉及一种语音识别方法及装置。The embodiment of the present invention relates to the field of computer technology, in particular to a voice recognition method and device.
背景技术Background technique
智能设备是传统电气设备与计算机技术、数据处理技术、控制技术、传感器技术、网络通信技术、电力电子技术等相结合的产物。在各种智能设备中,智能语音设备是其中重要的一个分支。Intelligent equipment is a combination of traditional electrical equipment and computer technology, data processing technology, control technology, sensor technology, network communication technology, power electronics technology, etc. Among various smart devices, smart voice devices are an important branch.
通过智能语音设备,用户仅需语音即可实现对各种智能设备的控制,包括对智能语音设备自身和智能语音设备所控制的其它智能设备的控制。目前,在用户与智能语音设备进行交互的过程中,每一次对智能语音设备的控制都需要使用唤醒词,然后紧接着说出语音指令以完成用户的意图。例如:“天猫精灵,打开灯”、“天猫精灵,播放音乐”等等,可见,在该交互中,用户每次都需要使用“天猫精灵”这一唤醒词来唤醒智能语音设备,以进行相应的操作和控制。而在“你为什么这么晚回家啊?请打开卧室的灯”这一语句中,“你为什么这么晚回家啊?”是用户之间的交互,而“请打开卧室的灯”则是对智能语音设备的控制指令。对于这类复杂的且无唤醒词的混合指令,目前的智能语音设备则无法进行处理。Through smart voice devices, users can control various smart devices only by voice, including control of the smart voice device itself and other smart devices controlled by the smart voice device. At present, in the process of interaction between the user and the smart voice device, each control of the smart voice device requires the use of a wake-up word, and then a voice command is followed to complete the user's intention. For example: "Tmall Genie, turn on the light", "Tmall Genie, play music", etc. It can be seen that in this interaction, the user needs to use the wake-up word "Tmall Genie" to wake up the smart voice device every time , In order to carry out the corresponding operation and control. And in the sentence "Why are you going home so late? Please turn on the bedroom lights", "Why are you going home so late?" is the interaction between users, and "Please turn on the bedroom lights" is correct Control instructions for smart voice devices. For such complex mixed instructions without a wake-up word, current smart voice devices cannot process them.
但是,这种使用唤醒词唤醒智能语音设备的方式,一方面,用户的每条指令都必须使用唤醒词,增加了用户的操作负担,也使得智能语音设备对用户语音指令的智能化处理程度较低;另一方面,智能语音设备需要重复地对唤醒词进行处理,也增加了智能语音设备的处理负担。However, this way of using a wake-up word to wake up a smart voice device, on the one hand, the user must use a wake-up word for every instruction, which increases the user's operational burden, and also makes the smart voice device more intelligently processing the user's voice instructions. Low; on the other hand, smart voice devices need to process the wake-up words repeatedly, which also increases the processing burden of smart voice devices.
发明内容Summary of the invention
有鉴于此,本发明实施例提供一种语音识别方案,以解决上述问题。In view of this, an embodiment of the present invention provides a voice recognition solution to solve the above-mentioned problem.
根据本发明实施例的第一方面,提供了一种语音识别方法,包括:获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量;获取所述文本向量的句法特征;根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,以及,获取每一个 所述文本子句的领域信息;至少根据每一个所述文本子句的领域信息,识别所述语音输入数据中的语音指令。According to a first aspect of the embodiments of the present invention, there is provided a voice recognition method, including: obtaining text data corresponding to voice input data and a text vector corresponding to the text data; obtaining the syntactic feature of the text vector; The syntactic feature, acquiring at least one text clause contained in the text data, and acquiring the domain information of each text clause; at least according to the domain information of each text clause, recognizing the voice input Voice commands in the data.
根据本发明实施例的第二方面,提供了一种语音识别装置,包括:第一获取模块,用于获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量;第二获取模块,用于获取所述文本向量的句法特征;第三获取模块,用于根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,以及,获取每一个所述文本子句的领域信息;识别模块,用于至少根据每一个所述文本子句的领域信息,识别所述语音输入数据中的语音指令。According to a second aspect of the embodiments of the present invention, there is provided a speech recognition device, including: a first acquisition module for acquiring text data corresponding to voice input data and a text vector corresponding to the text data; and a second acquisition module , Used to obtain the syntactic feature of the text vector; the third obtaining module, used to obtain at least one text clause contained in the text data according to the syntactic feature, and obtain the information of each text clause Domain information; a recognition module for recognizing voice commands in the voice input data at least according to the domain information of each of the text clauses.
根据本发明实施例的第三方面,提供了一种智能设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如第一方面所述的语音识别方法对应的操作。According to a third aspect of the embodiments of the present invention, there is provided an intelligent device, including: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface complete each other through the communication bus. Inter-communication; the memory is used to store at least one executable instruction, the executable instruction causes the processor to perform operations corresponding to the voice recognition method described in the first aspect.
根据本发明实施例的第四方面,提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的语音识别方法。According to a fourth aspect of the embodiments of the present invention, there is provided a computer storage medium having a computer program stored thereon, and when the program is executed by a processor, the speech recognition method as described in the first aspect is implemented.
根据本发明实施例提供的语音识别方案,先获取由语音输入数据转换的文本数据和该文本数据对应的文本向量;再通过对文本向量的特征提取获得对应的句法特征;然后,根据句法特征对语音输入数据对应的文本数据进行文本子句的划分及文本子句的领域信息的确定;进而,根据文本子句的领域信息识别语音输入数据中的语音指令。可见,通过本发明实施例的方案,使得智能语音设备更加适用于实际的使用环境,无需用户再使用唤醒词唤醒智能语音设备,不管是用户使用纯语音指令的语音输入数据,还是使用包含语音指令和其它语音数据的混合语音输入数据,都能对语音输入数据进行有效的子句划分,并识别其中包含的语音指令,进而,后续可以通过识别的语音指令对智能语音设备进行操作和控制。According to the speech recognition solution provided by the embodiment of the present invention, the text data converted from the speech input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the corresponding syntactic feature is obtained according to the syntactic feature The text data corresponding to the voice input data performs the division of the text clauses and the determination of the domain information of the text clauses; further, the voice instructions in the voice input data are recognized according to the domain information of the text clauses. It can be seen that through the solution of the embodiment of the present invention, the smart voice device is more suitable for the actual use environment, and the user does not need to use the wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands that contain voice commands. The mixed voice input data with other voice data can effectively divide the voice input data into clauses and recognize the voice commands contained therein. Furthermore, the smart voice device can be operated and controlled by the recognized voice commands later.
因无需使用唤醒词唤醒智能语音设备,减轻了用户的操作负担,也提高了智能语音设备对用户语音指令的智能化处理程度;并且,智能语音设备也无需再针对唤醒词进行处理,减轻了智能语音设备的数据处理负担。Because there is no need to use a wake-up word to wake up the smart voice device, it reduces the user's operational burden and improves the intelligent processing level of the user's voice instructions by the smart voice device; and the smart voice device no longer needs to process the wake-up word, reducing the intelligence The burden of data processing for voice devices.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本 发明实施例中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some of the embodiments described in the embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained from these drawings.
图1为根据本发明实施例一的一种语音识别方法的步骤流程图;Figure 1 is a flowchart of the steps of a voice recognition method according to the first embodiment of the present invention;
图2为根据本发明实施例二的一种语音识别方法的步骤流程图;2 is a flowchart of steps of a voice recognition method according to the second embodiment of the present invention;
图3为图2所示实施例中的一种神经网络模型的结构示意图;Fig. 3 is a schematic structural diagram of a neural network model in the embodiment shown in Fig. 2;
图4为根据本发明实施例三的一种语音识别装置的结构框图;4 is a structural block diagram of a speech recognition device according to the third embodiment of the present invention;
图5为根据本发明实施例四的一种语音识别装置的结构框图;5 is a structural block diagram of a speech recognition device according to the fourth embodiment of the present invention;
图6为根据本发明实施例五的一种智能设备的结构示意图。Fig. 6 is a schematic structural diagram of a smart device according to the fifth embodiment of the present invention.
具体实施方式detailed description
为了使本领域的人员更好地理解本发明实施例中的技术方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本发明实施例一部分实施例,而不是全部的实施例。基于本发明实施例中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于本发明实施例保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, the following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the description The embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments in the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art should fall within the protection scope of the embodiments of the present invention.
下面结合本发明实施例附图进一步说明本发明实施例具体实现。The specific implementation of the embodiments of the present invention will be further described below in conjunction with the accompanying drawings of the embodiments of the present invention.
实施例一Example one
参照图1,示出了根据本发明实施例一的一种语音识别方法的步骤流程图。Referring to Fig. 1, there is shown a flow chart of the steps of a voice recognition method according to the first embodiment of the present invention.
本实施例的语音识别方法包括以下步骤:The voice recognition method of this embodiment includes the following steps:
步骤S102:获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量。Step S102: Acquire text data corresponding to the voice input data and text vectors corresponding to the text data.
在智能语音设备的使用场景中,用户可以通过语音对智能语音设备进行操作和控制;智能语音设备以用户发出的语音作为输入生成相应的语音输入数据,并将该语音输入数据转换为对应的文本数据,进而进行相应的处理。本实施例中,除需要将语音输入数据转换为文本数据外,还会获取该文本数据对应的文本向量,以通过向量形式表征该文本数据,且便于后续的处理。其中,将语音输入数据转换为对应的文本数据,以及,获取该文本数据对应的文本向量的具体实现均可由本领域技术人员根据实际需要采用任意适当的方式实现,本发明实施例对此不作限制。In the use scenario of the smart voice device, the user can operate and control the smart voice device through voice; the smart voice device uses the voice of the user as input to generate corresponding voice input data, and convert the voice input data into corresponding text Data, and then deal with it accordingly. In this embodiment, in addition to the need to convert voice input data into text data, a text vector corresponding to the text data is also obtained to characterize the text data in the form of a vector, and to facilitate subsequent processing. Wherein, the specific implementation of converting voice input data into corresponding text data and obtaining the text vector corresponding to the text data can be implemented by those skilled in the art in any appropriate manner according to actual needs, and the embodiment of the present invention does not limit this .
例如,可以采用卷积神经网络模型或者BP神经网络模型或者隐马尔科夫模型HMM或者多频带谱减法等等方式实现将语音输入数据转换为文本数据;又例如,可以基于深 度学习的方式(如word2vec方式)、或者基于图的方式(如textrank方式)、或者基于主题模型的方式(如LDA方式)、或者基于统计的方式(如bag of words方式)等等实现文本数据对应的文本向量的获取。For example, a convolutional neural network model or a BP neural network model or a hidden Markov model HMM or multi-band spectral subtraction can be used to convert voice input data into text data; for example, it can be based on deep learning methods (such as word2vec method), or graph-based method (such as textrank method), or topic model-based method (such as LDA method), or statistical method (such as bag of words method), etc. to achieve the acquisition of text vectors corresponding to text data .
步骤S104:获取所述文本数据对应的文本向量的句法特征。Step S104: Obtain the syntactic feature of the text vector corresponding to the text data.
本发明实施例中,文本向量的句法特征可以表征文本向量所对应的文本数据中的字词之间的依存关系和语义信息,句法特征可以通过句法特征向量表达。在具体实现中,可以通过卷积神经网络CNN模型或者循环神经网络RNN模型对所述文本向量进行特征提取,以获取所述文本向量的句法特征。但不限于此,在实际应用中,本领域技术人员也可以采用其它适当方式获取文本向量的句法特征,如文本分类或其它方式等。In the embodiment of the present invention, the syntactic feature of the text vector can represent the dependency relationship and semantic information between words in the text data corresponding to the text vector, and the syntactic feature can be expressed by the syntactic feature vector. In a specific implementation, the text vector may be feature extracted through a convolutional neural network CNN model or a recurrent neural network RNN model to obtain the syntactic features of the text vector. But it is not limited to this. In practical applications, those skilled in the art can also use other appropriate methods to obtain the syntactic features of the text vector, such as text classification or other methods.
步骤S106:根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,以及,获取每一个文本子句的领域信息。Step S106: Obtain at least one text clause contained in the text data according to the syntactic feature, and obtain domain information of each text clause.
本发明实施例中,语音输入数据对应的文本数据中,包含一个或多个文本子句,当包含一个文本子句时,该文本子句可以是语音指令对应的文本子句,也可以是其它语音数据对应的文本子句;当包含多个文本子句时,该多个文本子句可以均为语音指令对应的文本子句,该多个文本子句也可以均为其它语音数据对应的文本子句,如用户说的与语音指令无关的句子,该多个文本子句还可以为语音指令对应的文本子句与其它语音数据对应的文本子句的混合,如在一个复杂的多人场景中,用户A和用户B在交流的同时,向智能语音设备发出的语音指令,如“你为什么这么晚回家啊?请打开卧室的灯”,其中,前半句“你为什么这么晚回家啊”会被识别为其它语音数据对应的文本子句,而后半句“请打开卧室的灯”则会被识别为语音指令对应的文本子句。In the embodiment of the present invention, the text data corresponding to the voice input data contains one or more text clauses. When a text clause is included, the text clause may be the text clause corresponding to the voice command, or other text clauses. The text clause corresponding to the voice data; when multiple text clauses are included, the multiple text clauses can all be text clauses corresponding to the voice command, and the multiple text clauses can also be text corresponding to other voice data Clauses, such as sentences that are not related to voice commands. The multiple text clauses can also be a mixture of text clauses corresponding to the voice command and text clauses corresponding to other voice data, such as in a complex multiplayer scene In, user A and user B are communicating with the smart voice device at the same time they send voice commands to the smart voice device, such as "Why are you going home so late? Please turn on the light in the bedroom", in which the first half sentence "Why are you going home so late?" "Will be recognized as the text clause corresponding to other voice data, and the second half sentence "Please turn on the light in the bedroom" will be recognized as the text clause corresponding to the voice command.
在实际应用中,在获得了文本向量对应的句法特征后,即可根据句法特征确定所述文本数据中的一个或多个文本子句。其中,获取文本子句的方式可以与获取句法特征的方式相适应,例如,当采用CNN模型或RNN模型获取所述文本向量的句法特征时,可以依据句法特征对所述文本数据进行序列标注,根据序列标注的结果获得一个或多个文本子句。In practical applications, after obtaining the syntactic feature corresponding to the text vector, one or more text clauses in the text data can be determined according to the syntactic feature. Wherein, the method of obtaining text clauses can be adapted to the method of obtaining syntactic features. For example, when the CNN model or the RNN model is used to obtain the syntactic features of the text vector, the text data can be sequenced according to the syntactic features. Obtain one or more text clauses based on the result of sequence labeling.
此外,本发明实施例中,还会根据文本向量的句法特征,获取每一个文本子句的领域信息。如,通过机器学习算法或者神经网络模型,由文本向量的句法特征获取其对应的文本子句的领域信息,其中,所述领域信息中包括语音指令对应的领域的信息。In addition, in the embodiment of the present invention, the domain information of each text clause is also obtained according to the syntactic characteristics of the text vector. For example, through a machine learning algorithm or a neural network model, the domain information of the corresponding text clause is obtained from the syntactic features of the text vector, where the domain information includes the information of the domain corresponding to the voice command.
需要说明的是,本发明实施例中,若无特别说明,“多个”、“多种”等与“多”有关的数量意指两个及两个以上。It should be noted that, in the embodiments of the present invention, unless otherwise specified, the numbers related to "multiple" such as "multiple" and "multiple" mean two or more.
步骤S108:至少根据每一个文本子句的领域信息,识别语音输入数据中的语音指令。Step S108: Recognize the voice command in the voice input data at least according to the domain information of each text clause.
在语音输入数据包括有语音指令的情况下,其对应的文本数据包含的一个或多个文本子句中,应当有文本子句的领域信息指示该文本子句对应的那部分语音输入数据为语音指令,据此,即可从语音输入数据中识别出语音指令。In the case that the voice input data includes a voice command, in one or more text clauses contained in the corresponding text data, there should be field information of the text clause indicating that the part of the voice input data corresponding to the text clause is voice According to this, the voice command can be recognized from the voice input data.
例如,在“你为什么这么晚回家啊?请打开卧室的灯”这一语音输入数据中,根据对其对应的文本数据及文本数据对应的文本向量的上述处理,可确定其中的文本子句“请打开卧室的灯”为语音指令。For example, in the voice input data of "Why are you going home so late? Please turn on the light in the bedroom", according to the above processing of the corresponding text data and the text vector corresponding to the text data, the text clause can be determined "Please turn on the bedroom light" is a voice command.
通过本实施例,先获取由语音输入数据转换的文本数据和该文本数据对应的文本向量;再通过对文本向量的特征提取获得对应的句法特征;然后,根据句法特征对语音输入数据对应的文本数据进行文本子句的划分及文本子句的领域信息的确定;进而,根据文本子句的领域信息识别语音输入数据中的语音指令。可见,通过本实施例,使得智能语音设备更加适用于实际的使用环境,无需用户再使用唤醒词唤醒智能语音设备,不管是用户使用纯语音指令的语音输入数据,还是使用包含语音指令和其它语音数据的混合语音输入数据,都能对语音输入数据进行有效的子句划分,并识别其中包含的语音指令,进而,后续可以通过识别的语音指令对智能语音设备进行操作和控制。Through this embodiment, the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the text corresponding to the voice input data is obtained according to the syntactic feature The data is used to divide the text clauses and determine the domain information of the text clauses; furthermore, recognize the voice commands in the voice input data according to the domain information of the text clauses. It can be seen that through this embodiment, the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices. The mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein. Furthermore, the smart voice device can be operated and controlled by the recognized voice commands later.
因无需使用唤醒词唤醒智能语音设备,减轻了用户的操作负担,也提高了智能语音设备对用户语音指令的智能化处理程度;并且,智能语音设备也无需再针对唤醒词进行处理,减轻了智能语音设备的数据处理负担。Because there is no need to use a wake-up word to wake up the smart voice device, it reduces the user's operational burden and improves the intelligent processing level of the user's voice instructions by the smart voice device; and the smart voice device no longer needs to process the wake-up word, reducing the intelligence The burden of data processing for voice devices.
本实施例的语音识别方法可以由任意适当的具有数据处理能力的智能语音设备执行,如,具有相应功能的各种智能家电等。The voice recognition method of this embodiment can be executed by any appropriate smart voice device with data processing capabilities, such as various smart home appliances with corresponding functions.
实施例二Example 2
参照图2,示出了根据本发明实施例二的一种语音识别方法的步骤流程图。Referring to Fig. 2, a flowchart of the steps of a speech recognition method according to the second embodiment of the present invention is shown.
本实施例的语音识别方法包括以下步骤:The voice recognition method of this embodiment includes the following steps:
步骤S202:获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量。Step S202: Acquire text data corresponding to the voice input data and text vectors corresponding to the text data.
本实施例中,所述文本数据对应的文本向量包括文本数据中的每一个字词对应的字词向量。其中,因可能采用的语言的不同,字词的具体含义也可能不同。例如,对于像中文、日文、韩文等类似语言体系的文本数据,一个字词可能是单个的字,也可能是一个词;而对于像英文、法文等类似语言体系的文本数据,一个字词多为一个完整的单词。In this embodiment, the text vector corresponding to the text data includes a word vector corresponding to each word in the text data. Among them, due to the different languages that may be used, the specific meanings of the words may also be different. For example, for text data of similar language systems such as Chinese, Japanese, and Korean, a word may be a single word or a word; and for text data of similar language systems such as English and French, there are more than one word. As a complete word.
基于此,在一种可行方式中,本步骤可以实现为:获取语音输入数据,并生成与语 音输入数据对应的文本数据;生成所述文本数据中的每一个字词对应的字词向量;根据每一个字词对应的字词向量,生成所述文本数据对应的文本向量。其中,根据语音输入数据生成对应的文本数据,以及,生成文本数据中的每一个字词对应的字词向量的具体实现方式均可由本领域技术人员根据实际需求采用任意适当的方式实现,本发明实施例对此不作限制。通过每一个字词对应的字词向量表征文本数据对应的文本向量既可便于对文本数据的处理,也可以有效避免因向量化处理而造成的文本数据的过多信息损失。Based on this, in a feasible manner, this step can be implemented as: acquiring voice input data and generating text data corresponding to the voice input data; generating a word vector corresponding to each word in the text data; The word vector corresponding to each word generates a text vector corresponding to the text data. Among them, the specific implementation of generating the corresponding text data according to the voice input data and generating the word vector corresponding to each word in the text data can be implemented by those skilled in the art in any appropriate manner according to actual needs. The present invention The embodiment does not limit this. Characterizing the text vector corresponding to the text data by the word vector corresponding to each word can not only facilitate the processing of the text data, but also effectively avoid the excessive information loss of the text data caused by the vectorization processing.
步骤S204:获取所述文本数据对应的文本向量的句法特征。Step S204: Obtain the syntactic feature of the text vector corresponding to the text data.
如实施例一中所述,获取文本向量的句法特征的方式可以有多种,本实施例中采用特征提取的方式,也即,对所述文本数据对应的文本向量进行特征提取,获取所述文本向量的句法特征。As described in the first embodiment, there are many ways to obtain the syntactic features of the text vector. In this embodiment, the feature extraction method is adopted, that is, the feature extraction is performed on the text vector corresponding to the text data to obtain the Syntactic characteristics of text vectors.
在文本向量包括每一个字词对应的字词向量的情况下,本步骤可以实现为:对所述文本向量中的每一个字词对应的字词向量进行特征提取,获取每一个字词的句法特征。采用特征提取的方式提取到的句法特征,可以更为有效地表征每一个字词向量对应的字词的特性。In the case that the text vector includes a word vector corresponding to each word, this step can be implemented as: performing feature extraction on the word vector corresponding to each word in the text vector to obtain the syntax of each word feature. The syntactic features extracted by the feature extraction method can more effectively characterize the characteristics of the words corresponding to each word vector.
步骤S206:根据所述文本向量的句法特征,获取所述文本数据中包含的至少一个文本子句,以及,获取每一个所述文本子句的领域信息。Step S206: Obtain at least one text clause contained in the text data according to the syntax feature of the text vector, and obtain domain information of each text clause.
其中,在根据所述文本向量的句法特征,获取所述文本数据中包含的至少一个文本子句时,基于前述获取的每一个字词的句法特征,可以根据所述每一个字词的句法特征,获取每一个字词的标签,其中,所述标签包括结束标签;根据每一个字词的标签,获得所述文本数据的序列标注;根据所述序列标注中的结束标签,获取所述文本数据中包含的至少一个文本子句。也即,可以将文本子句的划分问题转换为文本数据的序列标注问题。其中,所述标签的类型可以由本领域技术人员根据实际需求适当设置,但至少包括结束标签。若一个字词被标注为结束标签,则表明从文本数据开头至该字词之间的所有字词组成为一个文本子句,或者,从前一结束标签之后的首个字词至该结束标签对应的字词之间的所有字词组成为一个文本子句。Wherein, when obtaining at least one text clause contained in the text data according to the syntactic feature of the text vector, based on the syntactic feature of each word obtained previously, the syntactic feature of each word may be , Obtain the label of each word, wherein the label includes an end tag; obtain the sequence label of the text data according to the label of each word; obtain the text data according to the end label in the sequence label At least one text clause contained in. That is, the problem of dividing text clauses can be converted into a problem of sequence labeling of text data. Wherein, the type of the tag can be appropriately set by those skilled in the art according to actual needs, but at least includes the end tag. If a word is marked as an end tag, it means that all the words from the beginning of the text data to the word constitute a text clause, or from the first word after the previous end tag to the end tag All the words between the words form a text clause.
可选地,所述标签可以包括B标签(开始标签,表明当前字词为一个句子的开始)、I标签(中间标签,表明当前字词在一个句子开始和结尾之间的内部)、E标签(结束标签,表明当前字词为一个句子的结尾)。若当前文本数据包括多个E标签,则表明当前文本数据包括多个子句,可以根据E标签对文本子句进行划分;而若当前文本数据仅包括一个E标签,则表明当前文本数据仅有一个文本子句,即当前文本数据自身。Optionally, the tags may include B tags (start tags, indicating that the current word is the beginning of a sentence), I tags (middle tags, indicating that the current word is between the beginning and the end of a sentence), and E tags (The end tag indicates that the current word is the end of a sentence). If the current text data includes multiple E tags, it indicates that the current text data includes multiple clauses, and the text clauses can be divided according to the E tags; and if the current text data includes only one E tag, it indicates that the current text data has only one The text clause is the current text data itself.
通过对字词进行标签标注以形成文本数据的序列标注,进而根据序列标注中的结束标签获得文本子句的方式,可以使得对文本子句的划分更为准确;另外,相对于其它划分文本子句的方式,也简化了划分的操作步骤,降低了划分的实现成本。By labeling words to form a sequence label of text data, and then obtaining text clauses according to the end tags in the sequence labeling, the division of text clauses can be made more accurate; in addition, compared to other divided text clauses The sentence method also simplifies the operation steps of the division and reduces the realization cost of the division.
而在根据所述文本向量的句法特征,获取每一个文本子句的领域信息时,可以根据所述文本向量的句法特征,获取每一个文本子句对应的领域特征;对每一个文本子句的领域特征,在每个特征维度上进行最大特征值提取,生成每一个文本子句的领域特征向量;根据每一个文本子句的领域特征向量,确定当前文本子句的领域信息。通过这种方式,可以获得每一个文本子句最有效的特征表达,且每一个文本子句的特征表达具有相同的向量长度,以便于后续处理。When obtaining the domain information of each text clause according to the syntactic feature of the text vector, the domain feature corresponding to each text clause can be obtained according to the syntactic feature of the text vector; for each text clause Domain features, extract the maximum feature value in each feature dimension, generate the domain feature vector of each text clause; determine the domain information of the current text clause according to the domain feature vector of each text clause. In this way, the most effective feature expression of each text clause can be obtained, and the feature expression of each text clause has the same vector length to facilitate subsequent processing.
在一种可行方式中,所述根据所述文本向量的句法特征,获取每一个文本子句对应的领域特征可以包括:根据所述文本向量的句法特征,获取所述文本向量的领域特征;根据每一个文本子句所包含的字词的信息,从所述文本向量的领域特征中获取每一个文本子句对应的领域特征。也即,先根据整个所述文本数据对应的文本向量,获取对应的总领域特征,进而依据文本子句中的字词的信息,从总领域特征中获得每一个文本子句的领域特征。由此,既保证了每一个文本子句的领域特征与总领域特征的一致性,也简化了获取文本子句的领域特征的实现。In a feasible manner, the obtaining the domain feature corresponding to each text clause according to the syntactic feature of the text vector may include: obtaining the domain feature of the text vector according to the syntactic feature of the text vector; The information of the words contained in each text clause is obtained from the domain features of the text vector corresponding to the domain feature of each text clause. That is, first obtain the corresponding total domain feature according to the text vector corresponding to the entire text data, and then obtain the domain feature of each text clause from the total domain feature according to the information of the words in the text clause. As a result, it not only ensures the consistency of the domain characteristics of each text clause with the total domain characteristics, but also simplifies the realization of obtaining the domain characteristics of the text clause.
步骤S208:至少根据每一个文本子句的领域信息,识别语音输入数据中的语音指令。Step S208: Recognizing the voice command in the voice input data at least according to the domain information of each text clause.
本发明实施例中,属于语音指令的文本子句对应有设定的领域信息,当某一文本子句的领域信息与该设定的领域信息一致时,则可以将该文本子句确定为语音指令对应的文本子句。可选地,还可以设置其它的领域信息,该其它的领域信息可以是指示文本子句为非语音指令的统一的领域信息,也可以再将其它的领域信息进行细分,以指示出文本子句的具体领域,如,交互领域,等等。In the embodiment of the present invention, the text clause belonging to the voice instruction corresponds to the set domain information. When the domain information of a certain text clause is consistent with the set domain information, the text clause can be determined as the voice The text clause corresponding to the instruction. Optionally, other domain information may be set. The other domain information may be unified domain information indicating that the text clause is a non-voice instruction, or other domain information may be subdivided to indicate the text The specific domain of the sentence, such as the interaction domain, etc.
步骤S210:根据识别出的语音指令,对智能语音设备进行所述语音指令所指示的操作。Step S210: According to the recognized voice command, perform the operation indicated by the voice command on the smart voice device.
其中,所述操作可以是任意适当的操作,如,指示智能语音设备开启或关闭相应功能,如打开空调、关闭电灯等,或者,指示智能语音设备进行查询的操作,如,查询并播放某首歌曲、查询并播放某地天气,等等,本发明实施例对语音指令所指示的具体操作不作限制。Wherein, the operation can be any appropriate operation, such as instructing the smart voice device to turn on or off the corresponding function, such as turning on the air conditioner, turning off the light, etc., or instructing the smart voice device to perform query operations, such as querying and playing a certain song. Songs, query and play the weather of a certain place, etc., the embodiment of the present invention does not limit the specific operations indicated by the voice instructions.
如前所述,本发明实施例提供的语音识别方案可以通过多种适当的方式实现,在一 种可行的方式中,可以通过神经网络模型实现该语音识别方案中的部分或全部方案。以下,以卷积神经网络CNN模型为例,对本实施例的上述过程进行说明。As mentioned above, the speech recognition solution provided by the embodiments of the present invention can be implemented in a variety of suitable ways. In one feasible way, some or all of the speech recognition solutions can be implemented through a neural network model. In the following, the convolutional neural network CNN model is taken as an example to describe the above process of this embodiment.
一种CNN模型的结构如图3所示,其包括:输入部分A、特征提取部分B、句边界探测部分C、和领域分类部分D。The structure of a CNN model is shown in Figure 3, which includes an input part A, a feature extraction part B, a sentence boundary detection part C, and a domain classification part D.
其中:among them:
输入部分A可以为CNN的输入层,用于接收输入的文本向量,如语音输入数据对应的文本数据的文本向量。The input part A may be the input layer of the CNN, and is used to receive the input text vector, such as the text vector of the text data corresponding to the voice input data.
特征提取部分B中设置有多个卷积层,本实施例中,设定至少12个卷积层,以提高特征提取的精度。可选地,还可以在特征提取部分B中设置批规范化层(batch normalization layer)、激活层(activation layer)、卷积层(convolutional layer),其中,还可以对卷积层进行残差(residual)处理。通过设置批规范化层,可以优化CNN模型的数据处理速度;所述激活层可采用线性门函数进行非线性化处理,激活层采用线性门函数可以提升对文本向量的非线性化特征转换效果,当然,其它的激活函数也同样适用;通过对卷积层进行残差处理,将文本数据对应的原始的文本向量与当前卷积层输出的句法特征相结合后输出,以优化梯度回传效果,并提升特征提取效果。There are multiple convolutional layers in the feature extraction part B. In this embodiment, at least 12 convolutional layers are set to improve the accuracy of feature extraction. Optionally, a batch normalization layer, an activation layer, and a convolutional layer can also be set in the feature extraction part B. Among them, the residual of the convolutional layer can also be set. )deal with. By setting the batch normalization layer, the data processing speed of the CNN model can be optimized; the activation layer can use linear gate functions for non-linear processing, and the activation layer using linear gate functions can improve the effect of non-linear feature conversion on text vectors. Of course , Other activation functions are also applicable; by performing residual processing on the convolutional layer, the original text vector corresponding to the text data is combined with the syntactic features output by the current convolutional layer to optimize the gradient return effect, and Improve the effect of feature extraction.
句边界探测部分C可选地,可依次包括批规范化层、卷积层和输出层,其中,输出层采用Softmax函数作为损失函数。通过句边界探测部分C可以获得文本向量中每一个字词对应的字词向量的标签,进而获得整个文本数据的序列标注,根据序列标注中的结束标签如E标签即可确定文本子句的划分。Optionally, the sentence boundary detection part C may include a batch normalization layer, a convolution layer, and an output layer in sequence, wherein the output layer adopts a Softmax function as a loss function. Through the sentence boundary detection part C, the label of the word vector corresponding to each word in the text vector can be obtained, and then the sequence label of the entire text data can be obtained. The division of the text clause can be determined according to the end label in the sequence label, such as the E label .
领域分类部分D可选地,可依次包括批规范化层、卷积层、池化层和输出层,其中,池化层采用一维特征池化(1-D RoI pooling),输出层采用Softmax函数作为损失函数。领域分类部分D根据文本子句的划分结果和每个文本子句的领域信息,即可识别出其中的语音指令对应的文本子句。The domain classification part D can optionally include a batch normalization layer, a convolutional layer, a pooling layer, and an output layer in sequence. The pooling layer adopts one-dimensional feature pooling (1-D RoI pooling), and the output layer adopts the Softmax function As a loss function. According to the division result of the text clauses and the domain information of each text clause, the domain classification part D can identify the text clause corresponding to the voice command.
需要说明的是,如图3所示,本实施例的CNN模型中的句边界探测部分C和领域分类部分D共享特征提取部分B提取的句法特征,以提升CNN模型数据处理效率,节约CNN模型实现成本。It should be noted that, as shown in FIG. 3, the sentence boundary detection part C and the domain classification part D in the CNN model of this embodiment share the syntactic features extracted by the feature extraction part B to improve the data processing efficiency of the CNN model and save the CNN model Realization cost.
基于图3所示的CNN模型,以智能语音设备为智能音箱、语音输入数据为“你为什么这么晚回家啊?请打开卧室的灯”为例,则相应的语音识别流程包括:Based on the CNN model shown in Figure 3, taking the smart voice device as a smart speaker and the voice input data as "Why are you going home so late? Please turn on the lights in the bedroom" as an example, the corresponding voice recognition process includes:
(1)将语音输入数据转换为文本数据,并获取该文本数据对应的文本向量。(1) Convert the voice input data into text data, and obtain the text vector corresponding to the text data.
该部分是文本向量输入CNN模型前,对数据的转换和处理。以用户发出“你为什么 这么晚回家啊?请打开卧室的灯”的语音为例,则在本部分中,需要将“你为什么这么晚回家啊?请打开卧室的灯”这一语音输入数据转换为文本数据,并将其中的每一个字转换为D维的向量,其中,D的具体数值可以由本领域技术人员根据实际需求适当设置。This part is the conversion and processing of the data before the text vector is input into the CNN model. Take the voice of "Why are you going home so late? Please turn on the bedroom light" as an example. In this section, you need to input the voice "Why are you going home so late? Please turn on the bedroom light" The data is converted into text data, and each word therein is converted into a D-dimensional vector, where the specific value of D can be appropriately set by those skilled in the art according to actual needs.
由此,可以生成N个D维的向量,其中N为字词的数量,本示例中包括17个字,因此,N为17。这N个D维的向量即为文本数据对应的文本向量。Thus, N D-dimensional vectors can be generated, where N is the number of words. In this example, 17 words are included. Therefore, N is 17. The N D-dimensional vectors are the text vectors corresponding to the text data.
(2)通过CNN模型的输入部分接收所述文本数据对应的文本向量。(2) Receive the text vector corresponding to the text data through the input part of the CNN model.
例如,通过CNN模型的输入层接收上述生成的N个D维的向量。For example, the N D-dimensional vectors generated above are received through the input layer of the CNN model.
(3)通过CNN模型的特征提取部分对所述文本向量进行特征提取,获取所述文本向量的句法特征。(3) Perform feature extraction on the text vector through the feature extraction part of the CNN model to obtain the syntactic feature of the text vector.
包括:对输入的向量进行批规范化操作,生成规范化的向量;对所述规范化的向量进行非线性化处理;通过卷积层对非线性处理后的所述向量进行特征提取,获得初始特征;对所述初始特征进行残差分析处理,根据所述残差分析处理结果获得所述向量的句法特征并输出;返回所述对输入的向量进行批规范化操作的步骤继续执行,直至获得所述文本向量的句法特征。可选地,当设置有批规范化层时,通过批规范化层对输入的向量进行批规范化操作,生成规范化的向量。其中,输入首个卷积层部分的所述批规范化层的向量为所述文本数据对应的文本向量;输入非首个卷积层部分的所述批规范化层的向量为前一卷积层部分输出的向量。又可选地,当设置有激活层时,通过激活层对所述规范化的向量进行非线性化处理。It includes: performing batch normalization operations on the input vectors to generate normalized vectors; performing non-linear processing on the normalized vectors; performing feature extraction on the non-linearly processed vectors through a convolutional layer to obtain initial features; The initial feature is subjected to residual analysis processing, and the syntactic feature of the vector is obtained and output according to the residual analysis processing result; the step of returning to the batch normalization operation on the input vector is continued until the text vector is obtained The syntactic features. Optionally, when a batch normalization layer is provided, a batch normalization operation is performed on the input vector through the batch normalization layer to generate a normalized vector. Wherein, the vector of the batch normalization layer input to the first convolutional layer part is the text vector corresponding to the text data; the vector of the batch normalization layer input that is not the first convolutional layer part is the previous convolutional layer part The output vector. Alternatively, when an activation layer is provided, the normalized vector is subjected to non-linearization processing through the activation layer.
也即,先将文本向量输入首个批规范化层,依次通过批规范化层、激活层、卷积层对该文本向量进行批规范化操作、非线性化处理、特征提取和残差处理,获得句法特征;接着,将得到的该句法特征输入邻接的下一个批规范化层、激活层、卷积层等依次进行处理,再次获得处理得到的句法特征;再将该处理得到的句法特征输入下一批规范化层、激活层、卷积层进行处理,依次类推,直至获得最终的文本向量的句法特征。That is, first input the text vector into the first batch normalization layer, and then perform batch normalization operations, non-linearization processing, feature extraction and residual processing on the text vector through the batch normalization layer, activation layer, and convolution layer to obtain syntactic features. ; Then, input the obtained syntactic feature into the next batch of normalization layer, activation layer, convolutional layer, etc., to be processed in turn to obtain the processed syntactic feature again; then input the processed syntactic feature into the next batch of normalization Layer, activation layer, convolutional layer are processed, and so on, until the syntactic feature of the final text vector is obtained.
需要说明的是,输入批规范化层的向量可以是前一卷积层经残差处理后的全部的向量如全部的文本向量或全部的句法特征,也可以是前一卷积层残差处理后的每一个字词的向量如文本向量中的每一个字词的字词向量或每一个字词对应的句法特征。但不管哪种方式,最终获取到的文本向量的句法特征包括每一个字词的句法特征。It should be noted that the vector input to the batch normalization layer can be all the vectors after the residual processing of the previous convolutional layer, such as all the text vectors or all the syntactic features, or it can be the residual processing of the previous convolutional layer. The vector of each word in the text vector is the word vector of each word in the text vector or the syntactic feature corresponding to each word. Either way, the syntactic feature of the finally obtained text vector includes the syntactic feature of each word.
具体到“你为什么这么晚回家啊?请打开卧室的灯”这一示例,则通过本步骤可以获得其中的每一个字对应的句法特征。Specific to the example of "Why are you going home so late? Please turn on the light in the bedroom", you can get the syntactic features corresponding to each word in this step.
(4)通过CNN模型的句边界探测部分根据特征提取部分输出的句法特征,获取语音 输入数据对应的文本数据中包含的至少一个文本子句。(4) Obtain at least one text clause contained in the text data corresponding to the voice input data by the sentence boundary detection part of the CNN model according to the syntactic features output by the feature extraction part.
包括:对所述句法特征进行批规范化操作(可选地,可通过句边界探测部分的批规范化层对所述句法特征进行批规范化操作),生成规范化的句法特征;通过卷积层对所述规范化的句法特征进行特征提取;通过输出层根据特征提取结果确定所述文本数据中的每个字词的标签,根据每个字词的标签获取所述文本数据中包含的至少一个文本子句。It includes: performing batch normalization operations on the syntactic features (optionally, batch normalization operations can be performed on the syntactic features through the batch normalization layer of the sentence boundary detection part) to generate normalized syntactic features; and the convolutional layer The normalized syntactic feature performs feature extraction; the output layer determines the label of each word in the text data according to the feature extraction result, and obtains at least one text clause contained in the text data according to the label of each word.
通过句边界探测部分,实现了文本数据的序列标注。例如,用B表示对应的字处于一个文本片段的开始(即B为开始标签),E表示对应的字处于一个文本片段的结束(即E为结束标签),I表示对应的字处于一个文本片段的中间(即I为中间标签)。根据每一个字词的句法特征,通过句边界探测部分即可得到文本数据中每一个字词上的BIE概率分布,对于每一个字词,取其BIE概率分布的最大值对应的标签。如,“你为什么这么晚回家啊?请打开卧室的灯”这一示例中,若“啊”字的B标签概率为0.3,I标签概率为0.1,E标签概率为0.8,则可以确定“啊”字的标签应为E标签。根据每一个字词的标签,可得到整个文本数据的序列标注,进而,根据该序列标注中的结束标签,即可得到每一个文本子句的句子边界,从而得到每一个文本子句的范围。Through the sentence boundary detection part, the sequence labeling of text data is realized. For example, use B to indicate that the corresponding word is at the beginning of a text segment (ie B is the start tag), E indicates that the corresponding word is at the end of a text segment (ie E is the end tag), I indicates that the corresponding word is at the beginning of a text segment (I.e. I is the middle label). According to the syntactic characteristics of each word, the BIE probability distribution on each word in the text data can be obtained through the sentence boundary detection part. For each word, the label corresponding to the maximum value of the BIE probability distribution is selected. For example, in the example of "Why are you going home so late? Please turn on the light in the bedroom", if the probability of "Ah" is 0.3, the probability of I is 0.1, and the probability of E is 0.8, it can be determined " The label of the word ah should be the E label. According to the label of each word, the sequence label of the entire text data can be obtained, and then, according to the end label in the sequence label, the sentence boundary of each text clause can be obtained, and the range of each text clause can be obtained.
具体到“你为什么这么晚回家啊?请打开卧室的灯”这一示例,则通过本步骤可以获得其序列标注,如,“BIIIIIIIIEBIIIIIE”,据此,可以获得两个文本子句,即“你为什么这么晚回家啊?”和“请打开卧室的灯”。Specific to the example of "Why are you going home so late? Please turn on the light in the bedroom", you can get the sequence label through this step, for example, "BIIIIIIIIEBIIIIIE", based on this, you can get two text clauses, namely " Why are you coming home so late?" and "Please turn on the bedroom light".
(5)通过CNN模型的领域分类部分根据所述文本向量的句法特征和每一个文本子句的信息,获取每一个文本子句的领域信息。(5) Through the domain classification part of the CNN model, the domain information of each text clause is obtained according to the syntactic characteristics of the text vector and the information of each text clause.
包括:对所述文本向量的句法特征进行批规范化操作(可选地,可通过领域分类部分的批规范化层对所述文本向量的句法特征进行批规范化操作),生成规范化的句法特征;通过卷积层对所述规范化的句法特征进行特征映射,获取所述文本向量的领域特征;通过池化层根据每一个所述文本子句的信息对所述文本向量的领域特征进行池化处理;通过输出层根据所述池化处理的结果,获取每一个所述文本子句的领域信息。It includes: performing batch normalization operations on the syntax features of the text vector (optionally, batch normalization operations can be performed on the syntax features of the text vector through the batch normalization layer of the field classification part), and generating normalized syntax features; The accumulation layer performs feature mapping on the normalized syntactic features to obtain the domain features of the text vector; through the pooling layer, the domain features of the text vector are pooled according to the information of each text clause; through The output layer obtains the domain information of each text clause according to the result of the pooling process.
首先,针对文本向量的句法特征,通过领域分类部分的批规范化层和卷积层,可以把该句法特征映射为领域特征C。具体到“你为什么这么晚回家啊?请打开卧室的灯”这一示例,得到的领域特征C可以为一个N*D的二维矩阵,其中,N为文本数据中包含的字词数,本示例中为17,D为每一个字词的领域特征向量的维度。First, for the syntactic feature of the text vector, the syntactic feature can be mapped to the domain feature C through the batch normalization layer and the convolution layer of the domain classification part. Specific to the example of "Why are you going home so late? Please turn on the light in the bedroom", the obtained domain feature C can be an N*D two-dimensional matrix, where N is the number of words contained in the text data, In this example, it is 17, and D is the dimension of the domain feature vector of each word.
其次,根据句边界探测部分得到的每一个文本子句的范围,可以把领域特征C转换为S=(m1,m2,m3,...),其中,m为每一个文本子句对应的二维领域特征矩阵,S是文本子 句的二维领域特征矩阵的集合,也为一个N*D的二维矩阵。其中,每个m为一个W*D的二维矩阵,W表示当前文本子句中的字词的数量,D如前所述为每一个字词的特征向量的维度。Secondly, according to the scope of each text clause obtained by the sentence boundary detection part, the domain feature C can be converted into S=(m1,m2,m3,...), where m is the two corresponding to each text clause The dimensional domain feature matrix, S is the set of the two-dimensional domain feature matrix of the text clause, and it is also an N*D two-dimensional matrix. Among them, each m is a W*D two-dimensional matrix, W represents the number of words in the current text clause, and D is the dimension of the feature vector of each word as described above.
具体到“你为什么这么晚回家啊?请打开卧室的灯”这一示例,其包括文本子句“你为什么这么晚回家啊”,则其对应的二维领域特征矩阵m1为一个10*D的矩阵,文本子句“请打开卧室的灯”,其对应的二维领域特征矩阵m2为一个7*D的矩阵。相对应地,S=(m1,m2)。Specific to the example of "Why are you going home so late? Please turn on the bedroom light", which includes the text clause "Why are you going home so late", the corresponding two-dimensional domain feature matrix m1 is a 10* The matrix of D, the text clause "please turn on the bedroom light", the corresponding two-dimensional domain characteristic matrix m2 is a 7*D matrix. Correspondingly, S=(m1, m2).
接着,针对S,在第一维上即N维上选取最大值,即在每个文本子句对应的二维特征领域矩阵的第一维进行max操作,得到每一个文本子句对应的二维特征领域矩阵的一维特征(1*D),进而得到所有文本子句的固定长度的特征表达T=(u1,u2,u3,...),其中,u为每个文本子句对应的一维且长度为D的领域特征向量。Then, for S, select the maximum value on the first dimension, that is, on the N dimension, that is, perform the max operation on the first dimension of the two-dimensional feature domain matrix corresponding to each text clause to obtain the two-dimensional corresponding to each text clause The one-dimensional feature (1*D) of the feature domain matrix, and then the fixed-length feature expression T=(u1,u2,u3,...) of all text clauses is obtained, where u is corresponding to each text clause One-dimensional domain feature vector of length D.
然后,通过领域分类部分的池化层对T=(u1,u2,u3,...)进行池化处理后,再通过Softmax函数得到每一个文本子句的领域概率分布,根据每一个文本子句的领域概率分布确定该文本子句的领域信息。Then, T=(u1,u2,u3,...) is pooled through the pooling layer of the domain classification part, and then the domain probability distribution of each text clause is obtained through the Softmax function. According to each text sub The domain probability distribution of the sentence determines the domain information of the text clause.
(6)根据每一个文本子句的领域信息,识别语音输入数据中的语音指令。(6) According to the domain information of each text clause, recognize the voice command in the voice input data.
可见,通过上述过程(2)-(4),实现了CNN模型对文本子句的划分和文本子句的领域信息确定。通过该CNN模型,将语音指令的提取和识别两个任务统一于一个CNN模型框架中,有效实现了对用户指令的切分和识别。It can be seen that through the above processes (2)-(4), the division of text clauses and the determination of the domain information of the text clauses by the CNN model are realized. Through the CNN model, the two tasks of voice command extraction and recognition are unified into a CNN model framework, which effectively realizes the segmentation and recognition of user commands.
进一步地,基于CNN输出的结果,即可确定出语音指令。Further, based on the results output by CNN, the voice command can be determined.
例如,为语音指令设定领域信息IOT(Internet Of Things),则如果某个文本子句的领域信息被分类为IOT领域,则可以认为该文本子句对应的那部分语音输入数据即为语音指令。具体到“你为什么这么晚回家啊?请打开卧室的灯”这一示例,“请打开卧室的灯”会被分类到IOT领域,因此,确定“请打开卧室的灯”为用于操作和控制智能语音设备的语音指令。For example, if the domain information IOT (Internet Of Things) is set for a voice command, if the domain information of a text clause is classified as an IOT domain, the part of the voice input data corresponding to the text clause can be considered as a voice command . Specific to the example of "Why are you going home so late? Please turn on the bedroom light", "Please turn on the bedroom light" will be classified into the IOT field. Therefore, make sure that "Please turn on the bedroom light" is used for operation and Control the voice commands of smart voice devices.
通过本实施例,先获取由语音输入数据转换的文本数据和该文本数据对应的文本向量;再通过对文本向量的特征提取获得对应的句法特征;然后,根据句法特征对语音输入数据对应的文本数据进行文本子句的划分及文本子句的领域信息的确定;进而,根据文本子句的领域信息识别语音输入数据中的语音指令。可见,通过本实施例,使得智能语音设备更加适用于实际的使用环境,无需用户再使用唤醒词唤醒智能语音设备,不管是用户使用纯语音指令的语音输入数据,还是使用包含语音指令和其它语音数据的混合 语音输入数据,都能对语音输入数据进行有效的子句划分,并识别其中包含的语音指令,进而,后续可以通过识别的语音指令对智能语音设备进行操作和控制。Through this embodiment, the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the text corresponding to the voice input data is obtained according to the syntactic feature The data is used to divide the text clauses and determine the domain information of the text clauses; furthermore, recognize the voice commands in the voice input data according to the domain information of the text clauses. It can be seen that through this embodiment, the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices. The mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein. Furthermore, the smart voice device can be operated and controlled by the recognized voice commands later.
因无需使用唤醒词唤醒智能语音设备,减轻了用户的操作负担,也提高了智能语音设备对用户语音指令的智能化处理程度;并且,智能语音设备也无需再针对唤醒词进行处理,减轻了智能语音设备的数据处理负担。Because there is no need to use a wake-up word to wake up the smart voice device, it reduces the user's operational burden and improves the intelligent processing level of the user's voice instructions by the smart voice device; and the smart voice device no longer needs to process the wake-up word, reducing the intelligence The burden of data processing for voice devices.
本实施例的语音识别方法可以由任意适当的具有数据处理能力的智能语音设备执行,如,具有相应功能的各种智能家电等。The voice recognition method of this embodiment can be executed by any appropriate smart voice device with data processing capabilities, such as various smart home appliances with corresponding functions.
实施例三Example three
参照图4,示出了根据本发明实施例三的一种语音识别装置的结构框图。Referring to Fig. 4, there is shown a structural block diagram of a speech recognition device according to the third embodiment of the present invention.
本实施例的语音识别装置包括:第一获取模块302,用于获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量;第二获取模块304,用于获取所述文本向量的句法特征;第三获取模块306,用于根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,以及,获取每一个所述文本子句的领域信息;识别模块308,用于至少根据每一个所述文本子句的领域信息,识别所述语音输入数据中的语音指令。The speech recognition device of this embodiment includes: a first obtaining module 302, configured to obtain text data corresponding to voice input data and a text vector corresponding to the text data; and a second obtaining module 304, configured to obtain information about the text vector Syntactic features; the third acquisition module 306 is used to acquire at least one text clause contained in the text data according to the syntax features, and to acquire the domain information of each text clause; the recognition module 308 uses At least according to the domain information of each of the text clauses, the voice command in the voice input data is recognized.
通过本实施例,先获取由语音输入数据转换的文本数据和该文本数据对应的文本向量;再通过对文本向量的特征提取获得对应的句法特征;然后,根据句法特征对语音输入数据对应的文本数据进行文本子句的划分及文本子句的领域信息的确定;进而,根据文本子句的领域信息识别语音输入数据中的语音指令。可见,通过本实施例,使得智能语音设备更加适用于实际的使用环境,无需用户再使用唤醒词唤醒智能语音设备,不管是用户使用纯语音指令的语音输入数据,还是使用包含语音指令和其它语音数据的混合语音输入数据,都能对语音输入数据进行有效的子句划分,并识别其中包含的语音指令,进而,后续可以通过识别的语音指令对智能语音设备进行操作和控制。Through this embodiment, the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the text corresponding to the voice input data is obtained according to the syntactic feature The data is used to divide the text clauses and determine the domain information of the text clauses; furthermore, recognize the voice commands in the voice input data according to the domain information of the text clauses. It can be seen that through this embodiment, the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices. The mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein. Furthermore, the smart voice device can be operated and controlled by the recognized voice commands later.
因无需使用唤醒词唤醒智能语音设备,减轻了用户的操作负担,也提高了智能语音设备对用户语音指令的智能化处理程度;并且,智能语音设备也无需再针对唤醒词进行处理,减轻了智能语音设备的数据处理负担。Because there is no need to use a wake-up word to wake up the smart voice device, it reduces the user's operational burden and improves the intelligent processing level of the user's voice instructions by the smart voice device; and the smart voice device no longer needs to process the wake-up word, reducing the intelligence The burden of data processing for voice devices.
实施例四Example 4
参照图5,示出了根据本发明实施例四的一种语音识别装置的结构框图。Referring to FIG. 5, there is shown a structural block diagram of a speech recognition device according to the fourth embodiment of the present invention.
本实施例的语音识别装置包括:第一获取模块402,用于获取与语音输入数据对应 的文本数据和所述文本数据对应的文本向量;第二获取模块404,用于获取所述文本向量的句法特征;第三获取模块406,用于根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,以及,获取每一个所述文本子句的领域信息;识别模块408,用于至少根据每一个所述文本子句的领域信息,识别所述语音输入数据中的语音指令。The voice recognition device of this embodiment includes: a first obtaining module 402, configured to obtain text data corresponding to voice input data and a text vector corresponding to the text data; and a second obtaining module 404, configured to obtain information about the text vector Syntactic features; the third acquisition module 406 is used to acquire at least one text clause contained in the text data according to the syntactic features, and to acquire the domain information of each text clause; the recognition module 408 uses At least according to the domain information of each of the text clauses, the voice command in the voice input data is recognized.
可选地,所述第一获取模块402,用于获取语音输入数据,并生成与所述语音输入数据对应的文本数据;生成所述文本数据中的每一个字词对应的字词向量;根据每一个字词对应的字词向量,生成所述文本数据对应的文本向量。Optionally, the first acquisition module 402 is configured to acquire voice input data, and generate text data corresponding to the voice input data; generate a word vector corresponding to each word in the text data; The word vector corresponding to each word generates a text vector corresponding to the text data.
可选地,所述第二获取模块404,用于对所述文本向量进行特征提取,获取所述文本向量的句法特征。Optionally, the second obtaining module 404 is configured to perform feature extraction on the text vector to obtain the syntactic feature of the text vector.
可选地,所述第二获取模块404,用于对所述文本向量中的每一个字词对应的字词向量进行特征提取,获取每一个字词的句法特征。Optionally, the second acquisition module 404 is configured to perform feature extraction on the word vector corresponding to each word in the text vector to acquire the syntactic feature of each word.
可选地,所述第三获取模块406包括:子句获取模块4062,用于根据每一个字词的句法特征,获取每一个字词的标签,其中,所述标签包括结束标签;根据每一个字词的标签,获得所述文本数据的序列标注;根据所述序列标注中的结束标签,获取所述文本数据中包含的至少一个文本子句;领域获取模块4064,用于根据所述句法特征,获取每一个所述文本子句的领域信息。Optionally, the third obtaining module 406 includes: a clause obtaining module 4062, configured to obtain a tag of each word according to the syntactic feature of each word, wherein the tag includes an end tag; The label of the word obtains the sequence label of the text data; the at least one text clause contained in the text data is obtained according to the end label in the sequence label; the domain obtaining module 4064 is used to obtain according to the syntactic feature To obtain the domain information of each text clause.
可选地,所述领域获取模块4064包括:领域特征模块40642,用于根据所述文本向量的句法特征,获取每一个所述文本子句对应的领域特征;确定模块40644,用于对每一个所述文本子句的领域特征,在每个特征维度上进行最大特征值提取,生成每一个所述文本子句的领域特征向量;根据每一个所述文本子句的领域特征向量,确定当前文本子句的领域信息。Optionally, the domain obtaining module 4064 includes: a domain feature module 40642, configured to obtain the domain feature corresponding to each text clause according to the syntactic feature of the text vector; and a determining module 40644, configured to determine each For the domain feature of the text clause, extract the maximum feature value in each feature dimension to generate the domain feature vector of each text clause; determine the current text according to the domain feature vector of each text clause The domain information of the clause.
可选地,所述领域特征模块40642,用于根据所述文本向量的句法特征,获取所述文本向量的领域特征;根据每一个所述文本子句所包含的字词的信息,从所述文本向量的领域特征中获取每一个所述文本子句对应的领域特征。Optionally, the domain feature module 40642 is configured to obtain the domain feature of the text vector according to the syntactic feature of the text vector; according to the information of the words contained in each text clause, from the From the domain feature of the text vector, the domain feature corresponding to each text clause is obtained.
可选地,所述第二获取模块404,用于通过卷积神经网络模型的特征提取部分对所述文本向量进行特征提取,获取所述文本向量的句法特征;所述第三获取模块406,用于通过所述卷积神经网络模型的句边界探测部分根据所述句法特征,获取所述文本数据中包含的至少一个文本子句;通过所述卷积神经网络模型的领域分类部分根据所述句法特征和每一个所述文本子句的信息,获取每一个所述文本子句的领域信息;其中,所述句边界探测部分和所述领域分类部分共享所述特征提取部分提取的句法特征。Optionally, the second acquisition module 404 is configured to perform feature extraction on the text vector through the feature extraction part of the convolutional neural network model to acquire the syntactic features of the text vector; the third acquisition module 406, The sentence boundary detection part of the convolutional neural network model is used to obtain at least one text clause contained in the text data according to the syntactic feature; the domain classification part of the convolutional neural network model is used according to the Syntactic features and information of each of the text clauses to obtain domain information of each of the text clauses; wherein, the sentence boundary detection part and the domain classification part share the syntactic features extracted by the feature extraction part.
可选地,所述第二获取模块404,用于对输入的向量进行批规范化操作,生成规范化的向量;对所述规范化的向量进行非线性化处理;通过所述卷积层对非线性处理后的所述向量进行特征提取,获得初始特征;对所述初始特征进行残差分析处理,根据所述残差分析处理结果获得所述向量的句法特征并输出;返回所述对输入的向量进行批规范化操作继续执行,直至获得所述文本向量的句法特征。Optionally, the second acquisition module 404 is configured to perform batch normalization operations on the input vectors to generate normalized vectors; perform non-linear processing on the normalized vectors; perform non-linear processing on the normalized vectors through the convolutional layer Perform feature extraction on the latter vector to obtain an initial feature; perform residual analysis processing on the initial feature, obtain and output the syntactic feature of the vector according to the residual analysis processing result; return to the input vector The batch normalization operation continues to execute until the syntactic feature of the text vector is obtained.
可选地,所述特征提取部分至少包括12个卷积层,通过线性门函数对所述规范化的向量进行非线性化处理。Optionally, the feature extraction part includes at least 12 convolutional layers, and the normalized vector is non-linearized through a linear gate function.
可选地,所述第三获取模块406在通过所述卷积神经网络模型的句边界探测部分根据所述句法特征,获取所述文本数据中包含的至少一个文本子句时:对所述句法特征进行批规范化操作,生成规范化的句法特征;通过卷积层对所述规范化的句法特征进行特征提取;通过输出层根据特征提取结果确定所述文本数据中的每个字词的标签,根据每个字词的标签获取所述文本数据中包含的至少一个文本子句。Optionally, when the third acquiring module 406 acquires at least one text clause contained in the text data according to the syntax feature by the sentence boundary detection part of the convolutional neural network model: Perform batch normalization operations on features to generate normalized syntactic features; perform feature extraction on the normalized syntactic features through the convolutional layer; determine the label of each word in the text data according to the feature extraction result through the output layer, according to each The label of one word acquires at least one text clause contained in the text data.
可选地,所述第三获取模块406在通过所述卷积神经网络模型的领域分类部分根据所述句法特征和每一个所述文本子句的信息,获取每一个所述文本子句的领域信息时:对所述句法特征进行批规范化操作,生成规范化的句法特征;通过卷积层对所述规范化的句法特征进行特征映射,获取所述文本向量的领域特征;通过池化层根据每一个所述文本子句的信息对所述文本向量的领域特征进行池化处理;通过输出层根据所述池化处理的结果,获取每一个所述文本子句的领域信息。Optionally, the third obtaining module 406 obtains the field of each text clause according to the syntactic feature and the information of each text clause in the field classification part of the convolutional neural network model. Information: perform batch normalization operations on the syntactic features to generate normalized syntactic features; perform feature mapping on the normalized syntactic features through the convolutional layer to obtain the domain features of the text vector; use the pooling layer according to each The information of the text clause performs pooling processing on the domain features of the text vector; the output layer obtains the domain information of each text clause according to the result of the pooling processing.
本实施例的语音识别装置用于实现前述多个方法实施例中相应的语音识别方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的语音识别装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。The voice recognition device in this embodiment is used to implement the corresponding voice recognition methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here. In addition, the function implementation of each module in the speech recognition device of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and it will not be repeated here.
实施例五Example 5
参照图6,示出了根据本发明实施例六的一种智能设备的结构示意图,本发明具体实施例并不对智能设备的具体实现做限定。Referring to FIG. 6, there is shown a schematic structural diagram of a smart device according to the sixth embodiment of the present invention. The specific embodiment of the present invention does not limit the specific implementation of the smart device.
如图6所示,该智能设备可以包括:处理器(processor)502、通信接口(Communications Interface)504、存储器(memory)506、以及通信总线508。As shown in FIG. 6, the smart device may include: a processor (processor) 502, a communication interface (Communications Interface) 504, a memory (memory) 506, and a communication bus 508.
其中:among them:
处理器502、通信接口504、以及存储器506通过通信总线508完成相互间的通信。The processor 502, the communication interface 504, and the memory 506 communicate with each other through the communication bus 508.
通信接口504,用于与其它电子设备如其它智能设备或服务器进行通信。The communication interface 504 is used to communicate with other electronic devices such as other smart devices or servers.
处理器502,用于执行程序510,具体可以执行上述语音识别方法实施例中的相关步骤。The processor 502 is configured to execute the program 510, and specifically can execute relevant steps in the above-mentioned voice recognition method embodiment.
具体地,程序510可以包括程序代码,该程序代码包括计算机操作指令。Specifically, the program 510 may include program code, and the program code includes computer operation instructions.
处理器502可能是中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路。智能设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。The processor 502 may be a central processing unit CPU, or a specific integrated circuit (ASIC) (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention. The one or more processors included in the smart device may be the same type of processor, such as one or more CPUs, or different types of processors, such as one or more CPUs and one or more ASICs.
存储器506,用于存放程序510。存储器506可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。The memory 506 is used to store the program 510. The memory 506 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory.
程序510具体可以用于使得处理器502执行以下操作:获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量;获取所述文本向量的句法特征;根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,以及,获取每一个所述文本子句的领域信息;至少根据每一个所述文本子句的领域信息,识别所述语音输入数据中的语音指令。The program 510 can specifically be used to cause the processor 502 to perform the following operations: obtain text data corresponding to the voice input data and a text vector corresponding to the text data; obtain the syntactic feature of the text vector; At least one text clause contained in the text data, and acquiring the domain information of each text clause; at least according to the domain information of each text clause, recognizing the voice command in the voice input data.
在一种可选的实施方式中,程序510还用于使得处理器502在获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量时,获取语音输入数据,并生成与所述语音输入数据对应的文本数据;生成所述文本数据中的每一个字词对应的字词向量;根据每一个字词对应的字词向量,生成所述文本数据对应的文本向量。In an optional implementation manner, the program 510 is further configured to enable the processor 502 to obtain the voice input data when obtaining the text data corresponding to the voice input data and the text vector corresponding to the text data, and generate the Text data corresponding to the voice input data; generating a word vector corresponding to each word in the text data; generating a text vector corresponding to the text data according to the word vector corresponding to each word.
在一种可选的实施方式中,程序510还用于使得处理器502在获取所述文本向量的句法特征时,对所述文本向量进行特征提取,获取所述文本向量的句法特征。In an optional implementation manner, the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector when obtaining the syntactic feature of the text vector to obtain the syntactic feature of the text vector.
在一种可选的实施方式中,程序510还用于使得处理器502在对所述文本向量进行特征提取,获取所述文本向量的句法特征时,对所述文本向量中的每一个字词对应的字词向量进行特征提取,获取每一个字词的句法特征。In an optional implementation manner, the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector to obtain the syntactic features of the text vector, and then perform a comparison of each word in the text vector The corresponding word vector performs feature extraction to obtain the syntactic feature of each word.
在一种可选的实施方式中,程序510还用于使得处理器502在根据所述句法特征,获取所述文本数据中包含的至少一个文本子句时,根据每一个字词的句法特征,获取每一个字词的标签,其中,所述标签包括结束标签;根据每一个字词的标签,获得所述文本数据的序列标注;根据所述序列标注中的结束标签,获取所述文本数据中包含的至少一个文本子句。In an optional implementation manner, the program 510 is further configured to cause the processor 502 to obtain at least one text clause contained in the text data according to the syntactic feature, according to the syntactic feature of each word, Obtain the label of each word, where the label includes an end tag; obtain the sequence label of the text data according to the label of each word; obtain the text data according to the end label in the sequence label At least one text clause contained.
在一种可选的实施方式中,程序510还用于使得处理器502在获取每一个所述文本子句的领域信息时,根据所述文本向量的句法特征,获取每一个所述文本子句对应的领 域特征;对每一个所述文本子句的领域特征,在每个特征维度上进行最大特征值提取,生成每一个所述文本子句的领域特征向量;根据每一个所述文本子句的领域特征向量,确定当前文本子句的领域信息。In an optional implementation manner, the program 510 is further configured to enable the processor 502 to obtain each of the text clauses according to the syntactic characteristics of the text vector when obtaining the domain information of each of the text clauses. Corresponding domain features; for the domain features of each text clause, extract the maximum feature value in each feature dimension to generate the domain feature vector of each text clause; according to each text clause The domain feature vector of to determine the domain information of the current text clause.
在一种可选的实施方式中,程序510还用于使得处理器502在根据所述文本向量的句法特征,获取每一个所述文本子句对应的领域特征时,根据所述文本向量的句法特征,获取所述文本向量的领域特征;根据每一个所述文本子句所包含的字词的信息,从所述文本向量的领域特征中获取每一个所述文本子句对应的领域特征。In an optional implementation manner, the program 510 is further configured to cause the processor 502 to obtain the domain feature corresponding to each text clause according to the syntax feature of the text vector, according to the syntax feature of the text vector Feature, obtain the domain feature of the text vector; obtain the domain feature corresponding to each text clause from the domain feature of the text vector according to the information of the words contained in each text clause.
在一种可选的实施方式中,程序510还用于使得处理器502通过卷积神经网络模型的特征提取部分对所述文本向量进行特征提取,获取所述文本向量的句法特征;程序510还用于使得处理器502通过所述卷积神经网络模型的句边界探测部分根据所述句法特征,获取所述文本数据中包含的至少一个文本子句;通过所述卷积神经网络模型的领域分类部分根据所述句法特征和每一个所述文本子句的信息,获取每一个所述文本子句的领域信息;其中,所述句边界探测部分和所述领域分类部分共享所述特征提取部分提取的句法特征。In an alternative embodiment, the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector through the feature extraction part of the convolutional neural network model to obtain the syntactic features of the text vector; the program 510 also The processor 502 is configured to obtain at least one text clause contained in the text data according to the syntactic feature by the sentence boundary detection part of the convolutional neural network model; use the domain classification of the convolutional neural network model Partly according to the syntactic feature and the information of each of the text clauses, the domain information of each text clause is obtained; wherein, the sentence boundary detection part and the domain classification part share the feature extraction part extraction The syntactic features.
在一种可选的实施方式中,程序510还用于使得处理器502在通过卷积神经网络模型的特征提取部分对所述文本向量进行特征提取,获取所述文本向量的句法特征时,对输入的向量进行批规范化操作,生成规范化的向量;对所述规范化的向量进行非线性化处理;通过卷积层对非线性处理后的所述向量进行特征提取,获得初始特征;对所述初始特征进行残差分析处理,根据所述残差分析处理结果获得所述向量的句法特征并输出;返回所述对输入的向量进行批规范化操作的步骤继续执行,直至获得所述文本向量的句法特征。In an optional implementation manner, the program 510 is further configured to enable the processor 502 to perform feature extraction on the text vector through the feature extraction part of the convolutional neural network model to obtain the syntactic features of the text vector, Perform batch normalization operations on the input vectors to generate normalized vectors; perform non-linearization processing on the normalized vectors; perform feature extraction on the non-linearly processed vectors through a convolutional layer to obtain initial features; Perform residual analysis processing on features, and obtain and output the syntactic features of the vectors according to the residual analysis processing results; return to the step of performing batch normalization operations on the input vectors until the syntactic feature of the text vector is obtained .
在一种可选的实施方式中,所述特征提取部分至少包括12个卷积层;通过线性门函数对所述规范化的向量进行非线性化处理。In an optional implementation manner, the feature extraction part includes at least 12 convolutional layers; the normalized vector is non-linearized through a linear gate function.
在一种可选的实施方式中,程序510还用于使得处理器502在通过所述卷积神经网络模型的句边界探测部分根据所述句法特征,获取所述文本数据中包含的至少一个文本子句时,对所述句法特征进行批规范化操作,生成规范化的句法特征;通过卷积层对所述规范化的句法特征进行特征提取;通过输出层根据特征提取结果确定所述文本数据中的每个字词的标签,根据每个字词的标签获取所述文本数据中包含的至少一个文本子句。In an optional implementation manner, the program 510 is further configured to enable the processor 502 to obtain at least one text contained in the text data according to the syntactic feature in the sentence boundary detection part of the convolutional neural network model. Clauses, perform batch normalization operations on the syntactic features to generate normalized syntactic features; perform feature extraction on the normalized syntactic features through the convolutional layer; determine each of the text data according to the feature extraction results through the output layer According to the label of each word, at least one text clause contained in the text data is obtained according to the label of each word.
在一种可选的实施方式中,程序510还用于使得处理器502在通过所述卷积神经网络模型的领域分类部分根据所述句法特征和每一个所述文本子句的信息,获取每一个所 述文本子句的领域信息时,对所述句法特征进行批规范化操作,生成规范化的句法特征;通过卷积层对所述规范化的句法特征进行特征映射,获取所述文本向量的领域特征;通过池化层根据每一个所述文本子句的信息对所述文本向量的领域特征进行池化处理;通过输出层根据所述池化处理的结果,获取每一个所述文本子句的领域信息。In an optional implementation manner, the program 510 is further configured to cause the processor 502 to obtain each item based on the syntactic feature and the information of each text clause in the domain classification part of the convolutional neural network model. For the domain information of a text clause, perform batch normalization operations on the syntactic features to generate normalized syntactic features; perform feature mapping on the normalized syntactic features through a convolutional layer to obtain the domain features of the text vector The field feature of the text vector is pooled by the pooling layer according to the information of each text clause; the field of each text clause is obtained by the output layer according to the result of the pooling process information.
在一种可选的实施方式中,本实施例的智能设备还可以包括麦克风,以接收用户输入的模拟语音信号并转换为数字语音信号即语音输入数据;程序510还可以用于使得处理器502将所述语音输入数据转换为对应的文本数据。但不限于此,麦克风也可以独立于所述智能设备设置,并通过适当的连接方式与智能设备连接,并将所述语音输入数据发送给处理器。In an optional implementation manner, the smart device of this embodiment may further include a microphone to receive the analog voice signal input by the user and convert it into a digital voice signal, that is, voice input data; the program 510 may also be used to make the processor 502 Convert the voice input data into corresponding text data. But it is not limited to this. The microphone can also be set independently of the smart device, and connected to the smart device through an appropriate connection mode, and send the voice input data to the processor.
程序510中各步骤的具体实现可以参见上述语音识别方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。For the specific implementation of each step in the program 510, reference may be made to the corresponding description in the corresponding steps and units in the above-mentioned voice recognition method embodiment, which is not repeated here. Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working processes of the devices and modules described above can refer to the corresponding process descriptions in the foregoing method embodiments, which will not be repeated here.
通过本实施例的智能设备,先获取由语音输入数据转换的文本数据和该文本数据对应的文本向量;再通过对文本向量的特征提取获得对应的句法特征;然后,根据句法特征对语音输入数据对应的文本数据进行文本子句的划分及文本子句的领域信息的确定;进而,根据文本子句的领域信息识别语音输入数据中的语音指令。可见,通过本实施例,使得智能语音设备更加适用于实际的使用环境,无需用户再使用唤醒词唤醒智能语音设备,不管是用户使用纯语音指令的语音输入数据,还是使用包含语音指令和其它语音数据的混合语音输入数据,都能对语音输入数据进行有效的子句划分,并识别其中包含的语音指令,进而,后续可以通过识别的语音指令对智能语音设备进行操作和控制。Through the smart device of this embodiment, the text data converted from the voice input data and the text vector corresponding to the text data are first obtained; then the corresponding syntactic feature is obtained by feature extraction of the text vector; then, the voice input data is processed according to the syntactic feature The corresponding text data is divided into text clauses and the domain information of the text clauses is determined; furthermore, the voice commands in the voice input data are recognized according to the domain information of the text clauses. It can be seen that through this embodiment, the smart voice device is more suitable for the actual use environment, and the user does not need to use a wake-up word to wake up the smart voice device, regardless of whether the user uses voice input data with pure voice commands or uses voice commands and other voices. The mixed voice input data of the data can effectively divide the voice input data into clauses and recognize the voice commands contained therein. Furthermore, the smart voice device can be operated and controlled by the recognized voice commands later.
因无需使用唤醒词唤醒智能语音设备,减轻了用户的操作负担,也提高了智能语音设备对用户语音指令的智能化处理程度;并且,智能语音设备也无需再针对唤醒词进行处理,减轻了智能语音设备的数据处理负担。Because there is no need to use a wake-up word to wake up the smart voice device, it reduces the user's operational burden and improves the intelligent processing level of the user's voice instructions by the smart voice device; and the smart voice device no longer needs to process the wake-up word, reducing the intelligence The burden of data processing for voice devices.
需要指出,根据实施的需要,可将本发明实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本发明实施例的目的。It should be pointed out that, according to the needs of implementation, each component/step described in the embodiment of the present invention can be split into more components/steps, or two or more components/steps or partial operations of components/steps can be combined into New components/steps to achieve the purpose of the embodiments of the present invention.
上述根据本发明实施例的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实 现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当所述软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的语音识别方法。此外,当通用计算机访问用于实现在此示出的语音识别方法的代码时,代码的执行将通用计算机转换为用于执行在此示出的语音识别方法的专用计算机。The above method according to the embodiment of the present invention can be implemented in hardware, firmware, or implemented as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by The computer code downloaded from the network is originally stored in a remote recording medium or a non-transitory machine-readable medium and will be stored in a local recording medium, so that the method described here can be stored using a general-purpose computer, a dedicated processor or a programmable Or such software processing on a recording medium of dedicated hardware (such as ASIC or FPGA). It can be understood that a computer, a processor, a microprocessor controller, or programmable hardware includes a storage component (for example, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is used by the computer, When accessed and executed by the processor or hardware, the voice recognition method described here is implemented. In addition, when a general-purpose computer accesses the code for implementing the voice recognition method shown here, the execution of the code converts the general-purpose computer into a dedicated computer for executing the voice recognition method shown here.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及方法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明实施例的范围。A person of ordinary skill in the art may be aware that the units and method steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present invention.
以上实施方式仅用于说明本发明实施例,而并非对本发明实施例的限制,有关技术领域的普通技术人员,在不脱离本发明实施例的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本发明实施例的范畴,本发明实施例的专利保护范围应由权利要求限定。The above implementations are only used to illustrate the embodiments of the present invention, and are not intended to limit the embodiments of the present invention. Those of ordinary skill in the relevant technical field can also make various modifications without departing from the spirit and scope of the embodiments of the present invention. Changes and modifications, therefore, all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the patent protection scope of the embodiments of the present invention should be defined by the claims.

Claims (13)

  1. 一种语音识别方法,包括:A method of speech recognition, including:
    获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量;Acquiring text data corresponding to the voice input data and a text vector corresponding to the text data;
    获取所述文本向量的句法特征;Acquiring the syntactic feature of the text vector;
    根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,以及,获取每一个所述文本子句的领域信息;Obtaining at least one text clause contained in the text data according to the syntactic feature, and obtaining the domain information of each text clause;
    至少根据每一个所述文本子句的领域信息,识别所述语音输入数据中的语音指令。At least according to the domain information of each of the text clauses, the voice command in the voice input data is recognized.
  2. 根据权利要求1所述的方法,其中,所述获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量,包括:The method according to claim 1, wherein said acquiring text data corresponding to voice input data and a text vector corresponding to said text data comprises:
    获取语音输入数据,并生成与所述语音输入数据对应的文本数据;Acquiring voice input data, and generating text data corresponding to the voice input data;
    生成所述文本数据中的每一个字词对应的字词向量;Generating a word vector corresponding to each word in the text data;
    根据每一个字词对应的字词向量,生成所述文本数据对应的文本向量。According to the word vector corresponding to each word, a text vector corresponding to the text data is generated.
  3. 根据权利要求2所述的方法,其中,所述获取所述文本向量的句法特征,包括:The method according to claim 2, wherein said obtaining the syntactic feature of the text vector comprises:
    对所述文本向量进行特征提取,获取所述文本向量的句法特征。Perform feature extraction on the text vector to obtain the syntactic feature of the text vector.
  4. 根据权利要求3所述的方法,其中,所述对所述文本向量进行特征提取,获取所述文本向量的句法特征,包括:The method according to claim 3, wherein said performing feature extraction on said text vector to obtain the syntactic feature of said text vector comprises:
    对所述文本向量中的每一个字词对应的字词向量进行特征提取,获取每一个字词的句法特征。Perform feature extraction on the word vector corresponding to each word in the text vector to obtain the syntactic feature of each word.
  5. 根据权利要求4所述的方法,其中,所述根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,包括:The method according to claim 4, wherein said acquiring at least one text clause contained in said text data according to said syntactic feature comprises:
    根据每一个字词的句法特征,获取每一个字词的标签,其中,所述标签包括结束标签;Obtain the label of each word according to the syntactic characteristics of each word, where the label includes an end label;
    根据每一个字词的标签,获得所述文本数据的序列标注;Obtain the sequence label of the text data according to the label of each word;
    根据所述序列标注中的结束标签,获取所述文本数据中包含的至少一个文本子句。Acquire at least one text clause contained in the text data according to the end tag in the sequence label.
  6. 根据权利要求5所述的方法,其中,所述获取每一个所述文本子句的领域信息,包括:The method according to claim 5, wherein said obtaining the domain information of each of said text clauses comprises:
    根据所述文本向量的句法特征,获取每一个所述文本子句对应的领域特征;Acquiring the domain feature corresponding to each text clause according to the syntactic feature of the text vector;
    对每一个所述文本子句的领域特征,在每个特征维度上进行最大特征值提取,生成每一个所述文本子句的领域特征向量;For the domain feature of each text clause, extract the maximum feature value in each feature dimension to generate the domain feature vector of each text clause;
    根据每一个所述文本子句的领域特征向量,确定当前文本子句的领域信息。According to the domain feature vector of each text clause, the domain information of the current text clause is determined.
  7. 根据权利要求6所述的方法,其中,所述根据所述文本向量的句法特征,获取每一个所述文本子句对应的领域特征,包括:7. The method according to claim 6, wherein said acquiring the domain characteristics corresponding to each of said text clauses according to the syntactic characteristics of said text vector comprises:
    根据所述文本向量的句法特征,获取所述文本向量的领域特征;Obtaining the domain feature of the text vector according to the syntactic feature of the text vector;
    根据每一个所述文本子句所包含的字词的信息,从所述文本向量的领域特征中获取每一个所述文本子句对应的领域特征。According to the word information contained in each text clause, the domain feature corresponding to each text clause is obtained from the domain features of the text vector.
  8. 根据权利要求1-7任一项所述的方法,其中:The method according to any one of claims 1-7, wherein:
    通过卷积神经网络模型的特征提取部分对所述文本向量进行特征提取,获取所述文本向量的句法特征;Performing feature extraction on the text vector by the feature extraction part of the convolutional neural network model to obtain the syntactic feature of the text vector;
    通过所述卷积神经网络模型的句边界探测部分根据所述句法特征,获取所述文本数据中包含的至少一个文本子句;通过所述卷积神经网络模型的领域分类部分根据所述句法特征和每一个所述文本子句的信息,获取每一个所述文本子句的领域信息;The sentence boundary detection part of the convolutional neural network model acquires at least one text clause contained in the text data according to the syntactic feature; the domain classification part of the convolutional neural network model is based on the syntactic feature And the information of each of the text clauses, and obtain the domain information of each of the text clauses;
    其中,所述句边界探测部分和所述领域分类部分共享所述特征提取部分提取的句法特征。Wherein, the sentence boundary detection part and the field classification part share the syntactic features extracted by the feature extraction part.
  9. 根据权利要求8所述的方法,其中,所述通过卷积神经网络模型的特征提取部分对所述文本向量进行特征提取,获取所述文本向量的句法特征,包括:The method according to claim 8, wherein the feature extraction of the text vector by the feature extraction part of the convolutional neural network model to obtain the syntactic feature of the text vector comprises:
    对输入的向量进行批规范化操作,生成规范化的向量;Perform batch normalization operations on the input vectors to generate normalized vectors;
    对所述规范化的向量进行非线性化处理;Performing non-linear processing on the normalized vector;
    通过卷积层对非线性处理后的所述向量进行特征提取,获得初始特征;Performing feature extraction on the non-linearly processed vector through a convolutional layer to obtain an initial feature;
    对所述初始特征进行残差分析处理,根据所述残差分析处理结果获得所述向量的句法特征并输出;Performing residual analysis processing on the initial feature, and obtaining and outputting the syntactic feature of the vector according to the residual analysis processing result;
    返回所述对输入的向量进行批规范化操作的步骤继续执行,直至获得所述文本向量的句法特征。The step of returning to the batch normalization operation on the input vector is continued until the syntax feature of the text vector is obtained.
  10. 根据权利要求9所述的方法,其中,所述特征提取部分至少包括12个卷积层;通过线性门函数对所述规范化的向量进行非线性化处理。The method according to claim 9, wherein the feature extraction part includes at least 12 convolutional layers; and the normalized vector is non-linearized by a linear gate function.
  11. 根据权利要求8所述的方法,其中,所述通过所述卷积神经网络模型的句边界探测部分根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,包括:8. The method according to claim 8, wherein said acquiring at least one text clause contained in said text data by said sentence boundary detection part of said convolutional neural network model according to said syntactic feature comprises:
    对所述句法特征进行批规范化操作,生成规范化的句法特征;Perform batch normalization operations on the syntactic features to generate normalized syntactic features;
    通过卷积层对所述规范化的句法特征进行特征提取;Performing feature extraction on the normalized syntactic feature through a convolutional layer;
    通过输出层根据特征提取结果确定所述文本数据中的每个字词的标签,根据每个字词的标签获取所述文本数据中包含的至少一个文本子句。The output layer determines the label of each word in the text data according to the feature extraction result, and obtains at least one text clause contained in the text data according to the label of each word.
  12. 根据权利要求8所述的方法,其中,所述通过所述卷积神经网络模型的领域分类部分根据所述句法特征和每一个所述文本子句的信息,获取每一个所述文本子句的领域信息包括:8. The method according to claim 8, wherein the field classification part of the convolutional neural network model obtains the information of each text clause according to the syntactic feature and the information of each text clause Field information includes:
    对所述句法特征进行批规范化操作,生成规范化的句法特征;Perform batch normalization operations on the syntactic features to generate normalized syntactic features;
    通过卷积层对所述规范化的句法特征进行特征映射,获取所述文本向量的领域特征;Performing feature mapping on the normalized syntactic feature through a convolutional layer to obtain the domain feature of the text vector;
    通过池化层根据每一个所述文本子句的信息对所述文本向量的领域特征进行池化处理;Pooling the domain features of the text vector according to the information of each of the text clauses through the pooling layer;
    通过输出层根据所述池化处理的结果,获取每一个所述文本子句的领域信息。According to the result of the pooling process, the output layer obtains the domain information of each text clause.
  13. 一种语音识别装置,包括:A speech recognition device includes:
    第一获取模块,用于获取与语音输入数据对应的文本数据和所述文本数据对应的文本向量;The first acquiring module is configured to acquire text data corresponding to voice input data and text vectors corresponding to the text data;
    第二获取模块,用于获取所述文本向量的句法特征;The second acquiring module is used to acquire the syntactic features of the text vector;
    第三获取模块,用于根据所述句法特征,获取所述文本数据中包含的至少一个文本子句,以及,获取每一个所述文本子句的领域信息;The third obtaining module is configured to obtain at least one text clause contained in the text data according to the syntactic feature, and obtain the domain information of each text clause;
    识别模块,用于至少根据每一个所述文本子句的领域信息,识别所述语音输入数据中的语音指令。The recognition module is used to recognize the voice command in the voice input data at least according to the domain information of each of the text clauses.
PCT/CN2020/070581 2019-01-18 2020-01-07 Speech recognition method and apparatus WO2020147609A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910047340.2A CN111462738B (en) 2019-01-18 2019-01-18 Speech recognition method and device
CN201910047340.2 2019-01-18

Publications (1)

Publication Number Publication Date
WO2020147609A1 true WO2020147609A1 (en) 2020-07-23

Family

ID=71613709

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/070581 WO2020147609A1 (en) 2019-01-18 2020-01-07 Speech recognition method and apparatus

Country Status (2)

Country Link
CN (1) CN111462738B (en)
WO (1) WO2020147609A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106471570A (en) * 2014-05-30 2017-03-01 苹果公司 Order single language input method more
CN106528522A (en) * 2016-08-26 2017-03-22 南京威卡尔软件有限公司 Scenarized semantic comprehension and dialogue generation method and system
CN107315737A (en) * 2017-07-04 2017-11-03 北京奇艺世纪科技有限公司 A kind of semantic logic processing method and system
CN107679042A (en) * 2017-11-15 2018-02-09 北京灵伴即时智能科技有限公司 A kind of multi-layer dialog analysis method towards Intelligent voice dialog system
CN108563790A (en) * 2018-04-28 2018-09-21 科大讯飞股份有限公司 A kind of semantic understanding method and device, equipment, computer-readable medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103839549A (en) * 2012-11-22 2014-06-04 腾讯科技(深圳)有限公司 Voice instruction control method and system
CN105469789A (en) * 2014-08-15 2016-04-06 中兴通讯股份有限公司 Voice information processing method and voice information processing terminal
US10121467B1 (en) * 2016-06-30 2018-11-06 Amazon Technologies, Inc. Automatic speech recognition incorporating word usage information
CN107247702A (en) * 2017-05-05 2017-10-13 桂林电子科技大学 A kind of text emotion analysis and processing method and system
CN107773982B (en) * 2017-10-20 2021-08-13 科大讯飞股份有限公司 Game voice interaction method and device
CN108091327A (en) * 2018-02-22 2018-05-29 成都启英泰伦科技有限公司 A kind of intelligent sound apparatus control method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106471570A (en) * 2014-05-30 2017-03-01 苹果公司 Order single language input method more
US20180350353A1 (en) * 2014-05-30 2018-12-06 Apple Inc. Multi-command single utterance input method
CN106528522A (en) * 2016-08-26 2017-03-22 南京威卡尔软件有限公司 Scenarized semantic comprehension and dialogue generation method and system
CN107315737A (en) * 2017-07-04 2017-11-03 北京奇艺世纪科技有限公司 A kind of semantic logic processing method and system
CN107679042A (en) * 2017-11-15 2018-02-09 北京灵伴即时智能科技有限公司 A kind of multi-layer dialog analysis method towards Intelligent voice dialog system
CN108563790A (en) * 2018-04-28 2018-09-21 科大讯飞股份有限公司 A kind of semantic understanding method and device, equipment, computer-readable medium

Also Published As

Publication number Publication date
CN111462738A (en) 2020-07-28
CN111462738B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
JP6909832B2 (en) Methods, devices, equipment and media for recognizing important words in audio
CN112100349B (en) Multi-round dialogue method and device, electronic equipment and storage medium
WO2020073530A1 (en) Customer service robot session text classification method and apparatus, and electronic device and computer-readable storage medium
JP6541673B2 (en) Real time voice evaluation system and method in mobile device
WO2014117645A1 (en) Information identification method and apparatus
CN110209812B (en) Text classification method and device
JP7300435B2 (en) Methods, apparatus, electronics, and computer-readable storage media for voice interaction
CN111161726B (en) Intelligent voice interaction method, device, medium and system
US10482876B2 (en) Hierarchical speech recognition decoder
CN109074353A (en) The combination of language understanding and information retrieval
WO2020232898A1 (en) Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
CN114038457B (en) Method, electronic device, storage medium, and program for voice wakeup
CN111210824B (en) Voice information processing method and device, electronic equipment and storage medium
CN114420102B (en) Method and device for speech sentence-breaking, electronic equipment and storage medium
CN114913590A (en) Data emotion recognition method, device and equipment and readable storage medium
CN111126084A (en) Data processing method and device, electronic equipment and storage medium
CN111292731A (en) Voice information processing method and device, electronic equipment and storage medium
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
WO2020147609A1 (en) Speech recognition method and apparatus
CN111241336A (en) Audio scene recognition method and device, electronic equipment and medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
WO2021051565A1 (en) Machine learning-based semantic parsing method and apparatus, electronic device, and computer non-volatile readable storage medium
CN109065016B (en) Speech synthesis method, speech synthesis device, electronic equipment and non-transient computer storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20741402

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20741402

Country of ref document: EP

Kind code of ref document: A1