WO2022028378A1 - 语音意图识别方法、装置及设备 - Google Patents
语音意图识别方法、装置及设备 Download PDFInfo
- Publication number
- WO2022028378A1 WO2022028378A1 PCT/CN2021/110134 CN2021110134W WO2022028378A1 WO 2022028378 A1 WO2022028378 A1 WO 2022028378A1 CN 2021110134 W CN2021110134 W CN 2021110134W WO 2022028378 A1 WO2022028378 A1 WO 2022028378A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- pinyin
- sample
- recognized
- phoneme
- vector
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 101
- 239000013598 vector Substances 0.000 claims abstract description 288
- 238000013507 mapping Methods 0.000 claims abstract description 46
- 238000003860 storage Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 description 48
- 230000006870 function Effects 0.000 description 25
- 238000012549 training Methods 0.000 description 25
- 238000010586 diagram Methods 0.000 description 12
- 238000013528 artificial neural network Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 238000011176 pooling Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 230000005284 excitation Effects 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000012804 iterative process Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000004378 air conditioning Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- the present disclosure relates to the field of voice interaction, and in particular, to a voice intent recognition method, apparatus, and device.
- voice interaction has become an important bridge for communication between humans and machines.
- the robotic system needs to talk to the user and complete specific tasks.
- One of the core technologies is the recognition of voice intent. That is, after the user inputs the voice to be recognized to the robotic system, the robotic system can determine the voice intent of the user through the voice to be recognized.
- the speech intent recognition method includes: a speech recognition stage and an intention recognition stage.
- the speech recognition stage the speech to be recognized is recognized by the automatic speech recognition (Automatic Speech Recognition, ASR) technology, and the to-be-recognized speech is converted into text.
- ASR Automatic Speech Recognition
- the intent recognition stage the text is semantically understood by natural language processing (NLP) technology to obtain keyword information, and based on the keyword information, the user's voice intent is identified.
- NLP natural language processing
- the accuracy of the above text-based intent recognition method depends heavily on the accuracy of speech-to-text conversion, and the accuracy of speech-to-text conversion is relatively low, resulting in a very low accuracy of speech intent recognition, and it is impossible to accurately identify the user's voice intent.
- the accuracy of speech-to-text conversion is relatively low, resulting in a very low accuracy of speech intent recognition, and it is impossible to accurately identify the user's voice intent.
- there are "trees" in the speech but when the speech is converted to text, the text content may be "number", which leads to the wrong recognition of the speech intent.
- the present application provides a speech intent recognition method, including:
- the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
- the to-be-recognized phoneme set includes a plurality of to-be-recognized phonemes
- the acquiring a to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set includes:
- each to-be-recognized phoneme determines the phoneme feature value corresponding to the to-be-recognized phoneme; based on the phoneme feature value corresponding to each to-be-recognized phoneme, obtain a to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set.
- the recognized phoneme vector includes the phoneme feature value corresponding to each of the to-be-recognized phonemes.
- the Methods before the inputting the to-be-recognized phoneme vector into a trained target network model, so that the target network model outputs the speech intent corresponding to the to-be-recognized phoneme vector, the Methods also include:
- the sample phoneme vector and the sample intent are input to an initial network model, and the initial network model is trained by the sample phoneme vector and the sample intent to obtain the target network model.
- the sample phoneme set includes a plurality of sample phonemes
- the obtaining a sample phoneme vector corresponding to the sample phoneme set includes:
- a sample phoneme vector corresponding to the sample phoneme set is obtained, and the sample phoneme vector includes the phoneme feature value corresponding to each sample phoneme.
- the present application provides a speech intent recognition method, including:
- the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
- the set of pinyin to be recognized includes a plurality of pinyin to be recognized
- the acquiring the pinyin vector to be recognized corresponding to the set of pinyin to be recognized includes:
- each to-be-recognized pinyin determines the pinyin feature value corresponding to the to-be-recognized pinyin; based on the pinyin feature value corresponding to each to-be-recognized pinyin, obtain the to-be-recognized pinyin vector corresponding to the to-be-recognized pinyin set.
- the recognized pinyin vector includes the pinyin feature value corresponding to each to-be-recognized pinyin.
- the Methods before the inputting the to-be-recognized pinyin vector into a trained target network model, so that the target network model outputs the speech intent corresponding to the to-be-recognized pinyin vector, the Methods also include:
- the sample pinyin vector and the sample intent are input to an initial network model, and the initial network model is trained by the sample pinyin vector and the sample intent to obtain the target network model.
- the sample pinyin set includes a plurality of sample pinyin
- the acquiring a sample pinyin vector corresponding to the sample pinyin set includes:
- a sample pinyin vector corresponding to the sample pinyin set is obtained, and the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.
- the present application provides a voice intent recognition device, including:
- a determining module configured to determine a phoneme set to be recognized according to the speech to be recognized
- an acquisition module configured to acquire a phoneme vector to be recognized corresponding to the phoneme set to be recognized
- a processing module configured to input the to-be-recognized phoneme vector to a trained target network model, so that the target network model outputs a speech intent corresponding to the to-be-recognized phoneme vector;
- the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
- the present application provides a voice intent recognition device, including:
- a determination module used for determining a set of pinyin to be recognized according to the speech to be recognized
- an acquisition module used for acquiring the pinyin vector to be recognized corresponding to the set of pinyin to be recognized
- a processing module configured to input the to-be-recognized pinyin vector to a trained target network model, so that the target network model outputs a speech intent corresponding to the to-be-recognized pinyin vector;
- the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
- the present application provides a speech intent recognition device, including: a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor; the processor is configured to execute a machine Instructions can be executed to implement the following steps:
- the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
- the present application provides a speech intent recognition device, including: a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor; the processor is configured to execute a machine Instructions can be executed to implement the following steps:
- the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
- the speech intent is recognized based on the phoneme to be recognized, not the speech intent based on the text, and does not need to rely on the accuracy of converting speech to text.
- phonemes are the smallest units of speech divided according to the natural attributes of speech, and are analyzed based on pronunciation actions, an action constitutes a phoneme. Therefore, the accuracy rate of determining the phoneme to be recognized based on the speech to be recognized is very high, and the accuracy rate of speech intent recognition is very high. , which can accurately identify the user's voice intent, effectively improve the accuracy of voice intent recognition, and make intent recognition more reliable. It does not require a large number of language model algorithm libraries for voice recognition, resulting in significant performance and memory optimization. .
- FIG. 1 is a schematic flowchart of a speech intent recognition method in an embodiment of the present application.
- FIG. 2 is a schematic flowchart of a voice intent recognition method in an embodiment of the present application.
- FIG. 3 is a schematic flowchart of a voice intent recognition method in an embodiment of the present application.
- FIG. 4 is a schematic flowchart of a voice intent recognition method in an embodiment of the present application.
- FIG. 5A is a schematic structural diagram of an apparatus for recognizing speech intent in an embodiment of the present application.
- FIG. 5B is a schematic structural diagram of an apparatus for recognizing speech intent in an embodiment of the present application.
- FIG. 6 is a hardware structure diagram of a voice intent recognition device in an embodiment of the present application.
- FIG. 7 is a hardware structure diagram of a speech intent recognition device in an embodiment of the present application.
- first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
- the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application.
- the use of the word "if” can be interpreted as "at the time of" or "when” or “in response to determining”, depending on the context.
- Machine learning is a way to realize artificial intelligence, which is used to study how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance.
- Deep learning is a subcategory of machine learning and is the process of using mathematical models to model specific problems in the real world in order to solve similar problems in the field.
- Neural network is the implementation of deep learning. For the convenience of description, this paper takes neural network as an example to introduce the structure and function of neural network. For other subclasses of machine learning, the structure and function of neural network are similar.
- Neural networks include but are not limited to convolutional neural networks (CNN for short), recurrent neural networks (RNN for short), fully connected networks, etc.
- the structural units of neural networks may include but are not limited to convolutional layers (Conv), pooling Layer (Pool), excitation layer, fully connected layer (FC), etc., there is no restriction on this.
- one or more convolutional layers, one or more pooling layers, one or more excitation layers, and one or more fully connected layers can be combined to construct a neural network according to different requirements.
- the input data features are enhanced by performing convolution operations on the input data features using a convolution kernel.
- the convolution kernel can be a matrix of size m*n, and the input data features of the convolution layer are processed with the convolution kernel.
- Convolution the output data features of the convolution layer can be obtained, and the convolution operation is actually a filtering process.
- the input data features (such as the output of the convolution layer) are subjected to operations such as taking the maximum value, the minimum value, and the average value, so as to use the principle of local correlation to sub-sample the input data features.
- the pooling layer operation is actually a downsampling process.
- an activation function (such as a nonlinear function) can be used to map the input data features, thereby introducing nonlinear factors, so that the neural network can enhance the expressive ability through nonlinear combination.
- the activation function may include, but is not limited to, a ReLU (Rectified Linear Units, rectified linear unit) function, where the ReLU function is used to set the features smaller than 0 to 0, while the features larger than 0 remain unchanged.
- ReLU Rectified Linear Units, rectified linear unit
- all data features input to the fully connected layer are fully connected to obtain a feature vector, and the feature vector may include multiple data features.
- Network model A model built using a machine learning algorithm (such as a deep learning algorithm), such as a model built using a neural network, that is, a network model can consist of one or more convolutional layers, one or more pooling layers, one or more It consists of an excitation layer and one or more fully connected layers.
- a machine learning algorithm such as a deep learning algorithm
- the untrained network model is called the initial network model
- the trained network model is called the target network model.
- the sample data is used to train various network parameters in the initial network model, such as convolution layer parameters (such as convolution kernel parameters), pooling layer parameters, excitation layer parameters, fully connected layer parameters, etc. This does not limit.
- convolution layer parameters such as convolution kernel parameters
- pooling layer parameters such as convolution kernel parameters
- excitation layer parameters fully connected layer parameters
- fully connected layer parameters etc. This does not limit.
- the initial network model can fit the mapping relationship between input and output.
- the initial network model that has been trained is the target network model, and the speech intent is recognized through the target network model.
- a phoneme is the smallest phonetic unit divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable, and an action constitutes a phoneme.
- the Chinese syllable ah (a) has only one phoneme (a), love (ai) has two phonemes (a and i), dai (dai) has three phonemes (d, a and i), etc.
- the Chinese syllable tree (shumu) has five phonemes (s, h, u, m, and u).
- Pinyin is the combination of more than one phoneme into a compound sound, for example, dai (dai) has three phonemes (d, a, and i) that make up a pinyin (dai).
- dai dai
- a tree shumu
- s, h, u, m, and u a tree
- these phonemes form two pinyin (shu and mu).
- the speech intent recognition method includes: a speech recognition stage and an intention recognition stage.
- the speech recognition stage the speech to be recognized is recognized by the automatic speech recognition technology, and the speech to be recognized is converted into text.
- the intent recognition stage the text is semantically understood through natural language processing technology, keywords are obtained, and the user's voice intent is identified based on the keywords.
- the accuracy depends on the accuracy of voice-to-text conversion, and the voice-to-text accuracy is relatively low, resulting in a very low accuracy of voice intent recognition and inability to accurately identify the user's voice intent.
- the speech intent is recognized based on the phoneme to be recognized, rather than the speech intent based on the text, so that it is not necessary to rely on the accuracy rate of converting speech into text.
- An embodiment of the present application proposes a voice intent recognition method, which can be applied to a human-computer interaction application scenario, and is mainly used to control a device according to the voice intent.
- the method can be applied to any device that needs to be controlled according to voice intent, such as access control devices, screen projection devices, IPC (IP Camera, network cameras), servers, smart terminals, robotic systems, air conditioning devices, etc. No restrictions.
- the training process of the initial network model is involved, and the identification process based on the target network model is involved.
- the initial network model can be trained to obtain the trained target network model.
- the recognition process based on the target network model the speech intent can be recognized based on the target network model.
- the training process of the initial network model and the recognition process based on the target network model can be implemented in the same device or in different devices. For example, implement the training process of the initial network model on device A, obtain the target network model, and recognize the speech intent based on the target network model.
- the training process of the initial network model is implemented on the device A1 to obtain the target network model, and the target network model is deployed to the device A2, and the device A2 recognizes the speech intent based on the target network model.
- an embodiment of the present application proposes a speech intent recognition method, which can realize the training of the initial network model, and the method includes:
- Step 101 Obtain a sample speech and a sample intent corresponding to the sample speech.
- a large number of sample voices may be obtained from historical data, and/or a large number of sample voices input by a user may be received, and the acquisition method is not limited, and the sample voices represent sounds produced when speaking. For example, if the sound produced when speaking is "turn on the air conditioner", the sample speech is "turn on the air conditioner".
- the speech intent corresponding to the sample speech may be obtained.
- the speech intent corresponding to the sample speech may be called a sample intent (ie, a sample speech intent).
- a sample intent ie, a sample speech intent
- the sample intent may be "turn on the air conditioner”.
- Step 102 Determine a sample phoneme set according to the sample speech.
- a sample phoneme set may be determined according to the sample speech, and the sample phoneme set may include a plurality of sample phonemes, and the process of determining the sample phonemes according to the sample speech is to identify each sample phoneme from the sample speech.
- each recognized phoneme is called a sample phoneme. Therefore, a plurality of sample phonemes can be recognized according to the sample voice. This recognition process is not limited, as long as it can be recognized according to the sample voice. You can generate multiple sample phonemes.
- the sample phoneme set may include the following sample phonemes "b, a, k, o, n, g, t, i, a, o, d, a, k , a, i”.
- Step 103 Obtain a sample phoneme vector corresponding to the sample phoneme set.
- a phoneme feature value corresponding to the sample phoneme is determined, and based on the phoneme feature value corresponding to each sample phoneme, a sample phoneme vector corresponding to the sample phoneme set is obtained, the The sample phoneme vector includes the phoneme feature value corresponding to each sample phoneme.
- the mapping relationship between each phoneme in all the phonemes and the phoneme feature value is maintained in advance. Assuming that there are 50 phonemes in total, the mapping relationship between phoneme 1 and phoneme feature value 1, and the mapping between phoneme 2 and phoneme feature value 2 can be maintained. Relationship, ..., the mapping relationship between phoneme 50 and phoneme feature value 50.
- step 103 for each sample phoneme in the sample phoneme set, by querying the above mapping relationship, the phoneme feature value corresponding to the sample phoneme can be obtained, and each sample phoneme in the sample phoneme set The corresponding phoneme feature values are combined to obtain the sample phoneme vector.
- the sample phoneme vector is a 15-dimensional feature vector
- the feature vector sequentially includes the phoneme feature value corresponding to "b”, the phoneme feature value corresponding to "a”, the phoneme feature value corresponding to "k”, the phoneme feature value corresponding to "o", and the phoneme feature value corresponding to "n” , the phoneme feature value corresponding to "g”, the phoneme feature value corresponding to "t”, the phoneme feature value corresponding to "i”, the phoneme feature value corresponding to "a”, the phoneme feature value corresponding to "o”, and the phoneme feature value corresponding to "d”
- all phonemes can be sorted. Assuming that there are 50 phonemes in total, the serial numbers of the 50 phonemes are 1-50 respectively.
- the phoneme feature value corresponding to each phoneme the phoneme feature value can be is a 50-bit value. Assuming that the serial number of the phoneme is M, in the phoneme feature value corresponding to the phoneme, the value of the Mth bit is the first value, and the values of the other bits except the Mth bit are the second value.
- the phoneme feature value corresponding to the phoneme with the serial number 1 the value of the first digit is the first value, and the value of the 2-50th place is the second value; the phoneme feature corresponding to the phoneme with the serial number 2 In the value, the value of bit 2 is the first value, the value of bit 1 and bit 3-50 is the second value, and so on.
- the sample phoneme vector can be a 15* A 50-dimensional feature vector, the feature vector includes 15 rows and 50 columns, and each row represents a phoneme feature value corresponding to a phoneme, which will not be repeated here.
- the first value and the second value can be configured according to experience, which is not limited, for example, the first value is 1, the second value is 0, or the first value is 0, The second value is 1, or the first value is 255 and the second value is 0, or the first value is 0 and the second value is 255.
- Step 104 Input the sample phoneme vector and the sample intent corresponding to the sample phoneme vector to the initial network model, so as to train the initial network model through the sample phoneme vector and the sample intent, and obtain a trained target network model.
- the trained target network model is obtained. Therefore, the target network model can be used to record the phoneme vector and the speech. Intent mapping relationship.
- a large number of sample voices can be obtained, and for each sample voice, the sample intent corresponding to the sample voice is obtained, and the sample phoneme vector corresponding to the sample phoneme set corresponding to the sample voice, that is, the sample phoneme corresponding to the sample voice is obtained.
- Vector and sample intent (participate in training as the label information of sample phoneme vectors). Based on this, a large number of sample phoneme vectors and the sample intent (ie, label information) corresponding to each sample phoneme vector can be input into the initial network model, so as to use the sample phoneme vectors and sample intents to train each network parameter in the initial network model. There are no restrictions on this training process. After the training of the initial network model is completed, the initial network model that has been trained is the target network model.
- a large number of sample phoneme vectors and sample intents can be input to the first network layer of the initial network model, and the first network layer processes these data to obtain the output data of the first network layer.
- the output data of the network layer is input to the second network layer of the initial network model, and so on, until the data is input to the last network layer of the initial network model, the data is processed by the last network layer, and the output data is obtained, Denote this output data as the target feature vector.
- the target feature vector After the target feature vector is obtained, it is determined whether the initial network model has converged based on the target feature vector. If the initial network model has converged, the converged initial network model is determined as the trained target network model, and the training process of the initial network model is completed. If the initial network model does not converge, the network parameters of the unconverged initial network model are adjusted to obtain the adjusted initial network model.
- a large number of sample phoneme vectors and sample intentions can be input into the adjusted initial network model, so as to retrain the adjusted initial network model.
- the specific training process refer to the above-mentioned embodiment, which is not repeated here. Repeat. And so on, until the initial network model has converged, and the converged initial network model is determined as the trained target network model.
- determining whether the initial network model has converged based on the target feature vector may include, but is not limited to: pre-constructing a loss function, which is not limited and can be set according to experience. After the target feature vector is obtained, the loss value of the loss function can be determined according to the target feature vector. For example, the target feature vector can be substituted into the loss function to obtain the loss value of the loss function. After the loss value of the loss function is obtained, it is determined whether the initial network model has converged according to the loss value of the loss function.
- whether the initial network model has converged may be determined according to a loss value, for example, a loss value 1 is obtained based on the target feature vector, and if the loss value 1 is not greater than a threshold, it is determined that the initial network model has converged. If the loss value 1 is greater than the threshold, it is determined that the initial network model has not converged. or,
- Whether the initial network model has converged can be determined according to multiple loss values of multiple iterations. For example, in each iteration, the initial network model of the previous iteration is adjusted to obtain the adjusted initial network model, and each iteration is performed. The second iteration process can get the loss value. Determine the change range curve of multiple loss values. If it is determined according to the change range curve that the change range of the loss value has been stable (the loss value in the continuous multiple iteration process has not changed, or the change range is small), and the last iteration process If the loss value is not greater than the threshold, it is determined that the initial network model of the last iteration process has converged. Otherwise, it is determined that the initial network model of the last iterative process has not converged, and the next iterative process is continued to obtain the loss value of the next iterative process, and the change amplitude curves of multiple loss values are re-determined.
- other methods may also be used to determine whether the initial network model has converged, which is not limited. For example, if the number of iterations reaches a preset number of times threshold, it is determined that the initial network model has converged; for another example, if the iteration duration reaches a preset duration threshold, it is determined that the initial network model has converged.
- the initial network model can be trained through the sample phoneme vector and the sample intent corresponding to the sample phoneme vector, so as to obtain the trained target network model.
- an embodiment of the present application proposes a voice intent recognition method, which can realize voice intent recognition, and the method includes:
- Step 201 Determine a phoneme set to be recognized according to the speech to be recognized.
- a set of to-be-recognized phonemes may be determined according to the to-be-recognized voice, and the to-be-recognized phoneme set may include a plurality of to-be-recognized phonemes.
- each recognized phoneme is called a phoneme to be recognized. Therefore, a plurality of phonemes to be recognized can be recognized according to the speech to be recognized, and this recognition process is not done. Restriction, as long as a plurality of to-be-recognized phonemes can be recognized according to the to-be-recognized speech.
- the to-be-recognized phoneme set may include the following to-be-recognized phonemes "k, a, i, k, o, n, g, t, i, a, o".
- Step 202 Obtain a phoneme vector to be recognized corresponding to the phoneme set to be recognized.
- a phoneme vector to be recognized corresponding to the phoneme set to be recognized.
- determine the phoneme feature value corresponding to the to-be-recognized phoneme and obtain the to-be-recognized phoneme corresponding to the to-be-recognized phoneme set based on the phoneme feature value corresponding to each to-be-recognized phoneme.
- a phoneme vector, where the to-be-recognized phoneme vector includes a phoneme feature value corresponding to each to-be-recognized phoneme.
- the mapping relationship between each phoneme in all the phonemes and the phoneme feature value is maintained in advance. Assuming that there are 50 phonemes in total, the mapping relationship between phoneme 1 and phoneme feature value 1, and the mapping between phoneme 2 and phoneme feature value 2 can be maintained. Relationship, ..., the mapping relationship between phoneme 50 and phoneme feature value 50.
- step 202 for each to-be-recognized phoneme in the to-be-recognized phoneme set, by querying the above mapping relationship, the phoneme feature value corresponding to the to-be-recognized phoneme can be obtained, and each to-be-recognized phoneme in the to-be-recognized phoneme set corresponds to The phoneme feature value combination of , obtains the phoneme vector to be recognized.
- all phonemes can be sorted. Assuming that there are 50 phonemes in total, the serial numbers of the 50 phonemes are 1-50 respectively.
- the phoneme feature value corresponding to each phoneme the phoneme feature value can be is a 50-bit value. Assuming that the serial number of the phoneme is M, in the phoneme feature value corresponding to the phoneme, the value of the Mth bit is the first value, and the values of the other bits except the Mth bit are the second value.
- the phoneme feature value corresponding to the phoneme with the serial number 1 the value of the first digit is the first value, and the value of the 2-50th place is the second value; the phoneme feature corresponding to the phoneme with the serial number 2 In the value, the value of bit 2 is the first value, the value of bit 1 and bit 3-50 is the second value, and so on.
- Step 203 Input the to-be-recognized phoneme vector to the trained target network model, so that the target network model outputs a speech intent corresponding to the to-be-recognized phoneme vector.
- the target network model is used to record the mapping relationship between the phoneme vector and the voice intent.
- the target network model can output the voice intent corresponding to the to-be-recognized phoneme vector.
- the phoneme vector to be recognized can be input to the first network layer of the target network model, and the phoneme vector to be recognized can be processed by the first network layer to obtain the output data of the first network layer.
- the output data of the network layer is input to the second network layer of the target network model, and so on, until the data is input to the last network layer of the target network model, the data is processed by the last network layer, and the output data is obtained, Denote this output data as the target feature vector.
- the target network model Since the target network model is used to record the mapping relationship between the phoneme vector and the voice intent, after the target feature vector is obtained, the mapping relationship can be queried based on the target feature vector, and the voice intent corresponding to the target feature vector can be obtained. It may be the speech intent corresponding to the to-be-recognized phoneme vector, and the target network model may output the speech intent corresponding to the to-be-recognized phoneme vector.
- the device After obtaining the voice intent corresponding to the phoneme vector to be recognized, the device can be controlled based on the voice intent, and the control method is not limited. For example, when the voice intent is "turn on the air conditioner", turn on the air conditioner.
- the target network model when the target network model outputs the speech intent corresponding to the phoneme vector to be recognized, it can also output a probability value corresponding to the speech intent (such as a probability value between 0 and 1, which can also be referred to as a probability value between 0 and 1). is confidence), for example, the target network model can output voice intent 1 and voice intent 1 with probability value 1 (eg 0.8), voice intent 2 and voice intent 2 with probability value 2 (eg 0.1), voice intent 3 and voice intent The probability value of 3 is 3 (eg 0.08), and so on.
- a probability value corresponding to the speech intent such as a probability value between 0 and 1, which can also be referred to as a probability value between 0 and 1). is confidence
- the target network model can output voice intent 1 and voice intent 1 with probability value 1 (eg 0.8), voice intent 2 and voice intent 2 with probability value 2 (eg 0.1), voice intent 3 and voice intent
- the probability value of 3 is 3 (eg 0.08), and so on.
- the speech intent with the largest probability value can be used as the speech intent corresponding to the phoneme vector to be recognized, for example, the speech intent 1 with the largest probability value is used as the speech intent corresponding to the phoneme vector to be recognized.
- the speech intent 1 with the largest probability value is used as the speech intent corresponding to the phoneme vector to be recognized.
- first determine the voice intent with the largest probability value and determine whether the probability value (that is, the maximum probability value) of the voice intent is greater than the preset probability threshold, if so, use the voice intent as the voice intent corresponding to the phoneme vector to be recognized, otherwise, There is no speech intent corresponding to the to-be-recognized phoneme vector.
- the speech intent is recognized based on the phoneme to be recognized, not the speech intent based on the text, and does not need to rely on the accuracy of converting speech to text.
- the phoneme to be recognized is determined based on the speech to be recognized with a high accuracy, and the voice intent recognition has a high accuracy, the user's voice intent can be accurately recognized, and the accuracy of the voice intent recognition is effectively improved.
- the user sends out the speech to be recognized "I want to see photos with trees" and the phonemes determined by the terminal device (such as IPC, smartphone, etc.) based on the speech to be recognized are "w, o, x, i, a, n, g” , k, a, n, y, o, u, s, h, u, m, u, d, e, z, h, a, o, p, i, a, n", that is, "trees” correspond to
- the phonemes are "s, h, u, m, u", so that the intent of the speech is determined based on the above phonemes without parsing "number” or “trees” from the to-be-recognized speech "I want to see a photo of trees", This avoids the use of "number” or “trees” to determine the speech intent, which makes the intent recognition more reliable, and does not require a
- the speech intent is recognized based on the pinyin to be recognized, instead of the speech intent being recognized based on the text, so that the accuracy of converting the speech into text does not need to be relied upon.
- An embodiment of the present application proposes a voice intent recognition method, which can be applied to a human-computer interaction application scenario, and is mainly used to control a device according to the voice intent.
- the method can be applied to any device that needs to be controlled according to voice intent, such as access control devices, screen projection devices, IPC (IP Camera, network cameras), servers, smart terminals, robotic systems, air conditioning devices, etc. No restrictions.
- the training process of the initial network model may be involved, and the identification process based on the target network model may be involved.
- the initial network model can be trained to obtain the trained target network model.
- the recognition process based on the target network model the speech intent can be recognized based on the target network model.
- the training process of the initial network model and the identification process based on the target network model may be implemented on the same device, or may be implemented on different devices.
- an embodiment of the present application proposes a speech intent recognition method, which can realize the training of the initial network model, and the method includes:
- Step 301 Obtain a sample speech and a sample intent corresponding to the sample speech.
- a large number of sample voices may be obtained from historical data, and/or a large number of sample voices input by a user may be received, and the acquisition method is not limited, and the sample voices represent sounds produced when speaking. For example, if the sound produced when speaking is "turn on the air conditioner", the sample speech is "turn on the air conditioner".
- the speech intent corresponding to the sample speech may be obtained.
- the speech intent corresponding to the sample speech may be called a sample intent (ie, a sample speech intent).
- a sample intent ie, a sample speech intent
- the sample intent may be "turn on the air conditioner”.
- Step 302 Determine a sample pinyin set according to the sample speech.
- a sample pinyin set may be determined according to the sample speech, and the sample pinyin set may include a plurality of sample pinyin, and the process of determining the sample pinyin according to the sample speech is to identify each sample pinyin from the sample speech.
- each identified pinyin is called a sample pinyin. Therefore, a plurality of sample pinyin can be recognized according to the sample voice, and there is no restriction on the recognition process, as long as it can be recognized according to the sample voice You can generate multiple sample pinyin.
- the sample pinyin set may include the following sample pinyin "ba”, “kong”, “tiao”, “da”, and "kai”.
- Step 303 Obtain a sample pinyin vector corresponding to the sample pinyin set.
- the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.
- step 303 for each sample pinyin in the sample pinyin set, by querying the above-mentioned mapping relationship, the pinyin feature value corresponding to the sample pinyin can be obtained, and each sample pinyin in the sample pinyin set The corresponding pinyin feature values are combined to obtain the sample pinyin vector.
- the sample pinyin vector can be a 5-dimensional feature vector, and the feature vector can sequentially include “ba” ” corresponds to the pinyin feature value, “kong” corresponds to the pinyin feature value, “tiao” corresponds to the pinyin feature value, “da” corresponds to the pinyin feature value, and “kai” corresponds to the pinyin feature value.
- all pinyin can be sorted. Assuming that there are 400 pinyin in total, the serial numbers of the 400 pinyin are respectively 1-400. For the pinyin feature value corresponding to each pinyin, the pinyin feature value can be is a 400-bit value. Assuming that the serial number of the pinyin is N, in the pinyin feature value corresponding to the pinyin, the value of the Nth bit is the first value, and the values of the other bits except the Nth bit are the second value.
- the pinyin feature value corresponding to the pinyin with the serial number 1 the value of the first digit is the first value, and the value of the 2-400th place is the second value; the pinyin feature corresponding to the pinyin with the serial number 2 In the value, the value of bit 2 is the first value, the value of bit 1 and bit 3-400 is the second value, and so on.
- the sample pinyin vector can be a 5*400-dimensional feature vector, the feature vector includes There are 5 rows and 400 columns, and each row represents a pinyin feature value corresponding to a pinyin, which will not be repeated here.
- Step 304 Input the sample pinyin vector and the sample intent corresponding to the sample pinyin vector to the initial network model, so as to train the initial network model through the sample pinyin vector and the sample intent, and obtain the trained target network model.
- the initial network model is trained by using the sample pinyin vector and the sample intent (that is, the sample speech intent)
- the trained target network model is obtained. Therefore, the target network model can be used to record the pinyin vector and the voice. Intent mapping relationship.
- a large number of sample voices can be obtained, and for each sample voice, the sample intent corresponding to the sample voice is obtained, and the sample pinyin vector corresponding to the sample pinyin set corresponding to the sample voice, that is, the sample pinyin corresponding to the sample voice is obtained.
- Vector and sample intent (participate in training as the label information of the sample pinyin vector). Based on this, a large number of sample pinyin vectors and the sample intent (ie label information) corresponding to each sample pinyin vector can be input into the initial network model, so as to use the sample pinyin vectors and sample intents to train each network parameter in the initial network model. There are no restrictions on this training process. After the training of the initial network model is completed, the initial network model that has been trained is the target network model.
- sample pinyin vectors and sample intents can be input to the first network layer of the initial network model, and the first network layer processes these data to obtain the output data of the first network layer, and the first network layer
- the output data of the network layer is input to the second network layer of the initial network model, and so on, until the data is input to the last network layer of the initial network model, the data is processed by the last network layer, and the output data is obtained, Denote this output data as the target feature vector.
- the target feature vector After the target feature vector is obtained, it is determined whether the initial network model has converged based on the target feature vector. If the initial network model has converged, the converged initial network model is determined as the trained target network model, and the training process of the initial network model is completed. If the initial network model does not converge, adjust the network parameters of the unconverged initial network model to obtain the adjusted initial network model.
- a large number of sample pinyin vectors and sample intentions can be input into the adjusted initial network model, so that the adjusted initial network model can be retrained.
- the specific training process refer to the above embodiment, which is not repeated here. Repeat. And so on, until the initial network model has converged, and the converged initial network model is determined as the trained target network model.
- determining whether the initial network model has converged based on the target feature vector may include, but is not limited to: pre-constructing a loss function, which is not limited and can be set according to experience. After the target feature vector is obtained, the loss value of the loss function can be determined according to the target feature vector. For example, the target feature vector can be substituted into the loss function to obtain the loss value of the loss function. After the loss value of the loss function is obtained, it is determined whether the initial network model has converged according to the loss value of the loss function.
- other methods may also be used to determine whether the initial network model has converged, which is not limited. For example, if the number of iterations reaches a preset number of times threshold, it is determined that the initial network model has converged; for another example, if the iteration duration reaches a preset duration threshold, it is determined that the initial network model has converged.
- the initial network model can be trained through the sample pinyin vector and the sample intent corresponding to the sample pinyin vector, thereby obtaining the trained target network model.
- an embodiment of the present application proposes a voice intent recognition method, which can realize voice intent recognition, and the method includes:
- Step 401 Determine a set of pinyin to be recognized according to the speech to be recognized.
- a set of pinyin to be recognized may be determined according to the speech to be recognized, and the set of pinyin to be recognized may include a plurality of pinyin to be recognized, and the process of determining the pinyin to be recognized according to the speech to be recognized is a process from The process of recognizing each pinyin in the recognized speech, for the convenience of distinction, each pinyin recognized can be called the pinyin to be recognized, therefore, a plurality of pinyin to be recognized can be recognized according to the speech to be recognized, this recognition process does not matter. It is limited as long as a plurality of pinyin to be recognized can be recognized according to the speech to be recognized.
- the to-be-recognized pinyin set may include the following to-be-recognized pinyin "kai”, “kong", and "tiao".
- Step 402 Obtain a pinyin vector to be recognized corresponding to the pinyin set to be recognized.
- a pinyin vector to be recognized corresponding to the pinyin set to be recognized.
- determine the pinyin feature value corresponding to the to-be-recognized pinyin and obtain the to-be-identified pinyin corresponding to the to-be-recognized pinyin set based on the pinyin feature value corresponding to each to-be-recognized pinyin.
- Pinyin vector, the to-be-recognized pinyin vector includes the pinyin feature value corresponding to each to-be-recognized pinyin.
- step 402 for each to-be-recognized pinyin in the to-be-recognized pinyin set, by querying the above-mentioned mapping relationship, the pinyin feature value corresponding to the to-be-recognized pinyin can be obtained, and each to-be-recognized pinyin in the to-be-recognized pinyin set corresponds to The pinyin feature value combination of , obtains the pinyin vector to be recognized.
- all pinyin can be sorted. Assuming that there are 400 pinyin in total, the serial numbers of the 400 pinyin are respectively 1-400. For the pinyin feature value corresponding to each pinyin, the pinyin feature value can be is a 400-bit value. Assuming that the serial number of the pinyin is N, in the pinyin feature value corresponding to the pinyin, the value of the Nth bit is the first value, and the values of the other bits except the Nth bit are the second value.
- the pinyin feature value corresponding to the pinyin with the serial number 1 the value of the first digit is the first value, and the value of the 2-400th place is the second value; the pinyin feature corresponding to the pinyin with the serial number 2 In the value, the value of bit 2 is the first value, the value of bit 1 and bit 3-400 is the second value, and so on.
- Step 403 input the to-be-recognized pinyin vector to the trained target network model, so that the target network model outputs the phonetic intent corresponding to the to-be-recognized pinyin vector.
- the target network model is used to record the mapping relationship between the pinyin vector and the phonetic intent. After inputting the pinyin vector to be recognized to the target network model, the target network model can output the phonetic intent corresponding to the pinyin vector to be recognized.
- the pinyin vector to be recognized can be input to the first network layer of the target network model, and the pinyin vector to be recognized can be processed by the first network layer to obtain the output data of the first network layer.
- the output data of the network layer is input to the second network layer of the target network model, and so on, until the data is input to the last network layer of the target network model, the data is processed by the last network layer, and the output data is obtained, Denote this output data as the target feature vector.
- the target network model Since the target network model is used to record the mapping relationship between the pinyin vector and the voice intent, after the target feature vector is obtained, the mapping relationship can be queried based on the target feature vector to obtain the voice intent corresponding to the target feature vector. It can be the speech intent corresponding to the pinyin vector to be recognized, and the target network model can output the speech intent corresponding to the pinyin vector to be recognized.
- the device After obtaining the voice intent corresponding to the pinyin vector to be recognized, the device can be controlled based on the voice intent, and the control method is not limited. For example, when the voice intent is "turn on the air conditioner", turn on the air conditioner.
- the target network model when the target network model outputs the voice intent corresponding to the pinyin vector to be recognized, it can also output a probability value corresponding to the voice intent (such as a probability value between 0 and 1, which can also be called a probability value). is confidence), for example, the target network model can output voice intent 1 and voice intent 1 with probability value 1 (eg 0.8), voice intent 2 and voice intent 2 with probability value 2 (eg 0.1), voice intent 3 and voice intent The probability value of 3 is 3 (eg 0.08), and so on.
- a probability value corresponding to the voice intent such as a probability value between 0 and 1, which can also be called a probability value). is confidence
- the target network model can output voice intent 1 and voice intent 1 with probability value 1 (eg 0.8), voice intent 2 and voice intent 2 with probability value 2 (eg 0.1), voice intent 3 and voice intent
- the probability value of 3 is 3 (eg 0.08), and so on.
- the speech intent with the largest probability value can be used as the speech intent corresponding to the pinyin vector to be recognized. Or, first determine the voice intent with the largest probability value, and determine whether the probability value (that is, the maximum probability value) of the voice intent is greater than the preset probability threshold, if so, use the voice intent as the voice intent corresponding to the pinyin vector to be recognized, otherwise, There is no speech intent corresponding to the pinyin vector to be recognized.
- the speech intent is recognized based on the pinyin to be recognized, not the speech intent based on the text, and does not need to rely on the accuracy of converting the speech into text.
- the accuracy of determining the pinyin to be recognized based on the to-be-recognized speech is high, and the accuracy of the speech intent recognition is high. Therefore, the user's speech intent can be accurately recognized, and the accuracy of the speech intent recognition can be effectively improved.
- the user sends out the speech to be recognized "I want to see photos with trees", and the pinyin determined by the terminal device (such as IPC, smart phone, etc.) based on the speech to be recognized is "wo, xiang, kan, you, shu, mu, de , zhao, pian", that is, the pinyin corresponding to "tree” is "shu, mu”, so that the phonetic intent can be determined based on the above pinyin, without the need to parse out the "number of ” or “trees”, thus avoiding the use of “number” or “trees” to determine the speech intent, making the intent recognition more reliable, without the need for a large number of language model algorithm libraries for speech recognition, resulting in a significant increase in performance and memory optimization.
- the terminal device such as IPC, smart phone, etc.
- the device may include:
- a determination module 511 configured to determine a phoneme set to be recognized according to the speech to be recognized
- Obtaining module 512 for obtaining the phoneme vector to be recognized corresponding to the phoneme set to be recognized
- the processing module 513 is used to input the phoneme vector to be recognized to the trained target network model, so that the target network model outputs the speech intent corresponding to the phoneme vector to be recognized;
- the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
- the to-be-recognized phoneme set includes a plurality of to-be-recognized phonemes, and when the acquiring module 512 acquires the to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set, it is specifically used for:
- each to-be-recognized phoneme determines a phoneme feature value corresponding to the to-be-recognized phoneme; based on the phoneme feature value corresponding to each to-be-recognized phoneme, obtain a to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set.
- the phoneme vector includes the phoneme feature value corresponding to each to-be-recognized phoneme.
- the determining module 511 is further configured to: obtain the sample speech and the sample intent corresponding to the sample speech; determine a sample phoneme set according to the sample speech; the obtaining module 512 is further configured to: obtain the The sample phoneme vector corresponding to the sample phoneme set; the processing module 513 is further configured to: input the sample phoneme vector and the sample intent to the initial network model, and train the initial network model through the sample phoneme vector and the sample intent , to obtain the target network model.
- the sample phoneme set includes a plurality of sample phonemes, and when the obtaining module 51 obtains the sample phoneme vector corresponding to the sample phoneme set, it is specifically used for:
- a sample phoneme vector corresponding to the sample phoneme set is obtained, where the sample phoneme vector includes the phoneme feature value corresponding to each sample phoneme.
- the device may include:
- a determination module 521 configured to determine a set of pinyin to be recognized according to the speech to be recognized;
- Obtaining module 522 for obtaining the pinyin vector to be recognized corresponding to the set of pinyin to be recognized;
- the processing module 523 is used to input the to-be-recognized pinyin vector to the trained target network model, so that the target network model outputs a speech intent corresponding to the to-be-recognized pinyin vector;
- the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
- the to-be-recognized pinyin set includes a plurality of to-be-recognized pinyin
- the acquiring module 522 acquires the to-be-recognized pinyin vector corresponding to the to-be-recognized pinyin set, it is specifically used for: for each to-be-recognized pinyin Identify the pinyin, and determine the corresponding pinyin characteristic value of the to-be-recognized pinyin; based on the corresponding pinyin characteristic value of each to-be-recognized pinyin, obtain the to-be-recognized pinyin vector corresponding to the to-be-recognized pinyin set, and the to-be-recognized pinyin vector includes each Pinyin feature values corresponding to the pinyin to be recognized.
- the determining module 521 is further configured to: obtain the sample speech and the sample intent corresponding to the sample speech; determine a sample pinyin set according to the sample speech; the obtaining module 522 is further configured to: obtain the The sample pinyin vector corresponding to the sample pinyin set; the processing module 523 is further configured to: input the sample pinyin vector and the sample intent to the initial network model, and train the initial network model through the sample pinyin vector and the sample intent , to obtain the target network model.
- the sample pinyin set includes a plurality of sample pinyin, and when the obtaining module 522 obtains the sample pinyin vector corresponding to the sample pinyin set, it is specifically used for:
- a sample pinyin vector corresponding to the sample pinyin set is obtained, and the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.
- the voice intent recognition device includes: a processor 61 and a machine-readable storage medium 62 , the machine The readable storage medium 62 stores machine-executable instructions that can be executed by the processor 61; the processor 61 is configured to execute the machine-executable instructions to implement the following steps:
- the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
- the to-be-recognized phoneme set includes a plurality of to-be-recognized phonemes, and when the to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set is obtained, the processor 61 is prompted to:
- a to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set is obtained, and the to-be-recognized phoneme vector includes the phoneme feature value corresponding to each to-be-recognized phoneme.
- the The processor 61 is also prompted to:
- the sample phoneme vector and the sample intent are input to an initial network model, and the initial network model is trained by the sample phoneme vector and the sample intent to obtain the target network model.
- the sample phoneme set includes a plurality of sample phonemes, and when the sample phoneme vector corresponding to the sample phoneme set is obtained, the processor 61 is prompted to:
- a sample phoneme vector corresponding to the sample phoneme set is obtained, and the sample phoneme vector includes the phoneme feature value corresponding to each sample phoneme.
- the voice intent recognition device includes: a processor 71 and a machine-readable storage medium 72, the machine The readable storage medium 72 stores machine-executable instructions that can be executed by the processor 71; the processor 71 is configured to execute the machine-executable instructions to implement the following steps:
- the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
- the set of pinyin to be recognized includes a plurality of pinyin to be recognized, and when acquiring the pinyin vector to be recognized corresponding to the set of pinyin to be recognized, the processor 71 is prompted to:
- a to-be-recognized pinyin vector corresponding to the to-be-recognized pinyin set is obtained, and the to-be-recognized pinyin vector includes the pinyin characteristic value corresponding to each to-be-recognized pinyin.
- the The processor 71 is also prompted to:
- the sample pinyin vector and the sample intent are input to the initial network model, and the initial network model is trained by the sample pinyin vector and the sample intent to obtain the target network model.
- the sample pinyin set includes a plurality of sample pinyin, and when the sample pinyin vector corresponding to the sample pinyin set is obtained, the processor 71 is prompted to:
- a sample pinyin vector corresponding to the sample pinyin set is obtained, and the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.
- an embodiment of the present application further provides a machine-readable storage medium, where several computer instructions are stored on the machine-readable storage medium, and when the computer instructions are executed by a processor, the present invention can be implemented. Apply for the speech intent recognition method disclosed in the above example.
- the above-mentioned machine-readable storage medium may be any electronic, magnetic, optical or other physical storage device, which may contain or store information, such as executable instructions, data, and the like.
- the machine-readable storage medium can be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, storage drive (such as hard disk drive), solid state drive, any type of storage disk (such as compact disc, dvd, etc.), or similar storage media, or a combination thereof.
- a typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, e-mail device, game control desktop, tablet, wearable device, or a combination of any of these devices.
- the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
- computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
- these computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising the instruction means,
- the instruction means implements the functions specified in a flow or flows of the flowcharts and/or a block or blocks of the block diagrams.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (18)
- 一种语音意图识别方法,包括:根据待识别语音确定待识别音素集合;获取与所述待识别音素集合对应的待识别音素向量;将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
- 根据权利要求1所述的方法,其特征在于,所述待识别音素集合包括多个待识别音素,所述获取与所述待识别音素集合对应的待识别音素向量,包括:针对每个待识别音素,确定该待识别音素对应的音素特征值;基于所述每个待识别音素对应的音素特征值,获取与所述待识别音素集合对应的待识别音素向量,所述待识别音素向量包括所述每个待识别音素对应的音素特征值。
- 根据权利要求1所述的方法,其特征在于,所述将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图之前,所述方法还包括:获取样本语音和所述样本语音对应的样本意图;根据所述样本语音确定样本音素集合;获取与所述样本音素集合对应的样本音素向量;将所述样本音素向量和所述样本意图输入给初始网络模型,通过所述样本音素向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
- 根据权利要求3所述的方法,其特征在于,所述样本音素集合包括多个样本音素,所述获取与所述样本音素集合对应的样本音素向量,包括:针对每个样本音素,确定该样本音素对应的音素特征值;基于所述每个样本音素对应的音素特征值,获取与所述样本音素集合对应的样本音素向量,所述样本音素向量包括所述每个样本音素对应的音素特征值。
- 一种语音意图识别方法,包括:根据待识别语音确定待识别拼音集合;获取与所述待识别拼音集合对应的待识别拼音向量;将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
- 根据权利要求5所述的方法,其特征在于,所述待识别拼音集合包括多个待识别拼音,所述获取与所述待识别拼音集合对应的待识别拼音向量,包括:针对每个待识别拼音,确定该待识别拼音对应的拼音特征值;基于所述每个待识别拼音对应的拼音特征值,获取与所述待识别拼音集合对应的待识别拼音向量,所述待识别拼音向量包括所述每个待识别拼音对应的拼音特征值。
- 根据权利要求5所述的方法,其特征在于,所述将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图之前,所述方法还包括:获取样本语音和所述样本语音对应的样本意图;根据所述样本语音确定样本拼音集合;获取与所述样本拼音集合对应的样本拼音向量;将所述样本拼音向量和所述样本意图输入给初始网络模型,通过所述样本拼音向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
- 根据权利要求7所述的方法,其特征在于,所述样本拼音集合包括多个样本拼音,所述获取与所述样本拼音集合对应的样本拼音向量,包括:针对每个样本拼音,确定该样本拼音对应的拼音特征值;基于所述每个样本拼音对应的拼音特征值,获取与所述样本拼音集合对应的样本拼音向量,所述样本拼音向量包括所述每个样本拼音对应的拼音特征值。
- 一种语音意图识别装置,包括:确定模块,用于根据待识别语音确定待识别音素集合;获取模块,用于获取与所述待识别音素集合对应的待识别音素向量;处理模块,用于将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
- 一种语音意图识别装置,包括:确定模块,用于根据待识别语音确定待识别拼音集合;获取模块,用于获取与所述待识别拼音集合对应的待识别拼音向量;处理模块,用于将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
- 一种语音意图识别设备,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;所述处理器用于执行机器可执行指令,以实现如下步骤:根据待识别语音确定待识别音素集合;获取与所述待识别音素集合对应的待识别音素向量;将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
- 根据权利要求11所述的设备,其特征在于,所述待识别音素集合包括多个待识别音素,在所述获取与所述待识别音素集合对应的待识别音素向量时,所述处理器被促使:针对每个待识别音素,确定该待识别音素对应的音素特征值;基于所述每个待识别音素对应的音素特征值,获取与所述待识别音素集合对应的待识别音素向量,所述待识别音素向量包括所述每个待识别音素对应的音素特征值。
- 根据权利要求11所述的设备,其特征在于,在所述将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图之前,所述处理器还被促使:获取样本语音和所述样本语音对应的样本意图;根据所述样本语音确定样本音素集合;获取与所述样本音素集合对应的样本音素向量;将所述样本音素向量和所述样本意图输入给初始网络模型,通过所述样本音素向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
- 根据权利要求13所述的设备,其特征在于,所述样本音素集合包括多个样本音素,在所述获取与所述样本音素集合对应的样本音素向量时,所述处理器被促使:针对每个样本音素,确定该样本音素对应的音素特征值;基于所述每个样本音素对应的音素特征值,获取与所述样本音素集合对应的样本音素向量,所述样本音素向量包括所述每个样本音素对应的音素特征值。
- 一种语音意图识别设备,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;所述处理器用于执行所述机器可执行指令,以实现如下步骤:根据待识别语音确定待识别拼音集合;获取与所述待识别拼音集合对应的待识别拼音向量;将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
- 根据权利要求15所述的设备,其特征在于,所述待识别拼音集合包括多个待识别拼音,在所述获取与所述待识别拼音集合对应的待识别拼音向量时,所述处理器被促使:针对每个待识别拼音,确定该待识别拼音对应的拼音特征值;基于所述每个待识别拼音对应的拼音特征值,获取与所述待识别拼音集合对应的待识别拼音向量,所述待识别拼音向量包括所述每个待识别拼音对应的拼音特征值。
- 根据权利要求15所述的设备,其特征在于,在所述将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图之前,所述处理器还被促使:获取样本语音和所述样本语音对应的样本意图;根据所述样本语音确定样本拼音集合;获取与所述样本拼音集合对应的样本拼音向量;将所述样本拼音向量和所述样本意图输入给初始网络模型,通过所述样本拼音向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
- 根据权利要求17所述的设备,其特征在于,所述样本拼音集合包括多个样本拼音,在所述获取与所述样本拼音集合对应的样本拼音向量时,所述处理器被促使:针对每个样本拼音,确定该样本拼音对应的拼音特征值;基于所述每个样本拼音对应的拼音特征值,获取与所述样本拼音集合对应的样本拼音向量,所述样本拼音向量包括所述每个样本拼音对应的拼音特征值。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010785605.1A CN111986653A (zh) | 2020-08-06 | 2020-08-06 | 一种语音意图识别方法、装置及设备 |
CN202010785605.1 | 2020-08-06 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022028378A1 true WO2022028378A1 (zh) | 2022-02-10 |
Family
ID=73444526
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/110134 WO2022028378A1 (zh) | 2020-08-06 | 2021-08-02 | 语音意图识别方法、装置及设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111986653A (zh) |
WO (1) | WO2022028378A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111986653A (zh) * | 2020-08-06 | 2020-11-24 | 杭州海康威视数字技术股份有限公司 | 一种语音意图识别方法、装置及设备 |
CN113836945B (zh) * | 2021-09-23 | 2024-04-16 | 平安科技(深圳)有限公司 | 意图识别方法、装置、电子设备和存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6408271B1 (en) * | 1999-09-24 | 2002-06-18 | Nortel Networks Limited | Method and apparatus for generating phrasal transcriptions |
CN107357875A (zh) * | 2017-07-04 | 2017-11-17 | 北京奇艺世纪科技有限公司 | 一种语音搜索方法、装置及电子设备 |
CN108549637A (zh) * | 2018-04-19 | 2018-09-18 | 京东方科技集团股份有限公司 | 基于拼音的语义识别方法、装置以及人机对话系统 |
CN110674314A (zh) * | 2019-09-27 | 2020-01-10 | 北京百度网讯科技有限公司 | 语句识别方法及装置 |
CN111081219A (zh) * | 2020-01-19 | 2020-04-28 | 南京硅基智能科技有限公司 | 一种端到端的语音意图识别方法 |
CN111986653A (zh) * | 2020-08-06 | 2020-11-24 | 杭州海康威视数字技术股份有限公司 | 一种语音意图识别方法、装置及设备 |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08227410A (ja) * | 1994-12-22 | 1996-09-03 | Just Syst Corp | ニューラルネットワークの学習方法、ニューラルネットワークおよびニューラルネットワークを利用した音声認識装置 |
CN109754789B (zh) * | 2017-11-07 | 2021-06-08 | 北京国双科技有限公司 | 语音音素的识别方法及装置 |
CN110767214A (zh) * | 2018-07-27 | 2020-02-07 | 杭州海康威视数字技术股份有限公司 | 语音识别方法及其装置和语音识别系统 |
CN110808050B (zh) * | 2018-08-03 | 2024-04-30 | 蔚来(安徽)控股有限公司 | 语音识别方法及智能设备 |
CN110931000B (zh) * | 2018-09-20 | 2022-08-02 | 杭州海康威视数字技术股份有限公司 | 语音识别的方法和装置 |
CN109829153A (zh) * | 2019-01-04 | 2019-05-31 | 平安科技(深圳)有限公司 | 基于卷积神经网络的意图识别方法、装置、设备及介质 |
KR20200091738A (ko) * | 2019-01-23 | 2020-07-31 | 주식회사 케이티 | 핵심어 검출 장치, 이를 이용한 핵심어 검출 방법 및 컴퓨터 프로그램 |
CN110415687B (zh) * | 2019-05-21 | 2021-04-13 | 腾讯科技(深圳)有限公司 | 语音处理方法、装置、介质、电子设备 |
CN110349567B (zh) * | 2019-08-12 | 2022-09-13 | 腾讯科技(深圳)有限公司 | 语音信号的识别方法和装置、存储介质及电子装置 |
KR102321798B1 (ko) * | 2019-08-15 | 2021-11-05 | 엘지전자 주식회사 | 인공 신경망 기반의 음성 인식 모델을 학습시키는 방법 및 음성 인식 디바이스 |
CN110610707B (zh) * | 2019-09-20 | 2022-04-22 | 科大讯飞股份有限公司 | 语音关键词识别方法、装置、电子设备和存储介质 |
CN111243603B (zh) * | 2020-01-09 | 2022-12-06 | 厦门快商通科技股份有限公司 | 声纹识别方法、系统、移动终端及存储介质 |
CN111274797A (zh) * | 2020-01-13 | 2020-06-12 | 平安国际智慧城市科技股份有限公司 | 用于终端的意图识别方法、装置、设备及存储介质 |
-
2020
- 2020-08-06 CN CN202010785605.1A patent/CN111986653A/zh active Pending
-
2021
- 2021-08-02 WO PCT/CN2021/110134 patent/WO2022028378A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6408271B1 (en) * | 1999-09-24 | 2002-06-18 | Nortel Networks Limited | Method and apparatus for generating phrasal transcriptions |
CN107357875A (zh) * | 2017-07-04 | 2017-11-17 | 北京奇艺世纪科技有限公司 | 一种语音搜索方法、装置及电子设备 |
CN108549637A (zh) * | 2018-04-19 | 2018-09-18 | 京东方科技集团股份有限公司 | 基于拼音的语义识别方法、装置以及人机对话系统 |
CN110674314A (zh) * | 2019-09-27 | 2020-01-10 | 北京百度网讯科技有限公司 | 语句识别方法及装置 |
CN111081219A (zh) * | 2020-01-19 | 2020-04-28 | 南京硅基智能科技有限公司 | 一种端到端的语音意图识别方法 |
CN111986653A (zh) * | 2020-08-06 | 2020-11-24 | 杭州海康威视数字技术股份有限公司 | 一种语音意图识别方法、装置及设备 |
Also Published As
Publication number | Publication date |
---|---|
CN111986653A (zh) | 2020-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10902845B2 (en) | System and methods for adapting neural network acoustic models | |
JP6980119B2 (ja) | 音声認識方法、並びにその装置、デバイス、記憶媒体及びプログラム | |
US10032463B1 (en) | Speech processing with learned representation of user interaction history | |
Ferrer et al. | Study of senone-based deep neural network approaches for spoken language recognition | |
CN105679317B (zh) | 用于训练语言模型并识别语音的方法和设备 | |
US20210407498A1 (en) | On-device custom wake word detection | |
US10629185B2 (en) | Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for building deep neural network, and computer program for adapting statistical acoustic model | |
CN106683661B (zh) | 基于语音的角色分离方法及装置 | |
Anand et al. | Few shot speaker recognition using deep neural networks | |
WO2016037350A1 (en) | Learning student dnn via output distribution | |
CN108711421A (zh) | 一种语音识别声学模型建立方法及装置和电子设备 | |
WO2022028378A1 (zh) | 语音意图识别方法、装置及设备 | |
CN109754789B (zh) | 语音音素的识别方法及装置 | |
CN114830139A (zh) | 使用模型提供的候选动作训练模型 | |
WO2021208455A1 (zh) | 一种面向家居口语环境的神经网络语音识别方法及系统 | |
JP2023545988A (ja) | トランスフォーマトランスデューサ:ストリーミング音声認識と非ストリーミング音声認識を統合する1つのモデル | |
US20230096805A1 (en) | Contrastive Siamese Network for Semi-supervised Speech Recognition | |
Ault et al. | On speech recognition algorithms | |
Jansson | Single-word speech recognition with convolutional neural networks on raw waveforms | |
CN107452374B (zh) | 基于单向自标注辅助信息的多视角语言识别方法 | |
Chen et al. | Sequence-to-sequence modelling for categorical speech emotion recognition using recurrent neural network | |
Soliman et al. | Isolated word speech recognition using convolutional neural network | |
Chang et al. | On the importance of modeling and robustness for deep neural network feature | |
US9892726B1 (en) | Class-based discriminative training of speech models | |
Walter et al. | An evaluation of unsupervised acoustic model training for a dysarthric speech interface |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21852323 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21852323 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21852323 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.08.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21852323 Country of ref document: EP Kind code of ref document: A1 |