WO2022028378A1 - 语音意图识别方法、装置及设备 - Google Patents

语音意图识别方法、装置及设备 Download PDF

Info

Publication number
WO2022028378A1
WO2022028378A1 PCT/CN2021/110134 CN2021110134W WO2022028378A1 WO 2022028378 A1 WO2022028378 A1 WO 2022028378A1 CN 2021110134 W CN2021110134 W CN 2021110134W WO 2022028378 A1 WO2022028378 A1 WO 2022028378A1
Authority
WO
WIPO (PCT)
Prior art keywords
pinyin
sample
recognized
phoneme
vector
Prior art date
Application number
PCT/CN2021/110134
Other languages
English (en)
French (fr)
Inventor
陈展
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2022028378A1 publication Critical patent/WO2022028378A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present disclosure relates to the field of voice interaction, and in particular, to a voice intent recognition method, apparatus, and device.
  • voice interaction has become an important bridge for communication between humans and machines.
  • the robotic system needs to talk to the user and complete specific tasks.
  • One of the core technologies is the recognition of voice intent. That is, after the user inputs the voice to be recognized to the robotic system, the robotic system can determine the voice intent of the user through the voice to be recognized.
  • the speech intent recognition method includes: a speech recognition stage and an intention recognition stage.
  • the speech recognition stage the speech to be recognized is recognized by the automatic speech recognition (Automatic Speech Recognition, ASR) technology, and the to-be-recognized speech is converted into text.
  • ASR Automatic Speech Recognition
  • the intent recognition stage the text is semantically understood by natural language processing (NLP) technology to obtain keyword information, and based on the keyword information, the user's voice intent is identified.
  • NLP natural language processing
  • the accuracy of the above text-based intent recognition method depends heavily on the accuracy of speech-to-text conversion, and the accuracy of speech-to-text conversion is relatively low, resulting in a very low accuracy of speech intent recognition, and it is impossible to accurately identify the user's voice intent.
  • the accuracy of speech-to-text conversion is relatively low, resulting in a very low accuracy of speech intent recognition, and it is impossible to accurately identify the user's voice intent.
  • there are "trees" in the speech but when the speech is converted to text, the text content may be "number", which leads to the wrong recognition of the speech intent.
  • the present application provides a speech intent recognition method, including:
  • the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
  • the to-be-recognized phoneme set includes a plurality of to-be-recognized phonemes
  • the acquiring a to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set includes:
  • each to-be-recognized phoneme determines the phoneme feature value corresponding to the to-be-recognized phoneme; based on the phoneme feature value corresponding to each to-be-recognized phoneme, obtain a to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set.
  • the recognized phoneme vector includes the phoneme feature value corresponding to each of the to-be-recognized phonemes.
  • the Methods before the inputting the to-be-recognized phoneme vector into a trained target network model, so that the target network model outputs the speech intent corresponding to the to-be-recognized phoneme vector, the Methods also include:
  • the sample phoneme vector and the sample intent are input to an initial network model, and the initial network model is trained by the sample phoneme vector and the sample intent to obtain the target network model.
  • the sample phoneme set includes a plurality of sample phonemes
  • the obtaining a sample phoneme vector corresponding to the sample phoneme set includes:
  • a sample phoneme vector corresponding to the sample phoneme set is obtained, and the sample phoneme vector includes the phoneme feature value corresponding to each sample phoneme.
  • the present application provides a speech intent recognition method, including:
  • the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
  • the set of pinyin to be recognized includes a plurality of pinyin to be recognized
  • the acquiring the pinyin vector to be recognized corresponding to the set of pinyin to be recognized includes:
  • each to-be-recognized pinyin determines the pinyin feature value corresponding to the to-be-recognized pinyin; based on the pinyin feature value corresponding to each to-be-recognized pinyin, obtain the to-be-recognized pinyin vector corresponding to the to-be-recognized pinyin set.
  • the recognized pinyin vector includes the pinyin feature value corresponding to each to-be-recognized pinyin.
  • the Methods before the inputting the to-be-recognized pinyin vector into a trained target network model, so that the target network model outputs the speech intent corresponding to the to-be-recognized pinyin vector, the Methods also include:
  • the sample pinyin vector and the sample intent are input to an initial network model, and the initial network model is trained by the sample pinyin vector and the sample intent to obtain the target network model.
  • the sample pinyin set includes a plurality of sample pinyin
  • the acquiring a sample pinyin vector corresponding to the sample pinyin set includes:
  • a sample pinyin vector corresponding to the sample pinyin set is obtained, and the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.
  • the present application provides a voice intent recognition device, including:
  • a determining module configured to determine a phoneme set to be recognized according to the speech to be recognized
  • an acquisition module configured to acquire a phoneme vector to be recognized corresponding to the phoneme set to be recognized
  • a processing module configured to input the to-be-recognized phoneme vector to a trained target network model, so that the target network model outputs a speech intent corresponding to the to-be-recognized phoneme vector;
  • the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
  • the present application provides a voice intent recognition device, including:
  • a determination module used for determining a set of pinyin to be recognized according to the speech to be recognized
  • an acquisition module used for acquiring the pinyin vector to be recognized corresponding to the set of pinyin to be recognized
  • a processing module configured to input the to-be-recognized pinyin vector to a trained target network model, so that the target network model outputs a speech intent corresponding to the to-be-recognized pinyin vector;
  • the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
  • the present application provides a speech intent recognition device, including: a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor; the processor is configured to execute a machine Instructions can be executed to implement the following steps:
  • the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
  • the present application provides a speech intent recognition device, including: a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor; the processor is configured to execute a machine Instructions can be executed to implement the following steps:
  • the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
  • the speech intent is recognized based on the phoneme to be recognized, not the speech intent based on the text, and does not need to rely on the accuracy of converting speech to text.
  • phonemes are the smallest units of speech divided according to the natural attributes of speech, and are analyzed based on pronunciation actions, an action constitutes a phoneme. Therefore, the accuracy rate of determining the phoneme to be recognized based on the speech to be recognized is very high, and the accuracy rate of speech intent recognition is very high. , which can accurately identify the user's voice intent, effectively improve the accuracy of voice intent recognition, and make intent recognition more reliable. It does not require a large number of language model algorithm libraries for voice recognition, resulting in significant performance and memory optimization. .
  • FIG. 1 is a schematic flowchart of a speech intent recognition method in an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a voice intent recognition method in an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a voice intent recognition method in an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a voice intent recognition method in an embodiment of the present application.
  • FIG. 5A is a schematic structural diagram of an apparatus for recognizing speech intent in an embodiment of the present application.
  • FIG. 5B is a schematic structural diagram of an apparatus for recognizing speech intent in an embodiment of the present application.
  • FIG. 6 is a hardware structure diagram of a voice intent recognition device in an embodiment of the present application.
  • FIG. 7 is a hardware structure diagram of a speech intent recognition device in an embodiment of the present application.
  • first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application.
  • the use of the word "if” can be interpreted as "at the time of" or "when” or “in response to determining”, depending on the context.
  • Machine learning is a way to realize artificial intelligence, which is used to study how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance.
  • Deep learning is a subcategory of machine learning and is the process of using mathematical models to model specific problems in the real world in order to solve similar problems in the field.
  • Neural network is the implementation of deep learning. For the convenience of description, this paper takes neural network as an example to introduce the structure and function of neural network. For other subclasses of machine learning, the structure and function of neural network are similar.
  • Neural networks include but are not limited to convolutional neural networks (CNN for short), recurrent neural networks (RNN for short), fully connected networks, etc.
  • the structural units of neural networks may include but are not limited to convolutional layers (Conv), pooling Layer (Pool), excitation layer, fully connected layer (FC), etc., there is no restriction on this.
  • one or more convolutional layers, one or more pooling layers, one or more excitation layers, and one or more fully connected layers can be combined to construct a neural network according to different requirements.
  • the input data features are enhanced by performing convolution operations on the input data features using a convolution kernel.
  • the convolution kernel can be a matrix of size m*n, and the input data features of the convolution layer are processed with the convolution kernel.
  • Convolution the output data features of the convolution layer can be obtained, and the convolution operation is actually a filtering process.
  • the input data features (such as the output of the convolution layer) are subjected to operations such as taking the maximum value, the minimum value, and the average value, so as to use the principle of local correlation to sub-sample the input data features.
  • the pooling layer operation is actually a downsampling process.
  • an activation function (such as a nonlinear function) can be used to map the input data features, thereby introducing nonlinear factors, so that the neural network can enhance the expressive ability through nonlinear combination.
  • the activation function may include, but is not limited to, a ReLU (Rectified Linear Units, rectified linear unit) function, where the ReLU function is used to set the features smaller than 0 to 0, while the features larger than 0 remain unchanged.
  • ReLU Rectified Linear Units, rectified linear unit
  • all data features input to the fully connected layer are fully connected to obtain a feature vector, and the feature vector may include multiple data features.
  • Network model A model built using a machine learning algorithm (such as a deep learning algorithm), such as a model built using a neural network, that is, a network model can consist of one or more convolutional layers, one or more pooling layers, one or more It consists of an excitation layer and one or more fully connected layers.
  • a machine learning algorithm such as a deep learning algorithm
  • the untrained network model is called the initial network model
  • the trained network model is called the target network model.
  • the sample data is used to train various network parameters in the initial network model, such as convolution layer parameters (such as convolution kernel parameters), pooling layer parameters, excitation layer parameters, fully connected layer parameters, etc. This does not limit.
  • convolution layer parameters such as convolution kernel parameters
  • pooling layer parameters such as convolution kernel parameters
  • excitation layer parameters fully connected layer parameters
  • fully connected layer parameters etc. This does not limit.
  • the initial network model can fit the mapping relationship between input and output.
  • the initial network model that has been trained is the target network model, and the speech intent is recognized through the target network model.
  • a phoneme is the smallest phonetic unit divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable, and an action constitutes a phoneme.
  • the Chinese syllable ah (a) has only one phoneme (a), love (ai) has two phonemes (a and i), dai (dai) has three phonemes (d, a and i), etc.
  • the Chinese syllable tree (shumu) has five phonemes (s, h, u, m, and u).
  • Pinyin is the combination of more than one phoneme into a compound sound, for example, dai (dai) has three phonemes (d, a, and i) that make up a pinyin (dai).
  • dai dai
  • a tree shumu
  • s, h, u, m, and u a tree
  • these phonemes form two pinyin (shu and mu).
  • the speech intent recognition method includes: a speech recognition stage and an intention recognition stage.
  • the speech recognition stage the speech to be recognized is recognized by the automatic speech recognition technology, and the speech to be recognized is converted into text.
  • the intent recognition stage the text is semantically understood through natural language processing technology, keywords are obtained, and the user's voice intent is identified based on the keywords.
  • the accuracy depends on the accuracy of voice-to-text conversion, and the voice-to-text accuracy is relatively low, resulting in a very low accuracy of voice intent recognition and inability to accurately identify the user's voice intent.
  • the speech intent is recognized based on the phoneme to be recognized, rather than the speech intent based on the text, so that it is not necessary to rely on the accuracy rate of converting speech into text.
  • An embodiment of the present application proposes a voice intent recognition method, which can be applied to a human-computer interaction application scenario, and is mainly used to control a device according to the voice intent.
  • the method can be applied to any device that needs to be controlled according to voice intent, such as access control devices, screen projection devices, IPC (IP Camera, network cameras), servers, smart terminals, robotic systems, air conditioning devices, etc. No restrictions.
  • the training process of the initial network model is involved, and the identification process based on the target network model is involved.
  • the initial network model can be trained to obtain the trained target network model.
  • the recognition process based on the target network model the speech intent can be recognized based on the target network model.
  • the training process of the initial network model and the recognition process based on the target network model can be implemented in the same device or in different devices. For example, implement the training process of the initial network model on device A, obtain the target network model, and recognize the speech intent based on the target network model.
  • the training process of the initial network model is implemented on the device A1 to obtain the target network model, and the target network model is deployed to the device A2, and the device A2 recognizes the speech intent based on the target network model.
  • an embodiment of the present application proposes a speech intent recognition method, which can realize the training of the initial network model, and the method includes:
  • Step 101 Obtain a sample speech and a sample intent corresponding to the sample speech.
  • a large number of sample voices may be obtained from historical data, and/or a large number of sample voices input by a user may be received, and the acquisition method is not limited, and the sample voices represent sounds produced when speaking. For example, if the sound produced when speaking is "turn on the air conditioner", the sample speech is "turn on the air conditioner".
  • the speech intent corresponding to the sample speech may be obtained.
  • the speech intent corresponding to the sample speech may be called a sample intent (ie, a sample speech intent).
  • a sample intent ie, a sample speech intent
  • the sample intent may be "turn on the air conditioner”.
  • Step 102 Determine a sample phoneme set according to the sample speech.
  • a sample phoneme set may be determined according to the sample speech, and the sample phoneme set may include a plurality of sample phonemes, and the process of determining the sample phonemes according to the sample speech is to identify each sample phoneme from the sample speech.
  • each recognized phoneme is called a sample phoneme. Therefore, a plurality of sample phonemes can be recognized according to the sample voice. This recognition process is not limited, as long as it can be recognized according to the sample voice. You can generate multiple sample phonemes.
  • the sample phoneme set may include the following sample phonemes "b, a, k, o, n, g, t, i, a, o, d, a, k , a, i”.
  • Step 103 Obtain a sample phoneme vector corresponding to the sample phoneme set.
  • a phoneme feature value corresponding to the sample phoneme is determined, and based on the phoneme feature value corresponding to each sample phoneme, a sample phoneme vector corresponding to the sample phoneme set is obtained, the The sample phoneme vector includes the phoneme feature value corresponding to each sample phoneme.
  • the mapping relationship between each phoneme in all the phonemes and the phoneme feature value is maintained in advance. Assuming that there are 50 phonemes in total, the mapping relationship between phoneme 1 and phoneme feature value 1, and the mapping between phoneme 2 and phoneme feature value 2 can be maintained. Relationship, ..., the mapping relationship between phoneme 50 and phoneme feature value 50.
  • step 103 for each sample phoneme in the sample phoneme set, by querying the above mapping relationship, the phoneme feature value corresponding to the sample phoneme can be obtained, and each sample phoneme in the sample phoneme set The corresponding phoneme feature values are combined to obtain the sample phoneme vector.
  • the sample phoneme vector is a 15-dimensional feature vector
  • the feature vector sequentially includes the phoneme feature value corresponding to "b”, the phoneme feature value corresponding to "a”, the phoneme feature value corresponding to "k”, the phoneme feature value corresponding to "o", and the phoneme feature value corresponding to "n” , the phoneme feature value corresponding to "g”, the phoneme feature value corresponding to "t”, the phoneme feature value corresponding to "i”, the phoneme feature value corresponding to "a”, the phoneme feature value corresponding to "o”, and the phoneme feature value corresponding to "d”
  • all phonemes can be sorted. Assuming that there are 50 phonemes in total, the serial numbers of the 50 phonemes are 1-50 respectively.
  • the phoneme feature value corresponding to each phoneme the phoneme feature value can be is a 50-bit value. Assuming that the serial number of the phoneme is M, in the phoneme feature value corresponding to the phoneme, the value of the Mth bit is the first value, and the values of the other bits except the Mth bit are the second value.
  • the phoneme feature value corresponding to the phoneme with the serial number 1 the value of the first digit is the first value, and the value of the 2-50th place is the second value; the phoneme feature corresponding to the phoneme with the serial number 2 In the value, the value of bit 2 is the first value, the value of bit 1 and bit 3-50 is the second value, and so on.
  • the sample phoneme vector can be a 15* A 50-dimensional feature vector, the feature vector includes 15 rows and 50 columns, and each row represents a phoneme feature value corresponding to a phoneme, which will not be repeated here.
  • the first value and the second value can be configured according to experience, which is not limited, for example, the first value is 1, the second value is 0, or the first value is 0, The second value is 1, or the first value is 255 and the second value is 0, or the first value is 0 and the second value is 255.
  • Step 104 Input the sample phoneme vector and the sample intent corresponding to the sample phoneme vector to the initial network model, so as to train the initial network model through the sample phoneme vector and the sample intent, and obtain a trained target network model.
  • the trained target network model is obtained. Therefore, the target network model can be used to record the phoneme vector and the speech. Intent mapping relationship.
  • a large number of sample voices can be obtained, and for each sample voice, the sample intent corresponding to the sample voice is obtained, and the sample phoneme vector corresponding to the sample phoneme set corresponding to the sample voice, that is, the sample phoneme corresponding to the sample voice is obtained.
  • Vector and sample intent (participate in training as the label information of sample phoneme vectors). Based on this, a large number of sample phoneme vectors and the sample intent (ie, label information) corresponding to each sample phoneme vector can be input into the initial network model, so as to use the sample phoneme vectors and sample intents to train each network parameter in the initial network model. There are no restrictions on this training process. After the training of the initial network model is completed, the initial network model that has been trained is the target network model.
  • a large number of sample phoneme vectors and sample intents can be input to the first network layer of the initial network model, and the first network layer processes these data to obtain the output data of the first network layer.
  • the output data of the network layer is input to the second network layer of the initial network model, and so on, until the data is input to the last network layer of the initial network model, the data is processed by the last network layer, and the output data is obtained, Denote this output data as the target feature vector.
  • the target feature vector After the target feature vector is obtained, it is determined whether the initial network model has converged based on the target feature vector. If the initial network model has converged, the converged initial network model is determined as the trained target network model, and the training process of the initial network model is completed. If the initial network model does not converge, the network parameters of the unconverged initial network model are adjusted to obtain the adjusted initial network model.
  • a large number of sample phoneme vectors and sample intentions can be input into the adjusted initial network model, so as to retrain the adjusted initial network model.
  • the specific training process refer to the above-mentioned embodiment, which is not repeated here. Repeat. And so on, until the initial network model has converged, and the converged initial network model is determined as the trained target network model.
  • determining whether the initial network model has converged based on the target feature vector may include, but is not limited to: pre-constructing a loss function, which is not limited and can be set according to experience. After the target feature vector is obtained, the loss value of the loss function can be determined according to the target feature vector. For example, the target feature vector can be substituted into the loss function to obtain the loss value of the loss function. After the loss value of the loss function is obtained, it is determined whether the initial network model has converged according to the loss value of the loss function.
  • whether the initial network model has converged may be determined according to a loss value, for example, a loss value 1 is obtained based on the target feature vector, and if the loss value 1 is not greater than a threshold, it is determined that the initial network model has converged. If the loss value 1 is greater than the threshold, it is determined that the initial network model has not converged. or,
  • Whether the initial network model has converged can be determined according to multiple loss values of multiple iterations. For example, in each iteration, the initial network model of the previous iteration is adjusted to obtain the adjusted initial network model, and each iteration is performed. The second iteration process can get the loss value. Determine the change range curve of multiple loss values. If it is determined according to the change range curve that the change range of the loss value has been stable (the loss value in the continuous multiple iteration process has not changed, or the change range is small), and the last iteration process If the loss value is not greater than the threshold, it is determined that the initial network model of the last iteration process has converged. Otherwise, it is determined that the initial network model of the last iterative process has not converged, and the next iterative process is continued to obtain the loss value of the next iterative process, and the change amplitude curves of multiple loss values are re-determined.
  • other methods may also be used to determine whether the initial network model has converged, which is not limited. For example, if the number of iterations reaches a preset number of times threshold, it is determined that the initial network model has converged; for another example, if the iteration duration reaches a preset duration threshold, it is determined that the initial network model has converged.
  • the initial network model can be trained through the sample phoneme vector and the sample intent corresponding to the sample phoneme vector, so as to obtain the trained target network model.
  • an embodiment of the present application proposes a voice intent recognition method, which can realize voice intent recognition, and the method includes:
  • Step 201 Determine a phoneme set to be recognized according to the speech to be recognized.
  • a set of to-be-recognized phonemes may be determined according to the to-be-recognized voice, and the to-be-recognized phoneme set may include a plurality of to-be-recognized phonemes.
  • each recognized phoneme is called a phoneme to be recognized. Therefore, a plurality of phonemes to be recognized can be recognized according to the speech to be recognized, and this recognition process is not done. Restriction, as long as a plurality of to-be-recognized phonemes can be recognized according to the to-be-recognized speech.
  • the to-be-recognized phoneme set may include the following to-be-recognized phonemes "k, a, i, k, o, n, g, t, i, a, o".
  • Step 202 Obtain a phoneme vector to be recognized corresponding to the phoneme set to be recognized.
  • a phoneme vector to be recognized corresponding to the phoneme set to be recognized.
  • determine the phoneme feature value corresponding to the to-be-recognized phoneme and obtain the to-be-recognized phoneme corresponding to the to-be-recognized phoneme set based on the phoneme feature value corresponding to each to-be-recognized phoneme.
  • a phoneme vector, where the to-be-recognized phoneme vector includes a phoneme feature value corresponding to each to-be-recognized phoneme.
  • the mapping relationship between each phoneme in all the phonemes and the phoneme feature value is maintained in advance. Assuming that there are 50 phonemes in total, the mapping relationship between phoneme 1 and phoneme feature value 1, and the mapping between phoneme 2 and phoneme feature value 2 can be maintained. Relationship, ..., the mapping relationship between phoneme 50 and phoneme feature value 50.
  • step 202 for each to-be-recognized phoneme in the to-be-recognized phoneme set, by querying the above mapping relationship, the phoneme feature value corresponding to the to-be-recognized phoneme can be obtained, and each to-be-recognized phoneme in the to-be-recognized phoneme set corresponds to The phoneme feature value combination of , obtains the phoneme vector to be recognized.
  • all phonemes can be sorted. Assuming that there are 50 phonemes in total, the serial numbers of the 50 phonemes are 1-50 respectively.
  • the phoneme feature value corresponding to each phoneme the phoneme feature value can be is a 50-bit value. Assuming that the serial number of the phoneme is M, in the phoneme feature value corresponding to the phoneme, the value of the Mth bit is the first value, and the values of the other bits except the Mth bit are the second value.
  • the phoneme feature value corresponding to the phoneme with the serial number 1 the value of the first digit is the first value, and the value of the 2-50th place is the second value; the phoneme feature corresponding to the phoneme with the serial number 2 In the value, the value of bit 2 is the first value, the value of bit 1 and bit 3-50 is the second value, and so on.
  • Step 203 Input the to-be-recognized phoneme vector to the trained target network model, so that the target network model outputs a speech intent corresponding to the to-be-recognized phoneme vector.
  • the target network model is used to record the mapping relationship between the phoneme vector and the voice intent.
  • the target network model can output the voice intent corresponding to the to-be-recognized phoneme vector.
  • the phoneme vector to be recognized can be input to the first network layer of the target network model, and the phoneme vector to be recognized can be processed by the first network layer to obtain the output data of the first network layer.
  • the output data of the network layer is input to the second network layer of the target network model, and so on, until the data is input to the last network layer of the target network model, the data is processed by the last network layer, and the output data is obtained, Denote this output data as the target feature vector.
  • the target network model Since the target network model is used to record the mapping relationship between the phoneme vector and the voice intent, after the target feature vector is obtained, the mapping relationship can be queried based on the target feature vector, and the voice intent corresponding to the target feature vector can be obtained. It may be the speech intent corresponding to the to-be-recognized phoneme vector, and the target network model may output the speech intent corresponding to the to-be-recognized phoneme vector.
  • the device After obtaining the voice intent corresponding to the phoneme vector to be recognized, the device can be controlled based on the voice intent, and the control method is not limited. For example, when the voice intent is "turn on the air conditioner", turn on the air conditioner.
  • the target network model when the target network model outputs the speech intent corresponding to the phoneme vector to be recognized, it can also output a probability value corresponding to the speech intent (such as a probability value between 0 and 1, which can also be referred to as a probability value between 0 and 1). is confidence), for example, the target network model can output voice intent 1 and voice intent 1 with probability value 1 (eg 0.8), voice intent 2 and voice intent 2 with probability value 2 (eg 0.1), voice intent 3 and voice intent The probability value of 3 is 3 (eg 0.08), and so on.
  • a probability value corresponding to the speech intent such as a probability value between 0 and 1, which can also be referred to as a probability value between 0 and 1). is confidence
  • the target network model can output voice intent 1 and voice intent 1 with probability value 1 (eg 0.8), voice intent 2 and voice intent 2 with probability value 2 (eg 0.1), voice intent 3 and voice intent
  • the probability value of 3 is 3 (eg 0.08), and so on.
  • the speech intent with the largest probability value can be used as the speech intent corresponding to the phoneme vector to be recognized, for example, the speech intent 1 with the largest probability value is used as the speech intent corresponding to the phoneme vector to be recognized.
  • the speech intent 1 with the largest probability value is used as the speech intent corresponding to the phoneme vector to be recognized.
  • first determine the voice intent with the largest probability value and determine whether the probability value (that is, the maximum probability value) of the voice intent is greater than the preset probability threshold, if so, use the voice intent as the voice intent corresponding to the phoneme vector to be recognized, otherwise, There is no speech intent corresponding to the to-be-recognized phoneme vector.
  • the speech intent is recognized based on the phoneme to be recognized, not the speech intent based on the text, and does not need to rely on the accuracy of converting speech to text.
  • the phoneme to be recognized is determined based on the speech to be recognized with a high accuracy, and the voice intent recognition has a high accuracy, the user's voice intent can be accurately recognized, and the accuracy of the voice intent recognition is effectively improved.
  • the user sends out the speech to be recognized "I want to see photos with trees" and the phonemes determined by the terminal device (such as IPC, smartphone, etc.) based on the speech to be recognized are "w, o, x, i, a, n, g” , k, a, n, y, o, u, s, h, u, m, u, d, e, z, h, a, o, p, i, a, n", that is, "trees” correspond to
  • the phonemes are "s, h, u, m, u", so that the intent of the speech is determined based on the above phonemes without parsing "number” or “trees” from the to-be-recognized speech "I want to see a photo of trees", This avoids the use of "number” or “trees” to determine the speech intent, which makes the intent recognition more reliable, and does not require a
  • the speech intent is recognized based on the pinyin to be recognized, instead of the speech intent being recognized based on the text, so that the accuracy of converting the speech into text does not need to be relied upon.
  • An embodiment of the present application proposes a voice intent recognition method, which can be applied to a human-computer interaction application scenario, and is mainly used to control a device according to the voice intent.
  • the method can be applied to any device that needs to be controlled according to voice intent, such as access control devices, screen projection devices, IPC (IP Camera, network cameras), servers, smart terminals, robotic systems, air conditioning devices, etc. No restrictions.
  • the training process of the initial network model may be involved, and the identification process based on the target network model may be involved.
  • the initial network model can be trained to obtain the trained target network model.
  • the recognition process based on the target network model the speech intent can be recognized based on the target network model.
  • the training process of the initial network model and the identification process based on the target network model may be implemented on the same device, or may be implemented on different devices.
  • an embodiment of the present application proposes a speech intent recognition method, which can realize the training of the initial network model, and the method includes:
  • Step 301 Obtain a sample speech and a sample intent corresponding to the sample speech.
  • a large number of sample voices may be obtained from historical data, and/or a large number of sample voices input by a user may be received, and the acquisition method is not limited, and the sample voices represent sounds produced when speaking. For example, if the sound produced when speaking is "turn on the air conditioner", the sample speech is "turn on the air conditioner".
  • the speech intent corresponding to the sample speech may be obtained.
  • the speech intent corresponding to the sample speech may be called a sample intent (ie, a sample speech intent).
  • a sample intent ie, a sample speech intent
  • the sample intent may be "turn on the air conditioner”.
  • Step 302 Determine a sample pinyin set according to the sample speech.
  • a sample pinyin set may be determined according to the sample speech, and the sample pinyin set may include a plurality of sample pinyin, and the process of determining the sample pinyin according to the sample speech is to identify each sample pinyin from the sample speech.
  • each identified pinyin is called a sample pinyin. Therefore, a plurality of sample pinyin can be recognized according to the sample voice, and there is no restriction on the recognition process, as long as it can be recognized according to the sample voice You can generate multiple sample pinyin.
  • the sample pinyin set may include the following sample pinyin "ba”, “kong”, “tiao”, “da”, and "kai”.
  • Step 303 Obtain a sample pinyin vector corresponding to the sample pinyin set.
  • the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.
  • step 303 for each sample pinyin in the sample pinyin set, by querying the above-mentioned mapping relationship, the pinyin feature value corresponding to the sample pinyin can be obtained, and each sample pinyin in the sample pinyin set The corresponding pinyin feature values are combined to obtain the sample pinyin vector.
  • the sample pinyin vector can be a 5-dimensional feature vector, and the feature vector can sequentially include “ba” ” corresponds to the pinyin feature value, “kong” corresponds to the pinyin feature value, “tiao” corresponds to the pinyin feature value, “da” corresponds to the pinyin feature value, and “kai” corresponds to the pinyin feature value.
  • all pinyin can be sorted. Assuming that there are 400 pinyin in total, the serial numbers of the 400 pinyin are respectively 1-400. For the pinyin feature value corresponding to each pinyin, the pinyin feature value can be is a 400-bit value. Assuming that the serial number of the pinyin is N, in the pinyin feature value corresponding to the pinyin, the value of the Nth bit is the first value, and the values of the other bits except the Nth bit are the second value.
  • the pinyin feature value corresponding to the pinyin with the serial number 1 the value of the first digit is the first value, and the value of the 2-400th place is the second value; the pinyin feature corresponding to the pinyin with the serial number 2 In the value, the value of bit 2 is the first value, the value of bit 1 and bit 3-400 is the second value, and so on.
  • the sample pinyin vector can be a 5*400-dimensional feature vector, the feature vector includes There are 5 rows and 400 columns, and each row represents a pinyin feature value corresponding to a pinyin, which will not be repeated here.
  • Step 304 Input the sample pinyin vector and the sample intent corresponding to the sample pinyin vector to the initial network model, so as to train the initial network model through the sample pinyin vector and the sample intent, and obtain the trained target network model.
  • the initial network model is trained by using the sample pinyin vector and the sample intent (that is, the sample speech intent)
  • the trained target network model is obtained. Therefore, the target network model can be used to record the pinyin vector and the voice. Intent mapping relationship.
  • a large number of sample voices can be obtained, and for each sample voice, the sample intent corresponding to the sample voice is obtained, and the sample pinyin vector corresponding to the sample pinyin set corresponding to the sample voice, that is, the sample pinyin corresponding to the sample voice is obtained.
  • Vector and sample intent (participate in training as the label information of the sample pinyin vector). Based on this, a large number of sample pinyin vectors and the sample intent (ie label information) corresponding to each sample pinyin vector can be input into the initial network model, so as to use the sample pinyin vectors and sample intents to train each network parameter in the initial network model. There are no restrictions on this training process. After the training of the initial network model is completed, the initial network model that has been trained is the target network model.
  • sample pinyin vectors and sample intents can be input to the first network layer of the initial network model, and the first network layer processes these data to obtain the output data of the first network layer, and the first network layer
  • the output data of the network layer is input to the second network layer of the initial network model, and so on, until the data is input to the last network layer of the initial network model, the data is processed by the last network layer, and the output data is obtained, Denote this output data as the target feature vector.
  • the target feature vector After the target feature vector is obtained, it is determined whether the initial network model has converged based on the target feature vector. If the initial network model has converged, the converged initial network model is determined as the trained target network model, and the training process of the initial network model is completed. If the initial network model does not converge, adjust the network parameters of the unconverged initial network model to obtain the adjusted initial network model.
  • a large number of sample pinyin vectors and sample intentions can be input into the adjusted initial network model, so that the adjusted initial network model can be retrained.
  • the specific training process refer to the above embodiment, which is not repeated here. Repeat. And so on, until the initial network model has converged, and the converged initial network model is determined as the trained target network model.
  • determining whether the initial network model has converged based on the target feature vector may include, but is not limited to: pre-constructing a loss function, which is not limited and can be set according to experience. After the target feature vector is obtained, the loss value of the loss function can be determined according to the target feature vector. For example, the target feature vector can be substituted into the loss function to obtain the loss value of the loss function. After the loss value of the loss function is obtained, it is determined whether the initial network model has converged according to the loss value of the loss function.
  • other methods may also be used to determine whether the initial network model has converged, which is not limited. For example, if the number of iterations reaches a preset number of times threshold, it is determined that the initial network model has converged; for another example, if the iteration duration reaches a preset duration threshold, it is determined that the initial network model has converged.
  • the initial network model can be trained through the sample pinyin vector and the sample intent corresponding to the sample pinyin vector, thereby obtaining the trained target network model.
  • an embodiment of the present application proposes a voice intent recognition method, which can realize voice intent recognition, and the method includes:
  • Step 401 Determine a set of pinyin to be recognized according to the speech to be recognized.
  • a set of pinyin to be recognized may be determined according to the speech to be recognized, and the set of pinyin to be recognized may include a plurality of pinyin to be recognized, and the process of determining the pinyin to be recognized according to the speech to be recognized is a process from The process of recognizing each pinyin in the recognized speech, for the convenience of distinction, each pinyin recognized can be called the pinyin to be recognized, therefore, a plurality of pinyin to be recognized can be recognized according to the speech to be recognized, this recognition process does not matter. It is limited as long as a plurality of pinyin to be recognized can be recognized according to the speech to be recognized.
  • the to-be-recognized pinyin set may include the following to-be-recognized pinyin "kai”, “kong", and "tiao".
  • Step 402 Obtain a pinyin vector to be recognized corresponding to the pinyin set to be recognized.
  • a pinyin vector to be recognized corresponding to the pinyin set to be recognized.
  • determine the pinyin feature value corresponding to the to-be-recognized pinyin and obtain the to-be-identified pinyin corresponding to the to-be-recognized pinyin set based on the pinyin feature value corresponding to each to-be-recognized pinyin.
  • Pinyin vector, the to-be-recognized pinyin vector includes the pinyin feature value corresponding to each to-be-recognized pinyin.
  • step 402 for each to-be-recognized pinyin in the to-be-recognized pinyin set, by querying the above-mentioned mapping relationship, the pinyin feature value corresponding to the to-be-recognized pinyin can be obtained, and each to-be-recognized pinyin in the to-be-recognized pinyin set corresponds to The pinyin feature value combination of , obtains the pinyin vector to be recognized.
  • all pinyin can be sorted. Assuming that there are 400 pinyin in total, the serial numbers of the 400 pinyin are respectively 1-400. For the pinyin feature value corresponding to each pinyin, the pinyin feature value can be is a 400-bit value. Assuming that the serial number of the pinyin is N, in the pinyin feature value corresponding to the pinyin, the value of the Nth bit is the first value, and the values of the other bits except the Nth bit are the second value.
  • the pinyin feature value corresponding to the pinyin with the serial number 1 the value of the first digit is the first value, and the value of the 2-400th place is the second value; the pinyin feature corresponding to the pinyin with the serial number 2 In the value, the value of bit 2 is the first value, the value of bit 1 and bit 3-400 is the second value, and so on.
  • Step 403 input the to-be-recognized pinyin vector to the trained target network model, so that the target network model outputs the phonetic intent corresponding to the to-be-recognized pinyin vector.
  • the target network model is used to record the mapping relationship between the pinyin vector and the phonetic intent. After inputting the pinyin vector to be recognized to the target network model, the target network model can output the phonetic intent corresponding to the pinyin vector to be recognized.
  • the pinyin vector to be recognized can be input to the first network layer of the target network model, and the pinyin vector to be recognized can be processed by the first network layer to obtain the output data of the first network layer.
  • the output data of the network layer is input to the second network layer of the target network model, and so on, until the data is input to the last network layer of the target network model, the data is processed by the last network layer, and the output data is obtained, Denote this output data as the target feature vector.
  • the target network model Since the target network model is used to record the mapping relationship between the pinyin vector and the voice intent, after the target feature vector is obtained, the mapping relationship can be queried based on the target feature vector to obtain the voice intent corresponding to the target feature vector. It can be the speech intent corresponding to the pinyin vector to be recognized, and the target network model can output the speech intent corresponding to the pinyin vector to be recognized.
  • the device After obtaining the voice intent corresponding to the pinyin vector to be recognized, the device can be controlled based on the voice intent, and the control method is not limited. For example, when the voice intent is "turn on the air conditioner", turn on the air conditioner.
  • the target network model when the target network model outputs the voice intent corresponding to the pinyin vector to be recognized, it can also output a probability value corresponding to the voice intent (such as a probability value between 0 and 1, which can also be called a probability value). is confidence), for example, the target network model can output voice intent 1 and voice intent 1 with probability value 1 (eg 0.8), voice intent 2 and voice intent 2 with probability value 2 (eg 0.1), voice intent 3 and voice intent The probability value of 3 is 3 (eg 0.08), and so on.
  • a probability value corresponding to the voice intent such as a probability value between 0 and 1, which can also be called a probability value). is confidence
  • the target network model can output voice intent 1 and voice intent 1 with probability value 1 (eg 0.8), voice intent 2 and voice intent 2 with probability value 2 (eg 0.1), voice intent 3 and voice intent
  • the probability value of 3 is 3 (eg 0.08), and so on.
  • the speech intent with the largest probability value can be used as the speech intent corresponding to the pinyin vector to be recognized. Or, first determine the voice intent with the largest probability value, and determine whether the probability value (that is, the maximum probability value) of the voice intent is greater than the preset probability threshold, if so, use the voice intent as the voice intent corresponding to the pinyin vector to be recognized, otherwise, There is no speech intent corresponding to the pinyin vector to be recognized.
  • the speech intent is recognized based on the pinyin to be recognized, not the speech intent based on the text, and does not need to rely on the accuracy of converting the speech into text.
  • the accuracy of determining the pinyin to be recognized based on the to-be-recognized speech is high, and the accuracy of the speech intent recognition is high. Therefore, the user's speech intent can be accurately recognized, and the accuracy of the speech intent recognition can be effectively improved.
  • the user sends out the speech to be recognized "I want to see photos with trees", and the pinyin determined by the terminal device (such as IPC, smart phone, etc.) based on the speech to be recognized is "wo, xiang, kan, you, shu, mu, de , zhao, pian", that is, the pinyin corresponding to "tree” is "shu, mu”, so that the phonetic intent can be determined based on the above pinyin, without the need to parse out the "number of ” or “trees”, thus avoiding the use of “number” or “trees” to determine the speech intent, making the intent recognition more reliable, without the need for a large number of language model algorithm libraries for speech recognition, resulting in a significant increase in performance and memory optimization.
  • the terminal device such as IPC, smart phone, etc.
  • the device may include:
  • a determination module 511 configured to determine a phoneme set to be recognized according to the speech to be recognized
  • Obtaining module 512 for obtaining the phoneme vector to be recognized corresponding to the phoneme set to be recognized
  • the processing module 513 is used to input the phoneme vector to be recognized to the trained target network model, so that the target network model outputs the speech intent corresponding to the phoneme vector to be recognized;
  • the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
  • the to-be-recognized phoneme set includes a plurality of to-be-recognized phonemes, and when the acquiring module 512 acquires the to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set, it is specifically used for:
  • each to-be-recognized phoneme determines a phoneme feature value corresponding to the to-be-recognized phoneme; based on the phoneme feature value corresponding to each to-be-recognized phoneme, obtain a to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set.
  • the phoneme vector includes the phoneme feature value corresponding to each to-be-recognized phoneme.
  • the determining module 511 is further configured to: obtain the sample speech and the sample intent corresponding to the sample speech; determine a sample phoneme set according to the sample speech; the obtaining module 512 is further configured to: obtain the The sample phoneme vector corresponding to the sample phoneme set; the processing module 513 is further configured to: input the sample phoneme vector and the sample intent to the initial network model, and train the initial network model through the sample phoneme vector and the sample intent , to obtain the target network model.
  • the sample phoneme set includes a plurality of sample phonemes, and when the obtaining module 51 obtains the sample phoneme vector corresponding to the sample phoneme set, it is specifically used for:
  • a sample phoneme vector corresponding to the sample phoneme set is obtained, where the sample phoneme vector includes the phoneme feature value corresponding to each sample phoneme.
  • the device may include:
  • a determination module 521 configured to determine a set of pinyin to be recognized according to the speech to be recognized;
  • Obtaining module 522 for obtaining the pinyin vector to be recognized corresponding to the set of pinyin to be recognized;
  • the processing module 523 is used to input the to-be-recognized pinyin vector to the trained target network model, so that the target network model outputs a speech intent corresponding to the to-be-recognized pinyin vector;
  • the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
  • the to-be-recognized pinyin set includes a plurality of to-be-recognized pinyin
  • the acquiring module 522 acquires the to-be-recognized pinyin vector corresponding to the to-be-recognized pinyin set, it is specifically used for: for each to-be-recognized pinyin Identify the pinyin, and determine the corresponding pinyin characteristic value of the to-be-recognized pinyin; based on the corresponding pinyin characteristic value of each to-be-recognized pinyin, obtain the to-be-recognized pinyin vector corresponding to the to-be-recognized pinyin set, and the to-be-recognized pinyin vector includes each Pinyin feature values corresponding to the pinyin to be recognized.
  • the determining module 521 is further configured to: obtain the sample speech and the sample intent corresponding to the sample speech; determine a sample pinyin set according to the sample speech; the obtaining module 522 is further configured to: obtain the The sample pinyin vector corresponding to the sample pinyin set; the processing module 523 is further configured to: input the sample pinyin vector and the sample intent to the initial network model, and train the initial network model through the sample pinyin vector and the sample intent , to obtain the target network model.
  • the sample pinyin set includes a plurality of sample pinyin, and when the obtaining module 522 obtains the sample pinyin vector corresponding to the sample pinyin set, it is specifically used for:
  • a sample pinyin vector corresponding to the sample pinyin set is obtained, and the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.
  • the voice intent recognition device includes: a processor 61 and a machine-readable storage medium 62 , the machine The readable storage medium 62 stores machine-executable instructions that can be executed by the processor 61; the processor 61 is configured to execute the machine-executable instructions to implement the following steps:
  • the target network model is used to record the mapping relationship between the phoneme vector and the speech intent.
  • the to-be-recognized phoneme set includes a plurality of to-be-recognized phonemes, and when the to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set is obtained, the processor 61 is prompted to:
  • a to-be-recognized phoneme vector corresponding to the to-be-recognized phoneme set is obtained, and the to-be-recognized phoneme vector includes the phoneme feature value corresponding to each to-be-recognized phoneme.
  • the The processor 61 is also prompted to:
  • the sample phoneme vector and the sample intent are input to an initial network model, and the initial network model is trained by the sample phoneme vector and the sample intent to obtain the target network model.
  • the sample phoneme set includes a plurality of sample phonemes, and when the sample phoneme vector corresponding to the sample phoneme set is obtained, the processor 61 is prompted to:
  • a sample phoneme vector corresponding to the sample phoneme set is obtained, and the sample phoneme vector includes the phoneme feature value corresponding to each sample phoneme.
  • the voice intent recognition device includes: a processor 71 and a machine-readable storage medium 72, the machine The readable storage medium 72 stores machine-executable instructions that can be executed by the processor 71; the processor 71 is configured to execute the machine-executable instructions to implement the following steps:
  • the target network model is used to record the mapping relationship between the pinyin vector and the speech intent.
  • the set of pinyin to be recognized includes a plurality of pinyin to be recognized, and when acquiring the pinyin vector to be recognized corresponding to the set of pinyin to be recognized, the processor 71 is prompted to:
  • a to-be-recognized pinyin vector corresponding to the to-be-recognized pinyin set is obtained, and the to-be-recognized pinyin vector includes the pinyin characteristic value corresponding to each to-be-recognized pinyin.
  • the The processor 71 is also prompted to:
  • the sample pinyin vector and the sample intent are input to the initial network model, and the initial network model is trained by the sample pinyin vector and the sample intent to obtain the target network model.
  • the sample pinyin set includes a plurality of sample pinyin, and when the sample pinyin vector corresponding to the sample pinyin set is obtained, the processor 71 is prompted to:
  • a sample pinyin vector corresponding to the sample pinyin set is obtained, and the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.
  • an embodiment of the present application further provides a machine-readable storage medium, where several computer instructions are stored on the machine-readable storage medium, and when the computer instructions are executed by a processor, the present invention can be implemented. Apply for the speech intent recognition method disclosed in the above example.
  • the above-mentioned machine-readable storage medium may be any electronic, magnetic, optical or other physical storage device, which may contain or store information, such as executable instructions, data, and the like.
  • the machine-readable storage medium can be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, storage drive (such as hard disk drive), solid state drive, any type of storage disk (such as compact disc, dvd, etc.), or similar storage media, or a combination thereof.
  • a typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, e-mail device, game control desktop, tablet, wearable device, or a combination of any of these devices.
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • these computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising the instruction means,
  • the instruction means implements the functions specified in a flow or flows of the flowcharts and/or a block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

一种语音意图识别方法包括:根据待识别语音确定待识别音素集合(201);获取与待识别音素集合对应的待识别音素向量(202);将待识别音素向量输入给已训练的目标网络模型,以使目标网络模型输出与待识别音素向量对应的语音意图(203)。其中,目标网络模型用于记录音素向量与语音意图的映射关系。方法有效提高了语音意图识别的准确率,能够准确识别出用户的语音意图。还提供了一种语音意图识别装置及语音意图识别设备。

Description

语音意图识别方法、装置及设备 技术领域
本公开涉及语音交互领域,尤其涉及一种语音意图识别方法、装置及设备。
背景技术
随着人工智能技术的快速发展以及人工智能技术在生活中的广泛使用,语音交互成为人与机器之间沟通交流的重要桥梁。机器人系统要与用户对话并完成特定任务,其中一个核心技术是语音意图的识别,即,用户向机器人系统输入待识别语音后,机器人系统能够通过待识别语音判定用户的语音意图。
在相关技术中,语音意图的识别方式包括:语音识别阶段和意图识别阶段。在语音识别阶段,通过自动语音识别(Automatic Speech Recognition,ASR)技术对待识别语音进行语音识别,将待识别语音转化为文本。然后,在意图识别阶段,通过自然语言处理(Natural Language Processing,NLP)技术对文本进行语义理解,得到关键词信息,并基于关键词信息识别出用户的语音意图。
上述基于文本的意图识别方式,准确率严重依赖于语音转化为文本的准确率,而语音转化为文本的准确率较低,导致语音意图识别的准确率很低,无法准确识别出用户的语音意图。比如说,语音中存在的是“树木”,但是,语音转化为文本时,文本内容可以能是“数目”,从而导致语音意图的识别错误。
发明内容
本申请提供一种语音意图识别方法,包括:
根据待识别语音确定待识别音素集合;
获取与所述待识别音素集合对应的待识别音素向量;
将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;
其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
在一种可能的实施方式中,所述待识别音素集合包括多个待识别音素,所述获取与所述待识别音素集合对应的待识别音素向量,包括:
针对每个待识别音素,确定该待识别音素对应的音素特征值;基于所述每个待识别音素对应的音素特征值,获取与所述待识别音素集合对应的待识别音素向量,所述待识别音素向量包括所述每个待识别音素对应的音素特征值。
在一种可能的实施方式中,所述将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图之前,所述方法还包括:
获取样本语音和所述样本语音对应的样本意图;
根据所述样本语音确定样本音素集合;
获取与所述样本音素集合对应的样本音素向量;
将所述样本音素向量和所述样本意图输入给初始网络模型,通过所述样本音素向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
在一种可能的实施方式中,所述样本音素集合包括多个样本音素,所述获取与所述样本音素集合对应的样本音素向量,包括:
针对每个样本音素,确定该样本音素对应的音素特征值;
基于所述每个样本音素对应的音素特征值,获取与所述样本音素集合对应的样本音素向量,所述样本音素向量包括所述每个样本音素对应的音素特征值。
本申请提供一种语音意图识别方法,包括:
根据待识别语音确定待识别拼音集合;
获取与所述待识别拼音集合对应的待识别拼音向量;
将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;
其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
在一种可能的实施方式中,所述待识别拼音集合包括多个待识别拼音,所述获取与所述待识别拼音集合对应的待识别拼音向量,包括:
针对每个待识别拼音,确定该待识别拼音对应的拼音特征值;基于所述每个待识别拼音对应的拼音特征值,获取与所述待识别拼音集合对应的待识别拼音向量,所述待识别拼音向量包括所述每个待识别拼音对应的拼音特征值。
在一种可能的实施方式中,所述将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图之前,所述方法还包括:
获取样本语音和所述样本语音对应的样本意图;
根据所述样本语音确定样本拼音集合;
获取与所述样本拼音集合对应的样本拼音向量;
将所述样本拼音向量和所述样本意图输入给初始网络模型,通过所述样本拼音向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
在一种可能的实施方式中,所述样本拼音集合包括多个样本拼音,所述获取与所述样本拼音集合对应的样本拼音向量,包括:
针对每个样本拼音,确定该样本拼音对应的拼音特征值;
基于所述每个样本拼音对应的拼音特征值,获取与所述样本拼音集合对应的样本拼音向量,所述样本拼音向量包括所述每个样本拼音对应的拼音特征值。
本申请提供一种语音意图识别装置,包括:
确定模块,用于根据待识别语音确定待识别音素集合;
获取模块,用于获取与所述待识别音素集合对应的待识别音素向量;
处理模块,用于将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;
其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
本申请提供一种语音意图识别装置,包括:
确定模块,用于根据待识别语音确定待识别拼音集合;
获取模块,用于获取与所述待识别拼音集合对应的待识别拼音向量;
处理模块,用于将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;
其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
本申请提供一种语音意图识别设备,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;所述处理器用于执行机器可执行指令,以实现如下步骤:
根据待识别语音确定待识别音素集合;
获取与所述待识别音素集合对应的待识别音素向量;
将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;
其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
本申请提供一种语音意图识别设备,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;所述处理器用于执行机器可执行指令,以实现如下步骤:
,根据待识别语音确定待识别拼音集合;
获取与所述待识别拼音集合对应的待识别拼音向量;
将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;
其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
由以上技术方案可见,本申请实施例中,基于待识别音素识别出语音意图,不是基于文本识别语音意图,不需要依赖语音转化为文本的准确率。由于音素是根据语音的自然属性划分的最小语音单位,基于发音动作来分析,一个动作构成一个音素,因此,基于待识别语音确定待识别音素的准确率很高,语音意图识别的准确率很高,能够准确识别出用户的语音意图,有效提高语音意图识别的准确率,使得意图识别有了更强的可靠性,不需要语音识别的大量语言模型算法库,带来性能和内存的大幅度优化。
附图说明
为了更加清楚地说明本申请实施例或者现有技术中的技术方案,下面将对本申请实施例或者现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据本申请实施例的这些附图获得其他的附图。
图1是本申请一种实施方式中的语音意图识别方法的流程示意图。
图2是本申请一种实施方式中的语音意图识别方法的流程示意图。
图3是本申请一种实施方式中的语音意图识别方法的流程示意图。
图4是本申请一种实施方式中的语音意图识别方法的流程示意图。
图5A是本申请一种实施方式中的语音意图识别装置的结构示意图。
图5B是本申请一种实施方式中的语音意图识别装置的结构示意图。
图6是本申请一种实施方式中的语音意图识别设备的硬件结构图。
图7是本申请一种实施方式中的语音意图识别设备的硬件结构图。
具体实施方式
在本申请实施例使用的术语仅仅是出于描述特定实施例的目的,而非限制本申请。本申请和权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其它含义。还应当理解,本文中使用的术语“和/或”是指包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请实施例可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,此外,所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
在介绍本申请的技术方案之前,先介绍与本申请实施例有关的概念。
机器学习:机器学习是实现人工智能的一种途径,用于研究计算机如何模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身性能。深度学习属于机器学习的子类,是一种使用数学模型对真实世界中的特定问题进行建模,以解决该领域内相似问题的过程。神经网络是深度学习的实现方式,为了方便描述,本文以神经网络为例,介绍神经网络的结构和功能,对于机器学习的其它子类,与神经网络的结构和功能类似。
神经网络:神经网络包括但不限于卷积神经网络(简称CNN)、循环神经网络(简称RNN)、全连接网络等,神经网络的结构单元可以包括但不限于卷积层(Conv)、池化层(Pool)、激励层、全连接层(FC)等,对此不做限制。
在实际应用中,可以根据不同需求,将一个或多个卷积层,一个或多个池化层,一个或多个激励层,以及一个或多个全连接层进行组合构建神经网络。
在卷积层中,通过使用卷积核对输入数据特征进行卷积运算,使输入数据特征增强,该卷积核可以是m*n大小的矩阵,卷积层的输入数据特征与卷积核进行卷积,可以得到卷积层的输出数据特征,卷积运算实际是一个滤波过程。
在池化层中,通过对输入数据特征(如卷积层的输出)进行取最大值、取最小值、取平均值等操作,从而利用局部相关性的原理,对输入数据特征进行子抽样,减少处理量,并保持特征不变性,池化层运算实际是一个降采样过程。
在激励层中,可以使用激活函数(如非线性函数)对输入数据特征进行映射,从而引入非线性因素,使得神经网络通过非线性的组合增强表达能力。
该激活函数可以包括但不限于ReLU(Rectified Linear Units,整流线性单元)函数,该ReLU函数用于将小于0的特征置0,而大于0的特征保持不变。
在全连接层中,将输入给本全连接层的所有数据特征进行全连接处理,从而得到一个特征向量,且该特征向量中可以包括多个数据特征。
网络模型:采用机器学习算法(如深度学习算法)构建的模型,如采用神经网络构建的模型,即,网络模型可以由一个或多个卷积层,一个或多个池化层,一个或多个激励层,以及一个或多个全连接层组成。为了区分方便,将未训练的网络模型称为初始网络模型,将已训练的网络模型称为目标网络模型。
在初始网络模型的训练过程中,利用样本数据训练初始网络模型内各网络参数,如卷积层参数(如卷积核参数)、池化层参数、激励层参数、全连接层参数等,对此不做限制。通过训练初始网络模型内各网络参数,使得初始网络模型拟合出输入和输出的映射关系。在初始网络模型训练完成后,已经完成训练的初始网络模型就是目标网络模型, 通过目标网络模型识别语音意图。
音素:音素是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。例如,汉语音节啊(a)只有一个音素(a),爱(ai)有两个音素(a和i),代(dai)有三个音素(d、a和i)等。又例如,汉语音节树木(shumu)有五个音素(s、h、u、m和u)等。
拼音:拼音是将一个以上的音素结合起来成为一个复合的音,例如,代(dai)有三个音素(d、a和i),这些音素组成一个拼音(dai)。又例如,树木(shumu)有五个音素(s、h、u、m和u),这些音素组成两个拼音(shu和mu)。
在相关技术中,语音意图的识别方式包括:语音识别阶段和意图识别阶段。在语音识别阶段,通过自动语音识别技术对待识别语音进行语音识别,将待识别语音转化为文本。在意图识别阶段,通过自然语言处理技术对文本进行语义理解,得到关键词,并基于关键词识别用户的语音意图。上述基于文本的意图识别方式,准确率依赖于语音转化为文本的准确率,而语音转化为文本的准确率较低,导致语音意图识别的准确率很低,无法准确识别出用户的语音意图。
针对上述发现,本申请实施例中,基于待识别音素识别出语音意图,而不是基于文本识别语音意图,从而不需要依赖语音转化为文本的准确率。
以下结合具体实施例,对本申请实施例的技术方案进行说明。
本申请实施例中提出一种语音意图识别方法,可以应用于人机交互应用场景,主要用于根据语音意图对设备进行控制。示例性的,该方法可以应用于需要根据语音意图进行控制的任意设备,如门禁设备,投屏设备,IPC(IP Camera,网络摄像机),服务器,智能终端,机器人系统,空调设备等,对此不做限制。
本申请实施例中,涉及初始网络模型的训练过程,基于目标网络模型的识别过程。在初始网络模型的训练过程中,可以对初始网络模型进行训练,得到已训练的目标网络模型。在基于目标网络模型的识别过程中,可以基于目标网络模型识别语音意图。初始网络模型的训练过程与基于目标网络模型的识别过程,可以在同一个设备实现,也可以在不同设备实现。比如说,在设备A实现初始网络模型的训练过程,得到目标网络模型,并基于目标网络模型识别语音意图。又例如,在设备A1实现初始网络模型的训练过程,得到目标网络模型,将目标网络模型部署到设备A2,由设备A2基于目标网络模型识别语音意图。
参见图1所示,针对初始网络模型的训练过程,本申请实施例中提出一种语音意图识别方法,该方法可以实现初始网络模型的训练,该方法包括:
步骤101,获取样本语音和该样本语音对应的样本意图。
示例性的,可以从历史数据中获取大量样本语音,和/或,接收用户输入的大量样本语音,对此获取方式不做限制,样本语音表示说话时发出的声音。比如说,说话时发出的声音是“把空调打开”,则样本语音就是“把空调打开”。
针对每个样本语音来说,可以获取该样本语音对应的语音意图,为了区分方便,可以将样本语音对应的语音意图称为样本意图(即样本语音意图)。比如说,若样本语音是“把空调打开”,则样本意图可以是“开空调”。
步骤102,根据该样本语音确定样本音素集合。
示例性的,针对每个样本语音来说,可以根据该样本语音确定样本音素集合,该样本音素集合可以包括多个样本音素,根据样本语音确定样本音素的过程,是从样本语音中识别出每个音素的过程,为了区分方便,将识别出的每个音素称为样本音素,因此, 可以根据该样本语音识别出多个样本音素,对此识别过程不做限制,只要能够根据该样本语音识别出多个样本音素即可。
比如说,针对样本语音“把空调打开”来说,则样本音素集合可以包括如下的样本音素“b、a、k、o、n、g、t、i、a、o、d、a、k、a、i”。
步骤103,获取与该样本音素集合对应的样本音素向量。
示例性的,针对该样本音素集合中的每个样本音素,确定该样本音素对应的音素特征值,基于每个样本音素对应的音素特征值,获取与该样本音素集合对应的样本音素向量,该样本音素向量包括每个样本音素对应的音素特征值。
比如说,预先维护所有音素中的每个音素与音素特征值的映射关系,假设一共存在50个音素,则可以维护音素1与音素特征值1的映射关系,音素2与音素特征值2的映射关系,…,音素50与音素特征值50的映射关系。
在此基础上,在步骤103中,针对样本音素集合中的每个样本音素,通过查询上述映射关系,可以得到与该样本音素对应的音素特征值,并将样本音素集合中的每个样本音素对应的音素特征值组合,得到该样本音素向量。
比如说,针对样本音素集合“b、a、k、o、n、g、t、i、a、o、d、a、k、a、i”,该样本音素向量是一个15维的特征向量,该特征向量依次包括“b”对应的音素特征值,“a”对应的音素特征值,“k”对应的音素特征值,“o”对应的音素特征值,“n”对应的音素特征值,“g”对应的音素特征值,“t”对应的音素特征值,“i”对应的音素特征值,“a”对应的音素特征值,“o”对应的音素特征值,“d”对应的音素特征值,“a”对应的音素特征值,“k”对应的音素特征值,“a”对应的音素特征值,“i”对应的音素特征值。
在一种可能的实施方式中,可以对所有音素进行排序,假设一共存在50个音素,则50个音素的序号分别为1-50,针对每个音素对应的音素特征值,该音素特征值可以是50位的数值。假设音素的序号为M,则该音素对应的音素特征值中,第M位的取值是第一取值,除第M位之外的其它位的取值是第二取值。比如说,序号为1的音素对应的音素特征值中,第1位的取值是第一取值,第2-50位的取值是第二取值;序号为2的音素对应的音素特征值中,第2位的取值是第一取值,第1位、第3-50位的取值是第二取值,以此类推。
综上所述,针对样本音素集合“b、a、k、o、n、g、t、i、a、o、d、a、k、a、i”,该样本音素向量可以是一个15*50维的特征向量,该特征向量包括15行50列,每一行表示一个音素对应的音素特征值,对此不再赘述。
在上述实施例中,第一取值和第二取值可以根据经验配置,对此不做限制,如第一取值为1,第二取值为0,或者,第一取值为0,第二取值为1,或者,第一取值为255,第二取值为0,或者,第一取值为0,第二取值为255。
步骤104,将该样本音素向量和该样本音素向量对应的样本意图输入给初始网络模型,以通过该样本音素向量和该样本意图对初始网络模型进行训练,得到已训练的目标网络模型。示例性的,由于是采用该样本音素向量和该样本意图(即样本语音意图)对初始网络模型进行训练,得到已训练的目标网络模型,因此,该目标网络模型可以用于记录音素向量与语音意图的映射关系。
参见上述实施例,可以获取大量样本语音,针对每个样本语音,获取该样本语音对应的样本意图,该样本语音对应的样本音素集合对应的样本音素向量,即,得到该样本语音对应的样本音素向量和样本意图(作为样本音素向量的标签信息参与训练)。基于此,可以将大量样本音素向量及每个样本音素向量对应的样本意图(即标签信息)输入 给初始网络模型,从而利用样本音素向量及样本意图对初始网络模型内各网络参数进行训练,对此训练过程不做限制。在初始网络模型训练完成后,已经完成训练的初始网络模型是目标网络模型。
比如说,可以将大量样本音素向量及样本意图输入给初始网络模型的第一个网络层,由第一个网络层对这些数据进行处理,得到第一个网络层的输出数据,将第一个网络层的输出数据输入给初始网络模型的第二个网络层,以此类推,一直到将数据输入给初始网络模型的最后一个网络层,由最后一个网络层对数据进行处理,得到输出数据,将这个输出数据记为目标特征向量。
在得到目标特征向量后,基于目标特征向量确定初始网络模型是否已收敛。若初始网络模型已收敛,则将已收敛的初始网络模型确定为已训练的目标网络模型,完成初始网络模型的训练过程。若初始网络模型未收敛,则对未收敛的初始网络模型的网络参数进行调整,得到调整后的初始网络模型。
基于调整后的初始网络模型,可以将大量样本音素向量及样本意图输入给调整后的初始网络模型,从而对调整后的初始网络模型重新进行训练,具体训练过程参见上述实施例,在此不再重复赘述。以此类推,一直到初始网络模型已收敛,并将已收敛的初始网络模型确定为已训练的目标网络模型。
在上述实施例中,基于目标特征向量确定初始网络模型是否已收敛,可以包括但不限于:预先构建损失函数,对此损失函数不做限制,可以根据经验设置。在得到目标特征向量后,可以根据该目标特征向量确定损失函数的损失值,比如说,可以将该目标特征向量代入损失函数,得到损失函数的损失值。在得到损失函数的损失值后,根据损失函数的损失值确定初始网络模型是否已收敛。
示例性的,可以根据一个损失值确定初始网络模型是否已收敛,例如,基于目标特征向量得到损失值1,若损失值1不大于阈值,则确定初始网络模型已收敛。若损失值1大于阈值,则确定初始网络模型未收敛。或者,
可以根据多次迭代过程的多个损失值确定初始网络模型是否已收敛,例如,在每次迭代过程中,对上次迭代过程的初始网络模型进行调整,得到调整后的初始网络模型,且每次迭代过程可以得到损失值。确定多个损失值的变化幅度曲线,若根据该变化幅度曲线确定损失值变化幅度已经平稳(连续多次迭代过程的损失值未发生变化,或者变化的幅度很小),且最后一次迭代过程的损失值不大于阈值,则确定最后一次迭代过程的初始网络模型已收敛。否则,确定最后一次迭代过程的初始网络模型未收敛,继续进行下一次迭代过程,得到下一次迭代过程的损失值,并重新确定多个损失值的变化幅度曲线。
在实际应用中,还可以采用其它方式确定初始网络模型是否已收敛,对此不做限制。例如,若迭代次数达到预设次数阈值,则确定初始网络模型已收敛;又例如,若迭代时长达到预设时长阈值,则确定初始网络模型已收敛。
综上所述,可以通过样本音素向量和该样本音素向量对应的样本意图对初始网络模型进行训练,从而得到训练后的目标网络模型。
参见图2所示,针对基于目标网络模型的识别过程,本申请实施例中提出一种语音意图识别方法,该方法可以实现语音意图的识别,该方法包括:
步骤201,根据待识别语音确定待识别音素集合。
示例性的,在得到待识别语音后,可以根据该待识别语音确定待识别音素集合,该待识别音素集合可以包括多个待识别音素,根据待识别语音确定待识别音素的过程,是从待识别语音中识别出每个音素的过程,为了区分方便,将识别出的每个音素称为待识 别音素,因此,可以根据该待识别语音识别出多个待识别音素,对此识别过程不做限制,只要能够根据待识别语音识别出多个待识别音素即可。比如说,针对待识别语音“开空调”来说,则待识别音素集合可以包括如下的待识别音素“k、a、i、k、o、n、g、t、i、a、o”。
步骤202,获取与该待识别音素集合对应的待识别音素向量。示例性的,针对待识别音素集合中的每个待识别音素,确定该待识别音素对应的音素特征值,基于每个待识别音素对应的音素特征值,获取与待识别音素集合对应的待识别音素向量,该待识别音素向量包括每个待识别音素对应的音素特征值。
比如说,预先维护所有音素中的每个音素与音素特征值的映射关系,假设一共存在50个音素,则可以维护音素1与音素特征值1的映射关系,音素2与音素特征值2的映射关系,…,音素50与音素特征值50的映射关系。
步骤202中,针对待识别音素集合中的每个待识别音素,通过查询上述映射关系,可以得到与该待识别音素对应的音素特征值,并将待识别音素集合中的每个待识别音素对应的音素特征值组合,得到该待识别音素向量。
在一种可能的实施方式中,可以对所有音素进行排序,假设一共存在50个音素,则50个音素的序号分别为1-50,针对每个音素对应的音素特征值,该音素特征值可以是50位的数值。假设音素的序号为M,则该音素对应的音素特征值中,第M位的取值是第一取值,除第M位之外的其它位的取值是第二取值。比如说,序号为1的音素对应的音素特征值中,第1位的取值是第一取值,第2-50位的取值是第二取值;序号为2的音素对应的音素特征值中,第2位的取值是第一取值,第1位、第3-50位的取值是第二取值,以此类推。
步骤203,将该待识别音素向量输入给已训练的目标网络模型,以使目标网络模型输出与该待识别音素向量对应的语音意图。示例性的,目标网络模型用于记录音素向量与语音意图的映射关系,在将待识别音素向量输入给目标网络模型后,目标网络模型可以输出与该待识别音素向量对应的语音意图。
比如说,可以将待识别音素向量输入给目标网络模型的第一个网络层,由第一个网络层对该待识别音素向量进行处理,得到第一个网络层的输出数据,将第一个网络层的输出数据输入给目标网络模型的第二个网络层,以此类推,一直到将数据输入给目标网络模型的最后一个网络层,由最后一个网络层对数据进行处理,得到输出数据,将这个输出数据记为目标特征向量。
由于目标网络模型用于记录音素向量与语音意图的映射关系,因此,在得到目标特征向量后,可以基于该目标特征向量查询该映射关系,得到与该目标特征向量对应的语音意图,这个语音意图可以是与该待识别音素向量对应的语音意图,目标网络模型可以输出与该待识别音素向量对应的语音意图。
在得到与待识别音素向量对应的语音意图后,可以基于该语音意图对设备进行控制,对此控制方式不做限制,如语音意图是“开空调”时,打开空调。
在一种可能的实施方式中,目标网络模型输出与待识别音素向量对应的语音意图时,还可以输出语音意图对应的概率值(如0-1之间的概率值,该概率值也可以称为置信度),例如,目标网络模型可以输出语音意图1以及语音意图1的概率值1(如0.8),语音意图2以及语音意图2的概率值2(如0.1),语音意图3以及语音意图3的概率值3(如0.08),以此类推。
基于上述输出数据,可以将概率值最大的语音意图作为与待识别音素向量对应的语音意图,例如,将概率值最大的语音意图1作为与待识别音素向量对应的语音意图。或者,先确定概率值最大的语音意图,确定该语音意图的概率值(即最大概率值)是否大 于预设概率阈值,若是,将该语音意图作为与待识别音素向量对应的语音意图,否则,没有与该待识别音素向量对应的语音意图。
由以上技术方案可见,本申请实施例中,基于待识别音素识别出语音意图,不是基于文本识别语音意图,不需要依赖语音转化为文本的准确率。基于待识别语音确定待识别音素的准确率很高,语音意图识别的准确率很高,能够准确识别出用户的语音意图,有效提高语音意图识别的准确率。
比如说,用户发出待识别语音“我想看有树木的照片”,终端设备(如IPC,智能手机等)基于待识别语音确定的音素是“w、o、x、i、a、n、g、k、a、n、y、o、u、s、h、u、m、u、d、e、z、h、a、o、p、i、a、n”,即“树木”对应的音素是“s、h、u、m、u”,从而基于上述音素来确定语音意图,而不需要从待识别语音“我想看有树木的照片”中解析出“数目”或者“树木”,从而避免采用“数目”或“树木”确定语音意图,使得意图识别有了更强的可靠性,不需要语音识别的大量语言模型算法库,带来性能和内存的大幅度优化。
本申请实施例的另一种实现方式中,基于待识别拼音识别出语音意图,而不是基于文本识别语音意图,从而不需要依赖语音转化为文本的准确率。
以下结合具体实施例,对本申请实施例的技术方案进行说明。
本申请实施例中提出一种语音意图识别方法,可以应用于人机交互应用场景,主要用于根据语音意图对设备进行控制。示例性的,该方法可以应用于需要根据语音意图进行控制的任意设备,如门禁设备,投屏设备,IPC(IP Camera,网络摄像机),服务器,智能终端,机器人系统,空调设备等,对此不做限制。
本申请实施例中,可以涉及初始网络模型的训练过程,基于目标网络模型的识别过程。在初始网络模型的训练过程中,可以对初始网络模型进行训练,得到已训练的目标网络模型。在基于目标网络模型的识别过程中,可以基于目标网络模型识别语音意图。示例性的,初始网络模型的训练过程与基于目标网络模型的识别过程,可以在同一个设备实现,也可以在不同设备实现。
参见图3所示,针对初始网络模型的训练过程,本申请实施例中提出一种语音意图识别方法,该方法可以实现初始网络模型的训练,该方法包括:
步骤301,获取样本语音和该样本语音对应的样本意图。
示例性的,可以从历史数据中获取大量样本语音,和/或,接收用户输入的大量样本语音,对此获取方式不做限制,样本语音表示说话时发出的声音。比如说,说话时发出的声音是“把空调打开”,则样本语音就是“把空调打开”。
针对每个样本语音来说,可以获取该样本语音对应的语音意图,为了区分方便,可以将样本语音对应的语音意图称为样本意图(即样本语音意图)。比如说,若样本语音是“把空调打开”,则样本意图可以是“开空调”。
步骤302,根据该样本语音确定样本拼音集合。
示例性的,针对每个样本语音来说,可以根据该样本语音确定样本拼音集合,该样本拼音集合可以包括多个样本拼音,根据样本语音确定样本拼音的过程,是从样本语音中识别出每个拼音的过程,为了区分方便,将识别出的每个拼音称为样本拼音,因此,可以根据该样本语音识别出多个样本拼音,对此识别过程不做限制,只要能够根据该样本语音识别出多个样本拼音即可。
比如说,针对样本语音“把空调打开”来说,则样本拼音集合可以包括如下的样本拼音“ba”、“kong”、“tiao”、“da”、“kai”。
步骤303,获取与该样本拼音集合对应的样本拼音向量。
示例性的,针对该样本拼音集合中的每个样本拼音,确定该样本拼音对应的拼音特征值,基于每个样本拼音对应的拼音特征值,获取与该样本拼音集合对应的样本拼音向量,该样本拼音向量包括每个样本拼音对应的拼音特征值。
比如说,预先维护所有拼音中的每个拼音与拼音特征值的映射关系,假设一共存在400个拼音,则可以维护拼音1与拼音特征值1的映射关系,拼音2与拼音特征值2的映射关系,…,拼音400与拼音特征值400的映射关系。
在此基础上,在步骤303中,针对样本拼音集合中的每个样本拼音,通过查询上述映射关系,可以得到与该样本拼音对应的拼音特征值,并将样本拼音集合中的每个样本拼音对应的拼音特征值组合,得到该样本拼音向量。
比如说,针对上述的样本拼音集合“ba”、“kong”、“tiao”、“da”、“kai”,该样本拼音向量可以是一个5维的特征向量,该特征向量依次可以包括“ba”对应的拼音特征值,“kong”对应的拼音特征值,“tiao”对应的拼音特征值,“da”对应的拼音特征值,“kai”对应的拼音特征值。
在一种可能的实施方式中,可以对所有拼音进行排序,假设一共存在400个拼音,则400个拼音的序号分别为1-400,针对每个拼音对应的拼音特征值,该拼音特征值可以是400位的数值。假设拼音的序号为N,则该拼音对应的拼音特征值中,第N位的取值是第一取值,除第N位之外的其它位的取值是第二取值。比如说,序号为1的拼音对应的拼音特征值中,第1位的取值是第一取值,第2-400位的取值是第二取值;序号为2的拼音对应的拼音特征值中,第2位的取值是第一取值,第1位、第3-400位的取值是第二取值,以此类推。
综上所述,针对样本拼音集合“ba”、“kong”、“tiao”、“da”、“kai”来说,该样本拼音向量可以是一个5*400维的特征向量,该特征向量包括5行400列,每一行表示一个拼音对应的拼音特征值,对此不再赘述。
步骤304,将该样本拼音向量和该样本拼音向量对应的样本意图输入给初始网络模型,以通过该样本拼音向量和该样本意图对初始网络模型进行训练,得到已训练的目标网络模型。示例性的,由于是采用该样本拼音向量和该样本意图(即样本语音意图)对初始网络模型进行训练,得到已训练的目标网络模型,因此,该目标网络模型可以用于记录拼音向量与语音意图的映射关系。
参见上述实施例,可以获取大量样本语音,针对每个样本语音,获取该样本语音对应的样本意图,该样本语音对应的样本拼音集合对应的样本拼音向量,即,得到该样本语音对应的样本拼音向量和样本意图(作为样本拼音向量的标签信息参与训练)。基于此,可以将大量样本拼音向量及每个样本拼音向量对应的样本意图(即标签信息)输入给初始网络模型,从而利用样本拼音向量及样本意图对初始网络模型内各网络参数进行训练,对此训练过程不做限制。在初始网络模型训练完成后,已经完成训练的初始网络模型是目标网络模型。
比如说,可以将大量样本拼音向量及样本意图输入给初始网络模型的第一个网络层,由第一个网络层对这些数据进行处理,得到第一个网络层的输出数据,将第一个网络层的输出数据输入给初始网络模型的第二个网络层,以此类推,一直到将数据输入给初始网络模型的最后一个网络层,由最后一个网络层对数据进行处理,得到输出数据,将这个输出数据记为目标特征向量。
在得到目标特征向量后,基于目标特征向量确定初始网络模型是否已收敛。若初始网络模型已收敛,则将已收敛的初始网络模型确定为已训练的目标网络模型,完成初始网络模型的训练过程。若初始网络模型未收敛,则对未收敛的初始网络模型的网络 参数进行调整,得到调整后的初始网络模型。
基于调整后的初始网络模型,可以将大量样本拼音向量及样本意图输入给调整后的初始网络模型,从而对调整后的初始网络模型重新进行训练,具体训练过程参见上述实施例,在此不再重复赘述。以此类推,一直到初始网络模型已收敛,并将已收敛的初始网络模型确定为已训练的目标网络模型。
在上述实施例中,基于目标特征向量确定初始网络模型是否已收敛,可以包括但不限于:预先构建损失函数,对此损失函数不做限制,可以根据经验设置。在得到目标特征向量后,可以根据该目标特征向量确定损失函数的损失值,比如说,可以将该目标特征向量代入损失函数,得到损失函数的损失值。在得到损失函数的损失值后,根据损失函数的损失值确定初始网络模型是否已收敛。
在实际应用中,还可以采用其它方式确定初始网络模型是否已收敛,对此不做限制。例如,若迭代次数达到预设次数阈值,则确定初始网络模型已收敛;又例如,若迭代时长达到预设时长阈值,则确定初始网络模型已收敛。
综上所述,可以通过样本拼音向量和该样本拼音向量对应的样本意图对初始网络模型进行训练,从而得到训练后的目标网络模型。
参见图4所示,针对基于目标网络模型的识别过程,本申请实施例中提出一种语音意图识别方法,该方法可以实现语音意图的识别,该方法包括:
步骤401,根据待识别语音确定待识别拼音集合。
示例性的,在得到待识别语音后,可以根据该待识别语音确定待识别拼音集合,该待识别拼音集合可以包括多个待识别拼音,根据待识别语音确定待识别拼音的过程,是从待识别语音中识别出每个拼音的过程,为了区分方便,可以将识别出的每个拼音称为待识别拼音,因此,可以根据该待识别语音识别出多个待识别拼音,对此识别过程不做限制,只要能够根据待识别语音识别出多个待识别拼音即可。比如说,针对待识别语音“开空调”来说,则待识别拼音集合可以包括如下的待识别拼音“kai”、“kong”、“tiao”。
步骤402,获取与该待识别拼音集合对应的待识别拼音向量。示例性的,针对待识别拼音集合中的每个待识别拼音,确定该待识别拼音对应的拼音特征值,基于每个待识别拼音对应的拼音特征值,获取与待识别拼音集合对应的待识别拼音向量,该待识别拼音向量包括每个待识别拼音对应的拼音特征值。
比如说,预先维护所有拼音中的每个拼音与拼音特征值的映射关系,假设一共存在400个拼音,则可以维护拼音1与拼音特征值1的映射关系,拼音2与拼音特征值2的映射关系,…,拼音400与拼音特征值400的映射关系。
步骤402中,针对待识别拼音集合中的每个待识别拼音,通过查询上述映射关系,可以得到与该待识别拼音对应的拼音特征值,并将待识别拼音集合中的每个待识别拼音对应的拼音特征值组合,得到该待识别拼音向量。
在一种可能的实施方式中,可以对所有拼音进行排序,假设一共存在400个拼音,则400个拼音的序号分别为1-400,针对每个拼音对应的拼音特征值,该拼音特征值可以是400位的数值。假设拼音的序号为N,则该拼音对应的拼音特征值中,第N位的取值是第一取值,除第N位之外的其它位的取值是第二取值。比如说,序号为1的拼音对应的拼音特征值中,第1位的取值是第一取值,第2-400位的取值是第二取值;序号为2的拼音对应的拼音特征值中,第2位的取值是第一取值,第1位、第3-400位的取值是第二取值,以此类推。
步骤403,将该待识别拼音向量输入给已训练的目标网络模型,以使目标网络模 型输出与该待识别拼音向量对应的语音意图。示例性的,目标网络模型用于记录拼音向量与语音意图的映射关系,在将待识别拼音向量输入给目标网络模型后,目标网络模型可以输出与该待识别拼音向量对应的语音意图。
比如说,可以将待识别拼音向量输入给目标网络模型的第一个网络层,由第一个网络层对该待识别拼音向量进行处理,得到第一个网络层的输出数据,将第一个网络层的输出数据输入给目标网络模型的第二个网络层,以此类推,一直到将数据输入给目标网络模型的最后一个网络层,由最后一个网络层对数据进行处理,得到输出数据,将这个输出数据记为目标特征向量。
由于目标网络模型用于记录拼音向量与语音意图的映射关系,因此,在得到目标特征向量后,可以基于该目标特征向量查询该映射关系,得到与该目标特征向量对应的语音意图,这个语音意图可以是与该待识别拼音向量对应的语音意图,目标网络模型可以输出与该待识别拼音向量对应的语音意图。
在得到与待识别拼音向量对应的语音意图后,可以基于该语音意图对设备进行控制,对此控制方式不做限制,如语音意图是“开空调”时,打开空调。
在一种可能的实施方式中,目标网络模型输出与待识别拼音向量对应的语音意图时,还可以输出语音意图对应的概率值(如0-1之间的概率值,该概率值也可以称为置信度),例如,目标网络模型可以输出语音意图1以及语音意图1的概率值1(如0.8),语音意图2以及语音意图2的概率值2(如0.1),语音意图3以及语音意图3的概率值3(如0.08),以此类推。
基于上述输出数据,可以将概率值最大的语音意图作为与待识别拼音向量对应的语音意图,例如,将概率值最大的语音意图1作为与待识别拼音向量对应的语音意图。或者,先确定概率值最大的语音意图,确定该语音意图的概率值(即最大概率值)是否大于预设概率阈值,若是,将该语音意图作为与待识别拼音向量对应的语音意图,否则,没有与该待识别拼音向量对应的语音意图。
由以上技术方案可见,本申请实施例中,基于待识别拼音识别出语音意图,不是基于文本识别语音意图,不需要依赖语音转化为文本的准确率。基于待识别语音确定待识别拼音的准确率很高,语音意图识别的准确率很高,因此,能够准确识别出用户的语音意图,有效提高语音意图识别的准确率。比如说,用户发出待识别语音“我想看有树木的照片”,终端设备(如IPC,智能手机等)基于待识别语音确定的拼音是“wo、xiang、kan、you、shu、mu、de、zhao、pian”,即“树木”对应的拼音是“shu、mu”,从而基于上述拼音来确定语音意图,而不需要从待识别语音“我想看有树木的照片”中解析出“数目”或者“树木”,从而避免采用“数目”或“树木”确定语音意图,使得意图识别有了更强的可靠性,不需要语音识别的大量语言模型算法库,带来性能和内存的大幅度优化。
基于与上述方法同样的申请构思,本申请实施例中提出一种语音意图识别装置,参见图5A所示,为所述装置的结构示意图,所述装置可以包括:
确定模块511,用于根据待识别语音确定待识别音素集合;
获取模块512,用于获取与所述待识别音素集合对应的待识别音素向量;
处理模块513,用于将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;
其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
在一种可能的实施方式中,所述待识别音素集合包括多个待识别音素,所述获取模块512获取与所述待识别音素集合对应的待识别音素向量时具体用于:
针对每个待识别音素,确定所述待识别音素对应的音素特征值;基于每个待识别音素对应的音素特征值,获取与所述待识别音素集合对应的待识别音素向量,所述待识别音素向量包括每个待识别音素对应的音素特征值。
在一种可能的实施方式中,确定模块511还用于:获取样本语音和所述样本语音对应的样本意图;根据所述样本语音确定样本音素集合;获取模块512还用于:获取与所述样本音素集合对应的样本音素向量;处理模块513还用于:将所述样本音素向量和所述样本意图输入给初始网络模型,通过所述样本音素向量和所述样本意图对初始网络模型进行训练,得到所述目标网络模型。
在一种可能的实施方式中,所述样本音素集合包括多个样本音素,所述获取模块51获取与所述样本音素集合对应的样本音素向量时具体用于:
针对每个样本音素,确定所述样本音素对应的音素特征值;
基于每个样本音素对应的音素特征值,获取与所述样本音素集合对应的样本音素向量,所述样本音素向量包括每个样本音素对应的音素特征值。
基于与上述方法同样的申请构思,本申请实施例中提出一种语音意图识别装置,参见图5B所示,为所述装置的结构示意图,所述装置可以包括:
确定模块521,用于根据待识别语音确定待识别拼音集合;
获取模块522,用于获取与所述待识别拼音集合对应的待识别拼音向量;
处理模块523,用于将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;
其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
在一种可能的实施方式中,所述待识别拼音集合包括多个待识别拼音,所述获取模块522获取与所述待识别拼音集合对应的待识别拼音向量时具体用于:针对每个待识别拼音,确定所述待识别拼音对应的拼音特征值;基于每个待识别拼音对应的拼音特征值,获取与所述待识别拼音集合对应的待识别拼音向量,所述待识别拼音向量包括每个待识别拼音对应的拼音特征值。
在一种可能的实施方式中,确定模块521还用于:获取样本语音和所述样本语音对应的样本意图;根据所述样本语音确定样本拼音集合;获取模块522还用于:获取与所述样本拼音集合对应的样本拼音向量;处理模块523还用于:将所述样本拼音向量和所述样本意图输入给初始网络模型,通过所述样本拼音向量和所述样本意图对初始网络模型进行训练,得到所述目标网络模型。
在一种可能的实施方式中,所述样本拼音集合包括多个样本拼音,所述获取模块522获取与所述样本拼音集合对应的样本拼音向量时具体用于:
针对每个样本拼音,确定所述样本拼音对应的拼音特征值;
基于每个样本拼音对应的拼音特征值,获取与所述样本拼音集合对应的样本拼音向量,所述样本拼音向量包括每个样本拼音对应的拼音特征值。
基于与上述方法同样的申请构思,本申请实施例中提出一种语音意图识别设备,参见图6所示,所述语音意图识别设备包括:处理器61和机器可读存储介质62,所述机器可读存储介质62存储有能够被所述处理器61执行的机器可执行指令;所述处理器61用于执行机器可执行指令,以实现如下步骤:
根据待识别语音确定待识别音素集合;
获取与所述待识别音素集合对应的待识别音素向量;
将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;
其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
在一种可能的实施方式中,所述待识别音素集合包括多个待识别音素,在所述获取与所述待识别音素集合对应的待识别音素向量时,所述处理器61被促使:
针对每个待识别音素,确定该待识别音素对应的音素特征值;
基于所述每个待识别音素对应的音素特征值,获取与所述待识别音素集合对应的待识别音素向量,所述待识别音素向量包括所述每个待识别音素对应的音素特征值。
在一种可能的实施方式中,在所述将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图之前,所述处理器61还被促使:
获取样本语音和所述样本语音对应的样本意图;
根据所述样本语音确定样本音素集合;
获取与所述样本音素集合对应的样本音素向量;
将所述样本音素向量和所述样本意图输入给初始网络模型,通过所述样本音素向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
在一种可能的实施方式中,所述样本音素集合包括多个样本音素,在所述获取与所述样本音素集合对应的样本音素向量时,所述处理器61被促使:
针对每个样本音素,确定该样本音素对应的音素特征值;
基于所述每个样本音素对应的音素特征值,获取与所述样本音素集合对应的样本音素向量,所述样本音素向量包括所述每个样本音素对应的音素特征值。
基于与上述方法同样的申请构思,本申请实施例中提出一种语音意图识别设备,参见图7所示,所述语音意图识别设备包括:处理器71和机器可读存储介质72,所述机器可读存储介质72存储有能够被所述处理器71执行的机器可执行指令;所述处理器71用于执行机器可执行指令,以实现如下步骤:
根据待识别语音确定待识别拼音集合;
获取与所述待识别拼音集合对应的待识别拼音向量;
将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;
其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
在一种可能的实施方式中,所述待识别拼音集合包括多个待识别拼音,在所述获取与所述待识别拼音集合对应的待识别拼音向量时,所述处理器71被促使:
针对每个待识别拼音,确定该待识别拼音对应的拼音特征值;
基于所述每个待识别拼音对应的拼音特征值,获取与所述待识别拼音集合对应的待识别拼音向量,所述待识别拼音向量包括所述每个待识别拼音对应的拼音特征值。
在一种可能的实施方式中,在所述将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图之前,所述处理器71还被促使:
获取样本语音和所述样本语音对应的样本意图;
根据所述样本语音确定样本拼音集合;
获取与所述样本拼音集合对应的样本拼音向量;
将所述样本拼音向量和所述样本意图输入给初始网络模型,通过所述样本拼音向 量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
在一种可能的实施方式中,所述样本拼音集合包括多个样本拼音,在所述获取与所述样本拼音集合对应的样本拼音向量时,所述处理器71被促使:
针对每个样本拼音,确定该样本拼音对应的拼音特征值;
基于所述每个样本拼音对应的拼音特征值,获取与所述样本拼音集合对应的样本拼音向量,所述样本拼音向量包括所述每个样本拼音对应的拼音特征值。
基于与上述方法同样的申请构思,本申请实施例还提供一种机器可读存储介质,所述机器可读存储介质上存储有若干计算机指令,所述计算机指令被处理器执行时,能够实现本申请上述示例公开的语音意图识别方法。
其中,上述机器可读存储介质可以是任何电子、磁性、光学或其它物理存储装置,可以包含或存储信息,如可执行指令、数据,等等。例如,机器可读存储介质可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、dvd等),或者类似的存储介质,或者它们的组合。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可以由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其它可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其它可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
而且,这些计算机程序指令也可以存储在能引导计算机或其它可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或者多个流程和/或方框图一个方框或者多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其它可编程数据处理设备上,使得在计算机或者其它可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其它可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (18)

  1. 一种语音意图识别方法,包括:
    根据待识别语音确定待识别音素集合;
    获取与所述待识别音素集合对应的待识别音素向量;
    将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;
    其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
  2. 根据权利要求1所述的方法,其特征在于,所述待识别音素集合包括多个待识别音素,所述获取与所述待识别音素集合对应的待识别音素向量,包括:
    针对每个待识别音素,确定该待识别音素对应的音素特征值;
    基于所述每个待识别音素对应的音素特征值,获取与所述待识别音素集合对应的待识别音素向量,所述待识别音素向量包括所述每个待识别音素对应的音素特征值。
  3. 根据权利要求1所述的方法,其特征在于,
    所述将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图之前,所述方法还包括:
    获取样本语音和所述样本语音对应的样本意图;
    根据所述样本语音确定样本音素集合;
    获取与所述样本音素集合对应的样本音素向量;
    将所述样本音素向量和所述样本意图输入给初始网络模型,通过所述样本音素向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
  4. 根据权利要求3所述的方法,其特征在于,所述样本音素集合包括多个样本音素,所述获取与所述样本音素集合对应的样本音素向量,包括:
    针对每个样本音素,确定该样本音素对应的音素特征值;
    基于所述每个样本音素对应的音素特征值,获取与所述样本音素集合对应的样本音素向量,所述样本音素向量包括所述每个样本音素对应的音素特征值。
  5. 一种语音意图识别方法,包括:
    根据待识别语音确定待识别拼音集合;
    获取与所述待识别拼音集合对应的待识别拼音向量;
    将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;
    其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
  6. 根据权利要求5所述的方法,其特征在于,所述待识别拼音集合包括多个待识别拼音,所述获取与所述待识别拼音集合对应的待识别拼音向量,包括:
    针对每个待识别拼音,确定该待识别拼音对应的拼音特征值;
    基于所述每个待识别拼音对应的拼音特征值,获取与所述待识别拼音集合对应的待识别拼音向量,所述待识别拼音向量包括所述每个待识别拼音对应的拼音特征值。
  7. 根据权利要求5所述的方法,其特征在于,
    所述将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图之前,所述方法还包括:
    获取样本语音和所述样本语音对应的样本意图;
    根据所述样本语音确定样本拼音集合;
    获取与所述样本拼音集合对应的样本拼音向量;
    将所述样本拼音向量和所述样本意图输入给初始网络模型,通过所述样本拼音向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
  8. 根据权利要求7所述的方法,其特征在于,所述样本拼音集合包括多个样本拼音,所述获取与所述样本拼音集合对应的样本拼音向量,包括:
    针对每个样本拼音,确定该样本拼音对应的拼音特征值;
    基于所述每个样本拼音对应的拼音特征值,获取与所述样本拼音集合对应的样本拼音向量,所述样本拼音向量包括所述每个样本拼音对应的拼音特征值。
  9. 一种语音意图识别装置,包括:
    确定模块,用于根据待识别语音确定待识别音素集合;
    获取模块,用于获取与所述待识别音素集合对应的待识别音素向量;
    处理模块,用于将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;
    其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
  10. 一种语音意图识别装置,包括:
    确定模块,用于根据待识别语音确定待识别拼音集合;
    获取模块,用于获取与所述待识别拼音集合对应的待识别拼音向量;
    处理模块,用于将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;
    其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
  11. 一种语音意图识别设备,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;所述处理器用于执行机器可执行指令,以实现如下步骤:
    根据待识别语音确定待识别音素集合;
    获取与所述待识别音素集合对应的待识别音素向量;
    将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图;
    其中,所述目标网络模型用于记录音素向量与语音意图的映射关系。
  12. 根据权利要求11所述的设备,其特征在于,所述待识别音素集合包括多个待识别音素,在所述获取与所述待识别音素集合对应的待识别音素向量时,所述处理器被促使:
    针对每个待识别音素,确定该待识别音素对应的音素特征值;
    基于所述每个待识别音素对应的音素特征值,获取与所述待识别音素集合对应的待识别音素向量,所述待识别音素向量包括所述每个待识别音素对应的音素特征值。
  13. 根据权利要求11所述的设备,其特征在于,
    在所述将所述待识别音素向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别音素向量对应的语音意图之前,所述处理器还被促使:
    获取样本语音和所述样本语音对应的样本意图;
    根据所述样本语音确定样本音素集合;
    获取与所述样本音素集合对应的样本音素向量;
    将所述样本音素向量和所述样本意图输入给初始网络模型,通过所述样本音素向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
  14. 根据权利要求13所述的设备,其特征在于,所述样本音素集合包括多个样本音素,在所述获取与所述样本音素集合对应的样本音素向量时,所述处理器被促使:
    针对每个样本音素,确定该样本音素对应的音素特征值;
    基于所述每个样本音素对应的音素特征值,获取与所述样本音素集合对应的样本音素向量,所述样本音素向量包括所述每个样本音素对应的音素特征值。
  15. 一种语音意图识别设备,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;所述处理器用于执行所述机器可执行指令,以实现如下步骤:
    根据待识别语音确定待识别拼音集合;
    获取与所述待识别拼音集合对应的待识别拼音向量;
    将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图;
    其中,所述目标网络模型用于记录拼音向量与语音意图的映射关系。
  16. 根据权利要求15所述的设备,其特征在于,所述待识别拼音集合包括多个待识别拼音,在所述获取与所述待识别拼音集合对应的待识别拼音向量时,所述处理器被促使:
    针对每个待识别拼音,确定该待识别拼音对应的拼音特征值;
    基于所述每个待识别拼音对应的拼音特征值,获取与所述待识别拼音集合对应的待识别拼音向量,所述待识别拼音向量包括所述每个待识别拼音对应的拼音特征值。
  17. 根据权利要求15所述的设备,其特征在于,
    在所述将所述待识别拼音向量输入给已训练的目标网络模型,以使所述目标网络模型输出与所述待识别拼音向量对应的语音意图之前,所述处理器还被促使:
    获取样本语音和所述样本语音对应的样本意图;
    根据所述样本语音确定样本拼音集合;
    获取与所述样本拼音集合对应的样本拼音向量;
    将所述样本拼音向量和所述样本意图输入给初始网络模型,通过所述样本拼音向量和所述样本意图对所述初始网络模型进行训练,得到所述目标网络模型。
  18. 根据权利要求17所述的设备,其特征在于,所述样本拼音集合包括多个样本拼音,在所述获取与所述样本拼音集合对应的样本拼音向量时,所述处理器被促使:
    针对每个样本拼音,确定该样本拼音对应的拼音特征值;
    基于所述每个样本拼音对应的拼音特征值,获取与所述样本拼音集合对应的样本拼音向量,所述样本拼音向量包括所述每个样本拼音对应的拼音特征值。
PCT/CN2021/110134 2020-08-06 2021-08-02 语音意图识别方法、装置及设备 WO2022028378A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010785605.1A CN111986653A (zh) 2020-08-06 2020-08-06 一种语音意图识别方法、装置及设备
CN202010785605.1 2020-08-06

Publications (1)

Publication Number Publication Date
WO2022028378A1 true WO2022028378A1 (zh) 2022-02-10

Family

ID=73444526

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/110134 WO2022028378A1 (zh) 2020-08-06 2021-08-02 语音意图识别方法、装置及设备

Country Status (2)

Country Link
CN (1) CN111986653A (zh)
WO (1) WO2022028378A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986653A (zh) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 一种语音意图识别方法、装置及设备
CN113836945B (zh) * 2021-09-23 2024-04-16 平安科技(深圳)有限公司 意图识别方法、装置、电子设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6408271B1 (en) * 1999-09-24 2002-06-18 Nortel Networks Limited Method and apparatus for generating phrasal transcriptions
CN107357875A (zh) * 2017-07-04 2017-11-17 北京奇艺世纪科技有限公司 一种语音搜索方法、装置及电子设备
CN108549637A (zh) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 基于拼音的语义识别方法、装置以及人机对话系统
CN110674314A (zh) * 2019-09-27 2020-01-10 北京百度网讯科技有限公司 语句识别方法及装置
CN111081219A (zh) * 2020-01-19 2020-04-28 南京硅基智能科技有限公司 一种端到端的语音意图识别方法
CN111986653A (zh) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 一种语音意图识别方法、装置及设备

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08227410A (ja) * 1994-12-22 1996-09-03 Just Syst Corp ニューラルネットワークの学習方法、ニューラルネットワークおよびニューラルネットワークを利用した音声認識装置
CN109754789B (zh) * 2017-11-07 2021-06-08 北京国双科技有限公司 语音音素的识别方法及装置
CN110767214A (zh) * 2018-07-27 2020-02-07 杭州海康威视数字技术股份有限公司 语音识别方法及其装置和语音识别系统
CN110808050B (zh) * 2018-08-03 2024-04-30 蔚来(安徽)控股有限公司 语音识别方法及智能设备
CN110931000B (zh) * 2018-09-20 2022-08-02 杭州海康威视数字技术股份有限公司 语音识别的方法和装置
CN109829153A (zh) * 2019-01-04 2019-05-31 平安科技(深圳)有限公司 基于卷积神经网络的意图识别方法、装置、设备及介质
KR20200091738A (ko) * 2019-01-23 2020-07-31 주식회사 케이티 핵심어 검출 장치, 이를 이용한 핵심어 검출 방법 및 컴퓨터 프로그램
CN110415687B (zh) * 2019-05-21 2021-04-13 腾讯科技(深圳)有限公司 语音处理方法、装置、介质、电子设备
CN110349567B (zh) * 2019-08-12 2022-09-13 腾讯科技(深圳)有限公司 语音信号的识别方法和装置、存储介质及电子装置
KR102321798B1 (ko) * 2019-08-15 2021-11-05 엘지전자 주식회사 인공 신경망 기반의 음성 인식 모델을 학습시키는 방법 및 음성 인식 디바이스
CN110610707B (zh) * 2019-09-20 2022-04-22 科大讯飞股份有限公司 语音关键词识别方法、装置、电子设备和存储介质
CN111243603B (zh) * 2020-01-09 2022-12-06 厦门快商通科技股份有限公司 声纹识别方法、系统、移动终端及存储介质
CN111274797A (zh) * 2020-01-13 2020-06-12 平安国际智慧城市科技股份有限公司 用于终端的意图识别方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6408271B1 (en) * 1999-09-24 2002-06-18 Nortel Networks Limited Method and apparatus for generating phrasal transcriptions
CN107357875A (zh) * 2017-07-04 2017-11-17 北京奇艺世纪科技有限公司 一种语音搜索方法、装置及电子设备
CN108549637A (zh) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 基于拼音的语义识别方法、装置以及人机对话系统
CN110674314A (zh) * 2019-09-27 2020-01-10 北京百度网讯科技有限公司 语句识别方法及装置
CN111081219A (zh) * 2020-01-19 2020-04-28 南京硅基智能科技有限公司 一种端到端的语音意图识别方法
CN111986653A (zh) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 一种语音意图识别方法、装置及设备

Also Published As

Publication number Publication date
CN111986653A (zh) 2020-11-24

Similar Documents

Publication Publication Date Title
US10902845B2 (en) System and methods for adapting neural network acoustic models
JP6980119B2 (ja) 音声認識方法、並びにその装置、デバイス、記憶媒体及びプログラム
US10032463B1 (en) Speech processing with learned representation of user interaction history
Ferrer et al. Study of senone-based deep neural network approaches for spoken language recognition
CN105679317B (zh) 用于训练语言模型并识别语音的方法和设备
US20210407498A1 (en) On-device custom wake word detection
US10629185B2 (en) Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for building deep neural network, and computer program for adapting statistical acoustic model
CN106683661B (zh) 基于语音的角色分离方法及装置
Anand et al. Few shot speaker recognition using deep neural networks
WO2016037350A1 (en) Learning student dnn via output distribution
CN108711421A (zh) 一种语音识别声学模型建立方法及装置和电子设备
WO2022028378A1 (zh) 语音意图识别方法、装置及设备
CN109754789B (zh) 语音音素的识别方法及装置
CN114830139A (zh) 使用模型提供的候选动作训练模型
WO2021208455A1 (zh) 一种面向家居口语环境的神经网络语音识别方法及系统
JP2023545988A (ja) トランスフォーマトランスデューサ:ストリーミング音声認識と非ストリーミング音声認識を統合する1つのモデル
US20230096805A1 (en) Contrastive Siamese Network for Semi-supervised Speech Recognition
Ault et al. On speech recognition algorithms
Jansson Single-word speech recognition with convolutional neural networks on raw waveforms
CN107452374B (zh) 基于单向自标注辅助信息的多视角语言识别方法
Chen et al. Sequence-to-sequence modelling for categorical speech emotion recognition using recurrent neural network
Soliman et al. Isolated word speech recognition using convolutional neural network
Chang et al. On the importance of modeling and robustness for deep neural network feature
US9892726B1 (en) Class-based discriminative training of speech models
Walter et al. An evaluation of unsupervised acoustic model training for a dysarthric speech interface

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21852323

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21852323

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21852323

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.08.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21852323

Country of ref document: EP

Kind code of ref document: A1