CN111986653B

CN111986653B - Voice intention recognition method, device and equipment

Info

Publication number: CN111986653B
Application number: CN202010785605.1A
Authority: CN
Inventors: 陈展
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2020-08-06
Filing date: 2020-08-06
Publication date: 2024-06-25
Anticipated expiration: 2040-08-06
Also published as: CN111986653A; WO2022028378A1

Abstract

The application provides a voice intention recognition method, a device and equipment, wherein the method comprises the following steps: determining a phoneme set to be recognized according to the voice to be recognized; obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized; inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized; the target network model is used for recording the mapping relation between the phoneme vector and the voice intention. By the technical scheme, the voice intention recognition accuracy is high, the voice intention of the user can be accurately recognized, and the voice intention recognition accuracy is effectively improved.

Description

Voice intention recognition method, device and equipment

Technical Field

The present application relates to the field of voice interaction, and in particular, to a method, apparatus, and device for recognizing voice intention.

Background

With the rapid development of artificial intelligence technology and the wide use of artificial intelligence technology in life, voice interaction is an important bridge for communication between people and machines. The robot system is to talk with the user and complete a specific task, wherein one core technology is recognition of voice intention, namely, after the user inputs voice to be recognized into the robot system, the robot system can judge the voice intention of the user through the voice to be recognized.

In the related art, the recognition means of the voice intention includes: a speech recognition phase and an intent recognition phase. In the speech recognition phase, the speech to be recognized is converted into text by means of automatic speech recognition (Automatic Speech Recognition, ASR) technology. Then, in the intention recognition stage, the text is subjected to semantic understanding through natural language processing (Natural Language Processing, NLP) technology to obtain keyword information, and the voice intention of the user is recognized based on the keyword information.

According to the text-based intention recognition mode, the accuracy rate is seriously dependent on the accuracy rate of converting the voice into the text, and the accuracy rate of converting the voice into the text is lower, so that the accuracy rate of voice intention recognition is very low, and the voice intention of a user cannot be accurately recognized. For example, there are "trees" in the speech, but when the speech is converted to text, the text content may be "number", resulting in erroneous recognition of the speech intent.

Disclosure of Invention

The application provides a voice intention recognition method, which comprises the following steps:

Determining a phoneme set to be recognized according to the voice to be recognized;

Obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized;

Inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;

The target network model is used for recording the mapping relation between the phoneme vector and the voice intention.

In one possible implementation, the set of phonemes to be identified includes a plurality of phonemes to be identified, and the obtaining a phoneme vector to be identified corresponding to the set of phonemes to be identified includes:

Determining a phoneme characteristic value corresponding to each phoneme to be recognized; and acquiring a phoneme vector to be recognized corresponding to the phoneme set to be recognized based on the phoneme characteristic value corresponding to each phoneme to be recognized, wherein the phoneme vector to be recognized comprises the phoneme characteristic value corresponding to each phoneme to be recognized.

In one possible implementation manner, before the inputting the phoneme vector to be recognized into the trained target network model, so that the target network model outputs the voice intention corresponding to the phoneme vector to be recognized, the method further includes: acquiring sample voice and sample intention corresponding to the sample voice;

Determining a sample phoneme set according to the sample speech;

acquiring a sample phoneme vector corresponding to the sample phoneme set;

and inputting the sample phoneme vector and the sample intention to an initial network model, and training the initial network model through the sample phoneme vector and the sample intention to obtain the target network model.

In one possible implementation, the sample phone set includes a plurality of sample phones, and the obtaining a sample phone vector corresponding to the sample phone set includes:

Determining a phoneme characteristic value corresponding to each sample phoneme aiming at each sample phoneme;

And acquiring a sample phoneme vector corresponding to the sample phoneme set based on the phoneme characteristic value corresponding to each sample phoneme, wherein the sample phoneme vector comprises the phoneme characteristic value corresponding to each sample phoneme.

Determining a pinyin set to be recognized according to the voice to be recognized;

acquiring a pinyin vector to be recognized corresponding to the pinyin set to be recognized;

inputting the pinyin vector to be recognized to a trained target network model, so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;

the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.

In one possible implementation manner, the pinyin-set to be identified includes a plurality of pinyin-sets to be identified, and the obtaining the pinyin-vectors to be identified corresponding to the pinyin-set to be identified includes:

Determining a pinyin characteristic value corresponding to each pinyin to be identified; based on the pinyin characteristic value corresponding to each pinyin to be identified, a pinyin vector to be identified corresponding to the pinyin to be identified set is obtained, and the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified.

In one possible implementation manner, before the inputting the pinyin vector to be recognized into the trained target network model, so that the target network model outputs the voice intent corresponding to the pinyin vector to be recognized, the method further includes: acquiring sample voice and sample intention corresponding to the sample voice;

determining a sample pinyin set according to the sample speech;

Acquiring a sample pinyin vector corresponding to the sample pinyin set;

And inputting the sample pinyin vector and the sample intention to an initial network model, and training the initial network model through the sample pinyin vector and the sample intention to obtain the target network model.

In one possible implementation, the sample pinyin set includes a plurality of sample pinyin, and the obtaining a sample pinyin vector corresponding to the sample pinyin set includes:

determining a pinyin characteristic value corresponding to each sample pinyin;

And acquiring a sample pinyin vector corresponding to the sample pinyin set based on the pinyin characteristic value corresponding to each sample pinyin, wherein the sample pinyin vector comprises the pinyin characteristic value corresponding to each sample pinyin.

The present application provides a voice intention recognition apparatus, the apparatus comprising:

The determining module is used for determining a phoneme set to be recognized according to the voice to be recognized;

the acquisition module is used for acquiring a phoneme vector to be identified, which corresponds to the phoneme set to be identified;

The processing module is used for inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;

the determining module is used for determining a pinyin set to be recognized according to the voice to be recognized;

the acquisition module is used for acquiring the pinyin vectors to be identified corresponding to the pinyin sets to be identified;

The processing module is used for inputting the pinyin vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;

The present application provides a voice intention recognition apparatus including: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine-executable instructions to perform the steps of:

The target network model is used for recording the mapping relation between the phoneme vector and the voice intention;

or determining a pinyin set to be recognized according to the voice to be recognized;

As can be seen from the above technical solutions, in the embodiments of the present application, the voice intention is recognized based on the phonemes to be recognized, not based on the text recognition voice intention, and the accuracy of converting voice into text is not required to be relied on. Because the phonemes are the minimum speech units divided according to the natural attribute of the speech, the phonemes are analyzed based on pronunciation actions, and one action forms one phoneme, the accuracy of determining the phonemes to be recognized based on the speech to be recognized is high, the accuracy of speech intention recognition is high, the speech intention of a user can be accurately recognized, the accuracy of speech intention recognition is effectively improved, the intention recognition has stronger reliability, a large number of language model algorithm libraries for speech recognition are not needed, and the performance and the memory are greatly optimized.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly describe the drawings required to be used in the embodiments of the present application or the description in the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings of the embodiments of the present application for a person having ordinary skill in the art.

FIG. 1 is a flow chart of a method of speech intent recognition in one embodiment of the present application;

FIG. 2 is a flow chart of a method of speech intent recognition in one embodiment of the present application;

FIG. 3 is a flow chart of a method of speech intent recognition in one embodiment of the present application;

FIG. 4 is a flow chart of a method of speech intent recognition in one embodiment of the present application;

FIG. 5A is a schematic diagram of a voice intent recognition device in accordance with one embodiment of the present application;

FIG. 5B is a schematic diagram of a voice intent recognition device in accordance with one embodiment of the present application;

fig. 6 is a hardware configuration diagram of a voice intention recognition apparatus in one embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations including one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. Depending on the context, furthermore, the word "if" used may be interpreted as "at … …" or "at … …" or "in response to a determination".

Before describing the technical scheme of the application, concepts related to the embodiments of the application are described.

Machine learning: machine learning is a way to implement artificial intelligence to study how computers simulate or implement learning behavior of humans to obtain new knowledge or skills, reorganizing existing knowledge structures to continuously improve their own performance. Deep learning belongs to a subclass of machine learning, and is a process of modeling specific problems in the real world using mathematical models to solve similar problems in the field. Neural networks are implementations of deep learning, and for ease of description, the structure and function of the neural network is described herein by taking neural networks as an example, and for other subclasses of machine learning, the structure and function of the neural network are similar.

Neural network: the neural network includes, but is not limited to, a Convolutional Neural Network (CNN), a cyclic neural network (RNN), a fully connected network, etc., and the structural units of the neural network may include, but are not limited to, a convolutional layer (Conv), a pooling layer (Pool), an excitation layer, a fully connected layer (FC), etc.

In practical applications, the neural network may be constructed by combining one or more convolution layers, one or more pooling layers, one or more excitation layers, and one or more fully-connected layers according to different requirements.

In the convolution layer, the input data features are enhanced by performing convolution operation by using convolution kernel, the convolution kernel can be a matrix with m x n, the input data features of the convolution layer are convolved with the convolution kernel, the output data features of the convolution layer can be obtained, and the convolution operation is actually a filtering process.

In the pooling layer, operations such as maximum value taking, minimum value taking, average value taking and the like are performed on input data features (such as output of a convolution layer), so that the input data features are subsampled by utilizing the principle of local correlation, the processing amount is reduced, the feature invariance is kept, and the pooling layer operation is actually a downsampling process.

In the excitation layer, the input data features may be mapped using an activation function (e.g., a nonlinear function) to introduce a nonlinear factor such that the neural network enhances expression through nonlinear combinations.

The activation function may include, but is not limited to, a ReLU (RECTIFIED LINEAR Units ) function that is used to place features less than 0 at 0, while features greater than 0 remain unchanged.

In the fully-connected layer, all data features input to the fully-connected layer are fully-connected, so that a feature vector is obtained, and the feature vector can comprise a plurality of data features.

Network model: a model constructed using a machine learning algorithm (e.g., a deep learning algorithm), such as a model constructed using a neural network, i.e., the network model may be composed of one or more convolutional layers, one or more pooling layers, one or more excitation layers, and one or more fully-connected layers. For convenience of distinction, the untrained network model is referred to as the initial network model, and the trained network model is referred to as the target network model.

In the training process of the initial network model, the sample data is utilized to train all network parameters in the initial network model, such as convolution layer parameters (such as convolution kernel parameters), pooling layer parameters, excitation layer parameters, full connection layer parameters and the like, which are not limited. By training each network parameter in the initial network model, the initial network model is fitted with the mapping relation between the input and the output. After the initial network model training is completed, the initial network model which has completed training is the target network model, and the voice intention is recognized through the target network model.

Phonemes: the phonemes are the minimum phonetic units divided according to the natural attributes of the speech, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. For example, the Chinese syllable o (a) has only one phoneme (a), ai (ai) has two phonemes (a and i), and dai (dai) has three phonemes (d, a and i), etc. Also for example, the chinese syllable tree (shumu) has five phones (s, h, u, m and u), etc.

Pinyin: pinyin is the combination of more than one phoneme into a composite sound, e.g., the generation (dai) has three phonemes (d, a, and i) that make up a single pinyin (dai). For another example, tree (shumu) has five phonemes (s, h, u, m and u) that make up two pinyins (shu and mu).

In the related art, the recognition means of the voice intention includes: a speech recognition phase and an intent recognition phase. In the voice recognition stage, the voice to be recognized is recognized by an automatic voice recognition technology, and the voice to be recognized is converted into a text. In the intention recognition stage, the text is subjected to semantic understanding through a natural language processing technology to obtain keywords, and the voice intention of the user is recognized based on the keywords. According to the text-based intention recognition mode, the accuracy rate depends on the accuracy rate of converting the voice into the text, and the accuracy rate of converting the voice into the text is low, so that the accuracy rate of voice intention recognition is low, and the voice intention of a user cannot be accurately recognized.

In view of the above findings, in the embodiment of the present application, the voice intent is recognized based on the phonemes to be recognized, instead of recognizing the voice intent based on the text, so that the accuracy of converting the voice into the text does not need to be relied on.

The technical scheme of the embodiment of the application is described below with reference to specific embodiments.

The embodiment of the application provides a voice intention recognition method which can be applied to a man-machine interaction application scene and is mainly used for controlling equipment according to voice intention. The method can be applied to any device that needs to be controlled according to voice intention, such as an access control device, a screen throwing device, an IPC (internet protocol Camera), a server, an intelligent terminal, a robot system, an air conditioning device, and the like, without limitation.

The embodiment of the application relates to a training process of an initial network model and an identification process based on a target network model. In the training process of the initial network model, the initial network model can be trained to obtain a trained target network model. In the recognition process based on the target network model, the voice intent may be recognized based on the target network model. The training process of the initial network model and the recognition process based on the target network model can be realized in the same equipment or in different equipment. For example, a training process of the initial network model is implemented in the device a, a target network model is obtained, and the voice intention is recognized based on the target network model. For another example, a training process of the initial network model is implemented in the device A1, a target network model is obtained, the target network model is deployed to the device A2, and the device A2 recognizes the voice intention based on the target network model.

Referring to fig. 1, for the training process of the initial network model, a method for identifying a voice intention is provided in an embodiment of the present application, where the method may implement training of the initial network model, and the method includes:

step 101, obtaining a sample voice and a sample intention corresponding to the sample voice.

For example, a large number of sample voices may be obtained from the history data, and/or a large number of sample voices input by the user may be received, and the obtaining manner is not limited, and the sample voices represent sounds made while speaking. For example, the sound made when speaking is "turn on air conditioner", and the sample speech is "turn on air conditioner".

For each sample voice, a voice intent corresponding to the sample voice may be obtained, and for convenience of distinction, the voice intent corresponding to the sample voice may be referred to as a sample intent (i.e., a sample voice intent). For example, if the sample voice is "turn air conditioner on", the sample intent may be "turn air conditioner on".

Step 102, determining a sample phoneme set according to the sample speech.

For example, for each sample voice, a sample phone set may be determined according to the sample voice, the sample phone set may include a plurality of sample phones, the process of determining a sample phone according to the sample voice is a process of identifying each phone from the sample voice, and for convenience of distinction, each identified phone is referred to as a sample phone, so a plurality of sample phones may be identified according to the sample voice, and the identification process is not limited as long as a plurality of sample phones can be identified according to the sample voice.

For example, for sample speech "air conditioning on," the sample phone set may include the following sample phone "b, a, k, o, n, g, t, i, a, o, d, a, k, a, i".

Step 103, obtaining a sample phoneme vector corresponding to the sample phoneme set.

For each sample phone in the sample phone set, a phone feature value corresponding to the sample phone is determined, and a sample phone vector corresponding to the sample phone set is obtained based on the phone feature value corresponding to each sample phone, wherein the sample phone vector comprises the phone feature value corresponding to each sample phone.

For example, a mapping relationship between each phoneme in all phonemes and the phoneme feature value is maintained in advance, and assuming that there are 50 phonemes in total, a mapping relationship between the phoneme 1 and the phoneme feature value 1, a mapping relationship between the phoneme 2 and the phoneme feature value 2, a mapping relationship between the phoneme 50 and the phoneme feature value 50 may be maintained.

On this basis, in step 103, for each sample phone in the sample phone set, by querying the mapping relation, a phone feature value corresponding to the sample phone can be obtained, and the phone feature value corresponding to each sample phone in the sample phone set is combined to obtain the sample phone vector.

For example, for the sample phone set "b, a, k, o, n, g, t, i, a, o, d, a, k, a, i", the sample phone vector is a 15-dimensional feature vector, and the feature vector sequentially includes a phone feature value corresponding to "b", a phone feature value corresponding to "a", a phone feature value corresponding to "k", a phone feature value corresponding to "o", a phone feature value corresponding to "n", a phone feature value corresponding to "g", a phone feature value corresponding to "t", a phone feature value corresponding to "i", a phone feature value corresponding to "a", a phone feature value corresponding to "o", a phone feature value corresponding to "d", a phone feature value corresponding to "a", a phone feature value corresponding to "k", a phone feature value corresponding to "a", and a phone feature value corresponding to "i".

In one possible implementation, all phonemes may be ordered, and assuming that there are 50 total phonemes, the sequence numbers of the 50 phonemes are respectively 1-50, and for each phoneme, the phoneme feature value may be a 50-bit numerical value. Assuming that the sequence number of the phoneme is M, in the phoneme feature value corresponding to the phoneme, the value of the mth bit is a first value, and the values of the other bits except the mth bit are second values. For example, the phoneme feature value corresponding to the phoneme with the sequence number 1, the 1 st bit value is the first value, and the 2 nd to 50 th bit values are the second values; the phoneme characteristic value corresponding to the phoneme with the sequence number of 2, the value of the 2 nd bit is the first value, the values of the 1 st bit and the 3 rd to 50 th bits are the second values, and so on.

In summary, for the sample phone set "b, a, k, o, n, g, t, i, a, o, d, a, k, a, i", the sample phone vector may be a 15×50-dimensional feature vector, where the feature vector includes 15 rows and 50 columns, and each row represents a phone feature value corresponding to a phone, which is not described herein.

In the above embodiment, the first value and the second value may be empirically configured, and are not limited thereto, for example, the first value is 1, the second value is 0, or the first value is 0, the second value is 1, or the first value is 255, the second value is 0, or the first value is 0, and the second value is 255.

Step 104, inputting the sample phoneme vector and the sample intention corresponding to the sample phoneme vector to the initial network model, so as to train the initial network model through the sample phoneme vector and the sample intention, and obtain a trained target network model. For example, since the initial network model is trained using the sample phoneme vector and the sample intent (i.e., sample voice intent), a trained target network model is obtained, and thus the target network model may be used to record the mapping relationship of the phoneme vector and the voice intent.

Referring to the above embodiment, a large number of sample voices can be acquired, and for each sample voice, a sample intention corresponding to the sample voice is acquired, and a sample phoneme vector corresponding to a sample phoneme set corresponding to the sample voice is obtained, that is, a sample phoneme vector corresponding to the sample voice and a sample intention (as label information of the sample phoneme vector participates in training) are obtained. Based on this, a large number of sample phoneme vectors and sample intentions (i.e., label information) corresponding to each sample phoneme vector can be input to the initial network model, so that each network parameter in the initial network model is trained by using the sample phoneme vectors and the sample intentions, and the training process is not limited. After the initial network model training is completed, the initial network model that has completed training is the target network model.

For example, a large number of sample phoneme vectors and sample intentions may be input to a first network layer of the initial network model, the first network layer processes the data to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the initial network model, and so on until the data is input to a last network layer of the initial network model, the last network layer processes the data to obtain output data, and the output data is recorded as a target feature vector.

After the target feature vector is obtained, it is determined whether the initial network model has converged based on the target feature vector. If the initial network model is converged, determining the converged initial network model as a trained target network model, and completing the training process of the initial network model. And if the initial network model is not converged, adjusting network parameters of the initial network model which is not converged to obtain an adjusted initial network model.

Based on the adjusted initial network model, a large number of sample phoneme vectors and sample intentions can be input to the adjusted initial network model, so that the adjusted initial network model is retrained, and detailed training processes are referred to the above embodiments and are not repeated here. And so on until the initial network model has converged and determining the converged initial network model as the trained target network model.

In the above embodiment, determining whether the initial network model has converged based on the target feature vector may include, but is not limited to: the loss function is constructed in advance, and is not limited and can be empirically set. After the target feature vector is obtained, a loss value of the loss function may be determined according to the target feature vector, for example, the target feature vector may be substituted into the loss function to obtain the loss value of the loss function. After obtaining the loss value of the loss function, determining whether the initial network model is converged according to the loss value of the loss function.

For example, it may be determined whether the initial network model has converged based on a loss value, e.g., obtaining a loss value 1 based on the target feature vector, and if the loss value 1 is not greater than the threshold, determining that the initial network model has converged. If the loss value 1 is greater than the threshold, it is determined that the initial network model does not converge. Or alternatively

Whether the initial network model is converged can be determined according to a plurality of loss values of a plurality of iterative processes, for example, in each iterative process, the initial network model of the last iterative process is adjusted to obtain an adjusted initial network model, and each iterative process can obtain a loss value. And determining a change amplitude curve of a plurality of loss values, and if the change amplitude curve of the loss values is determined that the change amplitude of the loss values is stable (the loss values of the continuous multiple iteration processes are unchanged or the change amplitude is small), and the loss value of the last iteration process is not greater than a threshold value, determining that the initial network model of the last iteration process is converged. Otherwise, determining that the initial network model of the last iteration process is not converged, continuing the next iteration process to obtain a loss value of the next iteration process, and redetermining a variation amplitude curve of a plurality of loss values.

In practical applications, other ways of determining whether the initial network model has converged may be used, without limitation. For example, if the iteration number reaches a preset number threshold, determining that the initial network model has converged; for another example, if the iteration duration reaches a preset duration threshold, it is determined that the initial network model has converged.

In summary, the initial network model may be trained by the sample phoneme vector and the sample intention corresponding to the sample phoneme vector, so as to obtain a trained target network model.

Referring to fig. 2, for a recognition process based on a target network model, a method for recognizing a voice intention is provided in an embodiment of the present application, where the method can implement recognition of a voice intention, and the method includes:

in step 201, a set of phonemes to be recognized is determined from the speech to be recognized.

For example, after the speech to be recognized is obtained, a set of phonemes to be recognized may be determined according to the speech to be recognized, where the process of determining the phonemes to be recognized according to the speech to be recognized is a process of recognizing each phoneme from the speech to be recognized, and for convenience of distinction, each recognized phoneme is referred to as a phoneme to be recognized, so that a plurality of phonemes to be recognized may be recognized according to the speech to be recognized, and the recognition process is not limited as long as a plurality of phonemes to be recognized can be recognized according to the speech to be recognized. For example, for a voice to be recognized to "air-conditioned," the set of phonemes to be recognized may include the following phonemes to be recognized "k, a, i, k, o, n, g, t, i, a, o".

Step 202, obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized. For each phoneme to be identified in the set of phonemes to be identified, a phoneme feature value corresponding to the phoneme to be identified is determined, and a phoneme vector to be identified corresponding to the set of phonemes to be identified is obtained based on the phoneme feature value corresponding to each phoneme to be identified, wherein the phoneme vector to be identified comprises the phoneme feature value corresponding to each phoneme to be identified.

In step 202, for each phoneme to be identified in the set of phonemes to be identified, a phoneme feature value corresponding to the phoneme to be identified can be obtained by querying the mapping relation, and the phoneme feature values corresponding to each phoneme to be identified in the set of phonemes to be identified are combined to obtain the phoneme vector to be identified.

Step 203, inputting the phoneme vector to be recognized into the trained target network model, so that the target network model outputs the voice intention corresponding to the phoneme vector to be recognized. For example, the target network model is used to record a mapping relationship between a phoneme vector and a voice intention, and after the phoneme vector to be recognized is input to the target network model, the target network model may output the voice intention corresponding to the phoneme vector to be recognized.

For example, the phoneme vector to be identified may be input to a first network layer of the target network model, the first network layer processes the phoneme vector to be identified to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the target network model, and so on until the data is input to a last network layer of the target network model, the last network layer processes the data to obtain output data, and the output data is recorded as the target feature vector.

Because the target network model is used for recording the mapping relation between the phoneme vector and the voice intention, after the target feature vector is obtained, the mapping relation can be queried based on the target feature vector to obtain the voice intention corresponding to the target feature vector, the voice intention can be the voice intention corresponding to the phoneme vector to be recognized, and the target network model can output the voice intention corresponding to the phoneme vector to be recognized.

After obtaining the voice intention corresponding to the phoneme vector to be recognized, the device can be controlled based on the voice intention, and the control mode is not limited, for example, when the voice intention is "air conditioning on", the air conditioner is turned on.

In one possible implementation, when the target network model outputs the voice intent corresponding to the phoneme vector to be recognized, a probability value (e.g., a probability value between 0 and 1, which may also be referred to as a confidence) corresponding to the voice intent may also be output, for example, the target network model may output the voice intent 1 and the probability value 1 of the voice intent 1 (e.g., 0.8), the voice intent 2 and the probability value 2 of the voice intent 2 (e.g., 0.1), the voice intent 3 and the probability value 3 of the voice intent 3 (e.g., 0.08), and so on.

Based on the output data described above, the speech intention with the largest probability value may be regarded as the speech intention corresponding to the phoneme vector to be recognized, for example, the speech intention 1 with the largest probability value may be regarded as the speech intention corresponding to the phoneme vector to be recognized. Or firstly determining the voice intention with the maximum probability value, and determining whether the probability value (namely the maximum probability value) of the voice intention is larger than a preset probability threshold value, if so, taking the voice intention as the voice intention corresponding to the phoneme vector to be recognized, otherwise, not taking the voice intention corresponding to the phoneme vector to be recognized.

As can be seen from the above technical solutions, in the embodiments of the present application, the voice intention is recognized based on the phonemes to be recognized, not based on the text recognition voice intention, and the accuracy of converting voice into text is not required to be relied on. The accuracy of determining the phonemes to be recognized based on the voices to be recognized is high, the accuracy of recognizing the voice intentions is high, the voice intentions of the user can be accurately recognized, and the accuracy of recognizing the voice intentions is effectively improved.

For example, the user sends out a voice to be recognized "i want to see a photo with a tree", the terminal device (such as IPC, smart phone, etc.) determines that the phoneme determined based on the voice to be recognized is "w, o, x, i, a, n, g, k, a, n, y, o, u, s, h, u, m, u, d, e, z, h, a, o, p, i, a, n", namely, the phoneme corresponding to the "tree" is "s, h, u, m, u", so as to determine the voice intention based on the phonemes, and the "number" or the "tree" does not need to be resolved from the voice to be recognized "i want to see a photo with a tree", so that the voice intention is prevented from being determined by adopting the "number" or the "tree", so that the intention recognition has stronger reliability, a large number of language model algorithm libraries of voice recognition are not needed, and the performance and the memory are greatly optimized.

In another implementation manner of the embodiment of the application, the voice intention is recognized based on the pinyin to be recognized, instead of the text, so that the accuracy of converting the voice into the text is not required to be relied on.

The embodiment of the application can relate to a training process of an initial network model and an identification process based on a target network model. In the training process of the initial network model, the initial network model can be trained to obtain a trained target network model. In the recognition process based on the target network model, the voice intent may be recognized based on the target network model. The training process of the initial network model and the recognition process based on the target network model can be implemented in the same device or in different devices.

Referring to fig. 3, for the training process of the initial network model, a method for identifying a voice intention is provided in an embodiment of the present application, where the method may implement training of the initial network model, and the method includes:

Step 301, a sample voice and a sample intention corresponding to the sample voice are obtained.

Step 302, determining a sample pinyin collection based on the sample speech.

For example, for each sample voice, a sample pinyin set may be determined according to the sample voice, where the sample pinyin set may include a plurality of sample pinyins, and the process of determining the sample pinyin according to the sample voice is a process of identifying each pinyin from the sample voice, and for convenience of distinction, each identified pinyin is referred to as a sample pinyin, so that a plurality of sample pinyins may be identified according to the sample voice, and the identification process is not limited as long as a plurality of sample pinyins can be identified according to the sample voice.

For example, for sample speech "turn air conditioner on", the sample pinyin collection may include the following sample pinyins "ba", "kong", "tiao", "da", "kai".

Step 303, obtaining a sample pinyin vector corresponding to the sample pinyin set.

For each sample pinyin in the sample pinyin set, determining a pinyin feature value corresponding to the sample pinyin, and obtaining a sample pinyin vector corresponding to the sample pinyin set based on the pinyin feature value corresponding to each sample pinyin, where the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.

For example, the mapping relationship between each pinyin in all pinyins and the pinyin feature value is maintained in advance, and if there are 400 pinyins in total, the mapping relationship between the pinyin 1 and the pinyin feature value 1, the mapping relationship between the pinyin 2 and the pinyin feature value 2, …, and the mapping relationship between the pinyin 400 and the pinyin feature value 400 can be maintained.

Based on this, in step 303, for each sample pinyin in the sample pinyin set, a pinyin feature value corresponding to the sample pinyin may be obtained by querying the mapping relationship, and the pinyin feature value corresponding to each sample pinyin in the sample pinyin set is combined to obtain the sample pinyin vector.

For example, for the sample pinyin sets "ba", "kong", "tiao", "da" and "kai", the sample pinyin vector may be a 5-dimensional feature vector, and the feature vector may sequentially include the pinyin feature value corresponding to "ba", the pinyin feature value corresponding to "kong", the pinyin feature value corresponding to "tiao", the pinyin feature value corresponding to "da", and the pinyin feature value corresponding to "kai".

In one possible implementation, all pinyins may be ordered, and assuming that there are 400 pinyins in total, the number of the 400 pinyins is 1-400, respectively, for each pinyin a corresponding pinyin feature value, which may be a 400-bit value. And assuming that the sequence number of the pinyin is N, in the pinyin characteristic value corresponding to the pinyin, the value of the Nth bit is a first value, and the values of other bits except the Nth bit are second values. For example, the pinyin with the sequence number of 1 corresponds to the pinyin characteristic value, the value of the 1 st bit is a first value, and the values of the 2 nd to 400 th bits are second values; the pinyin with the sequence number of 2 corresponds to the pinyin characteristic value, the value of the 2 nd bit is the first value, the values of the 1 st bit and the 3 rd to 400 rd bits are the second values, and so on.

In summary, for the sample pinyin sets "ba", "kong", "tiao", "da" and "kai", the sample pinyin vector may be a5×400-dimensional feature vector, where the feature vector includes 5 rows and 400 columns, and each row represents a pinyin feature value corresponding to one pinyin, which is not described herein.

Step 304, inputting the sample pinyin vector and the sample intent corresponding to the sample pinyin vector to the initial network model, so as to train the initial network model through the sample pinyin vector and the sample intent, and obtain a trained target network model. For example, the initial network model is trained by using the sample pinyin vector and the sample intent (i.e., the sample voice intent) to obtain a trained target network model, so that the target network model can be used to record the mapping relationship between the pinyin vector and the voice intent.

Referring to the above embodiment, a large number of sample voices can be obtained, and for each sample voice, a sample intent corresponding to the sample voice is obtained, and a sample pinyin vector corresponding to a sample pinyin set corresponding to the sample voice, that is, a sample pinyin vector corresponding to the sample voice and a sample intent (as label information of the sample pinyin vector participating in training) are obtained. Based on the above, a large number of sample pinyin vectors and sample intentions (i.e., label information) corresponding to each sample pinyin vector can be input into the initial network model, so that each network parameter in the initial network model is trained by using the sample pinyin vectors and the sample intentions, and the training process is not limited. After the initial network model training is completed, the initial network model that has completed training is the target network model.

For example, a large number of sample pinyin vectors and sample intent may be input to a first network layer of the initial network model, the first network layer processes the data to obtain output data for the first network layer, the output data for the first network layer is input to a second network layer of the initial network model, and so on until the data is input to a last network layer of the initial network model, the last network layer processes the data to obtain output data, and the output data is recorded as the target feature vector.

Based on the adjusted initial network model, a large number of sample pinyin vectors and sample intentions can be input to the adjusted initial network model, so that the adjusted initial network model is retrained, and detailed training processes are referred to the above embodiments and are not repeated here. And so on until the initial network model has converged and determining the converged initial network model as the trained target network model.

In summary, the initial network model may be trained by the sample pinyin vector and the sample intent corresponding to the sample pinyin vector, so as to obtain a trained target network model.

Referring to fig. 4, for a recognition process based on a target network model, a method for recognizing a voice intention is provided in an embodiment of the present application, where the method can implement recognition of a voice intention, and the method includes:

Step 401, determining a pinyin set to be recognized according to the voice to be recognized.

For example, after the voice to be recognized is obtained, a set of pinyin to be recognized may be determined according to the voice to be recognized, where the set of pinyin to be recognized may include a plurality of pinyin to be recognized, and the process of determining the pinyin to be recognized according to the voice to be recognized is a process of recognizing each pinyin from the voice to be recognized. For example, for the voice to be recognized to "turn on the air conditioner", the set of pinyin to be recognized may include the following pinyin to be recognized "kai", "kong", "tiao".

Step 402, obtaining a pinyin vector to be recognized corresponding to the pinyin set to be recognized. For each pinyin to be identified in the set of pinyin to be identified, a pinyin characteristic value corresponding to the pinyin to be identified is determined, and based on the pinyin characteristic value corresponding to each pinyin to be identified, a pinyin vector to be identified corresponding to the set of pinyin to be identified is obtained, wherein the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified.

In step 402, for each pinyin to be identified in the pinyin set to be identified, a pinyin feature value corresponding to the pinyin to be identified may be obtained by querying the mapping relationship, and the pinyin feature values corresponding to each pinyin to be identified in the pinyin set to be identified may be combined to obtain the pinyin vector to be identified.

Step 403, inputting the pinyin vector to be recognized to the trained target network model, so that the target network model outputs the voice intention corresponding to the pinyin vector to be recognized. The target network model is used for recording mapping relation between pinyin vectors and voice intents, and after the pinyin vectors to be recognized are input into the target network model, the target network model can output the voice intents corresponding to the pinyin vectors to be recognized.

For example, the pinyin vector to be identified may be input to a first network layer of the target network model, the first network layer processes the pinyin vector to be identified to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the target network model, and so on until the data is input to a last network layer of the target network model, the last network layer processes the data to obtain output data, and the output data is recorded as the target feature vector.

Because the target network model is used for recording the mapping relation between the pinyin vector and the voice intention, after the target feature vector is obtained, the mapping relation can be queried based on the target feature vector to obtain the voice intention corresponding to the target feature vector, the voice intention can be the voice intention corresponding to the pinyin vector to be recognized, and the target network model can output the voice intention corresponding to the pinyin vector to be recognized.

After the voice intention corresponding to the pinyin vector to be recognized is obtained, the equipment can be controlled based on the voice intention, and the control mode is not limited, if the voice intention is "air conditioning on", the air conditioner is turned on.

In one possible implementation, when the target network model outputs the voice intent corresponding to the pinyin vector to be recognized, a probability value (e.g., a probability value between 0-1, which may also be referred to as a confidence) corresponding to the voice intent may also be output, for example, the target network model may output the voice intent 1 and the probability value 1 of the voice intent 1 (e.g., 0.8), the voice intent 2 and the probability value 2 of the voice intent 2 (e.g., 0.1), the voice intent 3 and the probability value 3 of the voice intent 3 (e.g., 0.08), and so on.

Based on the output data described above, the voice intention with the largest probability value may be regarded as the voice intention corresponding to the pinyin vector to be recognized, for example, the voice intention 1 with the largest probability value may be regarded as the voice intention corresponding to the pinyin vector to be recognized. Or firstly determining the voice intention with the maximum probability value, and determining whether the probability value (namely the maximum probability value) of the voice intention is larger than a preset probability threshold value, if so, taking the voice intention as the voice intention corresponding to the pinyin vector to be recognized, otherwise, not taking the voice intention corresponding to the pinyin vector to be recognized.

As can be seen from the above technical solutions, in the embodiments of the present application, the voice intent is recognized based on the pinyin to be recognized, rather than recognizing the voice intent based on the text, and the accuracy of converting the voice into the text is not required. The accuracy of determining the pinyin to be recognized based on the voice to be recognized is high, and the accuracy of recognizing the voice intention is high, so that the voice intention of the user can be accurately recognized, and the accuracy of recognizing the voice intention is effectively improved. For example, the user sends out the voice to be recognized "i want to see the photo with the tree", the terminal equipment (such as IPC, smart phone, etc.) determines that the pinyin based on the voice to be recognized is "wo, xiang, kan, you, shu, mu, de, zhao, pian", namely the pinyin corresponding to the "tree" is "shu, mu", so as to determine the voice intention based on the pinyin, and the number or the tree is not needed to be analyzed from the voice to be recognized "i want to see the photo with the tree", so that the voice intention is not determined by adopting the number or the tree ", so that the intention recognition has stronger reliability, a large number of language model algorithm libraries of voice recognition are not needed, and the performance and the memory are greatly optimized.

Based on the same application concept as the above method, an embodiment of the present application provides a voice intention recognition device, as shown in fig. 5A, which is a schematic structural diagram of the device, where the device may include:

a determining module 511 for determining a set of phonemes to be recognized from the speech to be recognized;

an obtaining module 512, configured to obtain a phoneme vector to be recognized corresponding to the phoneme set to be recognized;

a processing module 513, configured to input the phoneme vector to be recognized to a trained target network model, so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;

In one possible implementation, the phone set to be recognized includes a plurality of phones to be recognized, and the obtaining module 512 is specifically configured to, when obtaining the phone vector to be recognized corresponding to the phone set to be recognized:

In one possible implementation, the determining module 511 is further configured to: acquiring sample voice and sample intention corresponding to the sample voice; determining a sample phoneme set according to the sample speech; the acquisition module 512 is further configured to: acquiring a sample phoneme vector corresponding to the sample phoneme set; the processing module 513 is further configured to: and inputting the sample phoneme vector and the sample intention to an initial network model, and training the initial network model through the sample phoneme vector and the sample intention to obtain the target network model.

In one possible implementation, the sample phone set includes a plurality of sample phones, and the obtaining module 51 is specifically configured to, when obtaining a sample phone vector corresponding to the sample phone set:

Based on the same application concept as the above method, an embodiment of the present application provides a voice intention recognition device, as shown in fig. 5B, which is a schematic structural diagram of the device, where the device may include:

a determining module 521, configured to determine a pinyin set to be recognized according to the speech to be recognized;

an obtaining module 522, configured to obtain a pinyin vector to be identified corresponding to the pinyin set to be identified;

A processing module 523, configured to input the pinyin vector to be recognized to a trained target network model, so that the target network model outputs a voice intent corresponding to the pinyin vector to be recognized;

In one possible implementation manner, the pinyin-group to be identified includes a plurality of pinyin-groups to be identified, and the obtaining module 522 is specifically configured to, when obtaining the pinyin vectors to be identified corresponding to the pinyin-group to be identified: determining a pinyin characteristic value corresponding to each pinyin to be identified; based on the pinyin characteristic value corresponding to each pinyin to be identified, a pinyin vector to be identified corresponding to the pinyin to be identified set is obtained, and the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified.

In one possible implementation, the determining module 521 is further configured to: acquiring sample voice and sample intention corresponding to the sample voice; determining a sample pinyin set according to the sample speech; the acquisition module 522 is further configured to: acquiring a sample pinyin vector corresponding to the sample pinyin set; the processing module 523 is further configured to: and inputting the sample pinyin vector and the sample intention to an initial network model, and training the initial network model through the sample pinyin vector and the sample intention to obtain the target network model.

In one possible implementation, the sample pinyin set includes a plurality of sample pinyin, and the acquiring module 522 is specifically configured to, when acquiring the sample pinyin vectors corresponding to the sample pinyin set:

determining a pinyin characteristic value corresponding to each sample pinyin;

Based on the same application concept as the above method, an embodiment of the present application provides a voice intention recognition apparatus, as shown in fig. 6, including: a processor 61 and a machine-readable storage medium 62, the machine-readable storage medium 62 storing machine-executable instructions executable by the processor 61; the processor 61 is configured to execute machine executable instructions to implement the following steps:

Based on the same application concept as the above method, the embodiment of the present application further provides a machine-readable storage medium, where a number of computer instructions are stored, where the computer instructions can implement the voice intent recognition method disclosed in the above example of the present application when the computer instructions are executed by a processor.

Wherein the machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, or the like. For example, a machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state disk, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Moreover, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method of speech intent recognition, the method comprising:

Determining a phoneme set to be recognized according to the voice to be recognized; the set of phonemes to be identified includes a plurality of phonemes to be identified; the phonemes to be recognized are determined according to the voice to be recognized;

Determining a phoneme characteristic value corresponding to each phoneme to be recognized; based on the phoneme characteristic value corresponding to each phoneme to be identified, obtaining a phoneme vector to be identified corresponding to the phoneme set to be identified, wherein the phoneme vector to be identified comprises a phoneme characteristic value corresponding to each phoneme to be identified;

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

Before the phoneme vector to be recognized is input to the trained target network model so that the target network model outputs the voice intention corresponding to the phoneme vector to be recognized, the method further comprises:

Acquiring sample voice and sample intention corresponding to the sample voice;

Determining a sample phoneme set according to the sample speech;

acquiring a sample phoneme vector corresponding to the sample phoneme set;

3. The method of claim 2, wherein the sample phone set includes a plurality of sample phones, the obtaining sample phone vectors corresponding to the sample phone set includes:

4. A method of speech intent recognition, the method comprising:

Determining a pinyin set to be recognized according to the voice to be recognized; the pinyin set to be identified comprises a plurality of pinyin to be identified; the pinyin to be identified is determined according to the speech to be identified;

Determining a pinyin characteristic value corresponding to each pinyin to be identified; based on the pinyin characteristic value corresponding to each pinyin to be identified, acquiring a pinyin vector to be identified corresponding to the pinyin to be identified set, wherein the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified;

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

Before the pinyin vector to be recognized is input to the trained target network model so that the target network model outputs the voice intention corresponding to the pinyin vector to be recognized, the method further comprises:

Acquiring sample voice and sample intention corresponding to the sample voice;

determining a sample pinyin set according to the sample speech;

Acquiring a sample pinyin vector corresponding to the sample pinyin set;

6. The method of claim 5, wherein the set of sample pinyins includes a plurality of sample pinyins, the obtaining a sample pinyin vector corresponding to the set of sample pinyins comprising:

determining a pinyin characteristic value corresponding to each sample pinyin;

7. A voice intent recognition device, the device comprising:

The determining module is used for determining a phoneme set to be recognized according to the voice to be recognized; the phoneme set to be recognized comprises a plurality of voices to be recognized; the voice to be recognized is determined according to the voice to be recognized;

The acquisition module is used for determining a phoneme characteristic value corresponding to each phoneme to be identified; based on the phoneme characteristic value corresponding to each phoneme to be identified, obtaining a phoneme vector to be identified corresponding to the phoneme set to be identified, wherein the phoneme vector to be identified comprises a phoneme characteristic value corresponding to each phoneme to be identified;

8. A voice intent recognition device, the device comprising:

The determining module is used for determining a pinyin set to be recognized according to the voice to be recognized; the pinyin set to be identified comprises a plurality of pinyin to be identified; the pinyin to be identified is determined according to the speech to be identified;

The acquisition module is used for determining a pinyin characteristic value corresponding to each pinyin to be identified; based on the pinyin characteristic value corresponding to each pinyin to be identified, acquiring a pinyin vector to be identified corresponding to the pinyin to be identified set, wherein the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified;

9. A voice intent recognition device, comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine-executable instructions to perform the steps of:

or determining a pinyin set to be recognized according to the voice to be recognized; the pinyin set to be identified comprises a plurality of pinyin to be identified; the pinyin to be identified is determined according to the speech to be identified;