CN115273849B

CN115273849B - Intention identification method and device for audio data

Info

Publication number: CN115273849B
Application number: CN202211178066.0A
Authority: CN
Inventors: 蒋宇; 徐敏; 李鑫豪; 任纪良
Original assignee: Beijing Baolande Software Co ltd
Current assignee: Beijing Baolande Software Co ltd
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2022-12-27
Anticipated expiration: 2042-09-27
Also published as: CN115273849A

Abstract

The invention provides an intention identification method and device for audio data, wherein the method comprises the following steps: acquiring audio data containing target voice; inputting audio data containing target voice into a pre-trained joint model to obtain an instruction intention of the target voice; the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; the semantic slot filling layer is used for acquiring semantic character vectors according to audio data containing target voice; the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice; and the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic character vector and the semantic prediction vector and acquiring the instruction intention of the target voice based on the combined objective function. According to the invention, the user intention is deeply understood through the combined model, various intentions of the target voice are accurately and efficiently recognized, and the instruction intention of the target voice is obtained.

Description

Intention identification method and device for audio data

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an intention identification method and device for audio data.

Background

In recent years, with the development of related technologies such as natural language processing and knowledge mapping, question answering systems have been expanded into many fields. The operation and maintenance actions are easily completed in a question-and-answer mode through man-machine interaction with the operation and maintenance robot, the working efficiency of operation and maintenance personnel can be greatly improved, and intention identification (Intent Detection) is a key for forming a man-machine conversation system.

The existing operation and maintenance robot is more a question-answering system with a certain independent function, and a user may have different intentions in different occasions, so that the operation and maintenance robot can relate to multiple fields in a man-machine conversation system, including a task type vertical field, chatting and the like. The intention text in the task type vertical field has the characteristics of clear theme and easy retrieval, such as query memory utilization rate, CPU utilization rate and the like. The chat intention text generally has the characteristics of unclear theme, wide semantic meaning, short sentence and the like, and pays attention to the communication with human beings in an open domain. In the dialogue system, only the topic field of the user is clarified, the specific requirements of the user can be correctly analyzed, otherwise, the later intention can be wrongly identified.

The existing technology is a method for recognizing a simple graph based on a rule template, and the method for recognizing an intention based on the rule template generally needs to artificially construct the rule template and classify user intention texts according to category information. In the prior art, aiming at consumption intention identification, an intention template is obtained by a rule and graph-based method, and a better classification effect is obtained in a single field. Later, the different expression modes in the same field can cause the increase of the number of rule templates, and a large amount of manpower and material resources are consumed. Therefore, although the method based on rule template matching can ensure the accuracy of recognition without a large amount of training data, it cannot solve the problem of high cost of reconstructing the template when the intended text is replaced, that is, the prior art has the following defects in the process of intention recognition: the rule template-based method suitable for single-intention recognition is not suitable for multi-intention recognition, and the existing intention recognition technology urgently needs a method suitable for multi-intention recognition.

Disclosure of Invention

The invention provides an intention recognition method and device for audio data, which are used for solving the problem that an intention recognition method in the prior art is not suitable for multi-intention recognition, and accurately and efficiently recognizing multiple intentions of target voice by deeply understanding user intentions through a combined model.

The invention provides an intention identification method for audio data, which comprises the following steps:

acquiring audio data containing target voice;

inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;

the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,

the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;

the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;

the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.

According to the method for recognizing the intention of the audio data provided by the invention, the semantic word vector is obtained according to the audio data containing the target voice, and the method comprises the following steps:

converting the audio data containing the target voice into an initial vector;

and mapping the initial vector into a semantic word vector.

According to the intention recognition method for audio data provided by the invention, the mapping of the initial vector into a semantic word vector comprises the following steps:

solving a hidden layer vector and a slot context vector based on the initial vector;

and solving the semantic character vector through a softmax function based on the hidden layer vector and the slot context vector.

According to the intention recognition method for audio data provided by the present invention, the slot context vector includes an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in the context.

According to the intention recognition method for audio data provided by the invention, the obtaining of semantic prediction vectors according to the audio data containing target voice comprises the following steps:

acquiring an intention context vector according to the audio data containing the target voice;

based on the intention context vector, a semantic prediction vector is obtained.

According to the intention identifying method for audio data provided by the invention, the intention identifying method further comprises the following steps:

obtaining a weighted feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.

According to the intention recognition method for audio data provided by the present invention, the method further comprises:

acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;

training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;

and training the prior model based on the audio data of the target user to obtain a combined model.

The present invention also provides an intention recognition apparatus with respect to audio data, including:

the audio data acquisition module is used for acquiring audio data containing target voice;

the audio data processing module is used for inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice;

and the instruction intention acquisition layer is used for acquiring a combined target function according to the semantic word vector and the semantic prediction vector and acquiring the instruction intention of the target voice by the combined target function.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for identification of an intention as described with respect to audio data when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for intention recognition with respect to audio data as described.

The invention provides a method and a device for identifying intentions of audio data, which are characterized in that the audio data containing target voice is obtained; inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice; the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice; the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice; the instruction intention acquisition layer is used for acquiring a combined target function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined target function; according to the invention, the user intention is deeply understood through the combined model, various intentions of the target voice are accurately and efficiently recognized, the instruction intention of the target voice is obtained, and remarkable progress is made.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart illustrating an intention recognition method for audio data according to the present invention;

FIG. 2 is a schematic diagram of an intention recognition apparatus for audio data according to the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The intention recognition method, apparatus, electronic device and storage medium for audio data according to the present invention are specifically described below with reference to fig. 1 to 3 by way of example.

Fig. 1 is a schematic flowchart of an intention identification method for audio data according to the present invention, and as shown in fig. 1, the intention identification method for audio data according to the present invention includes:

step S110, obtaining audio data containing target voice;

in this embodiment, the target voice is a spoken voice including instruction information uttered by the user. That is, all sounds within a certain collection distance range, including spoken sounds including instruction information issued by a user, are regarded as audio data to be collected. The invention adopts the strong hypothesis that the definition of the spoken voice containing the instruction information sent by the user far exceeds the background sound of the position of the user, and finally obtains the instruction intention corresponding to the target voice based on the strong hypothesis that the target voice in the collected audio data can be read only and clearly.

In the present embodiment, it is assumed that the user utters a spoken voice including instruction information within the collection range, and audio data including the target voice is acquired.

Step S120, inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;

In this embodiment, the joint model is a network model with deep learning capability, and adopts a BilSTM (direct Long Short-Term Memory) structure that can better capture bidirectional semantic dependence, where the BilSTM model is formed by combining forward LSTM and backward LSTM.

In this embodiment, an input vector is generated based on audio data containing target speech, and the input vector is mapped to a semantic word vector by a semantic slot filling layer

Mapping an input vector into a semantic vector by an intent prediction layer

Finally, the instruction intention acquisition layer is based on semantic word vectors

And semantic prediction vector

And acquiring a joint objective function, and acquiring the instruction intention of the target voice by the joint objective function.

The intention identification method related to audio data provided by the invention comprises the steps of acquiring audio data containing target voice; and inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice, deeply understanding the user intention by the joint model, and accurately and efficiently identifying various intentions of the target voice to obtain the instruction intention of the target voice.

According to the intention identification method about audio data provided by the invention, the semantic word vector is obtained according to the audio data containing the target voice, and the method comprises the following steps:

converting the audio data containing the target voice into an initial vector;

and mapping the initial vector into a semantic word vector.

In the present embodiment, audio data including target speech is converted into an initial vector

Initial vector of

I.e. the input vector of the combined model, the initial vector

The essence is a word sequence which corresponds to single words in the target voice one by one; the initial vector is then filled in by the semantic slot

Mapping to generate semantic word vectors

。

Substituting the initial vector into a joint objective function formula to obtain:

according to the method for recognizing the intention of the audio data, provided by the invention, the audio data containing the target voice is converted into the initial vector, and then the initial vector is mapped into the semantic character vector, so that the generation path of the semantic character vector is further clarified, and the deep understanding of a joint model on the user intention and the accurate and efficient recognition of various intentions of the target voice are powerfully supported.

In this embodiment, the softmax function, also called normalized exponential function, is a single-layer neural network. Vector of hidden layer

Refers to the vector corresponding to the single word in the target speech, and also corresponds to the initial vector

Of the number sequence

Numerical correspondence, used to represent the meaning of a single word; slot context vector (slot context vector)

Refers to the context vector corresponding to the single word in the target speech, and also refers to the initial vector

Of the sequence of numbersiThe digital correspondence is used for representing a plurality of specific meanings of the single words and combining the context to specifically evaluate the vector of the real meanings of the single words; wherein,ihas a value range of

。

In this embodiment, a BILSTM structure is used, based on the initial vector of the input

To find the hidden layer vector

Sum-bin context vector

(ii) a Vector of hidden layer

Sum-bin context vector

Finding the second in the word sequence by softmax functioniSlot filling label (slot label) corresponding to single word

I.e. semantic word vectors

Is formulated as:

wherein,

is a matrix of the weights that is,

the vector of the hidden layer is the vector of the hidden layer,

is a slot context vector.

The invention provides an intention identification method about audio data, which obtains a hidden layer vector and a groove context vector from an initial vector; and then, the semantic character vector is obtained through the hidden layer vector and the slot context vector by a softmax function, the specific generation path of the semantic character vector is further clarified, the deep understanding of the joint model on the user intention is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently recognized.

In this embodiment, the slot context vector

Including an attention score parameter

The probability that each specific meaning in a plurality of specific meanings corresponding to a single word in audio data conforms to the actual meaning of the single word in the context is expressed by the formula:

wherein,

to represent

And

the relationship between the two or more of them,

is the function of the activation of the function,

is a matrix of the weights that is,

a matrix of weights is represented by a matrix of weights,kit is meant to convey a number of specific meanings,jindicating that a single word has itselfjThe specific meaning of each of the compounds is,

the state of the hidden layer is represented,

representing a current input vector;

a convolution implementation is used and the result is,

a linear mapping implementation is used.

The intention recognition method for the audio data provided by the invention strongly supports deep understanding of the joint model to the user intention by further clarifying the process of calculating the intention score parameter in the slot context vector, and accurately and efficiently recognizes multiple intentions of the target voice.

In the initial directionMeasurement of

I.e. the input vector, the initial vector, of the joint model

The essence is a word sequence which corresponds to single words in the target voice one by one; the initial vector is then predicted by the intent prediction layer

Map generating intention context vector (intent context vector)

From the intention context vector

Generating semantic prediction vectors

Wherein the intention context vector

Is computed like a slot context vector

For representing a slot context vector; predicting hidden layer vector

Represents a vector that is found using only the last hidden state of BILSTM when predicting intent. Semantic prediction vector

Is formulated as:

wherein,

a matrix of weights is represented by a matrix of weights,

a context vector representing the intent is shown,

representing a predicted hidden layer vector.

According to the intention recognition method for the audio data, provided by the invention, the intention context vector is obtained according to the audio data containing the target voice, and then the semantic prediction vector is obtained by the intention context vector, so that the specific obtaining path of the semantic prediction vector is further clarified, the deep understanding of a joint model to the user intention is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently recognized.

In the present embodiment, the characteristic parameter is weighted

Can be viewed as a joint slot context vector

And intention context vector

The main purpose of the weighted feature of (2) is to use an intention context vector (intent context vector)

To improve the performance of the semantic slot-filling layer (slot-filling). Weighted featuresParameter(s)

Is formulated as:

wherein,

a trainable vector is represented in the form of a vector,

a matrix that is trainable is represented in such a way that,

which represents a function of the hyperbolic tangent,

in the form of a slot context vector,

is an intention context vector.

Accordingly, the characteristic parameters are weighted

Adding semantic word vectors, and formulating as:

according to the intention identification method for the audio data, disclosed by the invention, the specific calculation path of the weighted characteristic parameters for improving the performance of the semantic slot filling layer is further disclosed, so that the deep understanding of a joint model on the intention of a user is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently identified.

In the present embodiment, the mixture gaussian-generic background model GMM-UBM is an improved model of the gaussian mixture model GMM. The Universal Background Model UBM (Universal Background Model) is a Model proposed by the DA Reynolds group.

In the embodiment, a large amount of non-target user audio data is input into a Gaussian mixture-universal background model GMM-UBM, and a prior model of a specific speaker model is obtained through training; and inputting a small amount of target user audio data into the prior model, and finely adjusting parameters of the prior model to obtain a final combined model.

According to the intention identification method for the audio data, provided by the invention, the Gaussian mixture-general background model GMM-UBM is trained to obtain the joint model based on the non-target user audio data and the target user audio data, so that the deep understanding of the joint model on the user intention is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently identified.

Fig. 2 is a schematic structural diagram of an intention recognition apparatus for audio data according to the present invention, and as shown in fig. 2, the intention recognition apparatus for audio data according to the present invention includes:

an audio data acquiring module 210, configured to acquire audio data including a target voice;

the audio data processing module 220 is configured to input the audio data including the target speech into a pre-trained joint model to obtain an instruction intention of the target speech;

and the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring the instruction intention of the target voice by the combined objective function.

The intention recognition device for the audio data provided by the invention acquires the audio data containing the target voice by arranging the audio data acquisition module and the audio data processing module; and inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice, so that the joint model can understand the user intention deeply and accurately and efficiently recognize various intentions of the target voice to obtain the instruction intention of the target voice.

Based on any one of the above embodiments, in this embodiment, the intention recognition apparatus for audio data according to the present invention further includes:

a slot filling weighting parameter layer for obtaining a weighting feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.

The intention recognition device for the audio data further discloses a specific calculation path of the weighted characteristic parameters for improving the performance of the semantic slot filling layer by setting the slot filling weighted parameter layer, powerfully supports deep understanding of a joint model to the user intention, and accurately and efficiently recognizes multiple intentions of the target voice.

Based on any one of the embodiments described above, in this embodiment, the intention identification apparatus for audio data provided by the invention further includes:

the audio data acquisition unit is used for acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;

the prior model unit is used for training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;

and the joint model unit is used for training the prior model based on the audio data of the target user to obtain a joint model.

The intention recognition device for the audio data further discloses a combined model obtained by training a Gaussian mixture-universal background model GMM-UBM based on non-target user audio data and target user audio data by arranging a sample audio data acquisition unit, a prior model unit and a combined model unit, powerfully supports deep understanding of the combined model to the user intention, and accurately and efficiently recognizes multiple intentions of target voice.

In another aspect, the present invention also provides an electronic device, fig. 3 illustrates a schematic structural diagram of an electronic device, as shown in fig. 3, the electronic device may include a processor 310, a communication bus 320, a memory 330, a communication interface 340, and a computer program stored in the memory 330 and operable on the processor 310, wherein the processor 310, the communication interface 310, and the memory 330 complete communication with each other through the communication bus 340, and the processor 310 may call logic instructions in the memory 330 to perform an intention identifying method for audio data, where the method includes:

acquiring audio data containing target voice;

Finally, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, can implement an intent recognition method with respect to audio data, the method comprising:

acquiring audio data containing target voice;

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An intention recognition method with respect to audio data, characterized by comprising:

acquiring audio data containing target voice;

the semantic slot filling layer is used for converting the audio data containing the target voice into an initial vector; solving a hidden layer vector and a slot context vector based on the initial vector; solving a semantic character vector through a softmax function based on the hidden layer vector and the slot context vector; wherein the bin context vector comprises an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in context;

the intention prediction layer is used for acquiring an intention context vector according to the audio data containing the target voice; based on the intention context vector, obtaining a semantic prediction vector;

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein the method further comprises:

4. An intention recognition apparatus with respect to audio data, characterized by comprising:

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method for intention recognition with respect to audio data according to any one of claims 1 to 3 are implemented when the program is executed by the processor.

6. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for intention recognition on audio data according to any one of claims 1 to 3.