CN115273849B - Intention identification method and device for audio data - Google Patents
Intention identification method and device for audio data Download PDFInfo
- Publication number
- CN115273849B CN115273849B CN202211178066.0A CN202211178066A CN115273849B CN 115273849 B CN115273849 B CN 115273849B CN 202211178066 A CN202211178066 A CN 202211178066A CN 115273849 B CN115273849 B CN 115273849B
- Authority
- CN
- China
- Prior art keywords
- audio data
- intention
- vector
- semantic
- target voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 239000013598 vector Substances 0.000 claims abstract description 174
- 238000012549 training Methods 0.000 claims abstract description 21
- 230000006870 function Effects 0.000 claims description 32
- 230000015654 memory Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 4
- 239000010410 layer Substances 0.000 description 68
- 238000013507 mapping Methods 0.000 description 8
- 239000011159 matrix material Substances 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000012423 maintenance Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0638—Interactive procedures
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention provides an intention identification method and device for audio data, wherein the method comprises the following steps: acquiring audio data containing target voice; inputting audio data containing target voice into a pre-trained joint model to obtain an instruction intention of the target voice; the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; the semantic slot filling layer is used for acquiring semantic character vectors according to audio data containing target voice; the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice; and the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic character vector and the semantic prediction vector and acquiring the instruction intention of the target voice based on the combined objective function. According to the invention, the user intention is deeply understood through the combined model, various intentions of the target voice are accurately and efficiently recognized, and the instruction intention of the target voice is obtained.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an intention identification method and device for audio data.
Background
In recent years, with the development of related technologies such as natural language processing and knowledge mapping, question answering systems have been expanded into many fields. The operation and maintenance actions are easily completed in a question-and-answer mode through man-machine interaction with the operation and maintenance robot, the working efficiency of operation and maintenance personnel can be greatly improved, and intention identification (Intent Detection) is a key for forming a man-machine conversation system.
The existing operation and maintenance robot is more a question-answering system with a certain independent function, and a user may have different intentions in different occasions, so that the operation and maintenance robot can relate to multiple fields in a man-machine conversation system, including a task type vertical field, chatting and the like. The intention text in the task type vertical field has the characteristics of clear theme and easy retrieval, such as query memory utilization rate, CPU utilization rate and the like. The chat intention text generally has the characteristics of unclear theme, wide semantic meaning, short sentence and the like, and pays attention to the communication with human beings in an open domain. In the dialogue system, only the topic field of the user is clarified, the specific requirements of the user can be correctly analyzed, otherwise, the later intention can be wrongly identified.
The existing technology is a method for recognizing a simple graph based on a rule template, and the method for recognizing an intention based on the rule template generally needs to artificially construct the rule template and classify user intention texts according to category information. In the prior art, aiming at consumption intention identification, an intention template is obtained by a rule and graph-based method, and a better classification effect is obtained in a single field. Later, the different expression modes in the same field can cause the increase of the number of rule templates, and a large amount of manpower and material resources are consumed. Therefore, although the method based on rule template matching can ensure the accuracy of recognition without a large amount of training data, it cannot solve the problem of high cost of reconstructing the template when the intended text is replaced, that is, the prior art has the following defects in the process of intention recognition: the rule template-based method suitable for single-intention recognition is not suitable for multi-intention recognition, and the existing intention recognition technology urgently needs a method suitable for multi-intention recognition.
Disclosure of Invention
The invention provides an intention recognition method and device for audio data, which are used for solving the problem that an intention recognition method in the prior art is not suitable for multi-intention recognition, and accurately and efficiently recognizing multiple intentions of target voice by deeply understanding user intentions through a combined model.
The invention provides an intention identification method for audio data, which comprises the following steps:
acquiring audio data containing target voice;
inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
According to the method for recognizing the intention of the audio data provided by the invention, the semantic word vector is obtained according to the audio data containing the target voice, and the method comprises the following steps:
converting the audio data containing the target voice into an initial vector;
and mapping the initial vector into a semantic word vector.
According to the intention recognition method for audio data provided by the invention, the mapping of the initial vector into a semantic word vector comprises the following steps:
solving a hidden layer vector and a slot context vector based on the initial vector;
and solving the semantic character vector through a softmax function based on the hidden layer vector and the slot context vector.
According to the intention recognition method for audio data provided by the present invention, the slot context vector includes an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in the context.
According to the intention recognition method for audio data provided by the invention, the obtaining of semantic prediction vectors according to the audio data containing target voice comprises the following steps:
acquiring an intention context vector according to the audio data containing the target voice;
based on the intention context vector, a semantic prediction vector is obtained.
According to the intention identifying method for audio data provided by the invention, the intention identifying method further comprises the following steps:
obtaining a weighted feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.
According to the intention recognition method for audio data provided by the present invention, the method further comprises:
acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;
training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;
and training the prior model based on the audio data of the target user to obtain a combined model.
The present invention also provides an intention recognition apparatus with respect to audio data, including:
the audio data acquisition module is used for acquiring audio data containing target voice;
the audio data processing module is used for inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
and the instruction intention acquisition layer is used for acquiring a combined target function according to the semantic word vector and the semantic prediction vector and acquiring the instruction intention of the target voice by the combined target function.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for identification of an intention as described with respect to audio data when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for intention recognition with respect to audio data as described.
The invention provides a method and a device for identifying intentions of audio data, which are characterized in that the audio data containing target voice is obtained; inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice; the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice; the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice; the instruction intention acquisition layer is used for acquiring a combined target function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined target function; according to the invention, the user intention is deeply understood through the combined model, various intentions of the target voice are accurately and efficiently recognized, the instruction intention of the target voice is obtained, and remarkable progress is made.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart illustrating an intention recognition method for audio data according to the present invention;
FIG. 2 is a schematic diagram of an intention recognition apparatus for audio data according to the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The intention recognition method, apparatus, electronic device and storage medium for audio data according to the present invention are specifically described below with reference to fig. 1 to 3 by way of example.
Fig. 1 is a schematic flowchart of an intention identification method for audio data according to the present invention, and as shown in fig. 1, the intention identification method for audio data according to the present invention includes:
step S110, obtaining audio data containing target voice;
in this embodiment, the target voice is a spoken voice including instruction information uttered by the user. That is, all sounds within a certain collection distance range, including spoken sounds including instruction information issued by a user, are regarded as audio data to be collected. The invention adopts the strong hypothesis that the definition of the spoken voice containing the instruction information sent by the user far exceeds the background sound of the position of the user, and finally obtains the instruction intention corresponding to the target voice based on the strong hypothesis that the target voice in the collected audio data can be read only and clearly.
In the present embodiment, it is assumed that the user utters a spoken voice including instruction information within the collection range, and audio data including the target voice is acquired.
Step S120, inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
In this embodiment, the joint model is a network model with deep learning capability, and adopts a BilSTM (direct Long Short-Term Memory) structure that can better capture bidirectional semantic dependence, where the BilSTM model is formed by combining forward LSTM and backward LSTM.
In this embodiment, an input vector is generated based on audio data containing target speech, and the input vector is mapped to a semantic word vector by a semantic slot filling layerMapping an input vector into a semantic vector by an intent prediction layerFinally, the instruction intention acquisition layer is based on semantic word vectorsAnd semantic prediction vectorAnd acquiring a joint objective function, and acquiring the instruction intention of the target voice by the joint objective function.
The intention identification method related to audio data provided by the invention comprises the steps of acquiring audio data containing target voice; and inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice, deeply understanding the user intention by the joint model, and accurately and efficiently identifying various intentions of the target voice to obtain the instruction intention of the target voice.
According to the intention identification method about audio data provided by the invention, the semantic word vector is obtained according to the audio data containing the target voice, and the method comprises the following steps:
converting the audio data containing the target voice into an initial vector;
and mapping the initial vector into a semantic word vector.
In the present embodiment, audio data including target speech is converted into an initial vectorInitial vector ofI.e. the input vector of the combined model, the initial vectorThe essence is a word sequence which corresponds to single words in the target voice one by one; the initial vector is then filled in by the semantic slotMapping to generate semantic word vectors。
Substituting the initial vector into a joint objective function formula to obtain:
according to the method for recognizing the intention of the audio data, provided by the invention, the audio data containing the target voice is converted into the initial vector, and then the initial vector is mapped into the semantic character vector, so that the generation path of the semantic character vector is further clarified, and the deep understanding of a joint model on the user intention and the accurate and efficient recognition of various intentions of the target voice are powerfully supported.
According to the intention recognition method for audio data provided by the invention, the mapping of the initial vector into a semantic word vector comprises the following steps:
solving a hidden layer vector and a slot context vector based on the initial vector;
and solving the semantic character vector through a softmax function based on the hidden layer vector and the slot context vector.
In this embodiment, the softmax function, also called normalized exponential function, is a single-layer neural network. Vector of hidden layerRefers to the vector corresponding to the single word in the target speech, and also corresponds to the initial vectorOf the number sequenceNumerical correspondence, used to represent the meaning of a single word; slot context vector (slot context vector)Refers to the context vector corresponding to the single word in the target speech, and also refers to the initial vectorOf the sequence of numbersiThe digital correspondence is used for representing a plurality of specific meanings of the single words and combining the context to specifically evaluate the vector of the real meanings of the single words; wherein,ihas a value range of。
In this embodiment, a BILSTM structure is used, based on the initial vector of the inputTo find the hidden layer vectorSum-bin context vector(ii) a Vector of hidden layerSum-bin context vectorFinding the second in the word sequence by softmax functioniSlot filling label (slot label) corresponding to single wordI.e. semantic word vectorsIs formulated as:
wherein,is a matrix of the weights that is,the vector of the hidden layer is the vector of the hidden layer,is a slot context vector.
The invention provides an intention identification method about audio data, which obtains a hidden layer vector and a groove context vector from an initial vector; and then, the semantic character vector is obtained through the hidden layer vector and the slot context vector by a softmax function, the specific generation path of the semantic character vector is further clarified, the deep understanding of the joint model on the user intention is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently recognized.
According to the intention recognition method for audio data provided by the present invention, the slot context vector includes an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in the context.
In this embodiment, the slot context vectorIncluding an attention score parameterThe probability that each specific meaning in a plurality of specific meanings corresponding to a single word in audio data conforms to the actual meaning of the single word in the context is expressed by the formula:
wherein,to representAndthe relationship between the two or more of them,is the function of the activation of the function,is a matrix of the weights that is,a matrix of weights is represented by a matrix of weights,kit is meant to convey a number of specific meanings,jindicating that a single word has itselfjThe specific meaning of each of the compounds is,the state of the hidden layer is represented,representing a current input vector;a convolution implementation is used and the result is,a linear mapping implementation is used.
The intention recognition method for the audio data provided by the invention strongly supports deep understanding of the joint model to the user intention by further clarifying the process of calculating the intention score parameter in the slot context vector, and accurately and efficiently recognizes multiple intentions of the target voice.
According to the intention recognition method for audio data provided by the invention, the obtaining of semantic prediction vectors according to the audio data containing target voice comprises the following steps:
acquiring an intention context vector according to the audio data containing the target voice;
based on the intention context vector, a semantic prediction vector is obtained.
In the present embodiment, audio data including target speech is converted into an initial vectorIn the initial directionMeasurement ofI.e. the input vector, the initial vector, of the joint modelThe essence is a word sequence which corresponds to single words in the target voice one by one; the initial vector is then predicted by the intent prediction layerMap generating intention context vector (intent context vector)From the intention context vectorGenerating semantic prediction vectorsWherein the intention context vectorIs computed like a slot context vectorFor representing a slot context vector; predicting hidden layer vectorRepresents a vector that is found using only the last hidden state of BILSTM when predicting intent. Semantic prediction vectorIs formulated as:
wherein,a matrix of weights is represented by a matrix of weights,a context vector representing the intent is shown,representing a predicted hidden layer vector.
According to the intention recognition method for the audio data, provided by the invention, the intention context vector is obtained according to the audio data containing the target voice, and then the semantic prediction vector is obtained by the intention context vector, so that the specific obtaining path of the semantic prediction vector is further clarified, the deep understanding of a joint model to the user intention is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently recognized.
According to the intention identifying method for audio data provided by the invention, the intention identifying method further comprises the following steps:
obtaining a weighted feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.
In the present embodiment, the characteristic parameter is weightedCan be viewed as a joint slot context vectorAnd intention context vectorThe main purpose of the weighted feature of (2) is to use an intention context vector (intent context vector)To improve the performance of the semantic slot-filling layer (slot-filling). Weighted featuresParameter(s)Is formulated as:
wherein,a trainable vector is represented in the form of a vector,a matrix that is trainable is represented in such a way that,which represents a function of the hyperbolic tangent,in the form of a slot context vector,is an intention context vector.
Accordingly, the characteristic parameters are weightedAdding semantic word vectors, and formulating as:
according to the intention identification method for the audio data, disclosed by the invention, the specific calculation path of the weighted characteristic parameters for improving the performance of the semantic slot filling layer is further disclosed, so that the deep understanding of a joint model on the intention of a user is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently identified.
According to the intention recognition method for audio data provided by the present invention, the method further comprises:
acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;
training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;
and training the prior model based on the audio data of the target user to obtain a combined model.
In the present embodiment, the mixture gaussian-generic background model GMM-UBM is an improved model of the gaussian mixture model GMM. The Universal Background Model UBM (Universal Background Model) is a Model proposed by the DA Reynolds group.
In the embodiment, a large amount of non-target user audio data is input into a Gaussian mixture-universal background model GMM-UBM, and a prior model of a specific speaker model is obtained through training; and inputting a small amount of target user audio data into the prior model, and finely adjusting parameters of the prior model to obtain a final combined model.
According to the intention identification method for the audio data, provided by the invention, the Gaussian mixture-general background model GMM-UBM is trained to obtain the joint model based on the non-target user audio data and the target user audio data, so that the deep understanding of the joint model on the user intention is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently identified.
Fig. 2 is a schematic structural diagram of an intention recognition apparatus for audio data according to the present invention, and as shown in fig. 2, the intention recognition apparatus for audio data according to the present invention includes:
an audio data acquiring module 210, configured to acquire audio data including a target voice;
the audio data processing module 220 is configured to input the audio data including the target speech into a pre-trained joint model to obtain an instruction intention of the target speech;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
and the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring the instruction intention of the target voice by the combined objective function.
The intention recognition device for the audio data provided by the invention acquires the audio data containing the target voice by arranging the audio data acquisition module and the audio data processing module; and inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice, so that the joint model can understand the user intention deeply and accurately and efficiently recognize various intentions of the target voice to obtain the instruction intention of the target voice.
Based on any one of the above embodiments, in this embodiment, the intention recognition apparatus for audio data according to the present invention further includes:
a slot filling weighting parameter layer for obtaining a weighting feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.
The intention recognition device for the audio data further discloses a specific calculation path of the weighted characteristic parameters for improving the performance of the semantic slot filling layer by setting the slot filling weighted parameter layer, powerfully supports deep understanding of a joint model to the user intention, and accurately and efficiently recognizes multiple intentions of the target voice.
Based on any one of the embodiments described above, in this embodiment, the intention identification apparatus for audio data provided by the invention further includes:
the audio data acquisition unit is used for acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;
the prior model unit is used for training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;
and the joint model unit is used for training the prior model based on the audio data of the target user to obtain a joint model.
The intention recognition device for the audio data further discloses a combined model obtained by training a Gaussian mixture-universal background model GMM-UBM based on non-target user audio data and target user audio data by arranging a sample audio data acquisition unit, a prior model unit and a combined model unit, powerfully supports deep understanding of the combined model to the user intention, and accurately and efficiently recognizes multiple intentions of target voice.
In another aspect, the present invention also provides an electronic device, fig. 3 illustrates a schematic structural diagram of an electronic device, as shown in fig. 3, the electronic device may include a processor 310, a communication bus 320, a memory 330, a communication interface 340, and a computer program stored in the memory 330 and operable on the processor 310, wherein the processor 310, the communication interface 310, and the memory 330 complete communication with each other through the communication bus 340, and the processor 310 may call logic instructions in the memory 330 to perform an intention identifying method for audio data, where the method includes:
acquiring audio data containing target voice;
inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
Finally, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, can implement an intent recognition method with respect to audio data, the method comprising:
acquiring audio data containing target voice;
inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (6)
1. An intention recognition method with respect to audio data, characterized by comprising:
acquiring audio data containing target voice;
inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for converting the audio data containing the target voice into an initial vector; solving a hidden layer vector and a slot context vector based on the initial vector; solving a semantic character vector through a softmax function based on the hidden layer vector and the slot context vector; wherein the bin context vector comprises an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in context;
the intention prediction layer is used for acquiring an intention context vector according to the audio data containing the target voice; based on the intention context vector, obtaining a semantic prediction vector;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
2. The method of claim 1, further comprising:
obtaining a weighted feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.
3. The method of claim 1, wherein the method further comprises:
acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;
training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;
and training the prior model based on the audio data of the target user to obtain a combined model.
4. An intention recognition apparatus with respect to audio data, characterized by comprising:
the audio data acquisition module is used for acquiring audio data containing target voice;
the audio data processing module is used for inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for converting the audio data containing the target voice into an initial vector; solving a hidden layer vector and a slot context vector based on the initial vector; solving a semantic character vector through a softmax function based on the hidden layer vector and the slot context vector; wherein the bin context vector comprises an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in context;
the intention prediction layer is used for acquiring an intention context vector according to the audio data containing the target voice; based on the intention context vector, obtaining a semantic prediction vector;
and the instruction intention acquisition layer is used for acquiring a combined target function according to the semantic word vector and the semantic prediction vector and acquiring the instruction intention of the target voice by the combined target function.
5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method for intention recognition with respect to audio data according to any one of claims 1 to 3 are implemented when the program is executed by the processor.
6. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for intention recognition on audio data according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211178066.0A CN115273849B (en) | 2022-09-27 | 2022-09-27 | Intention identification method and device for audio data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211178066.0A CN115273849B (en) | 2022-09-27 | 2022-09-27 | Intention identification method and device for audio data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115273849A CN115273849A (en) | 2022-11-01 |
CN115273849B true CN115273849B (en) | 2022-12-27 |
Family
ID=83757223
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211178066.0A Active CN115273849B (en) | 2022-09-27 | 2022-09-27 | Intention identification method and device for audio data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115273849B (en) |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11094317B2 (en) * | 2018-07-31 | 2021-08-17 | Samsung Electronics Co., Ltd. | System and method for personalized natural language understanding |
CN109785833A (en) * | 2019-01-02 | 2019-05-21 | 苏宁易购集团股份有限公司 | Human-computer interaction audio recognition method and system for smart machine |
CN110516253B (en) * | 2019-08-30 | 2023-08-25 | 思必驰科技股份有限公司 | Chinese spoken language semantic understanding method and system |
CN110853626B (en) * | 2019-10-21 | 2021-04-20 | 成都信息工程大学 | Bidirectional attention neural network-based dialogue understanding method, device and equipment |
CN113505591A (en) * | 2020-03-23 | 2021-10-15 | 华为技术有限公司 | Slot position identification method and electronic equipment |
CN112037773B (en) * | 2020-11-05 | 2021-01-29 | 北京淇瑀信息科技有限公司 | N-optimal spoken language semantic recognition method and device and electronic equipment |
CN113204952B (en) * | 2021-03-26 | 2023-09-15 | 南京邮电大学 | Multi-intention and semantic slot joint identification method based on cluster pre-analysis |
-
2022
- 2022-09-27 CN CN202211178066.0A patent/CN115273849B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN115273849A (en) | 2022-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110111775B (en) | Streaming voice recognition method, device, equipment and storage medium | |
CN110321418B (en) | Deep learning-based field, intention recognition and groove filling method | |
CN112100349A (en) | Multi-turn dialogue method and device, electronic equipment and storage medium | |
CN110853626B (en) | Bidirectional attention neural network-based dialogue understanding method, device and equipment | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN113505200B (en) | Sentence-level Chinese event detection method combined with document key information | |
CN111161726B (en) | Intelligent voice interaction method, device, medium and system | |
CN114596844A (en) | Acoustic model training method, voice recognition method and related equipment | |
CN111445898A (en) | Language identification method and device, electronic equipment and storage medium | |
US11322151B2 (en) | Method, apparatus, and medium for processing speech signal | |
CN113505198A (en) | Keyword-driven generating type dialogue reply method and device and electronic equipment | |
US20240331686A1 (en) | Relevant context determination | |
CN111046674A (en) | Semantic understanding method and device, electronic equipment and storage medium | |
CN111126084B (en) | Data processing method, device, electronic equipment and storage medium | |
CN115687934A (en) | Intention recognition method and device, computer equipment and storage medium | |
CN116303966A (en) | Dialogue behavior recognition system based on prompt learning | |
CN114254649A (en) | Language model training method and device, storage medium and equipment | |
CN114373443A (en) | Speech synthesis method and apparatus, computing device, storage medium, and program product | |
CN113393841A (en) | Training method, device and equipment of speech recognition model and storage medium | |
CN115273849B (en) | Intention identification method and device for audio data | |
CN116186255A (en) | Method for training unknown intention detection model, unknown intention detection method and device | |
CN116978367A (en) | Speech recognition method, device, electronic equipment and storage medium | |
CN111680514A (en) | Information processing and model training method, device, equipment and storage medium | |
CN113111855B (en) | Multi-mode emotion recognition method and device, electronic equipment and storage medium | |
Biswas et al. | Speech Emotion Recognition Using Deep CNNs Trained on Log-Frequency Spectrograms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |