CN115273849B - Intention identification method and device for audio data - Google Patents

Intention identification method and device for audio data Download PDF

Info

Publication number
CN115273849B
CN115273849B CN202211178066.0A CN202211178066A CN115273849B CN 115273849 B CN115273849 B CN 115273849B CN 202211178066 A CN202211178066 A CN 202211178066A CN 115273849 B CN115273849 B CN 115273849B
Authority
CN
China
Prior art keywords
audio data
intention
vector
semantic
target voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211178066.0A
Other languages
Chinese (zh)
Other versions
CN115273849A (en
Inventor
蒋宇
徐敏
李鑫豪
任纪良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baolande Software Co ltd
Original Assignee
Beijing Baolande Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baolande Software Co ltd filed Critical Beijing Baolande Software Co ltd
Priority to CN202211178066.0A priority Critical patent/CN115273849B/en
Publication of CN115273849A publication Critical patent/CN115273849A/en
Application granted granted Critical
Publication of CN115273849B publication Critical patent/CN115273849B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an intention identification method and device for audio data, wherein the method comprises the following steps: acquiring audio data containing target voice; inputting audio data containing target voice into a pre-trained joint model to obtain an instruction intention of the target voice; the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; the semantic slot filling layer is used for acquiring semantic character vectors according to audio data containing target voice; the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice; and the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic character vector and the semantic prediction vector and acquiring the instruction intention of the target voice based on the combined objective function. According to the invention, the user intention is deeply understood through the combined model, various intentions of the target voice are accurately and efficiently recognized, and the instruction intention of the target voice is obtained.

Description

Intention identification method and device for audio data
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an intention identification method and device for audio data.
Background
In recent years, with the development of related technologies such as natural language processing and knowledge mapping, question answering systems have been expanded into many fields. The operation and maintenance actions are easily completed in a question-and-answer mode through man-machine interaction with the operation and maintenance robot, the working efficiency of operation and maintenance personnel can be greatly improved, and intention identification (Intent Detection) is a key for forming a man-machine conversation system.
The existing operation and maintenance robot is more a question-answering system with a certain independent function, and a user may have different intentions in different occasions, so that the operation and maintenance robot can relate to multiple fields in a man-machine conversation system, including a task type vertical field, chatting and the like. The intention text in the task type vertical field has the characteristics of clear theme and easy retrieval, such as query memory utilization rate, CPU utilization rate and the like. The chat intention text generally has the characteristics of unclear theme, wide semantic meaning, short sentence and the like, and pays attention to the communication with human beings in an open domain. In the dialogue system, only the topic field of the user is clarified, the specific requirements of the user can be correctly analyzed, otherwise, the later intention can be wrongly identified.
The existing technology is a method for recognizing a simple graph based on a rule template, and the method for recognizing an intention based on the rule template generally needs to artificially construct the rule template and classify user intention texts according to category information. In the prior art, aiming at consumption intention identification, an intention template is obtained by a rule and graph-based method, and a better classification effect is obtained in a single field. Later, the different expression modes in the same field can cause the increase of the number of rule templates, and a large amount of manpower and material resources are consumed. Therefore, although the method based on rule template matching can ensure the accuracy of recognition without a large amount of training data, it cannot solve the problem of high cost of reconstructing the template when the intended text is replaced, that is, the prior art has the following defects in the process of intention recognition: the rule template-based method suitable for single-intention recognition is not suitable for multi-intention recognition, and the existing intention recognition technology urgently needs a method suitable for multi-intention recognition.
Disclosure of Invention
The invention provides an intention recognition method and device for audio data, which are used for solving the problem that an intention recognition method in the prior art is not suitable for multi-intention recognition, and accurately and efficiently recognizing multiple intentions of target voice by deeply understanding user intentions through a combined model.
The invention provides an intention identification method for audio data, which comprises the following steps:
acquiring audio data containing target voice;
inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
According to the method for recognizing the intention of the audio data provided by the invention, the semantic word vector is obtained according to the audio data containing the target voice, and the method comprises the following steps:
converting the audio data containing the target voice into an initial vector;
and mapping the initial vector into a semantic word vector.
According to the intention recognition method for audio data provided by the invention, the mapping of the initial vector into a semantic word vector comprises the following steps:
solving a hidden layer vector and a slot context vector based on the initial vector;
and solving the semantic character vector through a softmax function based on the hidden layer vector and the slot context vector.
According to the intention recognition method for audio data provided by the present invention, the slot context vector includes an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in the context.
According to the intention recognition method for audio data provided by the invention, the obtaining of semantic prediction vectors according to the audio data containing target voice comprises the following steps:
acquiring an intention context vector according to the audio data containing the target voice;
based on the intention context vector, a semantic prediction vector is obtained.
According to the intention identifying method for audio data provided by the invention, the intention identifying method further comprises the following steps:
obtaining a weighted feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.
According to the intention recognition method for audio data provided by the present invention, the method further comprises:
acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;
training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;
and training the prior model based on the audio data of the target user to obtain a combined model.
The present invention also provides an intention recognition apparatus with respect to audio data, including:
the audio data acquisition module is used for acquiring audio data containing target voice;
the audio data processing module is used for inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
and the instruction intention acquisition layer is used for acquiring a combined target function according to the semantic word vector and the semantic prediction vector and acquiring the instruction intention of the target voice by the combined target function.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for identification of an intention as described with respect to audio data when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for intention recognition with respect to audio data as described.
The invention provides a method and a device for identifying intentions of audio data, which are characterized in that the audio data containing target voice is obtained; inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice; the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice; the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice; the instruction intention acquisition layer is used for acquiring a combined target function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined target function; according to the invention, the user intention is deeply understood through the combined model, various intentions of the target voice are accurately and efficiently recognized, the instruction intention of the target voice is obtained, and remarkable progress is made.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart illustrating an intention recognition method for audio data according to the present invention;
FIG. 2 is a schematic diagram of an intention recognition apparatus for audio data according to the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The intention recognition method, apparatus, electronic device and storage medium for audio data according to the present invention are specifically described below with reference to fig. 1 to 3 by way of example.
Fig. 1 is a schematic flowchart of an intention identification method for audio data according to the present invention, and as shown in fig. 1, the intention identification method for audio data according to the present invention includes:
step S110, obtaining audio data containing target voice;
in this embodiment, the target voice is a spoken voice including instruction information uttered by the user. That is, all sounds within a certain collection distance range, including spoken sounds including instruction information issued by a user, are regarded as audio data to be collected. The invention adopts the strong hypothesis that the definition of the spoken voice containing the instruction information sent by the user far exceeds the background sound of the position of the user, and finally obtains the instruction intention corresponding to the target voice based on the strong hypothesis that the target voice in the collected audio data can be read only and clearly.
In the present embodiment, it is assumed that the user utters a spoken voice including instruction information within the collection range, and audio data including the target voice is acquired.
Step S120, inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
In this embodiment, the joint model is a network model with deep learning capability, and adopts a BilSTM (direct Long Short-Term Memory) structure that can better capture bidirectional semantic dependence, where the BilSTM model is formed by combining forward LSTM and backward LSTM.
In this embodiment, an input vector is generated based on audio data containing target speech, and the input vector is mapped to a semantic word vector by a semantic slot filling layer
Figure 939372DEST_PATH_IMAGE001
Mapping an input vector into a semantic vector by an intent prediction layer
Figure 230677DEST_PATH_IMAGE002
Finally, the instruction intention acquisition layer is based on semantic word vectors
Figure 435393DEST_PATH_IMAGE001
And semantic prediction vector
Figure 306397DEST_PATH_IMAGE002
And acquiring a joint objective function, and acquiring the instruction intention of the target voice by the joint objective function.
The intention identification method related to audio data provided by the invention comprises the steps of acquiring audio data containing target voice; and inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice, deeply understanding the user intention by the joint model, and accurately and efficiently identifying various intentions of the target voice to obtain the instruction intention of the target voice.
According to the intention identification method about audio data provided by the invention, the semantic word vector is obtained according to the audio data containing the target voice, and the method comprises the following steps:
converting the audio data containing the target voice into an initial vector;
and mapping the initial vector into a semantic word vector.
In the present embodiment, audio data including target speech is converted into an initial vector
Figure 148844DEST_PATH_IMAGE003
Initial vector of
Figure 345471DEST_PATH_IMAGE004
I.e. the input vector of the combined model, the initial vector
Figure 37483DEST_PATH_IMAGE004
The essence is a word sequence which corresponds to single words in the target voice one by one; the initial vector is then filled in by the semantic slot
Figure 712178DEST_PATH_IMAGE004
Mapping to generate semantic word vectors
Figure 671781DEST_PATH_IMAGE001
Substituting the initial vector into a joint objective function formula to obtain:
Figure 39309DEST_PATH_IMAGE005
according to the method for recognizing the intention of the audio data, provided by the invention, the audio data containing the target voice is converted into the initial vector, and then the initial vector is mapped into the semantic character vector, so that the generation path of the semantic character vector is further clarified, and the deep understanding of a joint model on the user intention and the accurate and efficient recognition of various intentions of the target voice are powerfully supported.
According to the intention recognition method for audio data provided by the invention, the mapping of the initial vector into a semantic word vector comprises the following steps:
solving a hidden layer vector and a slot context vector based on the initial vector;
and solving the semantic character vector through a softmax function based on the hidden layer vector and the slot context vector.
In this embodiment, the softmax function, also called normalized exponential function, is a single-layer neural network. Vector of hidden layer
Figure 953038DEST_PATH_IMAGE006
Refers to the vector corresponding to the single word in the target speech, and also corresponds to the initial vector
Figure 431424DEST_PATH_IMAGE003
Of the number sequence
Figure 508184DEST_PATH_IMAGE007
Numerical correspondence, used to represent the meaning of a single word; slot context vector (slot context vector)
Figure 781033DEST_PATH_IMAGE008
Refers to the context vector corresponding to the single word in the target speech, and also refers to the initial vector
Figure 447638DEST_PATH_IMAGE003
Of the sequence of numbersiThe digital correspondence is used for representing a plurality of specific meanings of the single words and combining the context to specifically evaluate the vector of the real meanings of the single words; wherein,ihas a value range of
Figure 464135DEST_PATH_IMAGE009
In this embodiment, a BILSTM structure is used, based on the initial vector of the input
Figure 398331DEST_PATH_IMAGE003
To find the hidden layer vector
Figure 842082DEST_PATH_IMAGE006
Sum-bin context vector
Figure 995983DEST_PATH_IMAGE008
(ii) a Vector of hidden layer
Figure 816171DEST_PATH_IMAGE006
Sum-bin context vector
Figure 607803DEST_PATH_IMAGE008
Finding the second in the word sequence by softmax functioniSlot filling label (slot label) corresponding to single word
Figure 488034DEST_PATH_IMAGE010
I.e. semantic word vectors
Figure 660390DEST_PATH_IMAGE011
Is formulated as:
Figure 221952DEST_PATH_IMAGE012
wherein,
Figure 130740DEST_PATH_IMAGE013
is a matrix of the weights that is,
Figure 916294DEST_PATH_IMAGE006
the vector of the hidden layer is the vector of the hidden layer,
Figure 779207DEST_PATH_IMAGE008
is a slot context vector.
The invention provides an intention identification method about audio data, which obtains a hidden layer vector and a groove context vector from an initial vector; and then, the semantic character vector is obtained through the hidden layer vector and the slot context vector by a softmax function, the specific generation path of the semantic character vector is further clarified, the deep understanding of the joint model on the user intention is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently recognized.
According to the intention recognition method for audio data provided by the present invention, the slot context vector includes an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in the context.
In this embodiment, the slot context vector
Figure 941198DEST_PATH_IMAGE008
Including an attention score parameter
Figure 725001DEST_PATH_IMAGE014
The probability that each specific meaning in a plurality of specific meanings corresponding to a single word in audio data conforms to the actual meaning of the single word in the context is expressed by the formula:
Figure 947034DEST_PATH_IMAGE015
wherein,
Figure 93982DEST_PATH_IMAGE016
to represent
Figure 997347DEST_PATH_IMAGE017
And
Figure 411886DEST_PATH_IMAGE018
the relationship between the two or more of them,
Figure 8083DEST_PATH_IMAGE019
is the function of the activation of the function,
Figure 642327DEST_PATH_IMAGE020
is a matrix of the weights that is,
Figure 146121DEST_PATH_IMAGE021
a matrix of weights is represented by a matrix of weights,kit is meant to convey a number of specific meanings,jindicating that a single word has itselfjThe specific meaning of each of the compounds is,
Figure 621358DEST_PATH_IMAGE017
the state of the hidden layer is represented,
Figure 185194DEST_PATH_IMAGE018
representing a current input vector;
Figure 509996DEST_PATH_IMAGE022
a convolution implementation is used and the result is,
Figure 551902DEST_PATH_IMAGE023
a linear mapping implementation is used.
The intention recognition method for the audio data provided by the invention strongly supports deep understanding of the joint model to the user intention by further clarifying the process of calculating the intention score parameter in the slot context vector, and accurately and efficiently recognizes multiple intentions of the target voice.
According to the intention recognition method for audio data provided by the invention, the obtaining of semantic prediction vectors according to the audio data containing target voice comprises the following steps:
acquiring an intention context vector according to the audio data containing the target voice;
based on the intention context vector, a semantic prediction vector is obtained.
In the present embodiment, audio data including target speech is converted into an initial vector
Figure 144295DEST_PATH_IMAGE004
In the initial directionMeasurement of
Figure 879032DEST_PATH_IMAGE004
I.e. the input vector, the initial vector, of the joint model
Figure 425551DEST_PATH_IMAGE004
The essence is a word sequence which corresponds to single words in the target voice one by one; the initial vector is then predicted by the intent prediction layer
Figure 536727DEST_PATH_IMAGE004
Map generating intention context vector (intent context vector)
Figure 980697DEST_PATH_IMAGE024
From the intention context vector
Figure 886336DEST_PATH_IMAGE025
Generating semantic prediction vectors
Figure 654572DEST_PATH_IMAGE026
Wherein the intention context vector
Figure 303859DEST_PATH_IMAGE027
Is computed like a slot context vector
Figure 605265DEST_PATH_IMAGE008
For representing a slot context vector; predicting hidden layer vector
Figure 681805DEST_PATH_IMAGE028
Represents a vector that is found using only the last hidden state of BILSTM when predicting intent. Semantic prediction vector
Figure 468496DEST_PATH_IMAGE029
Is formulated as:
Figure 655895DEST_PATH_IMAGE030
wherein,
Figure 80316DEST_PATH_IMAGE031
a matrix of weights is represented by a matrix of weights,
Figure 327758DEST_PATH_IMAGE032
a context vector representing the intent is shown,
Figure 336165DEST_PATH_IMAGE028
representing a predicted hidden layer vector.
According to the intention recognition method for the audio data, provided by the invention, the intention context vector is obtained according to the audio data containing the target voice, and then the semantic prediction vector is obtained by the intention context vector, so that the specific obtaining path of the semantic prediction vector is further clarified, the deep understanding of a joint model to the user intention is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently recognized.
According to the intention identifying method for audio data provided by the invention, the intention identifying method further comprises the following steps:
obtaining a weighted feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.
In the present embodiment, the characteristic parameter is weighted
Figure 327255DEST_PATH_IMAGE033
Can be viewed as a joint slot context vector
Figure 337674DEST_PATH_IMAGE008
And intention context vector
Figure 21596DEST_PATH_IMAGE032
The main purpose of the weighted feature of (2) is to use an intention context vector (intent context vector)
Figure 251720DEST_PATH_IMAGE032
To improve the performance of the semantic slot-filling layer (slot-filling). Weighted featuresParameter(s)
Figure 46501DEST_PATH_IMAGE033
Is formulated as:
Figure 174076DEST_PATH_IMAGE034
wherein,
Figure 763321DEST_PATH_IMAGE035
a trainable vector is represented in the form of a vector,
Figure 480741DEST_PATH_IMAGE036
a matrix that is trainable is represented in such a way that,
Figure 813633DEST_PATH_IMAGE037
which represents a function of the hyperbolic tangent,
Figure 798644DEST_PATH_IMAGE008
in the form of a slot context vector,
Figure 824369DEST_PATH_IMAGE032
is an intention context vector.
Accordingly, the characteristic parameters are weighted
Figure 29086DEST_PATH_IMAGE033
Adding semantic word vectors, and formulating as:
Figure 900090DEST_PATH_IMAGE038
according to the intention identification method for the audio data, disclosed by the invention, the specific calculation path of the weighted characteristic parameters for improving the performance of the semantic slot filling layer is further disclosed, so that the deep understanding of a joint model on the intention of a user is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently identified.
According to the intention recognition method for audio data provided by the present invention, the method further comprises:
acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;
training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;
and training the prior model based on the audio data of the target user to obtain a combined model.
In the present embodiment, the mixture gaussian-generic background model GMM-UBM is an improved model of the gaussian mixture model GMM. The Universal Background Model UBM (Universal Background Model) is a Model proposed by the DA Reynolds group.
In the embodiment, a large amount of non-target user audio data is input into a Gaussian mixture-universal background model GMM-UBM, and a prior model of a specific speaker model is obtained through training; and inputting a small amount of target user audio data into the prior model, and finely adjusting parameters of the prior model to obtain a final combined model.
According to the intention identification method for the audio data, provided by the invention, the Gaussian mixture-general background model GMM-UBM is trained to obtain the joint model based on the non-target user audio data and the target user audio data, so that the deep understanding of the joint model on the user intention is powerfully supported, and the multiple intentions of the target voice are accurately and efficiently identified.
Fig. 2 is a schematic structural diagram of an intention recognition apparatus for audio data according to the present invention, and as shown in fig. 2, the intention recognition apparatus for audio data according to the present invention includes:
an audio data acquiring module 210, configured to acquire audio data including a target voice;
the audio data processing module 220 is configured to input the audio data including the target speech into a pre-trained joint model to obtain an instruction intention of the target speech;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
and the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring the instruction intention of the target voice by the combined objective function.
The intention recognition device for the audio data provided by the invention acquires the audio data containing the target voice by arranging the audio data acquisition module and the audio data processing module; and inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice, so that the joint model can understand the user intention deeply and accurately and efficiently recognize various intentions of the target voice to obtain the instruction intention of the target voice.
Based on any one of the above embodiments, in this embodiment, the intention recognition apparatus for audio data according to the present invention further includes:
a slot filling weighting parameter layer for obtaining a weighting feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.
The intention recognition device for the audio data further discloses a specific calculation path of the weighted characteristic parameters for improving the performance of the semantic slot filling layer by setting the slot filling weighted parameter layer, powerfully supports deep understanding of a joint model to the user intention, and accurately and efficiently recognizes multiple intentions of the target voice.
Based on any one of the embodiments described above, in this embodiment, the intention identification apparatus for audio data provided by the invention further includes:
the audio data acquisition unit is used for acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;
the prior model unit is used for training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;
and the joint model unit is used for training the prior model based on the audio data of the target user to obtain a joint model.
The intention recognition device for the audio data further discloses a combined model obtained by training a Gaussian mixture-universal background model GMM-UBM based on non-target user audio data and target user audio data by arranging a sample audio data acquisition unit, a prior model unit and a combined model unit, powerfully supports deep understanding of the combined model to the user intention, and accurately and efficiently recognizes multiple intentions of target voice.
In another aspect, the present invention also provides an electronic device, fig. 3 illustrates a schematic structural diagram of an electronic device, as shown in fig. 3, the electronic device may include a processor 310, a communication bus 320, a memory 330, a communication interface 340, and a computer program stored in the memory 330 and operable on the processor 310, wherein the processor 310, the communication interface 310, and the memory 330 complete communication with each other through the communication bus 340, and the processor 310 may call logic instructions in the memory 330 to perform an intention identifying method for audio data, where the method includes:
acquiring audio data containing target voice;
inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
Finally, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, can implement an intent recognition method with respect to audio data, the method comprising:
acquiring audio data containing target voice;
inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for acquiring semantic character vectors according to the audio data containing the target voice;
the intention prediction layer is used for acquiring a semantic prediction vector according to the audio data containing the target voice;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (6)

1. An intention recognition method with respect to audio data, characterized by comprising:
acquiring audio data containing target voice;
inputting the audio data containing the target voice into a pre-trained joint model to obtain an instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for converting the audio data containing the target voice into an initial vector; solving a hidden layer vector and a slot context vector based on the initial vector; solving a semantic character vector through a softmax function based on the hidden layer vector and the slot context vector; wherein the bin context vector comprises an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in context;
the intention prediction layer is used for acquiring an intention context vector according to the audio data containing the target voice; based on the intention context vector, obtaining a semantic prediction vector;
the instruction intention acquisition layer is used for acquiring a combined objective function according to the semantic word vector and the semantic prediction vector and acquiring an instruction intention of a target voice based on the combined objective function.
2. The method of claim 1, further comprising:
obtaining a weighted feature parameter based on the slot context vector and the intention context vector; wherein the weighted feature parameters are used to improve the performance of the semantic slot filling layer.
3. The method of claim 1, wherein the method further comprises:
acquiring sample audio data, wherein the sample audio data comprises non-target user audio data and target user audio data;
training a Gaussian mixture-universal background model GMM-UBM based on the non-target user audio data to obtain a prior model;
and training the prior model based on the audio data of the target user to obtain a combined model.
4. An intention recognition apparatus with respect to audio data, characterized by comprising:
the audio data acquisition module is used for acquiring audio data containing target voice;
the audio data processing module is used for inputting the audio data containing the target voice into a pre-trained joint model to obtain the instruction intention of the target voice;
the joint model is obtained based on sample audio data training and comprises a semantic slot filling layer, an intention prediction layer and an instruction intention acquisition layer; wherein,
the semantic slot filling layer is used for converting the audio data containing the target voice into an initial vector; solving a hidden layer vector and a slot context vector based on the initial vector; solving a semantic character vector through a softmax function based on the hidden layer vector and the slot context vector; wherein the bin context vector comprises an attention score parameter for indicating a probability that each of a plurality of specific meanings corresponding to a single word itself in the audio data conforms to an actual meaning of the single word in context;
the intention prediction layer is used for acquiring an intention context vector according to the audio data containing the target voice; based on the intention context vector, obtaining a semantic prediction vector;
and the instruction intention acquisition layer is used for acquiring a combined target function according to the semantic word vector and the semantic prediction vector and acquiring the instruction intention of the target voice by the combined target function.
5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method for intention recognition with respect to audio data according to any one of claims 1 to 3 are implemented when the program is executed by the processor.
6. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for intention recognition on audio data according to any one of claims 1 to 3.
CN202211178066.0A 2022-09-27 2022-09-27 Intention identification method and device for audio data Active CN115273849B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211178066.0A CN115273849B (en) 2022-09-27 2022-09-27 Intention identification method and device for audio data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211178066.0A CN115273849B (en) 2022-09-27 2022-09-27 Intention identification method and device for audio data

Publications (2)

Publication Number Publication Date
CN115273849A CN115273849A (en) 2022-11-01
CN115273849B true CN115273849B (en) 2022-12-27

Family

ID=83757223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211178066.0A Active CN115273849B (en) 2022-09-27 2022-09-27 Intention identification method and device for audio data

Country Status (1)

Country Link
CN (1) CN115273849B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11094317B2 (en) * 2018-07-31 2021-08-17 Samsung Electronics Co., Ltd. System and method for personalized natural language understanding
CN109785833A (en) * 2019-01-02 2019-05-21 苏宁易购集团股份有限公司 Human-computer interaction audio recognition method and system for smart machine
CN110516253B (en) * 2019-08-30 2023-08-25 思必驰科技股份有限公司 Chinese spoken language semantic understanding method and system
CN110853626B (en) * 2019-10-21 2021-04-20 成都信息工程大学 Bidirectional attention neural network-based dialogue understanding method, device and equipment
CN113505591A (en) * 2020-03-23 2021-10-15 华为技术有限公司 Slot position identification method and electronic equipment
CN112037773B (en) * 2020-11-05 2021-01-29 北京淇瑀信息科技有限公司 N-optimal spoken language semantic recognition method and device and electronic equipment
CN113204952B (en) * 2021-03-26 2023-09-15 南京邮电大学 Multi-intention and semantic slot joint identification method based on cluster pre-analysis

Also Published As

Publication number Publication date
CN115273849A (en) 2022-11-01

Similar Documents

Publication Publication Date Title
CN110111775B (en) Streaming voice recognition method, device, equipment and storage medium
CN110321418B (en) Deep learning-based field, intention recognition and groove filling method
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
CN110853626B (en) Bidirectional attention neural network-based dialogue understanding method, device and equipment
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN111161726B (en) Intelligent voice interaction method, device, medium and system
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
CN111445898A (en) Language identification method and device, electronic equipment and storage medium
US11322151B2 (en) Method, apparatus, and medium for processing speech signal
CN113505198A (en) Keyword-driven generating type dialogue reply method and device and electronic equipment
US20240331686A1 (en) Relevant context determination
CN111046674A (en) Semantic understanding method and device, electronic equipment and storage medium
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN115687934A (en) Intention recognition method and device, computer equipment and storage medium
CN116303966A (en) Dialogue behavior recognition system based on prompt learning
CN114254649A (en) Language model training method and device, storage medium and equipment
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN113393841A (en) Training method, device and equipment of speech recognition model and storage medium
CN115273849B (en) Intention identification method and device for audio data
CN116186255A (en) Method for training unknown intention detection model, unknown intention detection method and device
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
CN111680514A (en) Information processing and model training method, device, equipment and storage medium
CN113111855B (en) Multi-mode emotion recognition method and device, electronic equipment and storage medium
Biswas et al. Speech Emotion Recognition Using Deep CNNs Trained on Log-Frequency Spectrograms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant