CN115690552A - Multi-intention recognition method and device, computer equipment and storage medium - Google Patents

Multi-intention recognition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115690552A
CN115690552A CN202211717897.0A CN202211717897A CN115690552A CN 115690552 A CN115690552 A CN 115690552A CN 202211717897 A CN202211717897 A CN 202211717897A CN 115690552 A CN115690552 A CN 115690552A
Authority
CN
China
Prior art keywords
feature
information
features
intention
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211717897.0A
Other languages
Chinese (zh)
Inventor
左勇
刘伟华
马金民
林超超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Athena Eyes Co Ltd
Original Assignee
Athena Eyes Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Athena Eyes Co Ltd filed Critical Athena Eyes Co Ltd
Priority to CN202211717897.0A priority Critical patent/CN115690552A/en
Publication of CN115690552A publication Critical patent/CN115690552A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a method, a device, equipment and a medium for identifying multiple intents, which comprise the following steps: obtaining multi-mode information, wherein the multi-mode information comprises at least two items of voice information, text information and picture information, performing feature extraction and feature fusion on the multi-mode information to obtain fusion features, classifying the fusion features by adopting a multi-classification model to obtain a classification result, wherein the classification result at least comprises two intentions, constructing a multi-dimensional relationship matrix for the classification result according to a preset intention relationship, and determining an associated intention and a non-associated intention based on the multi-dimensional relationship matrix.

Description

Multi-intention recognition method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of natural language processing, and in particular, to a method and an apparatus for recognizing multiple intents, a computer device, and a storage medium.
Background
The application scenes of the task-based intelligent dialogue system are more and more, and the task-based intelligent dialogue system becomes a research hotspot at present, particularly an online medical diagnosis system, and an intelligent inquiry and pre-diagnosis assistant becomes a popular research field. The task-based dialog system is generally composed of six components, namely ASR (speech recognition), NLU (natural language understanding), DST (dialog state tracking), DPL (dialog strategy learning), NLG (natural language generation), TTS (speech generation), and the technology includes methods based on rules, machine learning, deep learning, reinforcement learning, mixing, and the like.
In the existing methods, the methods mainly used for identifying intentions include: rule-based dialog techniques, machine learning-based dialog techniques, and single intent-based dialog techniques.
The inventor realizes that the prior art has at least the following technical problems in the process of implementing the invention:
rule-based dialog techniques, such as: a method and apparatus for manufacturing an intelligent automatic assistant, a dialogue management method written using a script language, and the like have disadvantages: 1. the complex rules need to be manually written by experts, the expandability is poor, 2, knowledge cannot be learned from limited data, and the intention is difficult to identify or inaccurate to identify for unknown data sets.
Machine learning based dialog techniques such as: a KNN-based dialogue intention recognition system, a support vector machine-based task-based dialogue system, a dialogue technique using a deep learning method, a system for performing a dialogue strategy using reinforcement learning, and the like, have disadvantages: 1. machine learning technology-based dialog technique algorithms suffer from computational complexity and domain dependencies. 2. By utilizing supervised learning, a large amount of resources based on the existing data set have relatively high requirements on computing resources, the condition of less data amount is small, and overfitting is easy to realize, so that the intention identification accuracy rate cannot meet the requirements of practical application.
Dialog techniques based on single-intent dialogs, such as: the single intention recognition dialogue system, the simple multi-intention recognition dialogue system and the like have the defects that the single intention recognition is only suitable for a simple task type dialogue process and cannot meet a more complex dialogue process, and the intention recognition accuracy rate is low for the complex dialogue process.
Therefore, an intention identification method for accurately identifying multiple intents is needed.
Disclosure of Invention
The embodiment of the invention provides a multi-intention identification method, a multi-intention identification device, computer equipment and a storage medium, and aims to improve the accuracy of multi-intention identification.
In order to solve the above technical problem, an embodiment of the present application provides a multiple intention identification method, including:
acquiring multi-mode information, wherein the multi-mode information comprises at least two items of voice information, text information and picture information;
performing feature extraction and feature fusion on the multimode information to obtain fusion features;
classifying the fusion features by adopting a multi-classification model to obtain a classification result, wherein the classification result at least comprises two intents;
constructing a multi-dimensional relation matrix for the classification result according to a preset intention relation;
and determining the associated intention and the non-associated intention based on the multi-dimensional relation matrix.
Optionally, the performing feature extraction and feature fusion on the multi-mode information to obtain a fusion feature includes:
if the multi-mode information contains text information, extracting characteristic text information by adopting a bert model to obtain text characteristics;
if the multi-mode information contains voice information, carrying out feature extraction on a Mel cepstrum coefficient feature and a Bark spectrum of the voice information, and taking the extracted feature as a voice feature;
if the multi-mode information contains picture information, extracting features of the picture information by adopting a depth residual error network, and taking the extracted features as picture features;
and carrying out normalization and feature fusion on the text features, the voice features and the picture features to obtain fusion features.
Optionally, the normalizing and feature fusing the text feature, the voice feature, and the picture feature to obtain a fused feature includes:
normalizing the text feature, the voice feature and the picture feature;
and splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain fusion features.
Optionally, the normalizing and feature fusing the text feature, the voice feature, and the picture feature to obtain a fused feature further includes:
normalizing the text feature, the voice feature and the picture feature;
splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain splicing features;
selecting one of the normalized text characteristic, the normalized voice characteristic and the normalized picture characteristic as a first characteristic in a preset mode, and taking the rest two characteristics as second characteristics;
adopting an attention mechanism mode, enabling the first features to point to a K vector and a V vector of the attention mechanism, and enabling the second features to point to a Q vector of the attention mechanism, so as to obtain a first attention calculation result;
adopting an attention mechanism mode, enabling the first features to point to a Q vector of the attention mechanism, and enabling the second features to point to a K vector and a V vector of the attention mechanism to obtain a second attention calculation result;
and splicing the splicing feature, the first attention calculation result and the second attention calculation result to obtain the fusion feature.
Optionally, the multi-classification model is a neural network model, and a classifier of the multi-classification model is composed of fully connected layers.
Optionally, after the determining of the associated intention and the non-associated intention based on the multi-dimensional relationship matrix, the multi-intention identification method further includes:
aiming at the associated intention, acquiring the time of the associated intention, and sequentially putting the time into a global queue for identification processing according to the time sequence of the associated intention;
and aiming at the non-associated intentions, respectively identifying and processing each non-associated intention by adopting a virtual dialog manager.
Optionally, the obtaining, for the associated intention, the time of the associated intention, and sequentially putting the time of the associated intention into a global queue according to the time sequence of the associated intention to perform recognition processing includes:
aiming at the association intents, taking time and intention information among the association intents as shared information and putting the shared information into a shared slot;
and generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into a global queue to perform identification processing.
In order to solve the above technical problem, an embodiment of the present application further provides a multi-intent recognition apparatus, including:
the multimode information acquisition module is used for acquiring multimode information, wherein the multimode information comprises at least two items of voice information, text information and picture information;
the characteristic extraction and fusion module is used for carrying out characteristic extraction and characteristic fusion on the multimode information to obtain fusion characteristics;
the fusion information classification module is used for classifying the fusion characteristics by adopting a multi-classification model to obtain a classification result, and the classification result at least comprises two intents;
the relation matrix construction module is used for constructing a multi-dimensional relation matrix for the classification result according to a preset intention relation;
and the association intention identification module is used for determining association intents and non-association intents based on the multi-dimensional relation matrix.
Optionally, the feature extraction and fusion module includes:
the first extraction unit is used for extracting the characteristic text information by adopting a bert model if the text information exists in the multimode information to obtain text characteristics;
a second extraction unit, configured to, if there is speech information in the multimode information, perform feature extraction on a mel cepstrum coefficient feature and a Bark spectrum of the speech information, and use the extracted feature as a speech feature;
a third extraction unit, configured to, if there is picture information in the multi-mode information, perform feature extraction on the picture information by using a depth residual error network, and use the extracted features as picture features;
and the feature fusion unit is used for carrying out normalization and feature fusion on the text feature, the voice feature and the picture feature to obtain fusion features.
Optionally, the feature fusion unit includes:
the first normalization subunit is used for normalizing the text feature, the voice feature and the picture feature;
and the first splicing subunit is used for splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain fusion features.
Optionally, the feature fusion unit further includes:
the second normalization subunit is used for normalizing the text feature, the voice feature and the picture feature;
the second splicing subunit is used for splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain splicing features;
the feature selection subunit is used for selecting one of the normalized text feature, the normalized voice feature and the normalized picture feature as a first feature and the rest two as second features according to a preset mode;
the first attention calculation subunit is used for pointing the first feature to a K vector and a V vector of an attention mechanism and pointing the second feature to a Q vector of the attention mechanism in an attention mechanism mode to obtain a first attention calculation result;
the second attention calculation subunit is used for pointing the first feature to a Q vector of the attention mechanism and pointing the second feature to a K vector and a V vector of the attention mechanism in an attention mechanism mode to obtain a second attention calculation result;
and the splicing and fusion subunit is used for splicing the splicing feature, the first attention calculation result and the second attention calculation result to obtain the fusion feature.
Optionally, the multi-intent recognition apparatus further comprises:
the first intention identification module is used for acquiring the time of the associated intention aiming at the associated intention and sequentially putting the time into a global queue for identification processing according to the time sequence of the associated intention;
and the second intention identification module is used for respectively identifying each non-associated intention by adopting a virtual dialog manager aiming at the non-associated intention.
Optionally, the first intent recognition module includes:
the sharing information determining unit is used for taking the time between the associated intentions and the intention information as sharing information and putting the sharing information into a sharing slot position aiming at the associated intentions;
and the key value pair construction unit is used for generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into a global queue to execute identification processing.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above multi-intent recognition method when executing the computer program.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned multi-intent recognition method.
According to the multi-intention identification method, the multi-intention identification device, the computer equipment and the storage medium, multi-mode information is obtained, wherein the multi-mode information comprises at least two items of voice information, text information and picture information, feature extraction and feature fusion are carried out on the multi-mode information to obtain fusion features, a multi-classification model is adopted to classify the fusion features to obtain classification results, the classification results at least comprise two intentions, a multi-dimensional relation matrix is built for the classification results according to a preset intention relation, association intentions and non-association intentions are determined based on the multi-dimensional relation matrix, the multi-intention classification identification is achieved, and the intention identification accuracy is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a multi-intent recognition method of the present application;
FIG. 3 is a schematic block diagram of one embodiment of a multiple intent recognition arrangement according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof in the description and claims of this application and the description of the figures above, are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may use terminal devices 101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
The multiple intention recognition method provided by the embodiment of the application is executed by the server, and accordingly, the multiple intention recognition device is arranged in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the terminal devices 101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.
Referring to fig. 2, fig. 2 shows a multi-intent recognition method according to an embodiment of the present invention, which is described by taking the method applied to the server in fig. 1 as an example, and is detailed as follows:
s201: obtaining multi-mode information, wherein the multi-mode information comprises at least two items of voice information, text information and picture information.
S202: and performing feature extraction and feature fusion on the multimode information to obtain fusion features.
In a specific optional implementation, the performing feature extraction and feature fusion on the multimodal information to obtain a fusion feature includes:
if the multi-mode information contains text information, extracting the characteristic text information by adopting a bert model to obtain text characteristics;
if the voice information exists in the multimode information, carrying out feature extraction on the voice information by using a Mel cepstrum coefficient feature and a Bark spectrum, and taking the extracted feature as a voice feature;
if the multi-mode information contains picture information, extracting the features of the picture information by adopting a depth residual error network, and taking the extracted features as picture features;
and carrying out normalization and feature fusion on the text features, the voice features and the picture features to obtain fusion features.
Specifically, for text data, firstly, words are cut for the text and the text is input into a word embedding layer, then context semantic coding is carried out by utilizing a bert encoder layer, a bert model is understood as a deep neural network, and the model construction is completed by constructing a self-attention mechanism module according to the specific principle.
For voice data, voice signal data is first prepared, and high frequency information is amplified using a pre-emphasis filter (equation 1-1) method, which has a balanced spectrum, avoids numerical problems during fourier transform operations, and may also improve signal-to-noise ratio (SNR).
y(t)=x(t)-ax(t-1) (1-1)
The signal then needs to be divided into short time frames. By performing a fourier transform on this short time frame, we can obtain a good approximation of the signal frequency profile by concatenating adjacent frames. After the signal is then sliced into frames, we apply a window function, such as a hamming window, to each frame. The Hamming window has the form as shown in equations 1-2, where 0. Ltoreq. N.ltoreq.1, is the window length:
Figure 611198DEST_PATH_IMAGE001
(1-2)
performing a fourier transform (or more specifically a short-time fourier transform) on each frame and calculating a power spectrum; a filter bank is then calculated. To obtain the MFCC, a Discrete Cosine Transform (DCT) may be applied to the filter bank to retain a number of the resulting coefficients, with the remaining coefficients discarded, ultimately forming the MFCC features.
Aiming at image data, firstly, an ResNet model is trained by using an image data set, then, intermediate characteristic layer data are directly extracted to serve as image characteristics, the ResNet model is also a deep neural network model, and a residual error connection mode is mainly adopted to improve the training effect. The specific formula is as follows (formula 1-3), and the principle is that the input x and the coded H layer are added:
Y=H(xw h )+x (1-3)
further, normalizing and fusing the text feature, the voice feature and the picture feature to obtain a fused feature comprises:
normalizing the text feature, the voice feature and the picture feature;
and splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain fusion features.
In a specific optional implementation, normalizing and feature fusing the text feature, the voice feature and the picture feature to obtain a fused feature further includes:
normalizing the text feature, the voice feature and the picture feature;
splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain splicing features;
selecting one of the normalized text characteristic, the normalized voice characteristic and the normalized picture characteristic as a first characteristic in a preset mode, and taking the rest two characteristics as second characteristics;
adopting an attention mechanism mode, enabling the first characteristic to point to a K vector and a V vector of the attention mechanism, and enabling the second characteristic to point to a Q vector of the attention mechanism, so as to obtain a first attention calculation result;
adopting an attention mechanism mode, enabling the first features to point to a Q vector of the attention mechanism, and enabling the second features to point to a K vector and a V vector of the attention mechanism to obtain a second attention calculation result;
and splicing the splicing characteristic, the first attention calculation result and the second attention calculation result to obtain a fusion characteristic.
S203: and classifying the fusion characteristics by adopting a multi-classification model to obtain a classification result, wherein the classification result at least comprises two intents.
Optionally, the multi-classification model is a neural network model, and the classifier of the multi-classification model is composed of fully connected layers.
S204: and constructing a multi-dimensional relation matrix for the classification result according to a preset intention relation.
S205: based on the multi-dimensional relationship matrix, associated intents and non-associated intents are determined.
Optionally, after step S205, that is, after determining the associated intention and the non-associated intention based on the multi-dimensional relationship matrix, the multi-intention identifying method further includes:
aiming at the association intention, acquiring the time of the association intention, and sequentially putting the association intention into a global queue for identification processing according to the time sequence of the association intention;
aiming at the non-associated intents, a virtual dialog manager is adopted to respectively identify and process each non-associated intention.
Optionally, after step S205, that is, for the association intention, acquiring the time of the association intention, and sequentially putting into the global queue according to the time sequence of the association intention to perform the identification processing includes:
regarding the associated intents, taking time and intention information between the associated intents as shared information and putting the shared information into a shared slot;
and generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into the global queue to execute identification processing.
In the embodiment, the multi-mode information is obtained, wherein the multi-mode information comprises at least two items of voice information, text information and picture information, the multi-mode information is subjected to feature extraction and feature fusion to obtain fusion features, the fusion features are classified by adopting a multi-classification model to obtain classification results, the classification results at least comprise two intentions, a multi-dimensional relationship matrix is built for the classification results according to a preset intention relationship, and association intentions and non-association intentions are determined based on the multi-dimensional relationship matrix, so that the classification recognition of the multi-intentions is realized, and the accuracy of the intention recognition is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 3 shows a schematic block diagram of a multiple intention recognition apparatus corresponding to the multiple intention recognition method of the above embodiment one to one. As shown in fig. 3, the multiple intention identifying apparatus includes a multiple mode information obtaining module 31, a feature extraction and fusion module 32, a fusion information classification module 33, a relation matrix building module 34, and an association intention identifying module 35. The functional modules are explained in detail as follows:
a multimode information obtaining module 31, configured to obtain multimode information, where the multimode information includes at least two of voice information, text information, and picture information;
the feature extraction and fusion module 32 is used for performing feature extraction and feature fusion on the multimode information to obtain fusion features;
the fusion information classification module 33 is configured to classify the fusion features by using a multi-classification model to obtain a classification result, where the classification result at least includes two intents;
a relation matrix construction module 34, configured to construct a multidimensional relation matrix for the classification result according to a preset intention relation;
and an association intention identifying module 35 for determining an association intention and a non-association intention based on the multi-dimensional relationship matrix.
Optionally, the feature extraction and fusion module 32 includes:
the first extraction unit is used for extracting characteristic text information by adopting a bert model if text information exists in the multimode information to obtain text characteristics;
the second extraction unit is used for extracting the features of the Mel cepstrum coefficient and the Bark spectrum of the voice information if the voice information exists in the multimode information, and taking the extracted features as voice features;
a third extraction unit, configured to, if there is picture information in the multimode information, perform feature extraction on the picture information by using a depth residual error network, and use the extracted features as picture features;
and the feature fusion unit is used for carrying out normalization and feature fusion on the text features, the voice features and the picture features to obtain fusion features.
Optionally, the feature fusion unit comprises:
the first normalization subunit is used for normalizing the text feature, the voice feature and the picture feature;
and the first splicing subunit is used for splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing manner to obtain fusion features.
Optionally, the feature fusion unit further comprises:
the second normalization subunit is used for normalizing the text feature, the voice feature and the picture feature;
the second splicing subunit is used for splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain splicing features;
the feature selection subunit is used for selecting one of the normalized text feature, the normalized voice feature and the normalized picture feature as a first feature and the rest two as second features according to a preset mode;
the first attention calculation subunit is used for pointing the first feature to a K vector and a V vector of the attention mechanism and pointing the second feature to a Q vector of the attention mechanism in an attention mechanism mode to obtain a first attention calculation result;
the second attention calculation subunit is used for pointing the first feature to a Q vector of the attention mechanism and pointing the second feature to a K vector and a V vector of the attention mechanism in an attention mechanism mode to obtain a second attention calculation result;
and the splicing and fusion subunit is used for splicing the splicing characteristic, the first attention calculation result and the second attention calculation result to obtain a fusion characteristic.
Optionally, the multi-intent recognition apparatus further comprises:
the first intention identification module is used for acquiring the time of the associated intention aiming at the associated intention and sequentially putting the time into the global queue for identification processing according to the time sequence of the associated intention;
and the second intention identification module is used for respectively identifying each non-associated intention by adopting a virtual dialog manager aiming at the non-associated intention.
Optionally, the first intent recognition module includes:
the sharing information determining unit is used for taking the time between the associated intentions and the intention information as sharing information and putting the sharing information into a sharing slot position aiming at the associated intentions;
and the key value pair construction unit is used for generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into the global queue to execute identification processing.
For specific definition of the multiple intention recognition device, reference may be made to the definition of the multiple intention recognition method above, and details are not repeated here. The various modules in the above-described multiple intent recognition apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In order to solve the technical problem, the embodiment of the application further provides computer equipment. Referring to fig. 4 in particular, fig. 4 is a block diagram of a basic structure of a computer device according to the embodiment.
The computer device 4 comprises a memory 41, a processor 42, and a network interface 43, which are communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user in a keyboard mode, a mouse mode, a remote controller mode, a touch panel mode or a voice control equipment mode.
The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both an internal storage unit of the computer device 4 and an external storage device thereof. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing an interface display program, which is executable by at least one processor to cause the at least one processor to perform the steps of the multi-intent recognition method as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It should be understood that the above-described embodiments are merely exemplary of some, and not all, embodiments of the present application, and that the drawings illustrate preferred embodiments of the present application without limiting the scope of the claims appended hereto. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A multi-intent recognition method, comprising:
acquiring multi-mode information, wherein the multi-mode information comprises at least two items of voice information, text information and picture information;
performing feature extraction and feature fusion on the multimode information to obtain fusion features;
classifying the fusion features by adopting a multi-classification model to obtain a classification result, wherein the classification result at least comprises two intents;
constructing a multi-dimensional relation matrix for the classification result according to a preset intention relation;
and determining the associated intention and the non-associated intention based on the multi-dimensional relation matrix.
2. The method for multi-intent recognition according to claim 1, wherein the performing feature extraction and feature fusion on the multi-modal information to obtain fused features comprises:
if the multi-mode information contains text information, extracting characteristic text information by adopting a bert model to obtain text characteristics;
if the multi-mode information contains voice information, carrying out feature extraction on a Mel cepstrum coefficient feature and a Bark spectrum of the voice information, and taking the extracted feature as a voice feature;
if the multi-mode information contains picture information, extracting the features of the picture information by adopting a depth residual error network, and taking the extracted features as picture features;
and carrying out normalization and feature fusion on the text features, the voice features and the picture features to obtain fusion features.
3. The method of claim 2, wherein the normalizing and feature fusing the text feature, the voice feature and the picture feature to obtain a fused feature comprises:
normalizing the text feature, the voice feature and the picture feature;
and splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain fusion features.
4. The method for multi-intent recognition according to claim 2, wherein the normalizing and feature fusing the text feature, the voice feature and the picture feature to obtain a fused feature further comprises:
normalizing the text feature, the voice feature and the picture feature;
splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain splicing features;
selecting one of the normalized text characteristic, the normalized voice characteristic and the normalized picture characteristic as a first characteristic in a preset mode, and taking the rest two characteristics as second characteristics;
adopting an attention mechanism mode, enabling the first features to point to a K vector and a V vector of the attention mechanism, and enabling the second features to point to a Q vector of the attention mechanism, so as to obtain a first attention calculation result;
adopting an attention mechanism mode, enabling the first features to point to a Q vector of the attention mechanism, and enabling the second features to point to a K vector and a V vector of the attention mechanism to obtain a second attention calculation result;
and splicing the splicing feature, the first attention calculation result and the second attention calculation result to obtain the fusion feature.
5. The multi-intent recognition method of claim 1, wherein the multi-classification model is a neural network model, and the classifiers of the multi-classification model are composed of fully connected layers.
6. The multi-intent recognition method according to any one of claims 1 to 5, wherein after said determining associated intents and non-associated intents based on the multi-dimensional relationship matrix, the multi-intent recognition method further comprises:
aiming at the associated intention, acquiring the time of the associated intention, and sequentially putting the time into a global queue for identification processing according to the time sequence of the associated intention;
aiming at the non-associated intents, a virtual dialog manager is adopted to respectively identify and process each non-associated intention.
7. The multi-intent recognition method according to claim 6, wherein the obtaining of the time of the associated intent and the sequentially putting into the global queue according to the time sequence of the associated intent for recognition processing for the associated intent comprises:
regarding the associated intents, taking time and intention information between the associated intents as shared information and putting the shared information into a shared slot;
and generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into a global queue to execute identification processing.
8. A multiple intention recognition apparatus, characterized in that the multiple intention recognition apparatus comprises:
the multimode information acquisition module is used for acquiring multimode information, wherein the multimode information comprises at least two items of voice information, text information and picture information;
the feature extraction and fusion module is used for carrying out feature extraction and feature fusion on the multimode information to obtain fusion features;
the fusion information classification module is used for classifying the fusion characteristics by adopting a multi-classification model to obtain a classification result, and the classification result at least comprises two intents;
the relation matrix construction module is used for constructing a multi-dimensional relation matrix for the classification result according to a preset intention relation;
and the association intention identification module is used for determining association intents and non-association intents based on the multi-dimensional relation matrix.
9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the multiple intent recognition method of any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the multiple intention recognition method according to any one of claims 1 to 7.
CN202211717897.0A 2022-12-30 2022-12-30 Multi-intention recognition method and device, computer equipment and storage medium Pending CN115690552A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211717897.0A CN115690552A (en) 2022-12-30 2022-12-30 Multi-intention recognition method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211717897.0A CN115690552A (en) 2022-12-30 2022-12-30 Multi-intention recognition method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115690552A true CN115690552A (en) 2023-02-03

Family

ID=85057206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211717897.0A Pending CN115690552A (en) 2022-12-30 2022-12-30 Multi-intention recognition method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115690552A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007148118A (en) * 2005-11-29 2007-06-14 Infocom Corp Voice interactive system
US20160349941A1 (en) * 2015-05-29 2016-12-01 Flipboard, Inc. Queuing Actions Received While a Client Device is Offline for Execution When Connectivity is Restored Between the Client Device and a Digital Magazine Server
CN111737458A (en) * 2020-05-21 2020-10-02 平安国际智慧城市科技股份有限公司 Intention identification method, device and equipment based on attention mechanism and storage medium
CN112035645A (en) * 2020-09-01 2020-12-04 平安科技(深圳)有限公司 Data query method and system
US20210011941A1 (en) * 2019-07-14 2021-01-14 Alibaba Group Holding Limited Multimedia file categorizing, information processing, and model training method, system, and device
CN113408385A (en) * 2021-06-10 2021-09-17 华南理工大学 Audio and video multi-mode emotion classification method and system
WO2022078346A1 (en) * 2020-10-13 2022-04-21 深圳壹账通智能科技有限公司 Text intent recognition method and apparatus, electronic device, and storage medium
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN114627868A (en) * 2022-03-03 2022-06-14 平安普惠企业管理有限公司 Intention recognition method and device, model and electronic equipment
CN115292463A (en) * 2022-08-08 2022-11-04 云南大学 Information extraction-based method for joint multi-intention detection and overlapping slot filling
CN115510224A (en) * 2022-07-14 2022-12-23 南京邮电大学 Cross-modal BERT emotion analysis method based on fusion of vision, audio and text

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007148118A (en) * 2005-11-29 2007-06-14 Infocom Corp Voice interactive system
US20160349941A1 (en) * 2015-05-29 2016-12-01 Flipboard, Inc. Queuing Actions Received While a Client Device is Offline for Execution When Connectivity is Restored Between the Client Device and a Digital Magazine Server
US20210011941A1 (en) * 2019-07-14 2021-01-14 Alibaba Group Holding Limited Multimedia file categorizing, information processing, and model training method, system, and device
CN111737458A (en) * 2020-05-21 2020-10-02 平安国际智慧城市科技股份有限公司 Intention identification method, device and equipment based on attention mechanism and storage medium
CN112035645A (en) * 2020-09-01 2020-12-04 平安科技(深圳)有限公司 Data query method and system
WO2022078346A1 (en) * 2020-10-13 2022-04-21 深圳壹账通智能科技有限公司 Text intent recognition method and apparatus, electronic device, and storage medium
CN113408385A (en) * 2021-06-10 2021-09-17 华南理工大学 Audio and video multi-mode emotion classification method and system
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN114627868A (en) * 2022-03-03 2022-06-14 平安普惠企业管理有限公司 Intention recognition method and device, model and electronic equipment
CN115510224A (en) * 2022-07-14 2022-12-23 南京邮电大学 Cross-modal BERT emotion analysis method based on fusion of vision, audio and text
CN115292463A (en) * 2022-08-08 2022-11-04 云南大学 Information extraction-based method for joint multi-intention detection and overlapping slot filling

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
俞凯;陈露;陈博;孙锴;朱苏;: "任务型人机对话系统中的认知技术――概念、进展及其未来" *
周权;陈永生;郭玉臣;: "基于多特征融合的意图识别算法研究" *

Similar Documents

Publication Publication Date Title
CN112685565B (en) Text classification method based on multi-mode information fusion and related equipment thereof
US10515627B2 (en) Method and apparatus of building acoustic feature extracting model, and acoustic feature extracting method and apparatus
CN112466314A (en) Emotion voice data conversion method and device, computer equipment and storage medium
CN112328761A (en) Intention label setting method and device, computer equipment and storage medium
CN112653798A (en) Intelligent customer service voice response method and device, computer equipment and storage medium
CN112084752A (en) Statement marking method, device, equipment and storage medium based on natural language
CN112836521A (en) Question-answer matching method and device, computer equipment and storage medium
CN112084779B (en) Entity acquisition method, device, equipment and storage medium for semantic recognition
CN115687934A (en) Intention recognition method and device, computer equipment and storage medium
CN113887237A (en) Slot position prediction method and device for multi-intention text and computer equipment
CN112699213A (en) Speech intention recognition method and device, computer equipment and storage medium
CN115827872A (en) Training method of intention recognition model, and intention recognition method and device
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN114817478A (en) Text-based question and answer method and device, computer equipment and storage medium
CN113220828B (en) Method, device, computer equipment and storage medium for processing intention recognition model
CN112906368B (en) Industry text increment method, related device and computer program product
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
CN113918710A (en) Text data processing method and device, electronic equipment and readable storage medium
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN115730237B (en) Junk mail detection method, device, computer equipment and storage medium
CN112364649B (en) Named entity identification method and device, computer equipment and storage medium
CN114218356A (en) Semantic recognition method, device, equipment and storage medium based on artificial intelligence
CN115690552A (en) Multi-intention recognition method and device, computer equipment and storage medium
CN111414468B (en) Speaking operation selection method and device and electronic equipment
CN113035230A (en) Authentication model training method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination