CN115690552A

CN115690552A - Multi-intention recognition method and device, computer equipment and storage medium

Info

Publication number: CN115690552A
Application number: CN202211717897.0A
Authority: CN
Inventors: 左勇; 刘伟华; 马金民; 林超超
Original assignee: Athena Eyes Co Ltd
Current assignee: Athena Eyes Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-02-03

Abstract

The invention discloses a method, a device, equipment and a medium for identifying multiple intents, which comprise the following steps: obtaining multi-mode information, wherein the multi-mode information comprises at least two items of voice information, text information and picture information, performing feature extraction and feature fusion on the multi-mode information to obtain fusion features, classifying the fusion features by adopting a multi-classification model to obtain a classification result, wherein the classification result at least comprises two intentions, constructing a multi-dimensional relationship matrix for the classification result according to a preset intention relationship, and determining an associated intention and a non-associated intention based on the multi-dimensional relationship matrix.

Description

Multi-intention recognition method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing, and in particular, to a method and an apparatus for recognizing multiple intents, a computer device, and a storage medium.

Background

The application scenes of the task-based intelligent dialogue system are more and more, and the task-based intelligent dialogue system becomes a research hotspot at present, particularly an online medical diagnosis system, and an intelligent inquiry and pre-diagnosis assistant becomes a popular research field. The task-based dialog system is generally composed of six components, namely ASR (speech recognition), NLU (natural language understanding), DST (dialog state tracking), DPL (dialog strategy learning), NLG (natural language generation), TTS (speech generation), and the technology includes methods based on rules, machine learning, deep learning, reinforcement learning, mixing, and the like.

In the existing methods, the methods mainly used for identifying intentions include: rule-based dialog techniques, machine learning-based dialog techniques, and single intent-based dialog techniques.

The inventor realizes that the prior art has at least the following technical problems in the process of implementing the invention:

rule-based dialog techniques, such as: a method and apparatus for manufacturing an intelligent automatic assistant, a dialogue management method written using a script language, and the like have disadvantages: 1. the complex rules need to be manually written by experts, the expandability is poor, 2, knowledge cannot be learned from limited data, and the intention is difficult to identify or inaccurate to identify for unknown data sets.

Machine learning based dialog techniques such as: a KNN-based dialogue intention recognition system, a support vector machine-based task-based dialogue system, a dialogue technique using a deep learning method, a system for performing a dialogue strategy using reinforcement learning, and the like, have disadvantages: 1. machine learning technology-based dialog technique algorithms suffer from computational complexity and domain dependencies. 2. By utilizing supervised learning, a large amount of resources based on the existing data set have relatively high requirements on computing resources, the condition of less data amount is small, and overfitting is easy to realize, so that the intention identification accuracy rate cannot meet the requirements of practical application.

Dialog techniques based on single-intent dialogs, such as: the single intention recognition dialogue system, the simple multi-intention recognition dialogue system and the like have the defects that the single intention recognition is only suitable for a simple task type dialogue process and cannot meet a more complex dialogue process, and the intention recognition accuracy rate is low for the complex dialogue process.

Therefore, an intention identification method for accurately identifying multiple intents is needed.

Disclosure of Invention

The embodiment of the invention provides a multi-intention identification method, a multi-intention identification device, computer equipment and a storage medium, and aims to improve the accuracy of multi-intention identification.

In order to solve the above technical problem, an embodiment of the present application provides a multiple intention identification method, including:

acquiring multi-mode information, wherein the multi-mode information comprises at least two items of voice information, text information and picture information;

performing feature extraction and feature fusion on the multimode information to obtain fusion features;

classifying the fusion features by adopting a multi-classification model to obtain a classification result, wherein the classification result at least comprises two intents;

constructing a multi-dimensional relation matrix for the classification result according to a preset intention relation;

and determining the associated intention and the non-associated intention based on the multi-dimensional relation matrix.

Optionally, the performing feature extraction and feature fusion on the multi-mode information to obtain a fusion feature includes:

if the multi-mode information contains text information, extracting characteristic text information by adopting a bert model to obtain text characteristics;

if the multi-mode information contains voice information, carrying out feature extraction on a Mel cepstrum coefficient feature and a Bark spectrum of the voice information, and taking the extracted feature as a voice feature;

if the multi-mode information contains picture information, extracting features of the picture information by adopting a depth residual error network, and taking the extracted features as picture features;

and carrying out normalization and feature fusion on the text features, the voice features and the picture features to obtain fusion features.

Optionally, the normalizing and feature fusing the text feature, the voice feature, and the picture feature to obtain a fused feature includes:

normalizing the text feature, the voice feature and the picture feature;

and splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain fusion features.

Optionally, the normalizing and feature fusing the text feature, the voice feature, and the picture feature to obtain a fused feature further includes:

normalizing the text feature, the voice feature and the picture feature;

splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain splicing features;

selecting one of the normalized text characteristic, the normalized voice characteristic and the normalized picture characteristic as a first characteristic in a preset mode, and taking the rest two characteristics as second characteristics;

adopting an attention mechanism mode, enabling the first features to point to a K vector and a V vector of the attention mechanism, and enabling the second features to point to a Q vector of the attention mechanism, so as to obtain a first attention calculation result;

adopting an attention mechanism mode, enabling the first features to point to a Q vector of the attention mechanism, and enabling the second features to point to a K vector and a V vector of the attention mechanism to obtain a second attention calculation result;

and splicing the splicing feature, the first attention calculation result and the second attention calculation result to obtain the fusion feature.

Optionally, the multi-classification model is a neural network model, and a classifier of the multi-classification model is composed of fully connected layers.

Optionally, after the determining of the associated intention and the non-associated intention based on the multi-dimensional relationship matrix, the multi-intention identification method further includes:

aiming at the associated intention, acquiring the time of the associated intention, and sequentially putting the time into a global queue for identification processing according to the time sequence of the associated intention;

and aiming at the non-associated intentions, respectively identifying and processing each non-associated intention by adopting a virtual dialog manager.

Optionally, the obtaining, for the associated intention, the time of the associated intention, and sequentially putting the time of the associated intention into a global queue according to the time sequence of the associated intention to perform recognition processing includes:

aiming at the association intents, taking time and intention information among the association intents as shared information and putting the shared information into a shared slot;

and generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into a global queue to perform identification processing.

In order to solve the above technical problem, an embodiment of the present application further provides a multi-intent recognition apparatus, including:

the multimode information acquisition module is used for acquiring multimode information, wherein the multimode information comprises at least two items of voice information, text information and picture information;

the characteristic extraction and fusion module is used for carrying out characteristic extraction and characteristic fusion on the multimode information to obtain fusion characteristics;

the fusion information classification module is used for classifying the fusion characteristics by adopting a multi-classification model to obtain a classification result, and the classification result at least comprises two intents;

the relation matrix construction module is used for constructing a multi-dimensional relation matrix for the classification result according to a preset intention relation;

and the association intention identification module is used for determining association intents and non-association intents based on the multi-dimensional relation matrix.

Optionally, the feature extraction and fusion module includes:

the first extraction unit is used for extracting the characteristic text information by adopting a bert model if the text information exists in the multimode information to obtain text characteristics;

a second extraction unit, configured to, if there is speech information in the multimode information, perform feature extraction on a mel cepstrum coefficient feature and a Bark spectrum of the speech information, and use the extracted feature as a speech feature;

a third extraction unit, configured to, if there is picture information in the multi-mode information, perform feature extraction on the picture information by using a depth residual error network, and use the extracted features as picture features;

and the feature fusion unit is used for carrying out normalization and feature fusion on the text feature, the voice feature and the picture feature to obtain fusion features.

Optionally, the feature fusion unit includes:

the first normalization subunit is used for normalizing the text feature, the voice feature and the picture feature;

and the first splicing subunit is used for splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain fusion features.

Optionally, the feature fusion unit further includes:

the second normalization subunit is used for normalizing the text feature, the voice feature and the picture feature;

the second splicing subunit is used for splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing mode to obtain splicing features;

the feature selection subunit is used for selecting one of the normalized text feature, the normalized voice feature and the normalized picture feature as a first feature and the rest two as second features according to a preset mode;

the first attention calculation subunit is used for pointing the first feature to a K vector and a V vector of an attention mechanism and pointing the second feature to a Q vector of the attention mechanism in an attention mechanism mode to obtain a first attention calculation result;

the second attention calculation subunit is used for pointing the first feature to a Q vector of the attention mechanism and pointing the second feature to a K vector and a V vector of the attention mechanism in an attention mechanism mode to obtain a second attention calculation result;

and the splicing and fusion subunit is used for splicing the splicing feature, the first attention calculation result and the second attention calculation result to obtain the fusion feature.

Optionally, the multi-intent recognition apparatus further comprises:

the first intention identification module is used for acquiring the time of the associated intention aiming at the associated intention and sequentially putting the time into a global queue for identification processing according to the time sequence of the associated intention;

and the second intention identification module is used for respectively identifying each non-associated intention by adopting a virtual dialog manager aiming at the non-associated intention.

Optionally, the first intent recognition module includes:

the sharing information determining unit is used for taking the time between the associated intentions and the intention information as sharing information and putting the sharing information into a sharing slot position aiming at the associated intentions;

and the key value pair construction unit is used for generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into a global queue to execute identification processing.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above multi-intent recognition method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned multi-intent recognition method.

According to the multi-intention identification method, the multi-intention identification device, the computer equipment and the storage medium, multi-mode information is obtained, wherein the multi-mode information comprises at least two items of voice information, text information and picture information, feature extraction and feature fusion are carried out on the multi-mode information to obtain fusion features, a multi-classification model is adopted to classify the fusion features to obtain classification results, the classification results at least comprise two intentions, a multi-dimensional relation matrix is built for the classification results according to a preset intention relation, association intentions and non-association intentions are determined based on the multi-dimensional relation matrix, the multi-intention classification identification is achieved, and the intention identification accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a multi-intent recognition method of the present application;

FIG. 3 is a schematic block diagram of one embodiment of a multiple intent recognition arrangement according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof in the description and claims of this application and the description of the figures above, are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may use

terminal devices

101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

The multiple intention recognition method provided by the embodiment of the application is executed by the server, and accordingly, the multiple intention recognition device is arranged in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the

terminal devices

101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.

Referring to fig. 2, fig. 2 shows a multi-intent recognition method according to an embodiment of the present invention, which is described by taking the method applied to the server in fig. 1 as an example, and is detailed as follows:

s201: obtaining multi-mode information, wherein the multi-mode information comprises at least two items of voice information, text information and picture information.

S202: and performing feature extraction and feature fusion on the multimode information to obtain fusion features.

In a specific optional implementation, the performing feature extraction and feature fusion on the multimodal information to obtain a fusion feature includes:

if the multi-mode information contains text information, extracting the characteristic text information by adopting a bert model to obtain text characteristics;

if the voice information exists in the multimode information, carrying out feature extraction on the voice information by using a Mel cepstrum coefficient feature and a Bark spectrum, and taking the extracted feature as a voice feature;

if the multi-mode information contains picture information, extracting the features of the picture information by adopting a depth residual error network, and taking the extracted features as picture features;

Specifically, for text data, firstly, words are cut for the text and the text is input into a word embedding layer, then context semantic coding is carried out by utilizing a bert encoder layer, a bert model is understood as a deep neural network, and the model construction is completed by constructing a self-attention mechanism module according to the specific principle.

For voice data, voice signal data is first prepared, and high frequency information is amplified using a pre-emphasis filter (equation 1-1) method, which has a balanced spectrum, avoids numerical problems during fourier transform operations, and may also improve signal-to-noise ratio (SNR).

y(t)=x(t)-ax(t-1) (1-1)

The signal then needs to be divided into short time frames. By performing a fourier transform on this short time frame, we can obtain a good approximation of the signal frequency profile by concatenating adjacent frames. After the signal is then sliced into frames, we apply a window function, such as a hamming window, to each frame. The Hamming window has the form as shown in equations 1-2, where 0. Ltoreq. N.ltoreq.1, is the window length:

(1-2)

performing a fourier transform (or more specifically a short-time fourier transform) on each frame and calculating a power spectrum; a filter bank is then calculated. To obtain the MFCC, a Discrete Cosine Transform (DCT) may be applied to the filter bank to retain a number of the resulting coefficients, with the remaining coefficients discarded, ultimately forming the MFCC features.

Aiming at image data, firstly, an ResNet model is trained by using an image data set, then, intermediate characteristic layer data are directly extracted to serve as image characteristics, the ResNet model is also a deep neural network model, and a residual error connection mode is mainly adopted to improve the training effect. The specific formula is as follows (formula 1-3), and the principle is that the input x and the coded H layer are added:

Y=H(x，w _h )+x (1-3)

further, normalizing and fusing the text feature, the voice feature and the picture feature to obtain a fused feature comprises:

normalizing the text feature, the voice feature and the picture feature;

In a specific optional implementation, normalizing and feature fusing the text feature, the voice feature and the picture feature to obtain a fused feature further includes:

normalizing the text feature, the voice feature and the picture feature;

adopting an attention mechanism mode, enabling the first characteristic to point to a K vector and a V vector of the attention mechanism, and enabling the second characteristic to point to a Q vector of the attention mechanism, so as to obtain a first attention calculation result;

and splicing the splicing characteristic, the first attention calculation result and the second attention calculation result to obtain a fusion characteristic.

S203: and classifying the fusion characteristics by adopting a multi-classification model to obtain a classification result, wherein the classification result at least comprises two intents.

Optionally, the multi-classification model is a neural network model, and the classifier of the multi-classification model is composed of fully connected layers.

S204: and constructing a multi-dimensional relation matrix for the classification result according to a preset intention relation.

S205: based on the multi-dimensional relationship matrix, associated intents and non-associated intents are determined.

Optionally, after step S205, that is, after determining the associated intention and the non-associated intention based on the multi-dimensional relationship matrix, the multi-intention identifying method further includes:

aiming at the association intention, acquiring the time of the association intention, and sequentially putting the association intention into a global queue for identification processing according to the time sequence of the association intention;

aiming at the non-associated intents, a virtual dialog manager is adopted to respectively identify and process each non-associated intention.

Optionally, after step S205, that is, for the association intention, acquiring the time of the association intention, and sequentially putting into the global queue according to the time sequence of the association intention to perform the identification processing includes:

regarding the associated intents, taking time and intention information between the associated intents as shared information and putting the shared information into a shared slot;

and generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into the global queue to execute identification processing.

In the embodiment, the multi-mode information is obtained, wherein the multi-mode information comprises at least two items of voice information, text information and picture information, the multi-mode information is subjected to feature extraction and feature fusion to obtain fusion features, the fusion features are classified by adopting a multi-classification model to obtain classification results, the classification results at least comprise two intentions, a multi-dimensional relationship matrix is built for the classification results according to a preset intention relationship, and association intentions and non-association intentions are determined based on the multi-dimensional relationship matrix, so that the classification recognition of the multi-intentions is realized, and the accuracy of the intention recognition is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Fig. 3 shows a schematic block diagram of a multiple intention recognition apparatus corresponding to the multiple intention recognition method of the above embodiment one to one. As shown in fig. 3, the multiple intention identifying apparatus includes a multiple mode information obtaining module 31, a feature extraction and fusion module 32, a fusion information classification module 33, a relation matrix building module 34, and an association intention identifying module 35. The functional modules are explained in detail as follows:

a multimode information obtaining module 31, configured to obtain multimode information, where the multimode information includes at least two of voice information, text information, and picture information;

the feature extraction and fusion module 32 is used for performing feature extraction and feature fusion on the multimode information to obtain fusion features;

the fusion information classification module 33 is configured to classify the fusion features by using a multi-classification model to obtain a classification result, where the classification result at least includes two intents;

a relation matrix construction module 34, configured to construct a multidimensional relation matrix for the classification result according to a preset intention relation;

and an association intention identifying module 35 for determining an association intention and a non-association intention based on the multi-dimensional relationship matrix.

Optionally, the feature extraction and fusion module 32 includes:

the first extraction unit is used for extracting characteristic text information by adopting a bert model if text information exists in the multimode information to obtain text characteristics;

the second extraction unit is used for extracting the features of the Mel cepstrum coefficient and the Bark spectrum of the voice information if the voice information exists in the multimode information, and taking the extracted features as voice features;

a third extraction unit, configured to, if there is picture information in the multimode information, perform feature extraction on the picture information by using a depth residual error network, and use the extracted features as picture features;

and the feature fusion unit is used for carrying out normalization and feature fusion on the text features, the voice features and the picture features to obtain fusion features.

Optionally, the feature fusion unit comprises:

and the first splicing subunit is used for splicing the normalized text features, the normalized voice features and the normalized picture features in a matrix splicing manner to obtain fusion features.

Optionally, the feature fusion unit further comprises:

the first attention calculation subunit is used for pointing the first feature to a K vector and a V vector of the attention mechanism and pointing the second feature to a Q vector of the attention mechanism in an attention mechanism mode to obtain a first attention calculation result;

and the splicing and fusion subunit is used for splicing the splicing characteristic, the first attention calculation result and the second attention calculation result to obtain a fusion characteristic.

Optionally, the multi-intent recognition apparatus further comprises:

the first intention identification module is used for acquiring the time of the associated intention aiming at the associated intention and sequentially putting the time into the global queue for identification processing according to the time sequence of the associated intention;

Optionally, the first intent recognition module includes:

and the key value pair construction unit is used for generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into the global queue to execute identification processing.

For specific definition of the multiple intention recognition device, reference may be made to the definition of the multiple intention recognition method above, and details are not repeated here. The various modules in the above-described multiple intent recognition apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In order to solve the technical problem, the embodiment of the application further provides computer equipment. Referring to fig. 4 in particular, fig. 4 is a block diagram of a basic structure of a computer device according to the embodiment.

The computer device 4 comprises a memory 41, a processor 42, and a network interface 43, which are communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user in a keyboard mode, a mouse mode, a remote controller mode, a touch panel mode or a voice control equipment mode.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both an internal storage unit of the computer device 4 and an external storage device thereof. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing an interface display program, which is executable by at least one processor to cause the at least one processor to perform the steps of the multi-intent recognition method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It should be understood that the above-described embodiments are merely exemplary of some, and not all, embodiments of the present application, and that the drawings illustrate preferred embodiments of the present application without limiting the scope of the claims appended hereto. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A multi-intent recognition method, comprising:

2. The method for multi-intent recognition according to claim 1, wherein the performing feature extraction and feature fusion on the multi-modal information to obtain fused features comprises:

3. The method of claim 2, wherein the normalizing and feature fusing the text feature, the voice feature and the picture feature to obtain a fused feature comprises:

normalizing the text feature, the voice feature and the picture feature;

4. The method for multi-intent recognition according to claim 2, wherein the normalizing and feature fusing the text feature, the voice feature and the picture feature to obtain a fused feature further comprises:

normalizing the text feature, the voice feature and the picture feature;

5. The multi-intent recognition method of claim 1, wherein the multi-classification model is a neural network model, and the classifiers of the multi-classification model are composed of fully connected layers.

6. The multi-intent recognition method according to any one of claims 1 to 5, wherein after said determining associated intents and non-associated intents based on the multi-dimensional relationship matrix, the multi-intent recognition method further comprises:

7. The multi-intent recognition method according to claim 6, wherein the obtaining of the time of the associated intent and the sequentially putting into the global queue according to the time sequence of the associated intent for recognition processing for the associated intent comprises:

and generating a key value pair by taking the slot position identification of the sharing slot position as a key and the sharing information as a value, and putting the key value pair into a global queue to execute identification processing.

8. A multiple intention recognition apparatus, characterized in that the multiple intention recognition apparatus comprises:

the feature extraction and fusion module is used for carrying out feature extraction and feature fusion on the multimode information to obtain fusion features;

9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the multiple intent recognition method of any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the multiple intention recognition method according to any one of claims 1 to 7.