CN114627868A

CN114627868A - Intention recognition method and device, model and electronic equipment

Info

Publication number: CN114627868A
Application number: CN202210208740.9A
Authority: CN
Inventors: 沈佳
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Ping An Puhui Enterprise Management Co Ltd
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-06-14

Abstract

The application discloses an intention identification method, an intention identification device, a model, an electronic device and a readable storage medium, wherein the method comprises the following steps: acquiring multiple analog state information of a problem to be identified, wherein the multiple analog state information comprises voice information, character information and image information; determining a first feature vector of the voice information based on a voice feature extraction network; determining a second feature vector of the text information based on the text feature extraction network; determining a third feature vector of the image information based on the image feature extraction network; fusing the first feature vector, the second feature vector and the third feature vector to obtain a fused vector; and determining an intention recognition result according to the fusion vector based on the full-connection network. The method and the system have the advantages that more data sources are used, the intention of the client is comprehensively recognized, the accuracy of intention recognition in the intelligent customer service is obviously improved, the experience of the user is improved, the satisfaction of the user is improved, the service capacity is increased, the application range is wide, and the calculated amount is small.

Description

Intention recognition method, device, model and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an intention identification method, an intention identification device, an intention identification model and electronic equipment.

Background

With the more and more extensive application of intelligent customer service in the telephone scene in recent years, the intersection between people and intelligent customer service in daily life is larger and larger.

In the prior art, the method for intelligent customer service to recognize the intention of the customer generally comprises the steps of firstly converting the speech of the customer into characters through an ASR (automatic speech recognition) technology, and then recognizing the real intention of the customer through the characters.

However, the shorthand of using only the characters to identify the intention of the client is that the intention of the client cannot be effectively identified in the situations of canon, irony, and the like of the client, for example, the client says "line", and whether the intention of the client does not want to listen to the words continuously or is affirmative cannot be accurately judged.

Disclosure of Invention

In view of the above problems, embodiments of the present application provide an intention identification method, apparatus, model and electronic device to overcome or partially overcome the disadvantages of the prior art.

In a first aspect, an intention recognition method is provided in an embodiment of the present application, where the intention recognition method is implemented based on an intention recognition model, and the intention recognition model includes a voice feature extraction network, a text feature extraction network, an image feature extraction network, and a fully-connected network, where the voice feature extraction network, the text feature extraction network, and the image feature extraction network are respectively connected to the fully-connected network;

the method comprises the following steps:

acquiring multiple analog state information of a problem to be identified, wherein the multiple analog state information comprises voice information, character information and image information;

determining a first feature vector of the voice information based on the voice feature extraction network; determining a second feature vector of the text information based on the text feature extraction network; determining a third feature vector of the image information based on the image feature extraction network;

fusing the first feature vector, the second feature vector and the third feature vector to obtain a fused vector;

and determining an intention recognition result according to the fusion vector based on the full-connection network.

In a second aspect, an intention recognition apparatus is further provided, where the intention recognition apparatus deploys an intention recognition model, and the intention recognition model includes a voice feature extraction network, a text feature extraction network, an image feature extraction network, and a fully-connected network, where the voice feature extraction network, the text feature extraction network, and the image feature extraction network are respectively connected to the fully-connected network;

the device comprises:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of analog state information of a problem to be identified, and the plurality of analog state information comprises voice information, character information and image information;

a feature extraction unit, configured to determine a first feature vector of the voice information based on the voice feature extraction network; determining a second feature vector of the text information based on the text feature extraction network; and determining a third feature vector of the image information based on the image feature extraction network;

the fusion unit is used for fusing the first feature vector, the second feature vector and the third feature vector to obtain a fusion vector;

and the identification unit is used for determining an intention identification result according to the fusion vector based on the full-connection network.

In a third aspect, an embodiment of the present application further provides an intention recognition model, which includes a voice feature extraction network, a text feature extraction network, an image feature extraction network, and a fully connected network, where the voice feature extraction network, the text feature extraction network, and the image feature extraction network are respectively connected to the fully connected network;

the voice feature extraction network includes: the first CNN layer, the second CNN layer and the third CNN layer are connected in sequence; the first CNN layer, the second CNN layer and the third CNN layer respectively comprise a plurality of CNN units arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size;

the text feature extraction network includes: the embedded layer, the first full-connection layer, the second full-connection layer and the third full-connection layer are connected in sequence; wherein the number of neurons of the first fully-connected layer, the second fully-connected layer, and the third fully-connected layer decreases in sequence;

the image feature extraction network includes: a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are connected in sequence; the fourth CNN layer, the fifth CNN layer and the sixth CNN layer respectively comprise a plurality of CNN units arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size;

the fully connected network comprises: the fourth full-junction layer, the fifth full-junction layer and the sixth full-junction layer are connected in sequence, wherein the number of the neurons of the fourth full-junction layer, the fifth full-junction layer and the sixth full-junction layer is reduced in sequence.

In a fourth aspect, an embodiment of the present application further provides an electronic device, including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform any of the methods described above.

In a fifth aspect, the present embodiments also provide a computer-readable storage medium storing one or more programs, which when executed by an electronic device including a plurality of application programs, cause the electronic device to perform any of the above methods.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:

aiming at the current situation that the real intention of a user is difficult to accurately identify by adopting single text data in the prior art, the method adopts multi-modal data information, directly fuses the features extracted from the text information, the voice information and the image information to obtain a fusion vector, and identifies the expression of the real intention of the user. Compared with the prior art, the method and the system have the advantages that more data sources are used, the intention of the client is comprehensively recognized, the accuracy of intention recognition in the intelligent customer service is remarkably improved, the experience of the user is improved, the satisfaction of the user is improved, the service capacity is increased, the application range is wide, and the calculated amount is small.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 shows a schematic flow diagram of an intent recognition method according to one embodiment of the present application;

FIG. 2 illustrates a structural schematic of an intent recognition model according to one embodiment of the present application;

FIG. 3 illustrates a structural schematic diagram of a speech feature extraction network 210 according to some embodiments of the present application;

FIG. 4 shows a structural schematic diagram of a textual feature extraction network 220, according to one embodiment of the present application;

FIG. 5 shows a schematic structural diagram of a fully connected network 240 according to one embodiment of the present application;

FIG. 6 illustrates a schematic structural diagram of an intent recognition apparatus, according to one embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to the specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

With the rapid development of electronic commerce, the intelligent robots and intelligent customer service are more and more widely applied, in the process of man-machine conversation, the mode of question-answer is mainly adopted, usually, a user asks questions, and the robots answer the questions given by the user, so that the purpose of accurately identifying the user is the basis of good man-machine communication.

In the prior art, when the intention of the user is recognized, the intention is generally recognized based on the characters of the user, including the characters directly input by the user on an interactive interface, and also including the characters which receive the voice of the user and then convert the voice of the user into the characters.

However, the text-based intention recognition has certain disadvantages, such as failure to recognize the tone and intonation of the user, which results in inaccurate prediction of the real intention of the user, such as failure to effectively recognize the situation of kayaking, irony, etc. of the client, for example, "the client is performing", and it is impossible to accurately judge whether the intention of the client is not to be heard or is certain only by text recognition.

Aiming at the defects of the prior art, the method for identifying the intention is provided, and the method is based on data information of multiple simulation states, extracts characteristic vectors from the data information of the multiple simulation states, fuses the characteristic vectors and comprehensively identifies the real intention of a user according to the fused vectors.

Fig. 1 shows a schematic flow chart of an intention identification method according to an embodiment of the present application, and as can be seen from fig. 1, the present application at least includes steps S110 to S140:

step S110: acquiring multiple analog state data information of a problem to be identified, wherein the multiple analog state data information comprises character information, voice information and image information.

The intention recognition method of the present application is implemented based on an intention recognition model, fig. 2 shows a schematic structural diagram of an intention recognition model according to the present application, and as can be seen from fig. 2, the intention recognition model 200 includes a voice feature extraction network 210, a text feature extraction network 220, an image feature extraction network 230, and a fully connected network 240, wherein the voice feature extraction network 210, the text feature extraction network 220, and the image feature extraction network 230 are respectively connected to the fully connected network 240. First, multi-modal data information is obtained for the same question targeted by the user, which may include, but is not limited to, text information, voice information, and image information. For example, a user opens a camera on a terminal interface to interact with the intelligent customer service in real time, and at this time, the voice of the customer, the text of the voice after ASR translation, and the image of the customer can be obtained at the same time.

In some embodiments of the present application, the obtaining of multiple analog state information of a problem to be identified, where the multiple analog state information includes voice information, text information, and image information, includes: acquiring a target video stream corresponding to the problem to be identified; determining a plurality of continuous key frames in a target video stream; taking the voice separated from the continuous key frames as voice information; converting the voice information into character information; at least one of the consecutive key frames is used as image information.

Assuming that a user has a conversation with the customer service robot, in the process of interaction between the user and the customer service robot, the customer service robot starts a recording and videoing device, and can record the video stream of the user corresponding to the problem to be identified and record the video stream as a target video stream. The target video stream usually comprises multi-frame images, a plurality of continuous key frames are determined from the multi-frame images to serve as feedback of a user to be identified, the determination of the plurality of continuous key frames is not applied and is not limited, if the determination can be made according to time, a plurality of frames within a preset time length after the customer service robot finishes speaking the last word are taken as the plurality of continuous key frames, the plurality of continuous key frames are sequenced according to a time sequence, and one or more frames in the middle are taken as image information. The speech information and the text information can also be obtained from a plurality of continuous key frames, specifically, the speech separated from the continuous key frames can be used as the speech information, the speech information is converted into the text information, and the specific separation and conversion process can refer to the prior art.

Step S120: determining a first feature vector of the voice information based on the voice feature extraction network; determining a second feature vector of the text information based on the text feature extraction network; determining a third feature vector of the image information based on the image feature extraction network.

When the tone, intonation, expression, limb movement and the like of a user often imply the real intention of the user, for example, the voice uttered by the user is 'lined', but the current attitude of the user is understood from the literal meaning, but the user may only have the sarcasic meaning, or the problem is not solved, and an anecdotal representation is uttered, wherein the expression is a 'unliked' expression which indicates whether the real intention of the user is determined. In the prior art, the intention recognition is carried out by only depending on characters, and recognition errors are often caused.

At present, part of the prior art adopts multi-modal information, but in the prior art, only the intonation features in the user semantics are usually adopted, and in the prior art, a tag is usually added to the voice information of a user, namely a recognition result, and then the recognition result is combined with a text recognition result. If a user says "line", in the prior art, the tone of the user's voice is usually recognized first, a sub-result is recognized, if the sub-result is "positive", then another sub-result is recognized, if the sub-result is also "positive", the two sub-results are combined to determine that the final intention recognition result is "positive"; if the sub-result identified by speech recognition is "negative" and the sub-result identified by text recognition is "positive", when the two sub-results are combined, a self-contradictory phenomenon occurs, and the final result is determined based on the weight, if the weight of the text is greater than the weight of the speech, the final result is determined to be "positive", which is very likely to cause a false recognition of the user's intention. The method is different from the prior art in that the character information, the voice information and the image information are respectively subjected to feature extraction, the extracted features are fused into a large feature vector, and intention recognition is performed according to the fused feature vector, so that the defects of the prior art are overcome, and the accuracy of intention recognition is greatly improved.

After the data information of various analog states is obtained, a feature vector is determined according to each data information. For simple understanding, for the text information, based on the character feature recognition network 220, the text features including but not limited to semantic features, grammatical structures, etc. can be obtained by extracting features of the text information; for voice information, features such as tone, intonation, emotion, speech speed, sound amplitude, emphasis point, and the like, included in the user voice information may be extracted based on the voice feature extraction network 210; and for the image features, the network 230 may be extracted based on the image features, and the expression, the body motion, and the like of the user may be recognized. For the features obtained according to each analog state data information, the features are recorded as feature vectors, in the present application, for convenience of description, the feature vector obtained according to the voice information is recorded as a first feature vector, the feature vector obtained according to the text information is recorded as a second feature vector, and the feature vector obtained according to the image information is recorded as a third feature vector. It should be noted that each feature vector is in the form of a one-dimensional vector or a multi-dimensional matrix.

Step S130: and fusing the first feature vector, the second feature vector and the third feature vector to obtain a fused vector.

After a first feature vector of voice information, a second feature vector of text information and a third feature vector of image information are obtained respectively, the three feature vectors are fused together to form a large vector. A vector that is "large" can be understood as an increase in dimension.

For example, if the first feature vector, the second feature vector, and the third feature vector are all one-dimensional vectors, the first feature vector is denoted as a, the second feature vector is denoted as b, the third feature vector is denoted as c, and the fusion vector is denoted as z, in a specific fusion manner, after elements of the second feature vector are "placed" in the original order to the first feature vector, elements of the third feature vector are "placed" in the original order to the second feature vector, a fusion vector is obtained, that is, z ═ a, b, c.

Step S140: and determining an intention recognition result according to the fusion vector based on the full-connection network.

Finally, based on the fully-connected layer 240, an intent recognition result is determined from the fused vector. The full link layer 240 may recognize the fused vector according to the knowledge learned in the training phase, and in the fused vector, the multiple features in the voice, the text, and the image of the user are fused, and an intention recognition result is obtained according to the multiple features, which is more accurate than a result obtained by data of a single modality.

As can be seen from the method shown in fig. 1, in order to solve the problem that it is difficult to accurately identify the real intention of the user with a single text data in the prior art, the present application uses a plurality of analog data information, and directly fuses the features extracted from the text information, the voice information, and the image information to obtain a fusion vector, so as to identify the expression of the real intention of the user. Compared with the prior art, the method and the system have the advantages that more data sources are used, the intention of the client is comprehensively recognized, the accuracy of recognizing the intention in the intelligent customer service is obviously improved, the experience of the user is improved, the satisfaction of the user is improved, the service capacity is increased, the application range is wide, and the calculated amount is small.

In some embodiments of the present application, the speech feature extraction network comprises: the first CNN layer, the second CNN layer and the third CNN layer are connected in sequence; the first CNN layer, the second CNN layer and the third CNN layer respectively comprise a plurality of CNN units arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size; the determining a first feature vector of the voice information based on the voice feature extraction network comprises: preprocessing the voice information; enabling the preprocessed voice information to enter each CNN unit of the first CNN layer respectively so as to perform feature extraction on the voice information, and splicing the output of each CNN unit to obtain a first primary feature vector of the voice information; performing dimension reduction on the first primary feature vector, enabling the first primary feature vector subjected to dimension reduction to enter each CNN unit of the second CNN layer so as to perform feature extraction on the first primary feature vector, and splicing the outputs of the CNN units to obtain a first intermediate-level feature vector of the voice information; and performing dimension reduction on the first intermediate-level feature vector, enabling the first intermediate-level feature vector subjected to dimension reduction to enter each CNN unit of the third CNN layer, performing feature extraction on the first intermediate-level feature vector, and splicing the outputs of the CNN units to obtain a first high-level feature vector of the voice information as the first feature vector.

Fig. 3 shows a schematic structural diagram of a speech feature extraction network 210 according to some embodiments of the present application, and as can be seen from fig. 3, the speech feature extraction network comprises: a first CNN layer 211, a second CNN layer 212, and a third CNN layer 213 connected in sequence; the first CNN layer 211, the second CNN layer 212, and the third CNN layer 213 respectively include a plurality of CNN units arranged in parallel, where each CNN unit respectively includes an input layer, a convolutional layer, an active layer, a pooling layer, and a full-link layer, where convolutional cores of convolutional layers of each CNN unit are different in size. In fig. 3, taking the first CNN layer 211 as an example, the first CNN layer 211 includes three CNN units, that is, a CNN unit 211-1, a CNN unit 211-2, and a CNN unit 211-3, where the size of the convolution kernel of the convolution layer of the CNN unit 211-1 is 1 × 1, the size of the convolution kernel of the convolution layer of the CNN unit 211-2 is 3 × 3, and the size of the convolution kernel of the convolution layer of the CNN unit 211-3 is 5 × 5; the second CNN layer 212 and the third CNN layer 213 have the same structure as the first CNN layer 211, and are not described in detail.

When extracting the voice information feature, the voice information may be preprocessed first, and then the preprocessed voice information may sequentially enter the first CNN layer 211, the second CNN layer 212, and the third CNN layer 213.

The voice signal is an audio signal, and the audio signal can be, but is not limited to, a sound signal collected by the intelligent terminal when the user speaks. For example, audio signals of a microphone of the smart terminal are collected at 16000Hz to obtain a time sequence of audio signals, wherein the audio signals are represented by 16 bits at 16000Hz as an example, and have a single channel, the size of a collection point signal is (2,4,100,120,140,60, -60, -130, …), and the interval time of each point is 1/16000 seconds.

When the voice signal is recognized by using the multi-layer CNN network, after the voice signal is obtained, the voice signal is in a wave form, and is preferably converted into a digital matrix, and then the digital matrix is used as input data of the multi-layer CNN network. Preprocessing the voice information, including but not limited to framing, windowing, mel-frequency spectrum conversion and the like, in sequence, specifically, framing the voice signal to obtain a multi-frame voice signal; windowing is carried out on each frame of obtained voice signals, and further Mel frequency spectrum transformation is carried out on the windowed voice signals, so that a digital matrix is obtained.

The time sequence is used for taking out a group of data according to a certain rule, the group of data is called a frame of data, for example, 512 data are taken out each time, the group of data is called a frame of data, the process is divided into frames, and the number of the data taken out each time can be set according to the calculated amount, and is usually 512 or 1024. Specifically, in one frame of data, data of frequency points of the audio signal is extracted, which is related to frequency point resolution, for example, when the frequency point resolution is 16000, 512 data are extracted each time, since 16000/512 is 31.25Hz, that is, in a frequency domain of 0 to 8000Hz, only information of frequency points of 31.25 × N can be obtained, and N is an integer of 1 to 256.

The process of windowing can be described briefly as follows: for example, there is a function, where a is 1, f is 1Hz, and the obvious frequency includes plus or minus 1Hz, and assuming that sampling is performed at 10Hz, the sampled signal spectrum is periodically extended by taking 10Hz as a period. In this case, the sampled signal is infinitely spread in the time domain, and the spectrum is the same. And multiplying the sampled signal by a rectangular window in the time domain, and correspondingly convolving the signal spectrum and the rectangular window spectrum in the frequency domain to obtain a continuous periodic spectral line. After windowing, a sampling signal of N points can be obtained, the sampling signal is subjected to period prolongation, a signal which is virtualized into a discrete period is subjected to Fourier transform, a spectral line of the discrete period is obtained in the same way, and the transform is called discrete Fourier transform.

Before the mel frequency spectrum transformation, fourier transform is usually performed, specifically, the amplitude values corresponding to the frequency points in the frame of audio signal are subjected to fourier transform, and are combined according to the sequence of time, so as to form the power spectrum of the frame of audio signal. That is, the power spectrum of each frame can be represented by using a one-dimensional array (a1, a2, a3, …, a256), corresponding to the amplitudes of 31.25Hz, 62.5Hz, 93.75Hz, …,8000Hz, respectively.

Fourier transform or discrete fourier transform, which converts the original signal from time domain to frequency domain. This generates a "power spectrum" and a "periodogram of power spectra" (frequency is the X-axis) for each frame. Using DFT (discrete Fourier transform) of each frame, which contains parameters: n determines the sampling point of the long-window, e.g., Hanning window (Hanning window), where K is the length of the DFT.

A mel filter is then applied to the power spectrum, which specifies a number of filters (typically 26-40). Each filter is a vector representing a particular energy level (which corresponds to some fraction of the frequency range being non-zero). The filter energy may be generated for each filter by multiplying each filter with the power spectrum and then adding all coefficients. Positive and negative values represent the concentration of spectral energy (in low or high frequencies). Mathematically: each filter is represented as a vector with K entries, where K represents the length of the DFT (range of input frequencies). It is non-zero in a specific part of the total frequency range, which represents the energy level. The main parameters include: the X number of the filter (typically 26-40) selects the upper/lower frequency, e.g. 300Hz for the lower frequency and 8000Hz for the upper frequency, which is limited by the audio sampling frequency.

Taking the logarithm of all X filter energies as an example, this would result in an X log filter energy. Taking the DCT (discrete cosine transform) of the X-pair filter energy as an example, this results in X cepstral coefficients. The X cepstral Coefficients thus generated are MFCCs (Mel Frequency cepstral Coefficients), by which the preprocessing of the speech signal is completed.

The preprocessed speech signal is input into the first CNN layer 211, and the primary characteristics of the speech signal are extracted. In the speech signal, each frame of speech corresponds to a feature vector, and the overall speech signal corresponds to a matrix.

After entering the first CNN layer 211, the processed voice information enters each CNN unit, i.e., CNN unit 211-1, CNN unit 211-2, and CNN unit 211-3, at the same time, and each CNN (convolutional neural) network mainly comprises these layers: the convolutional neural unit comprises an input layer, a convolutional layer, a ReLU layer, a Pooling (Pooling) layer and a full-connection layer, wherein a complete convolutional neural unit can be constructed by adding the layers, and the convolutional layer and the ReLU layer are often called convolutional layers in practical application, so that the convolutional layers are subjected to convolutional operation and are subjected to activation functions, and particularly, when the convolutional layers and the full-connection layer (CONV/FC) perform transformation operation on input, not only the activation functions but also a plurality of parameters, namely weight w and deviation b of neurons are used; the ReLU layer and the pooling layer perform a fixed function operation. The parameters in the convolutional layer and the fully-connected layer are trained as the gradient decreases so that the classification scores calculated by the convolutional neural network can be matched with the label of each sample in the training set.

In the present application, the voice information is subjected to feature extraction in each CNN unit, because the sizes of convolution kernels of the CNN units are different, the effect of feature extraction is different, and the first CNN layer 211 is subjected to extraction of a rough feature, which can also be understood as primary feature extraction, after the voice information is output from the CNN unit 211-1, the CNN unit 211-2, and the CNN unit 211-3, the output vectors are spliced to obtain a high-dimensional matrix, for example, 2056 × 2056, and this matrix is denoted as a first primary feature matrix, where the first primary feature matrix includes feature information of the voice information primarily extracted.

Before the first primary feature matrix enters the second CNN layer 212, dimension reduction processing is required to reduce the first primary feature matrix to a specified dimension, such as 1024 × 1024, and the dimension reduction process may refer to any one of the methods in the prior art, such as Principal Component Analysis (PCA), multidimensional scaling (MDS), Linear Discriminant Analysis (LDA), and is not described again. And enabling the first primary feature matrix after dimensionality reduction to enter a second CNN layer 212, wherein the second CNN layer 212 has a structure consistent with that of the first CNN layer 211 and also comprises three CNN units, and the convolution kernels of convolution layers of each CNN unit are 1 × 1, 3 × 3 and 5 × 5 respectively. The feature extraction process of the first primary feature matrix in the second CNN layer 212 is the same as the process of the voice information in the first CNN layer 211, and is not described herein again.

Similarly, the first primary feature matrix outputs three vectors from each CNN unit of the second CNN layer 212, and after the three vectors are subjected to splicing, a first intermediate-level feature vector is obtained, and after the dimension reduction, the three vectors enter each CNN unit of the third CNN layer 213, and after the three vectors are output, a first high-level feature vector is obtained, and the first high-level feature vector is recorded as a first feature vector of the voice information. That is, in the voice feature extraction network, a plurality of layers of CNN networks are arranged, and features in the voice information are extracted and concentrated from rough to high precision, so that various features in the voice information are maintained to the greatest extent, including but not limited to features such as tone, intonation, emotion, speed, width, emphasis and the like.

In some embodiments of the present application, the image feature extraction network comprises: a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are connected in sequence; the fourth CNN layer, the fifth CNN layer and the sixth CNN layer respectively comprise a plurality of CNN units arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size; the determining a third feature vector of the image information based on the image feature extraction network includes: enabling image information to enter each CNN unit of the fourth CNN layer respectively to perform feature extraction on the image information, and splicing the output of each CNN unit to obtain a third primary feature vector of the image information; performing dimensionality reduction on the third primary feature vector to enable the dimensionality reduced primary feature vector to enter each CNN unit of the fifth CNN layer so as to perform feature extraction on the third primary feature vector, and splicing the outputs of the CNN units to obtain a third middle-level feature vector of the voice information; and performing dimension reduction processing on the third middle-level feature vector, enabling the reduced third primary feature vector to enter each CNN unit of the sixth CNN layer, performing feature extraction on the third middle-level feature vector, and splicing the outputs of the CNN units to obtain a third high-level feature vector of the voice information as a third feature vector.

In the present application, the structure of the image feature extraction network and the process of extracting the features of the image information are completely the same as the voice information, and are not described herein again. It should be noted that the image information does not need to be preprocessed as the voice information.

In some embodiments of the present application, the text feature extraction network comprises: the embedded layer, the first full-connection layer, the second full-connection layer and the third full-connection layer are connected in sequence; wherein the number of neurons of the first fully-connected layer, the second fully-connected layer, and the third fully-connected layer decreases in sequence; the determining a second feature vector of the text information based on the text feature extraction network comprises: performing word segmentation processing on the text information, and enabling the processed text information to enter the Embedding layer so as to convert the text information into a text input vector; enabling the text input vector to enter the first full-connection layer so as to perform feature lifting on the text input vector to obtain a second primary feature vector of the text information; enabling the second primary feature vector to enter the second full-connection layer so as to carry out feature lifting on the second primary feature vector to obtain a second intermediate feature vector of the text information; and enabling the second middle-level feature vector to enter the third full-connection layer so as to perform feature extraction on the second middle-level feature vector to obtain a second high-level feature vector of the text information as a second feature vector.

Fig. 4 shows a schematic structural diagram of a text feature extraction network 220 according to an embodiment of the present application, and as can be seen from fig. 4, the text feature extraction network 220 includes an Embedding layer 221, a first fully-connected layer 222, a second fully-connected layer 223, and a third fully-connected layer 224, where the Embedding layer 221, the first fully-connected layer 222, the second fully-connected layer 223, and the third fully-connected layer 224 are connected in sequence. The number of neurons of the first fully-connected layer 222, the second fully-connected layer 223, and the third fully-connected layer 224 decreases in order, and in some embodiments, the number of neurons of the first fully-connected layer 222, the second fully-connected layer 223, and the third fully-connected layer 224 is: 768. 256 and 2. The effect of setting up a plurality of full junctional layers is in order to carry out a lot of extraction and concentration with the text characteristic, and neuron quantity is more, and the feature extraction effect is better, but along with the calculated amount is big more, in this application, through setting up the multilayer full junctional layer and reducing the neuron quantity of the multilayer full junctional layer in proper order, can enough remain text information to the at utmost, can reach the equilibrium with performance consumption again.

Because the text information is converted by the voice information, denoising processing can be performed before feature extraction is performed on the text information, for example, data cleaning and the like, the data cleaning can adopt a regular expression, the training corpus is matched based on a preset matching rule so as to remove the unconventional characters of the training corpus, and the more specific cleaning method can be that the regular expression is adopted, and the matching rule is as follows: "\ u4e00- \ u9fa5], \\ un match n, where n is a Unicode character represented by four hexadecimal numbers, and Unicode codes between 4e00-9fa5 represent 20000 chinese characters, where [ \\ u4e00- \ u9fa5] represents matching chinese characters and [ \\ u4e00- \ u9fa5] represents matching all characters except chinese characters, and by such processing, some meaningless special characters such as" - "," … "and the like can be removed.

Before feature extraction is performed on the text information, word segmentation processing needs to be performed on the text information, for example, word segmentation processing is performed on the text information based on a full mode or an accurate mode of a jieba library, word segmentation is important content of Chinese text analysis, and correct word segmentation can be helpful for better model building and algorithm analysis. In the application, a cut method in a jieba library can be used, and word segmentation of the cut method has two modes, one mode is a full mode, and the other mode is an accurate mode. It should be noted that the difference between the precise mode and the full mode is only whether cut _ all is used, and the precise mode selects cut _ all as False; the full mode selects cut _ all True.

The text information after word segmentation enters the text feature extraction network 220, and the text information sequentially passes through an Embedding layer 221, a first full-connection layer 222, a second full-connection layer 223 and a third full-connection layer 224 to obtain a second feature vector.

The text input vector processed by the Embedding layer 221 enters a first full-connection layer 222 to perform feature extraction on the text input vector to obtain a second primary feature vector of the text information; enabling the second primary feature vector to enter the second full-connection layer 223 so as to perform feature extraction on the second primary feature vector to obtain a second intermediate feature vector of the text message; and enabling the second intermediate feature vector to enter the third full-connection layer 224 so as to perform feature extraction on the second intermediate feature vector, and obtaining a second high-level feature vector of the text information as a second feature vector.

The Embedding layer 221 is used for converting the text content of the divided words into corresponding word vectors; the full connection layers respectively perform certain nonlinear transformation on the word vectors so as to extract the features, and simultaneously convert the word vectors into vectors which are more compatible with the output of the voice feature extraction network and the image feature extraction network. According to the method, a conventional RNN (Recurrent Neural Network) Network structure is not used for extracting the text features, but a full link layer is used, so that the calculation efficiency is higher.

The first full-link layer 222, the second full-link layer 223 and the third full-link layer 224 are used for feature extraction, the most important information in text features can be reserved through multi-layer feature extraction, and the number of neurons of the first full-link layer 222, the second full-link layer 223 and the third full-link layer 224 is sequentially reduced, so that subsequent calculation amount can be reduced.

In some embodiments of the present application, in the method, the fusing the first feature vector, the second feature vector, and the third feature vector to obtain a fused vector includes: according to a specified splicing form, placing each element in the second feature vector to a corresponding position behind the first feature vector according to an original sequence, and placing each element in the third feature vector to a corresponding position behind the second feature vector according to the original sequence; and if the first eigenvector, the second eigenvector and the third eigenvector are the heteromorphic matrix, setting missing elements in the rest vectors to be null by taking the row vector and the column vector with the most elements as references.

For example, assume that the first feature vector, the second feature vector, and the third feature vector are one-dimensional vectors, assume that the first feature vector is a vector (a, b, c), the second feature vector is a vector (d, e), and the third feature vector is (f, g); during splicing, the elements in the second eigenvector are placed to the last element of the first eigenvector according to the original sequence of the elements in the second eigenvector, and then the elements in the third eigenvector are placed to the last element of the second eigenvector according to the original sequence of the elements in the third eigenvector, and the fused vector is recorded as a vector z, wherein the vector z is (a, b, c, d, e, f, g).

If the first eigenvector, the second eigenvector and the third eigenvector are in matrix form, taking two of them as examples, assuming that the first eigenvector and the second eigenvector are isomorphic matrices, assuming that the first eigenvector and the second eigenvector are both two-dimensional matrices of 2 × 2, respectively expressed as:

and

the appointed splicing form is vertical splicing, when splicing is carried out, the second eigenvector is placed in the last row of the first eigenvector according to the original sequence of each element in the second eigenvector, the fusion vector is recorded as vector z, and the vector z is a2 x 4 matrix, and specifically:

if the first eigenvector and the second eigenvector are the special-shaped matrix; the missing element is set to null. When there are more vectors, it is sufficient to "place" the third eigenvector after the second eigenvector, as described above.

In some embodiments of the present application, in the above method, the fully-connected network comprises a fourth fully-connected layer, a fifth fully-connected layer and a sixth fully-connected layer connected in sequence; wherein the number of neurons of the fourth fully-connected layer, the fifth fully-connected layer and the sixth fully-connected layer decreases in sequence; the determining an intention recognition result according to the fusion vector based on the fully connected network comprises: and enabling the fusion vector to sequentially enter the fourth full connection layer, the fifth full connection layer and the sixth full connection layer so as to compress the characteristics represented by the fusion vector and determine an intention recognition result.

In some embodiments of the present application, the fully-connected layer is used to perform recognition processing on the fusion vector to obtain a final intention recognition result, in order to more accurately perform the intention recognition, a plurality of fully-connected layers are also arranged in the fully-connected network 240 of the intention recognition model 200, fig. 5 illustrates a schematic structural diagram of the fully-connected network 240 according to an embodiment of the present application, as can be seen from fig. 5, the fully-connected network 240 includes a fourth fully-connected layer 241, a fifth fully-connected layer 242 and a sixth fully-connected layer 243 which are sequentially connected, and the structure of each fully-connected layer is the same as that of the prior art, but the number of neurons of the fourth fully-connected layer 241, the fifth fully-connected layer 242 and the sixth fully-connected layer 243 is sequentially reduced, and in some embodiments of the present application, the number of neurons of the fourth fully-connected layer 241, the fifth fully-connected layer 242 and the sixth fully-connected layer 243 is sequentially 768, the number of neurons of the fourth fully-connected layer 241, the fifth fully-connected layer 242 and the sixth fully-connected layer 243 are sequentially reduced, 256. 2.

The fusion vector is a relatively large vector and contains multiple characteristics of texts, voices and images, and the full-connection network is used for extracting, identifying and compressing the characteristics of the large fusion vector, and finally determining a label of a problem to be identified so as to determine an intention identification result.

In some embodiments of the present application, an intention recognition model (please refer to fig. 2 to 5 simultaneously) capable of implementing any one of the aforementioned intention recognition methods is provided, wherein the intention recognition model includes a voice feature extraction network, a text feature extraction network, an image feature extraction network, and a fully-connected network, and the voice feature extraction network, the text feature extraction network, and the image feature extraction network are respectively connected to the fully-connected network; the voice feature extraction network includes: the first CNN layer, the second CNN layer and the third CNN layer are connected in sequence; the first CNN layer, the second CNN layer and the third CNN layer respectively comprise a plurality of CNN units arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size; the text feature extraction network includes: the embedded layer, the first full-connection layer, the second full-connection layer and the third full-connection layer are connected in sequence; wherein the number of neurons of the first fully-connected layer, the second fully-connected layer, and the third fully-connected layer decreases in sequence; the image feature extraction network includes: a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are connected in sequence; the fourth CNN layer, the fifth CNN layer and the sixth CNN layer respectively comprise a plurality of CNN units arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size; the fully connected network comprises: the number of the neurons of the fourth full connection layer, the fifth full connection layer and the sixth full connection layer is reduced in sequence.

Based on the intention recognition network model, the intention recognition method is realized, and the flow can be briefly described as follows:

the voice information enters a voice characteristic extraction network after being subjected to framing, windowing and Mel frequency spectrum conversion processing, so that the processed voice information sequentially passes through a first CNN layer, a second CNN layer and a third CNN layer and is output from the third CNN layer as a first characteristic vector.

The text information enters a text feature extraction network after word segmentation, so that the text information sequentially passes through an Embedding layer, a first full connection layer, a second full connection layer and a third full connection layer; the output of the third fully-connected layer is taken as the second eigenvector.

And the image information enters an image feature extraction network, so that the image information sequentially passes through a fourth CNN layer, a fifth CNN layer and a sixth CNN layer, and a third feature vector is output from the sixth CNN layer.

And splicing the first feature vector, the second feature vector and the third feature vector to obtain a fusion vector.

The fused vector enters the full-connection network 240, and then sequentially passes through a fourth full-connection layer, a fifth full-connection layer and a sixth full-connection layer, and the final intention of the client is inferred in the hidden space. A hidden space is understood to be a mathematically low-dimensional space.

FIG. 6 is a schematic diagram illustrating an intent recognition apparatus that deploys an intent recognition model including a speech feature extraction network, a text feature extraction network, an image feature extraction network, and a fully-connected network, wherein the speech feature extraction network, the text feature extraction network, and the image feature extraction network are respectively connected to the fully-connected network, according to an embodiment of the application; the apparatus 600 comprises:

the acquiring unit 610 is configured to acquire multiple analog state information of a problem to be identified, where the multiple analog state information includes voice information, text information, and image information;

a feature extraction unit 620, configured to determine a first feature vector of the voice information based on the voice feature extraction network; determining a second feature vector of the text information based on the text feature extraction network; and determining a third feature vector of the image information based on the image feature extraction network;

a fusion unit 630, configured to fuse the first feature vector, the second feature vector, and the third feature vector to obtain a fusion vector;

an identifying unit 640, configured to determine an intention identifying result according to the fusion vector based on the fully connected network.

In some embodiments of the present application, in the above apparatus, the voice feature extraction network includes: the first CNN layer, the second CNN layer and the third CNN layer are connected in sequence; the first CNN layer, the second CNN layer and the third CNN layer respectively comprise a plurality of CNN units which are arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size; a feature extraction unit 620, configured to pre-process the voice information; enabling the preprocessed voice information to enter each CNN unit of the first CNN layer respectively so as to perform feature extraction on the voice information, and splicing the output of each CNN unit to obtain a first primary feature vector of the voice information; performing dimensionality reduction on the first primary feature vector, enabling the dimensionality reduced first primary feature vector to enter each CNN unit of the second CNN layer, so as to perform feature extraction on the first primary feature vector, and splicing the outputs of the CNN units to obtain a first middle-level feature vector of the voice information; and performing dimension reduction processing on the first intermediate-level feature vector, enabling the first intermediate-level feature vector after dimension reduction to enter each CNN unit of the third CNN layer, performing feature extraction on the first intermediate-level feature vector, and splicing the outputs of the CNN units to obtain a first high-level feature vector of the voice information as the first feature vector.

In some embodiments of the present application, in the above apparatus, the text feature extraction network includes: the embedded layer, the first full-connection layer, the second full-connection layer and the third full-connection layer are connected in sequence; wherein the number of neurons of the first fully-connected layer, the second fully-connected layer, and the third fully-connected layer decreases in sequence; the feature extraction unit 620 is configured to perform word segmentation processing on the text information, so that the processed text information enters the Embedding layer, and the text information is converted into a text input vector; enabling the text input vector to enter the first full-connection layer so as to perform feature lifting on the text input vector to obtain a second primary feature vector of the text information; enabling the second primary feature vector to enter the second full-connection layer so as to carry out feature lifting on the second primary feature vector to obtain a second intermediate feature vector of the text information; and enabling the second middle-level feature vector to enter the third full-connection layer so as to perform feature extraction on the second middle-level feature vector to obtain a second high-level feature vector of the text information as a second feature vector.

In some embodiments of the application, in the above apparatus, the image feature extraction network comprises: a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are connected in sequence; the fourth CNN layer, the fifth CNN layer and the sixth CNN layer respectively comprise a plurality of CNN units arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size; a feature extraction unit 620, configured to enable image information to enter each CNN unit of the fourth CNN layer, respectively, to perform feature extraction on the image information, and splice outputs of the CNN units to obtain a third primary feature vector of the image information; performing dimensionality reduction on the third primary feature vector, enabling the dimensionality-reduced primary feature vector to enter each CNN unit of the fifth CNN layer so as to perform feature extraction on the third primary feature vector, and splicing the outputs of the CNN units to obtain a third middle-level feature vector of the voice information; and performing dimensionality reduction on the third middle-level feature vector, enabling the dimensionality-reduced third primary feature vector to enter each CNN unit of the sixth CNN layer, performing feature extraction on the third middle-level feature vector, and splicing the outputs of the CNN units to obtain a third high-level feature vector of the voice information as a third feature vector.

In some embodiments of the present application, in the above apparatus, the merging unit 630 is configured to place, according to a specified concatenation form, each element in the second feature vector at a corresponding position after the first feature vector in an original order, and each element in the third feature vector at a corresponding position after the second feature vector in the original order; and if the first feature vector, the second feature vector and the third feature vector are special-shaped matrixes, setting missing elements in the other vectors to be null on the basis of the row vector and the column vector with the most elements.

In some embodiments of the present application, in the above apparatus, the fully-connected network comprises a fourth fully-connected layer, a fifth fully-connected layer and a sixth fully-connected layer connected in sequence; wherein the number of neurons of the fourth fully-connected layer, the fifth fully-connected layer and the sixth fully-connected layer decreases in sequence; the identifying unit 640 is configured to enable the fused vector to sequentially enter the fourth full connection layer, the fifth full connection layer, and the sixth full connection layer, so as to compress features represented by the fused vector and determine an intention identification result.

In some embodiments of the present application, in the above apparatus, the obtaining unit 610 is configured to obtain a target video stream corresponding to the problem to be identified; determining a plurality of continuous key frames in a target video stream; taking the voice separated from the continuous key frames as voice information;

converting the voice information into character information; and taking at least one frame in the continuous key frames as image information.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 7, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other by an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs, forming the intention recognition means on a logical level. The processor executes the program stored in the memory and is specifically used for executing the following operations:

The method performed by the intention recognition apparatus disclosed in the embodiment of fig. 6 of the present application can be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in hardware or in a combination of hardware and software modules in a decoding processor. The software modules may be located in ram, flash memory, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The electronic device may further execute the method executed by the intention identifying apparatus in fig. 6, and implement the function of the intention identifying apparatus in the embodiment shown in fig. 6, which is not described herein again in this embodiment of the application.

An embodiment of the present application further provides a computer-readable storage medium storing one or more programs, where the one or more programs include instructions, which, when executed by an electronic device including a plurality of application programs, can cause the electronic device to perform the method performed by the intent recognition apparatus in the embodiment shown in fig. 6, and is specifically configured to perform:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium, such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. An intention recognition method, which is characterized in that the intention recognition method is realized based on an intention recognition model, wherein the intention recognition model comprises a voice feature extraction network, a text feature extraction network, an image feature extraction network and a fully connected network, wherein the voice feature extraction network, the text feature extraction network and the image feature extraction network are respectively connected with the fully connected network;

the method comprises the following steps:

2. The method of claim 1, wherein the speech feature extraction network comprises: the first CNN layer, the second CNN layer and the third CNN layer are connected in sequence;

the first CNN layer, the second CNN layer and the third CNN layer respectively comprise a plurality of CNN units arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size;

the determining a first feature vector of the voice information based on the voice feature extraction network comprises:

preprocessing the voice information;

respectively enabling the preprocessed voice information to enter each CNN unit of the first CNN layer so as to perform feature extraction on the voice information, and splicing the output of each CNN unit to obtain a first primary feature vector of the voice information;

performing dimensionality reduction on the first primary feature vector, enabling the dimensionality-reduced first primary feature vector to enter each CNN unit of the second CNN layer, so as to perform feature extraction on the first primary feature vector, and splicing the outputs of the CNN units to obtain a first middle-level feature vector of the voice information;

and performing dimension reduction on the first intermediate-level feature vector, enabling the first intermediate-level feature vector subjected to dimension reduction to enter each CNN unit of the third CNN layer, performing feature extraction on the first intermediate-level feature vector, and splicing the outputs of the CNN units to obtain a first high-level feature vector of the voice information as the first feature vector.

3. The method of claim 1, wherein the text feature extraction network comprises: the embedded layer, the first full-connection layer, the second full-connection layer and the third full-connection layer are connected in sequence; wherein the number of neurons of the first fully-connected layer, the second fully-connected layer, and the third fully-connected layer decreases in sequence;

the determining a second feature vector of the text information based on the text feature extraction network comprises:

performing word segmentation processing on the text information, and enabling the processed text information to enter the Embedding layer so as to convert the text information into a text input vector;

enabling the text input vector to enter the first full-connection layer so as to perform feature lifting on the text input vector to obtain a second primary feature vector of the text information;

enabling the second primary feature vector to enter the second full-connection layer so as to carry out feature lifting on the second primary feature vector to obtain a second intermediate feature vector of the text information;

and enabling the second middle-level feature vector to enter the third full-connection layer so as to perform feature extraction on the second middle-level feature vector to obtain a second high-level feature vector of the text information as a second feature vector.

4. The method of claim 1, wherein the image feature extraction network comprises: a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are connected in sequence;

the fourth CNN layer, the fifth CNN layer and the sixth CNN layer respectively comprise a plurality of CNN units arranged in parallel, wherein convolution kernels of convolution layers of the CNN units are different in size;

the determining a third feature vector of the image information based on the image feature extraction network includes:

enabling image information to enter each CNN unit of the fourth CNN layer respectively to perform feature extraction on the image information, and splicing the output of each CNN unit to obtain a third primary feature vector of the image information;

performing dimensionality reduction on the third primary feature vector, enabling the dimensionality-reduced primary feature vector to enter each CNN unit of the fifth CNN layer, so as to perform feature extraction on the third primary feature vector, and splicing the outputs of the CNN units to obtain a third middle-level feature vector of the voice information;

and performing dimensionality reduction on the third middle-level feature vector, enabling the dimensionality-reduced third primary feature vector to enter each CNN unit of the sixth CNN layer, performing feature extraction on the third middle-level feature vector, and splicing the outputs of the CNN units to obtain a third high-level feature vector of the voice information as a third feature vector.

5. The method of claim 1, wherein fusing the first feature vector, the second feature vector, and the third feature vector to obtain a fused vector comprises:

according to a specified splicing form, placing elements in the second feature vector to corresponding positions behind the first feature vector according to an original sequence, and placing elements in the third feature vector to corresponding positions behind the second feature vector according to the original sequence;

and if the first eigenvector, the second eigenvector and the third eigenvector are the heteromorphic matrix, setting missing elements in the rest vectors to be null by taking the row vector and the column vector with the most elements as references.

6. The method of claim 1, wherein the fully connected network comprises a fourth fully connected layer, a fifth fully connected layer, and a sixth fully connected layer connected in sequence; wherein the number of neurons of the fourth fully-connected layer, the fifth fully-connected layer and the sixth fully-connected layer decreases in sequence;

the determining an intention recognition result according to the fusion vector based on the fully-connected network comprises:

and enabling the fusion vector to sequentially enter the fourth full connection layer, the fifth full connection layer and the sixth full connection layer so as to compress the characteristics represented by the fusion vector and determine an intention recognition result.

7. The method according to any one of claims 1 to 6, wherein the obtaining of a plurality of analog state information of the problem to be identified comprises:

acquiring a target video stream corresponding to the problem to be identified;

determining a plurality of continuous key frames in a target video stream;

taking the voice separated from the continuous key frames as voice information;

converting the voice information into character information;

at least one of the consecutive key frames is used as image information.

8. An intent recognition apparatus, wherein the intent recognition apparatus deploys an intent recognition model, and the intent recognition model comprises a voice feature extraction network, a text feature extraction network, an image feature extraction network, and a fully connected network, wherein the voice feature extraction network, the text feature extraction network, and the image feature extraction network are respectively connected to the fully connected network;

the device comprises:

9. An intention recognition model, characterized in that the intention recognition model comprises a voice feature extraction network, a text feature extraction network, an image feature extraction network and a fully connected network, wherein the voice feature extraction network, the text feature extraction network and the image feature extraction network are respectively connected with the fully connected network;

the fully connected network comprises: the fourth full-link layer, the fifth full-link layer and the sixth full-link layer that connect gradually, wherein, the fourth full-link layer the fifth full-link layer with the number of the neuron of the sixth full-link layer reduces in proper order.

10. An electronic device, comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the method of claims 1 to 7.