CN114611529B

CN114611529B - Intention recognition method and device, electronic equipment and storage medium

Info

Publication number: CN114611529B
Application number: CN202210253425.8A
Authority: CN
Inventors: 赵仕豪; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2024-02-02
Anticipated expiration: 2042-03-15
Also published as: CN114611529A

Abstract

The embodiment of the application provides an intention recognition method and device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring voice question-answering data to be recognized; performing voice recognition processing on the voice question-answering data to obtain candidate question-answering texts; corpus feature extraction is carried out on the candidate question-answering text through a preset feature extraction model, and a target question-answering text is obtained; carrying out intention prediction on the target question-answer text through a preset intention prediction model and intention type labels to obtain an intention prediction value corresponding to each intention type label; sorting the intention category labels according to the intention predicted value to obtain target intention category labels; and screening the target question-answer text according to the target intention type label to obtain target intention data. According to the method and the device for identifying the intention, accuracy of the intention identification can be improved.

Description

Intention recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to an intent recognition method and apparatus, an electronic device, and a storage medium.

Background

At present, most intelligent semantic recognition systems often need to translate voice question-answer data of a user into question-answer text data and perform text matching on the question-answer text data based on a pre-constructed text matching database when the user intention is recognized, so that the intention of the user is recognized, and the text matching database can only solve the matching problem on the vocabulary level, has certain limitation and affects the accuracy of intention recognition, so that how to improve the accuracy of intention recognition becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide an intention recognition method and device, electronic equipment and storage medium, and aims to improve accuracy of intention recognition.

To achieve the above object, a first aspect of an embodiment of the present application proposes an intent recognition method, including:

acquiring voice question-answering data to be recognized;

performing voice recognition processing on the voice question-answering data to obtain candidate question-answering texts;

corpus feature extraction is carried out on the candidate question-answer text through a preset feature extraction model, obtaining the target question-answering text;

carrying out intention prediction on the target question-answering text through a preset intention prediction model and intention type labels to obtain an intention prediction value corresponding to each intention type label;

Sorting the intention category labels according to the intention predicted value to obtain target intention category labels;

and screening the target question-answering text according to the target intention type label to obtain target intention data.

In some embodiments, the step of performing voice recognition processing on the voice question-answer data to obtain candidate question-answer text includes:

performing voice recognition processing on the voice question-answering data through a preset voice recognition model to obtain an initial question-answering text;

and carrying out semantic completion on the initial question-answer text to obtain the candidate question-answer text.

In some embodiments, the feature extraction model includes a convolution layer and a first full-connection layer, and the step of extracting corpus features of the candidate question-answer text through a preset feature extraction model to obtain a target question-answer text includes:

carrying out convolution processing on the candidate question-answering text through the convolution layer to obtain question-answering text convolution characteristics;

performing translation score calculation on the question-answer text convolution characteristics through a preset algorithm of the first full-connection layer to obtain translation scores;

and screening the candidate question-answer text according to the translation score to obtain the target question-answer text.

In some embodiments, the intent prediction model includes an embedding layer, a pooling layer, a Bi-LSTM layer, and a second fully-connected layer, and the step of performing intent prediction on the target question-answer text through a preset intent prediction model and intent category labels to obtain intent prediction values corresponding to each of the intent category labels includes:

word embedding processing is carried out on the target question-answering text through the embedding layer, so that a target text embedding vector is obtained;

carrying out pooling treatment on the target text embedded vector through the pooling layer to obtain a target text pooling vector;

encoding the target text pooling vector through the Bi-LSTM layer to obtain a target text hidden variable;

and carrying out intention prediction on the target text hidden variable through the prediction function of the second full-connection layer and the intention type label to obtain an intention prediction value corresponding to each intention type label.

In some embodiments, the step of pooling the target text embedded vector by the pooling layer to obtain a target text pooled vector includes:

carrying out maximum pooling treatment on the target text embedded vector through the pooling layer to obtain a text maximum pooling vector;

Carrying out average pooling treatment on the target text embedded vector through the pooling layer to obtain a text average pooling vector;

and performing splicing processing on the text maximum pooling vector and the text average pooling vector to obtain the target text pooling vector.

In some embodiments, the step of sorting the intent category labels according to the intent prediction value to obtain target intent category labels includes:

the intention category labels are arranged in a descending order according to the intention predication value to obtain a predication intention sequence;

and screening the predicted intention sequence according to a preset screening condition to obtain the target intention type label.

In some embodiments, before the step of performing intent prediction on the target question-answer text through a preset intent prediction model and intent class labels to obtain an intent prediction value corresponding to each of the intent class labels, the method further includes pre-training the intent prediction model, and specifically includes:

acquiring sample intention voice data;

performing format conversion on the sample intention voice data to obtain sample intention text data;

inputting the sample intention text data into an initial model, and carrying out word embedding processing, pooling processing and coding processing on the sample intention text data through the initial model to obtain a sample intention hidden variable;

Calculating a sample intention predicted value corresponding to each preset intention category label through a loss function of the initial model and the sample intention hidden variable;

calculating a model loss value of the initial model according to a preset cross entropy algorithm and the sample intention predicted value, and optimizing the loss function according to the model loss value so as to update the initial model to obtain the intention predicted model.

To achieve the above object, a second aspect of the embodiments of the present application proposes an intention recognition apparatus, the apparatus comprising:

the data acquisition module is used for acquiring voice question-answer data to be identified;

a voice recognition module for performing voice recognition processing on the voice question-answer data, obtaining candidate question-answering texts;

the feature extraction module is used for extracting corpus features of the candidate question-answering texts through a preset feature extraction model to obtain target question-answering texts;

the intention prediction module is used for carrying out intention prediction on the target question-answer text through a preset intention prediction model and intention type labels to obtain an intention prediction value corresponding to each intention type label;

the sorting module is used for sorting the intention category labels according to the intention predicted value to obtain target intention category labels;

And the screening module is used for screening the target question-answer text according to the target intention category label to obtain target intention data.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, the electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, the program, when executed by the processor, implementing the method according to the first aspect.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium, for computer-readable storage, the storage medium storing one or more programs executable by one or more processors to implement the method described in the first aspect.

The intention recognition method, the intention recognition device, the electronic equipment and the storage medium are used for acquiring voice question-answer data to be recognized. And further, voice recognition processing is carried out on the voice question-answering data to obtain candidate question-answering texts, corpus feature extraction is carried out on the candidate question-answering texts through a preset feature extraction model to obtain target question-answering texts, errors of conversion of the voice data into text data can be reduced, richness and contrast of the target question-answering texts are increased, and accuracy of intention recognition is improved. Furthermore, the intention prediction model and the intention type labels are preset to predict the intention of the target question-answer text, so that the intention prediction value corresponding to each intention type label is obtained, and sentence semantic features in the target question-answer text can be captured better, so that the intention prediction value of each intention type label is calculated. Finally, the intention type labels are sequenced according to the intention predicted value to obtain target intention type labels, so that the target question-answer text is screened according to the target intention type labels to obtain target intention data, and accuracy of intention recognition can be improved.

Drawings

FIG. 1 is a flow chart of an intent recognition method provided by an embodiment of the present application;

fig. 2 is a flowchart of step S102 in fig. 1;

fig. 3 is a flowchart of step S103 in fig. 1;

FIG. 4 is another flow chart of an intent recognition method provided by an embodiment of the present application;

fig. 5 is a flowchart of step S104 in fig. 1;

fig. 6 is a flowchart of step S502 in fig. 5;

fig. 7 is a flowchart of step S105 in fig. 1;

FIG. 8 is a schematic diagram of the structure of the intent recognition device provided in the embodiment of the present application;

fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Information extraction (Information Extraction, NER): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of small specific units, such as words, phrases, sentences, paragraphs, or a combination of these specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Web crawlers (also known as web spiders, web robots, more often called web chasers among FOAF communities): a web crawler is a program or script that automatically crawls web information according to certain rules.

Automatic speech recognition technology (Automatic Speech Recognition, ASR): automatic speech recognition technology is a technology that converts human speech into text. The input to speech recognition is typically a speech signal in the time domain, the signal length (length T) and dimensions (dimension d) are represented mathematically by a series of vectors, the output of the automatic semantic recognition technique is text, and the field length (length N) and the different tokens (tokens) are represented by a series of token.

Coding (encoder): encoding is used to convert the input sequence into a fixed length vector.

Twin neural network (Siamese neural network): the other name is a gemini neural network, which is a coupling framework established based on two artificial neural networks. The twin neural network takes two samples as input and outputs the characterization of the twin neural network embedded in a high-dimensional space so as to compare the similarity degree of the two samples. The narrow-definition twin neural network is formed by splicing two neural networks which have the same structure and share weight. The generalized twin neural network, or "pseudo-twin neural network", may be formed by stitching any two neural networks. Twin neural networks typically have a deep structure and may consist of convolutional neural networks, recurrent neural networks, and the like.

The twin neural network comprises two sub-networks, each of which receives an input, maps it to a high-dimensional feature space, and outputs a corresponding representation. By calculating the distance of the two characterizations, e.g., euclidean distance, the user can compare the similarity of the two inputs. The sub-network of the twin neural network may be a convolutional neural network or a cyclic neural network, the weights of which may be optimized by an energy function or a classification loss.

Bi-directional Long Short-Term Memory (Bi-LSTM): is formed by combining a forward LSTM and a backward LSTM. Are commonly used in natural language processing tasks to model context information. Bi-LSTM combines information of the input sequence in both forward and backward directions on the basis of LSTM. For the output of time t, the forward LSTM layer has information of time t and previous times in the input sequence, and the backward LSTM layer has information of time t and subsequent times in the input sequence. The output of the forward LSTM layer at time t is denoted as M, the output of the backward LSTM layer at time t is denoted as N, and the vectors output by the two LSTM layers can be processed by adding, averaging or connecting.

At present, most intelligent semantic recognition systems often need to translate user voice question-answer data into question-answer text data when user intention recognition is performed, and text matching is performed on the question-answer text data based on a pre-constructed text matching database, so that user intention is recognized, namely, a text translation result is output by a user answer through a voice recognition (ASR) module, then the user intention is recognized by matching the traditional text matching database through a traditional text matching model, and due to various factors such as noise, speaking accent and the like, a single text translation result has a certain error, so that the text matching database can only solve the matching problem on the vocabulary level, has a certain limitation and influences the accuracy of intention recognition, and therefore, how to improve the accuracy of intention recognition becomes a technical problem to be solved urgently.

Based on the above, the embodiment of the application provides an intention recognition method and device, electronic equipment and storage medium, and aims to improve accuracy of intention recognition.

The method and apparatus for identifying intent, electronic device and storage medium provided in the embodiments of the present application are specifically described by the following embodiments, and the method for identifying intent in the embodiments of the present application is first described.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides an intention recognition method, and relates to the technical field of artificial intelligence. The intention recognition method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements the intention recognition method, but is not limited to the above form.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an optional flowchart of an intent recognition method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S106.

Step S101, acquiring voice question-answer data to be recognized;

step S102, performing voice recognition processing on voice question and answer data to obtain candidate question and answer texts;

step S103, corpus feature extraction is carried out on the candidate question-answer text through a preset feature extraction model, and a target question-answer text is obtained;

step S104, carrying out intention prediction on the target question-answer text through a preset intention prediction model and intention type labels to obtain an intention prediction value corresponding to each intention type label;

step S105, sorting the intention category labels according to the intention predicted value to obtain target intention category labels;

and S106, screening the target question-answer text according to the target intention type label to obtain target intention data.

In the steps S101 to S107 of the embodiment of the present application, corpus feature extraction is performed on candidate question-answer texts through a preset feature extraction model, so as to obtain target question-answer texts, which can reduce errors in converting speech data into text data, increase richness and contrast of the target question-answer texts, and improve accuracy of intention recognition. The intention prediction model and the intention type labels are used for carrying out intention prediction on the target question-answer text, so that sentence semantic features in the target question-answer text can be captured better, and the intention prediction value of each intention type label is calculated. The intention type labels are sequenced according to the intention predicted value to obtain target intention type labels, and the target question-answer text is screened according to the target intention type labels to obtain target intention data, so that accuracy of intention recognition can be improved.

In step S101 of some embodiments, the data source may be set by writing a web crawler, and then targeted crawling data may be performed to obtain the voice question-answer data to be recognized. It should be noted that, the voice question-answer data to be recognized includes sentences and words of the mood that the user outputs in a certain period of time. The output sentence includes sentences of various parts of speech, such as a presentation sentence, a question-back sentence, and the like. For example, at nine am, the user inputs a sentence of "how do today's weather? "or" is today a sunny day? "or" what today's traffic conditions are, "etc.

Referring to fig. 2, in some embodiments, step S102 may include, but is not limited to, steps S201 to S202:

step S201, performing voice recognition processing on voice question-answering data through a preset voice recognition model to obtain an initial question-answering text;

and S202, carrying out semantic completion on the initial question-answering text to obtain candidate question-answering text.

In step S201 of some embodiments, the preset speech recognition model may be constructed based on an automatic speech recognition technique (ASR technique). The translation process is performed on the voice question-answer data by the ASR technique, and specifically, the translation process may be performed on the voice question-answer data according to a phoneme comparison table constructed in advance. For example, the voice recognition model uses basic elements of sound as units, and different words are composed of different phonemes, and corresponding voice text can be obtained by recognizing which phonemes exist in input voice and combining the phonemes into recognized words, so that a phoneme mapping table is arranged in the voice recognition model according to the basic elements of sound, and the phoneme mapping table can reflect the corresponding relation between voice signals and phonemes. And carrying out voice recognition processing on the voice signals of the voice question-answering data according to the phoneme mapping table, recognizing phonemes in the voice signals of the voice question-answering data, combining the phonemes, and decoding the phonemes into characters through N-Best to obtain voice characteristic word segments corresponding to the voice question-answering data. Further, the context information of each voice feature word segment is combined through a splicing function (such as a concat function and the like), and the voice feature word segments are subjected to splicing processing according to basic grammar, so that an initial question-answer text is obtained.

The requirements of voice intention recognition can be met through the steps S201 to S202, so that the intention recognition process is made into multi-modal, and the universality of the intention recognition is improved.

In step S202 of some embodiments, semantic error correction and filtering are performed on the initial question-answer text by using ASR technology, the initial question-answer text with unknown ideas and blurred audio is removed, and completion processing is performed on the incomplete initial question-answer text by using synonym substitution, part-of-speech expansion and other modes, so as to obtain candidate question-answer text.

Before step S103 of some embodiments, the intent recognition method of the embodiments of the present application further includes pre-training a feature extraction model, where the feature extraction model includes a convolution layer and a first fully-connected layer, the convolution layer has a size of 3*3, and the convolution layer is mainly used to perform feature extraction on candidate question-answer texts and capture text feature information; the first full-connection layer is mainly used for carrying out translation score calculation on the text feature information extracted by the convolution layer, identifying word error rate of the text feature information and outputting a final translation result, namely a target question-answer text.

Referring to fig. 3, in some embodiments, the feature extraction model includes a convolution layer and a first full connection layer, and step S103 may include, but is not limited to, steps S301 to S303:

Step S301, carrying out convolution processing on candidate question-answering texts through a convolution layer to obtain question-answering text convolution characteristics;

step S302, performing translation score calculation on the question-answer text convolution characteristics through a preset algorithm of the first full-connection layer to obtain a translation score;

and step S303, screening the candidate question-answer text according to the translation score to obtain the target question-answer text.

In step S301 of some embodiments, feature extraction is performed on candidate question-answer texts through a convolution layer, and text feature information is captured, so as to obtain question-answer text convolution features.

In step S302 of some embodiments, the translation score represents the word error rate of the candidate question-answer text, the preset algorithm may be understood as a preset calculation formula, and the redundant, missing and recognition error word numbers in the candidate question-answer text may be summed according to the preset algorithm to obtain a word error total number, and the word error total number is divided by the total word number to obtain the word error rate, i.e. the translation score, of the candidate question-answer text.

In step S303 of some embodiments, since the lower the word error rate is, the fewer the total number of word errors is represented, that is, the higher the accuracy of translation is, the higher the translation score is, so that when the candidate question-answer texts are subjected to the screening processing according to the translation score, the n candidate question-answer texts with the highest translation score may be selected as the target question-answer texts. For example, the 3 candidate question-answer texts with the lowest word error rate (i.e., the 3 candidate question-answer texts with the highest translation scores) may be selected as the target question-answer texts.

In some other embodiments, the predetermined algorithm may be a cosine similarity algorithm, and the similarity value is used to represent the translation score in step S302 and step S303. Specifically, calculating similarity values of the convolution features of the question-answer text and the reference text features through a cosine similarity algorithm and the like, taking the similarity values as translation values, and taking three candidate question-answer texts with highest translation values as target question-answer texts; or taking the candidate question-answer text with the translation score being greater than or equal to the preset translation score threshold value as the target question-answer text.

The steps S301 to S303 perform corpus feature extraction on the candidate question-answer text through the preset feature extraction model to obtain the target question-answer text, so that the error of converting voice data into text data can be reduced, the richness and contrast of the target question-answer text can be increased, and the accuracy of intention recognition can be improved.

Referring to fig. 4, in some embodiments, before step S104, the intent recognition method of the embodiments of the present application further includes pre-training an intent prediction model, specifically including steps S401 to S405:

step S401, obtaining sample intention voice data;

step S402, format conversion is carried out on the sample intention voice data to obtain sample intention text data;

Step S403, inputting the sample intention text data into an initial model, and carrying out word embedding processing, pooling processing and encoding processing on the sample intention text data through the initial model to obtain a sample intention hidden variable;

step S404, calculating a sample intention predicted value corresponding to each preset intention type label through a loss function and a sample intention hidden variable of the initial model;

and step S405, calculating a model loss value of the initial model according to a preset cross entropy algorithm and a sample intention prediction value, and optimizing a loss function according to the model loss value so as to update the initial model and obtain the intention prediction model.

In step S401 of some embodiments, the obtained sample intent voice data may be labeling intent voice data, including different intent category labels. The sample intention voice data can also be obtained by compiling a web crawler, setting a data source and then carrying out targeted crawling data. For example, sample intent speech data with a climate category label in the sample intent speech data is expressed as "weather/temperature how.

In step S402 of some embodiments, the sample intent speech data is format-converted by ASR technology or the like, and the sample intent text data in audio form is converted into the sample intent data in text form, resulting in sample intent text data.

In step S403 of some embodiments, the initial model is a twin neural network. And inputting the sample intention text data into an initial model, and carrying out word embedding processing, pooling processing and coding processing on the sample intention text data through each branch of the initial model to obtain a sample intention hidden variable. For example, on a certain branch node of the initial model, firstly traversing the flow nodes of the robot system which are intended to be identified, and extracting node information of each flow node; and carrying out word embedding processing on the sample intention text data according to the node information of the flow nodes, thereby obtaining a sample text embedding vector. And carrying out maximum pooling treatment and average pooling treatment on the sample text embedded vector to obtain a sample global pooling vector and a sample local pooling vector, and then carrying out splicing treatment on the sample global pooling vector and the sample local pooling vector to obtain a target sample pooling vector. And finally, coding the target sample pooled vector according to the sequence from left to right and from right to left by using a Bi-LSTM algorithm to obtain the sample intention hidden variable.

In step S404 of some embodiments, a sample intent prediction value corresponding to each preset intent category label is calculated by using a loss function and a sample intent hidden variable of the initial model, where the sample intent prediction value may be represented by using a similarity value or by calculating a prediction probability value corresponding to each preset intent category label by using a softmax function.

In step S405 of some embodiments, when calculating the model loss value of the initial model according to the preset cross entropy algorithm and the sample intention prediction value, the cross entropy algorithm may be calculated by using a square mean square difference cross entropy function, and the model loss value is represented by using a square mean square difference. And optimizing and updating the initial model according to the model loss value and a preset iteration condition, so that the model loss value meets the preset iteration condition, and an intention prediction model is obtained, wherein the intention prediction model comprises a plurality of embedded layers, a plurality of pooling layers, a plurality of Bi-LSTM layers and a plurality of second full-connection layers, and specifically, each node branch of the intention prediction model comprises an embedded layer, a pooling layer, a Bi-LSTM layer and a second full-connection layer. Through the method, intent prediction is carried out on each intent node, so that the intent prediction model can meet the requirement of multi-intent recognition, the resource waste of modeling and training the model for each subtask independently is effectively avoided, the training time can be greatly saved, and the number of model maintenance is reduced.

Referring to fig. 5, in some embodiments, the intent prediction model includes an embedded layer, a pooled layer, a Bi-LSTM layer, and a second fully connected layer, and step S104 may include, but is not limited to, steps S501 to S504:

Step S501, word embedding processing is carried out on the target question-answer text through an embedding layer, and a target text embedding vector is obtained;

step S502, carrying out pooling treatment on the target text embedded vector through a pooling layer to obtain a target text pooled vector;

step S503, coding the target text pooling vector through the Bi-LSTM layer to obtain a target text hidden variable;

and step S504, carrying out intention prediction on the target text hidden variable through a prediction function and intention type labels of the second full-connection layer to obtain an intention prediction value corresponding to each intention type label.

In step S501 of some embodiments, first, traversing flow nodes intended to identify a robot system, extracting node information of each flow node; and carrying out word embedding processing on the target question-answer text according to the node information of the flow node, thereby obtaining a target text embedded vector.

Specifically, the RoBERTa model can be adopted as word embedding representation of the target question-answering text at the embedding layer, and the format of the target question-answering text can be expressed as [ [ CLS ] +SENTENCE_TOKEN+ [ SEP ] ], wherein [ CLS ] is a text start identifier, SENTENCE_TOKEN is text content of the target question-answering text, and [ SEP ] is a text end identifier; the text format of the target question-answer text input by the embedded layer can be expressed as [ [ CLS ] +NODE_TOKEN+ [ SEP ] +SENTENCE_TOKEN+ [ SEP ] ] by extracting and adding the NODE information of the corresponding flow NODE, wherein NODE_TOKEN is the NODE information. After the node information is added, when word embedding processing is carried out on the target question-answer text through the embedding layer, characteristics of different layers such as semantics, sentence structures and the like of the target question-answer text can be identified, characteristics of different nodes can be identified through the node information, confusion among different subtasks is avoided, the intention prediction model can realize multi-intention identification in the whole business scene, and resource waste of modeling and training the model for each subtask independently is effectively avoided.

In step S502 of some embodiments, in order to improve the comprehensiveness of the obtained text feature, the maximum pooling process and the average pooling process may be performed on the target text embedded vector to obtain a text global pooling vector and a text local pooling vector, and then the text global pooling vector and the text local pooling vector are subjected to a stitching process to obtain the target text pooling vector. It should be noted that, the text global pooling vector and the text local pooling vector have the same vector length, and are both preset fixed lengths.

In step S503 of some embodiments, the Bi-LSTM algorithm encodes the target text pooled vector in a left-to-right order and a right-to-left order, respectively, to obtain the target text hidden variable. Compared with the LSTM algorithm, the Bi-LSTM algorithm can encode the text sentence from front to back and from back to front, can better capture the sentence meaning in two directions, and enhance the recognition effect of the intention prediction model, thereby improving the accuracy of the intention recognition.

In step S504 of some embodiments, the prediction function may be a softmax function, by which an intent probability value of the target text hidden variable on each intent category label is calculated, and a probability distribution is created on each intent category label according to the intent probability value, where the probability distribution is the intent prediction value corresponding to each intent category label.

Referring to fig. 6, in some embodiments, step S502 may further include, but is not limited to, steps S601 to S603:

step S601, carrying out maximum pooling treatment on the target text embedded vector through a pooling layer to obtain a text maximum pooling vector;

step S602, carrying out average pooling treatment on the target text embedded vector through a pooling layer to obtain a text average pooling vector;

and step S603, performing splicing processing on the text maximum pooling vector and the text average pooling vector to obtain a target text pooling vector.

In step S601 of some embodiments, the pooling layer performs maximum pooling processing on the target text embedded vector, captures global features of the target question-answer text, and obtains overall text information of the target question-answer text, thereby obtaining a text maximum pooling vector.

In step S602 of some embodiments, the pooling layer performs average pooling processing on the target text embedded vector, focuses on more text details, captures local features of the target question-answer text, and obtains local text information of the target question-answer text to obtain a text average pooling vector.

In step S603 of some embodiments, the stitching processing performed on the text maximum pooling vector and the text average pooling vector may be vector addition performed on the text maximum pooling vector and the text average pooling vector, so as to obtain a target text pooling vector.

Through the steps S601 to S603, not only the whole text characteristics of the target question-answer text but also the local text characteristics of the target question-answer text can be captured, and the comprehensiveness of text characteristic acquisition is improved.

Referring to fig. 7, in some embodiments, step S105 may further include, but is not limited to, steps S701 to S702:

step S701, performing descending order arrangement on intention category labels according to an intention predicted value to obtain a predicted intention sequence;

step S702, screening the predicted intention sequence according to a preset screening condition to obtain a target intention type label.

In step S701 of some embodiments, the intent prediction value of each intent category label is compared, and all the intent category labels are arranged in descending order according to the intent prediction value, so as to obtain a predicted intent sequence.

In step S702 of some embodiments, a preset filtering condition may be set according to the actual situation, for example, the filtering condition is to select an intention category label in the first m bits on the predicted intention sequence. For example, the intention category label ranked first in the predicted intention sequence is extracted as the target intention category label.

In step S106 of some embodiments, all the target process nodes corresponding to the target intention category labels are traversed, the target node information is extracted, and the target question-answer text is screened according to the association degree of the target node information and the target question-answer text, so as to obtain target intention data.

According to the intention recognition method, the voice question-answer data to be recognized are obtained. And further, voice recognition processing is carried out on the voice question-answering data to obtain candidate question-answering texts, corpus feature extraction is carried out on the candidate question-answering texts through a preset feature extraction model to obtain target question-answering texts, errors of conversion of the voice data into text data can be reduced, richness and contrast of the target question-answering texts are increased, and accuracy of intention recognition is improved. Furthermore, the intention prediction model and the intention type labels are preset to predict the intention of the target question-answer text, so that the intention prediction value corresponding to each intention type label is obtained, and sentence semantic features in the target question-answer text can be captured better, so that the intention prediction value of each intention type label is calculated. Finally, the intention type labels are sequenced according to the intention predicted value to obtain target intention type labels, so that the target question-answer text is screened according to the target intention type labels to obtain target intention data, and accuracy of intention recognition can be improved.

Referring to fig. 8, an embodiment of the present application further provides an intention recognition device, which may implement the above-mentioned intention recognition method, where the device includes:

A data acquisition module 801, configured to acquire voice question-answer data to be identified;

the voice recognition module 802 is configured to perform voice recognition processing on the voice question-answering data to obtain a candidate question-answering text;

the feature extraction module 803 is configured to perform corpus feature extraction on the candidate question-answer text through a preset feature extraction model, so as to obtain a target question-answer text;

the intention prediction module 804 is configured to predict the intention of the target question-answer text according to a preset intention prediction model and intention type labels, so as to obtain an intention prediction value corresponding to each intention type label;

the sorting module 805 is configured to sort the intent category labels according to the intent prediction value to obtain target intent category labels;

and a screening module 806, configured to screen the target question-answer text according to the target intention category label, so as to obtain target intention data.

In some embodiments, the speech recognition module 802 includes:

the voice recognition unit is used for carrying out voice recognition processing on the voice question-answering data through a preset voice recognition model to obtain an initial question-answering text;

and the semantic completion unit is used for carrying out semantic completion on the initial question-answering text to obtain candidate question-answering text.

In some embodiments, the feature extraction model includes a convolution layer and a first fully-connected layer, and the feature extraction module 803 includes:

the convolution processing unit is used for carrying out convolution processing on the candidate question-answer text through the convolution layer to obtain the question-answer text convolution characteristics;

the translation score calculating unit is used for calculating the translation score of the question-answer text convolution characteristic through a preset algorithm of the first full-connection layer to obtain the translation score;

and the text screening unit is used for screening the candidate question-answer text according to the translation scores to obtain the target question-answer text.

In some embodiments, the intent prediction model includes an embedded layer, a pooled layer, a Bi-LSTM layer, and a second fully connected layer, the intent prediction module 804 includes:

the word embedding processing unit is used for carrying out word embedding processing on the target question-answering text through the embedding layer to obtain a target text embedding vector;

the pooling processing unit is used for pooling the target text embedded vector through the pooling layer to obtain a target text pooling vector;

the coding processing unit is used for coding the target text pooling vector through the Bi-LSTM layer to obtain a target text hidden variable;

and the intention prediction unit is used for carrying out intention prediction on the target text hidden variable through the prediction function and the intention type label of the second full-connection layer to obtain an intention prediction value corresponding to each intention type label.

In some embodiments, the ordering module 805 includes:

the descending order arrangement unit is used for descending order arrangement of the intention category labels according to the intention predicted value to obtain a predicted intention sequence;

the tag screening unit is used for screening the predicted intention sequence according to preset screening conditions to obtain a target intention type tag.

In other embodiments, the intent recognition device further includes an intent prediction model training module, the intent prediction model training module including:

a sample data acquisition unit for acquiring sample intention voice data;

the format conversion unit is used for carrying out format conversion on the sample intention voice data to obtain sample intention text data;

the sample data processing unit is used for inputting sample intention text data into the initial model, and carrying out word embedding processing, pooling processing and encoding processing on the sample intention text data through the initial model to obtain a sample intention hidden variable;

the calculating unit is used for calculating a sample intention predicted value corresponding to each preset intention type label through a loss function and a sample intention hidden variable of the initial model;

the model optimizing unit is used for calculating a model loss value of the initial model according to a preset cross entropy algorithm and a sample intention prediction value, and optimizing a loss function according to the model loss value so as to update the initial model and obtain the intention prediction model.

The specific implementation of the intention recognition device is basically the same as the specific embodiment of the intention recognition method, and will not be repeated here.

The embodiment of the application also provides electronic equipment, which comprises: the device comprises a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program realizes the intention recognition method when being executed by the processor. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

the memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes an intention recognition method to execute the embodiments of the present application;

An input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a storage medium, which is a computer readable storage medium and is used for computer readable storage, the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the intention recognition method.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the intention recognition method, the intention recognition device, the electronic equipment and the storage medium, voice question-answer data to be recognized are obtained. And further, voice recognition processing is carried out on the voice question-answering data to obtain candidate question-answering texts, corpus feature extraction is carried out on the candidate question-answering texts through a preset feature extraction model to obtain target question-answering texts, errors of conversion of the voice data into text data can be reduced, richness and contrast of the target question-answering texts are increased, and accuracy of intention recognition is improved. Furthermore, the intention prediction model and the intention type labels are preset to predict the intention of the target question-answer text, so that the intention prediction value corresponding to each intention type label is obtained, and sentence semantic features in the target question-answer text can be captured better, so that the intention prediction value of each intention type label is calculated. Finally, the intention type labels are sequenced according to the intention predicted value to obtain target intention type labels, so that the target question-answer text is screened according to the target intention type labels to obtain target intention data, and accuracy of intention recognition can be improved.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not limiting to embodiments of the present application and may include more or fewer steps than shown, or certain steps may be combined, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method of intent recognition, the method comprising:

acquiring voice question-answering data to be recognized;

corpus feature extraction is carried out on the candidate question-answering text through a preset feature extraction model, and a target question-answering text is obtained;

screening the target question-answering text according to the target intention type label to obtain target intention data;

the intention prediction model comprises an embedding layer, a pooling layer, a Bi-LSTM layer and a second full-connection layer, and the step of carrying out intention prediction on the target question-answer text through a preset intention prediction model and intention type labels to obtain intention prediction values corresponding to each intention type label comprises the following steps:

carrying out intention prediction on the target text hidden variable through a prediction function of the second full-connection layer and the intention type labels to obtain an intention prediction value corresponding to each intention type label;

the step of performing pooling processing on the target text embedded vector through a pooling layer to obtain a target text pooled vector comprises the following steps:

performing splicing processing on the text maximum pooling vector and the text average pooling vector to obtain the target text pooling vector;

before the step of carrying out intention prediction on the target question-answer text through a preset intention prediction model and intention type labels to obtain an intention prediction value corresponding to each intention type label, the method further comprises the step of pre-training the intention prediction model, and specifically comprises the following steps:

Acquiring sample intention voice data;

2. The method for recognizing intention as claimed in claim 1, wherein the step of performing a voice recognition process on the voice question-answer data to obtain a candidate question-answer text comprises:

3. The method for identifying an intention according to claim 1, wherein the feature extraction model includes a convolution layer and a first full-connection layer, and the step of extracting corpus features of the candidate question-answer text through a preset feature extraction model to obtain a target question-answer text includes:

4. A method of identifying an intention as claimed in any one of claims 1 to 3 wherein the step of sorting the intention class labels according to the intention prediction value to obtain target intention class labels comprises:

5. An intent recognition device, the device comprising:

the voice recognition module is used for carrying out voice recognition processing on the voice question-answering data to obtain candidate question-answering texts;

the screening module is used for screening the target question-answer text according to the target intention category label to obtain target intention data;

the intent prediction model includes an embedded layer, a pooling layer, a Bi-LSTM layer, and a second fully connected layer, the intent prediction module further to:

the process of pooling the target text embedded vector by the pooling layer to obtain a target text pooled vector comprises the following steps:

the intention prediction model is trained by the following modes:

acquiring sample intention voice data;

6. An electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, the program when executed by the processor implementing the steps of the intent recognition method as claimed in any one of claims 1 to 4.

7. A storage medium, which is a computer-readable storage medium, for computer-readable storage, characterized in that the storage medium stores one or more programs executable by one or more processors to implement the steps of the intention recognition method of any one of claims 1 to 4.