CN113326351A

CN113326351A - User intention determining method and device

Info

Publication number: CN113326351A
Application number: CN202110671730.4A
Authority: CN
Inventors: 李林峰; 黄海荣
Original assignee: Hubei Ecarx Technology Co Ltd
Current assignee: Ecarx Hubei Tech Co Ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-08-31

Abstract

The embodiment of the invention provides a method and a device for determining user intention, wherein the method comprises the following steps: performing character conversion on the voice information to be recognized to obtain target characters corresponding to the voice information to be recognized; inputting the target characters into a pre-trained user intention recognition model; extracting target features of the target characters based on the user intention recognition model, and acquiring a first probability of each preset vertical domain intention mapped to the target features and a second probability of each preset sub-function intention mapped to the target features; wherein each preset subfunction intention belongs to a preset vertical domain intention; and determining the user intention corresponding to the voice information to be recognized based on the corresponding relation between the preset vertical domain intention corresponding to the highest first probability and the preset sub-function intention corresponding to the highest second probability. By adopting the method, the size and the calculation amount of the model network for identifying the vertical domain intention and the sub-function intention are reduced.

Description

User intention determining method and device

Technical Field

The invention relates to the technical field of deep learning, in particular to a method and a device for determining user intention.

Background

At present, under a plurality of application scenarios of human-computer voice interaction, user voice information needs to be identified to obtain a user intention, and a service matching the user intention is provided for the user. For example, a human-computer voice interaction system in a vehicle may receive voice information of people in the vehicle, recognize a user intention based on the voice information, for example, receive a voice message "i want to listen to music" spoken by the user, recognize the user intention as listening to music by recognizing the user intention of the voice message "i want to listen to music", and play music for the user according to the user intention.

In the prior art, a neural network classification model is generally adopted to identify a user intention aiming at user voice information, and the user intention is generally divided into a vertical domain intention and a plurality of sub-function intentions under the vertical domain. The sub-function intent is a subcategory intent refined for the vertical domain intent, for example, if the vertical domain intent includes: navigation and music, then the sub-functional intent includes navigation and music subcategory intent: navigating to a destination, navigating to a destination to a high speed, navigating to a destination to a non-high speed, playing music of a singer, playing a song of a singer, and the like.

At present, to recognize a user intention, a classification model for a vertical domain intention and a classification model for a sub-function intention under a vertical domain are generally trained respectively, and the user intention is recognized for user voice information by using the two classification models, so as to obtain 2 user intentions. Because two independent classification models need to be used, the problems of large scale of model networks and large calculation amount exist.

Disclosure of Invention

The embodiment of the invention aims to provide a user intention determining method and device, so as to reduce the scale and the calculation amount of a model network for identifying a vertical domain intention and a sub-function intention.

In order to achieve the above object, an embodiment of the present invention provides a method for determining a user intention, including:

performing character conversion on voice information to be recognized to obtain target characters corresponding to the voice information to be recognized;

inputting the target characters into a pre-trained user intention recognition model;

extracting target features of the target characters based on the user intention recognition model, and acquiring a first probability of each preset vertical domain intention mapped to the target features and a second probability of each preset sub-function intention mapped to the target features; wherein each preset subfunction intention belongs to a preset vertical domain intention;

and determining the user intention corresponding to the voice information to be recognized based on the corresponding relation between the preset vertical domain intention corresponding to the highest first probability and the preset sub-function intention corresponding to the highest second probability.

Optionally, the determining, based on a correspondence between a preset vertical domain intention corresponding to the highest first probability and a preset sub-function intention corresponding to the highest second probability, a user intention corresponding to the voice information to be recognized includes:

acquiring a target vertical domain intention according to the first probability and acquiring a first target sub-function intention according to the second probability, wherein the first target sub-function intention is a preset sub-function intention corresponding to the highest second probability, and the target vertical domain intention is a preset vertical domain intention corresponding to the highest first probability;

judging whether a preset vertical domain intention to which the first target sub-function intention belongs is consistent with the target vertical domain intention;

and if so, determining the first target sub-function intention as a target user intention.

Optionally, after the determining whether the preset vertical domain intention to which the first target sub-function intention belongs is consistent with the target vertical domain intention, the method further includes:

acquiring a second target sub-function intention from each preset sub-function intention belonging to the target vertical domain intention according to a second probability, and determining the second target sub-function intention as a target user intention, wherein the second target sub-function intention is as follows: and among the preset subfunction intents belonging to the target vertical domain intents, the preset subfunction intention corresponding to the highest second probability.

Optionally, the pre-trained user intention recognition model includes: a character feature extraction network, a vertical domain intention recognition network and a sub-function intention recognition network;

the character feature extraction network is used for extracting target features of the input target characters and respectively inputting the target features into the vertical domain intention recognition network and the sub-function intention recognition network;

the vertical domain intention recognition network obtains a first probability that the target characteristics belong to each preset vertical domain intention based on the target characteristics;

and the sub-function intention recognition network obtains a second probability that the target feature belongs to each preset sub-function intention based on the target feature.

Optionally, the text feature extraction network includes:

the input layer is used for acquiring a character index of the target character based on a preset character index library;

the word embedding layer is used for acquiring a target word vector corresponding to the target word based on the word index of the target word and a preset word vector library, and the preset word vector library stores the corresponding relation between the word index and the word vector;

the convolution layer comprises a first convolution kernel corresponding to a first preset word length, a second convolution kernel corresponding to a second word length and a third convolution kernel corresponding to a third preset word length, first word vector characteristics corresponding to the target words are obtained based on the first convolution kernel, second word vector characteristics corresponding to the target words are obtained based on the second convolution kernel, and third word vector characteristics corresponding to the target words are obtained based on the third convolution kernel;

the activation layer corresponding to each convolution kernel respectively carries out nonlinear processing on the first word vector characteristic, the second word vector characteristic and the third word vector characteristic;

a pooling layer corresponding to each active layer; performing down-sampling processing on the first word vector characteristic, the second word vector characteristic and the third word vector characteristic after the nonlinear processing;

and the fusion layer is used for splicing and fusing the first word vector feature, the second word vector feature and the third word vector feature which are subjected to down-sampling processing to obtain a target feature.

Optionally, the vertical domain intention recognition network comprises a vertical domain full-connection layer and a vertical domain normalization layer,

the vertical domain full-connection layer calculates the probability of mapping the target features to each preset vertical domain intention;

and the vertical domain normalization layer is used for performing normalization processing on the probabilities of the target features mapped to the preset vertical domain intents to obtain first probabilities of the target features belonging to the preset vertical domain intents.

Optionally, the sub-function intent recognition network includes a sub-function full connection layer and a sub-function normalization layer;

the sub-function full-link layer calculates the probability of mapping the target feature to each preset sub-function intention;

and the sub-function normalization layer is used for performing normalization processing on the probability of mapping the target feature to each preset sub-function intention to obtain a second probability of the target feature belonging to each preset sub-function intention.

Optionally, the training mode of the user intention recognition model includes:

acquiring sample voice information and a sample vertical domain intention label and a sample subfunction intention label corresponding to the sample voice information;

inputting the sample voice information into a user intention recognition model to be trained to obtain a predicted vertical domain intention and a predicted subfunction intention;

calculating a first loss value based on the predicted vertical domain intent and the sample vertical domain intent tag; and calculating a second loss value based on the predictor sub-function intent and the sample sub-function intent tag;

calculating a loss value of a user intention recognition model to be trained based on the first loss value and the second loss value;

judging whether the loss value is smaller than a preset loss threshold value or not, and if so, determining that the training of the user intention recognition model is finished; if not, updating the parameters of the user intention recognition model to be trained, and returning to execute the steps of obtaining the sample voice information, and a sample vertical domain intention label and a sample sub-function intention label corresponding to the sample voice information;

the parameters of the user intention recognition model to be trained comprise: the character feature extraction network parameters, the vertical domain intention identification network parameters and the subfunction intention identification network parameters.

In order to achieve the above object, an embodiment of the present invention further provides a user intention determining apparatus, including:

the character conversion module is used for performing character conversion on the voice information to be recognized to obtain target characters corresponding to the voice information to be recognized;

the model input module is used for inputting the target characters into a pre-trained user intention recognition model;

a probability obtaining module, configured to extract a target feature of the target text based on the user intention recognition model, and obtain a first probability of each preset vertical domain intention mapped to the target feature and a second probability of each preset sub-function intention mapped to the target feature; wherein each preset subfunction intention belongs to a preset vertical domain intention;

and the intention determining module is used for determining the user intention corresponding to the voice information to be recognized based on the corresponding relation between the preset vertical domain intention corresponding to the highest first probability and the preset sub-function intention corresponding to the highest second probability.

In order to achieve the above object, an embodiment of the present invention further provides an electronic device, which is characterized in that the electronic device includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing any of the user intent determination method steps when executing a program stored on the memory.

In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when being executed by a processor, the computer program implements any of the user intent determination method steps.

To achieve the above object, an embodiment of the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any of the above-described user intent determination methods.

The embodiment of the invention has the following beneficial effects:

by adopting the method provided by the embodiment of the invention, the voice information to be recognized is subjected to character conversion to obtain the target characters corresponding to the voice information to be recognized; inputting the target characters into a pre-trained user intention recognition model, and extracting target characteristics of the target characters; based on the user intention recognition model, acquiring a first probability of each preset vertical domain intention mapped to the target feature and a second probability of each preset sub-function intention mapped to the target feature; and determining the user intention corresponding to the voice information to be recognized based on the corresponding relation between the preset vertical domain intention corresponding to the highest first probability and the preset sub-function intention corresponding to the highest second probability. The method provided by the embodiment of the invention can determine the preset vertical domain intention and the preset sub-function intention through the same user intention recognition model, and further determine the user intention corresponding to the voice information to be recognized.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by referring to these drawings.

Fig. 1 is a flowchart of a user intention determining method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a process for determining a user's intent corresponding to a speech message to be recognized;

FIG. 3 is a block diagram of a user intent recognition model provided in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram of a text feature extraction network in a user intent recognition model;

FIG. 5 is a diagram illustrating convolution of a target word vector of a target word by a convolution layer;

FIG. 6 is a schematic diagram of the input and output of the relu function;

FIG. 7a is a schematic diagram of a vertical domain intention recognition network in the user intention recognition model;

FIG. 7b is a schematic diagram of a sub-functional intent recognition network in the user intent recognition model;

FIG. 8 is another block diagram of a user intent recognition model provided in accordance with an embodiment of the present invention;

FIG. 9 is a flowchart of a training method for a user intention recognition model according to an embodiment of the present invention;

FIG. 10 is a schematic structural diagram of a user intent determination apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments given herein by one of ordinary skill in the art, are within the scope of the invention.

In order to reduce the scale and the calculation amount of a model network for identifying a vertical domain intention and a sub-function intention, the embodiment of the invention provides a user intention determining method and device.

Referring to fig. 1, fig. 1 is a flow of a user intention determining method provided by an embodiment of the present invention, including:

step 101, performing character conversion on the voice information to be recognized to obtain target characters corresponding to the voice information to be recognized.

Step 102, inputting the target characters into a pre-trained user intention recognition model.

103, extracting target characteristics of the target characters based on the user intention recognition model, and acquiring a first probability of each preset vertical domain intention mapped to the target characteristics and a second probability of each preset sub-function intention mapped to the target characteristics; wherein each preset subfunction intention belongs to a preset vertical domain intention.

And step 104, determining the user intention corresponding to the voice information to be recognized based on the corresponding relation between the preset vertical domain intention corresponding to the highest first probability and the preset sub-function intention corresponding to the highest second probability.

Wherein, the user intention identification model is as follows: and training the user intention recognition model to be trained on the basis of the plurality of sample voice information, and the sample vertical domain intention label and the sample subfunction intention label corresponding to each sample voice information.

In various application scenarios of human-computer interaction by using voice information, user intentions expressed by the voice information can be classified based on different application scenarios, and specifically, vertical domain intention classification and sub-function intention classification can be performed, wherein more detailed classification can be performed on each vertical domain intention to obtain a plurality of sub-function intentions under each intention classification. For example, in a vehicle-mounted voice interaction system, preset vertical domain intentions and preset sub-function intentions that may correspond to voice information of a plurality of users may be preset, and user intention classification according to the voice information spoken by the users may refer to table 1:

table 1: corresponding table of preset vertical domain intention and preset sub-function intention

Voice information	Preset vertical domain intention	Presetting subfunction intentions
			Playing classical piano music	Music	Classical piano music
Coming a song of Liu De Hua	Music	Singer Liudebhua song
			Navigation to people square	Navigation	To people square
I want to go to Nanjing road	Navigation	To Nanjing road
			Today will not rain	Weather (weather)	Conditions of rainfall
How much air quality is in Beijing today	Weather (weather)	Air quality

As can be seen from table 1, the preset vertical domain intentions that the voice information of the user may correspond to may include: "music," "navigation," and "weather"; the sub-function intents that the voice information of the user may correspond to may include: "classic piano music", "singer liudebua songs", "to people square", "to south jing road", "rainfall conditions" and "air quality". Also, as can be seen from table 1, each preset sub-function intention belongs to one preset vertical domain intention, such as the preset sub-function intentions "classical piano music" and "singer bang hua song" both belong to the preset vertical domain intention "music".

For different application scenes, a preset vertical domain intention and a preset sub-function intention which may correspond to the voice information may be stored in advance, and the preset vertical domain intention and the preset sub-function intention are used for recognizing a user intention corresponding to the voice information in a voice interaction process.

Fig. 2 is a flow of determining a user intention corresponding to the voice information to be recognized, and referring to fig. 2, the step 104 may specifically include:

step 201, obtaining a target vertical domain intention and a first target subfunction intention.

The first target sub-function intention is a preset sub-function intention corresponding to the highest second probability, and the target vertical domain intention is a preset vertical domain intention corresponding to the highest first probability.

Step 202, determining whether the preset vertical domain intention to which the first target sub-function intention belongs is consistent with the target vertical domain intention, if so, executing step 203, and if not, executing step 204 or step 205.

Step 203, determining the first target sub-function intention as the target user intention.

And step 204, determining the target vertical domain intention as the target user intention.

And step 205, acquiring a second target sub-function intention, and determining the second target sub-function intention as a target user intention.

Wherein the second target sub-function intent is: and among the preset subfunction intents belonging to the target vertical domain intents, the preset subfunction intention corresponding to the highest second probability.

For example, referring to table 1 above, if the determined target vertical domain is intended to be "music", the first target sub-function is intended to be "singer, liudebua song". As can be seen from table 1, the first target sub-function intention is that the "singer bang song" belongs to the preset vertical domain intention "music", and it can be determined that the preset vertical domain intention to which the first target sub-function intention "singer bang song" belongs is consistent with the target vertical domain intention "music", and then the first target sub-function intention "singer bang song" can be directly determined as the target user intention. The voice interaction system may play the liudeluxe song for the user according to the determined target user intent "singer liudeluxe song," such as playing liudeluxe "forgetting water.

Referring to table 1 above, if the determined target vertical domain is intended to be "music", the first target sub-function is intended to be "to people square". As can be seen from table 1, the "to people square" belongs to the preset vertical domain intention "navigation", so that it can be determined that the preset vertical domain intention of the first target sub-function intention "to people square" does not coincide with the target vertical domain intention "music", the target vertical domain intention "music" can be determined as the target user intention at this time, and the voice interaction system can play music for the user according to the determined target user intention "music"; or, at this time, the preset sub-function intention corresponding to the highest second probability may be selected as the second target sub-function intention from the preset sub-function intentions "classical piano music" and "singer bang and warble song" belonging to the target vertical domain intention "music", and the second target sub-function intention may be determined as the target user intention. For example, if the preset sub-function intent "classical piano tune" of the target vertical domain intent "music" is the highest second probability corresponding to the "classical piano tune" in the "classical piano tune" and the "singer bang hua song", the "classical piano tune" may be determined as the target user intent, and the voice interaction system may play the piano tune for the user according to the determined target user intent "classical piano tune".

By adopting the method provided by the embodiment of the invention, the user intention recognition model is trained in advance, and the vertical domain intention recognition network and the sub-function intention recognition network are expanded from the trained user intention recognition model, so that the target vertical domain intention and the first target sub-function intention corresponding to the voice information to be recognized can be recognized at the same time, and the user intention corresponding to the voice information to be recognized is further obtained. Compared with the existing voice information recognition method, the model network scale and the calculation amount for recognizing the vertical domain intention and the sub-function intention are reduced.

In the step 102, the following steps a 1-a 2 may be specifically adopted to extract the target feature of the target character:

step A1, extracting a plurality of word vector characteristics of a target character according to a plurality of preset character lengths;

and step A2, fusing the multiple word vector characteristics to obtain the target characteristics of the target characters.

Referring to fig. 3, fig. 3 is a structural diagram of a user intention recognition model according to an embodiment of the present invention, where the user intention recognition model includes: a text feature extraction network 310, a vertical domain intent recognition network 320, and a sub-function intent recognition network 330.

The character feature extraction network 310 extracts target features of the input target characters, and inputs the target features to the vertical domain intention recognition network and the sub-function intention recognition network, respectively.

The vertical domain intention identifying network 320 obtains a first probability that the target feature belongs to each preset vertical domain intention based on the target feature.

The sub-function intent recognition network 330 obtains a second probability that the target feature belongs to each preset sub-function intent based on the target feature.

FIG. 4 is a block diagram of a network for extracting text features in a user intent recognition model. Specifically, referring to fig. 4, the text feature extraction network may include: an input layer 410, a word embedding layer 420, a convolutional layer 430, a plurality of activation layers 440, a plurality of pooling layers 450, and a fusion layer 460.

The input layer 410 obtains a text index of the target text based on a preset text index library.

In the embodiment of the present invention, the target characters are all characters included in the voice information to be recognized, for example, if the voice information to be recognized is "i want to go to south kyo road", then "i", "want", "go", "south", "kyo" and "road" are all the target characters.

In the embodiment of the present invention, a preset character index library may be provided, and the preset character index library stores the corresponding relationship between characters and indexes. The character index corresponding to each target character can be determined based on a preset character index library, and then the target index arrays corresponding to all the target characters are obtained. For example, it can be determined that the text indexes corresponding to the target text "i", "want", "go", "south", "beijing", and "way" are: 3. 5,8,11,14,17, and 25, and further obtain that the target index array corresponding to all target words is [3,5,8,11,14,17,25 ].

In general, the number of target words corresponding to the voice information to be recognized does not exceed 70, the maximum length of the target index arrays corresponding to all the target words can be set to 70, that is, the maximum length includes 70 characters, the excess portions can be discarded, and if the number of the target words corresponding to the voice information to be recognized is less than 70, the maximum length can be filled with 0. For example, the number of the target characters "i, go, south, jing, and way" is less than 70, so that the target index arrays corresponding to all the target characters may be filled with 0 until the length of the target index array is 70: [3,5,8,11,14,17,25,0,0 … … 0 ]. Of course, the maximum length of the target index array corresponding to all target words may also be set to 40, 50, or 60, etc., and is not limited herein.

The word embedding layer 420 obtains a target word vector corresponding to the target word based on the word index of the target word and a preset word vector library, where the preset word vector library stores a corresponding relationship between the word index and the word vector.

In the embodiment of the present invention, after the target index arrays corresponding to all target characters are obtained, the target index arrays may be input into the character embedding layer, and the character embedding layer obtains the target character vectors corresponding to the target characters according to the character indexes of each target character in the target index arrays.

In a possible implementation manner, a preset text vector library may be preset, and the preset text vector library stores a corresponding relationship between the text index and the text vector. Each literal vector may be a multidimensional floating point data, for example, a literal vector may be represented by a one-dimensional array of 128 elements. In the embodiment of the present invention, according to the target index array, the target text vector corresponding to each target text may be found in the preset text vector library according to each text index in the target index array, and a [ step, Dim ] dimensional matrix corresponding to all target text vectors is obtained, where step is the maximum length of the target index array (e.g., 70), and Dim is the number of elements (e.g., 128) representing the array of the target text vector corresponding to each target text. For example, if the maximum length of the target index array is 70, and the number of elements of the array of the target word vectors corresponding to each target word is 128, then a [70,128] dimensional matrix corresponding to all target word vectors can be obtained.

And the convolution layer 430 comprises a first convolution kernel corresponding to the first preset word length, a second convolution kernel corresponding to the second word length and a third convolution kernel corresponding to the third preset word length, and is used for acquiring the first word vector characteristic corresponding to the target word based on the first convolution kernel, acquiring the second word vector characteristic corresponding to the target word based on the second convolution kernel and acquiring the third word vector characteristic corresponding to the target word based on the third convolution kernel.

The activation layer 440 corresponding to each convolution kernel performs nonlinear processing on the first word vector feature, the second word vector feature and the third word vector feature respectively;

a pooling layer 450 corresponding to each active layer; performing down-sampling processing on the first word vector characteristic, the second word vector characteristic and the third word vector characteristic after the nonlinear processing;

and the fusion layer 460 is used for splicing and fusing the first word vector feature, the second word vector feature and the third word vector feature after the down-sampling processing to obtain the target feature.

Specifically, in the embodiment of the present invention, the method for obtaining the target feature may include steps B1-B3:

step B1: the convolution layer 430 may perform convolution feature extraction on a plurality of continuous target characters of the first preset word length in the target characters as a whole based on a first convolution kernel corresponding to the first preset word length to obtain a first word vector feature corresponding to the target characters; on the basis of a second convolution kernel corresponding to a second preset word length, taking a plurality of continuous target words with the second preset word length in the target words as a whole to perform convolution feature extraction to obtain a second word vector feature corresponding to the target words; and performing convolution feature extraction on a plurality of continuous target characters with the third preset word length in the target characters as a whole based on a third convolution kernel corresponding to the third preset word length to obtain a third word vector feature corresponding to the target characters.

In one possible implementation, after obtaining the corresponding target text vector, the target text vector may be input to a convolution layer in the text feature extraction network, and the convolution layer performs convolution on the target text vector to extract each word vector feature corresponding to the target text. In the embodiment of the present invention, the Convolutional layer may be a Network structure based on CNN (Convolutional Neural Network). In the embodiment of the invention, different convolution cores can be set according to the word length to perform convolution on the target word vector of the target word. The first preset word length may be set to be 3, and the first convolution kernel corresponding to the first preset word length may be an array of [3,32 ]; the second preset word length may be set to 4, and the second convolution kernel corresponding to the second preset word length may be an array of [4,32 ]; the third preset word length may be set to 5, and the third convolution kernel corresponding to the third preset word length may be an array of [5,32 ]. Wherein, the first preset word length is corresponding to the length of the first convolution kernel (if the first preset word length is 3, the length of the array of the first convolution kernel [3,32] corresponding to the first preset word length is 3), the second preset word length is corresponding to the length of the second convolution kernel, and the third preset word length is corresponding to the length of the third convolution kernel. Of course, in the embodiment of the present invention, the first preset word length, the second preset word length, and the third preset word length may also be set to other numerical values, for example, the first preset word length, the second preset word length, and the third preset word length may also be set to 5, 6, and 7, respectively, which is not limited herein.

In the embodiment of the present invention, the convolutional layer in the character feature extraction network is used to amplify and extract some features in the target character vector of the target character, for example, in NLP (Natural Language Processing), feature extraction with 3 word length, 4 word length, and 5 word length may be used, and the target character vectors of consecutive 3 target characters, 4 target characters, and 5 target characters are extracted as features of interest for subsequent Processing, so that 3 to 5 target characters can be viewed as a whole, if the extracted 3 to 5 target characters are words or phrases, they may be considered as a whole, and if they are all single characters, their context may be considered.

For example, a whole of 3 consecutive target characters may be convolved by a first convolution kernel ([3,32] array), and then a matrix of [68,1] dimension corresponding to the target character output after the convolution is completed is obtained. A plurality of first convolution kernels can be set to perform convolution on each whole formed by continuous 3 target characters to obtain a plurality of [68,1] matrixes corresponding to the output target characters, wherein the number of the first convolution kernels is the same as the number of elements included in a target character vector of each target character (if the target character vector of the target character includes 128 elements, the number of the first convolution kernels can be set to 128), and then 128 [68,1] matrixes corresponding to the target character group can be output, that is, the first word vector is characterized by a [68,128] matrix obtained by splicing the 128 [68,1] matrixes corresponding to the target character group. And taking 4 continuous target characters as a whole, and performing convolution on the whole formed by every 4 continuous target characters by using a second convolution kernel ([4,32] array), wherein a [67,1] dimensional matrix corresponding to the target characters is output after the convolution is completed; a plurality of second convolution kernels can be set to perform convolution on each whole formed by 4 continuous target characters to obtain a plurality of [67,1] matrixes corresponding to the output target characters, wherein the number of the second convolution kernels is the same as the number of elements included in a target character vector of each target character (if the target character vector of the target character includes 128 elements, the number of the second convolution kernels can be set to 128), and then 128 [67,1] matrixes corresponding to the target characters can be output, that is, the second word vector is characterized by a [67,128] matrix obtained by splicing the 128 [67,1] matrixes corresponding to the target characters. And a corresponding matrix of [66,1] which is output after convolution is completed by using a third convolution kernel ([3,32 ]) to convolve the whole formed by every continuous 5 target characters, wherein the whole is formed by the continuous 5 target characters. That is, each third convolution kernel can perform convolution on the whole formed by each continuous 5 target characters, and output a [66,1] dimensional matrix corresponding to the target characters; a plurality of third convolution kernels can be set to perform convolution on each whole formed by the continuous 5 target characters to obtain a plurality of [66,1] matrixes corresponding to the output target characters, wherein the number of the third convolution kernels is the same as the number of elements included in a target character vector of each target character (if the target character vector of the target character includes 128 elements, the number of the third convolution kernels can be set to 128), and then 128 [66,1] matrixes corresponding to the target characters can be output, that is, a [66,128] matrix obtained by splicing the 128 [66,1] matrixes corresponding to the target characters is characterized by the third word vector.

In the embodiment of the invention, the continuous 3 target characters, the continuous 4 target characters, the continuous 5 target characters and the like are taken into consideration as a whole, the feature extraction is carried out, and the relation among the target characters can be enhanced.

Specifically, fig. 5 is a schematic diagram of the convolution layer convolving the target character vector of the target character, and as shown in fig. 5, the target character vector of the target character may be convolved by using the following formula output ═ width × input + bias. Wherein output is an eigenvalue matrix 530 obtained by convolving the matrix 510 in fig. 5, input is a matrix 510 corresponding to a target character vector of a target character, and width and input are the length and width of the convolution kernel matrix 520. As shown in fig. 5, the convolution kernel may be used to convolve the matrix of the shaded portion in fig. 5 to obtain an eigenvalue (-1) × 1+0 × 0+1 × 2+ (-1) × 5+0 × 4+1 × 2+ (-1) × 3+0 × 4+1 × 5 ═ 0.

Wherein, the target character vector of the input target character is in a format of NHWC (N represents the number, C represents the channel, H represents the height, and W represents the width): [ batch, in _ height, in _ width, in _ channels ], where batch is the number of target text vectors for each batch of processed target text, in _ height is the height of the target text vectors for the target text, in _ width is the width of the target text vectors for the target text, and in _ channels is the input channel.

The convolution kernel is in HWCN format: [ filter _ height, filter _ width, in _ channels, out _ channels ], where filter _ height is the height of the convolution kernel, filter _ width is the width of the convolution kernel, and out _ channels is the output channel.

The output convolved matrix is in NHWN format: [ batch output _ height, output _ width, out _ channels ], where output _ height is the height of the convolved matrix of the output and output _ width is the width of the convolved matrix of the output. In embodiments of the invention there may be only one input channel and one output channel.

Wherein, the height and width variation rules may be:

wherein, height_outAnd width_outRepresenting the length and width of the matrix obtained after convolution; height_inAnd width_inRepresents the length and width of the matrix into which the input is convolved; height_kernelAnd width_kernelRepresents the length and width of the convolution kernel; padding means that if the edge of input data is exceeded when the convolution kernel slides, the excess part adopts a filling and filling strategy. Usually default to VALID (i.e. contain no excess), when the size of the convolution output is (input _ size-kernel _ size)/stride + 1; the Padding mode may also be SAME (i.e., the size of the output data and the input are the SAME), when the edge is padded with 0; stride refers to the step size, i.e., the value of each shift of the convolution kernel.

Step B2: and respectively carrying out nonlinear processing on the first word vector characteristic, the second word vector characteristic and the third word vector characteristic based on the activation function of the activation layer.

In a possible implementation manner, after obtaining the first word vector feature, the second word vector feature, and the third word vector feature corresponding to the target text, the first word vector feature, the second word vector feature, and the third word vector feature may be respectively input to the activation layer corresponding to each convolution kernel, and the activation layer corresponding to each convolution kernel respectively performs activation processing on the first word vector feature, the second word vector feature, and the third word vector feature. Specifically, the first word vector feature, the second word vector feature, and the third word vector feature may be activated by using an activation function. The activation function may be any activation function applied to a neural network, such as relu (Rectified Linear Unit).

In the embodiment of the invention, the activation function is used for bringing nonlinear characteristics to the network of the user intention recognition model, and the activation processing can also be regarded as the nonlinear processing. The first word vector feature, the second word vector feature and the third word vector feature can be subjected to nonlinearity through an activation function, and the first word vector feature, the second word vector feature and the third word vector feature after nonlinear processing are obtained. The first word vector feature, the second word vector feature and the third word vector feature after the nonlinear processing have nonlinear characteristics. But the activation function does not change the dimensions of the first word vector feature, the second word vector feature, and the third word vector feature. Fig. 6 is a schematic diagram of input and output of the relu function, and as shown in fig. 6, the horizontal axis is input of the relu function (the first word vector feature, the second word vector feature, and the third word vector feature), and the vertical axis is output of the relu function (the first word vector feature, the second word vector feature, and the third word vector feature after the non-linear processing).

Step B3: and performing down-sampling processing on the first word vector feature, the second word vector feature and the third word vector feature after the nonlinear processing to obtain the first word vector feature, the second word vector feature and the third word vector feature after the down-sampling processing.

In a possible implementation manner, after the first word vector feature, the second word vector feature, and the third word vector feature after the nonlinear processing are obtained, the first word vector feature, the second word vector feature, and the third word vector feature after the nonlinear processing may be input into a pooling layer corresponding to each activation layer, and the first word vector feature, the second word vector feature, and the third word vector feature after the nonlinear processing are down-sampled by the pooling layer. The specific down-sampling mode may include: maximum pooling and average pooling.

For example, the pooling layer corresponding to each activation layer may maximally pool the activated non-linearly processed first word vector feature, second word vector feature, and third word vector feature: if the obtained first word vector feature, the second word vector feature and the third word vector feature after the nonlinear processing are respectively a matrix of [68,128], a matrix of [67,128] and a matrix of [66,128], the pooling layer can perform dimensionality reduction on the matrix of [68,128], the matrix of [67,128] and the matrix of [66,128] into a matrix of [1,128], a matrix of [1,128] and a matrix of [1,128], respectively. Specifically, the pooling layer may select, for each column in the matrix of [68,128], a maximum value in the column representing the value of the column, resulting in a matrix of [1,128 ]; similarly, the pooling layer may select a maximum value in the column representing the value of the column for each column in the matrix of [67,128] resulting in a matrix of [1,128], and the pooling layer may select a maximum value in the column representing the value of the column for each column in the matrix of [67,128] resulting in a matrix of [1,128 ]. And respectively taking the matrix of [1,128], the matrix of [1,128] and the matrix of [1,128] obtained after the pooling as the first word vector feature, the second word vector feature and the third word vector feature after the down-sampling processing.

For example, the pooling layer may also average and pool the non-linearly processed first word vector feature, second word vector feature, and third word vector feature: if the obtained first word vector feature, the second word vector feature and the third word vector feature after the nonlinear processing are respectively a matrix of [68,128], a matrix of [67,128] and a matrix of [66,128], the pooling layer can perform dimensionality reduction on the matrix of [68,128], the matrix of [67,128] and the matrix of [66,128] into a matrix of [1,128], a matrix of [1,128] and a matrix of [1,128], respectively. Specifically, the pooling layer may average the number of columns of the matrix of [68,128] for each column to represent the value of that column, resulting in a matrix of [1,128 ]; similarly, the pooling layer may average the column number values for each column in the [67,128] matrix to represent the value of that column, resulting in a [1,128] matrix, and the pooling layer may average the column number values for each column in the [66,128] matrix to represent the value of that column, resulting in a [1,128] matrix. And respectively taking the matrix of [1,128], the matrix of [1,128] and the matrix of [1,128] obtained after the pooling as the first word vector feature, the second word vector feature and the third word vector feature after the down-sampling processing.

After the first word vector feature, the second word vector feature and the third word vector feature after the down-sampling processing are obtained, the first word vector feature, the second word vector feature and the third word vector feature after the down-sampling processing can be spliced through a fusion layer 460 of a character feature extraction network to obtain the target feature. The method specifically comprises the following steps: and splicing the first word vector feature, the second word vector feature and the third word vector feature after the down-sampling processing into a matrix of [1, K ] dimension as a target feature, wherein K is the sum of the lengths of the matrix corresponding to the first word vector feature after the down-sampling processing, the matrix corresponding to the second word vector feature after the down-sampling processing and the matrix corresponding to the third word vector feature after the down-sampling processing. For example, if the matrix corresponding to the first word vector feature after the downsampling process, the matrix corresponding to the second word vector feature after the downsampling process, and the matrix corresponding to the third word vector feature after the downsampling process are all [1,128], a matrix of [1,384] dimensions can be obtained by concatenation as the target feature.

Fig. 7a is a schematic structural diagram of a vertical domain intention recognition network in a user intention recognition model, as shown in fig. 7a, the vertical domain intention recognition network includes: a vertical domain full-link layer 710 and a vertical domain normalization layer 720.

Specifically, the vertical domain full-link layer 710 is configured to calculate a probability that the target feature is mapped to each preset vertical domain intention; and the vertical domain normalization layer 720 is configured to perform normalization processing on the probabilities of the target features mapped to the preset vertical domain intents, so as to obtain first probabilities of the target features belonging to the preset vertical domain intents.

In the embodiment of the invention, after the target characteristics are obtained, the target characteristics can be input into the vertical domain full-link layer, and the probability that the target characteristics belong to different preset vertical domain intentions is calculated by the vertical domain full-link layer based on the target characteristics.

For example, for a matrix with [1, K ] dimension of the target feature input to the vertical domain full-link layer, the vertical domain full-link layer may calculate the target feature ([1, K ] dimension matrix) by using the following formula:

Y1＝X*W1+B1

wherein X represents a matrix of [1, K ] dimensions of the target feature; w1 represents weight parameters of a vertical domain full-link layer of the trained user intention recognition model, and the weight parameters can be in a matrix form, and the dimensionality of the weight parameters is [ K, domainNum ], wherein the domainNum is the number of preset vertical domain intentions; b1 is a bias parameter of the vertical domain fully-connected layer, which can be represented by an array, specifically a one-dimensional array [ domainNum ]; y1 is an output of the vertical domain full link layer, and may be a matrix with a dimension [1, domainNum ], where each element in the matrix is a probability that the target feature is mapped to each preset vertical domain intention.

When training the user intention recognition model, a plurality of preset vertical domain intentions, such as the vertical domain intentions "music", "navigation", and "weather" in table 1, may be preset, and an index may be established for each preset vertical domain intention, for example, an index number of "music" may be 1, an index number of "navigation" may be 2, and an index number of "weather" may be 3. The vertical domain full-connection layer can also output the probability of mapping the target features to the indexes of the preset vertical domain intents, and the corresponding preset vertical domain intents can be found according to the indexes of the vertical domain intents.

In a possible implementation, the vertical domain full-link layer outputs domainNum floating-point numbers, for example, the output is C₀、C₁、……C_domainNum-1Each output floating point number represents the probability that the target feature is mapped to the index of each preset vertical domain intention, the index of the preset vertical domain intention corresponding to the maximum value can be selected as the index of the target vertical domain intention corresponding to the target feature by comparing the values of the floating point numbers, and the preset vertical domain intention searched according to the index of the target vertical domain intention corresponding to the target feature is taken as: and the target vertical domain intention corresponding to the voice information to be recognized. For example, if the maximum value is C_nAnd taking the n +1 th preset vertical domain intention as a target vertical domain intention corresponding to the voice information to be recognized.

Further, after obtaining the matrix of [1, domainNum ] output by the vertical domain full-link layer, the matrix may be normalized by a Softmax layer, and the sum of elements of the normalized matrix is 1, so as to perform probability statistics more conveniently; and respectively taking the normalized probabilities as first probabilities that the target features belong to the preset vertical domain intentions. The preset vertical domain intention with the largest first probability can be selected as the target vertical domain intention corresponding to the voice information to be recognized. For example, as shown in table 2, the preset vertical domain intentions may include "navigation", "music", and "news", and according to the first probability corresponding to each preset vertical domain intention, it may be determined that the first probability corresponding to "navigation" is the largest, and then "navigation" may be used as the target vertical domain intention corresponding to the voice information to be recognized.

Table 2: presetting a corresponding table of vertical domain intentions and first probabilities

Fig. 7b is a schematic structural diagram of a sub-function intent recognition network in the user intent recognition model, as shown in fig. 7b, the sub-function intent recognition network includes: a sub-function full connection layer 730 and a sub-function normalization layer 740.

The sub-function full-link layer 730 is used for calculating the probability of mapping the target features to each preset sub-function intention; and the sub-function normalization layer 740 is configured to perform normalization processing on the probabilities that the target feature is mapped to the preset sub-function intents, so as to obtain second probabilities that the target feature belongs to the preset sub-function intents.

In the embodiment of the invention, after the target feature is obtained, the target feature can be input into the sub-function full connection layer, and the probability that the target feature belongs to different preset sub-function intents is calculated by the sub-function full connection layer based on the target feature.

For example, the target feature of the input sub-function full-link layer is a matrix in [1, K ] dimension, and the sub-function full-link layer can calculate the target feature ([1, K ] dimension matrix) by using the following formula:

Y2＝X*W2+B2

wherein X represents a matrix of [1, K ] dimensions of the target feature; w2 represents weight parameters of the fully connected sub-function layers of the trained user intention recognition model, and the weight parameters can be in a matrix form, and the dimensionality of the weight parameters is [ K, intNum ], wherein the intNum is the number of preset sub-function intents; b2 is a bias parameter of the sub-function fully-connected layer, which may be represented by an array, specifically, a one-dimensional array [ intetnum ]; y2 is the output of the full link layer of the sub-functions, and may be a matrix with a dimension of [1, intnum ], where each element in the matrix is the probability that the target feature is mapped to each preset sub-function intention.

When the user intention recognition model is trained, a plurality of preset sub-function intentions may be preset, such as the sub-function intentions "classical piano song", "singer bang hua song", "to civil square", "to south jing road", "rainfall situation", and "air quality" in table 1 above, and indexes may be established for the respective preset sub-function intentions, for example, index numbers of 4, 5, 6, 7, 8, and 9 may be respectively established for "classical piano song", "singer bang hua song", "to civil square", "to south jing road", "rainfall situation", and "air quality". The sub-function full link layer may output a probability that the target feature is mapped to the index of each preset sub-function intention, and may find the corresponding preset sub-function intention according to the index of the preset sub-function intention.

In one possible embodiment, the above-mentioned sub-function fully-connected layer outputs intentNum floating-point numbers, for example, the output is C₀、C₁、……C_intentNum-1Each output floating point number represents the probability that the target feature is mapped to the index of each preset sub-function intention, the index of the preset sub-function intention corresponding to the maximum value can be selected as the index of the first target sub-function intention corresponding to the target feature by comparing the values of the floating point numbers, and the preset sub-function intention searched according to the index of the first target sub-function intention is taken as: and the target sub-function intention corresponding to the voice information to be recognized. For example, if the maximum value is C_n-1And if so, the first target sub-function intention corresponding to the voice information to be recognized is the nth preset sub-function intention.

Further, after obtaining the matrix of [1, intentNum ] output by the sub-function fully-connected layer, the matrix may be normalized by the Softmax layer, and the sum of the elements of the normalized matrix is 1, so as to perform probability statistics more conveniently. And taking the normalized probabilities as second probabilities that the target features belong to the preset subfunction intents respectively. The preset sub-function intention with the largest second probability can be selected as the first target sub-function intention corresponding to the voice information to be recognized. For example, as shown in table 3, the preset sub-function intents may include "navigation", "music", and "news", and according to the second probability corresponding to each preset sub-function intention, it may be determined that the second probability corresponding to "navigation" is the largest, and then "navigation" may be used as the first target sub-function intention corresponding to the to-be-recognized voice information.

Table 3: presetting a corresponding table of sub-function intents and second probabilities

Presetting subfunction intentions	Second probability
		Broadcasting song name	0.6
Navigating to a destination	0.3
		News	0.08
。。。	。。。

By adopting the method provided by the embodiment of the invention, the vertical domain intention recognition network and the sub-function intention recognition network are expanded from the pre-trained user intention recognition model, the target vertical domain intention and the first target sub-function intention corresponding to the voice information can be recognized at the same time, and compared with the existing voice information recognition method, the model network scale and the calculated amount for recognizing the vertical domain intention and the sub-function intention are reduced. And the target vertical domain intention and the first target sub-function intention corresponding to the voice information are recognized at the same time, a very fine particle classification model can be avoided, the classification number is large in the fine particle classification, the classification degree is low, the classification is easy to conflict, the vertical domain category of the coarse particles is combined with the sub-function category of the fine particles, the classification of the sub-function categories is solved, and the conflict problem that the classification is only carried out by using the sub-function category of the fine particles is avoided.

Fig. 8 is another structural diagram of the user intention recognition model according to the embodiment of the present invention, which illustrates a structure of the user intention recognition model.

Fig. 9 is a flowchart of a training method of a user intention recognition model according to an embodiment of the present invention, and as shown in fig. 9, specific steps may include:

step 901, obtaining sample voice information and a sample vertical domain intention label and a sample subfunction intention label corresponding to the sample voice information.

In the embodiment of the invention, a plurality of sample voice messages can be obtained. The corresponding sample vertical domain intention label and sample subfunction intention label can be predetermined for each sample voice message, for example, the sample voice message is "navigate to people square", the corresponding sample vertical domain intention label can be an array representing the probability that the sample voice information corresponds to each preset vertical domain intention, see table 1, the preset vertical domain intentions include music, navigation and weather, the sample vertical domain intention label corresponding to the sample voice information navigation to the people square can be an array [0,1,0], wherein, 2 '0's in the array [0,1,0] represent that the probability of the sample voice information 'navigate to people square' mapping to music is 0, the probability of mapping to weather is 0, and '1' in the array [0,1,0] represents that the probability of the sample voice information 'navigate to people square' mapping to navigation is 1. The same approach may predetermine for each sample speech information its corresponding sample sub-function intent label.

Step 902, inputting sample voice information into a user intention recognition model to be trained, and obtaining a predicted vertical domain intention and a predicted subfunction intention.

After the sample voice information is input into the user intention recognition model to be trained, the user intention recognition model to be trained can output a predicted vertical domain intention and a predicted sub-function intention corresponding to the sample voice information, wherein the predicted vertical domain intention can be represented by an array, and each element in the array represents the prediction probability of mapping the sample voice information to each preset vertical domain intention. The predicted sub-function intent may also be represented by an array, each element in the array representing a prediction probability that the sample speech information is mapped to a respective preset sub-function intent.

Step 903, calculating a first loss value based on the predicted vertical domain intention and the sample vertical domain intention label; and calculating a second loss value based on the predictor sub-function intent and the sample sub-function intent tag.

In one possible implementation, the first loss value may be calculated using the following cross-entropy calculation formula:

wherein H1(p1, q1) represents a first loss value; p1 (x)_i) An array corresponding to the sample vertical domain intention label is represented; q1 (x)_i) An array corresponding to the prediction vertical domain intention is represented; n1 is the number of elements in the output array, and i is the number of elements.

In one possible implementation, the second loss value may be calculated using the following cross-entropy calculation formula:

wherein H2(p2, q2) represents the secondA loss value; p2 (x)_i) An array corresponding to the sample sub-function intention label is represented; q2 (x)_i) An array corresponding to the meaning of the predictor function; n2 is the number of elements in the output array, and i is the number of elements.

And 904, calculating a loss value of the user intention recognition model to be trained based on the first loss value and the second loss value.

In this step, the loss value can be calculated by the following formula:

LOSS＝Loss1+λLoss2

the LOSS value of the user intention recognition model to be trained is LOSS1, the first LOSS value is Loss2, the weight of the balance Loss1 and Loss2 is lambda, and the lambda can be set to any value of 0.1-0.9.

Step 905, determining whether the loss value is smaller than a preset loss threshold, if so, executing step 906, and if not, executing step 907.

The preset loss threshold may be set to 0.1 or 0.2 according to the actual application, and is not specifically limited herein.

Step 906, determining that the training of the user intention recognition model is finished.

And 907, updating the parameters of the user intention recognition model to be trained, and returning to execute 901.

The parameters of the to-be-trained user intention recognition model may include: the character feature extraction network parameters, the vertical domain intention identification network parameters and the subfunction intention identification network parameters.

In the embodiment of the invention, the size of the obtained LOSS value LOSS can be judged, and if the LOSS is not less than the preset LOSS threshold, the condition that the current to-be-trained user intention recognition model is not converged is shown, and the parameters in the model need to be continuously adjusted. Since there are many parameters in the user intention recognition model to be trained and the dimension of each parameter is large, the parameters that need to be preferentially adjusted and the amplitude of parameter adjustment need to be considered.

In a possible implementation manner, the parameter with the largest change rate among all parameters of the model for recognizing the intention of the user to be trained can be found out through finding out the LOSS value LOSS, and the parameter is adjusted in the opposite direction according to the change rate. Finding the rate of change is equivalent to finding the first derivative, for example, LOSS (i.e., error) is lost for LOSS, yi is the output of data in the neural network through the parameter wi, and the partial derivative of the parameter wi from LOSS is:

the following calculation methods may be used to calculate the following characteristics: and extracting partial derivatives of parameters of the network layer, the vertical domain intention identification network layer and the sub-function intention identification network layer by character features, and finding out the parameter with the maximum change rate as a parameter for preferential adjustment.

Because the parameters are more, the dimension of each parameter is larger, and for each parameter, the variation of each dimension is the largest, namely the direction with the fastest gradient decrease is the direction in which the parameter needs to be adjusted at present, and the gradient matrix E of all the parameters W at present is sequentially obtained.

Each parameter in the above-mentioned user intention recognition model to be trained may be a matrix, and the dimension is the dimension of the matrix. For example, the convolution kernel with the size of [3,5], and the matrix of the convolution kernel is 15 parameters.

Gradient matrix for weighting parameters

The parameters may then be adjusted according to the gradient matrix, as follows:

wherein W is the parameter to be adjusted, s is the iteration number, eta is the learning rate, represents the change amount of each time, the value is artificially preset, the dynamic adjustment can also be carried out by using an algorithm,

the matrix is obtained by the derivation of the parameters.

The training process of the user intention recognition model to be trained is to continuously carry out loss calculation through network prediction, gradient calculation through back propagation and weight updating by using data, and the process is repeatedly circulated continuously until the training is finished when the loss function is small to an ideal range, and the network parameters are the optimal parameters.

In one possible embodiment, the execution of one step 901 to step 906/step 907 can be set to complete one iteration. In the embodiment of the invention, when the iteration number reaches the preset iteration number threshold, the user intention recognition model to be trained after the preset iteration number can also be determined as the user intention recognition model obtained by training. The preset iteration threshold may be set according to an actual application, and may specifically be set to 10000 or 20000, and the preset iteration threshold is not specifically limited herein.

In a possible embodiment, after the trained user intention recognition model is obtained, the trained user intention recognition model may be verified by using a verification set, and further whether the trained user intention recognition model is successful may be measured, where the specific verification method may include the following steps C1-C5:

and step C1, inputting each test voice information included in the verification set into the trained user intention recognition model to obtain the predicted vertical domain intention and the predicted subfunction intention corresponding to the test voice information.

Wherein the validation set may include a plurality of test voice messages. And each test voice message corresponds to the existence of a real user intention.

And step C2, calculating the correct rate corresponding to the verification set according to the real user intention corresponding to each tested voice message, and the predicted vertical domain intention and the predicted sub-function intention corresponding to each tested voice message.

Wherein the real user intent comprises: a true vertical domain intent and a true sub-functionality intent. In this step, for each piece of test speech information, whether the true vertical domain intention corresponding to the test speech information is consistent with the predicted vertical domain intention corresponding to the test speech information, and whether the true subfunction intention corresponding to the test speech information is consistent with the predicted subfunction intention corresponding to the test speech information may be compared.

The ratio of the test voice information with correct predicted vertical domain intention and predicted subfunction intention output by the trained user intention recognition model in the verification set can be calculated and used as the corresponding accuracy of the verification set. For example, if the predicted vertical domain intention and the predicted subfunction intention output by the user intention recognition model obtained by training 900 pieces of test speech information are correct in 1000 pieces of test speech information in the verification set, it can be determined that the correctness rate corresponding to the verification set is 90%.

And step C3, judging whether the accuracy is not less than the preset accuracy threshold, if so, executing step C4, and if not, executing step C5.

The preset accuracy threshold may be set according to an actual application, and may be set to 90% or 95%, for example, which is not limited herein.

And step C4, determining that the trained user intention recognition model is successfully trained.

The trained user intention recognition model successfully represents the training: the accuracy of the user intention corresponding to the voice information to be recognized determined by the trained user intention recognition model is high, and the effect of determining the user intention is good.

And step C5, training the user intention recognition model to be trained by using the new sample set.

If the accuracy of the user intention recognition model obtained by current training is smaller than the preset accuracy threshold value, the method comprises the following steps: the accuracy of the user intention corresponding to the voice information to be recognized determined by the user intention recognition model obtained through current training is low, and the effect of determining the user intention is poor. Therefore, for this case, a new user intention recognition model can be selected for training using a new sample set. Wherein the new sample set comprises: and the new sample voice information and a sample vertical domain intention label and a sample subfunction intention label corresponding to the new sample voice information.

By adopting the method provided by the embodiment of the invention, the user intention recognition model obtained by training is verified by using the verification set, so that the effect of determining the user intention by the user intention recognition model obtained by training is further ensured.

Based on the same inventive concept, according to the user intention determining method provided in the above embodiment of the present invention, correspondingly, another embodiment of the present invention further provides a user intention determining apparatus, a schematic structural diagram of which is shown in fig. 10, specifically including:

the text conversion module 1001 is configured to perform text conversion on the voice information to be recognized to obtain a target text corresponding to the voice information to be recognized;

a model input module 1002, configured to input the target text into a pre-trained user intention recognition model;

a probability obtaining module 1003, configured to extract a target feature of the target text based on the user intention recognition model, and obtain a first probability of each preset vertical domain intention to which the target feature is mapped and a second probability of each preset sub-function intention to which the target feature is mapped; wherein each preset subfunction intention belongs to a preset vertical domain intention;

an intention determining module 1004, configured to determine, based on a correspondence between a preset vertical domain intention corresponding to the highest first probability and a preset sub-function intention corresponding to the highest second probability, a user intention corresponding to the voice information to be recognized.

By adopting the device provided by the embodiment of the invention, the voice information to be recognized is subjected to character conversion to obtain the target characters corresponding to the voice information to be recognized; inputting the target characters into a pre-trained user intention recognition model, and extracting target characteristics of the target characters; based on the user intention recognition model, acquiring a first probability of each preset vertical domain intention mapped to the target feature and a second probability of each preset sub-function intention mapped to the target feature; and determining the user intention corresponding to the voice information to be recognized based on the corresponding relation between the preset vertical domain intention corresponding to the highest first probability and the preset sub-function intention corresponding to the highest second probability. The device provided by the embodiment of the invention can determine the preset vertical domain intention and the preset sub-function intention through the same user intention recognition model, and further determine the user intention corresponding to the voice information to be recognized.

An embodiment of the present invention further provides an electronic device, as shown in fig. 11, including a processor 1101, a communication interface 1102, a memory 1103 and a communication bus 1104, where the processor 1101, the communication interface 1102 and the memory 1103 complete mutual communication through the communication bus 1104,

a memory 1103 for storing a computer program;

the processor 1101 is configured to implement the following steps when executing the program stored in the memory 1103:

and determining the user intention corresponding to the voice information to be recognized based on the corresponding relation between the preset vertical domain intention corresponding to the highest first probability and the preset sub-function intention corresponding to the highest second probability. The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In yet another embodiment provided by the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above-mentioned user intent determination methods.

In yet another embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the user intent determination methods of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus, the electronic device and the storage medium, since they are substantially similar to the method embodiments, the description is relatively simple, and the relevant points can be referred to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for determining user intent, comprising:

2. The method according to claim 1, wherein the determining the user intention corresponding to the voice information to be recognized based on the correspondence between the preset vertical domain intention corresponding to the highest first probability and the preset sub-function intention corresponding to the highest second probability comprises:

3. The method according to claim 2, wherein after said determining whether the preset vertical domain intention to which the first target sub-function intention belongs is consistent with the target vertical domain intention, further comprising:

if the judgment result is negative, determining the target vertical domain intention as the target user intention, or,

4. The method of claim 1, wherein the pre-trained user intent recognition model comprises: a character feature extraction network, a vertical domain intention recognition network and a sub-function intention recognition network;

5. The method of claim 4, wherein the textual feature extraction network comprises:

6. The method of claim 4, wherein the vertical domain intent recognition network comprises a vertical domain full connectivity layer and a vertical domain normalization layer,

7. The method of claim 4, wherein the sub-function intent recognition network comprises a sub-function fully connected layer, a sub-function normalization layer;

8. The method according to any one of claims 1-7, wherein the training mode of the user intention recognition model comprises:

9. A user intent determination apparatus, comprising:

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 8 when executing a program stored in the memory.