CN112597301A

CN112597301A - Voice intention recognition method and device

Info

Publication number: CN112597301A
Application number: CN202011493429.0A
Authority: CN
Inventors: 李世杰; 包梦蛟; 陈欢; 钱瑞峰
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-04-02

Abstract

The present specification discloses a voice intent recognition method and device. Acquire the user's voice data, determine its corresponding text data, and determine the first and second feature vectors respectively through the first and second branch networks. A third feature vector is determined through a third branch network according to at least one of user information data, user behavior data, and historical interaction data. The total feature vector obtained by fusing the first, second and third feature vectors is input into the middle branch network of the intention recognition model, the user intention vector output by the middle branch network is determined, and the user intention vector is input into each output of the intention recognition model The branch network obtains each classification result output by each output branch network, and determines the user's intention based on each classification result. According to the classification results output by each output branch network of the intention recognition model, the actual intention of the user can be accurately determined, so that the efficiency of the service executed based on the intention is higher.

Description

Voice intention recognition method and device

Technical Field

The present disclosure relates to the field of information technologies, and in particular, to a method and an apparatus for recognizing a speech intention.

Background

With the development of information technology and artificial intelligence, speech recognition technology has been widely used. For example, some enterprises may use intelligent voice technology to assist users in handling business or providing services instead of customer service staff, and in order to improve service efficiency, intention recognition may be performed on the words spoken by the users, and corresponding responses or operations may be selected according to the recognized user intentions. Alternatively, some intelligent robots also need to recognize the intention of the corresponding user according to the acquired voice information so as to react properly. However, in order to make a correct or appropriate response and operation to the acquired voice information, intention recognition needs to be performed on the acquired voice information to determine the real intention of the user corresponding to the voice information.

In the prior art, when voice information is subjected to intention recognition, voice information of a user and a text converted from the voice information are often input into a pre-trained model for recognizing the intention of the user, so that the voice information of the user is subjected to intention recognition to obtain an intention recognition result.

However, in the prior art, the intention recognition result determined according to the voice information and the text is not accurate enough, so that the efficiency of the service executed based on the intention recognition result is difficult to improve.

Disclosure of Invention

The present specification provides a method and an apparatus for recognizing a speech intention, which partially solve the above problems in the prior art.

The technical scheme adopted by the specification is as follows:

the present specification provides a speech intention recognition method, specifically including:

acquiring voice data of a user, and determining text data corresponding to the voice data according to the voice data;

determining a first feature vector from the speech data via a first branch network of an intent recognition model and a second feature vector from the text data via a second branch network of the intent recognition model;

acquiring associated data of the user, and determining a third feature vector through a third branch network of the intention recognition model according to the associated data, wherein the associated data comprises at least one of user information data, user behavior data and historical interaction data, and the historical interaction data is voice sent to the user in the process of interacting with the user;

fusing the first feature vector, the second feature vector and the third feature vector to determine a total feature vector, inputting the total feature vector into an intermediate branch network of the intention recognition model, and determining a user intention vector output by the intermediate branch network;

and inputting the user intention vector into each output branch network of the intention recognition model, and determining the intention recognition result of the user according to each classification result output by each output branch network, wherein the intention of the user is used for determining and replying the voice information of the user, and different output branch networks are used for outputting classification results of different intention types.

Optionally, according to the voice data, the method specifically includes, through a first branch network of the intention recognition model:

removing invalid data in the voice data to obtain data to be identified;

and inputting the determined data to be recognized into a first branch network of the intention recognition model, and determining a first feature vector.

Optionally, determining a third feature vector through a third branch network of the intention recognition model according to the associated data, specifically including:

determining user information data of each preset type corresponding to the user according to the user information data of the user;

aiming at each preset type, encoding the user information data of the type and determining portrait encoding;

encoding each type of portrait, and respectively inputting the portrait into the neural network layer corresponding to each preset type in the third branch network to obtain the portrait vector corresponding to each portrait encoding;

fusing image vectors corresponding to the image codes to determine a comprehensive image vector;

and inputting the comprehensive portrait vector as an input into a fusion network layer in the third branch network, and determining the output of the fusion network layer as the third feature vector.

Optionally, the user behavior data is the user behavior data recorded before the process of interacting with the user;

determining a third feature vector through a third branch network of the intention recognition model according to the associated data, specifically comprising:

coding each behavior of the user according to the user behavior data;

determining each user behavior vector according to the code of each user behavior;

and taking each user behavior vector as input, sequentially inputting the third branch network of the intention recognition model, and determining a third feature vector according to the hidden layer feature of the third branch network of the intention recognition model.

determining each voice sent in the user interaction process according to historical interaction data;

determining target voice according to the sending sequence of each voice;

and inputting text data corresponding to the target voice as input into a third branch network of the intention recognition model, and determining a third feature vector.

Optionally, each output branch network at least comprises: two of an output branch network for outputting user emotion, an output branch network for outputting user attitude, and an output branch network for outputting whether the user is affirmative or not.

Optionally, the intent recognition model is trained using the following method, wherein:

acquiring voice data generated in the historical process of interacting with different users;

for each section of voice data, determining text data corresponding to the section of voice data and associated data of a user corresponding to the section of voice data, and taking the voice data and the associated data as training samples;

determining sample labels of the training samples according to the voice data and the interaction results of the interaction processes, wherein the sample labels comprise labels corresponding to the output branch networks;

inputting the training sample into an intention recognition model to be trained to obtain output results of each output branch network of the intention recognition model to be trained;

determining the loss corresponding to the output result of each output branch network according to the obtained output result of each output branch network and the label of each output branch network corresponding to the sample label;

and determining total loss according to the loss corresponding to the output result of each output branch network, and adjusting parameters in the to-be-trained intention recognition model by taking the minimum total loss as an optimization target.

The present specification provides a speech intention recognition apparatus, specifically including:

the text data determining module is used for acquiring voice data of a user and determining text data corresponding to the voice data according to the voice data;

a feature vector first determination module for determining a first feature vector through a first branch network of an intention recognition model according to the speech data and a second feature vector through a second branch network of the intention recognition model according to the text data;

a feature vector second determination module, configured to obtain associated data of the user, and determine a third feature vector through a third branch network of the intention recognition model according to the associated data, where the associated data includes at least one of user information data, user behavior data, and historical interaction data, and the historical interaction data is a voice sent to the user in a process of interacting with the user;

a user intention vector determining module, configured to fuse the first feature vector, the second feature vector, and the third feature vector, determine a total feature vector, input the total feature vector into an intermediate branch network of the intention recognition model, and determine a user intention vector output by the intermediate branch network;

and the intention recognition module is used for inputting the user intention vector into each output branch network of the intention recognition model, determining the intention recognition result of the user according to each classification result output by each output branch network, wherein the intention of the user is used for determining and replying the voice information of the user, and different output branch networks are used for outputting classification results of different intention types.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described speech intention recognition method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned voice intention recognition method when executing the program.

The technical scheme adopted by the specification can achieve the following beneficial effects:

in the speech intention recognition method provided in this specification, speech data of a user is acquired, text data corresponding to the speech data is determined, and first and second feature vectors are determined through first and second branch networks, respectively. And determining a third feature vector through the third branch network according to at least one of the user information data, the user behavior data and the historical interaction data. And inputting the user intention vector into each output branch network of the intention recognition model to obtain each classification result output by each output branch network, and determining the intention of the user based on each classification result.

It can be seen from the above method that the method is not limited to determining the user intention through the voice data of the user, and the intention recognition model can output classification results of different intention types, so that the user intention can be more accurately determined based on the classification results of multiple intention types, and the efficiency of the business executed based on the intention recognition result is higher.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

FIG. 1 is a flow chart of a speech intent recognition method according to the present disclosure;

FIG. 2 is a schematic diagram of voice interaction provided herein;

FIG. 3 is a schematic diagram of an intent recognition model provided herein;

fig. 4 is a schematic structural diagram of a third branch network provided in the present specification;

fig. 5 is a schematic structural diagram of a third branch network provided in the present specification;

fig. 6 is a schematic structural diagram of a third branch network provided in the present specification;

fig. 7 is a schematic structural diagram of a third branch network provided in the present specification;

FIG. 8 is a schematic diagram of a speech intent recognition apparatus provided herein;

fig. 9 is a schematic structural diagram of an electronic device corresponding to fig. 1 provided in this specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

At present, in the prior art, when the intention of a user is recognized based on voice information of the user, the voice information of the user and a text converted from the voice information are often used as input of a model for recognizing the intention of the user, and an intention of the user is determined according to output of the model for recognizing the intention of the user. However, a single output result can only identify the intention of some voices with simple user intentions, and if the user intentions are complicated, it is difficult to accurately determine the user intentions only from the voice information and corresponding texts of the user. For example, when interacting with a user, the user may speak a whisper, where an affirmative response by the user is actually intended to convey a negative meaning. Alternatively, during the interaction, the user has used a word or vocabulary that is easily recognized as homophonic words or harmonious words, resulting in the text being erroneously recognized as corresponding homophonic words or harmonious words after converting the user's speech into text, resulting in a change in the meaning of the text. However, in the prior art, the real intention of the user under these conditions is difficult to recognize only according to the voice information and the corresponding text of the user, which results in inaccurate intention recognition of the user, and the service executed based on the intention recognition result cannot be smoothly performed, even makes mistakes.

In order to solve the problem that the existing voice intention recognition of a user is inaccurate, the application provides a voice intention recognition method.

Fig. 1 is a schematic flow chart of a speech intention recognition method in this specification, which specifically includes the following steps:

s100: acquiring voice data of a user, and determining text data corresponding to the voice data according to the voice data.

In this specification, the voice intention recognition method may be used in a scenario of intelligent customer service, specifically, in an application scenario of voice intelligent customer service, when the intelligent customer service of the service platform interacts with the user, the real intention of the user is determined by recognizing the voice information of the user, so that the intelligent customer service may determine how to continue interacting with the user according to the intention recognition result obtained by recognition in the interaction process, and smoothly and accurately execute the corresponding service. For convenience of description, the intelligent customer service is referred to as voice intelligent customer service in the following description.

In one or more embodiments of the present specification, the voice intention recognition method may be specifically executed by a server of a provider of a service corresponding to a voice intention recognition result, where the server may be a single device, or a system composed of multiple devices, such as a distributed system. The server can run an intelligent customer service program for interacting with the user, and the program is used for determining voice data sent to the user according to the service scene and the voice of the user, namely executing the customer service.

In one or more embodiments of the present specification, when the server performs an intelligent customer service to provide the intelligent customer service to the user, the server may recognize the user's intention according to the user's voice data. First, the server may acquire voice data that requires intention recognition, which may be voice data that was last sent to the server by the user's client.

For example, when a user uses the smart customer service, the dialog between the user and the smart customer service is as shown in fig. 2, fig. 2 is a schematic diagram of the voice interaction provided in this specification, wherein the bubble bar represents voice data sent by the user or the smart customer service, and the server may determine the voice data sent by the user most recently, i.e., the bubble bar with a dark color in fig. 2, as the voice data for performing the intention recognition.

Then, the acquired voice data is converted into text data so as to input an intention recognition model in a subsequent step to recognize the intention of the user. When the voice data of the user is converted into text data, Deep Speech technology may be adopted, Automatic Speech Recognition technology (ASR) may also be adopted, or other technologies capable of converting a voice signal into a corresponding text may be adopted, which way is specifically adopted for voice Recognition, and this specification is not limited, and may be set as required.

S102: determining a first feature vector from the speech data via a first branch network of an intention recognition model, and determining a second feature vector from the text data via a second branch network of the intention recognition model.

Fig. 3 is a schematic diagram of an intention recognition model provided in the present specification, where the intention recognition model includes a first branch network, a second branch network, a third branch network, an intermediate branch network, and several output branch networks. As can be seen from the figure, after the voice data of the user, the text data corresponding to the voice data, and the associated data of the user are respectively input into the first branch network, the second branch network, and the third branch network of the intention recognition model, the first feature vector, the second feature vector, and the third feature vector are respectively obtained. And inputting a total feature vector obtained by fusing the first, second and third feature vectors into the intermediate branch network, determining a user intention vector fusing all the features of the user, and inputting the user intention vector into each output branch network respectively to obtain intention classification results corresponding to each output branch network. According to the intention classification result corresponding to each output branch network, the intention recognition result aiming at the voice data of the user can be finally determined.

In one or more embodiments of the present specification, after obtaining the voice data of the user, the server may input the voice data into the first branch network of the intention recognition model, so as to obtain a first feature vector corresponding to the voice data.

Specifically, the server may perform preprocessing on the voice data after acquiring the voice data of the user. Removing invalid data in the voice data to obtain data to be recognized, then framing and windowing the data to be recognized to obtain each windowed data frame, determining a spectrogram corresponding to each data frame, then inputting each spectrogram into a first branch network of the intention recognition model, and determining a first feature vector. The invalid data refers to silence data when the user does not speak in the Voice data or pure background noise data when the user does not speak, and the method for removing the invalid data may be a Voice endpoint Detection (VAD) technique.

In one or more embodiments of the present specification, after obtaining each data frame, the speech feature of each data frame may be directly extracted, and the first feature vector may be determined according to each speech feature. Wherein, each speech feature can be a Mel Frequency Cepstrum Coefficient (MFCC) feature, an i-vector feature, a sound intensity, a fundamental Frequency and other features.

In one or more embodiments of the present specification, after obtaining text data corresponding to the voice data of the user, the server may input the text data to the second branch network of the intention recognition model, so as to obtain a second feature vector corresponding to the text data.

Specifically, the text data corresponding to the voice data of the user may be split to obtain a plurality of vocabulary data, and each vocabulary data obtained after the splitting may be input into the second branch network of the intention recognition model to obtain the second feature vector output by the second branch network. The second branch network may be a word2vec (word to vector) network, and is configured to convert each split vocabulary data into word vectors, and fuse each word vector to obtain a vector representing text data corresponding to the speech data, that is, a second feature vector corresponding to the text data. Or, the second branch network may also be a transform network, and is configured to perform feature extraction on the text data and determine a second feature vector, which may be specifically set as required, and this specification is not limited herein.

S104: and acquiring the associated data of the user, and determining a third feature vector through a third branch network of the intention recognition model according to the associated data.

In this specification, in order to improve the accuracy of the voice intention recognition for the user, the server may further obtain the associated data of the user, and input the associated data of the user to the third branch network of the intention recognition model to obtain a third feature vector. Wherein the associated data of the user comprises at least one of user information data, user behavior data and historical interaction data, and the historical interaction data is voice sent to the user in the process of interacting with the user. Because different growing environments, cultural differences of regions, education degrees and ages of different users have different and non-negligible influences on speaking habits, speaking attitudes and characters of the different users, and in the interaction process, the current emotion and intention of the users may be related to a series of behaviors of the users before interaction and influenced by reply contents of intelligent customer service in the interaction process, the related data are obtained, a third feature vector is obtained through a third branch network of the intention recognition model, and a more accurate intention recognition result of the users can be obtained in the subsequent steps. For example, before interaction, assuming that a user browses more interesting content on a service platform, the interaction attitude of the user with an intelligent customer service during the interaction process may be more positive, and the user's intention may also be related to the content browsed on the service platform. Or, assuming that a user initiates a complaint on the service platform before interacting, the interaction attitude of the user with the intelligent customer service during the interaction process may be more negative, and the intention of the user may also be more relevant to the complaint content of the user on the service platform.

Specifically, in one or more embodiments of the present specification, the server may determine, according to the user information data of the user, user information data of each preset type corresponding to the user, and then encode the user information data of the type for each preset type to determine the portrait code. And each preset type of user information data comprises personal basic data of the user and data associated with the service corresponding to the interaction process. For example, assuming that the interaction process is a process of urging the user to pay back a debt on a service platform corresponding to the interaction process, each preset type of user information data may include financial characteristics, attribute characteristics, and consumption characteristics of the user. The financial characteristics may include the amount of the debit, the length of the debt, historical debit repayment, etc. of the user. The attribute characteristics may include basic information of the user's age, academic calendar, native place, occupation, etc. The consumption characteristics can comprise information of consumption times, consumption amount and the like of the user on the service platform. Fig. 4 is a schematic structural diagram of a third branch network provided in this specification, where the third branch network includes a neural network layer and a convergence network layer, and the user information data includes financial characteristics, attribute characteristics, and consumption characteristics … … of the user, and each user information data of the user corresponds to one portrait code. After each type of image coding is determined, each type of image coding can be input into a neural network layer in a third branch network, each image coding is compressed and subjected to dimension reduction through an embedding (embedding) layer in the neural network layer to obtain image vectors corresponding to each image coding, and the image vectors corresponding to each image coding are fused to determine a comprehensive image vector. Finally, the comprehensive portrait vector is used as input and input into a fusion network layer in a third branch network, and the output of the fusion network layer is determined and used as a third feature vector.

In addition, in One or more embodiments of the present specification, for each preset type of user information data, One-Hot Encoding (One-Hot Encoding) may be used to determine the portrait Encoding corresponding to each preset type.

Further, in one or more embodiments of the present specification, the server may further encode each behavior of the user according to the user behavior data, and determine each user behavior vector according to the encoding of each behavior of the user. And then, taking each user behavior vector as input, sequentially inputting a third branch network of the intention recognition model, and determining a third feature vector according to the hidden layer feature of the third branch network of the intention recognition model. At this time, the structure of the intention recognition model is as shown in fig. 5. Fig. 5 is a schematic structural diagram of a third branch Network provided in this specification, as can be seen, the third branch Network includes a vectorization Network layer and a fusion Network layer, and after the obtained user behavior data of the user is respectively input to the vectorization Network layer, user behavior vectors corresponding to the user behavior data can be obtained, and the user behavior vectors are sequentially input to the fusion Network layer, where the fusion Network layer is specifically a Recurrent Neural Network (RNN), or a variant Network thereof, such as a Long Short-Term Memory Network (LSTM), and the like, and then the hidden layer feature of the final fusion Network layer can be used as a third feature vector.

The user behavior data is the user behavior data recorded before the process of interacting with the user, for example, the user behavior data may be some page identification data, page browsing duration, and the like of the service platform clicked by the user before the interaction. When each behavior of the user is encoded, a behavior vector of each user may be obtained in a manner of one-hot encoding, or each behavior of the user may also be encoded by an embedding method, which may be specifically set as required, and this specification is not limited herein.

In addition, in one or more embodiments of the present specification, the server may further determine, according to historical interaction data, each voice sent during the interaction with the user, determine a target voice according to a sending sequence of each voice, then input text data corresponding to the target voice as an input into a third branch network of the intention recognition model, and determine a third feature vector. Wherein the historical interaction data refers to historical data generated before the voice data in the current interaction process with the user. At this time, the structure of the third branch network of the intention recognition model is as shown in fig. 6. Fig. 6 is a schematic structural diagram of a third branch network provided in this specification, and as can be seen, the third branch network includes a textual network layer and a vectorized network layer. The process of determining the third feature vector from the historical interaction data in fig. 6 may be similar to the process of determining the text data from the speech data and determining the second feature vector from the text data in step S102. The historical interaction data may be a latest voice sent by the intelligent customer service to the user in the interaction process, and of course, multiple voices in the interaction process may also be acquired as the historical interaction data. When there are multiple pieces of historical interactive data, the multiple pieces of historical interactive data can be converted into each historical interactive text through a third branch network with a structure similar to that of fig. 5, the historical interactive texts are sequentially input into networks such as an RNN network or an LSTM network according to the sending sequence of the historical interactive data corresponding to each historical interactive text in the interactive process, and the hidden layer features of the networks are used as third feature vectors.

In one or more embodiments of the present specification, the three data may be arbitrarily combined to serve as the associated data of the user, that is, the associated data of the user may include user information data and user behavior data, may also include user information data and historical interaction data, and may also include user behavior data and historical interaction data. Of course, the association data of the user may also include three types, i.e., user information data, user behavior data, and historical interaction data, and in this case, the structure of the third branch network is shown in fig. 7.

Fig. 7 is a schematic structural diagram of a third branch network provided in this specification, and as can be seen, after three kinds of associated data, i.e., user information data, user behavior data, and historical interaction data, are respectively input to a user information network layer, a user behavior network layer, and a historical interaction data network layer in the third branch network, a comprehensive portrait vector, a behavior vector, and a historical interaction vector are respectively obtained, then the obtained user information vector, comprehensive portrait vector, and historical interaction vector are fused to obtain a total association vector, and then, after the total association vector is input to a fusion network layer of the third branch network, a third feature vector can be determined according to an output of the fusion network layer. The structure of the user information network layer is consistent with that shown in fig. 4, the structure of the user behavior network layer is consistent with that shown in fig. 5, and the structure of the historical interaction data network layer is consistent with that shown in fig. 6.

S106: fusing the first feature vector, the second feature vector and the third feature vector to determine a total feature vector, inputting the total feature vector into an intermediate branch network of the intention recognition model, and determining a user intention vector output by the intermediate branch network.

In one or more embodiments of the present disclosure, after determining the first feature vector, the second feature vector, and the third feature vector, the server may fuse the feature vectors to obtain a total feature vector, input the total feature vector into an intermediate branch network of the intention recognition model, and determine a user intention vector output by the intermediate branch network.

S108: and inputting the user intention vector into each output branch network of the intention recognition model, and determining the intention recognition result of the user according to each classification result output by each output branch network, wherein the intention of the user is used for determining and replying the voice information of the user, and different output branch networks are used for outputting classification results of different intention types.

In one or more embodiments of the present specification, after determining the user intention vector, the server may input the user intention vector to each output branch network of the intention recognition model, and determine an intention recognition result of the user according to each classification result output by each output branch network. Wherein each output branch network at least comprises: the system comprises two output branch networks for outputting user emotion, an output branch network for outputting user attitude and an output branch network for outputting whether the user is positive or not, wherein different output branch networks are used for outputting classification results of different intention types. According to the determined user intention, the server can further determine the voice message replied to the user.

It should be noted that the structures of the intermediate branch Network and each output branch Network in this specification may specifically be a Multilayer Perceptron (MLP) or a Fully Connected Network (FCN), which may be specifically set according to needs, and this specification is not limited herein.

Based on the voice intention recognition method shown in fig. 1, voice data of a user is acquired, corresponding text data of the user is determined, and first and second feature vectors are determined through first and second branch networks respectively. And determining a third feature vector through the third branch network according to at least one of the user information data, the user behavior data and the historical interaction data. And inputting the user intention vector into each output branch network of the intention recognition model to obtain each classification result output by each output branch network, and determining the intention of the user based on each classification result.

According to the method, the classification results of different intention types output by each output branch network are determined through the intention recognition model according to the voice data, the text data and the associated data of the user, so that the real intention of the user is accurately determined, and the efficiency of the business executed based on the intention recognition result is higher.

In addition, in one or more embodiments of the present specification, when training the intention recognition model, speech data that is historically generated by different user interaction processes may be obtained, and for each piece of speech data, text data corresponding to the piece of speech data and associated data of a user corresponding to the piece of speech data may be determined, and then the speech data and the associated data may be used as training samples. And then, determining a sample label of each training sample according to the voice data and the interaction result of each interaction process, wherein the sample label comprises a label corresponding to each output branch network. Further, the training sample may be input into the intention recognition model to be trained, and the output result of each output branch network of the intention recognition model to be trained is obtained. Furthermore, the loss corresponding to the output result of each output branch network can be determined according to the obtained output result of each output branch network and the label of each output branch network corresponding to the sample label. And finally, determining total loss according to the loss corresponding to the output result of each output branch network, and adjusting parameters in the intention recognition model to be trained by taking the minimum total loss as an optimization target.

The formula for determining the total loss is:

where N is the number of output branch networks in the intent recognition model, C_iWeight of intention type corresponding to ith output branch network, Loss_iFor the corresponding loss, λ, of the ith output branch network_iIs a constraint regularization term corresponding to the ith output branch network for avoiding overfitting of the output branch network.

Based on the same idea, the voice intention recognition method provided above for one or more embodiments of the present specification further provides a corresponding voice intention recognition device, as shown in fig. 8.

Fig. 8 is a schematic diagram of a speech intention recognition apparatus provided in this specification, the apparatus including: the system comprises a text data determination module, a first feature vector determination module, a second feature vector determination module, a user intention vector determination module and an intention identification module, wherein: the text data determining module is used for acquiring voice data of a user and determining text data corresponding to the voice data according to the voice data;

Optionally, the feature vector first determining module 201 is configured to remove invalid data in the speech data to obtain data to be recognized, input the determined data to be recognized into the first branch network of the intention recognition model, and determine the first feature vector.

Optionally, the feature vector second determining module 202 determines, according to the user information data of the user, user information data of each preset type corresponding to the user, encodes the user information data of the type for each preset type, determines an image code, encodes each type of image, respectively inputs the image code of each type into a neural network layer corresponding to each preset type in the third branch network, obtains an image vector corresponding to each image code, fuses the image vectors corresponding to each image code, determines a comprehensive image vector, and inputs the comprehensive image vector as an input into a fusion network layer in the third branch network, and determines an output of the fusion network layer as the third feature vector.

Optionally, the feature vector second determining module 202 is configured to encode each behavior of the user according to the user behavior data, determine each user behavior vector according to the encoding of each behavior of the user, input each user behavior vector as an input to the third branch network of the intention recognition model in sequence, and determine a third feature vector according to a hidden layer feature of the third branch network of the intention recognition model.

Optionally, the feature vector second determining module 202 determines the interaction process with the user according to historical interaction data

Determining a target voice according to the sending sequence of the voices, inputting text data corresponding to the target voice as input into a third branch network of the intention recognition model, and determining a third feature vector.

Optionally, voice data historically generated in the interactive process with different users is obtained, for each piece of voice data, text data corresponding to the piece of voice data and associated data of the user corresponding to the piece of voice data are determined, the voice data and the associated data are used as training samples, sample labels of the training samples are determined according to the voice data and the interactive results of the interactive processes, the sample labels comprise labels corresponding to output branch networks, the training samples are input into the intention recognition model to be trained, output results of the output branch networks of the intention recognition model to be trained are obtained, losses corresponding to the output results of the output branch networks are determined according to the obtained output results of the output branch networks and the labels of the output branch networks corresponding to the sample labels, and total losses are determined according to the losses corresponding to the output results of the output branch networks, and adjusting parameters in the intention recognition model to be trained by taking the minimum total loss as an optimization target.

The present specification also provides a computer-readable storage medium storing a computer program, which can be used to execute the voice intention recognition method provided in fig. 1 above.

This specification also provides a schematic block diagram of the electronic device shown in fig. 9. As shown in fig. 9, at the hardware level, the electronic device includes a processor, an internal bus, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs the computer program to implement the voice intention recognition method provided in fig. 1 above.

Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. a speech intent recognition method, is characterized in that, described method specifically comprises:

Acquiring the user's voice data, and determining the text data corresponding to the voice data according to the voice data;

According to the speech data, through the first branch network of the intent recognition model, determine the first feature vector, and according to the text data, through the second branch network of the intent recognition model, determine the second feature vector;

Acquire the associated data of the user, and determine a third feature vector through the third branch network of the intent recognition model according to the associated data, where the associated data includes user information data, user behavior data, and historical interaction data At least one of, the historical interaction data is the voice sent to the user in the process of interacting with the user;

The first feature vector, the second feature vector and the third feature vector are fused to determine a total feature vector, and the total feature vector is input into the intermediate branch network of the intent recognition model to determine the intermediate The user intent vector output by the branch network;

The user intent vector is input into each output branch network of the intent recognition model, and the user's intent recognition result is determined according to each classification result output by each output branch network, and the user's intent is used to determine the response The voice information of the user is described, and different output branch networks are used to output the classification results of different intent types.

2. The method according to claim 1, wherein, according to the voice data, through the first branch network of the intention recognition model, the first feature vector is determined, specifically comprising:

Remove invalid data in the voice data to obtain data to be identified;

The determined data to be recognized is input into the first branch network of the intent recognition model, and the first feature vector is determined.

3. The method of claim 1, wherein, according to the associated data, determining a third feature vector through a third branch network of the intent recognition model, specifically comprising:

According to the user information data of the user, determine the user information data of each preset type corresponding to the user;

For each preset type, encode the user information data of the type to determine the portrait encoding;

Each type of portrait encoding is input into the neural network layer corresponding to each preset type in the third branch network, to obtain the portrait vector corresponding to each portrait encoding;

Integrate the portrait vectors corresponding to each portrait code to determine the comprehensive portrait vector;

The integrated portrait vector is used as an input, and is input to the fusion network layer in the third branch network, and the output of the fusion network layer is determined as the third feature vector.

4. The method of claim 1, wherein the user behavior data is the user behavior data recorded before a process of interacting with the user;

According to the associated data, a third feature vector is determined through the third branch network of the intent recognition model, which specifically includes:

Encoding each behavior of the user according to the user behavior data;

Determine each user behavior vector according to the coding of each behavior of the user;

Each user behavior vector is used as an input, which is sequentially input to the third branch network of the intention recognition model, and a third feature vector is determined according to the hidden layer features of the third branch network of the intention recognition model.

5. The method of claim 3, wherein, according to the associated data, determining a third feature vector through a third branch network of the intent recognition model, specifically comprising:

According to the historical interaction data, determine each voice sent during the interaction with the user;

Determine the target voice according to the sending order of each voice;

The text data corresponding to the target speech is used as input, and the third branch network of the intention recognition model is input to determine the third feature vector.

6. The method of claim 1, wherein each output branch network at least comprises: an output branch network for outputting user emotions, an output branch network for outputting user attitudes, and an output branch network for outputting whether the user is positive two of them.

7. The method of claim 1, wherein the intent recognition model is trained using the following method, wherein:

Obtain the voice data generated by the interaction with different users in history;

For each piece of speech data, determine the text data corresponding to the segment of speech data and the associated data of the user corresponding to the segment of speech data, and use the speech data and the associated data as training samples;

According to the voice data of each interaction process and the interaction result, determine the sample label of each training sample, and the sample label includes the label corresponding to each output branch network;

Input the training sample into the intent recognition model to be trained, and obtain the output results of each output branch network of the intent recognition model to be trained;

According to the obtained output result of each output branch network and the label of each output branch network corresponding to the sample label, determine the loss corresponding to the output result of each output branch network;

The total loss is determined according to the loss corresponding to the output result of each output branch network, and the optimization goal is to minimize the total loss, and the parameters in the intent recognition model to be trained are adjusted.

8. A voice intent recognition device, wherein the device specifically comprises:

a text data determination module, configured to acquire the user's voice data, and determine the text data corresponding to the voice data according to the voice data;

A first feature vector determination module, configured to determine a first feature vector through a first branch network of the intent recognition model according to the voice data, and through a second branch network of the intent recognition model according to the text data, determine the second eigenvector;

A second feature vector determination module, configured to obtain the associated data of the user, and determine a third feature vector through the third branch network of the intent recognition model according to the associated data, wherein the associated data includes user information at least one of data, user behavior data, and historical interaction data, where the historical interaction data is the voice sent to the user in the process of interacting with the user;

The user intent vector determination module is used to fuse the first feature vector, the second feature vector and the third feature vector to determine a total feature vector, and input the total feature vector into the intent recognition model. an intermediate branch network, which determines the user intent vector output by the intermediate branch network;

The intention recognition module is configured to input the user intention vector into each output branch network of the intention recognition model, and determine the user's intention recognition result according to the classification results output by each output branch network, and the user's intention recognition result. The intent is used to determine the reply to the user's voice information, and the different output branch networks are used to output classification results of different intent types.

9 . A computer-readable storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the method according to any one of the preceding claims 1 to 7 is implemented. 10 .

10. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the program, any one of claims 1 to 7 above is implemented method described in item.