CN110827831A

CN110827831A - Voice information processing method, device, equipment and medium based on man-machine interaction

Info

Publication number: CN110827831A
Application number: CN201911118806.XA
Authority: CN
Inventors: 姚志强; 周曦; 李继伟; 杜晓薇; 郝东; 赵云
Original assignee: Guangzhou Honghuang Intelligent Technology Co Ltd
Current assignee: Guangzhou Honghuang Intelligent Technology Co Ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-02-21

Abstract

The invention provides a voice information processing method, a device, equipment and a medium based on human-computer interaction, wherein the method comprises the following steps: acquiring input voice information based on natural language; preprocessing the voice information, and extracting associated features corresponding to the voice information, wherein the associated features comprise at least one dimension as follows: naming entities, domains, and intents; performing feature fusion on one or more dimensions of the named entities, the fields and the intentions extracted from the associated features according to the slot position information; and determining corresponding strategy response according to the fused features. The method comprises the steps of extracting one or more dimensions of named entities, fields and intents in the associated features, fusing the features according to slot position information and then responding; on one hand, the requirements of the user can be flexibly matched; on the other hand, the system can be quickly migrated to other scenes after fine tuning, so that the application capability of the system is improved. In addition, the number of turns of the conversation is obviously reduced, more accurate response can be obtained without multiple turns of conversation, and the conversation experience of the user is improved.

Description

Voice information processing method, device, equipment and medium based on man-machine interaction

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice information processing method, a device, equipment and a medium based on human-computer interaction.

Background

Man-machine interaction is a sub-direction in the field of artificial intelligence, and popular speaking is to allow people to interact with computers such as man-machine conversation systems through human languages (i.e. natural languages). Through the interaction between people and the human-computer conversation system, the human-computer interaction system can understand the intention and the requirement of people, and therefore tasks such as song searching, ordering during shopping, equipment control and the like are completed.

However, the existing human-computer interaction system is based on rule matching, the implementation is complex and not flexible enough, the machine cannot better understand the intention of the user, and meanwhile, the end-to-end deep learning model is not beneficial to controlling the interaction process.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present invention is to provide a method, an apparatus, a device and a medium for processing voice information based on human-computer interaction, which are used to solve the problem that voice interaction cannot be simply and flexibly implemented in the existing human-computer voice interaction process.

In order to achieve the above objects and other related objects, the present invention provides a method for processing voice information based on human-computer interaction, including:

acquiring input voice information based on natural language;

preprocessing the voice information, and extracting associated features corresponding to the voice information, wherein the associated features comprise at least one dimension as follows: naming entities, domains, and intents;

performing feature fusion on one or more dimensions of the named entities, the fields and the intentions extracted from the associated features according to slot position information;

and determining a corresponding response strategy according to the fused features.

Another object of the present invention is to provide a voice information processing apparatus based on human-computer interaction, comprising:

the dialogue acquisition module is used for acquiring input voice information based on natural language;

a feature extraction module, configured to preprocess the voice information and extract associated features corresponding to the voice information, where the associated features include at least one of the following dimensions: naming entities, domains, and intents;

the characteristic fusion module is used for carrying out characteristic fusion on one dimension or several dimensions in the named entity, the field and the intention extracted from the associated characteristics according to the slot position information;

and the response module is used for determining a corresponding response strategy according to the fused features.

It is another object of the invention to provide an apparatus comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform the human-machine interaction based voice information processing method described above.

It is also an object of the invention to provide one or more machine readable media comprising:

stored thereon are instructions which, when executed by one or more processors, cause an apparatus to perform the human-computer interaction based speech information processing method described above.

As described above, the voice information processing method, device, equipment and medium based on human-computer interaction provided by the present invention have the following beneficial effects:

the method comprises the steps of extracting one or more dimensions of named entities, fields and intents in the associated features, fusing the features according to slot position information and then responding; on one hand, the requirements of the user can be flexibly matched; on the other hand, the method can be rapidly transferred to other scenes through adjustment, and the application capability of the method is improved. In addition, the strategy response is carried out by adopting the fused features, so that the number of turns of the conversation can be obviously reduced, more effective and accurate response can be obtained without multiple turns of conversation, and the conversation experience of a user is improved.

Drawings

Fig. 1 is a flowchart of a method for processing voice information based on human-computer interaction according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating named entity recognition in a human-computer interaction based voice information processing method according to an embodiment of the present invention;

FIG. 3 is a flowchart of an embedded vector generation method for human-computer interaction based speech information processing according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating intent recognition in a method for processing human-computer interaction-based speech information according to an embodiment of the present invention;

FIG. 5 is a flowchart of encoding generation in intent recognition of a human-computer interaction-based speech information processing method according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a method for processing voice information based on human-computer interaction according to an embodiment of the present invention;

FIG. 7 is a block diagram of a voice information processing apparatus based on human-computer interaction according to an embodiment of the present invention;

FIG. 8 is a block diagram of a complete structure of a speech information processing apparatus based on human-computer interaction according to an embodiment of the present invention;

fig. 9 is a block diagram of a named entity extraction sub-module in a human-computer interaction-based speech information processing apparatus according to an embodiment of the present invention;

FIG. 10 is a block diagram of a vector generation unit embedded in a human-computer interaction-based speech information processing apparatus according to an embodiment of the present invention;

fig. 11 is a block diagram of an intention extraction submodule in the human-computer interaction based speech information processing apparatus according to the embodiment of the present invention;

FIG. 12 is a block diagram of a code generating unit in the intent recognition of a human-computer interaction-based speech information processing apparatus according to an embodiment of the present invention;

fig. 13 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention;

fig. 14 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.

Description of the element reference numerals

0 initialization module

1 dialogue acquisition module

2 feature extraction module

21 named entity extraction submodule

22 intent extraction submodule

23 field extraction submodule

3 feature fusion module

4-response module

5 updating module

6 field detection module

1100 input device

1101 first processor

1102 output device

1103 first memory

1104 communication bus

1200 processing assembly

1201 second processor

1202 second memory

1203 communication assembly

1204 Power supply Assembly

1205 multimedia assembly

1206 voice assembly

1207 input/output interface

1208 sensor assembly

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, a flowchart of a method for processing voice information based on human-computer interaction according to an embodiment of the present invention includes:

step S1, acquiring input voice information based on natural language;

the user inputs voice information based on natural language (human language understood by human or machine) through the terminal, and the voice information can be in the form of voice, characters or pictures, but is not limited to the form.

Step S2, preprocessing the voice information, and extracting associated features corresponding to the voice information, where the associated features include at least one of the following dimensions: naming entities, domains, and intents;

the speech information is normalized, for example, the speech information can be normalized by simple scaling, sample-by-sample mean subtraction, or feature normalization, so as to improve the accuracy of the speech information and facilitate subsequent feature extraction. Meanwhile, recognizing the voice information by using a pre-trained named entity model to obtain a corresponding named entity; the voice information is recognized by using a pre-trained intention recognition model to obtain a corresponding intention, and a domain feature in the associated features is extracted according to the extracted named entity and the intention, wherein the recognized named entity, domain and intention can be three features, two features or one feature, and are not limited herein.

Wherein the associated features comprise global features and local features.

Step S3, one or more dimensions of the named entities, the fields and the intentions extracted from the associated features are subjected to feature fusion according to slot position information;

and filling the slot position information into the corresponding slot position of the named entity, the field slot position and the intention slot position according to the extracted one-dimensional or several-dimensional characteristics in the named entity, the field and the intention, merging the characteristics according to the slot values of the same type to realize characteristic fusion, and accurately reflecting the user requirements.

Specifically, the slot position information is a value for filling a slot related to user behavior. Taking the user behavior of "booking an air ticket" as an example, the slot position related to "booking an air ticket" includes a departure time, a departure place and a destination place. Corresponding to the slot related to the above-mentioned "ticket booking", in the user statement "booking flight tickets from shenzhen to beijing at tiandeca", the slot information includes departure time ═ mingtian ten point, "shenzhen" at departure place, and "beijing" at destination place, and so on. The voice interaction system needs to fill all the slots related to the user behavior to execute the user behavior. For example, the voice interactive system needs to fill the slot departure time, departure place and destination place related to the "air ticket booking" to execute the user action of "air ticket booking". The extracted field and intention are 'air booking tickets', the named entity is 'identity information in my personal information and the like', and the statement does not contain the information.

Specifically, the slot positions may be identified by using a conditional random field model, a cyclic convolution network, or a dictionary search method, so as to fill the extracted features with the slot positions, which is not specifically limited herein.

And step S4, determining a corresponding response strategy according to the fused features.

Further, according to the fused features, a corresponding response strategy is determined based on rules or models.

Specifically, for example, in a certain human-computer interaction field, different architectures may be divided according to the intended relevance, and policy groups correspondingly provided for the different architectures, so as to implement accurate response to policies.

In the embodiment, one or more dimensions of the named entity, the field and the intention in the associated features are extracted, and the features are fused according to the slot position information and then are responded; on one hand, the requirements of the user can be flexibly matched; on the other hand, the method can be rapidly transferred to other scenes through simple fine adjustment, and the application capability of the method is improved (the method can be widely applied to voice assistants, intelligent customer service, intelligent sound boxes, chat robots and the like). In addition, the strategy response is carried out by adopting the fused features, so that the number of turns of the conversation can be obviously reduced, more effective and accurate response can be obtained without multiple turns of conversation, and the conversation experience of a user is improved.

Referring to fig. 2, a flowchart of named entity recognition in the human-computer interaction based voice information processing method according to the embodiment of the present invention is detailed as follows:

step S210, constructing an embedded vector of an input sequence;

wherein, because the original input sequence needs to be converted into vectorization expression, see fig. 3 in detail, the steps are as follows:

step S2101, cut the original input sequence corresponding to voice message into characters, words or multiple grammar units; i.e., re-slicing;

step S2102, inputting a sequence by adopting single or multiple granularity combinations according to the time sequence; that is, the segmented sequence is rearranged, wherein the granularity of the arranged time sequence may be a word, a plurality of grammar units or a combination thereof, and an input at one time point may be referred to as one input unit of the input sequence.

Step S2103, extracting each unit of the input sequence based on any one or more dimensions in semantic embedding, font embedding or pronunciation embedding;

the embedding type includes semantic embedding, font embedding or pronunciation embedding, for example, semantic embedding for extracting each unit in the input sequence, and the embedding vector can be used for training from beginning or loading the pre-trained embedding vector directly. Extracting glyph embedding of each unit of the input sequence by adopting a deep convolutional neural network, namely, a pre-training mode: inputting the font picture of each character, constructing a classification model, predicting the corresponding ID of each picture in a font library, and classifying by adopting a full-connection layer so as to minimize cross entropy loss and judge output. Adopting any one mode of a recurrent neural network, a long-short term memory network or a recurrent neural network to extract the word-pronunciation embedding of each unit of the input sequence, namely, a pre-training mode: inputting the character tone of each character, constructing a classification model, predicting the corresponding ID of each picture in a character library, and classifying by adopting a full-connection layer so as to minimize cross entropy loss and judge output.

Specifically, the embedded vector is embedded from three aspects of semantics, font and pronunciation, and the three share semantic information to promote the semantic information, so that the semantic information is fully utilized, on one hand, the accuracy and efficiency of embedded vector expression are improved, and on the other hand, the workload of training data is reduced.

The word extraction may be performed by a table lookup method, for example, a word or N-gram (a plurality of grammatical units) may be embedded in a table in which words or N-grams are embedded, or may be generated by word embedding.

Step S2104, fusing the multiple embedding types to generate an embedding vector of the input sequence.

Specifically, if multiple embedding types are selected, fusion of the multiple embedding types is performed to obtain the final embedding vector of the input sequence.

By adopting the method, the quantity requirement of training data can be reduced by introducing various embedding types and embedding modes according to the voice information input by the user, the expression of the embedded vector is formed by adopting multiple granularities and multiple ranges, the situation that only key words are captured in the traditional field is avoided, and the embedded vector of the real intention and the real field of the user is accurately matched.

Step S211, constructing a named entity feature generation model and generating features of an input sequence;

specifically, a pre-trained named entity feature generation model is adopted, training data are reduced, the named entity feature generation model is trained by adopting a two-way long and short memory network or a Transformer model, features with different granularities and dimensions required by a user in a fusion session can be generated, and meanwhile, the accuracy of feature extraction is improved subsequently.

Step S212, constructing a named entity distinguishing model, and generating a predicted named entity sequence as an output named entity;

specifically, the named entity discrimination model is trained by using a maximum likelihood estimation algorithm based on a conditional random field algorithm model, for example, the training criterion using the maximum likelihood estimation algorithm is as follows:

where x is the input sequence, y is the output sequence, Score (x, y) is the Score for the input sequence with x and the output sequence with y, and exp is an exponential function with a natural constant e as the base.

In the above formula, Ψ_EMIT(y_i->x_i) Indicates, label y_iEmits the corresponding input unit x_iPotential energy from the output features of the feature generation model; Ψ_TRANS(y_i-1->y_i) Label y at time i-1_i-1Transition to y at time i_iIs a training parameter in CRF (conditional random field algorithm); and i, summing the lengths of the input sequences, training the CRF by adopting the maximum likelihood estimation algorithm, and calculating a loss function output sequence.

In addition, the named entity discrimination model can also adopt the minimum cross entropy loss as a training criterion, and the cross entropy loss is specifically as follows:

in the formula, y is a label, p is the probability of each input unit prediction, M is the number of categories of tasks, L is the length of a sentence, and a loss function is calculated to output a word or a word until convergence to the minimum cross entropy loss.

In this embodiment, an original input sequence is expressed by using an embedded vector, a feature of the input sequence is generated by a named entity feature generation model, and the feature of the input sequence is input to a named entity discrimination model, so that a named entity sequence is output.

Referring to fig. 4, an intention recognition flowchart in the method for processing speech information based on human-computer interaction according to the embodiment of the present invention is that a pre-trained intention recognition model is used to recognize the speech information to obtain a corresponding intention, and the following is detailed:

step S220, pre-training by using unsupervised voice information to obtain an encoder in a language model;

the language model comprises an encoder and a decoder, and an input sequence of vectorization expression is obtained by embedding the original input sequence of the voice information; the encoder encodes the input sequence; and the decoder decodes the coded input sequence to obtain an output sequence.

In an embodiment, the step of encoding the input sequence by the encoder, see fig. 5 in detail, includes:

step S2200, obtaining the code of each unit in the input sequence, wherein the granularity of the input sequence is characters, words, a plurality of grammar units or the combination thereof;

step S2201, when it is detected that the input sequence contains more than one granularity, fusing the codes of different granularities according to a time sequence to obtain the code of the input sequence.

For example, when there is only one granularity in the input sequence, the corresponding codes can be directly output without fusing the codes.

In another embodiment, on the basis of the above embodiment, the method further includes:

and step S2202, when detecting that the input sequence relates to the context, fusing the codes of the input sequence according to the context to obtain the codes of the input sequence containing the context.

For example, when it is detected that the input sequence corresponds to no context, no fusion is required either.

In the above embodiment, the encoder adopts one or more of a recurrent neural network, an attention mechanism, a long-short term memory network or a recurrent neural network for fusion, and can accurately reflect the real needs (intentions) of the user by extracting features of multiple granularities and multiple dimensions of the input sequence.

Specifically, the language model adopts a cross entropy training criterion, and cross entropy loss is as follows:

wherein L is the sentence length; n is the total number of samples in the training set; m is the number of categories of each unit of the output sequence;

representing the output unit of the output sequence i at the index j,<j is a syntax unit before the index j; y is a label; p is the prediction probability.

Step S221, obtaining an intention recognition model of a preset scene by using the supervision voice information under the preset scene and combining the encoder and the intention recognition classifier;

further, when the scene changes, the intention recognition model in the preset scene may be adjusted and trained.

The intention recognition classifier is a full connection layer, and is trained by using the minimum cross entropy loss, specifically as follows:

wherein p is the prediction probability; y is an intention label; m is the number of categories of intent; n is the total number of samples in the training set.

In the adjusting training process, the encoder and the encoder in the pre-training process share one set of parameters, and the encoder can consider current voice information input or multi-round voice information input; if multi-round voice information is considered, the encoder needs to fuse the contexts, and the fusion mode can be splicing, LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit) or the combination thereof; the intent recognition classifier can be a fully connected layer; the intention recognition model established based on the recurrent neural network can process input sequences with indefinite lengths, well capture the correlation of the input sequences and perform natural language processing by combining correlation information, so that the intention confirmed by the model can be more accurate.

In some implementation examples, dropout (random deactivation) is performed on the output of the encoder, and the method for optimizing the artificial neural network with the deep structure reduces interdependency among nodes by randomly zeroing partial weight or output of the hidden layer in the learning process, so that regularization of the neural network is realized, and the structural risk of the neural network is reduced.

Step S212, inputting the voice information into the intention recognition model to recognize and obtain a corresponding intention.

In this embodiment, the model can be rapidly migrated to a new application scenario by a method of pre-training and adjusting, multi-granularity input sequence information is introduced to an encoder in an adjustment intention recognition model, and context information is introduced to classifier input, so that accuracy of intention recognition is improved.

Referring to fig. 6, a complete flowchart of a method for processing voice information based on human-computer interaction according to an embodiment of the present invention includes:

starting a system and initializing global information, wherein the global information comprises personal information of global slot position information, voice interaction historical records and a current voice interaction state;

when voice interaction starts, acquiring voice information of a user, extracting associated features corresponding to the voice information, and updating the global information according to the extracted associated features, wherein the global information comprises personal information of global slot position information, voice interaction history records and a current voice interaction state;

detecting whether a domain is obtained in the associated features, namely judging whether the domain is effective;

when the field is not acquired, guiding the user to perform voice interaction again;

when the domain is obtained, judging whether to enter the domain for the first time, if so, setting the domain information, and updating the domain information and the global information according to the extracted domain features; and if the user does not enter the field for the first time, directly updating the field information and the global information according to the extracted field features.

Specifically, because the preceding sentence and the following sentence have a certain correlation in the approximate rate in the communication process of the natural language, for example, the following sentence may be further explanation or supplement to the preceding sentence. Therefore, caching the previous user intent with the slot value information, combining the current speech intent with the slot value information and the cached intent with the slot value information (dialog history) when the user inputs a new speech (sentence or request), will enable a better understanding of the user's intent and, similarly, the user's domain and named entity.

And carrying out strategy response according to the fused features.

The accuracy and efficiency of strategy response can be improved better, and the performance of the strategy response is maximized.

In this embodiment, for example, a task of resetting a password is set to describe a voice information processing flow of human-computer interaction, and the task is designed as follows: the voice interaction field can be divided into chatting, account question and meal ordering question, wherein the account question can be divided into 2 frames, account authority question and reset password question.

The "account permission question" frame may include "permission confirmation" and "notification information" intended by the user, and the response policy of the frame may be "apply permission to the system" and "ask for information" and the like.

The "reset password question" frame may include "reset password confirmation" and "notification information" intended by the user, and the response policy of the frame may be "perform password reset" and "query information" or the like.

The voice interaction process is as follows:

1. initializing global information after the system is started, specifically including initializing global slot position information name, gender and work number, initializing a voice interaction history list, and initializing a current voice interaction state;

2. the user inputs voice interaction 'you are, i is Liyunlong, i needs to reset the password, i is the work number YCKJ 9999', global feature extraction in the associated features is carried out, the name 'Liyunlong' and the work number 'YCKJ 9999' can be extracted from the named entity module, the field is classified as 'account number problem', the global information work number and the name are updated, a voice interaction history list is updated, and the current voice interaction state is updated;

3. judging the field to be effective;

4. initializing domain information when entering the domain for the first time, setting a groove 'a system needing password resetting' in the domain, and initializing a voice interaction history list in the domain;

5. performing domain feature extraction, namely, you are Liyunlong, i want to reset passwords, i work number is YCKJ9999, and the domain extracted by a domain feature extraction model is as follows: "reset password problem"; the named entity extracted by the named entity module is: the job number 'YCKJ 9999', and the global information job number is updated to keep consistent; the intent identified at this time is "tell-info"; fusing the characteristics of the slot position of the named entity, the characteristics of the field slot position and the characteristics of the intention slot position identification;

6. strategy response is carried out according to the fused characteristics, because the slot position of the field of the system needing to reset the password is still empty, the response of the robot is to continue to collect information and give a reply of' asking for a request what the system needing to reset the password? "

In the embodiment, the voice interaction process can be shortened and the user voice interaction experience can be improved through the field judgment and detection.

Referring to fig. 7, a block diagram of a voice information processing apparatus based on human-computer interaction according to an embodiment of the present invention includes:

the acquisition module 1 is used for acquiring input voice information based on natural language;

a feature extraction module 2, configured to preprocess the voice information and extract associated features corresponding to the voice information, where the associated features include at least one of the following dimensions: naming entities, domains, and intents;

and normalizing the linguistic data in the voice information.

The feature fusion module 3 is used for performing feature fusion on one dimension or several dimensions in the named entities, the fields and the intents extracted from the associated features according to the slot position information;

and filling the slot position information into the corresponding slot position of the named entity, the field slot position and the intention slot position according to the extracted one-dimensional or several-dimensional characteristics in the named entity, the field and the intention, and merging the characteristics according to the similar slot values to realize characteristic fusion.

And the response module 4 is used for determining the corresponding strategy response according to the fused features.

In an embodiment, as shown in fig. 8, a block diagram of a complete structure of a speech information processing apparatus based on human-computer interaction according to an embodiment of the present invention, the obtaining module further includes:

and the initialization module 0 is used for initializing global information, wherein the global information comprises personal information of the global slot information, a conversation historical record and a current conversation state.

In another embodiment, as shown in fig. 8, a block diagram of a complete structure of a human-computer interaction based speech information processing apparatus provided in an embodiment of the present invention, the feature fusion module further includes: and the updating module 5 is used for updating the global information according to the extracted associated features, wherein the global information comprises the personal information of the global slot position information, the conversation historical record and the current conversation state.

In one embodiment, the feature extraction module comprises: a named entity extracting submodule 21, configured to recognize the voice information by using a pre-trained named entity model to obtain a corresponding named entity, which is shown in fig. 9 in detail and is a structural block diagram of the named entity extracting submodule in the voice information processing apparatus based on human-computer interaction according to the embodiment of the present invention; the details are as follows:

an embedded vector generating unit 210 for generating an original input sequence corresponding to the voice information as an input sequence expressed by an embedded vector;

a named entity feature generation unit 211 for constructing a named entity feature generation model that generates features of an input sequence;

a named entity discriminating unit 212 for constructing a named entity discrimination model that generates the predicted named entity sequence.

Specifically, see fig. 10 for details, which is a block diagram of a vector generation unit embedded in the human-computer interaction based speech information processing apparatus according to the embodiment of the present invention; the details are as follows:

a segmentation subunit 2101 configured to segment the original input sequence into words, phrases, or multiple syntactic units;

a sequence combination subunit 2102 for inputting a sequence with a single or a plurality of grain combinations according to a time sequence;

an embedding extraction subunit 2102 configured to extract each unit of the input sequence based on any one or several dimensions of semantic embedding, glyph embedding, or pronunciation embedding;

a vector output subunit 2103 for fusing the embedded vectors of the plurality of embedding types that generate the input sequence.

Specifically, a deep convolutional neural network is employed to extract glyph embedding for each unit of the input sequence.

Specifically, the phonetic word embedding of each unit of the input sequence is extracted by adopting any one mode of a recurrent neural network, a long-short term memory network or a recurrent neural network.

Specifically, the named entity feature generation model is trained by using a bidirectional long and short memory network or a Transformer model.

Specifically, the named entity distinguishing model is trained by using a maximum likelihood estimation algorithm on the basis of a conditional random field algorithm model, so that the named entity distinguishing model of the output predicted named entity sequence is obtained.

In another embodiment, the feature extraction module further comprises: an intention extraction submodule 22, configured to recognize the voice information by using a pre-trained intention recognition model to obtain a corresponding intention, and as shown in fig. 11 in detail, a structural block diagram of the intention extraction submodule in the voice information processing apparatus based on human-computer interaction according to the embodiment of the present invention is provided; the details are as follows:

the encoding generation unit 220 performs pre-training by using unsupervised voice information to obtain an encoder in a language model;

the graph recognition model generation unit 221 is configured to obtain an intention recognition model of a preset scene through a cross entropy training criterion by using supervised speech information in the preset scene and combining the encoder and the intention recognition classifier;

and an intention extracting unit 222, configured to input the speech information into the intention recognition model to recognize a corresponding intention.

Specifically, the language model comprises an encoder and a decoder, and an input sequence of vectorization expression is obtained by embedding an original input sequence of the voice information; the encoder encodes the input sequence; and the decoder decodes the coded input sequence to obtain an output sequence.

Specifically, see fig. 12 for details, which is a block diagram illustrating a structure of a code generation unit in the intent recognition of the human-computer interaction based speech information processing apparatus according to the embodiment of the present invention; the details are as follows:

a coding subunit 2200, configured to obtain a code of each unit in the input sequence, where the granularity of the input sequence is a word, multiple syntax units, or a combination thereof;

a first code fusion subunit 2201, configured to, when it is detected that the input sequence includes more than one granularity, fuse codes of different granularities in a time sequence to obtain a code of the input sequence.

In the above embodiment, the method further includes: a second code fusion subunit 2202, configured to, when it is detected that the input sequence relates to a context, fuse codes of the input sequence by context to obtain a code of the input sequence including the context.

Specifically, the encoder adopts one or more modes of a recurrent neural network, an attention mechanism, a long-short term memory network or a recurrent neural network for fusion.

Specifically, the language model employs a cross-entropy training criterion.

In particular, the intent recognition classifier is a fully connected layer.

And the domain extraction submodule 23 is configured to extract a domain feature in the global feature according to the extracted named entity and the intention.

In another embodiment, as shown in fig. 8, a block diagram of a complete structure of a human-computer interaction based speech information processing apparatus provided in an embodiment of the present invention, the feature fusion module further includes: a domain detection module 6, configured to detect whether a domain is obtained in the associated feature; when the field is not acquired, guiding the user to perform voice interaction again; when the field is obtained, judging whether to enter the field for the first time, and if so, setting the field information; and if the user does not enter the field for the first time, updating the field information and the global information according to the extracted field features.

In this embodiment, the human-computer interaction based speech information processing apparatus and the human-computer interaction based speech information processing method are in a one-to-one correspondence relationship, and specific functions and technical effects are only referred to the above embodiments, which are not described herein again.

An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.

The embodiment of the present application further provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may be enabled to execute instructions (instructions) of steps included in the human-computer interaction based voice information processing method in fig. 1 according to the embodiment of the present application.

Fig. 13 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be implemented by, for example, a Central processing unit (CP U), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic elements, and the first processor may be implemented by, for example, a Central processing unit (CP U), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller

The first processor 1101 is coupled to the input device 1100 and the output device 1102 described above by a wired or wireless connection.

Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.

In this embodiment, the processor of the terminal device includes specific functions and technical effects for executing the functions of the modules of the speech recognition apparatus in each device, which are referred to in the foregoing embodiments and will not be described herein again.

Fig. 14 is a schematic hardware structure diagram of a terminal device according to an embodiment of the present application. FIG. 14 is a specific embodiment of FIG. 13 in an implementation. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 4 in the above embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing component 1200 may include one or more second processors 1201 to execute instructions to complete all or part of the steps of the human-computer interaction based voice information processing method described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.

The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the speech component 1206 further comprises a speaker for outputting speech signals.

The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.

The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.

From the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 involved in the embodiment of fig. 14 may be implemented as input devices in the embodiment of fig. 13.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A voice information processing method based on human-computer interaction is characterized by comprising the following steps:

acquiring input voice information based on natural language;

preprocessing the voice information, and extracting associated features corresponding to the voice information, wherein the associated features comprise at least one of the following dimensions: naming entities, domains, and intents;

2. The human-computer interaction based voice information processing method according to claim 1, wherein the step of obtaining the input natural language based voice information is preceded by the step of:

initializing global information including personal information of the global slot information, a session history, and a current session state.

3. The human-computer interaction based voice information processing method according to claim 1, wherein the step of preprocessing the voice information comprises: and normalizing the linguistic data in the voice information.

4. The human-computer interaction based voice information processing method according to claim 2, further comprising: and updating the global information according to the extracted associated features, wherein the global information comprises personal information of the global slot information, a conversation historical record and a current conversation state.

5. The human-computer interaction based voice information processing method as claimed in claim 1, wherein the voice information is recognized by using a pre-trained named entity model to obtain a corresponding named entity.

6. The human-computer interaction based voice information processing method according to claim 5, wherein the step of recognizing the voice information by using a pre-trained named entity model to obtain a corresponding named entity comprises:

generating an original input sequence corresponding to the voice information into an input sequence expressed by an embedded vector;

constructing a named entity feature generation model for generating features of an input sequence;

a named entity discrimination model is constructed that generates a predicted sequence of named entities.

7. The human-computer interaction based voice information processing method according to claim 6, wherein the step of generating an original input sequence corresponding to the voice information into an input sequence expressed by an embedded vector comprises:

dividing an original input sequence into characters, words or a plurality of grammar units;

inputting a sequence by adopting single or multiple granularity combinations according to the time sequence;

extracting each unit of the input sequence based on any one or more dimensions of semantic embedding, font embedding or pronunciation embedding;

and fusing a plurality of embedding types to generate an embedding vector of the input sequence.

8. The human-computer interaction based voice information processing method of claim 7, wherein the glyph embedding of the unit of each of the input sequences is extracted by using a deep convolutional neural network.

9. The human-computer interaction based speech information processing method of claim 7, wherein the phonetic transcription of each unit of the input sequence is extracted by any one of a recurrent neural network, a long-short term memory network, or a recurrent neural network.

10. The human-computer interaction based voice information processing method of claim 6, wherein the named entity feature generation model is trained by using a two-way long-short memory network or a Transformer model.

11. The human-computer interaction based speech information processing method of claim 6, wherein the named entity discriminant model of the predicted named entity sequence is generated based on a conditional random field algorithm model and trained using a maximum likelihood estimation algorithm.

12. The human-computer interaction based voice information processing method according to claim 1, further comprising: and recognizing the voice information by using a pre-trained intention recognition model to obtain a corresponding intention.

13. The human-computer interaction based voice information processing method according to claim 12, wherein the step of recognizing the voice information by using a pre-trained intention recognition model to obtain a corresponding intention comprises:

pre-training by using unsupervised voice information to obtain an encoder in a language model;

obtaining an intention recognition model of a preset scene by using the supervision voice information under the preset scene and combining the encoder and the intention recognition classifier;

and inputting the voice information into the intention recognition model, and recognizing to obtain a corresponding intention.

14. The human-computer interaction based speech information processing method of claim 13, wherein the language model comprises an encoder and a decoder, and an input sequence of vectorization expression is obtained by embedding an original input sequence of the speech information; the encoder encodes the input sequence; and the decoder decodes the coded input sequence to obtain an output sequence.

15. The human-computer interaction based speech information processing method of claim 13, wherein the step of encoding the input sequence by the encoder comprises: acquiring the code of each unit in the input sequence, wherein the granularity of the input sequence is characters, words, a plurality of grammar units or a combination thereof; and when the input sequence is detected to contain more than one granularity, the codes with different granularities are fused according to time sequence to obtain the codes of the input sequence.

16. The human-computer interaction based speech information processing method of claim 15, wherein when it is detected that the input sequence relates to a context, the codes of the input sequence are fused according to the context to obtain the code of the input sequence containing the context.

17. The human-computer interaction based speech information processing method according to claim 13 or 14, wherein the encoder is integrated by one or more of a recurrent neural network, an attention mechanism, a long-short term memory network, or a recurrent neural network.

18. The human-computer interaction based speech information processing method of claim 13, wherein the language model employs a cross-entropy training criterion.

19. The human-computer interaction based voice information processing method of claim 13, wherein the intention recognition classifier is a full connection layer.

20. The human-computer interaction based voice information processing method according to claim 1, wherein before the step of performing feature fusion on one or more of named entities, domains and intents extracted from the associated features according to slot position information, the method further comprises:

detecting whether a domain is acquired in the associated features;

when the field is obtained, judging whether to enter the field for the first time, and if so, setting the field information; and if the user does not enter the field for the first time, updating the field information and the global information according to the extracted field features.

21. The human-computer interaction based voice information processing method according to claim 1 or 20, wherein a domain feature in the associated features is obtained according to the extracted named entity and intention.

22. The human-computer interaction based voice information processing method according to claim 1, wherein the step of performing feature fusion on one or more dimensions of named entities, domains and intentions extracted from the associated features according to slot position information comprises:

and filling the extracted one-dimensional or multi-dimensional features in the named entity, the field and the intention into corresponding slot positions of the named entity, the field slot position and the intention slot position according to the slot position information, and merging the features according to the similar slot values to realize feature fusion.

23. A voice information processing device based on human-computer interaction is characterized by comprising:

a feature extraction module, configured to preprocess the voice information and extract associated features corresponding to the dialog, where the associated features include at least one of the following dimensions: naming entities, domains, and intents;

and the response module is used for realizing strategy response based on rules or models according to the fused features.

24. The human-computer interaction based voice information processing device of claim 23, wherein the dialog acquisition module further comprises, before:

the initialization module is used for initializing global information, and the global information comprises personal information of the global slot information, a conversation historical record and a current conversation state.

25. The human-computer interaction based voice information processing device of claim 23, wherein the preprocessing in the feature extraction module comprises: and normalizing the linguistic data in the voice information.

26. The human-computer interaction based voice information processing apparatus according to claim 24, further comprising: and the updating module is used for updating the global information according to the extracted associated characteristics, wherein the global information comprises the personal information of the global slot position information, the conversation historical record and the current conversation state.

27. The human-computer interaction based voice information processing device of claim 23, wherein the feature extraction module further comprises: and the named entity extraction submodule is used for identifying the dialogue by utilizing a pre-trained named entity model to obtain a corresponding named entity.

28. The human-computer interaction based voice information processing method of claim 27, wherein the named entity extraction sub-module comprises:

an embedded vector generation unit for generating an original input sequence corresponding to the speech information into an input sequence expressed by an embedded vector;

the named entity feature generation unit is used for constructing a named entity feature generation model for generating the features of the input sequence;

and the named entity distinguishing unit is used for constructing a named entity distinguishing model for generating the predicted named entity sequence.

29. The human-computer interaction based speech information processing apparatus of claim 28, wherein the embedded vector generation unit comprises:

the segmentation subunit is used for segmenting the original input sequence into characters, words or a plurality of grammar units;

the sequence combination subunit is used for inputting the sequence by adopting single or multiple granularity combinations according to the time sequence;

an embedding extraction subunit, configured to extract each unit of the input sequence based on any one or several dimensions of semantic embedding, glyph embedding, or pronunciation embedding;

and the vector output subunit is used for fusing the embedded vectors of the plurality of embedding types for generating the input sequence.

30. The human-computer interaction based speech information processing apparatus of claim 29, wherein the glyph embedding of the unit of each of the input sequences is extracted using a deep convolutional neural network.

31. The human-computer interaction based speech information processing apparatus of claim 29, wherein the phonetic transcription of each unit of the input sequence is extracted by any one of a recurrent neural network, a long-short term memory network, or a recurrent neural network.

32. The human-computer interaction based voice information processing apparatus of claim 28, wherein the named entity feature generation model is trained using a two-way long-short memory network or a Transformer model.

33. The human-computer interaction based speech information processing apparatus of claim 28, wherein the named entity discrimination model that generates the predicted named entity sequence is derived based on a conditional random field algorithm model by training using a maximum likelihood estimation algorithm.

34. The human-computer interaction based voice information processing device of claim 23, wherein the feature extraction module further comprises: and the intention extraction submodule is used for identifying the voice information by utilizing a pre-trained intention identification model to obtain a corresponding intention.

35. The human-computer interaction based voice information processing apparatus according to claim 34, wherein the intention extraction sub-module includes:

the coding generation unit is used for pre-training by using unsupervised voice information to obtain a coder in a language model;

the specific model generation unit is used for obtaining an intention recognition model of a preset scene by utilizing supervision voice information under the preset scene and combining the encoder and the intention recognition classifier;

and the intention extraction unit is used for inputting the voice information into the intention recognition model to recognize the corresponding intention.

36. The human-computer interaction based speech information processing apparatus of claim 35, wherein the language model comprises an encoder and a decoder, and an input sequence of vectorization expression is obtained by embedding an original input sequence of the speech information; the encoder encodes the input sequence; and the decoder decodes the coded input sequence to obtain an output sequence.

37. The human-computer interaction based speech information processing device of claim 35 or 36, wherein the step of encoding the input sequence by the encoder comprises:

the encoding subunit is used for acquiring the encoding of each unit in the input sequence, wherein the granularity of the input sequence is a word, a plurality of grammar units or a combination of the word and the word;

and the first code fusion subunit is used for fusing the codes with different granularities according to time sequence to obtain the codes of the input sequence when detecting that the input sequence contains more than one granularity.

38. The human-computer interaction based voice information processing apparatus according to claim 37, further comprising: and the second encoding fusion subunit is used for fusing the encoding of the input sequence according to the context when the input sequence is detected to relate to the context, so as to obtain the encoding of the input sequence containing the context.

39. The human-computer interaction based speech information processing device of claim 35 or 36, wherein the encoder is integrated by one or more of a recurrent neural network, an attention mechanism, a long-short term memory network or a recurrent neural network.

40. A human-computer interaction based speech information processing apparatus according to claim 35, wherein the language model employs cross-entropy training criteria.

41. A human-computer interaction based speech information processing apparatus according to claim 35, wherein the intention recognition classifier is a fully connected layer.

42. The human-computer interaction based voice information processing device of claim 23, wherein the feature fusion module further comprises, before: the domain detection module is used for detecting whether a domain is acquired in the associated features; when the field is not acquired, guiding the user to perform voice interaction again; when the field is obtained, judging whether to enter the field for the first time, and if so, setting the field information; and if the user does not enter the field for the first time, updating the field information and the global information according to the extracted field features.

43. A human-computer interaction based voice information processing apparatus according to claim 23 or 42, wherein a domain feature in the associated features is obtained according to the extracted named entity and intention.

44. The human-computer interaction based voice information processing device according to claim 23, wherein the feature fusion module comprises:

45. An apparatus, comprising:

one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method recited by one or more of claims 1-22.

46. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-22.