CN114238605A

CN114238605A - Automatic conversation method and device for intelligent voice customer service robot

Info

Publication number: CN114238605A
Application number: CN202111554796.1A
Authority: CN
Inventors: 王志光; 杨羽
Original assignee: Beijing Doumi Youpin Technology Development Co ltd
Current assignee: Beijing Doumi Youpin Technology Development Co ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-25
Anticipated expiration: 2041-12-17
Also published as: CN114238605B

Abstract

The application provides an automatic dialogue method and device for an intelligent voice customer service robot, and belongs to the technical field of data processing. The method comprises the steps of converting acquired user voice data into a character sequence; based on the trained intention neural network model, determining expected values corresponding to the character sequence and containing intention keywords and entity keywords; matching the next tactical node branch of the current tactical node in the given tree flow according to the expected value, and filling the keywords into the slot position of the instruction information corresponding to the tactical node branch; determining a text language needing conversation according to the user instruction information; and converting the text language into a language and outputting the language to a user. The method and the device can improve the intelligence of the automatic voice conversation.

Description

Automatic conversation method and device for intelligent voice customer service robot

Technical Field

The application belongs to the technical field of data processing, and particularly relates to an automatic conversation method and device for an intelligent voice customer service robot.

Background

In the traditional human resource recruitment industry, manual outbound solicitation interview is the main force of service execution. However, in the face of massive resume clues, repetitive work content, high delivery pressure, manual callouts have a number of problems: high labor cost, long training time, lack of unified standards, low working efficiency, irreplaceable value in the present recruitment and the like. The AI consultant replaces the manual work to finish the job of asking the intention for recruitment, so that the labor cost can be greatly reduced.

However, the intelligent voice customer service robot has the disadvantage of insufficient language understanding, and the main reason is that the current grammar analysis is limited to analyze an isolated sentence, and the constraint and influence of the context and conversation environment on the sentence are lack of systematic research. A person understands that a sentence is not solely grammatical and also applies a great deal of relevant knowledge, including both living knowledge and expertise. In the telephone service industry, there is a need for more humane speaking modes including emotion, personality, psychology, etc., but the current natural language understanding technology cannot meet the requirements.

In the prior art, the most widespread chinese natural language understanding system uses the traditional Natural Language Processing (NLP) technology, which presets templates of all problems to be processed, leaves variable keywords as slots in the templates, extracts keywords from actual problems of users, and matches them in a template set. And (4) scoring in the matching result, and screening out the most possible target template. The intention of the template corresponds to the intention of an actual problem, words in the grooves in the template are keywords, understanding of the whole sentence is determined by the intention and the keywords, but due to the profound nature of Chinese, even the same sentence has multiple meanings in different use scenes or different speaking moods, and if the understanding method is adopted, the language understanding is not accurate.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present application provides an automatic dialogue method and apparatus for an intelligent voice customer service robot, which improve semantic understanding.

The application provides an automatic dialogue method for an intelligent voice customer service robot in a first aspect, which comprises the following steps: converting the acquired user voice data into a character sequence; based on the trained intention neural network model, determining expected values corresponding to the character sequence and containing intention keywords and entity keywords; matching the next tactical node branch of the current tactical node in the given tree flow according to the expected value, and filling the keywords into the slot position of the instruction information corresponding to the tactical node branch; determining a text language needing conversation according to the user instruction information; and converting the text language into a language and outputting the language to a user.

Preferably, converting the acquired user speech data into a character sequence comprises: converting the user voice data to discrete digital signals; improving high frequency characteristics by pre-emphasis; extracting MFCC characteristic information of the voice; and converting the MFCC characteristic information into a character sequence through an acoustic model and a voice model.

Preferably, after converting the acquired user voice data into a character sequence, the method further comprises: determining the occurrence times of each word according to the character sequence; and filling a calculation input vector with the same dimension as the training input vector adopted in the training of the intention neural network model according to each word and the occurrence frequency of the word, wherein each dimension in the calculation input vector corresponds to a specific word in the labeling dictionary, and the numerical value of each dimension corresponds to the occurrence frequency of the specific word in the character sequence for calculation.

Preferably, the labeling dictionary is obtained based on the following steps: carrying out BIO labeling on a plurality of actually input sentences of a user in a specified field in a high-frequency mode, wherein the high-frequency mode refers to labeling with the longest word hit in a given dictionary in the BIO labeling process; marking POI information in other low-frequency modes according to the accurate dictionary extracted from the high-frequency mode; and (5) performing CRF model training, identifying POI information and adding the POI information into a labeling dictionary.

Preferably, the intention neural network model is trained by: obtaining sample data, wherein each sample of the sample data comprises an input corpus and an output expected value; constructing an input vector according to the words and expressions contained in the input corpus and the occurrence times of the words and expressions; constructing a neural network model, wherein the neural network model comprises 4 layers; determining the weight and bias of each dimension of the input vector by constructing a secondary cost function and performing iterative optimization on the secondary cost function, wherein the secondary cost function comprises the following steps:

where y (x) is an expected output value, a is an actual output, x ═ w1x1+ w2x2+ - + wnxn + b, x1 to xn are dimension values of the input vector, w1 to wn are weights corresponding to the dimension values, and b is an offset.

The second aspect of the present application provides an automatic dialogue device of intelligent voice customer service robot, mainly including: the voice recognition module is used for converting the acquired user voice data into a character sequence; the language understanding module is used for determining expected values corresponding to the character sequences and containing the intention keywords and the entity keywords based on the trained intention neural network model; the dialogue management module is used for matching the next dialogue node branch of the current dialogue node in the given tree flow according to the expected value and filling the keywords into the slot position of the instruction information corresponding to the dialogue node branch; the language generation module is used for determining a text language needing conversation according to the user instruction information; and the voice synthesis module is used for converting the text language into a language and outputting the language to a user.

Preferably, the voice recognition module includes: a data conversion unit for converting the user voice data into discrete digital signals; a high-frequency processing unit for improving high-frequency characteristics by pre-emphasis; a voice extraction unit for extracting MFCC feature information of a voice; and the character sequence generating unit is used for converting the MFCC characteristic information into a character sequence through an acoustic model and a voice model.

Preferably, the automatic conversation device further includes a preprocessing module, and the preprocessing module includes: the word counting unit is used for determining the occurrence frequency of each word according to the character sequence; and the input vector filling unit is used for filling a calculation input vector with the same dimension as the training input vector adopted in the training of the intention neural network model according to each word and the occurrence frequency of the word, wherein each dimension in the calculation input vector corresponds to a specific word in the labeling dictionary, and the numerical value of each dimension corresponds to the occurrence frequency of the specific word in the character sequence for calculation.

Preferably, the labeling dictionary is generated based on the following modules: the high-frequency mode labeling unit is used for carrying out BIO labeling on a plurality of actually input sentences of a user in the specified field in a high-frequency mode, wherein the high-frequency mode refers to that the longest hit word in a given dictionary is labeled in the BIO labeling process; the low-frequency mode marking unit is used for marking POI information in other low-frequency modes according to the accurate dictionary extracted from the high-frequency mode; and the model training unit is used for performing CRF model training, identifying POI information and adding the POI information into the labeling dictionary.

Preferably, the intention neural network model is trained by: the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring sample data, and each sample of the sample data comprises an input corpus and an output expected value; the input vector construction unit is used for constructing an input vector according to the words and the occurrence times thereof contained in the input corpus; the neural network building unit is used for building a neural network model, and the neural network model comprises 4 layers; the training unit is used for determining the weight and the bias of each dimensionality of the input vector by constructing a secondary cost function and performing iterative optimization on the secondary cost function, wherein the secondary cost function comprises the following steps:

Drawings

Fig. 1 is a flow chart of a preferred embodiment of the intelligent voice service robot automatic dialogue method of the present application.

FIG. 2 is a flow chart of speech recognition according to the embodiment of FIG. 1 of the present application.

Fig. 3 is a schematic diagram of a neural network neuron structure according to the embodiment shown in fig. 1.

Fig. 4 is a system architecture diagram of a preferred embodiment of the intelligent voice service robot automatic dialog device of the present application.

FIG. 5 is a flowchart of the present application language generation.

Detailed Description

In order to make the implementation objects, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the accompanying drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are some, but not all embodiments of the present application. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application, and should not be construed as limiting the present application. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application. Embodiments of the present application will be described in detail below with reference to the drawings.

The application firstly provides an automatic dialogue method for an intelligent voice customer service robot, as shown in fig. 1, which mainly comprises the following steps:

step S1, converting the acquired user voice data into a character sequence.

Fig. 2 shows an embodiment of step S1, in which the input voice data is first pre-filtered to remove interference, then converted into discrete digital signals by an analog-to-digital converter, and pulse-coded by PCM; after obtaining the digital voice signal, improving the high-frequency characteristic through pre-emphasis, then obtaining a voice frame by using windowing and framing technology, screening the part of effective voice information, further eliminating noise interference and improving the voice recognition effect; extracting voice characteristics and extracting MFCC characteristic information; and finally, converting the voice characteristic information into a character sequence through an acoustic model and a voice model.

In this embodiment, the MFCC is a cepstrum parameter extracted in a Mel scale frequency domain, the Mel scale describes a non-linear characteristic of human ear frequency, in the above steps, after obtaining a digital voice signal, the method for improving a high frequency characteristic by pre-emphasis belongs to an initial step of MFCC parameter extraction, the pre-emphasis process is to pass the voice signal through a high pass filter, and then perform framing and windowing, the framing is to assemble a plurality of sampling points into an observation unit, a sampling frequency of the voice signal adopted by voice recognition is usually 8KHz or 16KHz, and in 8KHz, if a frame length is 256 sampling points, a corresponding time length is 256/8000 × 1000 ═ 32 ms. Windowing refers to multiplying the framed data by a hamming window to increase the continuity of the left and right ends of the frame. Then, the noise interference is further eliminated through a band-pass filter, for example, a triangular band-pass filter is adopted, logarithmic energy is obtained after the filter is adopted, and then the MFCC coefficient can be obtained through discrete cosine change.

And step S2, determining expected values corresponding to the character sequences and containing the intention keywords and the entity keywords based on the trained intention neural network model.

Before step S2 is implemented, the intention neural network model needs to be built first, and the annotation dictionary needs to be built before the intention neural network model is built, and the specific steps are as follows.

Step S21, data preparation: in order to ensure the accuracy of the result, the training data is the dialogue data between the user in the recruitment service and the recruitment advisor, and the sentence actually input by the user in the using process is used for model construction and use.

Step S22, data annotation: before the model training, there is no labeled data applicable to the existing service, so that the data needs to be labeled according to the data in step S21. The data annotation of the application is mainly annotated by using a BIO annotation mode, user data is completely without annotation information, so that a user QUERY needs to be annotated as a format required by the application, the longest matching of character strings is used, namely, the longest clause found in the existing dictionary is found, for example, for a QUERY 'I go to Beijing Tiananmen', three words such as 'Beijing', 'Tiananmen' and 'Beijing Tiananmen' are matched in the existing dictionary at the same time, the longest word needs to be selected, namely 'Beijing Tiananmen', and the data annotation is finally annotated as the following form:

navigating (O) to (O) day (B _ POI) Ann (I _ POI) door (I _ POI)

I (O) want to go to (O) North (B _ POI) Beijing (I _ POI) heaven (I _ POI) Ann (I _ POI) door (I _ POI)

The labeling dictionary is formed in the mode, a batch of data is labeled, POI information in other low-frequency modes can be labeled through an accurate dictionary extracted in a high-frequency mode, then a CRF model of a single machine is trained through CRF + + software of the single machine, and finally the CRF model is mapped to be a corresponding label. And repeating the previous process, marking data, training the model, identifying POI, adding in a dictionary to form a dictionary of the recruitment field, and marking all data for the following neural network model training on the basis of the marked dictionary.

It is further noted that the BIO notation of the present application, means that "B" is used to denote the beginning of the word, "I" is used to denote the middle or end of the word, "O" is used to denote a non-word portion, e.g., "day" in "tiananmen" is the beginning of the word, and thus is referred to hereinafter as "B", and the suffix POI is an abbreviation for position notation.

After the dictionary is labeled, the construction of the intention neural network model can be carried out, and in order to facilitate subsequent calculation, a bag of words model (bag of word) is adopted in the Chinese text processing. The following three corpora are used as examples:

"help me inquire about air tickets from tomorrow to Beijing", "whether there is rain in tomorrow of Beijing", "help me decide a roast duck to a hotel".

The word bag dictionary after word segmentation:

[ 'Beijing', 'I decide', 'tomorrow', 'if', 'rain', 'air ticket', 'query', 'roast duck', 'send to', 'Hotel' ]

Dictionary label:

{ ' Beijing ':0, ' I ' am ':1, ' tomorrow ' 2, ' if ' 3, ' rainy ' 4, ' air ticket ' 5, ' inquire ' 6, ' roast duck ' 7, ' go to ' 8, ' hotel ' 9}

This dictionary contains a total of 10 different words, and using the index numbers of the dictionary, each of the two above documents can be represented by a 10-dimensional vector representing the number of times a word occurs in anticipation:

[[1 0 1 0 0 1 1 0 0 0],[1 0 1 1 1 0 0 0 0 0],[0 1 0 0 0 0 0 1 1 1]]

after processing, the provided training text is converted into a vector representation form of the feature, and the feature vectors are transmitted into an EmbeddingIntentClassifier framework to be trained into a model together with the category identification of each corpus.

Based on the feature data, the application constructs a 4-layer neural network, wherein the first layer is an input layer and receives the processed feature vector data, and according to the example, each corpus has 10 vector values, and the first layer input layer corresponds to 10 neurons; the last layer is the output layer, i.e. the structural layer, and assuming that there are three intent classes in this application, there are 3 output neurons.

The main idea of the neural network can be regarded as that a simulation function is automatically learned through a large number of training samples, and then unknown data is predicted. The neural network uses the sample data to automatically deduce the characteristic rule of each class, and then applies the new position data to achieve the purpose of classification.

For the neural network structure described above, the structure of each neuron is as shown in fig. 3 below. In fig. 3, x1, x 2.,. x10 is the feature vector after the above processing, the above example is 10-dimensional, the output dimension is generally larger, x ═ w1x1+ w2x 2. + w10x10+ b is the weighting input, and the output is the activation function acting on the weighting input x, i.e., y ═ σ (x).

This applicationA 4-layer neural network is constructed, and with training data and related data identification, the neural network can be combined with our classification by designing related optimization functions, for example, the corpus "help me inquire about air tickets from tomorrow to beijing. "identification as a vector representation of 1X10 [ 1010011000]And a simulation function y (y) (x) is designed to identify the corresponding expected output, which is a 3-dimensional vector according to the above example, and for the corpus "help me inquire about air tickets from tomorrow to beijing. "the desired output should be y (x) ═ 1,0,0)^THow to find this simulation function (typically, finding the relevant weights and biases) is the process of model construction in this application, i.e., the optimization problem described above. For the neural network designed above, the application selects a quadratic cost function (also called a mean square error cost function) to optimize.

The quadratic cost function includes:

The construction process of the model is the process of optimizing the cost function based on the training corpus, finally obtaining the weight and bias of the y (x) function, and directly calculating after a new user data request comes to obtain a corresponding classification result.

The module inputs a section of human language text, outputs a section of structured data (json/xml etc.), and mainly comprises extraction results of intents (Intent) and Entities (Entities), for example, after the input text is ' i want to find the work of waiters in Beijing ', the module needs to identify the intention of a user as ' finding work ', and extracts key information ' place: beijing "," job: waiters ", where the location and position is the slot.

Step S3, matching the next conversational node branch of the current conversational node in the given tree flow according to the expected value, and filling these keywords in the slot of the instruction information corresponding to the conversational node branch.

In step S3, the user' S reaction at this moment is determined based on the history information of the dialog, and the multi-turn dialog of the present application employs task driving. Specifically, the user has specific purposes, such as inquiring about the work place, wages, and the like, and the user needs more complicated and more limited. It may need to be divided into multiple rounds of presentations. On one hand, the user can continuously modify or improve the own requirements in the conversation process. On the other hand, when the user states that the need is not specific or clear enough, the machine may also help the user find a satisfactory result by way of query, clarification, or confirmation. Step S3 further includes steps of dialog state maintenance, system decision generation, interaction with the backend/task model as an interface, semantic expression expectation, and the like. This step is effectively a decision-making step whose state determines the optimal action to take next (e.g., provide results, request specific constraints, clarify or confirm requests … …) to most effectively assist the user in the task of obtaining information or services, with inputs being user inputs (or user behavior, output of NLU) and semantic expressions of the current dialog state, and outputs being the next system behavior and updated dialog state. This is a continuous looping process until the task is completed.

The semantic input is the driving force for circulation and the constraint of this step (i.e. the information each node needs to supplement/the price to pay) is the resistance. The more semantic information is input and carried, the stronger the motivation is; the more information needed to complete the task, the stronger the resistance. This step requires dealing with some scenarios directly related to the service, and the outbound call in the customer service scenario includes the confirmation information: return visit, notification, information confirmation, marketing: promotion, introduction, and the like, the incoming call comprises a task conversation: call charges, reservations, FAQ, etc. The outbound scenario adopts a tree flow mode: under each normal conversational node, a background of the intelligent customer service can be provided with a positive or negative model, the intention of the customer is identified through the intention, and the intelligent customer service is switched to the next node correspondingly. If the conversation content does not hit the branch of any current node, the conversation content can flow to the default branch, so that the smooth and complete conversation process is ensured. Key "variables" are extracted during the dialog, for example: desired job site, salary, etc.

And step S4, determining the text language needing conversation according to the user instruction information.

And step S5, converting the text language into a language and outputting the language to a user.

A second aspect of the present application provides an automatic dialog device of an intelligent voice customer service robot, as shown in fig. 4, which mainly includes A Speech Recognition (ASR) module, a language understanding (NLU) module, a Dialog Management (DM) module, a language generation (NLG) module, and a speech synthesis (TTS) module. The speech recognition module receives speech input by a user, recognizes the input speech information into text information, processes the text information by the language understanding module to generate a complete sentence and recognize the intention of the sentence, the dialogue management module controls multi-turn dialogue according to the context, the language generation module performs data analysis according to text data in a specific field in a database of the system to generate a corresponding answer, integrates the generated answer into information of the complete sentence, transmits the information to the speech synthesis module, and outputs the information in a speech mode to realize dialogue with the user.

Specifically, the voice recognition module is used for converting the acquired user voice data into a character sequence; the language understanding module is used for determining expected values corresponding to the character sequences and containing the intention keywords and the entity keywords based on the trained intention neural network model; the dialogue management module is used for matching the next dialogue node branch of the current dialogue node in the given tree flow according to the expected value and filling the keywords into the slot position of the instruction information corresponding to the dialogue node branch; the language generation module is used for determining a text language needing conversation according to the user instruction information; and the voice synthesis module is used for converting the text language into a language and outputting the language to a user.

The language understanding (NLU) module of the application needs to deal with some scenes directly related to the service, and the call-out under the customer service scene comprises confirmation information: return visit, notification, information confirmation, marketing: promotion, introduction, and the like, the incoming call comprises a task conversation: telephone charge checking, reservation, FAQ and the like; natural language processing is a discipline that studies linguistic problems in human-to-human interactions, as well as in human-to-computer interactions. The goal of NLP is to hope to pass turing tests, including speech, morphology, grammar, semantics, and pragmatics, but also to solve causal, logical, and reasoning problems in human languages. The NLU is a subset of the NLP, one main function of the NLU is "execute an intent", namely, an extraction intent, and in the NLU, the intent can be expressed by a slot which is parameter information of the intent. The slot refers to a specific concept extracted from a sentence, and the slot filling is a process of complementing information in order to convert the user intention into an instruction clear to the user.

As shown in fig. 5, the language generation (NLG) module of the present application aims to bridge the communication gap between human and machine and convert data in non-language format into language format understood by human, such as articles, reports, etc. There are two ways natural language generation: text-to-language generation, data-to-language generation. Natural language is generated just like a human being. It understands intent, adds intelligence, considers context, and presents results in an insightful narrative that the user can easily read and understand. This form of natural language generation is just like a human being. It understands intent, adds intelligence, considers context, and presents results in an insightful narrative that the user can easily read and understand. In this module the first step: content determination: the system needs to decide which information should be included in the constructed text and which should not. Typically the data contains more information than the final message. The second step is that: and (3) text structure: the system needs to organize the text in a reasonable order. The third step: sentence aggregation: not every piece of information needs to be expressed in a separate sentence. Combining multiple pieces of information into a sentence may be more fluent and easier to read. The fourth step: grammar transformation: after the content of each sentence is determined, the information may be organized into a natural language. This step adds some conjunctions between the various information, appearing more like a complete sentence. The fifth step: reference expression generation: this step is analogous to grammar, selecting words and phrases to compose a complete sentence. However, he essentially differs from grammatical one in that "REG needs to identify the domain of the content and then use the vocabulary of that domain (but not other domains)". And a sixth step: language implementation: finally, when all related words and phrases are determined, they need to be combined to form a well-formed complete sentence.

A speech synthesis (TTS) module of the present application is a speech synthesis system comprising: the segmentation unit is used for segmenting the phoneme string corresponding to the target voice into a plurality of segments and generating a first segment sequence; a selection unit configured to generate a plurality of first voice unit strings corresponding to a first segment sequence by combining a plurality of voice units based on the first segment sequence, and select one voice unit string from the plurality of first voice unit strings; a connection unit configured to connect a plurality of speech units included in the selected speech unit string to generate a synthesized speech, the selection unit including a retrieval unit configured to repeatedly perform a first process and a second process, the first process generating a third speech unit string corresponding to a plurality of third segment sequences based on at most W (W is a predetermined value) a second speech unit string corresponding to a second segment sequence, the second segment sequence being a partial sequence in the first segment sequence, the third segment sequence being a partial sequence obtained by adding segments to the second segment sequence, the second process being derived from the plurality of third speech sequences. A first calculation unit configured to calculate a total cost of each of the plurality of third speech unit strings, a second calculation unit configured to calculate a penalty coefficient corresponding to the total cost of each of the plurality of third speech unit strings based on a limit obtained by speed speech unit data, wherein the penalty coefficient depends on a degree of proximity to the limit, a third calculation unit configured to calculate an estimated value of each of the plurality of third speech unit strings by correcting the total cost using the penalty coefficient, wherein the retrieval unit obtains the estimated value from the plurality of third speech unit strings based on the estimated value, and selects from the plurality of third speech unit strings to the plurality of W third speech unit strings, which mainly simulates a process in which a human understands a natural language, text regularization, and the like, The word segmentation, the grammar analysis and the semantic analysis enable a computer to fully understand the input text, give various pronunciation prompts required by the latter two parts, and through rhythm processing, the synthesized voice can plan segment characteristics such as pitch, length, strength and the like, so that the synthesized voice can correctly express the semantic meaning and sounds more natural. The speech is output by acoustic processing, i.e. synthesized speech.

In some optional embodiments, the speech recognition module comprises:

a data conversion unit for converting the user voice data into discrete digital signals;

a high-frequency processing unit for improving high-frequency characteristics by pre-emphasis;

a voice extraction unit for extracting MFCC feature information of a voice;

and the character sequence generating unit is used for converting the MFCC characteristic information into a character sequence through an acoustic model and a voice model.

In some optional embodiments, the automatic conversation apparatus further comprises a preprocessing module, the preprocessing module comprising:

the word counting unit is used for determining the occurrence frequency of each word according to the character sequence;

and the input vector filling unit is used for filling a calculation input vector with the same dimension as the training input vector adopted in the training of the intention neural network model according to each word and the occurrence frequency of the word, wherein each dimension in the calculation input vector corresponds to a specific word in the labeling dictionary, and the numerical value of each dimension corresponds to the occurrence frequency of the specific word in the character sequence for calculation.

In some optional embodiments, the annotation dictionary is generated based on:

the high-frequency mode labeling unit is used for carrying out BIO labeling on a plurality of actually input sentences of a user in the specified field in a high-frequency mode, wherein the high-frequency mode refers to that the longest hit word in a given dictionary is labeled in the BIO labeling process;

the low-frequency mode marking unit is used for marking POI information in other low-frequency modes according to the accurate dictionary extracted from the high-frequency mode;

and the model training unit is used for performing CRF model training, identifying POI information and adding the POI information into the labeling dictionary.

In some alternative embodiments, the intent neural network model is trained by:

the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring sample data, and each sample of the sample data comprises an input corpus and an output expected value;

the input vector construction unit is used for constructing an input vector according to the words and the occurrence times thereof contained in the input corpus;

the neural network building unit is used for building a neural network model, and the neural network model comprises 4 layers;

the training unit is used for determining the weight and the bias of each dimensionality of the input vector by constructing a secondary cost function and performing iterative optimization on the secondary cost function, wherein the secondary cost function comprises the following steps:

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An automatic dialogue method of an intelligent voice customer service robot is characterized by comprising the following steps:

converting the acquired user voice data into a character sequence;

based on the trained intention neural network model, determining expected values corresponding to the character sequence and containing intention keywords and entity keywords;

matching the next tactical node branch of the current tactical node in the given tree flow according to the expected value, and filling the keywords into the slot position of the instruction information corresponding to the tactical node branch;

determining a text language needing conversation according to the user instruction information;

and converting the text language into a language and outputting the language to a user.

2. The intelligent voice customer service robot automatic conversation method of claim 1, wherein converting the acquired user voice data into a sequence of characters comprises:

converting the user voice data to discrete digital signals;

improving high frequency characteristics by pre-emphasis;

extracting MFCC characteristic information of the voice;

and converting the MFCC characteristic information into a character sequence through an acoustic model and a voice model.

3. The intelligent voice customer service robot automatic conversation method according to claim 1, further comprising after converting the acquired user voice data into a character sequence:

determining the occurrence times of each word according to the character sequence;

and filling a calculation input vector with the same dimension as the training input vector adopted in the training of the intention neural network model according to each word and the occurrence frequency of the word, wherein each dimension in the calculation input vector corresponds to a specific word in the labeling dictionary, and the numerical value of each dimension corresponds to the occurrence frequency of the specific word in the character sequence for calculation.

4. The intelligent voice customer service robot automatic conversation method according to claim 3, wherein the labeling dictionary is obtained based on the steps of:

carrying out BIO labeling on a plurality of actually input sentences of a user in a specified field in a high-frequency mode, wherein the high-frequency mode refers to labeling with the longest word hit in a given dictionary in the BIO labeling process;

marking POI information in other low-frequency modes according to the accurate dictionary extracted from the high-frequency mode;

and (5) performing CRF model training, identifying POI information and adding the POI information into a labeling dictionary.

5. The intelligent voice customer service robot automatic conversation method according to claim 1, wherein said intention neural network model is trained by the steps of:

obtaining sample data, wherein each sample of the sample data comprises an input corpus and an output expected value;

constructing an input vector according to the words and expressions contained in the input corpus and the occurrence times of the words and expressions;

constructing a neural network model, wherein the neural network model comprises 4 layers;

determining the weight and bias of each dimension of the input vector by constructing a secondary cost function and performing iterative optimization on the secondary cost function, wherein the secondary cost function comprises the following steps:

6. An intelligent voice customer service robot automatic dialogue device, comprising:

the voice recognition module is used for converting the acquired user voice data into a character sequence;

the language understanding module is used for determining expected values corresponding to the character sequences and containing the intention keywords and the entity keywords based on the trained intention neural network model;

the dialogue management module is used for matching the next dialogue node branch of the current dialogue node in the given tree flow according to the expected value and filling the keywords into the slot position of the instruction information corresponding to the dialogue node branch;

the language generation module is used for determining a text language needing conversation according to the user instruction information;

and the voice synthesis module is used for converting the text language into a language and outputting the language to a user.

7. The intelligent voice customer service robot automatic conversation device according to claim 6, wherein said voice recognition module comprises:

a voice extraction unit for extracting MFCC feature information of a voice;

8. The intelligent voice customer service robot automatic conversation device according to claim 6, wherein said automatic conversation device further comprises a preprocessing module, said preprocessing module comprising:

9. The intelligent voice customer service robot automatic dialog device of claim 8, wherein the labeling dictionary is generated based on:

10. The intelligent voice customer service robot automatic conversation device according to claim 6, wherein the intention neural network model is trained by: