CN114333813A

CN114333813A - Implementation method and device for configurable intelligent voice robot and storage medium

Info

Publication number: CN114333813A
Application number: CN202110417581.9A
Authority: CN
Inventors: 王岗; 林健
Original assignee: Suning Financial Technology Nanjing Co Ltd
Current assignee: Suning Financial Technology Nanjing Co Ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2022-04-12
Also published as: CA3155717A1

Abstract

The application discloses a method, a device and a storage medium for realizing a configurable intelligent voice robot, which belong to the field of artificial intelligence, and the method comprises the following steps: obtaining a sample corpus of each conversation scene in a plurality of conversation scenes; generating scene characteristics of the conversation scene based on the sample corpus of the conversation scene aiming at each conversation scene, wherein the scene characteristics comprise characteristic words of the conversation scene and a characteristic word sequence obtained by mapping and converting the characteristic words; the intelligent voice robot is configured based on a preset word vector space model and scene characteristics of each conversation scene, and the word vector space model is used for enabling the intelligent voice robot to carry out word vector similarity calculation on the user conversation and the scene characteristics of each conversation scene so as to identify an intention scene of the user conversation. The method and the system aim at the same business field, and can process with a set of general intelligent voice robot solution no matter what business scene, so that the development cost of the marketing robot and the threshold of customized service are greatly reduced.

Description

Implementation method and device for configurable intelligent voice robot and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, and a storage medium for implementing a configurable intelligent voice robot.

Background

At present, the intelligent voice robot is widely applied to telemarketing and customer service systems, and the telemarketing to specific customers is completed by the robot through IVR (interactive voice response) service. The telemarketing comprises a simple form and a complex form, wherein the simple form refers to a 'unilateral marketing propaganda', such as a one-sentence marketing content report; the complex form may also enable multiple rounds of voice question-answer interaction with the customer and judgment feedback on the potential intent of the customer. The intelligent voice robot can solve the problems of high recruitment cost, long training period, inconsistent service level, unstable service quality and the like which are troubled by the industry for a long time in the traditional manual telemarketing, completes large-scale and repetitive work through the background robot based on the natural language model, and helps enterprises reduce the labor cost of general outbound by about 80%.

In the process of implementing the invention, the inventor finds that the intelligent voice robot in a complex form which can carry out scene question-answer interaction at present generally has the following technical problems:

in the scene question-answer interaction process, the intelligent voice robot can give a corresponding reply by identifying the intention scene of the user conversation. The traditional intention recognition algorithm based on a text classification model adopts an off-line training mode, and the model judges a newly added sample in the future by learning a historical label corpus and classifies the newly added sample into a learned certain label category. The algorithm model based on the Bayesian theory can only process known classifications, and for classes which do not appear in the historical corpus, the classes can still be classified into the known classifications, which inevitably leads to classification errors. In practical applications, in order to solve this problem, developers can only add a new class of labeled corpora to the historical labeled corpora, and then rollback the training model again. The method is low in efficiency and high in cost, forward convergence of the model cannot be guaranteed, namely classification accuracy of original classes can be reduced by introducing new corpora to relearn, and due to the fact that the bottom layer of the algorithm depends on conditional probabilities and the discrimination probabilities of the classes are not mutually independent, when the intelligent voice robot is developed aiming at different scene classes, customized development is needed respectively, migration and multiplexing cannot be achieved, and accordingly scene development cost of the intelligent voice robot is too high.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, the present application provides an implementation method, an implementation device and a storage medium for a configurable intelligent voice robot. The technical scheme of the application is as follows:

in a first aspect, a method for implementing a configurable intelligent voice robot is provided, where the method includes:

obtaining a sample corpus of each session scene in a plurality of session scenes;

generating scene features of the conversation scenes based on the sample corpora of the conversation scenes aiming at each conversation scene, wherein the scene features comprise feature words of the conversation scenes and feature word sequences obtained by mapping and converting the feature words;

the intelligent voice robot is configured based on a preset word vector space model and scene features of each conversation scene, and the word vector space model is used for enabling the intelligent voice robot to carry out word vector similarity calculation on the user conversation and the scene features of each conversation scene so as to identify an intention scene of the user conversation.

Further, the generating, for each of the conversation scenes, scene features of the conversation scene based on the sample corpus of the conversation scene includes:

aiming at each conversation scene, acquiring discrete representation of sample corpora of the conversation scene based on a preset domain dictionary;

extracting feature words of the conversation scene by adopting a feature selection algorithm based on the discrete representation of the sample corpus of the conversation scene;

mapping and converting the feature words of the conversation scene into corresponding dictionary indexes to generate a feature word sequence of the conversation scene;

preferably, the feature selection algorithm is a chi-square statistical feature selection algorithm.

Further, the method further comprises:

storing each conversation scene and the scene characteristics of each conversation scene into a scene characteristic relation table;

preferably, the method further comprises:

receiving configuration feature words input aiming at any one of the conversation scenes;

maintaining the scene characteristics of the session scene in the scene characteristic relation table based on the configuration characteristic words of the session scene and the configuration characteristic word sequence obtained by mapping and converting the configuration characteristic words;

preferably, the receiving of the configuration feature word input for any of the session scenarios includes:

and receiving configuration characteristic words input by users with characteristic configuration authority aiming at the conversation scene.

Further, the maintaining the scene characteristics of the session scene in the scene characteristic relationship table based on the configuration characteristic words of the session scene and the configuration characteristic word sequence obtained by mapping and converting the configuration characteristic words includes:

in the scene feature relationship table, the configuration feature words of the conversation scene are merged into the feature words of the conversation scene, and the merged configuration feature word sequence of the configuration feature words is added into the feature word sequence of the conversation scene.

Further, the word vector space model is obtained by training in the following way:

and training the pre-trained BERT word vector space by using the domain linguistic data of the domain to which each conversation scene belongs to obtain the word vector space model.

Further, the method further comprises:

aiming at any one conversation scene, receiving a state transition diagram input by a first user to the conversation scene, and receiving supplementary information input by a second user to the state transition diagram to generate a state transition matrix of the conversation scene;

and generating a script file for containing a state transition logic relation based on the state transition matrix of the conversation scene, and generating a finite state machine based on the script file for returning corresponding conversation when the intention scene of the user conversation is identified.

Further, the method further comprises:

after receiving a user session, the configured intelligent voice robot preprocesses the user session to obtain a plurality of participles in the user session, and performs mapping conversion on the participles to obtain a characteristic word sequence of the user session;

constructing a feature vector of the user session and a scene feature vector of each session scene by using the word vector space model based on the feature word sequence of the user session and the feature word sequence of each session scene;

and performing similarity calculation on the feature vector of the user conversation and the scene feature vector of each conversation scene, and identifying the intention of the user conversation based on the similarity calculation result so as to return the dialect corresponding to the intention.

In a second aspect, an implementation apparatus for a configurable intelligent voice robot is provided, the apparatus including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring sample corpora of each conversation scene in a plurality of conversation scenes;

a generating module, configured to generate, for each session scene, scene features of the session scene based on a sample corpus of the session scene, where the scene features include feature words of the session scene and a feature word sequence obtained by mapping and converting the feature words;

the configuration module is used for configuring the intelligent voice robot based on a preset word vector space model and scene features of each conversation scene, and the word vector space model is used for enabling the intelligent voice robot to carry out word vector similarity calculation on the scene features of a user conversation and each conversation scene so as to identify an intention scene of the user conversation.

Further, the generating module includes:

the presentation unit is used for acquiring discrete presentation of the sample corpus of each conversation scene based on a preset domain dictionary aiming at each conversation scene;

the screening unit is used for extracting the feature words of the conversation scene by adopting a feature selection algorithm based on the discrete representation of the sample corpus of the conversation scene;

the generating unit is used for mapping and converting the characteristic words of the conversation scene into corresponding dictionary indexes and generating a characteristic word sequence of the conversation scene;

Further, the apparatus further comprises:

the storage module is used for storing each session scene and the scene characteristics of each session scene into a scene characteristic relation table;

preferably, the apparatus further comprises:

the receiving module is used for receiving configuration characteristic words input aiming at any conversation scene;

the maintenance module is used for maintaining the scene characteristics of the conversation scene in the scene characteristic relation table based on the configuration characteristic words of the conversation scene and the configuration characteristic word sequence obtained by mapping and converting the configuration characteristic words;

preferably, the receiving module is configured to receive a configuration feature word input by a user with a feature configuration authority for the session scenario.

Further, the maintenance module is configured to merge the configuration feature words of the session scene into the feature words of the session scene in the scene feature relationship table, and add the merged configuration feature word sequence of the configuration feature words to the feature word sequence of the session scene.

Further, the device further comprises a training module, wherein the training module is used for training the pre-trained BERT word vector space by using the domain corpora in the field to which each session scene belongs to obtain the word vector space model.

Further, the apparatus further comprises a state machine configuration module, configured to:

Further, the apparatus further comprises an intention scene recognition module comprising:

the acquisition unit is used for preprocessing the user session to acquire a plurality of participles in the user session after the configured intelligent voice robot receives the user session, and mapping and converting the participles to acquire a feature word sequence of the user session;

the construction unit is used for constructing the feature vector of the user session and the scene feature vector of each session scene by using the word vector space model based on the feature word sequence of the user session and the feature word sequence of each session scene;

and the matching unit is used for carrying out similarity calculation on the feature vector of the user conversation and the scene feature vector of each conversation scene, identifying the intention of the user conversation based on the similarity calculation result, and returning the dialect corresponding to the intention.

In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the following operation steps when executing the computer program:

In a fourth aspect, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, performs the following operation:

The application provides a method, a device and a storage medium for realizing a configurable intelligent voice robot, which are characterized in that sample corpora of each conversation scene in a plurality of conversation scenes are obtained; generating scene characteristics of the conversation scene based on the sample corpus of the conversation scene aiming at each conversation scene, wherein the scene characteristics comprise characteristic words of the conversation scene and a characteristic word sequence obtained by mapping and converting the characteristic words; the intelligent voice robot is configured based on a preset word vector space model and scene characteristics of each conversation scene, and the word vector space model is used for enabling the intelligent voice robot to carry out word vector similarity calculation on the user conversation and the scene characteristics of each conversation scene so as to identify an intention scene of the user conversation. By adopting the technical scheme provided by the application, no matter what kind of service scene, the same service field can be processed by a set of universal intelligent Voice robot solution, the embarrassing situation that different Voice robots need to be developed respectively when different products or different customers of the same service line are subjected to specific IVR (Interactive Voice Response) scene conversation is avoided, cross-task multiplexing between common conversation scenes can be realized, and the development cost of a marketing robot and the threshold of customized service are greatly reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is an overall framework diagram of an implementation of a configurable intelligent voice robot provided by the present application;

FIG. 2 is a schematic flow chart diagram illustrating a method for implementing a configurable intelligent voice robot, in one embodiment;

FIG. 3 is a schematic flow chart of step 202 of the method shown in FIG. 2;

FIG. 4 is a flow diagram that illustrates scene feature maintenance, under an embodiment;

FIG. 5 is a flow diagram that illustrates the configuration of logical entities in one embodiment;

FIG. 6 is a diagram of foreground state transition drawing board effects in one embodiment;

FIG. 7 is a diagram of a json file format representing state transition moment logical relationships, in one embodiment;

FIG. 8 is a flow diagram that illustrates intent scene recognition, in one embodiment;

FIG. 9 is a schematic diagram of an implementation apparatus of a configurable intelligent voice robot in one embodiment;

fig. 10 is an internal structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is to be understood that, unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

Furthermore, in the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.

As described in the foregoing background technology, when developing an intelligent voice robot for different scene categories, the prior art needs customized development respectively, and cannot migrate and multiplex each other, which may result in too high scene development cost of the intelligent voice robot, in view of this, the present application provides an implementation method of a configurable intelligent voice robot innovatively, which can provide a unified interface service and can configure scenes, and provides a visual operation interface for a service through a matched "intelligent voice platform", thereby greatly improving scene development efficiency and service participation experience of the intelligent voice robot, solving the problem that a logic entity of the robot cannot multiplex to a certain extent, and providing an open scene configuration interface friendly to the service, so that the service can directly participate in the training and generation processes of the robot entity.

Fig. 1 is an overall framework diagram of an implementation of a configurable intelligent voice robot provided by the present application. Referring to fig. 1, the framework mainly relates to an internal basic configuration, an external configuration, an intention recognition, and an FSM (Finite-state machine) robot logic entity, where the internal basic configuration may be implemented by a background algorithm developer through a related algorithm, and includes an internal basic feature (i.e., a scene feature), an internal basic dialog, and an internal basic logic, and the external configuration may be configured by a foreground service person through a front end of an intelligent speech robot background management system, and includes an external configurable feature, an external configurable dialog, and a dialog configurable logic. After the intelligent voice robot is generated by adopting the technical scheme configuration of the application, the intelligent voice robot can recognize the intention scene of the input content of the frame, namely the Text content transcribed by the voice of the client through an ASR (automatic Speech recognition) module, return the corresponding dialect content according To the logic entity of the finite state machine, and further output the voice content converted from the Text To the voice through TTS (Text-To-Speech) technology dialect.

The technical solution of the present application will be described in detail by a plurality of examples.

In one embodiment, an implementation method of a configurable intelligent voice robot is provided, and the method can be applied to any computer device, such as a server, and the server can be implemented by a stand-alone server or a server cluster composed of a plurality of servers. As shown in fig. 2, the method may include the steps of:

201, a sample corpus of each of a plurality of session scenes is obtained.

Here, the plurality of session scenes are included in a preset session scene list for recording one or more session scenes of a specific service domain.

Specifically, the sample corpora of each session scene may be obtained by classifying and labeling the corpora of the specific domain according to the category of the session scene, where the corpora of the specific domain refers to the corpora of the specific business domain, such as customer service corpora of consumption credit and electricity consumption. It should be understood that the embodiment of the present application does not limit the specific acquisition process.

The above-mentioned conversation scene can be obtained by performing scene abstraction on the domain-specific corpus, wherein the scene abstraction is a process from data to information to knowledge. For example, in the field of consumer credit and electricity sales, common conversation scenes of electricity sales activities, such as "credit related questions", "amount related questions", "interest related questions", "operation consultation", and the like, can be combed out by analyzing an electricity sales conversation log of a consumer credit product under the guidance of service personnel, and statements in the customer service conversation log are labeled according to a plurality of scenes classified in a summary manner. For example, for an application that consumes a credit-electricity-selling activity, a session scenario as shown in table 1 below may be abstracted:

table 1: conversation scene of consumption credit electric sales

Serial number	Conversation scene
			1	Credit investigation correlation
2	Interest related
		3	Relationship of quota
4	Operational consultation
		5	Recall contact
6	Affirmation that
		7	Repudiation of
8	Termination requirement
		9	Is unknown

In practical applications, each session scenario may be abstracted as a session State (State), and the dialog processes of the customer service and the client are abstracted as transitions between session states. If the session state is taken as a node, the directed connection line between the session states is a process for transferring one state to another state, and thus the whole dialog process can be abstracted into a Graph (Graph) formed by the node and the directed connection line.

202, generating scene features of the conversation scene based on the sample corpus of the conversation scene for each conversation scene, wherein the scene features comprise feature words of the conversation scene and feature word sequences obtained by mapping and converting the feature words.

Specifically, content conversion based on a Word of Bag (WOB) model may be performed on the sample corpus of each conversation scene to obtain a discrete representation of the sample corpus of each conversation scene, then, based on the discrete representation of the sample corpus of each conversation scene, a feature Word of each conversation scene is extracted by using a feature selection algorithm, and then, the feature Word of each conversation scene is mapped and converted into a dictionary index of a preset domain dictionary to obtain a feature Word sequence of each conversation scene.

The bag-of-words model can divide a corpus text into words, imagine that all the words are put into a bag, neglect the elements of word order, grammar, syntax, etc., and only see the words as a collection of a plurality of words, and the appearance of each word in the text is independent and independent of whether other words appear, wherein the bag-of-words model can adopt one-hot (also called one-hot coding), TF-IDF or N-gram model.

Illustratively, for 5 feature words of the "credit-related" session scenario: and mapping and converting the [ credit investigation "," pedestrian "," report "," personal credit "," wind control ], wherein the index sequence number in the dictionary of the corresponding field is [12,223,166,17,62], and thus obtaining the characteristic word sequence of the 'credit investigation correlation' conversation scene.

Preferably, after the step of generating the scene features of the conversation scene based on the sample corpus of the conversation scene, for each conversation scene, the method may further include:

and storing each session scene and the scene characteristics of each session scene into a scene characteristic relation table.

Specifically, the name, the feature word and the feature word sequence of each session scene are correspondingly stored in a scene feature relationship table. The scene feature relationship table is used for storing a corresponding relationship between a session scene and scene features (including feature words and feature word sequences).

Illustratively, the scenario characteristic relationship table of the consumption credit electricity selling activity can be referred to as table 2.

Table 2: scene characteristic relation table of consumption credit electricity selling activity

Serial number	Conversation scene	Characteristic word	Sequence of feature words
				1	Credit investigation correlation	' Credit ', ' pedestrian ' of ']	[12,223]
2	Interest related	[ 'amount', 'loan amount', 'amount']	[2,12,13,9]
				3	Relationship of quota	[ 'interest', 'interest rate', 'amount']	[3,5,9]
4	Operational consultation	[ 'operation', 'transaction', 'set', 'configuration']	[8,103,198,210]
				…	…	…	…

In practical application, the scene characteristic relation table is stored in a server, a background algorithm technician performs off-line maintenance based on regular text data mining work, and foreground service personnel are isolated.

And 203, configuring the intelligent voice robot based on a preset word vector space model and scene characteristics of each conversation scene, wherein the word vector space model is used for the intelligent voice robot to perform word vector similarity calculation on the user conversation and the scene characteristics of each conversation scene so as to identify an intention scene of the user conversation.

In this embodiment, the intelligent voice robot is configured based on the word vector space model and the corresponding relationship between the scene features of each session scene, so that in practical application, the intention scene of the user session can be identified based on the word vector space model and the scene features of each session scene. Specifically, when the intelligent voice robot performs a conversation with a user, a user conversation text obtained by converting Speech Recognition of the user conversation through an Automatic Speech Recognition technology (ASR) can be obtained, feature information in the user conversation text is extracted, word vector similarity calculation is performed on scene features of the user conversation and each conversation scene according to a word vector space model, and an intention scene of the user conversation is recognized based on a word vector similarity calculation result.

In one example, the word vector space model described above may be trained in the following manner, including:

Here, the domain corpus refers to a corpus of a specific business domain to which each session scenario belongs, for example, a customer service corpus of a consumer credit and debit account.

In this embodiment, a large-scale corpus and high-computation-cost pre-trained embedding that cannot be realized by current hardware resources can be obtained by introducing a large-scale bert (bidirectional Encoder retrieval from transforms) word vector space (768 dimensions) pre-trained in google bert _ serving. On the basis, the BERT word vector space is retrained by bringing in the own service customer service corpus, so that the calibration of the BERT word vector is realized, and the BERT word vector can better accord with a specific service scene.

By adopting the technical scheme of the embodiment, no matter what kind of service scene is in the same service field, the system can be processed by a set of universal intelligent Voice robot solution, the embarrassing situation that different Voice robots need to be developed respectively when different products or different customers of the same service line carry out specific IVR (Interactive Voice Response) scene conversation is avoided, cross-task multiplexing between common conversation scenes can be realized, and the service scene can be expanded, so that the development cost of a marketing robot and the threshold of customized service are greatly reduced.

In an embodiment, as shown in fig. 3, the step 202 of generating, for each conversation scenario, a scenario feature of the conversation scenario based on the sample corpus of the conversation scenario may include:

301, for each conversation scene, obtaining a discrete representation of a sample corpus of the conversation scene based on a preset domain dictionary.

Specifically, a domain dictionary may be created based on a full-amount participle corpus from which stop words are removed, where the domain dictionary includes a full-amount effective vocabulary appearing in the corpus, and content conversion based on a bag-of-words model is performed on all sample corpora of a target session scene based on the domain dictionary to obtain a discrete representation, that is, the corpora are converted into the following expression forms:

[ (word index in dictionary, word frequency in document) ]

For example, there is a basic dictionary [ "i", "you", "dislike", "love", "south jing", "hometown" ], which contains 4 words, the sentence "i/ai/i/hometown/south jing" is transcribed by the word bag to be [ (0,2), (3,1), (4,1), (5,1) ].

And 302, extracting feature words of the conversation scene by adopting a feature selection algorithm based on the discrete representation of the sample corpus of the conversation scene.

Specifically, after completing the bag-of-words conversion, obtaining a discrete representation of the sample corpus of the target conversation scene, and then extracting the feature words of the target conversation scene by using a feature selection algorithm.

In this embodiment, feature words of a target conversation scene may be extracted based on a CHI-square statistical (CHI) technique. In a specific implementation process, the feature word extraction of the target conversation scene can be performed according to the following form 3 and the CHI calculation formula.

Table 3: category and document attribution

Document attribution	Class c number of documents	Number of non-class c documents
			Including t number of documents	a	b
Not including t document number	c	d

CHI calculation formula:

where c is a category (class), i.e. "conversation scene", t is a term (term), and N is the total number of texts in the corpus.

The above x²The method is generally used for chi-square hypothesis test in statistics, and is used for judging the consistency or goodness of fit of actual distribution and theoretical distribution, and invalid hypothesis H of the method₀There is no significant difference between the observed and expected frequencies. Therefore, the smaller the chi-square statistic is, the closer the observation frequency and the expectation frequency are, and the higher the correlation between the observation frequency and the expectation frequency is. Therefore, x²It can be seen as a measure of the distance between the observed object and the desired object, the higher the correlation between the two, from school. In the present application, an "observation object" is a term, an expected object is a conversation scene, and if there is a high correlation between the term and the conversation scene, the statistical distribution of the term and the conversation scene in the whole sample should be close. Thus passing through chi²Statistics can be based on a large amount of linguistic data, the relevance between all the vocabularies in the dictionary and each category can be calculated quickly and accurately, and χ is selected according to the relevance sequencing result²And taking the minimum preset number of words (the preset number can be set to be 5) as the feature set of the conversation scene, and completing feature mapping of each scene/class in the scene list.

303, mapping and converting the feature words of the conversation scene into corresponding dictionary indexes, and generating a feature word sequence of the conversation scene.

Specifically, the feature words of the target conversation scene are mapped and converted through dictionary indexes of the domain dictionary to obtain a feature word sequence of the target conversation scene.

For example, assuming that the 5 feature words of the "credit investigation relevant" conversation scene extracted in step 302 are [ "credit investigation", "pedestrian", "report", "personal credit", "wind control" ], through dictionary index mapping conversion, the index sequence number in the corresponding dictionary is [12,223,166,17,62], that is, the feature word sequence of the target conversation scene can be obtained.

In the prior art, Word vector models such as Word2vec, Glove or ELMo are usually adopted for sample corpora to generate Word vectors, and abstract dimensions of words are hidden in the generated Word vectors, so that a user cannot maintain and expand the Word vectors.

In one embodiment, as shown in the flowchart of scene feature maintenance shown in fig. 4, the method further includes:

401, a configuration feature word input for any of the conversational scenarios is received.

Specifically, configuration feature words input by a user aiming at a target session scene through a system front end are received, wherein the user refers to a service person, such as a customer service person, in a specific implementation process, the front end can provide a feature relationship expansion function for the service person to maintain a service field, the service person selects a specific session scene option at the front end, and types in scene high-correlation words which are summarized and extracted according to service knowledge and experience from an input box to serve as the configuration feature words of the target session scene, and the system can be updated to an external input feature set of the target session scene through background review after receiving the configuration feature words.

Receiving a configuration feature word input for any conversation scene, wherein the process may include:

and receiving configuration characteristic words input by users with characteristic configuration authority for the conversation scene.

In this embodiment, since the external configuration feature word may have a problem of unstable quality due to the difference between experience and level of the service personnel, the external feature configuration authority may be opened only to the selected part of the service personnel with rich experience by constructing the feature configuration authority list.

402, maintaining the scene features of the session scene in the scene feature relationship table based on the configuration feature words of the session scene and the configuration feature word sequence obtained by mapping and converting the configuration feature words.

In one example, the implementation of step 402 may include:

and combining the configuration characteristic words of the conversation scene into the characteristic words of the conversation scene in the scene characteristic relation table, and adding the combined configuration characteristic word sequence of the configuration characteristic words into the characteristic word sequence of the conversation scene.

Specifically, after basic deduplication and error correction processing is performed on configuration feature words input in a target conversation scene, the configuration feature words are directly added to the feature words of the target conversation scene in the scene feature relation table, and feature word sequence mapping is performed. The final combined results are shown in table 4 below:

table 4: scene characteristic relation table after internal and external configuration combination

It should be noted that if the externally input configuration feature word is not included in the domain dictionary, the externally input configuration feature word is ignored, for example, if the word "people bank" in table 4 is not included in the domain dictionary, the word is directly ignored when merging the feature word sequence. In this embodiment, the scene characteristics of the session scene in the scene characteristic relationship table are maintained through the configuration characteristic words based on the session scene and the configuration characteristic word sequence obtained by mapping and converting the configuration characteristic words, so that the working mode of "foreground and background" of "service definition requirement and technology development application" can be overcome, the business personnel and the technical personnel are cooperatively developed and marketed the marketing robot is realized through the mode of "common definition of internal and external configurations", and the problem of "separation of development and production, mismatch of requirement and response" in the traditional robot development mode is greatly alleviated.

In one embodiment, as shown in the flowchart of the logical entity configuration shown in fig. 5, the method further includes:

and 501, for any session scene, receiving a state transition diagram input by a first user for the session scene, and receiving supplementary information input by a second user for the state transition diagram to generate a state transition matrix of the session scene.

The first user refers to a service person, and the second user refers to an algorithm development technician.

And 502, generating a script file for containing the state transition logic relation based on the state transition matrix of the conversation scene, and generating a finite state machine based on the script file for returning corresponding conversation when the intention scene of the user conversation is identified.

A Finite-state machine (FSM), also called Finite-state automata, or state machine for short, is a mathematical model representing a Finite number of states and the behaviors of transitions and actions between the states. The main function of the system is to describe the state sequence of an object in its life cycle and how to respond to various events from the outside world to make the transition between states. The state machine includes 4 elements: current status, condition, action, substate. The current state is the current state; a condition, also known as an "event," that when satisfied triggers an action or performs a state transition; the action refers to an action executed after the condition is met, the action can be transferred to a new state after the action is executed, the original state can also be still kept, the action is not necessary, and any action can also not be executed after the condition is met, and the action is directly transferred to the new state; the secondary state is a new state to be migrated after the condition is satisfied. The "off state" is relative to the "off state" and, once activated, the "off state" transitions to a new "off state". The logical relationship of the finite state machine can be represented as the state transition moments as shown in table 5 below.

Table 5: FSM state transition moments

Here, "action" refers to a change process in which "present state" is converted into "next state" after "trigger condition" is satisfied.

In practical application, after a service person drags a front-end platform drawing board to finish drawing a basic state transition diagram, a background can generate a possibly incomplete state transition moment. The foreground state transition drawing board effect can be shown with reference to fig. 6. Considering the limitation of business technology capability, it is difficult to guarantee the basic requirement of the state transition moments on "logic integrity", so the newly generated state transition moments are usually repaired by the background technicians at regular time every day to conform to the state shown in table 5. The finite state machine model may be abstracted as a combination of the following modes:

when the intent scenario of the user session is identified, the corresponding conversational content may be returned through the logical entity of the finite state machine. Wherein the "session scenario" is the "condition/event". For example, the current state is "greeting" state, the condition/event is "quota related" session scenario, and the execution of the action is triggered and then the state is transferred to the next state "consulting quota" state. In the next round of conversation, the 'consulting limit' is used as the present state, the condition/event is 'negative', and the state is transferred to the 'non-bargained' state of the next state after triggering the execution action. Thus, a complete customer service conversation process is completed. The state transition logic is as follows:

the complete state transition moments of the FSM are automatically translated by the program into json-formatted script files.

In practical application, for consumption credit activity, a json script file form representing the logical relationship of the state transition moments thereof can be referred to fig. 7. The json script file is then read by the program into the state machine object at the time the finite state machine instance is generated, validating the logic. The generated state machine instance can be stored into Redis according to the uuid transmitted by the front end as an index value, and program access during subsequent IVR service starting is facilitated. The user can also carry out persistence operation on the state machine according to the requirement, so that the state machine can be called stably for a long time. If the user selects the task type as 'single task' at the front end, the finite state machine instance of the task stored in Redis is cleared as an invalid object within a preset time period (for example, 24 hours) after the IVR marketing service is triggered, so as to save storage resources.

Compared with the traditional development mode for off-line training of marketing robot instances according to business scene requirements, the traditional development mode depends on solidified configuration files and state transition logic, once the environment of a business side changes, if the logic or the dialogue needs to be modified, the modification cost and the risk of the research and development side are huge, and the mode of front-end configuration and background translation adopted by the embodiment ensures the timely response of the robot to the update of configuration information, and the configuration files and the program application are decoupled, so that the FSM (finite state machine) as a core component of the robot is simple and light to update. And the operation part also depends on foreground maintenance, so that timely effectiveness is realized.

In addition, in practical application, developers autonomously complete business logic abstraction and state transition diagram construction according to product requirements, but the whole robot development process lacks direct participation and supervision of business, and the final application effect of the robot greatly depends on understanding of business background and business environment, which causes mismatching of development and production separation, requirements and responses to a great extent. The technical scheme of the embodiment can overcome the traditional 'foreground and background' working mode of 'service definition requirement and technology development application', realizes cooperative development of the intelligent voice robot by service personnel and technical personnel through the 'internal and external configuration common definition' mode, and greatly relieves the 'separation of development and production, mismatch of requirement and response' problem existing in the traditional robot development mode.

In one embodiment, as shown in the flowchart of the intention scene recognition shown in fig. 8, the method further includes:

801, after receiving a user session, the configured intelligent voice robot preprocesses the user session to obtain a plurality of participles in the user session, and performs mapping conversion on the plurality of participles to obtain a feature word sequence of the user session.

Specifically, after the intelligent voice robot is configured and generated, the intelligent voice robot may perform a conversation with a user, where the user conversation may be a text content transcribed by recognizing voice of the user conversation through an Automatic Speech Recognition technology (ASR), and the text content is subjected to word segmentation processing to obtain a plurality of words, where the word segmentation processing includes character purification, error correction, word segmentation, and word removal. And converting the multiple participles into a representation form which is the same as the 'feature word sequence' of the conversation scene in the scene feature relation table through the index mapping of the field dictionary, so as to obtain the feature word sequence of the user conversation.

And 802, constructing a feature vector of the user session and a scene feature vector of each session scene by using the word vector space model based on the feature word sequence of the user session and the feature word sequence of each session scene.

In one example, the implementation of step 802 may include:

and respectively mapping the characteristic word sequence of the user session and the characteristic word sequence of each session scene in the scene characteristic relation table by using a word vector space model to generate a characteristic vector of the user session and a scene characteristic vector of each session scene.

Specifically, each element in the feature word sequence of the user session is mapped into a BERT word vector space, a 768-dimensional feature vector is obtained, all elements are summed and averaged (or the maximum value or the median value is taken), and a 1 × 768 vector is obtained as the feature expression of the user session input. Correspondingly, the feature word sequences of the session scenes in the scene feature relationship table are similarly operated and respectively converted into feature vectors of 1 × 768.

And 803, performing similarity calculation on the feature vector of the user session and the scene feature vector of each session scene, and identifying the intention of the user session based on the similarity calculation result so as to return the dialect corresponding to the intention.

Specifically, for each conversation scene, the cosine similarity between the feature vector input by the user conversation and the scene feature vector of the conversation scene is calculated, and the larger the numerical value is, the higher the similarity is, and the higher the correlation between the user conversation and the conversation scene is. All the conversation scenes are arranged in a descending order according To the cosine similarity calculation result, the conversation scene with the highest similarity is returned as the judgment result of the intention scene input by the user at this time, and the corresponding answer dialog (pattern) under the intention scene is returned according To the current state of the finite state machine, wherein the answer dialog can be converted into the voice content through a TTS (Text-To-Speech) technology To be played To the user, and the answer dialog form can be referred To as shown in a table 6.

Table 6: state dialect relation table

By adopting the technical scheme of the embodiment, the feature vector of the user session and the scene feature vector of each session scene are constructed by using the word vector space model based on the feature word sequence of the user session and the feature word sequence of each session scene, and the similarity matching is performed on the feature vector of the user session and the scene feature vector of each session scene so as to identify the intention scene of the user session, so that the accuracy of the recognition result of the intention of the user session can be improved.

It should be understood that, although the steps in the respective flowcharts described above are sequentially shown as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in each of the flowcharts described above may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, an implementation apparatus of a configurable intelligent voice robot is provided, and the apparatus is configured to perform the implementation method of the configurable intelligent voice robot in the foregoing embodiments. As shown in fig. 9, the implementation apparatus may include:

an obtaining module 901, configured to obtain a sample corpus of each session scene in a plurality of session scenes;

a generating module 902, configured to generate, for each conversation scene, scene features of the conversation scene based on the sample corpus of the conversation scene, where the scene features include feature words of the conversation scene and feature word sequences obtained by mapping and converting the feature words;

the configuration module 903 is configured to configure the intelligent voice robot by using a preset word vector space model and scene features of each session scene, where the word vector space model is used for the intelligent voice robot to perform word vector similarity calculation on the user session and the scene features of each session scene so as to identify an intention scene of the user session.

In one embodiment, the generating module 902 may include:

the presentation unit is used for acquiring discrete presentation of the sample corpus of the conversation scene aiming at each conversation scene and aiming at each conversation scene based on a preset domain dictionary;

the generating unit is used for mapping and converting the characteristic words of the conversation scene into corresponding dictionary indexes to generate a characteristic word sequence of the conversation scene;

In one embodiment, the apparatus may further comprise:

a saving module 904, configured to save each session scene and the scene characteristics of each session scene into a scene characteristic relationship table;

a receiving module 905, configured to receive configuration feature words input for any session scenario;

a maintenance module 906, configured to maintain the scene characteristics of the session scene in the scene characteristic relationship table based on the configuration characteristic words of the session scene and the configuration characteristic word sequence obtained by mapping and converting the configuration characteristic words;

preferably, the receiving module 905 is configured to receive a configuration feature word input by a user with a feature configuration authority for a session scenario.

In one embodiment, the maintaining module 906 is configured to merge the configuration feature words of the session scene into the feature words of the session scene in the scene feature relationship table, and add the sequence of the configuration feature words of the merged configuration feature words to the sequence of the feature words of the session scene.

In an embodiment, the apparatus may further include a training module, where the training module is configured to train the pre-trained BERT word vector space using the domain corpora of the domain to which each session scene belongs, to obtain a word vector space model.

In one embodiment, the apparatus further comprises a state machine configuration module 907, the state machine configuration module 907 to:

aiming at any conversation scene, receiving a state transition diagram input by a first user to the conversation scene, and receiving supplementary information input by a second user to the state transition diagram to generate a state transition matrix of the conversation scene;

and generating a script file for containing the state transition logic relation based on the state transition matrix of the session scene, and generating a finite state machine based on the script file for returning corresponding dialogues when the intention scene of the user session is identified.

In one embodiment, the apparatus further comprises an intent scene recognition module 908, the intent scene recognition module 908 comprising:

the construction unit is used for constructing a feature vector of the user session and a scene feature vector of each session scene by using a word vector space model based on the feature word sequence of the user session and the feature word sequence of each session scene;

For specific limitations of the implementation apparatus of the configurable intelligent voice robot, reference may be made to the above limitations of the implementation method of the configurable intelligent voice robot, which are not described herein again. The modules in the implementation device of the configurable intelligent voice robot can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The server comprises a processor, a memory and a network interface which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with other devices via a network connection. The computer program is executed by a processor to implement a method of implementing a configurable intelligent voice robot.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to serve as a limitation on the computing devices to which the disclosed aspects may be applied, and that a particular computing device may include more or less components than those shown, or may have some components combined, or may have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

generating scene characteristics of the conversation scene based on the sample corpus of the conversation scene aiming at each conversation scene, wherein the scene characteristics comprise characteristic words of the conversation scene and a characteristic word sequence obtained by mapping and converting the characteristic words;

the intelligent voice robot is configured based on a preset word vector space model and scene characteristics of each conversation scene, and the word vector space model is used for enabling the intelligent voice robot to carry out word vector similarity calculation on the user conversation and the scene characteristics of each conversation scene so as to identify an intention scene of the user conversation.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In summary, compared with the prior art, the technical scheme of the embodiment of the application can achieve the following technical effects:

1. by adopting the configurable intelligent voice robot, the comprehensive research and development cost can be saved by about 60%. Especially, aiming at the situation that marketing scenes are concentrated in individual business fields, compared with the traditional mode, the method and the system can save cost and form positive correlation with the number of scenes;

2. the front-end configuration effective mode adopted by the application enables the robot to be safely, stably and conveniently updated according to changes of the service environment and take effect every other day, thereby bringing great convenience to the service party and simultaneously improving the operation stability of the technical side;

3. by combining the front-end platform, business personnel can directly participate in the core component research and development process of the marketing robot, so that the satisfaction and participation sense of the business personnel are greatly improved;

4. the robot is matched with a front-end voice service platform, a whole set of closed-loop solutions from 'demand proposing' to 'intelligent voice robot generation' to 'IVR electricity marketing start' and then to 'feedback adjustment demand' are provided for a business party, and one-step response to the IVR demand of the customer intelligent marketing can be realized.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for implementing a configurable intelligent voice robot, the method comprising:

2. The method according to claim 1, wherein the generating scene features of the conversation scene based on the sample corpus of the conversation scene for each conversation scene comprises:

3. The method of claim 1, further comprising:

preferably, the method further comprises:

and maintaining the scene characteristics of the session scene in the scene characteristic relation table based on the configuration characteristic words of the session scene and the configuration characteristic word sequence obtained by mapping and converting the configuration characteristic words.

4. The method according to claim 3, wherein the maintaining the scene characteristics of the session scene in the scene characteristic relationship table based on the configuration characteristic words of the session scene and the configuration characteristic word sequence obtained by mapping and converting the configuration characteristic words comprises:

5. The method of claim 1, wherein the word vector space model is trained as follows:

6. The method of claim 1, further comprising:

7. The method of any of claims 1 to 6, further comprising:

8. An implementation device of a configurable intelligent voice robot, the device comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.