WO2019153612A1

WO2019153612A1 - Question and answer data processing method, electronic device and storage medium

Info

Publication number: WO2019153612A1
Application number: PCT/CN2018/090642
Authority: WO
Inventors: 于凤英; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-02-09
Filing date: 2018-06-11
Publication date: 2019-08-15
Also published as: CN108415980A

Abstract

Provided is a question and answer (Q&A) data processing method. The method comprises: acquiring agent text data from a database; dividing the agent text data into a plurality of groups of Q&A records; arranging question sentences and answer sentences in each group of Q&A records according to the time order of the records to obtain several candidate Q&A pairs; then counting a plurality of specified features of each candidate Q&A pair, and, for each candidate Q&A pair, using feature weights to carry out weighting calculation on the plurality of specified features so as to obtain the degree of match for each candidate Q&A pair; and then, on the basis of the degree of match, screening all the candidate Q&A pairs for candidate Q&A pairs having a degree of match higher than a second preset threshold, so as to use same as target Q&A pairs, and supplying same to a Q&A knowledge base. Utilizing the present application, suitable target Q&A pairs can be selected from the agent text data, the Q&A knowledge base can be maintained and updated, human resources saved, and efficiency improved.

Description

Question and answer data processing method, electronic device and storage medium

The present application claims priority to Chinese Patent Application No. 201101135748.0, entitled "Q&A Data Processing Method, Electronic Device and Storage Medium", which is filed on February 9, 2018, the entire contents of which are incorporated by reference. In this application.

Technical field

The present application relates to the field of computer technologies, and in particular, to a method for processing a question and answer data, an electronic device, and a storage medium.

Background technique

With the development of technology, Artificial Intelligence (AI) is gradually changing our way of life. For example, smart question and answer is one of them. When the customer consults online via text or voice, the customer can be intelligently answered by the online intelligent customer service robot or the online intelligent customer service response system. Intelligent Q&A can effectively alleviate the waiting situation of customer service and improve service quality, so it has a very broad prospect.

At present, the implementation of domain-specific intelligent Q&A requires a large Q&A knowledge base. The Q&A knowledge base contains a large number of set answer questions. For example, the question is “How long is the warranty period?”, the corresponding answer is “ The warranty period is one year, and the customer service in the intelligent question and answer is based on the question and answer knowledge base for intelligent reply. However, in order to ensure the accuracy of the response, it is necessary to supplement the Q&A knowledge base in time, and the maintenance of the current Q&A knowledge base consumes a lot of human resources and is inefficient.

Summary of the invention

In view of the above reasons, it is necessary to provide a question and answer data processing method, an electronic device and a storage medium, which can automatically select an appropriate target question and answer pair from the agent text data, and maintain and update the Q&A knowledge base, thereby saving manpower and improving efficiency. .

To achieve the above object, the present application provides a method for processing a question and answer data, the method comprising: an obtaining step of: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process; : The seat text data is divided into several groups of question and answer records, the same group of question and answer records belong to the same customer and corresponding customer service, each group of question and answer records includes several question questions of the customer and several answer statements of the customer service; sorting steps: according to the record The problem statement and the answer statement in each group of question and answer records are arranged in time order, the problem statement without the corresponding answer sentence and the answer statement without the corresponding question statement are removed, and the question statement and the answer sentence in each group of question and answer records are arranged as one question. The form of the answer, thus obtaining a number of candidate question and answer pairs, each candidate question and answer pair includes a question statement and a corresponding answer sentence; extraction step: word segmentation of the question statement and the answer sentence in each candidate question and answer pair to obtain each word Article, counting the number of texts in the seat The word frequency appears in the calculation, according to the word frequency of each term, the term weight is calculated for the corresponding term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence; Counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and weighting the plurality of specified features by using the feature weights for each candidate question and answer pair Obtaining the matching degree of each candidate question and answer pair; screening step: selecting candidate question and answer pairs whose matching degree is higher than the second preset threshold from all the candidate question and answer pairs as the target question and answer pair, and adding the target question and answer pair to the question and answer knowledge In the library, the Q&A knowledge base is used to intelligently answer questions and answers for customers based on the Q&A knowledge base.

To achieve the above object, the present application further provides an electronic device including a memory and a processor, wherein the memory includes a question and answer data processing program, and the question and answer data processing program is executed by the processor to implement the following steps: The database obtains the agent text data, and the agent text data includes all the question and answer data recorded by the customer and the customer service during the question and answer process; the grouping step: dividing the agent text data into several groups of question and answer records, and the same group of question and answer records belong to the same customer and corresponding The customer service record, each group of question and answer records includes several question statements of the customer and several answer statements of the customer service; sorting steps: arrange the problem sentences and answer sentences in each group of question and answer records according to the chronological order of the records, and remove the problem of no corresponding answer sentences The statement and the answer statement without the corresponding question statement, the question statement and the answer statement in each group of question and answer records are arranged in a question-and-answer form, thereby obtaining a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a Corresponding answer statement; extraction step: for each The problem sentence and the answer sentence in the candidate question and answer pair are processed by word segmentation to obtain each term, and the word frequency appearing in the seat text data of each term is counted, and the term weight is calculated according to the word frequency of each term. a term that has a term weight greater than the first predetermined threshold as a keyword of the corresponding question statement or the answer sentence; and a calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning each of the specified features Preset feature weights, for each candidate question and answer pair, using the feature weights to perform weighted calculation on the plurality of specified features to obtain a matching degree of each candidate question and answer pair; screening step: screening out all the candidate question and answer pairs A candidate question and answer pair with a matching degree higher than a second preset threshold is added as a target question and answer pair, and the target question and answer pair is added to the question and answer knowledge base, and the question and answer knowledge base is used for intelligently solving the question and answer pairs in the customer based question and answer knowledge base. Answer.

In addition, in order to achieve the above object, the present application further provides a computer readable storage medium including a question and answer data processing program, when the question and answer data processing program is executed by a processor, implementing the question and answer as described above Any step of the data processing method.

The question and answer data processing method, the electronic device and the computer readable storage medium provided by the present application divide the agent text data into a plurality of groups of question and answer records by first obtaining the agent text data from the database, and secondly, each group of questions and answers according to the time sequence of the records. The problem statement and the answer statement in the record obtain a plurality of candidate question and answer pairs, and then count a plurality of specified features of each candidate question and answer pair, and assign corresponding preset feature weights to each of the specified features, for each candidate question and answer And performing weighting calculation on the plurality of specified features by using the feature weights, obtaining a matching degree of each candidate question and answer pair, and filtering the matching degree from all the candidate question and answer pairs according to the matching degree to be higher than the second preset. The threshold candidate question and answer pair is added to the question and answer knowledge base as the target question and answer pair, and the appropriate target question and answer pair can be automatically selected from the agent text data to maintain and update the question and answer knowledge base, thereby saving manpower and improving efficiency.

DRAWINGS

1 is a schematic diagram of an operating environment of a preferred embodiment of an electronic device of the present application;

2 is a schematic diagram of interaction between an electronic device and a client according to a preferred embodiment of the present application;

3 is a flow chart of a preferred embodiment of a question and answer data processing method of the present application;

4 is a flow chart of a preferred embodiment of the data cleaning method of FIG. 3;

FIG. 5 is a program block diagram of the question and answer data processing program of FIG. 1.

The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.

Detailed ways

The principles and spirit of the present application are described below with reference to a number of specific embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

Those skilled in the art will appreciate that embodiments of the present application can be implemented as a method, apparatus, device, system, or computer program product. Accordingly, the application can be embodied in a complete hardware, complete software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.

According to an embodiment of the present application, a question and answer data processing method, an electronic device, and a storage medium are proposed.

1 is a schematic diagram of an operating environment of a preferred embodiment of an electronic device of the present application.

The electronic device 1 may be a terminal device having a storage and computing function such as a server, a portable computer, or a desktop computer.

The electronic device 1 includes a memory 11, a processor 12, a network interface 13, and a communication bus 14. The network interface 13 can optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The communication bus 14 is used to implement connection communication between the above components.

The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1. In other embodiments, the readable storage medium may also be an external memory 11 of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC). , Secure Digital (SD) card, Flash Card, etc.

In the present embodiment, the readable storage medium of the memory 11 is generally used to store the question and answer data processing program 10, the database 2, the question and answer knowledge base 3, and the like installed in the electronic device 1. The memory 11 can also be used to temporarily store data that has been output or is about to be output.

The processor 12, in some embodiments, may be a Central Processing Unit (CPU), a microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as performing Q&A data processing. Program 10 and so on.

1 shows only the electronic device 1 having the components 11-14 and the question and answer data processing program 10, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.

Optionally, the electronic device 1 may further include a user interface, and the user interface may include an input unit such as a keyboard, a voice input device such as a microphone, a device with a voice recognition function, a voice output device such as an audio, a headphone, and the like. . Optionally, the user interface may also include a standard wired interface and a wireless interface.

Optionally, the electronic device 1 may further include a display, which may also be referred to as a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch liquid crystal display, and an Organic Light-Emitting Diode (OLED) display. The display is used to display information processed in the electronic device 1 and a user interface for displaying visualizations.

Optionally, the electronic device 1 further comprises a touch sensor. The area provided by the touch sensor for the user to perform a touch operation is referred to as a touch area. Further, the touch sensor described herein may be a resistive touch sensor, a capacitive touch sensor, or the like. Moreover, the touch sensor includes not only a contact type touch sensor but also a proximity type touch sensor or the like. Furthermore, the touch sensor may be a single sensor or a plurality of sensors arranged, for example, in an array. The user can activate the question and answer data processing program 10 by touching the touch area.

In addition, the area of the display of the electronic device 1 may be the same as or different from the area of the touch sensor. Optionally, a display is stacked with the touch sensor to form a touch display. The device detects a user-triggered touch operation based on a touch screen display.

The electronic device 1 may further include a radio frequency (RF) circuit, a sensor, an audio circuit, and the like, and details are not described herein.

Referring to FIG. 2, it is a schematic diagram of interaction of the preferred embodiment of the electronic device 1 of the present application. The question and answer data processing program 10 runs in the electronic device 1, and the electronic device 1 may be a server related to running a smart question answering program. Therefore, the electronic device 1 can communicate with the client through the network, thereby receiving the problem raised by the client through the client, searching for the corresponding answer in the question and answer knowledge base 3, and feeding the answer to the client through the client to achieve Smart reply. In addition, the electronic device 1 can also record the agent text data of the daily customer and the customer service, and save the seat text data in the database 2. The agent text data includes question and answer data recorded by all customers and customer service during the question and answer process, such as each question and its input time, each answer and its output time, customer identity, customer service identity, and the like. The question and answer data processing program 10 is configured to extract the target question and answer pairs that meet the requirements of the Q&A knowledge base 3 by processing the agent text data in the database 2 to update and maintain the Q&A knowledge base 3.

Referring to FIG. 3, it is a flowchart of a preferred embodiment of the data processing method for the question and answer of the present application. When the processor 12 of the electronic device 1 executes the question and answer data processing program 10 stored in the memory 11, the following steps are implemented to implement the question and answer data processing method:

Step S1, obtaining agent text data from the database 2, the agent text data including all the questions and answers recorded by the customer and the customer service during the question and answer process, such as each question and its input time, each answer and its output time, the customer identity , customer service identity, etc. The agent text data records the original question and answer data, that is, the question and answer data in the agent text data is not processed, but is recorded as it is.

Usually in the question and answer process between the customer and the customer service, in addition to the specific questions and answers in the specific field, there will be some other pure communication statements. These pure communication statements are not meaningful compared to the Q&A knowledge base 3. . In order to remove such meaningless data in the seat text data, for example, only the words that are cold to each other, including "Hello", "Goodbye", "I am very happy to serve you", etc., and too short to express practical meaning. The statement, such as "why", or a statement containing only numbers and/or symbols, etc., after step S1 obtains the agent text data from the database 2, the data cleaning operation of the agent text data may also be performed first. As shown in FIG. 4, the data cleaning operation may include the following steps, for example:

Step S11, counting the frequency of occurrence of each statement in the agent text data, and obtaining a statement whose appearance frequency is higher than the third preset threshold as the high frequency statement;

Step S12, performing word segmentation processing on each of the high-frequency sentences, and counting the frequency of occurrence of each term in all the high-frequency sentences obtained by the statistical word segmentation, and the term having a frequency higher than the fourth preset threshold is used as a greeting. ;

Step S13, calculating the proportion of all the swear words in each sentence of the sentence text data in all the vocabulary words of the sentence, and using the calculated statement whose proportion is higher than the fifth preset threshold as the chilling statement;

Step S14, determining a short text sentence in the agent text data, the short text sentence including a statement whose number of characters is less than a sixth preset threshold, and a statement containing only numbers and/or symbols;

Step S15, filtering out the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.

In step S2, the agent text data is divided into several groups of question and answer records, and the same group of question and answer records belong to the same customer and the corresponding customer service. Each group of question and answer records includes several question questions of the customer and several answer statements of the customer service.

Specifically, step S2 may map the question and answer data corresponding to different customers to the corresponding customer ID, thereby dividing the agent text data into several groups of question and answer records according to the mapping relationship, and the same group of question and answer records belong to the same customer and corresponding customer service. In each group of question and answer records, according to the customer and customer service Q&A process, it can include several questions of the customer and several answer statements of the customer service, but in the group of question and answer records, the customer's question statement and the customer's answer statement can be There is no specific order state, for example, the corresponding question statement and the answer statement may be arranged according to the assigned ID number, or all the problem statements of the customer may be arranged together, and all the answer statements of the customer service are arranged together, and the like. In order to conveniently sort out the problem statement of each customer and the answer statement corresponding to the customer service from the agent text data, the following step S3 is performed.

Step S3, arranging the problem statement and the answer statement in each group of question and answer records according to the time sequence of the record, removing the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and the question statement and the answer sentence in each group of question and answer records Arranged in a question-and-answer format to obtain a number of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement.

Since the question and answer process of the customer and the customer service is performed in a chronological order, the step S3 arranges the question statement and the answer sentence in each group of question and answer records in the order of the records, and can conveniently extract each question of the customer in order. Statement and corresponding answer statement. Considering that the customer and customer service Q&A process may have customer questions and the customer service does not answer, or the customer sends a meaningless symbol such as an expression or a question mark, etc., causing the customer to actively ask questions, so there are some problem statements without corresponding answer statements. And the answer statement without the corresponding question statement, step S3 may first arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the record, first remove the question statement without the corresponding answer sentence and the answer without the corresponding question statement Statement. Thus, step S3 can arrange the question statement and the answer statement in each group of question and answer records into a question-and-answer form, thereby obtaining a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer sentence. .

Step S4, performing word segmentation on the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence.

Specifically, step S4 calculates the weight of the term according to the word frequency of each term that is counted, and the following formula can be used:

w=log(n/df);

Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus. The preset large-scale corpus may be a common document corpus, which includes a large-scale number of types of documents, and the large-scale corpus can implement sample training on sentences, semantics, and the like of sentences and sentences in the document. And deep learning.

Step S5, counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective preset feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain the matching degree of each candidate question and answer pair.

The plurality of designated features may include, for example:

The number of common keywords a1 of the question statement and the answer statement;

The common keyword length a2 of the question statement and the answer statement;

The number of common terms a3 of the question statement and the answer statement;

The length of the common entry of the question statement and the answer statement is a4;

The length of the problem statement is a5;

The length of the answer statement is a6;

The topic similarity a7 of the question statement and the answer statement;

The syntactic similarity a8 of the question statement and the answer statement.

The consideration of adopting the above specified features mainly includes: if the answer question in the candidate question and answer pair is an answer to the question statement, the answer sentence usually includes the keyword in the question statement; the number of common keywords between the question statement and the answer sentence is more The longer the number, the longer the number of co-occurring words, the greater the probability that the answer answer in the candidate question and answer pair is the answer to the question statement; since the target question and answer pair that you want to acquire has a full question and answer, The longer the question statement can better describe a complete question, and the longer the answer statement can better describe a complete answer; if the answer question in the candidate question and answer pair is the answer to the question statement, then the answer statement and question The topics between the statements are usually consistent; if the answer to the answer in the candidate question and answer is the answer to the question, there is usually a certain relationship between the answer and the question.

The method for performing weighting calculation on the plurality of specified features by using the feature weight may be, for example, performing weighted regression fitting calculation on the plurality of specified features by using a multiple logistic regression model, thereby obtaining a match of each candidate question and answer pair Degree g(z), the formula is as follows:

g(z)=1/(1+e ^z ), where e is a natural constant;

Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.

Step S6, selecting candidate question and answer pairs whose matching degree is higher than the second preset threshold from all the candidate question and answer pairs as the target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base 3, wherein the question and answer knowledge base 3 is used Intelligently answer questions and answers for customers based on Q&A Knowledge Base 3. The candidate question and answer pairs whose matching degree is higher than the second preset threshold may include one or more.

In addition, in order to make the target question and answer pair more refined, step S6 may further merge the answer questions corresponding to the same question statement in the respective target question and answer pairs before adding the target question and answer pair to the question and answer knowledge base 3. Processing, so that the problem statements included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer statements.

Since the sentence of a question statement should be a query sentence for a specific field, the answer sentence should be a statement. In order to further improve the accuracy of the target question and answer pair added to the question and answer knowledge base 3, step S6 may also add the question question and answer in each question question and answer pair before adding the target question and answer pair to the question and answer knowledge base 3. The statement uses a naive Bayesian classifier to determine whether the corresponding question statement is a query sentence, and whether the corresponding answer statement is a declarative sentence, and then step S6 filters out the target question and answer pair that does not include the question sentence, and will include the answer. The statement is not filtered out for the target question and answer pair of the declarative sentence, and the target question and answer pair to be added to the Q&A knowledge base 3 is obtained.

According to the question and answer data processing method provided by the embodiment, the agent text data is first divided into several groups of question and answer records by first obtaining the agent text data from the database, and the same group of question and answer records belong to the same customer and corresponding customer service, and secondly according to the recorded time. Arranging the question statement and the answer statement in each group of question and answer records in sequence, obtaining a plurality of candidate question and answer pairs, and then performing word segmentation on the question and answer sentences in each candidate question and answer pair to obtain individual terms, and the statistical terms are as described above. The word frequency appearing in the text data of the agent, the word weight is calculated for the corresponding term according to the word frequency of each term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence. Counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and weighting the plurality of specified features by using the feature weights for each candidate question and answer pair , get the matching degree of each candidate question and answer pair, and then from all the candidate question and answer pairs according to the matching degree The candidate question and answer pairs whose matching degree is higher than the second preset threshold are filtered out as the target question and answer pair is added to the question and answer knowledge base 3, and the appropriate target question and answer pair can be automatically selected from the agent text data, and the question and answer knowledge base 3 is maintained and Updates to save manpower and increase efficiency.

Referring to FIG. 5, it is a program module diagram of the question and answer data processing program 10 in FIG. In the present embodiment, the question and answer data processing program 10 is divided into a plurality of modules, which are stored in the memory 11 and executed by the processor 12 to complete the present application. A module as referred to in this application refers to a series of computer program instructions that are capable of performing a particular function.

The question and answer data processing program 10 can be divided into: an acquisition module 110, a grouping module 120, a sorting step 130, an extraction module 140, a calculation module 150, and a screening module 160.

The obtaining module 110 is configured to obtain agent text data from a database, where the agent text data includes question and answer data recorded by all customers and customer service during the question and answer process.

After the obtaining module 110 obtains the agent text data from the database, the data cleaning operation may be performed on the agent text data, where the data cleaning operation includes:

Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;

Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;

Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;

Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;

Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.

The grouping module 120 is configured to divide the agent text data into a plurality of groups of question and answer records, and the same group of question and answer records belong to the same customer and corresponding customer service records, and each group of question and answer records includes a plurality of question statements of the customer and a plurality of answer statements of the customer service.

The sorting step 130 is configured to arrange the question statement and the answer statement in each group of question and answer records according to the time sequence of the record, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement in each group of question and answer records. And the answer statement is arranged in a question-and-answer form, thereby obtaining a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement.

The extracting module 140 is configured to perform word segmentation on the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and count the word frequency of each term in the seat text data, according to the statistics of each term. The word frequency calculates the term weight for the corresponding term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence.

Specifically, the extraction module 140 may calculate the term weights for the corresponding terms according to the word frequency of each term calculated by using the following formula:

w=log(n/df);

Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.

The calculating module 150 is configured to count a plurality of specified features of each candidate question and answer pair, and respectively assign a corresponding preset feature weight to each of the specified features, and use the feature weights for each candidate question and answer pair The specified features are weighted to obtain the matching degree of each candidate question and answer pair.

The plurality of designated features may include, for example:

The length of the problem statement is a5;

The length of the answer statement is a6;

The topic similarity a7 of the question statement and the answer statement;

The syntactic similarity a8 of the question statement and the answer statement.

The method for performing weighting calculation on the plurality of specified features by using the feature weight may be, for example, the calculation module 150 performing a weighted regression fitting calculation on the plurality of specified features by using a multiple logistic regression model, thereby obtaining each candidate question and answer. The matching degree g(z) of the pair is as follows:

g(z)=1/(1+e ^z ), where e is a natural constant;

The screening module 160 is configured to filter, from all candidate question and answer pairs, a candidate question and answer pair whose matching degree is higher than a second preset threshold as a target question and answer pair, and add the target question and answer pair to the question and answer knowledge base, the question and answer knowledge base Used to intelligently answer questions and answers for customers based on the Q&A knowledge base.

In order to make the target question and answer pair more refined, the screening module 160 may further merge the answer questions corresponding to the same question statement in the respective target question and answer pairs before adding the target question and answer pair to the question and answer knowledge base 3. To make the problem statements contained in each target question and answer pair different from each other, each target question and answer pair includes one or more answer statements.

In addition, in order to further improve the accuracy of the target question and answer pairs added to the Q&A knowledge base 3, the screening module 160 may further solve the problem in each question question and answer pair before adding the target question and answer pair to the Q&A knowledge base 3. The statement and the answer statement respectively use the naive Bayes classifier to determine whether the corresponding question statement is a query sentence, and whether the corresponding answer statement is a declarative sentence, and then the screening module 160 filters out the target question and answer pair containing the question statement not for the query sentence. And filtering out the target question and answer pairs containing the answer statement not for the declarative sentence, and obtaining the target question and answer pair to be added to the question and answer knowledge base 3.

In the operating environment diagram of the preferred embodiment of the electronic device 1 shown in FIG. 1, the memory 11 including the readable storage medium may include a question and answer data processing program 10, a database 2, and a question and answer knowledge base 3. When the processor 12 executes the question and answer data processing program 10 stored in the memory 11, the following steps are implemented:

Obtaining step: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;

The grouping step: dividing the agent text data into a plurality of groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including a plurality of customer question statements and a number of answer statements of the customer service;

Sorting step: Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records. Arranged in a question-and-answer format to obtain a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement;

Extracting step: segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;

Calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;

a screening step: selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.

After the obtaining the agent text data from the database, the method further includes performing a data cleaning operation on the agent text data, where the data cleaning operation includes:

The calculating the term weights according to the word frequency of each term according to the statistics includes:

The term weight w is calculated using the following formula:

w=log(n/df);

The plurality of specified features includes:

The length of the problem statement is a5;

The length of the answer statement is a6;

The topic similarity a7 of the question statement and the answer statement;

Syntactic similarity a8 of the question statement and the answer statement;

And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:

The multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:

g(z)=1/(1+e ^z ), where e is a natural constant;

Before adding the target question and answer pair to the Q&A knowledge base, the method further includes:

In the respective target question and answer pairs, the answer sentences corresponding to the same question statement are combined, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.

The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;

Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.

For the specific principle, please refer to the program module diagram of the question and answer data processing program 10 in FIG. 5 and the flowchart of the preferred embodiment of the question and answer data processing method in FIG.

In addition, the embodiment of the present application further provides a computer readable storage medium, which may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read only memory (ROM), and an erasable programmable Any combination or combination of any one or more of read only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, and the like. The computer readable storage medium includes a database 2, a question and answer knowledge base 3, a question and answer data processing program 10, and the like. When the question and answer data processing program 10 is executed by the processor 12, the following operations are implemented:

The term weight w is calculated using the following formula:

w=log(n/df);

The plurality of specified features includes:

The length of the problem statement is a5;

The length of the answer statement is a6;

The topic similarity a7 of the question statement and the answer statement;

Syntactic similarity a8 of the question statement and the answer statement;

g(z)=1/(1+e ^z ), where e is a natural constant;

In each of the target question and answer pairs, the answer sentences corresponding to the same question statement are merged, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.

The specific implementation of the computer readable storage medium of the present application is substantially the same as the above-described question and answer data processing method and the specific embodiment of the electronic device 1, and details are not described herein again.

It is to be understood that the term "comprises", "comprising", or any other variants thereof, is intended to encompass a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a series of elements includes those elements. It also includes other elements not explicitly listed, or elements that are inherent to such a process, device, item, or method. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, the device, the item, or the method that comprises the element.

Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, portions of the technical solution of the present application that contribute substantially or to the prior art may be embodied in the form of a software product stored in a storage medium as described above, including a number of instructions. To enable a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.

The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims

A question and answer data processing method, characterized in that the method comprises:

Obtaining step: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;

The grouping step: dividing the agent text data into a plurality of groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including a plurality of customer question statements and a number of answer statements of the customer service;

Sorting step: Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records. Arranged in a question-and-answer format to obtain a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement;

Extracting step: segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;

Calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;

a screening step: selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
The question-and-answer data processing method according to claim 1, wherein after the obtaining the agent text data from the database, the method further comprises: performing a data cleaning operation on the agent text data, the data cleaning operation comprising:

Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;

Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;

Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;

Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;

Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
The question and answer data processing method according to claim 1, wherein the calculating the term weights according to the word frequency of each term as the corresponding term comprises:

The term weight w is calculated using the following formula:

w=log(n/df);

Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.
The question and answer data processing method according to claim 1, wherein the plurality of specified features comprise:

The number of common keywords a1 of the question statement and the answer statement;

The common keyword length a2 of the question statement and the answer statement;

The number of common terms a3 of the question statement and the answer statement;

The length of the common entry of the question statement and the answer statement is a4;

The length of the problem statement is a5;

The length of the answer statement is a6;

The topic similarity a7 of the question statement and the answer statement;

Syntactic similarity a8 of the question statement and the answer statement;

And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:

The multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:

g(z)=1/(1+e z ), where e is a natural constant;

Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
The question and answer data processing method according to claim 1, wherein before the adding the target question and answer pair to the question and answer knowledge base, the method further comprises:

In the respective target question and answer pairs, the answer sentences corresponding to the same question statement are combined, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.
The question and answer data processing method according to claim 1, wherein before the adding the target question and answer pair to the question and answer knowledge base, the method further comprises:

The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;

Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
The question and answer data processing method according to any one of claims 1 to 5, wherein before the adding the target question and answer pair to the question and answer knowledge base, the method further comprises:

The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;

Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
An electronic device comprising a memory and a processor, wherein the memory includes a question and answer data processing program, and the question and answer data processing program is executed by the processor to implement the following steps:

Obtaining step: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;

The grouping step: dividing the agent text data into a plurality of groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including a plurality of customer question statements and a number of answer statements of the customer service;

Sorting step: Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records. Arranged in a question-and-answer format to obtain a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement;

Extracting step: segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;

Calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;

a screening step: selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
The electronic device of claim 8, wherein the obtaining the agent text data from the database further comprises performing a data cleaning operation on the agent text data, the data cleaning operation comprising:

Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;

Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;

Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;

Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;

Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
The electronic device according to claim 8, wherein the calculating the term weights for the corresponding terms according to the word frequency of each of the counted words comprises:

The term weight w is calculated using the following formula:

w=log(n/df);

Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.
The electronic device of claim 8 wherein said plurality of specified features comprises:

The number of common keywords a1 of the question statement and the answer statement;

The common keyword length a2 of the question statement and the answer statement;

The number of common terms a3 of the question statement and the answer statement;

The length of the common entry of the question statement and the answer statement is a4;

The length of the problem statement is a5;

The length of the answer statement is a6;

The topic similarity a7 of the question statement and the answer statement;

Syntactic similarity a8 of the question statement and the answer statement;

And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:

The multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:

g(z)=1/(1+e z ), where e is a natural constant;

Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
The electronic device according to claim 8, wherein the adding the target question and answer pair to the question and answer knowledge base further comprises:

In the respective target question and answer pairs, the answer sentences corresponding to the same question statement are combined, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.
The electronic device according to claim 8, wherein the adding the target question and answer pair to the question and answer knowledge base further comprises:

The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;

Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
The electronic device according to any one of claims 8 to 12, wherein before the adding the target question and answer pair to the question and answer knowledge base, the method further comprises:

The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;

Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
A computer readable storage medium, comprising: a question and answer data processing program, wherein when the question and answer data processing program is executed by a processor, the question and answer data processing program is executed by the processor Implement the following steps:

Obtaining step: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;

The grouping step: dividing the agent text data into several groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including several question questions of the customer and several answer statements of the customer service;

Sorting step: Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records. Arranged in a question-and-answer format to obtain a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement;

Extracting step: segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;

Calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;

a screening step: selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
The computer readable storage medium according to claim 15, wherein after the obtaining the agent text data from the database, the method further comprises: performing a data cleaning operation on the agent text data, the data cleaning operation comprising:

Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;

Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;

Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;

Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;

Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
The computer readable storage medium according to claim 15, wherein the calculating the term weights for the corresponding terms according to the word frequency of each of the counted words comprises:

The term weight w is calculated using the following formula:

w=log(n/df);

Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.
The computer readable storage medium of claim 15 wherein the plurality of specified features comprises:

The number of common keywords a1 of the question statement and the answer statement;

The common keyword length a2 of the question statement and the answer statement;

The number of common terms a3 of the question statement and the answer statement;

The length of the common entry of the question statement and the answer statement is a4;

The length of the problem statement is a5;

The length of the answer statement is a6;

The topic similarity a7 of the question statement and the answer statement;

Syntactic similarity a8 of the question statement and the answer statement;

And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:

The multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:

g(z)=1/(1+e z ), where e is a natural constant;

Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
The computer readable storage medium of claim 15, wherein the adding the target question and answer pair to the question and answer knowledge base further comprises:

In the respective target question and answer pairs, the answer sentences corresponding to the same question statement are combined, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.
The computer readable storage medium of claim 15, wherein the adding the target question and answer pair to the question and answer knowledge base further comprises:

The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;

Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.