WO2019153612A1 - Question and answer data processing method, electronic device and storage medium - Google Patents

Question and answer data processing method, electronic device and storage medium Download PDF

Info

Publication number
WO2019153612A1
WO2019153612A1 PCT/CN2018/090642 CN2018090642W WO2019153612A1 WO 2019153612 A1 WO2019153612 A1 WO 2019153612A1 CN 2018090642 W CN2018090642 W CN 2018090642W WO 2019153612 A1 WO2019153612 A1 WO 2019153612A1
Authority
WO
WIPO (PCT)
Prior art keywords
answer
question
statement
sentence
pair
Prior art date
Application number
PCT/CN2018/090642
Other languages
French (fr)
Chinese (zh)
Inventor
于凤英
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019153612A1 publication Critical patent/WO2019153612A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a method for processing a question and answer data, an electronic device, and a storage medium.
  • AI Artificial Intelligence
  • smart question and answer is one of them.
  • the customer consults online via text or voice
  • the customer can be intelligently answered by the online intelligent customer service robot or the online intelligent customer service response system.
  • Intelligent Q&A can effectively alleviate the waiting situation of customer service and improve service quality, so it has a very broad prospect.
  • the Q&A knowledge base contains a large number of set answer questions. For example, the question is “How long is the warranty period?”, the corresponding answer is “ The warranty period is one year, and the customer service in the intelligent question and answer is based on the question and answer knowledge base for intelligent reply.
  • the question is “How long is the warranty period?”
  • the warranty period is one year
  • the customer service in the intelligent question and answer is based on the question and answer knowledge base for intelligent reply.
  • the present application provides a method for processing a question and answer data, the method comprising: an obtaining step of: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process; :
  • the seat text data is divided into several groups of question and answer records, the same group of question and answer records belong to the same customer and corresponding customer service, each group of question and answer records includes several question questions of the customer and several answer statements of the customer service; sorting steps: according to the record
  • the problem statement and the answer statement in each group of question and answer records are arranged in time order, the problem statement without the corresponding answer sentence and the answer statement without the corresponding question statement are removed, and the question statement and the answer sentence in each group of question and answer records are arranged as one question.
  • each candidate question and answer pair includes a question statement and a corresponding answer sentence; extraction step: word segmentation of the question statement and the answer sentence in each candidate question and answer pair to obtain each word Article, counting the number of texts in the seat The word frequency appears in the calculation, according to the word frequency of each term, the term weight is calculated for the corresponding term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence; Counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and weighting the plurality of specified features by using the feature weights for each candidate question and answer pair Obtaining the matching degree of each candidate question and answer pair; screening step: selecting candidate question and answer pairs whose matching degree is higher than the second preset threshold from all the candidate question and answer pairs as the target question and answer pair, and adding the target question and answer pair to the question and answer knowledge In the library
  • the present application further provides an electronic device including a memory and a processor, wherein the memory includes a question and answer data processing program, and the question and answer data processing program is executed by the processor to implement the following steps:
  • the database obtains the agent text data, and the agent text data includes all the question and answer data recorded by the customer and the customer service during the question and answer process;
  • the grouping step dividing the agent text data into several groups of question and answer records, and the same group of question and answer records belong to the same customer and corresponding
  • the customer service record, each group of question and answer records includes several question statements of the customer and several answer statements of the customer service; sorting steps: arrange the problem sentences and answer sentences in each group of question and answer records according to the chronological order of the records, and remove the problem of no corresponding answer sentences
  • the statement and the answer statement without the corresponding question statement, the question statement and the answer statement in each group of question and answer records are arranged in a question-and-answer form, thereby obtaining a plurality of candidate question and answer pairs, each
  • a calculating step counting a plurality of specified features of each candidate question and answer pair, respectively assigning each of the specified features Preset feature weights, for each candidate question and answer pair, using the feature weights to perform weighted calculation on the plurality of specified features to obtain a matching degree of each candidate question and answer pair; screening step: screening out all the candidate question and answer pairs A candidate question and answer pair with a matching degree higher than a second preset threshold is added as a target question and answer pair, and the target question and answer pair is added to the question and answer knowledge base, and the question and answer knowledge base is used for intelligently solving the question and answer pairs in the customer based question and answer knowledge base. Answer.
  • the present application further provides a computer readable storage medium including a question and answer data processing program, when the question and answer data processing program is executed by a processor, implementing the question and answer as described above Any step of the data processing method.
  • the question and answer data processing method, the electronic device and the computer readable storage medium provided by the present application divide the agent text data into a plurality of groups of question and answer records by first obtaining the agent text data from the database, and secondly, each group of questions and answers according to the time sequence of the records.
  • the problem statement and the answer statement in the record obtain a plurality of candidate question and answer pairs, and then count a plurality of specified features of each candidate question and answer pair, and assign corresponding preset feature weights to each of the specified features, for each candidate question and answer And performing weighting calculation on the plurality of specified features by using the feature weights, obtaining a matching degree of each candidate question and answer pair, and filtering the matching degree from all the candidate question and answer pairs according to the matching degree to be higher than the second preset.
  • the threshold candidate question and answer pair is added to the question and answer knowledge base as the target question and answer pair, and the appropriate target question and answer pair can be automatically selected from the agent text data to maintain and update the question and answer knowledge base, thereby saving manpower and improving efficiency.
  • FIG. 1 is a schematic diagram of an operating environment of a preferred embodiment of an electronic device of the present application
  • FIG. 2 is a schematic diagram of interaction between an electronic device and a client according to a preferred embodiment of the present application
  • FIG. 3 is a flow chart of a preferred embodiment of a question and answer data processing method of the present application.
  • FIG. 4 is a flow chart of a preferred embodiment of the data cleaning method of FIG. 3;
  • FIG. 5 is a program block diagram of the question and answer data processing program of FIG. 1.
  • embodiments of the present application can be implemented as a method, apparatus, device, system, or computer program product. Accordingly, the application can be embodied in a complete hardware, complete software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.
  • a question and answer data processing method an electronic device, and a storage medium are proposed.
  • FIG. 1 is a schematic diagram of an operating environment of a preferred embodiment of an electronic device of the present application.
  • the electronic device 1 may be a terminal device having a storage and computing function such as a server, a portable computer, or a desktop computer.
  • the electronic device 1 includes a memory 11, a processor 12, a network interface 13, and a communication bus 14.
  • the network interface 13 can optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the communication bus 14 is used to implement connection communication between the above components.
  • the memory 11 includes at least one type of readable storage medium.
  • the at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like.
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1.
  • the readable storage medium may also be an external memory 11 of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC). , Secure Digital (SD) card, Flash Card, etc.
  • SMC smart memory card
  • SD Secure Digital
  • the readable storage medium of the memory 11 is generally used to store the question and answer data processing program 10, the database 2, the question and answer knowledge base 3, and the like installed in the electronic device 1.
  • the memory 11 can also be used to temporarily store data that has been output or is about to be output.
  • the processor 12 may be a Central Processing Unit (CPU), a microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as performing Q&A data processing. Program 10 and so on.
  • CPU Central Processing Unit
  • microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as performing Q&A data processing. Program 10 and so on.
  • FIG. 1 shows only the electronic device 1 having the components 11-14 and the question and answer data processing program 10, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the electronic device 1 may further include a user interface
  • the user interface may include an input unit such as a keyboard, a voice input device such as a microphone, a device with a voice recognition function, a voice output device such as an audio, a headphone, and the like.
  • the user interface may also include a standard wired interface and a wireless interface.
  • the electronic device 1 may further include a display, which may also be referred to as a display screen or a display unit.
  • a display may also be referred to as a display screen or a display unit.
  • it may be an LED display, a liquid crystal display, a touch liquid crystal display, and an Organic Light-Emitting Diode (OLED) display.
  • the display is used to display information processed in the electronic device 1 and a user interface for displaying visualizations.
  • the electronic device 1 further comprises a touch sensor.
  • the area provided by the touch sensor for the user to perform a touch operation is referred to as a touch area.
  • the touch sensor described herein may be a resistive touch sensor, a capacitive touch sensor, or the like.
  • the touch sensor includes not only a contact type touch sensor but also a proximity type touch sensor or the like.
  • the touch sensor may be a single sensor or a plurality of sensors arranged, for example, in an array. The user can activate the question and answer data processing program 10 by touching the touch area.
  • the area of the display of the electronic device 1 may be the same as or different from the area of the touch sensor.
  • a display is stacked with the touch sensor to form a touch display. The device detects a user-triggered touch operation based on a touch screen display.
  • the electronic device 1 may further include a radio frequency (RF) circuit, a sensor, an audio circuit, and the like, and details are not described herein.
  • RF radio frequency
  • FIG. 2 it is a schematic diagram of interaction of the preferred embodiment of the electronic device 1 of the present application.
  • the question and answer data processing program 10 runs in the electronic device 1, and the electronic device 1 may be a server related to running a smart question answering program. Therefore, the electronic device 1 can communicate with the client through the network, thereby receiving the problem raised by the client through the client, searching for the corresponding answer in the question and answer knowledge base 3, and feeding the answer to the client through the client to achieve Smart reply.
  • the electronic device 1 can also record the agent text data of the daily customer and the customer service, and save the seat text data in the database 2.
  • the agent text data includes question and answer data recorded by all customers and customer service during the question and answer process, such as each question and its input time, each answer and its output time, customer identity, customer service identity, and the like.
  • the question and answer data processing program 10 is configured to extract the target question and answer pairs that meet the requirements of the Q&A knowledge base 3 by processing the agent text data in the database 2 to update and maintain the Q&A knowledge base 3.
  • FIG. 3 it is a flowchart of a preferred embodiment of the data processing method for the question and answer of the present application.
  • the processor 12 of the electronic device 1 executes the question and answer data processing program 10 stored in the memory 11, the following steps are implemented to implement the question and answer data processing method:
  • Step S1 obtaining agent text data from the database 2, the agent text data including all the questions and answers recorded by the customer and the customer service during the question and answer process, such as each question and its input time, each answer and its output time, the customer identity , customer service identity, etc.
  • the agent text data records the original question and answer data, that is, the question and answer data in the agent text data is not processed, but is recorded as it is.
  • step S1 obtains the agent text data from the database 2, the data cleaning operation of the agent text data may also be performed first. As shown in FIG. 4, the data cleaning operation may include the following steps, for example:
  • Step S11 counting the frequency of occurrence of each statement in the agent text data, and obtaining a statement whose appearance frequency is higher than the third preset threshold as the high frequency statement;
  • Step S12 performing word segmentation processing on each of the high-frequency sentences, and counting the frequency of occurrence of each term in all the high-frequency sentences obtained by the statistical word segmentation, and the term having a frequency higher than the fourth preset threshold is used as a greeting. ;
  • Step S13 calculating the proportion of all the swear words in each sentence of the sentence text data in all the vocabulary words of the sentence, and using the calculated statement whose proportion is higher than the fifth preset threshold as the chilling statement;
  • Step S14 determining a short text sentence in the agent text data, the short text sentence including a statement whose number of characters is less than a sixth preset threshold, and a statement containing only numbers and/or symbols;
  • Step S15 filtering out the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
  • step S2 the agent text data is divided into several groups of question and answer records, and the same group of question and answer records belong to the same customer and the corresponding customer service.
  • Each group of question and answer records includes several question questions of the customer and several answer statements of the customer service.
  • step S2 may map the question and answer data corresponding to different customers to the corresponding customer ID, thereby dividing the agent text data into several groups of question and answer records according to the mapping relationship, and the same group of question and answer records belong to the same customer and corresponding customer service.
  • each group of question and answer records according to the customer and customer service Q&A process, it can include several questions of the customer and several answer statements of the customer service, but in the group of question and answer records, the customer's question statement and the customer's answer statement can be There is no specific order state, for example, the corresponding question statement and the answer statement may be arranged according to the assigned ID number, or all the problem statements of the customer may be arranged together, and all the answer statements of the customer service are arranged together, and the like. In order to conveniently sort out the problem statement of each customer and the answer statement corresponding to the customer service from the agent text data, the following step S3 is performed.
  • Step S3 arranging the problem statement and the answer statement in each group of question and answer records according to the time sequence of the record, removing the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and the question statement and the answer sentence in each group of question and answer records Arranged in a question-and-answer format to obtain a number of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement.
  • the step S3 arranges the question statement and the answer sentence in each group of question and answer records in the order of the records, and can conveniently extract each question of the customer in order.
  • Statement and corresponding answer statement Considering that the customer and customer service Q&A process may have customer questions and the customer service does not answer, or the customer sends a meaningless symbol such as an expression or a question mark, etc., causing the customer to actively ask questions, so there are some problem statements without corresponding answer statements.
  • step S3 may first arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the record, first remove the question statement without the corresponding answer sentence and the answer without the corresponding question statement Statement.
  • step S3 can arrange the question statement and the answer statement in each group of question and answer records into a question-and-answer form, thereby obtaining a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer sentence.
  • Step S4 performing word segmentation on the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics
  • the term calculates the weight of the term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence.
  • step S4 calculates the weight of the term according to the word frequency of each term that is counted, and the following formula can be used:
  • n is the word frequency of the corresponding entry in the agent text data
  • df is the number of documents containing the corresponding entry in the preset large-scale corpus.
  • the preset large-scale corpus may be a common document corpus, which includes a large-scale number of types of documents, and the large-scale corpus can implement sample training on sentences, semantics, and the like of sentences and sentences in the document. And deep learning.
  • Step S5 counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective preset feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain the matching degree of each candidate question and answer pair.
  • the plurality of designated features may include, for example:
  • the length of the common entry of the question statement and the answer statement is a4;
  • the length of the problem statement is a5;
  • the length of the answer statement is a6;
  • the consideration of adopting the above specified features mainly includes: if the answer question in the candidate question and answer pair is an answer to the question statement, the answer sentence usually includes the keyword in the question statement; the number of common keywords between the question statement and the answer sentence is more The longer the number, the longer the number of co-occurring words, the greater the probability that the answer answer in the candidate question and answer pair is the answer to the question statement; since the target question and answer pair that you want to acquire has a full question and answer, The longer the question statement can better describe a complete question, and the longer the answer statement can better describe a complete answer; if the answer question in the candidate question and answer pair is the answer to the question statement, then the answer statement and question The topics between the statements are usually consistent; if the answer to the answer in the candidate question and answer is the answer to the question, there is usually a certain relationship between the answer and the question.
  • the method for performing weighting calculation on the plurality of specified features by using the feature weight may be, for example, performing weighted regression fitting calculation on the plurality of specified features by using a multiple logistic regression model, thereby obtaining a match of each candidate question and answer pair Degree g(z), the formula is as follows:
  • z a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
  • Step S6 selecting candidate question and answer pairs whose matching degree is higher than the second preset threshold from all the candidate question and answer pairs as the target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base 3, wherein the question and answer knowledge base 3 is used Intelligently answer questions and answers for customers based on Q&A Knowledge Base 3.
  • the candidate question and answer pairs whose matching degree is higher than the second preset threshold may include one or more.
  • step S6 may further merge the answer questions corresponding to the same question statement in the respective target question and answer pairs before adding the target question and answer pair to the question and answer knowledge base 3. Processing, so that the problem statements included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer statements.
  • step S6 may also add the question question and answer in each question question and answer pair before adding the target question and answer pair to the question and answer knowledge base 3.
  • the statement uses a naive Bayesian classifier to determine whether the corresponding question statement is a query sentence, and whether the corresponding answer statement is a declarative sentence, and then step S6 filters out the target question and answer pair that does not include the question sentence, and will include the answer. The statement is not filtered out for the target question and answer pair of the declarative sentence, and the target question and answer pair to be added to the Q&A knowledge base 3 is obtained.
  • the agent text data is first divided into several groups of question and answer records by first obtaining the agent text data from the database, and the same group of question and answer records belong to the same customer and corresponding customer service, and secondly according to the recorded time.
  • Arranging the question statement and the answer statement in each group of question and answer records in sequence, obtaining a plurality of candidate question and answer pairs, and then performing word segmentation on the question and answer sentences in each candidate question and answer pair to obtain individual terms, and the statistical terms are as described above.
  • the word frequency appearing in the text data of the agent is calculated for the corresponding term according to the word frequency of each term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence.
  • Counting a plurality of specified features of each candidate question and answer pair respectively assigning respective predetermined feature weights to each of the specified features, and weighting the plurality of specified features by using the feature weights for each candidate question and answer pair , get the matching degree of each candidate question and answer pair, and then from all the candidate question and answer pairs according to the matching degree
  • the candidate question and answer pairs whose matching degree is higher than the second preset threshold are filtered out as the target question and answer pair is added to the question and answer knowledge base 3, and the appropriate target question and answer pair can be automatically selected from the agent text data, and the question and answer knowledge base 3 is maintained and Updates to save manpower and increase efficiency.
  • FIG. 5 it is a program module diagram of the question and answer data processing program 10 in FIG.
  • the question and answer data processing program 10 is divided into a plurality of modules, which are stored in the memory 11 and executed by the processor 12 to complete the present application.
  • a module as referred to in this application refers to a series of computer program instructions that are capable of performing a particular function.
  • the question and answer data processing program 10 can be divided into: an acquisition module 110, a grouping module 120, a sorting step 130, an extraction module 140, a calculation module 150, and a screening module 160.
  • the obtaining module 110 is configured to obtain agent text data from a database, where the agent text data includes question and answer data recorded by all customers and customer service during the question and answer process.
  • the data cleaning operation may be performed on the agent text data, where the data cleaning operation includes:
  • the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;
  • the grouping module 120 is configured to divide the agent text data into a plurality of groups of question and answer records, and the same group of question and answer records belong to the same customer and corresponding customer service records, and each group of question and answer records includes a plurality of question statements of the customer and a plurality of answer statements of the customer service.
  • the sorting step 130 is configured to arrange the question statement and the answer statement in each group of question and answer records according to the time sequence of the record, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement in each group of question and answer records. And the answer statement is arranged in a question-and-answer form, thereby obtaining a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement.
  • the extracting module 140 is configured to perform word segmentation on the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and count the word frequency of each term in the seat text data, according to the statistics of each term.
  • the word frequency calculates the term weight for the corresponding term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence.
  • the extraction module 140 may calculate the term weights for the corresponding terms according to the word frequency of each term calculated by using the following formula:
  • n is the word frequency of the corresponding entry in the agent text data
  • df is the number of documents containing the corresponding entry in the preset large-scale corpus.
  • the calculating module 150 is configured to count a plurality of specified features of each candidate question and answer pair, and respectively assign a corresponding preset feature weight to each of the specified features, and use the feature weights for each candidate question and answer pair The specified features are weighted to obtain the matching degree of each candidate question and answer pair.
  • the plurality of designated features may include, for example:
  • the length of the common entry of the question statement and the answer statement is a4;
  • the length of the problem statement is a5;
  • the length of the answer statement is a6;
  • the method for performing weighting calculation on the plurality of specified features by using the feature weight may be, for example, the calculation module 150 performing a weighted regression fitting calculation on the plurality of specified features by using a multiple logistic regression model, thereby obtaining each candidate question and answer.
  • the matching degree g(z) of the pair is as follows:
  • z a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
  • the screening module 160 is configured to filter, from all candidate question and answer pairs, a candidate question and answer pair whose matching degree is higher than a second preset threshold as a target question and answer pair, and add the target question and answer pair to the question and answer knowledge base, the question and answer knowledge base Used to intelligently answer questions and answers for customers based on the Q&A knowledge base.
  • the screening module 160 may further merge the answer questions corresponding to the same question statement in the respective target question and answer pairs before adding the target question and answer pair to the question and answer knowledge base 3.
  • each target question and answer pair includes one or more answer statements.
  • the screening module 160 may further solve the problem in each question question and answer pair before adding the target question and answer pair to the Q&A knowledge base 3.
  • the statement and the answer statement respectively use the naive Bayes classifier to determine whether the corresponding question statement is a query sentence, and whether the corresponding answer statement is a declarative sentence, and then the screening module 160 filters out the target question and answer pair containing the question statement not for the query sentence. And filtering out the target question and answer pairs containing the answer statement not for the declarative sentence, and obtaining the target question and answer pair to be added to the question and answer knowledge base 3.
  • the memory 11 including the readable storage medium may include a question and answer data processing program 10, a database 2, and a question and answer knowledge base 3.
  • the processor 12 executes the question and answer data processing program 10 stored in the memory 11, the following steps are implemented:
  • Obtaining step obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;
  • the grouping step dividing the agent text data into a plurality of groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including a plurality of customer question statements and a number of answer statements of the customer service;
  • Sorting step Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records.
  • Extracting step segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;
  • Calculating step counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;
  • a screening step selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
  • the method further includes performing a data cleaning operation on the agent text data, where the data cleaning operation includes:
  • the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;
  • the calculating the term weights according to the word frequency of each term according to the statistics includes:
  • weight w is calculated using the following formula:
  • n is the word frequency of the corresponding entry in the agent text data
  • df is the number of documents containing the corresponding entry in the preset large-scale corpus.
  • the plurality of specified features includes:
  • the length of the common entry of the question statement and the answer statement is a4;
  • the length of the problem statement is a5;
  • the length of the answer statement is a6;
  • And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:
  • the multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:
  • z a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
  • the method further includes:
  • each target question and answer pair includes one or more answer sentences.
  • the method further includes:
  • the narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;
  • the embodiment of the present application further provides a computer readable storage medium, which may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read only memory (ROM), and an erasable programmable Any combination or combination of any one or more of read only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, and the like.
  • the computer readable storage medium includes a database 2, a question and answer knowledge base 3, a question and answer data processing program 10, and the like. When the question and answer data processing program 10 is executed by the processor 12, the following operations are implemented:
  • Obtaining step obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;
  • the grouping step dividing the agent text data into a plurality of groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including a plurality of customer question statements and a number of answer statements of the customer service;
  • Sorting step Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records.
  • Extracting step segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;
  • Calculating step counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;
  • a screening step selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
  • the method further includes performing a data cleaning operation on the agent text data, where the data cleaning operation includes:
  • the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;
  • the calculating the term weights according to the word frequency of each term according to the statistics includes:
  • weight w is calculated using the following formula:
  • n is the word frequency of the corresponding entry in the agent text data
  • df is the number of documents containing the corresponding entry in the preset large-scale corpus.
  • the plurality of specified features includes:
  • the length of the common entry of the question statement and the answer statement is a4;
  • the length of the problem statement is a5;
  • the length of the answer statement is a6;
  • And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:
  • the multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:
  • z a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
  • the method further includes:
  • each target question and answer pair the answer sentences corresponding to the same question statement are merged, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.
  • the method further includes:
  • the narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a question and answer (Q&A) data processing method. The method comprises: acquiring agent text data from a database; dividing the agent text data into a plurality of groups of Q&A records; arranging question sentences and answer sentences in each group of Q&A records according to the time order of the records to obtain several candidate Q&A pairs; then counting a plurality of specified features of each candidate Q&A pair, and, for each candidate Q&A pair, using feature weights to carry out weighting calculation on the plurality of specified features so as to obtain the degree of match for each candidate Q&A pair; and then, on the basis of the degree of match, screening all the candidate Q&A pairs for candidate Q&A pairs having a degree of match higher than a second preset threshold, so as to use same as target Q&A pairs, and supplying same to a Q&A knowledge base. Utilizing the present application, suitable target Q&A pairs can be selected from the agent text data, the Q&A knowledge base can be maintained and updated, human resources saved, and efficiency improved.

Description

问答数据处理方法、电子装置及存储介质Question and answer data processing method, electronic device and storage medium
本申请要求于2018年2月9日提交中国专利局,申请号为201810135748.0、发明名称为“问答数据处理方法、电子装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 201101135748.0, entitled "Q&A Data Processing Method, Electronic Device and Storage Medium", which is filed on February 9, 2018, the entire contents of which are incorporated by reference. In this application.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种问答数据处理方法、电子装置及存储介质。The present application relates to the field of computer technologies, and in particular, to a method for processing a question and answer data, an electronic device, and a storage medium.
背景技术Background technique
随着科技的发展,人工智能(Artificial Intelligence,AI)正逐步改变着我们的生活方式,例如智能问答就是其中一种。当客户通过文字或语音在线咨询时,可以由线上的智能客服机器人或线上智能客服应答系统为客户进行智能答复。智能问答可以有效缓解客户服务的等待状况,提升服务质量,因而有着非常广阔的前景。With the development of technology, Artificial Intelligence (AI) is gradually changing our way of life. For example, smart question and answer is one of them. When the customer consults online via text or voice, the customer can be intelligently answered by the online intelligent customer service robot or the online intelligent customer service response system. Intelligent Q&A can effectively alleviate the waiting situation of customer service and improve service quality, so it has a very broad prospect.
目前,面向特定领域的智能问答的实现都需要有一个庞大的问答知识库,所述问答知识库中包含大量设定的问题答案对,例如问题是“保修期多久”,则相应的答案为“保修期一年”,智能问答中客服即是基于所述问答知识库进行智能答复的。然而,为了保证答复的准确性,需要及时补充完善问答知识库,而目前问答知识库的维护耗费着大量的人力资源,并且效率低下。At present, the implementation of domain-specific intelligent Q&A requires a large Q&A knowledge base. The Q&A knowledge base contains a large number of set answer questions. For example, the question is “How long is the warranty period?”, the corresponding answer is “ The warranty period is one year, and the customer service in the intelligent question and answer is based on the question and answer knowledge base for intelligent reply. However, in order to ensure the accuracy of the response, it is necessary to supplement the Q&A knowledge base in time, and the maintenance of the current Q&A knowledge base consumes a lot of human resources and is inefficient.
发明内容Summary of the invention
鉴于以上原因,有必要提供一种问答数据处理方法、电子装置及存储介质,可以自动从坐席文本数据中选择出合适的目标问答对,对问答知识库进行维护和更新,从而节约人力和提高效率。In view of the above reasons, it is necessary to provide a question and answer data processing method, an electronic device and a storage medium, which can automatically select an appropriate target question and answer pair from the agent text data, and maintain and update the Q&A knowledge base, thereby saving manpower and improving efficiency. .
为实现上述目的,本申请提供一种问答数据处理方法,该方法包括:获取步骤:从数据库获取坐席文本数据,所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据;分组步骤:将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服,每组问答记录包括客户的若干个问题语句和客服的若干个答案语句;排序步骤:按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,去除无相应答案语句的问题语句和无相应问题语句的答案语句,将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句;提取步骤:对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词;计算步骤:统计每个候选问答对的多个指定特征,对每个所述指 定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度;筛选步骤:从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对,将所述目标问答对补充到问答知识库中,所述问答知识库用于为客户基于问答知识库中的问答对进行智能应答。To achieve the above object, the present application provides a method for processing a question and answer data, the method comprising: an obtaining step of: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process; : The seat text data is divided into several groups of question and answer records, the same group of question and answer records belong to the same customer and corresponding customer service, each group of question and answer records includes several question questions of the customer and several answer statements of the customer service; sorting steps: according to the record The problem statement and the answer statement in each group of question and answer records are arranged in time order, the problem statement without the corresponding answer sentence and the answer statement without the corresponding question statement are removed, and the question statement and the answer sentence in each group of question and answer records are arranged as one question. The form of the answer, thus obtaining a number of candidate question and answer pairs, each candidate question and answer pair includes a question statement and a corresponding answer sentence; extraction step: word segmentation of the question statement and the answer sentence in each candidate question and answer pair to obtain each word Article, counting the number of texts in the seat The word frequency appears in the calculation, according to the word frequency of each term, the term weight is calculated for the corresponding term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence; Counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and weighting the plurality of specified features by using the feature weights for each candidate question and answer pair Obtaining the matching degree of each candidate question and answer pair; screening step: selecting candidate question and answer pairs whose matching degree is higher than the second preset threshold from all the candidate question and answer pairs as the target question and answer pair, and adding the target question and answer pair to the question and answer knowledge In the library, the Q&A knowledge base is used to intelligently answer questions and answers for customers based on the Q&A knowledge base.
为实现上述目的,本申请还提供一种电子装置,该电子装置包括存储器和处理器,所述存储器中包括问答数据处理程序,该问答数据处理程序被所述处理器执行时实现如下步骤:从数据库获取坐席文本数据,所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据;分组步骤:将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服,每组问答记录包括客户的若干个问题语句和客服的若干个答案语句;排序步骤:按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,去除无相应答案语句的问题语句和无相应问题语句的答案语句,将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句;提取步骤:对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词;计算步骤:统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度;筛选步骤:从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对,将所述目标问答对补充到问答知识库中,所述问答知识库用于为客户基于问答知识库中的问答对进行智能应答。To achieve the above object, the present application further provides an electronic device including a memory and a processor, wherein the memory includes a question and answer data processing program, and the question and answer data processing program is executed by the processor to implement the following steps: The database obtains the agent text data, and the agent text data includes all the question and answer data recorded by the customer and the customer service during the question and answer process; the grouping step: dividing the agent text data into several groups of question and answer records, and the same group of question and answer records belong to the same customer and corresponding The customer service record, each group of question and answer records includes several question statements of the customer and several answer statements of the customer service; sorting steps: arrange the problem sentences and answer sentences in each group of question and answer records according to the chronological order of the records, and remove the problem of no corresponding answer sentences The statement and the answer statement without the corresponding question statement, the question statement and the answer statement in each group of question and answer records are arranged in a question-and-answer form, thereby obtaining a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a Corresponding answer statement; extraction step: for each The problem sentence and the answer sentence in the candidate question and answer pair are processed by word segmentation to obtain each term, and the word frequency appearing in the seat text data of each term is counted, and the term weight is calculated according to the word frequency of each term. a term that has a term weight greater than the first predetermined threshold as a keyword of the corresponding question statement or the answer sentence; and a calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning each of the specified features Preset feature weights, for each candidate question and answer pair, using the feature weights to perform weighted calculation on the plurality of specified features to obtain a matching degree of each candidate question and answer pair; screening step: screening out all the candidate question and answer pairs A candidate question and answer pair with a matching degree higher than a second preset threshold is added as a target question and answer pair, and the target question and answer pair is added to the question and answer knowledge base, and the question and answer knowledge base is used for intelligently solving the question and answer pairs in the customer based question and answer knowledge base. Answer.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中包括问答数据处理程序,该问答数据处理程序被处理器执行时,实现如上所述的问答数据处理方法的任意步骤。In addition, in order to achieve the above object, the present application further provides a computer readable storage medium including a question and answer data processing program, when the question and answer data processing program is executed by a processor, implementing the question and answer as described above Any step of the data processing method.
本申请提出的问答数据处理方法、电子装置及计算机可读存储介质,通过首先从数据库获取坐席文本数据,将所述坐席文本数据分为若干组问答记录,其次按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,得到若干个候选问答对,然后统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度,再根据所述匹配度从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对补充到问答知识库中,可以自动从坐席文本数据中选择出合适的目标问答对,对问答知识库进行维护和更新,从而节约人力和提高效率。The question and answer data processing method, the electronic device and the computer readable storage medium provided by the present application divide the agent text data into a plurality of groups of question and answer records by first obtaining the agent text data from the database, and secondly, each group of questions and answers according to the time sequence of the records. The problem statement and the answer statement in the record obtain a plurality of candidate question and answer pairs, and then count a plurality of specified features of each candidate question and answer pair, and assign corresponding preset feature weights to each of the specified features, for each candidate question and answer And performing weighting calculation on the plurality of specified features by using the feature weights, obtaining a matching degree of each candidate question and answer pair, and filtering the matching degree from all the candidate question and answer pairs according to the matching degree to be higher than the second preset. The threshold candidate question and answer pair is added to the question and answer knowledge base as the target question and answer pair, and the appropriate target question and answer pair can be automatically selected from the agent text data to maintain and update the question and answer knowledge base, thereby saving manpower and improving efficiency.
附图说明DRAWINGS
图1为本申请电子装置较佳实施例的运行环境示意图;1 is a schematic diagram of an operating environment of a preferred embodiment of an electronic device of the present application;
图2为本申请电子装置与客户端较佳实施例的交互示意图;2 is a schematic diagram of interaction between an electronic device and a client according to a preferred embodiment of the present application;
图3为本申请问答数据处理方法较佳实施例的流程图;3 is a flow chart of a preferred embodiment of a question and answer data processing method of the present application;
图4为图3中数据清洗方法较佳实施例的流程图;4 is a flow chart of a preferred embodiment of the data cleaning method of FIG. 3;
图5为图1中问答数据处理程序的程序模块图。FIG. 5 is a program block diagram of the question and answer data processing program of FIG. 1.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.
具体实施方式Detailed ways
下面将参考若干具体实施例来描述本申请的原理和精神。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。The principles and spirit of the present application are described below with reference to a number of specific embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.
本领域的技术人员知道,本申请的实施方式可以实现为一种方法、装置、设备、系统或计算机程序产品。因此,本申请可以具体实现为完全的硬件、完全的软件(包括固件、驻留软件、微代码等),或者硬件和软件结合的形式。Those skilled in the art will appreciate that embodiments of the present application can be implemented as a method, apparatus, device, system, or computer program product. Accordingly, the application can be embodied in a complete hardware, complete software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.
根据本申请的实施例,提出了一种问答数据处理方法、电子装置及存储介质。According to an embodiment of the present application, a question and answer data processing method, an electronic device, and a storage medium are proposed.
参照图1所示,为本申请电子装置较佳实施例的运行环境示意图。1 is a schematic diagram of an operating environment of a preferred embodiment of an electronic device of the present application.
该电子装置1可以是服务器、便携式计算机、桌上型计算机等具有存储和运算功能的终端设备。The electronic device 1 may be a terminal device having a storage and computing function such as a server, a portable computer, or a desktop computer.
该电子装置1包括存储器11、处理器12、网络接口13及通信总线14。所述网络接口13可选地可以包括标准的有线接口和无线接口(如WI-FI接口)。通信总线14用于实现上述组件之间的连接通信。The electronic device 1 includes a memory 11, a processor 12, a network interface 13, and a communication bus 14. The network interface 13 can optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The communication bus 14 is used to implement connection communication between the above components.
存储器11包括至少一种类型的可读存储介质。所述至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质。在一些实施例中,所述可读存储介质可以是所述电子装置1的内部存储单元,例如该电子装置1的硬盘。在另一些实施例中,所述可读存储介质也可以是所述电子装置1的外部存储器11,例如所述电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1. In other embodiments, the readable storage medium may also be an external memory 11 of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC). , Secure Digital (SD) card, Flash Card, etc.
在本实施例中,所述存储器11的可读存储介质通常用于存储安装于所述电子装置1的问答数据处理程序10、数据库2及问答知识库3等。所述存储器11还可以用于暂时地存储已经输出或者将要输出的数据。In the present embodiment, the readable storage medium of the memory 11 is generally used to store the question and answer data processing program 10, the database 2, the question and answer knowledge base 3, and the like installed in the electronic device 1. The memory 11 can also be used to temporarily store data that has been output or is about to be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行问答数据处理程序10等。The processor 12, in some embodiments, may be a Central Processing Unit (CPU), a microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as performing Q&A data processing. Program 10 and so on.
图1仅示出了具有组件11-14以及问答数据处理程序10的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。1 shows only the electronic device 1 having the components 11-14 and the question and answer data processing program 10, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
可选地,该电子装置1还可以包括用户接口,用户接口可以包括输入单元比如键盘(Keyboard)、语音输入装置比如麦克风(microphone)等具有语音识别功能的设备、语音输出装置比如音响、耳机等。可选地,用户接口还可以包括标准的有线接口、无线接口。Optionally, the electronic device 1 may further include a user interface, and the user interface may include an input unit such as a keyboard, a voice input device such as a microphone, a device with a voice recognition function, a voice output device such as an audio, a headphone, and the like. . Optionally, the user interface may also include a standard wired interface and a wireless interface.
可选地,该电子装置1还可以包括显示器,显示器也可以称为显示屏或显示单元。在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)显示器等。显示器用于显示在电子装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may further include a display, which may also be referred to as a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch liquid crystal display, and an Organic Light-Emitting Diode (OLED) display. The display is used to display information processed in the electronic device 1 and a user interface for displaying visualizations.
可选地,该电子装置1还包括触摸传感器。所述触摸传感器所提供的供用户进行触摸操作的区域称为触控区域。此外,这里所述的触摸传感器可以为电阻式触摸传感器、电容式触摸传感器等。而且,所述触摸传感器不仅包括接触式的触摸传感器,也可包括接近式的触摸传感器等。此外,所述触摸传感器可以为单个传感器,也可以为例如阵列布置的多个传感器。用户可以通过触摸所述触控区域启动问答数据处理程序10。Optionally, the electronic device 1 further comprises a touch sensor. The area provided by the touch sensor for the user to perform a touch operation is referred to as a touch area. Further, the touch sensor described herein may be a resistive touch sensor, a capacitive touch sensor, or the like. Moreover, the touch sensor includes not only a contact type touch sensor but also a proximity type touch sensor or the like. Furthermore, the touch sensor may be a single sensor or a plurality of sensors arranged, for example, in an array. The user can activate the question and answer data processing program 10 by touching the touch area.
此外,该电子装置1的显示器的面积可以与所述触摸传感器的面积相同,也可以不同。可选地,将显示器与所述触摸传感器层叠设置,以形成触摸显示屏。该装置基于触摸显示屏侦测用户触发的触控操作。In addition, the area of the display of the electronic device 1 may be the same as or different from the area of the touch sensor. Optionally, a display is stacked with the touch sensor to form a touch display. The device detects a user-triggered touch operation based on a touch screen display.
该电子装置1还可以包括射频(Radio Frequency,RF)电路、传感器和音频电路等等,在此不再赘述。The electronic device 1 may further include a radio frequency (RF) circuit, a sensor, an audio circuit, and the like, and details are not described herein.
参阅图2所示,为本申请电子装置1较佳实施例的交互示意图。所述问答数据处理程序10运行于电子装置1中,所述电子装置1可以为运行智能问答程序相关的服务器。因此,该电子装置1可以通过网络与客户端通信连接,从而接收客户通过客户端提出的问题,在问答知识库3中查找相应的答案,并将所述答案通过客户端反馈给客户,以实现智能答复。此外,电子装置1还可以记录日常客户与客服的坐席文本数据,并将所述坐席文本数据保存在数据库2中。所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据,例如每一个问题及其输入时间,每一个答案及其输出时间,客户身份标识,客服身份标识等。所述问答数据处理程序10用于通过对数据库2中的坐席文本数据进行处理,从而提取出符合问答知识库3要求的目标问答对,以对问答知识库3的更新和维护。Referring to FIG. 2, it is a schematic diagram of interaction of the preferred embodiment of the electronic device 1 of the present application. The question and answer data processing program 10 runs in the electronic device 1, and the electronic device 1 may be a server related to running a smart question answering program. Therefore, the electronic device 1 can communicate with the client through the network, thereby receiving the problem raised by the client through the client, searching for the corresponding answer in the question and answer knowledge base 3, and feeding the answer to the client through the client to achieve Smart reply. In addition, the electronic device 1 can also record the agent text data of the daily customer and the customer service, and save the seat text data in the database 2. The agent text data includes question and answer data recorded by all customers and customer service during the question and answer process, such as each question and its input time, each answer and its output time, customer identity, customer service identity, and the like. The question and answer data processing program 10 is configured to extract the target question and answer pairs that meet the requirements of the Q&A knowledge base 3 by processing the agent text data in the database 2 to update and maintain the Q&A knowledge base 3.
参阅图3所示,为本申请问答数据处理方法较佳实施例的流程图。电子装置1的处理器12执行存储器11中存储的问答数据处理程序10时实现问答数据处理方法的如下步骤:Referring to FIG. 3, it is a flowchart of a preferred embodiment of the data processing method for the question and answer of the present application. When the processor 12 of the electronic device 1 executes the question and answer data processing program 10 stored in the memory 11, the following steps are implemented to implement the question and answer data processing method:
步骤S1,从数据库2获取坐席文本数据,所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据,例如每一个问题及其输入时间,每一个答案及其输出时间,客户身份标识,客服身份标识等。所述坐席文本数据记录的是原始的问答数据,即所述坐席文本数据中的问答数据未经过任何 处理,而是原样记录。Step S1, obtaining agent text data from the database 2, the agent text data including all the questions and answers recorded by the customer and the customer service during the question and answer process, such as each question and its input time, each answer and its output time, the customer identity , customer service identity, etc. The agent text data records the original question and answer data, that is, the question and answer data in the agent text data is not processed, but is recorded as it is.
通常客户与客服的问答过程中,除了特定领域咨询相关的问题语句与答案语句外,还会存在一些其它的纯交流语句,这些纯交流语句相较于问答知识库3而言,没有太大意义。为了去除坐席文本数据中此类无意义的数据,例如仅仅是相互寒暄的语句,包括“您好”、“再见”、“很高兴为您服务”等,以及过于短而很难表达出实际意义的语句,例如“为什么呢”,或者只包含数字和/或符号的语句等,步骤S1在从数据库2获取所述坐席文本数据之后,还可以先对所述坐席文本数据进行数据清洗操作,参阅图4所示,所述数据清洗操作例如可以包括以下步骤:Usually in the question and answer process between the customer and the customer service, in addition to the specific questions and answers in the specific field, there will be some other pure communication statements. These pure communication statements are not meaningful compared to the Q&A knowledge base 3. . In order to remove such meaningless data in the seat text data, for example, only the words that are cold to each other, including "Hello", "Goodbye", "I am very happy to serve you", etc., and too short to express practical meaning. The statement, such as "why", or a statement containing only numbers and/or symbols, etc., after step S1 obtains the agent text data from the database 2, the data cleaning operation of the agent text data may also be performed first. As shown in FIG. 4, the data cleaning operation may include the following steps, for example:
步骤S11,统计坐席文本数据中各语句的出现频率,获取出现频率高于第三预设阈值的语句作为高频语句;Step S11, counting the frequency of occurrence of each statement in the agent text data, and obtaining a statement whose appearance frequency is higher than the third preset threshold as the high frequency statement;
步骤S12,对每个所述高频语句进行分词处理,统计分词处理后得到的各词条在所有高频语句中的出现频率,将出现频率高于第四预设阈值的词条作为寒暄词;Step S12, performing word segmentation processing on each of the high-frequency sentences, and counting the frequency of occurrence of each term in all the high-frequency sentences obtained by the statistical word segmentation, and the term having a frequency higher than the fourth preset threshold is used as a greeting. ;
步骤S13,计算坐席文本数据各语句中所有寒暄词在所属语句所有词条中的占比,将所计算出的占比高于第五预设阈值的语句作为寒暄类语句;Step S13, calculating the proportion of all the swear words in each sentence of the sentence text data in all the vocabulary words of the sentence, and using the calculated statement whose proportion is higher than the fifth preset threshold as the chilling statement;
步骤S14,确定坐席文本数据中的短文本语句,所述短文本语句包括字符数小于第六预设阈值的语句,以及只包含数字和/或符号的语句;Step S14, determining a short text sentence in the agent text data, the short text sentence including a statement whose number of characters is less than a sixth preset threshold, and a statement containing only numbers and/or symbols;
步骤S15,从所述坐席文本数据中过滤掉所述寒暄类语句和所述短文本语句,从而完成所述数据清洗操作。Step S15, filtering out the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
步骤S2,将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服,每组问答记录包括客户的若干个问题语句和客服的若干个答案语句。In step S2, the agent text data is divided into several groups of question and answer records, and the same group of question and answer records belong to the same customer and the corresponding customer service. Each group of question and answer records includes several question questions of the customer and several answer statements of the customer service.
具体地,步骤S2可以将不同客户对应的问答数据映射到相应的客户ID下,从而根据映射关系将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服。在每组问答记录中,根据客户与客服的问答过程,可以包括客户的若干个问题语句和客服的若干个答案语句,但在该组问答记录中,客户的问题语句和客服的答案语句可以是无特定顺序的状态,例如可以是按照分配的ID号排列相应的问题语句和答案语句,或者可以是将客户的所有问题语句排列在一起,而将客服的所有答案语句排列在一起,等等。为了方便地从坐席文本数据中整理出每个客户的问题语句和客服对应的答案语句,执行下述步骤S3。Specifically, step S2 may map the question and answer data corresponding to different customers to the corresponding customer ID, thereby dividing the agent text data into several groups of question and answer records according to the mapping relationship, and the same group of question and answer records belong to the same customer and corresponding customer service. In each group of question and answer records, according to the customer and customer service Q&A process, it can include several questions of the customer and several answer statements of the customer service, but in the group of question and answer records, the customer's question statement and the customer's answer statement can be There is no specific order state, for example, the corresponding question statement and the answer statement may be arranged according to the assigned ID number, or all the problem statements of the customer may be arranged together, and all the answer statements of the customer service are arranged together, and the like. In order to conveniently sort out the problem statement of each customer and the answer statement corresponding to the customer service from the agent text data, the following step S3 is performed.
步骤S3,按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,去除无相应答案语句的问题语句和无相应问题语句的答案语句,将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句。Step S3, arranging the problem statement and the answer statement in each group of question and answer records according to the time sequence of the record, removing the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and the question statement and the answer sentence in each group of question and answer records Arranged in a question-and-answer format to obtain a number of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement.
由于客户与客服的问答过程是按照时间顺序一问一答进行的,因此步骤S3按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,可以方便地按顺序提取客户的每个问题语句和相应的答案语句。考虑到客户与客服 的问答过程中可能会存在客户提问而客服没有回答,或者客户发了无意义的符号例如表情或问号等而引起客服主动发问的情况,从而存在一些无相应答案语句的问题语句和无相应问题语句的答案语句,步骤S3可以在按照记录的时间顺序排列每组问答记录中的问题语句和答案语句后,先去除所述无相应答案语句的问题语句和无相应问题语句的答案语句。由此,步骤S3可以将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句。Since the question and answer process of the customer and the customer service is performed in a chronological order, the step S3 arranges the question statement and the answer sentence in each group of question and answer records in the order of the records, and can conveniently extract each question of the customer in order. Statement and corresponding answer statement. Considering that the customer and customer service Q&A process may have customer questions and the customer service does not answer, or the customer sends a meaningless symbol such as an expression or a question mark, etc., causing the customer to actively ask questions, so there are some problem statements without corresponding answer statements. And the answer statement without the corresponding question statement, step S3 may first arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the record, first remove the question statement without the corresponding answer sentence and the answer without the corresponding question statement Statement. Thus, step S3 can arrange the question statement and the answer statement in each group of question and answer records into a question-and-answer form, thereby obtaining a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer sentence. .
步骤S4,对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词。Step S4, performing word segmentation on the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence.
具体地,步骤S4根据所统计的各词条的词频为相应词条计算词条权重,可以采用以下公式:Specifically, step S4 calculates the weight of the term according to the word frequency of each term that is counted, and the following formula can be used:
w=log(n/df);w=log(n/df);
其中n为相应词条在所述坐席文本数据中的词频,df为在预设的大规模语料库中包含相应词条的文档数。所述预设的大规模语料库可以是公共的文档语料库,其包括大规模数量的各类文档,通过所述大规模语料库可以实现对文档中的语句、词条进行句法、语义等内容的样本训练和深度学习。Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus. The preset large-scale corpus may be a common document corpus, which includes a large-scale number of types of documents, and the large-scale corpus can implement sample training on sentences, semantics, and the like of sentences and sentences in the document. And deep learning.
步骤S5,统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度。Step S5, counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective preset feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain the matching degree of each candidate question and answer pair.
所述多个指定特征例如可以包括:The plurality of designated features may include, for example:
问题语句和答案语句的共同关键词数目a1;The number of common keywords a1 of the question statement and the answer statement;
问题语句和答案语句的共同关键词长度a2;The common keyword length a2 of the question statement and the answer statement;
问题语句和答案语句的共同词条的数目a3;The number of common terms a3 of the question statement and the answer statement;
问题语句和答案语句的共同词条的长度a4;The length of the common entry of the question statement and the answer statement is a4;
问题语句的长度a5;The length of the problem statement is a5;
答案语句的长度a6;The length of the answer statement is a6;
问题语句和答案语句的主题相似度a7;The topic similarity a7 of the question statement and the answer statement;
问题语句和答案语句的句法相似度a8。The syntactic similarity a8 of the question statement and the answer statement.
采用以上指定特征的考虑主要包括:若候选问答对中答案语句是对问题语句的回答,则答案语句中通常会包含问题语句中的关键词;问题语句与答案语句之间共同关键词的数目越多越长,共同出现词的数目越多越长,则候选问答对中答案语句是对问题语句的回答的可能性就越大;由于希望获取的目标问答对具有的是完整意义的问答,因此越长的问题语句能更好地描述一个完整的问题,而越长的答案语句能更好地描述一个完整的回答;若候选问答对中答案语句是对问题语句的回答,则答案语句与问题语句之间主题通常是一致的;若候选问答对中答案语句是对问题语句的回答,则答案语句与问题语句之间在句法上通常也有一定的联系。The consideration of adopting the above specified features mainly includes: if the answer question in the candidate question and answer pair is an answer to the question statement, the answer sentence usually includes the keyword in the question statement; the number of common keywords between the question statement and the answer sentence is more The longer the number, the longer the number of co-occurring words, the greater the probability that the answer answer in the candidate question and answer pair is the answer to the question statement; since the target question and answer pair that you want to acquire has a full question and answer, The longer the question statement can better describe a complete question, and the longer the answer statement can better describe a complete answer; if the answer question in the candidate question and answer pair is the answer to the question statement, then the answer statement and question The topics between the statements are usually consistent; if the answer to the answer in the candidate question and answer is the answer to the question, there is usually a certain relationship between the answer and the question.
所述使用所述特征权重对所述多个指定特征进行加权计算的方法例如可以为对所述多个指定特征采用多元逻辑回归模型进行加权回归拟合计算,从而得到每个候选问答对的匹配度g(z),公式如下:The method for performing weighting calculation on the plurality of specified features by using the feature weight may be, for example, performing weighted regression fitting calculation on the plurality of specified features by using a multiple logistic regression model, thereby obtaining a match of each candidate question and answer pair Degree g(z), the formula is as follows:
g(z)=1/(1+e z),e为自然常数; g(z)=1/(1+e z ), where e is a natural constant;
其中,z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8,x1、x2...x8分别为所述a1、a2...a8的特征权重。Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
步骤S6,从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对,将所述目标问答对补充到问答知识库3中,所述问答知识库3用于为客户基于问答知识库3中的问答对进行智能应答。匹配度高于第二预设阈值的所述候选问答对可以包括一个或多个。Step S6, selecting candidate question and answer pairs whose matching degree is higher than the second preset threshold from all the candidate question and answer pairs as the target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base 3, wherein the question and answer knowledge base 3 is used Intelligently answer questions and answers for customers based on Q&A Knowledge Base 3. The candidate question and answer pairs whose matching degree is higher than the second preset threshold may include one or more.
此外,为了使所述目标问答对更加精炼,步骤S6在将所述目标问答对补充到问答知识库3之前,还可以先将所述各个目标问答对中,对应相同问题语句的答案语句进行合并处理,使每个目标问答对之间包含的问题语句互不相同,每个目标问答对中包括一个或多个答案语句。In addition, in order to make the target question and answer pair more refined, step S6 may further merge the answer questions corresponding to the same question statement in the respective target question and answer pairs before adding the target question and answer pair to the question and answer knowledge base 3. Processing, so that the problem statements included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer statements.
由于针对特定的领域,通常问题语句的句子构成上应当是询问句,而答案语句则应当是陈述句。为了进一步提高补充到问答知识库3中的目标问答对的准确性,步骤S6还可以在将所述目标问答对补充到问答知识库3之前,先对每个目标问答对中的问题语句和答案语句分别采用朴素贝叶斯分类器判断相应的问题语句是否为询问句,以及相应的答案语句是否为陈述句,然后步骤S6将包含问题语句不为询问句的目标问答对过滤掉,以及将包含答案语句不为陈述句的目标问答对过滤掉,得到待补充到问答知识库3的目标问答对。Since the sentence of a question statement should be a query sentence for a specific field, the answer sentence should be a statement. In order to further improve the accuracy of the target question and answer pair added to the question and answer knowledge base 3, step S6 may also add the question question and answer in each question question and answer pair before adding the target question and answer pair to the question and answer knowledge base 3. The statement uses a naive Bayesian classifier to determine whether the corresponding question statement is a query sentence, and whether the corresponding answer statement is a declarative sentence, and then step S6 filters out the target question and answer pair that does not include the question sentence, and will include the answer. The statement is not filtered out for the target question and answer pair of the declarative sentence, and the target question and answer pair to be added to the Q&A knowledge base 3 is obtained.
根据本实施例提供的问答数据处理方法,通过首先从数据库获取坐席文本数据,将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服,其次按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,得到若干个候选问答对,然后对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词,统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度,再根据所述匹配度从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对补充到问答知识库3中,可以自动从坐席文本数据中选择出合适的目标问答对,对问答知识库3进行维护和更新,从而节约人力和提高效率。According to the question and answer data processing method provided by the embodiment, the agent text data is first divided into several groups of question and answer records by first obtaining the agent text data from the database, and the same group of question and answer records belong to the same customer and corresponding customer service, and secondly according to the recorded time. Arranging the question statement and the answer statement in each group of question and answer records in sequence, obtaining a plurality of candidate question and answer pairs, and then performing word segmentation on the question and answer sentences in each candidate question and answer pair to obtain individual terms, and the statistical terms are as described above. The word frequency appearing in the text data of the agent, the word weight is calculated for the corresponding term according to the word frequency of each term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence. Counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and weighting the plurality of specified features by using the feature weights for each candidate question and answer pair , get the matching degree of each candidate question and answer pair, and then from all the candidate question and answer pairs according to the matching degree The candidate question and answer pairs whose matching degree is higher than the second preset threshold are filtered out as the target question and answer pair is added to the question and answer knowledge base 3, and the appropriate target question and answer pair can be automatically selected from the agent text data, and the question and answer knowledge base 3 is maintained and Updates to save manpower and increase efficiency.
参阅图5所示,为图1中问答数据处理程序10的程序模块图。在本实施例中,问答数据处理程序10被分割为多个模块,该多个模块被存储于存储器11中,并由处理器12执行,以完成本申请。本申请所称的模块是指能够完成 特定功能的一系列计算机程序指令段。Referring to FIG. 5, it is a program module diagram of the question and answer data processing program 10 in FIG. In the present embodiment, the question and answer data processing program 10 is divided into a plurality of modules, which are stored in the memory 11 and executed by the processor 12 to complete the present application. A module as referred to in this application refers to a series of computer program instructions that are capable of performing a particular function.
所述问答数据处理程序10可以被分割为:获取模块110、分组模块120、排序步骤130、提取模块140、计算模块150和筛选模块160。The question and answer data processing program 10 can be divided into: an acquisition module 110, a grouping module 120, a sorting step 130, an extraction module 140, a calculation module 150, and a screening module 160.
获取模块110,用于从数据库获取坐席文本数据,所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据。The obtaining module 110 is configured to obtain agent text data from a database, where the agent text data includes question and answer data recorded by all customers and customer service during the question and answer process.
所述获取模块110从数据库获取坐席文本数据之后,还可以先对所述坐席文本数据进行数据清洗操作,所述数据清洗操作包括:After the obtaining module 110 obtains the agent text data from the database, the data cleaning operation may be performed on the agent text data, where the data cleaning operation includes:
统计坐席文本数据中各语句的出现频率,获取出现频率高于第三预设阈值的语句作为高频语句;Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;
对每个所述高频语句进行分词处理,统计分词处理后得到的各词条在所有高频语句中的出现频率,将出现频率高于第四预设阈值的词条作为寒暄词;Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;
计算坐席文本数据各语句中所有寒暄词在所属语句所有词条中的占比,将所计算出的占比高于第五预设阈值的语句作为寒暄类语句;Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;
确定坐席文本数据中的短文本语句,所述短文本语句包括字符数小于第六预设阈值的语句,以及只包含数字和/或符号的语句;Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;
从所述坐席文本数据中过滤掉所述寒暄类语句和所述短文本语句,从而完成所述数据清洗操作。Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
分组模块120,用于将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服,每组问答记录包括客户的若干个问题语句和客服的若干个答案语句。The grouping module 120 is configured to divide the agent text data into a plurality of groups of question and answer records, and the same group of question and answer records belong to the same customer and corresponding customer service records, and each group of question and answer records includes a plurality of question statements of the customer and a plurality of answer statements of the customer service.
排序步骤130,用于按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,去除无相应答案语句的问题语句和无相应问题语句的答案语句,将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句。The sorting step 130 is configured to arrange the question statement and the answer statement in each group of question and answer records according to the time sequence of the record, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement in each group of question and answer records. And the answer statement is arranged in a question-and-answer form, thereby obtaining a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement.
提取模块140,用于对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词。The extracting module 140 is configured to perform word segmentation on the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and count the word frequency of each term in the seat text data, according to the statistics of each term. The word frequency calculates the term weight for the corresponding term, and the term with the term weight greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence.
具体地,提取模块140可以采用以下公式根据所统计的各词条的词频为相应词条计算词条权重:Specifically, the extraction module 140 may calculate the term weights for the corresponding terms according to the word frequency of each term calculated by using the following formula:
w=log(n/df);w=log(n/df);
其中n为相应词条在所述坐席文本数据中的词频,df为在预设的大规模语料库中包含相应词条的文档数。Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.
计算模块150,用于统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度。The calculating module 150 is configured to count a plurality of specified features of each candidate question and answer pair, and respectively assign a corresponding preset feature weight to each of the specified features, and use the feature weights for each candidate question and answer pair The specified features are weighted to obtain the matching degree of each candidate question and answer pair.
所述多个指定特征例如可以包括:The plurality of designated features may include, for example:
问题语句和答案语句的共同关键词数目a1;The number of common keywords a1 of the question statement and the answer statement;
问题语句和答案语句的共同关键词长度a2;The common keyword length a2 of the question statement and the answer statement;
问题语句和答案语句的共同词条的数目a3;The number of common terms a3 of the question statement and the answer statement;
问题语句和答案语句的共同词条的长度a4;The length of the common entry of the question statement and the answer statement is a4;
问题语句的长度a5;The length of the problem statement is a5;
答案语句的长度a6;The length of the answer statement is a6;
问题语句和答案语句的主题相似度a7;The topic similarity a7 of the question statement and the answer statement;
问题语句和答案语句的句法相似度a8。The syntactic similarity a8 of the question statement and the answer statement.
所述使用所述特征权重对所述多个指定特征进行加权计算的方法例如可以为计算模块150对所述多个指定特征采用多元逻辑回归模型进行加权回归拟合计算,从而得到每个候选问答对的匹配度g(z),公式如下:The method for performing weighting calculation on the plurality of specified features by using the feature weight may be, for example, the calculation module 150 performing a weighted regression fitting calculation on the plurality of specified features by using a multiple logistic regression model, thereby obtaining each candidate question and answer. The matching degree g(z) of the pair is as follows:
g(z)=1/(1+e z),e为自然常数; g(z)=1/(1+e z ), where e is a natural constant;
其中,z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8,x1、x2...x8分别为所述a1、a2...a8的特征权重。Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
筛选模块160,用于从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对,将所述目标问答对补充到问答知识库中,所述问答知识库用于为客户基于问答知识库中的问答对进行智能应答。The screening module 160 is configured to filter, from all candidate question and answer pairs, a candidate question and answer pair whose matching degree is higher than a second preset threshold as a target question and answer pair, and add the target question and answer pair to the question and answer knowledge base, the question and answer knowledge base Used to intelligently answer questions and answers for customers based on the Q&A knowledge base.
为了使所述目标问答对更加精炼,筛选模块160在将所述目标问答对补充到问答知识库3之前,还可以先将所述各个目标问答对中,对应相同问题语句的答案语句进行合并处理,使每个目标问答对之间包含的问题语句互不相同,每个目标问答对中包括一个或多个答案语句。In order to make the target question and answer pair more refined, the screening module 160 may further merge the answer questions corresponding to the same question statement in the respective target question and answer pairs before adding the target question and answer pair to the question and answer knowledge base 3. To make the problem statements contained in each target question and answer pair different from each other, each target question and answer pair includes one or more answer statements.
此外,为了进一步提高补充到问答知识库3中的目标问答对的准确性,筛选模块160还可以在将所述目标问答对补充到问答知识库3之前,先对每个目标问答对中的问题语句和答案语句分别采用朴素贝叶斯分类器判断相应的问题语句是否为询问句,以及相应的答案语句是否为陈述句,然后筛选模块160将包含问题语句不为询问句的目标问答对过滤掉,以及将包含答案语句不为陈述句的目标问答对过滤掉,得到待补充到问答知识库3的目标问答对。In addition, in order to further improve the accuracy of the target question and answer pairs added to the Q&A knowledge base 3, the screening module 160 may further solve the problem in each question question and answer pair before adding the target question and answer pair to the Q&A knowledge base 3. The statement and the answer statement respectively use the naive Bayes classifier to determine whether the corresponding question statement is a query sentence, and whether the corresponding answer statement is a declarative sentence, and then the screening module 160 filters out the target question and answer pair containing the question statement not for the query sentence. And filtering out the target question and answer pairs containing the answer statement not for the declarative sentence, and obtaining the target question and answer pair to be added to the question and answer knowledge base 3.
在图1所示的电子装置1较佳实施例的运行环境示意图中,包含可读存储介质的存储器11中可以包括问答数据处理程序10、数据库2及问答知识库3。处理器12执行存储器11中存储的问答数据处理程序10时实现如下步骤:In the operating environment diagram of the preferred embodiment of the electronic device 1 shown in FIG. 1, the memory 11 including the readable storage medium may include a question and answer data processing program 10, a database 2, and a question and answer knowledge base 3. When the processor 12 executes the question and answer data processing program 10 stored in the memory 11, the following steps are implemented:
获取步骤:从数据库获取坐席文本数据,所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据;Obtaining step: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;
分组步骤:将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服,每组问答记录包括客户的若干个问题语句和客服的若干个答案语句;The grouping step: dividing the agent text data into a plurality of groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including a plurality of customer question statements and a number of answer statements of the customer service;
排序步骤:按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,去除无相应答案语句的问题语句和无相应问题语句的答案语句,将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干 个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句;Sorting step: Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records. Arranged in a question-and-answer format to obtain a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement;
提取步骤:对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词;Extracting step: segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;
计算步骤:统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度;Calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;
筛选步骤:从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对,将所述目标问答对补充到问答知识库中,所述问答知识库用于为客户基于问答知识库中的问答对进行智能应答。a screening step: selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
所述从数据库获取坐席文本数据之后,还包括对所述坐席文本数据进行数据清洗操作,所述数据清洗操作包括:After the obtaining the agent text data from the database, the method further includes performing a data cleaning operation on the agent text data, where the data cleaning operation includes:
统计坐席文本数据中各语句的出现频率,获取出现频率高于第三预设阈值的语句作为高频语句;Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;
对每个所述高频语句进行分词处理,统计分词处理后得到的各词条在所有高频语句中的出现频率,将出现频率高于第四预设阈值的词条作为寒暄词;Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;
计算坐席文本数据各语句中所有寒暄词在所属语句所有词条中的占比,将所计算出的占比高于第五预设阈值的语句作为寒暄类语句;Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;
确定坐席文本数据中的短文本语句,所述短文本语句包括字符数小于第六预设阈值的语句,以及只包含数字和/或符号的语句;Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;
从所述坐席文本数据中过滤掉所述寒暄类语句和所述短文本语句,从而完成所述数据清洗操作。Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
所述根据所统计的各词条的词频为相应词条计算词条权重包括:The calculating the term weights according to the word frequency of each term according to the statistics includes:
采用以下公式计算所述词条权重w:The term weight w is calculated using the following formula:
w=log(n/df);w=log(n/df);
其中n为相应词条在所述坐席文本数据中的词频,df为在预设的大规模语料库中包含相应词条的文档数。Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.
所述多个指定特征包括:The plurality of specified features includes:
问题语句和答案语句的共同关键词数目a1;The number of common keywords a1 of the question statement and the answer statement;
问题语句和答案语句的共同关键词长度a2;The common keyword length a2 of the question statement and the answer statement;
问题语句和答案语句的共同词条的数目a3;The number of common terms a3 of the question statement and the answer statement;
问题语句和答案语句的共同词条的长度a4;The length of the common entry of the question statement and the answer statement is a4;
问题语句的长度a5;The length of the problem statement is a5;
答案语句的长度a6;The length of the answer statement is a6;
问题语句和答案语句的主题相似度a7;The topic similarity a7 of the question statement and the answer statement;
问题语句和答案语句的句法相似度a8;Syntactic similarity a8 of the question statement and the answer statement;
所述使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度包括:And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:
对所述多个指定特征采用多元逻辑回归模型进行加权回归拟合计算,得到每个候选问答对的匹配度g(z),公式如下:The multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:
g(z)=1/(1+e z),e为自然常数; g(z)=1/(1+e z ), where e is a natural constant;
其中,z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8,x1、x2...x8分别为所述a1、a2...a8的特征权重。Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
所述将所述目标问答对补充到问答知识库之前,还包括:Before adding the target question and answer pair to the Q&A knowledge base, the method further includes:
将所述各个目标问答对中,对应相同问题语句的答案语句进行合并处理,使每个目标问答对之间包含的问题语句互不相同,每个目标问答对中包括一个或多个答案语句。In the respective target question and answer pairs, the answer sentences corresponding to the same question statement are combined, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.
所述将所述目标问答对补充到问答知识库之前,还包括:Before adding the target question and answer pair to the Q&A knowledge base, the method further includes:
对每个目标问答对中的问题语句和答案语句分别采用朴素贝叶斯分类器判断相应的问题语句是否为询问句,以及相应的答案语句是否为陈述句;The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;
将包含问题语句不为询问句的目标问答对过滤掉,以及将包含答案语句不为陈述句的目标问答对过滤掉,得到待补充到问答知识库的目标问答对。Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
具体原理请参照上述图5关于问答数据处理程序10的程序模块图及图3关于问答数据处理方法较佳实施例的流程图的介绍。For the specific principle, please refer to the program module diagram of the question and answer data processing program 10 in FIG. 5 and the flowchart of the preferred embodiment of the question and answer data processing method in FIG.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质可以是硬盘、多媒体卡、SD卡、闪存卡、SMC、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器等等中的任意一种或者几种的任意组合。所述计算机可读存储介质中包括存储有数据库2、问答知识库3及问答数据处理程序10等,所述问答数据处理程序10被所述处理器12执行时实现如下操作:In addition, the embodiment of the present application further provides a computer readable storage medium, which may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read only memory (ROM), and an erasable programmable Any combination or combination of any one or more of read only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, and the like. The computer readable storage medium includes a database 2, a question and answer knowledge base 3, a question and answer data processing program 10, and the like. When the question and answer data processing program 10 is executed by the processor 12, the following operations are implemented:
获取步骤:从数据库获取坐席文本数据,所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据;Obtaining step: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;
分组步骤:将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服,每组问答记录包括客户的若干个问题语句和客服的若干个答案语句;The grouping step: dividing the agent text data into a plurality of groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including a plurality of customer question statements and a number of answer statements of the customer service;
排序步骤:按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,去除无相应答案语句的问题语句和无相应问题语句的答案语句,将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句;Sorting step: Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records. Arranged in a question-and-answer format to obtain a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement;
提取步骤:对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词;Extracting step: segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;
计算步骤:统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度;Calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;
筛选步骤:从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对,将所述目标问答对补充到问答知识库中,所述问答知识库用于为客户基于问答知识库中的问答对进行智能应答。a screening step: selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
所述从数据库获取坐席文本数据之后,还包括对所述坐席文本数据进行数据清洗操作,所述数据清洗操作包括:After the obtaining the agent text data from the database, the method further includes performing a data cleaning operation on the agent text data, where the data cleaning operation includes:
统计坐席文本数据中各语句的出现频率,获取出现频率高于第三预设阈值的语句作为高频语句;Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;
对每个所述高频语句进行分词处理,统计分词处理后得到的各词条在所有高频语句中的出现频率,将出现频率高于第四预设阈值的词条作为寒暄词;Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;
计算坐席文本数据各语句中所有寒暄词在所属语句所有词条中的占比,将所计算出的占比高于第五预设阈值的语句作为寒暄类语句;Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;
确定坐席文本数据中的短文本语句,所述短文本语句包括字符数小于第六预设阈值的语句,以及只包含数字和/或符号的语句;Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;
从所述坐席文本数据中过滤掉所述寒暄类语句和所述短文本语句,从而完成所述数据清洗操作。Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
所述根据所统计的各词条的词频为相应词条计算词条权重包括:The calculating the term weights according to the word frequency of each term according to the statistics includes:
采用以下公式计算所述词条权重w:The term weight w is calculated using the following formula:
w=log(n/df);w=log(n/df);
其中n为相应词条在所述坐席文本数据中的词频,df为在预设的大规模语料库中包含相应词条的文档数。Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.
所述多个指定特征包括:The plurality of specified features includes:
问题语句和答案语句的共同关键词数目a1;The number of common keywords a1 of the question statement and the answer statement;
问题语句和答案语句的共同关键词长度a2;The common keyword length a2 of the question statement and the answer statement;
问题语句和答案语句的共同词条的数目a3;The number of common terms a3 of the question statement and the answer statement;
问题语句和答案语句的共同词条的长度a4;The length of the common entry of the question statement and the answer statement is a4;
问题语句的长度a5;The length of the problem statement is a5;
答案语句的长度a6;The length of the answer statement is a6;
问题语句和答案语句的主题相似度a7;The topic similarity a7 of the question statement and the answer statement;
问题语句和答案语句的句法相似度a8;Syntactic similarity a8 of the question statement and the answer statement;
所述使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度包括:And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:
对所述多个指定特征采用多元逻辑回归模型进行加权回归拟合计算,得到每个候选问答对的匹配度g(z),公式如下:The multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:
g(z)=1/(1+e z),e为自然常数; g(z)=1/(1+e z ), where e is a natural constant;
其中,z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8,x1、x2...x8分别为所述a1、a2...a8的特征权重。Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
所述将所述目标问答对补充到问答知识库之前,还包括:Before adding the target question and answer pair to the Q&A knowledge base, the method further includes:
将所述各个目标问答对中,对应相同问题语句的答案语句进行合并处理,使每个目标问答对之间包含的问题语句互不相同,每个目标问答对中包括一 个或多个答案语句。In each of the target question and answer pairs, the answer sentences corresponding to the same question statement are merged, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.
所述将所述目标问答对补充到问答知识库之前,还包括:Before adding the target question and answer pair to the Q&A knowledge base, the method further includes:
对每个目标问答对中的问题语句和答案语句分别采用朴素贝叶斯分类器判断相应的问题语句是否为询问句,以及相应的答案语句是否为陈述句;The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;
将包含问题语句不为询问句的目标问答对过滤掉,以及将包含答案语句不为陈述句的目标问答对过滤掉,得到待补充到问答知识库的目标问答对。Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
本申请之计算机可读存储介质的具体实施方式与上述问答数据处理方法以及电子装置1的具体实施方式大致相同,在此不再赘述。The specific implementation of the computer readable storage medium of the present application is substantially the same as the above-described question and answer data processing method and the specific embodiment of the electronic device 1, and details are not described herein again.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It is to be understood that the term "comprises", "comprising", or any other variants thereof, is intended to encompass a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a series of elements includes those elements. It also includes other elements not explicitly listed, or elements that are inherent to such a process, device, item, or method. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, the device, the item, or the method that comprises the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, portions of the technical solution of the present application that contribute substantially or to the prior art may be embodied in the form of a software product stored in a storage medium as described above, including a number of instructions. To enable a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims (20)

  1. 一种问答数据处理方法,其特征在于,该方法包括:A question and answer data processing method, characterized in that the method comprises:
    获取步骤:从数据库获取坐席文本数据,所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据;Obtaining step: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;
    分组步骤:将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服,每组问答记录包括客户的若干个问题语句和客服的若干个答案语句;The grouping step: dividing the agent text data into a plurality of groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including a plurality of customer question statements and a number of answer statements of the customer service;
    排序步骤:按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,去除无相应答案语句的问题语句和无相应问题语句的答案语句,将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句;Sorting step: Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records. Arranged in a question-and-answer format to obtain a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement;
    提取步骤:对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词;Extracting step: segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;
    计算步骤:统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度;Calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;
    筛选步骤:从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对,将所述目标问答对补充到问答知识库中,所述问答知识库用于为客户基于问答知识库中的问答对进行智能应答。a screening step: selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
  2. 如权利要求1所述的问答数据处理方法,其特征在于,所述从数据库获取坐席文本数据之后,还包括对所述坐席文本数据进行数据清洗操作,所述数据清洗操作包括:The question-and-answer data processing method according to claim 1, wherein after the obtaining the agent text data from the database, the method further comprises: performing a data cleaning operation on the agent text data, the data cleaning operation comprising:
    统计坐席文本数据中各语句的出现频率,获取出现频率高于第三预设阈值的语句作为高频语句;Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;
    对每个所述高频语句进行分词处理,统计分词处理后得到的各词条在所有高频语句中的出现频率,将出现频率高于第四预设阈值的词条作为寒暄词;Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;
    计算坐席文本数据各语句中所有寒暄词在所属语句所有词条中的占比,将所计算出的占比高于第五预设阈值的语句作为寒暄类语句;Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;
    确定坐席文本数据中的短文本语句,所述短文本语句包括字符数小于第六预设阈值的语句,以及只包含数字和/或符号的语句;Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;
    从所述坐席文本数据中过滤掉所述寒暄类语句和所述短文本语句,从而完成所述数据清洗操作。Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
  3. 如权利要求1所述的问答数据处理方法,其特征在于,所述根据所统计的各词条的词频为相应词条计算词条权重包括:The question and answer data processing method according to claim 1, wherein the calculating the term weights according to the word frequency of each term as the corresponding term comprises:
    采用以下公式计算所述词条权重w:The term weight w is calculated using the following formula:
    w=log(n/df);w=log(n/df);
    其中n为相应词条在所述坐席文本数据中的词频,df为在预设的大规模语料库中包含相应词条的文档数。Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.
  4. 如权利要求1所述的问答数据处理方法,其特征在于,所述多个指定特征包括:The question and answer data processing method according to claim 1, wherein the plurality of specified features comprise:
    问题语句和答案语句的共同关键词数目a1;The number of common keywords a1 of the question statement and the answer statement;
    问题语句和答案语句的共同关键词长度a2;The common keyword length a2 of the question statement and the answer statement;
    问题语句和答案语句的共同词条的数目a3;The number of common terms a3 of the question statement and the answer statement;
    问题语句和答案语句的共同词条的长度a4;The length of the common entry of the question statement and the answer statement is a4;
    问题语句的长度a5;The length of the problem statement is a5;
    答案语句的长度a6;The length of the answer statement is a6;
    问题语句和答案语句的主题相似度a7;The topic similarity a7 of the question statement and the answer statement;
    问题语句和答案语句的句法相似度a8;Syntactic similarity a8 of the question statement and the answer statement;
    所述使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度包括:And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:
    对所述多个指定特征采用多元逻辑回归模型进行加权回归拟合计算,得到每个候选问答对的匹配度g(z),公式如下:The multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:
    g(z)=1/(1+e z),e为自然常数; g(z)=1/(1+e z ), where e is a natural constant;
    其中,z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8,x1、x2...x8分别为所述a1、a2...a8的特征权重。Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
  5. 如权利要求1所述的问答数据处理方法,其特征在于,所述将所述目标问答对补充到问答知识库之前,还包括:The question and answer data processing method according to claim 1, wherein before the adding the target question and answer pair to the question and answer knowledge base, the method further comprises:
    将所述各个目标问答对中,对应相同问题语句的答案语句进行合并处理,使每个目标问答对之间包含的问题语句互不相同,每个目标问答对中包括一个或多个答案语句。In the respective target question and answer pairs, the answer sentences corresponding to the same question statement are combined, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.
  6. 如权利要求1所述的问答数据处理方法,其特征在于,所述将所述目标问答对补充到问答知识库之前,还包括:The question and answer data processing method according to claim 1, wherein before the adding the target question and answer pair to the question and answer knowledge base, the method further comprises:
    对每个目标问答对中的问题语句和答案语句分别采用朴素贝叶斯分类器判断相应的问题语句是否为询问句,以及相应的答案语句是否为陈述句;The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;
    将包含问题语句不为询问句的目标问答对过滤掉,以及将包含答案语句不为陈述句的目标问答对过滤掉,得到待补充到问答知识库的目标问答对。Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
  7. 如权利要求1-5任一项所述的问答数据处理方法,其特征在于,所述将所述目标问答对补充到问答知识库之前,还包括:The question and answer data processing method according to any one of claims 1 to 5, wherein before the adding the target question and answer pair to the question and answer knowledge base, the method further comprises:
    对每个目标问答对中的问题语句和答案语句分别采用朴素贝叶斯分类器判断相应的问题语句是否为询问句,以及相应的答案语句是否为陈述句;The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;
    将包含问题语句不为询问句的目标问答对过滤掉,以及将包含答案语句不为陈述句的目标问答对过滤掉,得到待补充到问答知识库的目标问答对。Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
  8. 一种电子装置,包括存储器和处理器,其特征在于,所述存储器中包括问答数据处理程序,该问答数据处理程序被所述处理器执行时实现如下步骤:An electronic device comprising a memory and a processor, wherein the memory includes a question and answer data processing program, and the question and answer data processing program is executed by the processor to implement the following steps:
    获取步骤:从数据库获取坐席文本数据,所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据;Obtaining step: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;
    分组步骤:将所述坐席文本数据分为若干组问答记录,同一组问答记录属于同一客户和相应的客服,每组问答记录包括客户的若干个问题语句和客服的若干个答案语句;The grouping step: dividing the agent text data into a plurality of groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including a plurality of customer question statements and a number of answer statements of the customer service;
    排序步骤:按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,去除无相应答案语句的问题语句和无相应问题语句的答案语句,将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句;Sorting step: Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records. Arranged in a question-and-answer format to obtain a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement;
    提取步骤:对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词;Extracting step: segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;
    计算步骤:统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度;Calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;
    筛选步骤:从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对,将所述目标问答对补充到问答知识库中,所述问答知识库用于为客户基于问答知识库中的问答对进行智能应答。a screening step: selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
  9. 如权利要求8所述的电子装置,其特征在于,所述从数据库获取坐席文本数据之后,还包括对所述坐席文本数据进行数据清洗操作,所述数据清洗操作包括:The electronic device of claim 8, wherein the obtaining the agent text data from the database further comprises performing a data cleaning operation on the agent text data, the data cleaning operation comprising:
    统计坐席文本数据中各语句的出现频率,获取出现频率高于第三预设阈值的语句作为高频语句;Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;
    对每个所述高频语句进行分词处理,统计分词处理后得到的各词条在所有高频语句中的出现频率,将出现频率高于第四预设阈值的词条作为寒暄词;Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;
    计算坐席文本数据各语句中所有寒暄词在所属语句所有词条中的占比,将所计算出的占比高于第五预设阈值的语句作为寒暄类语句;Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;
    确定坐席文本数据中的短文本语句,所述短文本语句包括字符数小于第六预设阈值的语句,以及只包含数字和/或符号的语句;Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;
    从所述坐席文本数据中过滤掉所述寒暄类语句和所述短文本语句,从而完成所述数据清洗操作。Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
  10. 如权利要求8所述的电子装置,其特征在于,所述根据所统计的各词条的词频为相应词条计算词条权重包括:The electronic device according to claim 8, wherein the calculating the term weights for the corresponding terms according to the word frequency of each of the counted words comprises:
    采用以下公式计算所述词条权重w:The term weight w is calculated using the following formula:
    w=log(n/df);w=log(n/df);
    其中n为相应词条在所述坐席文本数据中的词频,df为在预设的大规模语料库中包含相应词条的文档数。Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.
  11. 如权利要求8所述的电子装置,其特征在于,所述多个指定特征包 括:The electronic device of claim 8 wherein said plurality of specified features comprises:
    问题语句和答案语句的共同关键词数目a1;The number of common keywords a1 of the question statement and the answer statement;
    问题语句和答案语句的共同关键词长度a2;The common keyword length a2 of the question statement and the answer statement;
    问题语句和答案语句的共同词条的数目a3;The number of common terms a3 of the question statement and the answer statement;
    问题语句和答案语句的共同词条的长度a4;The length of the common entry of the question statement and the answer statement is a4;
    问题语句的长度a5;The length of the problem statement is a5;
    答案语句的长度a6;The length of the answer statement is a6;
    问题语句和答案语句的主题相似度a7;The topic similarity a7 of the question statement and the answer statement;
    问题语句和答案语句的句法相似度a8;Syntactic similarity a8 of the question statement and the answer statement;
    所述使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度包括:And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:
    对所述多个指定特征采用多元逻辑回归模型进行加权回归拟合计算,得到每个候选问答对的匹配度g(z),公式如下:The multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:
    g(z)=1/(1+e z),e为自然常数; g(z)=1/(1+e z ), where e is a natural constant;
    其中,z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8,x1、x2...x8分别为所述a1、a2...a8的特征权重。Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
  12. 如权利要求8所述的电子装置,其特征在于,所述将所述目标问答对补充到问答知识库之前,还包括:The electronic device according to claim 8, wherein the adding the target question and answer pair to the question and answer knowledge base further comprises:
    将所述各个目标问答对中,对应相同问题语句的答案语句进行合并处理,使每个目标问答对之间包含的问题语句互不相同,每个目标问答对中包括一个或多个答案语句。In the respective target question and answer pairs, the answer sentences corresponding to the same question statement are combined, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.
  13. 如权利要求8所述的电子装置,其特征在于,所述将所述目标问答对补充到问答知识库之前,还包括:The electronic device according to claim 8, wherein the adding the target question and answer pair to the question and answer knowledge base further comprises:
    对每个目标问答对中的问题语句和答案语句分别采用朴素贝叶斯分类器判断相应的问题语句是否为询问句,以及相应的答案语句是否为陈述句;The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;
    将包含问题语句不为询问句的目标问答对过滤掉,以及将包含答案语句不为陈述句的目标问答对过滤掉,得到待补充到问答知识库的目标问答对。Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
  14. 如权利要求8-12任一项所述的电子装置,其特征在于,所述将所述目标问答对补充到问答知识库之前,还包括:The electronic device according to any one of claims 8 to 12, wherein before the adding the target question and answer pair to the question and answer knowledge base, the method further comprises:
    对每个目标问答对中的问题语句和答案语句分别采用朴素贝叶斯分类器判断相应的问题语句是否为询问句,以及相应的答案语句是否为陈述句;The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;
    将包含问题语句不为询问句的目标问答对过滤掉,以及将包含答案语句不为陈述句的目标问答对过滤掉,得到待补充到问答知识库的目标问答对。Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括问答数据处理程序,所述问答数据处理程序被处理器执行时,该问答数据处理程序被所述处理器执行时实现如下步骤:A computer readable storage medium, comprising: a question and answer data processing program, wherein when the question and answer data processing program is executed by a processor, the question and answer data processing program is executed by the processor Implement the following steps:
    获取步骤:从数据库获取坐席文本数据,所述坐席文本数据包括所有客户与客服在问答过程中记录的问答数据;Obtaining step: obtaining agent text data from a database, the agent text data including question and answer data recorded by all customers and customer service during the question and answer process;
    分组步骤:将所述坐席文本数据分为若干组问答记录,同一组问答记录 属于同一客户和相应的客服,每组问答记录包括客户的若干个问题语句和客服的若干个答案语句;The grouping step: dividing the agent text data into several groups of question and answer records, the same group of question and answer records belonging to the same customer and corresponding customer service, each group of question and answer records including several question questions of the customer and several answer statements of the customer service;
    排序步骤:按照记录的时间顺序排列每组问答记录中的问题语句和答案语句,去除无相应答案语句的问题语句和无相应问题语句的答案语句,将每组问答记录中的问题语句和答案语句排列为一问一答的形式,从而得到若干个候选问答对,每个候选问答对包括一个问题语句和一个相应的答案语句;Sorting step: Arrange the question statement and the answer statement in each group of question and answer records according to the chronological order of the records, remove the question statement without the corresponding answer sentence and the answer statement without the corresponding question statement, and set the question statement and the answer sentence in each group of question and answer records. Arranged in a question-and-answer format to obtain a plurality of candidate question and answer pairs, each candidate question and answer pair including a question statement and a corresponding answer statement;
    提取步骤:对每个候选问答对中的问题语句和答案语句进行分词处理得到各个词条,统计各词条在所述坐席文本数据中出现的词频,根据所统计的各词条的词频为相应词条计算词条权重,将词条权重大于第一预设阈值的词条作为相应问题语句或答案语句的关键词;Extracting step: segmenting the question statement and the answer sentence in each candidate question and answer pair to obtain each term, and counting the word frequency of each term appearing in the agent text data, according to the word frequency of each term according to the statistics The term calculates the weight of the term, and the term whose term weight is greater than the first preset threshold is used as the keyword of the corresponding question statement or the answer sentence;
    计算步骤:统计每个候选问答对的多个指定特征,对每个所述指定特征分别赋予相应的预设特征权重,针对每个候选问答对,使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度;Calculating step: counting a plurality of specified features of each candidate question and answer pair, respectively assigning respective predetermined feature weights to each of the specified features, and using the feature weights for the plurality of designated features for each candidate question and answer pair Perform a weighting calculation to obtain a matching degree of each candidate question and answer pair;
    筛选步骤:从所有候选问答对中筛选出匹配度高于第二预设阈值的候选问答对作为目标问答对,将所述目标问答对补充到问答知识库中,所述问答知识库用于为客户基于问答知识库中的问答对进行智能应答。a screening step: selecting candidate question and answer pairs whose matching degree is higher than a second preset threshold from all the candidate question and answer pairs as a target question and answer pair, and adding the target question and answer pair to the question and answer knowledge base, wherein the question and answer knowledge base is used for The customer responds intelligently based on the question and answer pairs in the Q&A knowledge base.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述从数据库获取坐席文本数据之后,还包括对所述坐席文本数据进行数据清洗操作,所述数据清洗操作包括:The computer readable storage medium according to claim 15, wherein after the obtaining the agent text data from the database, the method further comprises: performing a data cleaning operation on the agent text data, the data cleaning operation comprising:
    统计坐席文本数据中各语句的出现频率,获取出现频率高于第三预设阈值的语句作为高频语句;Counting the frequency of occurrence of each statement in the text data of the agent, and obtaining a statement whose frequency of occurrence is higher than a third preset threshold as a high frequency sentence;
    对每个所述高频语句进行分词处理,统计分词处理后得到的各词条在所有高频语句中的出现频率,将出现频率高于第四预设阈值的词条作为寒暄词;Performing word segmentation processing on each of the high-frequency sentences, and the frequency of occurrence of each term obtained in the statistical segmentation process in all high-frequency sentences, and the occurrence of a term having a frequency higher than the fourth predetermined threshold as a greeting;
    计算坐席文本数据各语句中所有寒暄词在所属语句所有词条中的占比,将所计算出的占比高于第五预设阈值的语句作为寒暄类语句;Calculating the proportion of all the slang words in each sentence of the sentence text data in all the vocabulary words of the sentence, and calculating the calculated statement with the proportion higher than the fifth preset threshold as the chilling statement;
    确定坐席文本数据中的短文本语句,所述短文本语句包括字符数小于第六预设阈值的语句,以及只包含数字和/或符号的语句;Determining a short text statement in the agent text data, the short text sentence including a statement having a number of characters smaller than a sixth preset threshold, and a statement containing only numbers and/or symbols;
    从所述坐席文本数据中过滤掉所述寒暄类语句和所述短文本语句,从而完成所述数据清洗操作。Extracting the greeting statement and the short text sentence from the agent text data, thereby completing the data cleaning operation.
  17. 如权利要求15所述的计算机可读存储介质,其特征在于,所述根据所统计的各词条的词频为相应词条计算词条权重包括:The computer readable storage medium according to claim 15, wherein the calculating the term weights for the corresponding terms according to the word frequency of each of the counted words comprises:
    采用以下公式计算所述词条权重w:The term weight w is calculated using the following formula:
    w=log(n/df);w=log(n/df);
    其中n为相应词条在所述坐席文本数据中的词频,df为在预设的大规模语料库中包含相应词条的文档数。Where n is the word frequency of the corresponding entry in the agent text data, and df is the number of documents containing the corresponding entry in the preset large-scale corpus.
  18. 如权利要求15所述的计算机可读存储介质,其特征在于,所述多个指定特征包括:The computer readable storage medium of claim 15 wherein the plurality of specified features comprises:
    问题语句和答案语句的共同关键词数目a1;The number of common keywords a1 of the question statement and the answer statement;
    问题语句和答案语句的共同关键词长度a2;The common keyword length a2 of the question statement and the answer statement;
    问题语句和答案语句的共同词条的数目a3;The number of common terms a3 of the question statement and the answer statement;
    问题语句和答案语句的共同词条的长度a4;The length of the common entry of the question statement and the answer statement is a4;
    问题语句的长度a5;The length of the problem statement is a5;
    答案语句的长度a6;The length of the answer statement is a6;
    问题语句和答案语句的主题相似度a7;The topic similarity a7 of the question statement and the answer statement;
    问题语句和答案语句的句法相似度a8;Syntactic similarity a8 of the question statement and the answer statement;
    所述使用所述特征权重对所述多个指定特征进行加权计算,得到每个候选问答对的匹配度包括:And performing weighting calculation on the plurality of specified features by using the feature weights, and obtaining a matching degree of each candidate question and answer pair includes:
    对所述多个指定特征采用多元逻辑回归模型进行加权回归拟合计算,得到每个候选问答对的匹配度g(z),公式如下:The multivariate logistic regression model is used to perform weighted regression fitting calculation on the plurality of specified features, and the matching degree g(z) of each candidate question and answer pair is obtained, and the formula is as follows:
    g(z)=1/(1+e z),e为自然常数; g(z)=1/(1+e z ), where e is a natural constant;
    其中,z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8,x1、x2...x8分别为所述a1、a2...a8的特征权重。Wherein, z=a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8, x1, x2...x8 are respectively the a1, a2 ...a8 feature weight.
  19. 如权利要求15所述的计算机可读存储介质,其特征在于,所述将所述目标问答对补充到问答知识库之前,还包括:The computer readable storage medium of claim 15, wherein the adding the target question and answer pair to the question and answer knowledge base further comprises:
    将所述各个目标问答对中,对应相同问题语句的答案语句进行合并处理,使每个目标问答对之间包含的问题语句互不相同,每个目标问答对中包括一个或多个答案语句。In the respective target question and answer pairs, the answer sentences corresponding to the same question statement are combined, so that the problem sentences included in each target question and answer pair are different from each other, and each target question and answer pair includes one or more answer sentences.
  20. 如权利要求15所述的计算机可读存储介质,其特征在于,所述将所述目标问答对补充到问答知识库之前,还包括:The computer readable storage medium of claim 15, wherein the adding the target question and answer pair to the question and answer knowledge base further comprises:
    对每个目标问答对中的问题语句和答案语句分别采用朴素贝叶斯分类器判断相应的问题语句是否为询问句,以及相应的答案语句是否为陈述句;The narrative Bayesian classifier is used to determine whether the corresponding question statement is a query sentence and whether the corresponding answer sentence is a declarative sentence for each of the question and answer statements in each target question and answer pair;
    将包含问题语句不为询问句的目标问答对过滤掉,以及将包含答案语句不为陈述句的目标问答对过滤掉,得到待补充到问答知识库的目标问答对。Filter out the target question and answer pairs that do not contain the question statement, and filter out the target question and answer pairs that contain the answer statement instead of the declarative sentence, and get the target question and answer pair to be added to the Q&A knowledge base.
PCT/CN2018/090642 2018-02-09 2018-06-11 Question and answer data processing method, electronic device and storage medium WO2019153612A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810135748.0A CN108415980A (en) 2018-02-09 2018-02-09 Question and answer data processing method, electronic device and storage medium
CN201810135748.0 2018-02-09

Publications (1)

Publication Number Publication Date
WO2019153612A1 true WO2019153612A1 (en) 2019-08-15

Family

ID=63128208

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/090642 WO2019153612A1 (en) 2018-02-09 2018-06-11 Question and answer data processing method, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN108415980A (en)
WO (1) WO2019153612A1 (en)

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109331470B (en) * 2018-08-21 2023-09-26 平安科技(深圳)有限公司 Method, device, equipment and medium for processing answering game based on voice recognition
CN111373395A (en) * 2018-08-31 2020-07-03 北京嘀嘀无限科技发展有限公司 Artificial intelligence system and method based on hierarchical clustering
CN109189803A (en) * 2018-09-03 2019-01-11 深圳前海微众银行股份有限公司 Question and answer are to construction method, device and computer readable storage medium
CN109272129B (en) * 2018-09-20 2022-03-18 重庆先特服务外包产业有限公司 Call center service management system
CN109388697A (en) * 2018-10-10 2019-02-26 成都小多科技有限公司 Handle the method, apparatus and electric terminal of question and answer corpus
CN109657038B (en) * 2018-10-10 2023-04-18 创新先进技术有限公司 Question and answer pair data mining method and device and electronic equipment
CN109543020B (en) * 2018-11-27 2022-11-04 科大讯飞股份有限公司 Query processing method and system
CN110162603B (en) * 2018-11-30 2023-11-14 腾讯科技(深圳)有限公司 Intelligent dialogue method, dynamic storage method and device
CN109800292A (en) * 2018-12-17 2019-05-24 北京百度网讯科技有限公司 The determination method, device and equipment of question and answer matching degree
CN109785099B (en) * 2018-12-27 2021-07-06 大象慧云信息技术有限公司 Method and system for automatically processing service data information
CN109710818B (en) * 2018-12-28 2021-08-06 咪咕文化科技有限公司 Answer weight determination method, answer determination device and storage medium
CN110008320B (en) * 2019-03-05 2023-10-27 创新先进技术有限公司 Method and device for answering user questions
CN110069772B (en) * 2019-03-12 2023-10-20 平安科技(深圳)有限公司 Device, method and storage medium for predicting scoring of question-answer content
CN110188205A (en) * 2019-05-08 2019-08-30 三角兽(北京)科技有限公司 A kind of update method and device of intelligent customer service system knowledge base
CN111984768A (en) * 2019-05-24 2020-11-24 北京京东尚科信息技术有限公司 Corpus processing and question-answer interaction method and device, computer equipment and storage medium
CN111737543A (en) * 2019-05-27 2020-10-02 北京京东尚科信息技术有限公司 Question and answer pair extraction method, device, equipment and storage medium
CN110162614B (en) * 2019-05-29 2021-08-27 腾讯科技(深圳)有限公司 Question information extraction method and device, electronic equipment and storage medium
CN110502623A (en) * 2019-07-08 2019-11-26 平安科技(深圳)有限公司 Intelligent answer method, electronic device, computer equipment and readable storage medium storing program for executing
CN110825860B (en) * 2019-10-24 2022-08-23 厦门快商通科技股份有限公司 Knowledge base question and answer extraction method and system, mobile terminal and storage medium
CN110825863B (en) * 2019-11-11 2021-05-07 腾讯科技(深圳)有限公司 Text pair fusion method and device
CN113051390B (en) * 2019-12-26 2023-09-26 百度在线网络技术(北京)有限公司 Knowledge base construction method, knowledge base construction device, electronic equipment and medium
CN111651567B (en) * 2020-04-16 2023-09-22 北京奇艺世纪科技有限公司 Service question-answer data processing method and device
CN111459917B (en) * 2020-04-17 2023-02-28 支付宝(杭州)信息技术有限公司 Knowledge base management method, device and processing equipment
CN111797214A (en) * 2020-06-24 2020-10-20 深圳壹账通智能科技有限公司 FAQ database-based problem screening method and device, computer equipment and medium
CN111753553B (en) * 2020-07-06 2022-07-05 北京世纪好未来教育科技有限公司 Statement type identification method and device, electronic equipment and storage medium
CN111597321B (en) * 2020-07-08 2024-06-11 腾讯科技(深圳)有限公司 Prediction method and device of answers to questions, storage medium and electronic equipment
CN112148743A (en) * 2020-09-18 2020-12-29 北京达佳互联信息技术有限公司 Method, device, equipment and storage medium for updating intelligent customer service knowledge base
CN112052310A (en) * 2020-09-28 2020-12-08 平安普惠企业管理有限公司 Information acquisition method, device, equipment and storage medium based on big data
CN113535980A (en) * 2021-07-20 2021-10-22 南京市栖霞区民政事务服务中心 Rapid establishing method of intelligent community knowledge base system based on artificial intelligence
CN113990288B (en) * 2021-12-23 2022-05-10 江苏微皓智能科技有限公司 Method for automatically generating and deploying voice synthesis model by voice customer service
CN117056497B (en) * 2023-10-13 2024-01-23 北京睿企信息科技有限公司 LLM-based question and answer method, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786851A (en) * 2014-12-23 2016-07-20 北京奇虎科技有限公司 Question and answer knowledge base construction method as well as search provision method and apparatus
CN106909572A (en) * 2015-12-23 2017-06-30 北京奇虎科技有限公司 A kind of construction method and device of question and answer knowledge base
US20170228372A1 (en) * 2016-02-08 2017-08-10 Taiger Spain Sl System and method for querying questions and answers
CN107220296A (en) * 2017-04-28 2017-09-29 北京拓尔思信息技术股份有限公司 The generation method of question and answer knowledge base, the training method of neutral net and equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411583B (en) * 2010-09-20 2013-09-18 阿里巴巴集团控股有限公司 Method and device for matching texts
CN104679815B (en) * 2014-12-08 2018-02-23 北京云知声信息技术有限公司 It is a kind of to screen question and answer pair and the method and system in real-time update question and answer storehouse
CN107153639A (en) * 2016-03-04 2017-09-12 北大方正集团有限公司 Intelligent answer method and system
CN106649868B (en) * 2016-12-30 2019-03-26 首都师范大学 Question and answer matching process and device
CN107066541A (en) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 The processing method and system of customer service question and answer data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786851A (en) * 2014-12-23 2016-07-20 北京奇虎科技有限公司 Question and answer knowledge base construction method as well as search provision method and apparatus
CN106909572A (en) * 2015-12-23 2017-06-30 北京奇虎科技有限公司 A kind of construction method and device of question and answer knowledge base
US20170228372A1 (en) * 2016-02-08 2017-08-10 Taiger Spain Sl System and method for querying questions and answers
CN107220296A (en) * 2017-04-28 2017-09-29 北京拓尔思信息技术股份有限公司 The generation method of question and answer knowledge base, the training method of neutral net and equipment

Also Published As

Publication number Publication date
CN108415980A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
WO2019153612A1 (en) Question and answer data processing method, electronic device and storage medium
CN108491433B (en) Chat response method, electronic device and storage medium
WO2019153607A1 (en) Intelligent response method, electronic device and storage medium
US11734329B2 (en) System and method for text categorization and sentiment analysis
US11093854B2 (en) Emoji recommendation method and device thereof
US10860654B2 (en) System and method for generating an answer based on clustering and sentence similarity
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
CN106649818B (en) Application search intention identification method and device, application search method and server
WO2020237856A1 (en) Smart question and answer method and apparatus based on knowledge graph, and computer storage medium
EP3540612A1 (en) Cluster processing method and device for questions in automatic question and answering system
CN106776574B (en) User comment text mining method and device
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN107784092A (en) A kind of method, server and computer-readable medium for recommending hot word
CN111125354A (en) Text classification method and device
US20140379719A1 (en) System and method for tagging and searching documents
US9772991B2 (en) Text extraction
CN107885717B (en) Keyword extraction method and device
CN109791570B (en) Efficient and accurate named entity recognition method and device
EP3762876A1 (en) Intelligent knowledge-learning and question-answering
CN112069326A (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN109783612B (en) Report data positioning method and device, storage medium and terminal
JP2017219899A (en) Knowledge search device, knowledge search method and knowledge search program
CN115982346A (en) Question-answer library construction method, terminal device and storage medium
CN110347934B (en) Text data filtering method, device and medium
CN112632264A (en) Intelligent question and answer method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18905414

Country of ref document: EP

Kind code of ref document: A1