CN111984768A

CN111984768A - Corpus processing and question-answer interaction method and device, computer equipment and storage medium

Info

Publication number: CN111984768A
Application number: CN201910442283.8A
Authority: CN
Inventors: 王逸凡
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Huijun Technology Co.,Ltd.
Priority date: 2019-05-24
Filing date: 2019-05-24
Publication date: 2020-11-24

Abstract

The embodiment of the invention provides a corpus processing and question-answer interaction method, a corpus processing and question-answer interaction device, computer equipment and a storage medium, wherein the method comprises the following steps: obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence; determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value; and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter.

Description

Corpus processing and question-answer interaction method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of computer processing, in particular to a corpus processing and question-answer interaction method, a corpus processing and question-answer interaction device, computer equipment and a storage medium.

Background

It is known that a customer service person can answer various related consultation made by a user in the consuming, service and other industries. Often, enterprises with more users need more customer service staff, and the intelligent question-answering system is developed in response to the situation that the construction mode of the dialogue system is different according to different service scenes in order to liberate manpower and reduce operation cost. The Information Retrieval (IR) based dialog system can search the most similar known question (query, Q) from a large number of query-answer-pairs (QA-pairs) according to the user question and output the corresponding answer (answer, a) as a result to the user, so that it is a basic condition for implementing a high-quality dialog system to acquire the high-quality QA-pairs from the corpus.

At present, when the QA-pair is mined, on one hand, the adjacent Q and A are defaulted to form the QA-pair, namely the adjacent Q and A are considered to form a correct question-answer pair; on the other hand, question-answer pairs are screened in a similarity measurement mode of keyword co-occurrence, and reasonable QA-pair is considered to have a certain amount of same keywords or keywords in the questions and answers, but the QA-pair obtained through mining is not strong in generalization capability, and the questions cannot be answered accurately enough.

Disclosure of Invention

In view of the above, the main objective of the present invention is to provide a corpus processing and question-answer interaction method, device, computer device and storage medium, which can accurately extract high-quality question-answer pairs from question-answer interaction data.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

in a first aspect of the embodiments of the present invention, a corpus processing method is provided, including:

obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;

determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;

And determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter.

In a second aspect of the embodiments of the present invention, there is provided a corpus processing apparatus, including:

the first acquisition module is used for acquiring question-answer interaction data and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;

the processing module is used for determining a corresponding screening strategy based on a set window value, screening the question-answer pair data sequence through the corresponding screening strategy and obtaining a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;

and the first determining module is used for determining the association degree parameter of each question-answer pair sequence in the question-answer pair set and selecting an associated question-answer pair from the question-answer pair sequence according to the association degree parameter.

In a third aspect of the embodiments of the present invention, a question-answer interaction method is provided, including:

acquiring question data, and determining a corresponding associated question-answer pair according to the question data; wherein, the associated question-answer pair is determined according to the corpus processing method provided by any embodiment of the invention;

And determining matched answer data to return based on the associated question-answer pairs.

In a fourth aspect of the embodiments of the present invention, there is provided a question-answer interaction apparatus, including:

the second acquisition module is used for determining a corresponding associated question-answer pair according to the question data; wherein, the associated question-answer pair is determined according to the corpus processing method provided by any embodiment of the invention;

and the second determination module is used for determining the return of matched answer data based on the associated question-answer pair.

In a fifth aspect of the embodiments of the present invention, there is provided a computer device, including: a processor and a memory for storing a computer program capable of running on the processor;

when the processor is used for running the computer program, the corpus processing method provided by any embodiment of the invention or the question-answer interaction method provided by any embodiment of the invention is realized.

A sixth aspect of the embodiments of the present invention provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by a processor, the corpus processing method according to any embodiment of the present invention or the question-answer interaction method according to any embodiment of the present invention is implemented.

The corpus processing and question-answer interaction method, device, computer equipment and storage medium provided by the embodiment of the invention are used for acquiring question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence; determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value; therefore, by setting a variable window, the question-answer pair sequence matched with the window value is obtained, the question-answer pair sequence with higher relevance can be matched from multiple rounds of conversations, and the problem of interruption or discontinuity in the conversations is solved; and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter, so that high-quality question-answer pairs are accurately mined from each question-answer pair sequence in the question-answer pair set through the relevance parameter, and meanwhile, the relevant question-answer pairs generated in multiple rounds of conversations can be obtained, and the problem of 'answer-not-asked' is avoided.

Drawings

FIG. 1 is a flow chart illustrating a corpus processing method according to an embodiment of the present invention;

Fig. 2 is a schematic flow chart of a question-answer interaction method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a corpus processing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a question-answer interaction device according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention;

fig. 6 is a flowchart illustrating a corpus processing method according to another embodiment of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Before further detailed description of the present invention, terms and expressions referred to in the embodiments of the present invention are described, and the terms and expressions referred to in the embodiments of the present invention are applicable to the following explanations.

As shown in fig. 1, an embodiment of the present invention provides a corpus processing method, including the following steps:

step 101: obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;

the interactive question-answer data refers to the session content between the customer service and the user, and specifically may be a chat record between the customer service and the user in a user service system such as service industry or e-commerce.

The step of preprocessing the question-answer interaction data to obtain a question-answer data sequence refers to distinguishing utterances of a user and a customer service to obtain a corresponding question Q and an answer a, where the utterance of the user is Q, the utterance of the customer service is a, and generally the time is taken as an order, and the session between the user and the customer service is correspondingly formed into the question-answer data sequence, for example, a question-answer data sequence such as qa …, qqa …, QAA …, and generally, Q is taken as the beginning of the question-answer sequence.

Step 102: determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;

The window value is the total number of Q and A included in the question-answer pair, for example, the window value is set to 2, and the corresponding screening strategy is to screen the question-answer pair in the data sequence according to a question and an answer to obtain the question-answer pair in the QA style; with the window value set to 3, challenge pairs of styles such as QAA or QQA can be screened. If the window is 4, question-answer pairs such as QAAA or QQA or QQQA or QQAA patterns can be screened out, and by analogy, a longer multi-turn question-answer pair can be screened out if the window is larger, and for a set window value, a question-answer pair set formed by a question-answer pair sequence matched with the window value is obtained by setting different window values.

Step 103: and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter.

The relevance parameter refers to a relevant parameter set for relevance between questions and answers in a question-answer pair sequence, and is used for calculating matching degree of the questions and the answers in the question-answer pair sequence, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevance parameter refers to obtaining matching degree of the questions and the answers in the question-answer pair sequence based on the relevance parameter, for example, setting a corresponding threshold value for a value obtained by the relevance parameter, that is, when the obtained value of the matching degree meets the set threshold value, selecting the question-answer pair sequence as the relevant question-answer pair.

In the above embodiment of the present application, a question-answer pair data sequence is obtained by obtaining question-answer interaction data and preprocessing the question-answer interaction data; determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value; therefore, by setting a variable window, a question-answer pair sequence matched with the window value is obtained, a question-answer pair sequence generated by a plurality of rounds of conversations is obtained, and the problem of interruption or discontinuity in the conversations is solved; and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter, so that high-quality question-answer pairs are accurately mined from each question-answer pair sequence in the question-answer pair set through the relevance parameter, and meanwhile, the relevant question-answer pairs generated in multiple rounds of conversations can be obtained, and the problem of 'answer-not-asked' is avoided.

In an embodiment, the preprocessing the question-answer interaction data to obtain a question-answer pair data sequence includes:

processing the question-answer interaction data based on a preset normalization processing mode; the normalization processing mode comprises at least one of the following modes: word segmentation processing, stop word processing and word bag model processing;

And coding the interactive data of the question and answer after the normalization processing to obtain a question and answer pair data sequence.

The question-answer interactive data are processed based on a preset normalization processing mode, namely the question-answer interactive data are preprocessed in a normalization processing mode, so that unnecessary noise is removed, the screening result of the question-answer interactive data is optimized, and the screening quality of the question-answer pairs is improved.

Specifically, if word segmentation processing is adopted, existing word segmentation methods can be classified into three main categories: the method comprises the steps of word segmentation based on character string matching, word segmentation based on understanding and word segmentation based on statistics, for example, the problem that the cost for installing is high in your family and 3 buildings, and the cost for installing is high in installation cost is high in the case that the text obtained by word segmentation is high in your family and home/3/building/installation cost/requirement/installation cost/time.

The text after word segmentation is processed to stop word processing, which means to remove the noise of some special characters, such as: "parent", "hello", "do", etc., the above-mentioned text after word segmentation is processed by stop word to get "home/3/building/installation/need/installation fee".

The bag-of-words mode processing means that the appearance sequence of the words is ignored, only whether the words appear is concerned, and the appearance sequence of the words is not concerned. For example, if the user a asks the question "floor 3, installation requires installation fee", the text processed in the word bag mode is "floor 3/installation/need/installation fee"; the question of user B "install in floor 3, need not install? "the text after processing in bag-of-words mode is" 3/floor/install/need/install fee ".

Encoding the interactive data of the question and answer pair after the normalization processing to obtain a question and answer pair data sequence means that the interactive data of the question and answer pair after the programming and processing adopts a serial number for encoding, and specifically, the interactive data can be represented by a shorter serial number. For example, "3/building/installation/need/installation fee" is denoted as Q₁Wherein "Q" represents the question asked by the user and "1" represents the number of the normalized text, so that all user questions of the normalized text "3/building/installation/need/installation fee" can use Q₁This is shown simplified.

In the above embodiment, the question-answer interaction data is processed by a normalization processing method based on a preset normalization processing; the normalization processing mode comprises at least one of the following modes: word segmentation processing, stop word processing and word bag model processing; in this way, two similar questions or answers can be made to have the same representation, thereby increasing the number of question and answer pairs mined; and coding the interactive data of the question answers after the normalization processing to obtain a question answer pair data sequence, so that the questions or answers after the normalization processing can be simply expressed, thereby avoiding the influence of noise and improving the screening quality of the question answers on the data sequence.

In an embodiment, after selecting an associated question-answer pair from the question-answer pair sequence according to the relevancy parameter, the method includes:

and determining an inverse normalization processing mode corresponding to the normalization mode, and processing the association question-answer pair based on the inverse normalization processing mode to obtain a target association question-answer pair set.

Determining the same asThe normalization mode corresponding to the inverse normalization processing mode refers to that the question-answer pair sequence obtained corresponding to the question-answer interaction data carries out reduction operation on the associated question-answer pair to obtain the complete question and answer corresponding to the associated question-answer pair, for example, Q₁To "3/building/installation/need/installation fee", A₁Obtaining Q as 'good family, home and 3-storied building, installation cost is needed' by 'we/all/package/installation' through reverse-normalization processing mode reverse-deduction; a is 'parent, we are all package-mounted', namely, a high-quality target associated question-answer pair is obtained.

In the above embodiment, an inverse normalization processing manner corresponding to the normalization manner is determined, and the associated question-answer pairs are processed based on the inverse normalization processing manner to obtain a target associated question-answer pair set, so that the question-answer interaction data is normalized, filtered, and finally subjected to inverse normalization processing to obtain a high-quality target associated question-answer pair set.

In one embodiment, the obtaining a question-answer pair set formed by a question-answer pair sequence with a pattern length matching the window value includes:

sequentially selecting question-answer interaction data segments with the length equal to the window value based on the window value, and respectively forming question-answer pair sequences with the style lengths matched with the window value according to the question-answer interaction data segments; each question-answer pair sequence comprises at least one question word to be analyzed and at least one answer word to be analyzed;

and forming a question-answer pair set according to the question-answer pair sequence.

Sequentially selecting question-answer interaction data segments with the length equal to the window value based on the window value means that the corresponding selection length is determined based on the window value, and further obtaining question-answer interaction data segments from the question-answer pair data sequence according to the selection length, so that a corresponding question-answer pair sequence is formed; for example, if the window value is set to 2, the corresponding screening strategy is to screen question-answer pairs in the question-answer pair data sequence according to a question and an answer to obtain a question-answer interaction data segment in the QA style, so as to form a corresponding question-answer pair sequence; with the window value set to 3, question-answer interaction data segments having a pattern length of 3, such as QAA or QQA patterns, may be filtered out to form a corresponding sequence of question-answer pairs. If the window value is 4, the question-answer interaction data segment with the style length of 4, such as QAA or QQA or QQQA or QQAA, can be screened out, so that a corresponding question-answer pair sequence is formed, by analogy, a longer multi-turn question-answer pair can be screened out if the window value is larger, and for the set window value, a question-answer pair set formed by the question-answer pair sequence matched with the window value is obtained by setting different window values.

Each question-answer pair sequence comprises at least one question word to be analyzed and at least one answer word to be analyzed, and the question-answer pair sequence comprising at least one Q and at least one A is obtained by starting from a question and through a set window value.

In the above embodiment, question-answer interactive data segments with the length equal to the window value are sequentially selected based on the window value, and question-answer pair sequences with pattern lengths matched with the window value are respectively formed according to the question-answer interactive data segments; forming a question-answer pair set according to the question-answer pair sequence; therefore, by setting the variable window, the question-answer pair sequence matched with the window value is obtained, the question-answer pair sequence generated by multiple sessions is obtained, and the problem of interruption or discontinuity in the sessions is solved.

In an embodiment, the determining a relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevance parameter includes:

determining a relevancy parameter of each question-answer pair sequence in the question-answer pair set and a threshold corresponding to the relevancy parameter; the relevancy parameter comprises at least one of the following: a degree of freedom parameter, a closeness parameter, a repetition parameter;

And selecting the question-answer pair sequence with the correlation degree parameter meeting the threshold value as a correlation question-answer pair.

The relevancy parameter comprises at least one of the following: a degree of freedom parameter, a closeness parameter, a repetition parameter; for example, the relevance parameter is a repetition parameter, the threshold value is 5, and the relevance parameter corresponds to a question-answer pair sequence, such as "QA", and when the number of times of "QA" appearing in the question-answer pair set exceeds 5 times, the question-answer pair sequence is determined as a relevance question-answer pair.

In the above embodiment, a relevance parameter of each question-answer pair sequence in the question-answer pair set and a threshold corresponding to the relevance parameter are determined; selecting the question-answer pair sequence with the correlation degree parameter meeting the threshold value as a correlation question-answer pair; therefore, high-quality associated question-answer pairs are accurately mined from each question-answer pair sequence in the question-answer pair set through the association degree parameters and the corresponding threshold values, and the problem of 'no question answering' is avoided.

In an embodiment, when the relevancy parameter is a degree of freedom parameter, the determining the relevancy parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter includes:

acquiring left and right adjacent question-answer pairs adjacent to each question-answer pair sequence to obtain a left question-answer pair and a right question-answer pair;

determining a left entropy value and a right entropy value of the question-answer pair sequence respectively based on the left question-answer pair sequence and the question-answer pair sequence, and determining the degree of freedom parameter according to the left entropy value and the right entropy value based on a set condition;

and when the degree of freedom parameter exceeds a set first threshold value, determining that the question-answer pair sequence is an associated question-answer pair.

Here, the association degree parameter is a degree of freedom parameter, specifically, the degree of freedom parameter may be an entropy value of a question-answer pair sequence, the entropy value is mainly used to indicate how random a left neighboring sentence set and a right neighboring sentence set of the question-answer pair sequence are, the larger the left and right entropy values are, the more different question-answer pair sequences may appear on the left or right of the question-answer pair sequence, and then it is likely to be a reasonable question-answer pair sequence.

Question-answer pair sequence pattern 'Q' with window value of 2₁A₁"example, screen out all Q's in the data sequence of question-answer pairs₁A₁The sequence of question-answer pairs is recorded and the sequences of question-answer pairs adjacent to each other are recorded, such as Q in the data sequence of question-answer pairs₁A₁Left side appearance of { Q_-1，Q_-2Two question-answer pair sequences, appearing on the right side thereof with { Q }₂，A₂Two question-answer pair sequences, which are respectively used for calculating entropy values of the left side and the right side according to formulas (1) and (2):

-E_L(Q₁A₁)_{left side of}＝P(Q_-1Q₁A₁|)log₂P(Q_-1Q₁A₁)+P(Q_-2Q₁A₁|)log₂P(Q_-2Q₁A₁) (1)

-E_L(Q₁A₁)_{Right side}＝P(Q₁A₁Q₃|)log₂P(Q₁A₁Q₃)+P(Q₁A₁A₃|)log₂P(Q₁A₁A₃) (2)

Wherein E is_L(Q₁A₁)_{Left side of}Is Q₁A₁Left entropy value of, E_L(Q₁A₁)_{Right side}Is Q₁A₁The lower value is taken as Q₁A₁See formula (3):

E_L(Q₁A₁)＝min(E_L(Q₁A₁)_{left side of}，E_L(Q₁A₁)_{Right side}) (3)

See in particular the following table 1, Q₁A₁Multiple sequences appear on the left and right, through publicThe entropy value obtained by the formula calculation is larger.

TABLE 1

See table 2, below, Q₁A₁Multiple question-answer pair sequences can appear on the left side, but only one question-answer pair sequence, namely A, appears on the right side₂The entropy values obtained according to the formulas (1) to (3) are small, and the left and right degree of freedom parameters are low. If the entropy value is lower than a preset entropy threshold, namely a first threshold, then Q can be considered to be₁A₁Not a rational question-answer pair.

TABLE 2

In the above embodiment, by obtaining the degree of freedom parameter, when the degree of freedom parameter exceeds the set first threshold, it is determined that the question-answer pair sequence is an associated question-answer pair, so that the degree of freedom of the question-answer pair sequence can be measured, if a certain question-answer sequence can appear in multiple context questions and answers, the degree of freedom is considered to be higher, and a higher degree of freedom makes it more likely to be a reasonable question-answer sequence pair, i.e., an associated question-answer pair, so that the associated question-answer pair is more accurately selected.

In an embodiment, when the relevancy parameter is a closeness parameter, the determining the relevancy parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter includes:

dividing each question-answer pair sequence into a first part and a second part, wherein the first part at least comprises a question word to be analyzed;

respectively acquiring the occurrence probability of the first part, the second part and the question-answer pair sequence in the question-answer pair set, and respectively acquiring the occurrence probability of the first part, the occurrence probability of the second part and the occurrence probability of the question-answer pair sequence;

determining the closeness parameter based on the first portion occurrence probability, the second portion occurrence probability, and the occurrence probability of the sequence of question-answer pairs;

and when the closeness parameter exceeds a set second threshold value, determining that the question-answer pair sequence is an associated question-answer pair.

Here, the relevance parameter is an affinity parameter, and specifically, the affinity parameter may be point Mutual Information (pmi), and in machine learning related documents, it can be seen that the relevance between two variables, such as two words and two sentences, is measured using pmi. Specifically, see formula (4):

Wherein x and y respectively represent two parts of the sequence, such as the sequence of the pattern "QA", x is Q, and y is A; for example, in the style "QAA", x is Q, y is AA, or x is QA and y is A. p (x, y) denotes the frequency with which a certain sequence appears in the total sequence set E, see, for example, table 3.

TABLE 3

Here, Q "is a normalized sequence of Q" knowing the skimming "A" and "A" is denoted as Q₁A₁(ii) a Q "knows the preference of thank you" A "and then please place the order as soon as possible" and is expressed as a sequence Q after normalization₁A₂Table 3 shows the statistics and calculations associated with the values, pmi calculation result pmi (Q), respectively₁,A₁) Greater than pmi (Q)₁,A₂). It can be seen that although A is₂Frequency ratio A occurring alone₁High, but Q₁A₁Probability of co-occurrence but ratio Q₁A₂Much larger, i.e., the former is more compact and therefore more likely to be a reasonably relevant question-and-answer pair. When the pmi value of a certain question-answer pair sequence is lowAt the second threshold, it may be considered impossible to form an associated challenge-response pair. In addition, when there are multiple possible cases of x or y, such as in style "QAA", x is Q, y is AA or x is QA, y is a; pmi max (pmi (Q, AA), pmi (QA, a)), i.e., the maximum value of pmi.

In the above embodiment, by obtaining the closeness parameter, when the closeness parameter exceeds the set second threshold, it is determined that the sequence of question-answer pairs is an associated question-answer pair, so that the closeness of the sequence of question-answer pairs can be measured, and the higher the closeness is, the more likely it is to be a pair of reasonable question-answer sequence pairs, i.e., an associated question-answer pair, so as to more accurately select the associated question-answer pair.

It should be noted that simply considering the goodness of the challenge-response pair with pmi is sometimes not strict, because in some cases it is not necessarily a good QA pair even though the pmi value is large, as shown in table 4.

Q₁	1 p/1.5 p/Difference/what
		A₁	1 p/Utility/area/10/15/sq
A₂	1.5 p/fit/area/15/25/square meter

TABLE 4

Here, Q₁A₁It appears many times in the data sequence E of question-answer pairs, whose pmi value is large, but it is not a reasonable question-answer pair because the reply is not complete and A needs to be added₂Can complete the question-answer pair, therefore, needs to be combined with the parameters of degree of freedom to perfect and filterAnd the mechanism is used for more accurately selecting the associated question-answer pairs.

In an embodiment, when the relevancy parameter is a repetitive parameter, the determining the relevancy parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter includes:

and acquiring the occurrence times of the question-answer pair sequence in the question-answer pair set, and determining that the question-answer pair sequence is a related question-answer pair when the occurrence times exceed a set third threshold value.

Here, when the question-answer pair sequence is Q₁A₁The third threshold is 10 times, when Q is₁A₁If the occurrence frequency in the question-answer pair set exceeds 10 times, determining Q ₁A₁For associating question and answer pairs.

In the above embodiment, the number of occurrences of the question-answer pair sequence in the question-answer pair set is calculated, and when the number of occurrences exceeds a third threshold, it is determined that the question-answer pair sequence is an associated question-answer pair.

In another embodiment, as shown in fig. 2, there is also provided a question-answer interaction method, including the steps of:

step 201: acquiring question data, and determining a corresponding associated question-answer pair according to the question data; wherein, the associated question-answer pair is determined by the corpus processing method according to any embodiment of the present invention;

step 202: and determining matched answer data to return based on the associated question-answer pairs.

In the above embodiment, the question data is obtained, and then the matched answer data is determined and returned according to the associated question-answer pair determined by the corpus processing method, so that the accuracy of the answer is improved, and the question of' answer is avoided.

In another embodiment, as shown in fig. 3, there is also provided a corpus processing apparatus, including:

The first obtaining module 11 is configured to obtain question-answer interaction data, and preprocess the question-answer interaction data to obtain a question-answer pair data sequence;

the processing module 12 is configured to determine a corresponding screening policy based on a set window value, and screen the question-answer pair data sequence through the corresponding screening policy to obtain a question-answer pair set formed by a question-answer pair sequence with a pattern length matched with the window value;

a first determining module 13, configured to determine a relevance parameter of each question-answer pair sequence in the question-answer pair set, and select a relevant question-answer pair from the question-answer pair sequence according to the relevance parameter.

Optionally, the first obtaining module 11 is further configured to process the question-answer interaction data based on a preset normalization processing manner; the normalization processing mode comprises at least one of the following modes: word segmentation processing, stop word processing and word bag model processing;

Optionally, the first determining module 13 is further configured to determine an inverse normalization processing manner corresponding to the normalization manner, and process the associated question-answer pair based on the inverse normalization processing manner to obtain a target associated question-answer pair set.

Optionally, the processing module 12 is further configured to sequentially select question-answer interaction data segments with lengths equal to the window values based on the window values, and form question-answer pair sequences with style lengths matched with the window values according to the question-answer interaction data segments; each question-answer pair sequence comprises at least one question word to be analyzed and at least one answer word to be analyzed;

Optionally, the first determining module 13 is further configured to determine a relevancy parameter of each question-answer pair sequence in the question-answer pair set and a threshold corresponding to the relevancy parameter; the relevancy parameter comprises at least one of the following: a degree of freedom parameter, a closeness parameter, a repetition parameter;

Optionally, the first determining module 13 is further configured to obtain a left adjacent question-answer pair and a right adjacent question-answer pair adjacent to each other in each question-answer pair sequence, so as to obtain a left question-answer pair and a right question-answer pair;

determining a left entropy value and a right entropy value of the question-answer pair sequence respectively based on the left question-answer pair sequence and the question-answer pair sequence, and determining the degree of freedom parameter according to the left entropy value and the right entropy value;

Optionally, the first determining module 13 is further configured to divide each question-answer pair sequence into a first part and a second part, where the first part at least contains one question word to be analyzed;

Optionally, the first determining module 13 is further configured to obtain the number of occurrences of the question-answer pair sequence in the question-answer pair set, and determine that the question-answer pair sequence is an associated question-answer pair when the number of occurrences exceeds a third threshold.

In another embodiment, as shown in fig. 4, there is also provided a question-answer interaction device, including:

a second obtaining module 21, configured to determine, according to the question data, a corresponding associated question-answer pair; wherein, the associated question-answer pair is determined according to the corpus processing method provided by any embodiment of the invention;

and a second determining module 22, configured to determine that matched answer data is returned based on the associated question-answer pair.

In the above embodiment, the question data is obtained, and then the matched answer data is determined and returned according to the associated question-answer pair determined by the corpus processing method, so that the accuracy of the answer is improved, and the problem of 'no question answering' is avoided.

In another embodiment, as shown in fig. 5, there is also provided a computer apparatus including: at least one processor 210 and a memory 211 for storing computer programs capable of running on the processor 210; the processor 210 illustrated in fig. 5 is not used to refer to the number of processors as one, but is only used to refer to the position relationship of the processor with respect to other devices, and in practical applications, the number of processors may be one or more; similarly, the memory 211 illustrated in fig. 5 is also used in the same sense, i.e., it is only used to refer to the position relationship of the memory with respect to other devices, and in practical applications, the number of the memory may be one or more.

Wherein, when the processor 210 is used for running the computer program, the following steps are executed:

In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:

acquiring question data, and determining a corresponding associated question-answer pair according to the question data; wherein, the associated question-answer pair is determined by the corpus processing method according to any embodiment of the present invention;

The computer device further includes: at least one network interface 212. The various components on the transmit side are coupled together by a bus system 213. It will be appreciated that the bus system 213 is used to enable communications among the connections of these components. The bus system 213 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 213 in fig. 5.

The memory 211 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 211 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The memory 211 in the embodiment of the present invention is used to store various types of data to support the operation of the transmitting end. Examples of such data include: any computer program for operating on the sender side, such as an operating system and application programs. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs may include various application programs for implementing various application services. Here, the program that implements the method of the embodiment of the present invention may be included in an application program.

The embodiment further provides a computer storage medium, for example, including a memory 211 storing a computer program, which can be executed by a processor 210 in the transmitting end to perform the steps of the foregoing method. The computer storage medium can be FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM; or various devices including one or any combination of the above memories, such as a smart phone, a tablet computer, a notebook computer, and the like. A computer storage medium having a computer program stored therein, the computer program, when executed by a processor, performing the steps of:

In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:

Referring to fig. 6, the working process of the corpus processing method of the present application will be described in more detail by way of a more detailed example with reference to the accompanying drawings. The corpus processing method comprises the following steps:

step S11: inputting a question-answer pair data sequence, a window value, a first threshold value, a second threshold value and a third threshold value;

before inputting the question-answer pair data sequence, obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain the question-answer pair data sequence. The window value is the total number of Q and A included in the question-answer pair, for example, if the window value is set to 2, the question-answer pair in the QA mode is obtained; with the window value set to 3, a question-and-answer pair such as QAA or QQA can be obtained.

Here, the first threshold corresponds to a threshold for the degree of freedom parameter setting, the second threshold corresponds to a threshold for the tightness parameter setting, and the third threshold corresponds to a threshold for the repetition parameter setting.

Step S12: normalization processing;

here, the normalization processing method includes performing word segmentation processing, stop word processing, and bag-of-word model processing on the question-answer interaction data, respectively.

Encoding the interactive data of the question and answer pair after the normalization processing to obtain a question and answer pair data sequence means that the interactive data of the question and answer pair after the programming and processing adopts a serial number for encoding, and specifically, the interactive data can be represented by a shorter serial number. For example, "3/building/install/need/install fee" is denoted as Q1, where "Q" represents the question asked by the user and "1" represents the number of the normalized text, so that all user questions of the normalized text "3/building/install/need/install fee" can be represented in a simplified manner using Q1.

Step S13: screening based on a set window value to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;

Step S14: determining a relevancy parameter of each question-answer pair sequence in the question-answer pair set;

here, the relevance parameter refers to a relevant parameter set for relevance between questions and answers in the question-answer pair sequence, and is used for calculating matching degree of the questions and the answers in the question-answer pair sequence, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevance parameter refers to obtaining matching degree of the questions and the answers in the question-answer pair sequence based on the relevance parameter, and specifically, calculating a degree of freedom parameter, a closeness parameter, and a repetition parameter of each question-answer pair sequence respectively.

Step S15: the degree of freedom parameter exceeds a set first threshold;

here, when the degree of freedom parameter exceeds the set first threshold, step S16 is executed; otherwise, determining that the sequence of question-answer pairs is not an associated question-answer pair.

Step S16: the compactness parameter exceeds a set second threshold;

here, when the closeness parameter exceeds the set second threshold, step S17 is performed; otherwise, determining that the sequence of question-answer pairs is not an associated question-answer pair.

Step S17: the repetition parameter exceeds a set third threshold;

here, when the repetition parameter exceeds the set third threshold, step S18 is performed; otherwise, determining that the sequence of question-answer pairs is not an associated question-answer pair.

Step S18: obtaining a relevant question-answer pair;

here, when the degree of freedom parameter exceeds a set first threshold, the closeness parameter exceeds a set second threshold, and the repetition parameter exceeds a set third threshold, it is determined that the question-answer pair sequence is an associated question-answer pair.

Step S19: and performing inverse normalization processing to obtain a target question-answer pair.

Here, the associated question-answer pair is restored to a target associated question-answer pair based on the inverse normalization processing corresponding to the normalization processing, and finally an associated question-answer pair set is obtained.

The above-described embodiments solve at least the following problems:

(1) by setting a variable window value, a question-answer pair sequence matched with the window value is obtained, a question-answer pair sequence generated by a plurality of rounds of conversations is obtained, and the problem of interruption or discontinuity in the conversations is solved;

(2) the method has the advantages that the sequence of question-answer pairs is screened by combining the freedom degree parameters, the compactness parameters and the repeated parameters to obtain the associated question-answer pairs, so that the content and the number of the screened question-answer pairs can be greatly enriched, and the problem that key information does not co-occur between the problems and the answers in the conventional method can be solved.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A corpus processing method, comprising:

2. The corpus processing method according to claim 1, wherein said preprocessing said question-answer interaction data to obtain a question-answer pair data sequence comprises:

3. The corpus processing method according to claim 2, wherein said selecting a relevant question-answer pair from said question-answer pair sequence according to said relevance parameter comprises:

4. The corpus processing method according to claim 1, wherein said obtaining a set of question-answer pairs formed by a sequence of question-answer pairs whose pattern lengths match said window values comprises:

5. The corpus processing method according to claim 1, wherein said determining a relevance parameter of each question-answer pair sequence in said question-answer pair set, and selecting a relevant question-answer pair from said question-answer pair sequence according to said relevance parameter comprises:

6. The corpus processing method according to claim 1, wherein when the relevancy parameter is a degree of freedom parameter, the determining the relevancy parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter comprises:

7. The corpus processing method according to claim 1, wherein when the relevancy parameter is an closeness parameter, the determining of the relevancy parameter of each question-answer pair sequence in the question-answer pair set and the selecting of a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter comprises:

8. The corpus processing method according to claim 1, wherein when the relevancy parameter is a repetition parameter, the determining of the relevancy parameter of each question-answer pair sequence in the question-answer pair set and the selecting of a relevancy question-answer pair from the question-answer pair sequence according to the relevancy parameter comprises:

9. A corpus processing apparatus, characterized in that the apparatus comprises:

10. A question-answer interaction method is characterized by comprising the following steps:

acquiring question data, and determining a corresponding associated question-answer pair according to the question data; wherein the associated question-answer pair is determined by the corpus processing method according to any one of claims 1 to 8;

11. A question-answer interaction device, comprising:

the second acquisition module is used for determining a corresponding associated question-answer pair according to the question data; wherein the associated question-answer pair is determined by the corpus processing method according to any one of claims 1 to 8;

12. A computer device, comprising: a processor and a memory for storing a computer program capable of running on the processor;

when the processor is used to run the computer program, the corpus processing method according to any one of claims 1 to 8 is implemented, or the question-answer interaction method according to claim 10 is implemented.

13. A computer storage medium, wherein a computer program is stored in the computer storage medium, and wherein the computer program, when executed by a processor, implements the corpus processing method according to any one of claims 1 to 8 or implements the question-answer interaction method according to claim 10.