CN111984768A - Corpus processing and question-answer interaction method and device, computer equipment and storage medium - Google Patents

Corpus processing and question-answer interaction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111984768A
CN111984768A CN201910442283.8A CN201910442283A CN111984768A CN 111984768 A CN111984768 A CN 111984768A CN 201910442283 A CN201910442283 A CN 201910442283A CN 111984768 A CN111984768 A CN 111984768A
Authority
CN
China
Prior art keywords
question
answer
answer pair
sequence
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910442283.8A
Other languages
Chinese (zh)
Inventor
王逸凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huijun Technology Co.,Ltd.
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910442283.8A priority Critical patent/CN111984768A/en
Publication of CN111984768A publication Critical patent/CN111984768A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/01Customer relationship services

Abstract

The embodiment of the invention provides a corpus processing and question-answer interaction method, a corpus processing and question-answer interaction device, computer equipment and a storage medium, wherein the method comprises the following steps: obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence; determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value; and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter.

Description

Corpus processing and question-answer interaction method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of computer processing, in particular to a corpus processing and question-answer interaction method, a corpus processing and question-answer interaction device, computer equipment and a storage medium.
Background
It is known that a customer service person can answer various related consultation made by a user in the consuming, service and other industries. Often, enterprises with more users need more customer service staff, and the intelligent question-answering system is developed in response to the situation that the construction mode of the dialogue system is different according to different service scenes in order to liberate manpower and reduce operation cost. The Information Retrieval (IR) based dialog system can search the most similar known question (query, Q) from a large number of query-answer-pairs (QA-pairs) according to the user question and output the corresponding answer (answer, a) as a result to the user, so that it is a basic condition for implementing a high-quality dialog system to acquire the high-quality QA-pairs from the corpus.
At present, when the QA-pair is mined, on one hand, the adjacent Q and A are defaulted to form the QA-pair, namely the adjacent Q and A are considered to form a correct question-answer pair; on the other hand, question-answer pairs are screened in a similarity measurement mode of keyword co-occurrence, and reasonable QA-pair is considered to have a certain amount of same keywords or keywords in the questions and answers, but the QA-pair obtained through mining is not strong in generalization capability, and the questions cannot be answered accurately enough.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide a corpus processing and question-answer interaction method, device, computer device and storage medium, which can accurately extract high-quality question-answer pairs from question-answer interaction data.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
in a first aspect of the embodiments of the present invention, a corpus processing method is provided, including:
obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;
determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;
And determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter.
In a second aspect of the embodiments of the present invention, there is provided a corpus processing apparatus, including:
the first acquisition module is used for acquiring question-answer interaction data and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;
the processing module is used for determining a corresponding screening strategy based on a set window value, screening the question-answer pair data sequence through the corresponding screening strategy and obtaining a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;
and the first determining module is used for determining the association degree parameter of each question-answer pair sequence in the question-answer pair set and selecting an associated question-answer pair from the question-answer pair sequence according to the association degree parameter.
In a third aspect of the embodiments of the present invention, a question-answer interaction method is provided, including:
acquiring question data, and determining a corresponding associated question-answer pair according to the question data; wherein, the associated question-answer pair is determined according to the corpus processing method provided by any embodiment of the invention;
And determining matched answer data to return based on the associated question-answer pairs.
In a fourth aspect of the embodiments of the present invention, there is provided a question-answer interaction apparatus, including:
the second acquisition module is used for determining a corresponding associated question-answer pair according to the question data; wherein, the associated question-answer pair is determined according to the corpus processing method provided by any embodiment of the invention;
and the second determination module is used for determining the return of matched answer data based on the associated question-answer pair.
In a fifth aspect of the embodiments of the present invention, there is provided a computer device, including: a processor and a memory for storing a computer program capable of running on the processor;
when the processor is used for running the computer program, the corpus processing method provided by any embodiment of the invention or the question-answer interaction method provided by any embodiment of the invention is realized.
A sixth aspect of the embodiments of the present invention provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by a processor, the corpus processing method according to any embodiment of the present invention or the question-answer interaction method according to any embodiment of the present invention is implemented.
The corpus processing and question-answer interaction method, device, computer equipment and storage medium provided by the embodiment of the invention are used for acquiring question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence; determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value; therefore, by setting a variable window, the question-answer pair sequence matched with the window value is obtained, the question-answer pair sequence with higher relevance can be matched from multiple rounds of conversations, and the problem of interruption or discontinuity in the conversations is solved; and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter, so that high-quality question-answer pairs are accurately mined from each question-answer pair sequence in the question-answer pair set through the relevance parameter, and meanwhile, the relevant question-answer pairs generated in multiple rounds of conversations can be obtained, and the problem of 'answer-not-asked' is avoided.
Drawings
FIG. 1 is a flow chart illustrating a corpus processing method according to an embodiment of the present invention;
Fig. 2 is a schematic flow chart of a question-answer interaction method according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a corpus processing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a question-answer interaction device according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention;
fig. 6 is a flowchart illustrating a corpus processing method according to another embodiment of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Before further detailed description of the present invention, terms and expressions referred to in the embodiments of the present invention are described, and the terms and expressions referred to in the embodiments of the present invention are applicable to the following explanations.
As shown in fig. 1, an embodiment of the present invention provides a corpus processing method, including the following steps:
step 101: obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;
the interactive question-answer data refers to the session content between the customer service and the user, and specifically may be a chat record between the customer service and the user in a user service system such as service industry or e-commerce.
The step of preprocessing the question-answer interaction data to obtain a question-answer data sequence refers to distinguishing utterances of a user and a customer service to obtain a corresponding question Q and an answer a, where the utterance of the user is Q, the utterance of the customer service is a, and generally the time is taken as an order, and the session between the user and the customer service is correspondingly formed into the question-answer data sequence, for example, a question-answer data sequence such as qa …, qqa …, QAA …, and generally, Q is taken as the beginning of the question-answer sequence.
Step 102: determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;
The window value is the total number of Q and A included in the question-answer pair, for example, the window value is set to 2, and the corresponding screening strategy is to screen the question-answer pair in the data sequence according to a question and an answer to obtain the question-answer pair in the QA style; with the window value set to 3, challenge pairs of styles such as QAA or QQA can be screened. If the window is 4, question-answer pairs such as QAAA or QQA or QQQA or QQAA patterns can be screened out, and by analogy, a longer multi-turn question-answer pair can be screened out if the window is larger, and for a set window value, a question-answer pair set formed by a question-answer pair sequence matched with the window value is obtained by setting different window values.
Step 103: and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter.
The relevance parameter refers to a relevant parameter set for relevance between questions and answers in a question-answer pair sequence, and is used for calculating matching degree of the questions and the answers in the question-answer pair sequence, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevance parameter refers to obtaining matching degree of the questions and the answers in the question-answer pair sequence based on the relevance parameter, for example, setting a corresponding threshold value for a value obtained by the relevance parameter, that is, when the obtained value of the matching degree meets the set threshold value, selecting the question-answer pair sequence as the relevant question-answer pair.
In the above embodiment of the present application, a question-answer pair data sequence is obtained by obtaining question-answer interaction data and preprocessing the question-answer interaction data; determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value; therefore, by setting a variable window, a question-answer pair sequence matched with the window value is obtained, a question-answer pair sequence generated by a plurality of rounds of conversations is obtained, and the problem of interruption or discontinuity in the conversations is solved; and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter, so that high-quality question-answer pairs are accurately mined from each question-answer pair sequence in the question-answer pair set through the relevance parameter, and meanwhile, the relevant question-answer pairs generated in multiple rounds of conversations can be obtained, and the problem of 'answer-not-asked' is avoided.
In an embodiment, the preprocessing the question-answer interaction data to obtain a question-answer pair data sequence includes:
processing the question-answer interaction data based on a preset normalization processing mode; the normalization processing mode comprises at least one of the following modes: word segmentation processing, stop word processing and word bag model processing;
And coding the interactive data of the question and answer after the normalization processing to obtain a question and answer pair data sequence.
The question-answer interactive data are processed based on a preset normalization processing mode, namely the question-answer interactive data are preprocessed in a normalization processing mode, so that unnecessary noise is removed, the screening result of the question-answer interactive data is optimized, and the screening quality of the question-answer pairs is improved.
Specifically, if word segmentation processing is adopted, existing word segmentation methods can be classified into three main categories: the method comprises the steps of word segmentation based on character string matching, word segmentation based on understanding and word segmentation based on statistics, for example, the problem that the cost for installing is high in your family and 3 buildings, and the cost for installing is high in installation cost is high in the case that the text obtained by word segmentation is high in your family and home/3/building/installation cost/requirement/installation cost/time.
The text after word segmentation is processed to stop word processing, which means to remove the noise of some special characters, such as: "parent", "hello", "do", etc., the above-mentioned text after word segmentation is processed by stop word to get "home/3/building/installation/need/installation fee".
The bag-of-words mode processing means that the appearance sequence of the words is ignored, only whether the words appear is concerned, and the appearance sequence of the words is not concerned. For example, if the user a asks the question "floor 3, installation requires installation fee", the text processed in the word bag mode is "floor 3/installation/need/installation fee"; the question of user B "install in floor 3, need not install? "the text after processing in bag-of-words mode is" 3/floor/install/need/install fee ".
Encoding the interactive data of the question and answer pair after the normalization processing to obtain a question and answer pair data sequence means that the interactive data of the question and answer pair after the programming and processing adopts a serial number for encoding, and specifically, the interactive data can be represented by a shorter serial number. For example, "3/building/installation/need/installation fee" is denoted as Q1Wherein "Q" represents the question asked by the user and "1" represents the number of the normalized text, so that all user questions of the normalized text "3/building/installation/need/installation fee" can use Q1This is shown simplified.
In the above embodiment, the question-answer interaction data is processed by a normalization processing method based on a preset normalization processing; the normalization processing mode comprises at least one of the following modes: word segmentation processing, stop word processing and word bag model processing; in this way, two similar questions or answers can be made to have the same representation, thereby increasing the number of question and answer pairs mined; and coding the interactive data of the question answers after the normalization processing to obtain a question answer pair data sequence, so that the questions or answers after the normalization processing can be simply expressed, thereby avoiding the influence of noise and improving the screening quality of the question answers on the data sequence.
In an embodiment, after selecting an associated question-answer pair from the question-answer pair sequence according to the relevancy parameter, the method includes:
and determining an inverse normalization processing mode corresponding to the normalization mode, and processing the association question-answer pair based on the inverse normalization processing mode to obtain a target association question-answer pair set.
Determining the same asThe normalization mode corresponding to the inverse normalization processing mode refers to that the question-answer pair sequence obtained corresponding to the question-answer interaction data carries out reduction operation on the associated question-answer pair to obtain the complete question and answer corresponding to the associated question-answer pair, for example, Q1To "3/building/installation/need/installation fee", A1Obtaining Q as 'good family, home and 3-storied building, installation cost is needed' by 'we/all/package/installation' through reverse-normalization processing mode reverse-deduction; a is 'parent, we are all package-mounted', namely, a high-quality target associated question-answer pair is obtained.
In the above embodiment, an inverse normalization processing manner corresponding to the normalization manner is determined, and the associated question-answer pairs are processed based on the inverse normalization processing manner to obtain a target associated question-answer pair set, so that the question-answer interaction data is normalized, filtered, and finally subjected to inverse normalization processing to obtain a high-quality target associated question-answer pair set.
In one embodiment, the obtaining a question-answer pair set formed by a question-answer pair sequence with a pattern length matching the window value includes:
sequentially selecting question-answer interaction data segments with the length equal to the window value based on the window value, and respectively forming question-answer pair sequences with the style lengths matched with the window value according to the question-answer interaction data segments; each question-answer pair sequence comprises at least one question word to be analyzed and at least one answer word to be analyzed;
and forming a question-answer pair set according to the question-answer pair sequence.
Sequentially selecting question-answer interaction data segments with the length equal to the window value based on the window value means that the corresponding selection length is determined based on the window value, and further obtaining question-answer interaction data segments from the question-answer pair data sequence according to the selection length, so that a corresponding question-answer pair sequence is formed; for example, if the window value is set to 2, the corresponding screening strategy is to screen question-answer pairs in the question-answer pair data sequence according to a question and an answer to obtain a question-answer interaction data segment in the QA style, so as to form a corresponding question-answer pair sequence; with the window value set to 3, question-answer interaction data segments having a pattern length of 3, such as QAA or QQA patterns, may be filtered out to form a corresponding sequence of question-answer pairs. If the window value is 4, the question-answer interaction data segment with the style length of 4, such as QAA or QQA or QQQA or QQAA, can be screened out, so that a corresponding question-answer pair sequence is formed, by analogy, a longer multi-turn question-answer pair can be screened out if the window value is larger, and for the set window value, a question-answer pair set formed by the question-answer pair sequence matched with the window value is obtained by setting different window values.
Each question-answer pair sequence comprises at least one question word to be analyzed and at least one answer word to be analyzed, and the question-answer pair sequence comprising at least one Q and at least one A is obtained by starting from a question and through a set window value.
In the above embodiment, question-answer interactive data segments with the length equal to the window value are sequentially selected based on the window value, and question-answer pair sequences with pattern lengths matched with the window value are respectively formed according to the question-answer interactive data segments; forming a question-answer pair set according to the question-answer pair sequence; therefore, by setting the variable window, the question-answer pair sequence matched with the window value is obtained, the question-answer pair sequence generated by multiple sessions is obtained, and the problem of interruption or discontinuity in the sessions is solved.
In an embodiment, the determining a relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevance parameter includes:
determining a relevancy parameter of each question-answer pair sequence in the question-answer pair set and a threshold corresponding to the relevancy parameter; the relevancy parameter comprises at least one of the following: a degree of freedom parameter, a closeness parameter, a repetition parameter;
And selecting the question-answer pair sequence with the correlation degree parameter meeting the threshold value as a correlation question-answer pair.
The relevance parameter refers to a relevant parameter set for relevance between questions and answers in a question-answer pair sequence, and is used for calculating matching degree of the questions and the answers in the question-answer pair sequence, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevance parameter refers to obtaining matching degree of the questions and the answers in the question-answer pair sequence based on the relevance parameter, for example, setting a corresponding threshold value for a value obtained by the relevance parameter, that is, when the obtained value of the matching degree meets the set threshold value, selecting the question-answer pair sequence as the relevant question-answer pair.
The relevancy parameter comprises at least one of the following: a degree of freedom parameter, a closeness parameter, a repetition parameter; for example, the relevance parameter is a repetition parameter, the threshold value is 5, and the relevance parameter corresponds to a question-answer pair sequence, such as "QA", and when the number of times of "QA" appearing in the question-answer pair set exceeds 5 times, the question-answer pair sequence is determined as a relevance question-answer pair.
In the above embodiment, a relevance parameter of each question-answer pair sequence in the question-answer pair set and a threshold corresponding to the relevance parameter are determined; selecting the question-answer pair sequence with the correlation degree parameter meeting the threshold value as a correlation question-answer pair; therefore, high-quality associated question-answer pairs are accurately mined from each question-answer pair sequence in the question-answer pair set through the association degree parameters and the corresponding threshold values, and the problem of 'no question answering' is avoided.
In an embodiment, when the relevancy parameter is a degree of freedom parameter, the determining the relevancy parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter includes:
acquiring left and right adjacent question-answer pairs adjacent to each question-answer pair sequence to obtain a left question-answer pair and a right question-answer pair;
determining a left entropy value and a right entropy value of the question-answer pair sequence respectively based on the left question-answer pair sequence and the question-answer pair sequence, and determining the degree of freedom parameter according to the left entropy value and the right entropy value based on a set condition;
and when the degree of freedom parameter exceeds a set first threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
Here, the association degree parameter is a degree of freedom parameter, specifically, the degree of freedom parameter may be an entropy value of a question-answer pair sequence, the entropy value is mainly used to indicate how random a left neighboring sentence set and a right neighboring sentence set of the question-answer pair sequence are, the larger the left and right entropy values are, the more different question-answer pair sequences may appear on the left or right of the question-answer pair sequence, and then it is likely to be a reasonable question-answer pair sequence.
Question-answer pair sequence pattern 'Q' with window value of 21A1"example, screen out all Q's in the data sequence of question-answer pairs1A1The sequence of question-answer pairs is recorded and the sequences of question-answer pairs adjacent to each other are recorded, such as Q in the data sequence of question-answer pairs1A1Left side appearance of { Q-1,Q-2Two question-answer pair sequences, appearing on the right side thereof with { Q }2,A2Two question-answer pair sequences, which are respectively used for calculating entropy values of the left side and the right side according to formulas (1) and (2):
-EL(Q1A1)left side of=P(Q-1Q1A1|)log2P(Q-1Q1A1)+P(Q-2Q1A1|)log2P(Q-2Q1A1) (1)
-EL(Q1A1)Right side=P(Q1A1Q3|)log2P(Q1A1Q3)+P(Q1A1A3|)log2P(Q1A1A3) (2)
Wherein E isL(Q1A1)Left side ofIs Q1A1Left entropy value of, EL(Q1A1)Right sideIs Q1A1The lower value is taken as Q1A1See formula (3):
EL(Q1A1)=min(EL(Q1A1)left side of,EL(Q1A1)Right side) (3)
See in particular the following table 1, Q1A1Multiple sequences appear on the left and right, through publicThe entropy value obtained by the formula calculation is larger.
Figure BDA0002072384560000101
TABLE 1
See table 2, below, Q1A1Multiple question-answer pair sequences can appear on the left side, but only one question-answer pair sequence, namely A, appears on the right side2The entropy values obtained according to the formulas (1) to (3) are small, and the left and right degree of freedom parameters are low. If the entropy value is lower than a preset entropy threshold, namely a first threshold, then Q can be considered to be1A1Not a rational question-answer pair.
Figure BDA0002072384560000102
TABLE 2
In the above embodiment, by obtaining the degree of freedom parameter, when the degree of freedom parameter exceeds the set first threshold, it is determined that the question-answer pair sequence is an associated question-answer pair, so that the degree of freedom of the question-answer pair sequence can be measured, if a certain question-answer sequence can appear in multiple context questions and answers, the degree of freedom is considered to be higher, and a higher degree of freedom makes it more likely to be a reasonable question-answer sequence pair, i.e., an associated question-answer pair, so that the associated question-answer pair is more accurately selected.
In an embodiment, when the relevancy parameter is a closeness parameter, the determining the relevancy parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter includes:
dividing each question-answer pair sequence into a first part and a second part, wherein the first part at least comprises a question word to be analyzed;
respectively acquiring the occurrence probability of the first part, the second part and the question-answer pair sequence in the question-answer pair set, and respectively acquiring the occurrence probability of the first part, the occurrence probability of the second part and the occurrence probability of the question-answer pair sequence;
determining the closeness parameter based on the first portion occurrence probability, the second portion occurrence probability, and the occurrence probability of the sequence of question-answer pairs;
and when the closeness parameter exceeds a set second threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
Here, the relevance parameter is an affinity parameter, and specifically, the affinity parameter may be point Mutual Information (pmi), and in machine learning related documents, it can be seen that the relevance between two variables, such as two words and two sentences, is measured using pmi. Specifically, see formula (4):
Figure BDA0002072384560000111
Wherein x and y respectively represent two parts of the sequence, such as the sequence of the pattern "QA", x is Q, and y is A; for example, in the style "QAA", x is Q, y is AA, or x is QA and y is A. p (x, y) denotes the frequency with which a certain sequence appears in the total sequence set E, see, for example, table 3.
Figure BDA0002072384560000112
TABLE 3
Here, Q "is a normalized sequence of Q" knowing the skimming "A" and "A" is denoted as Q1A1(ii) a Q "knows the preference of thank you" A "and then please place the order as soon as possible" and is expressed as a sequence Q after normalization1A2Table 3 shows the statistics and calculations associated with the values, pmi calculation result pmi (Q), respectively1,A1) Greater than pmi (Q)1,A2). It can be seen that although A is2Frequency ratio A occurring alone1High, but Q1A1Probability of co-occurrence but ratio Q1A2Much larger, i.e., the former is more compact and therefore more likely to be a reasonably relevant question-and-answer pair. When the pmi value of a certain question-answer pair sequence is lowAt the second threshold, it may be considered impossible to form an associated challenge-response pair. In addition, when there are multiple possible cases of x or y, such as in style "QAA", x is Q, y is AA or x is QA, y is a; pmi max (pmi (Q, AA), pmi (QA, a)), i.e., the maximum value of pmi.
In the above embodiment, by obtaining the closeness parameter, when the closeness parameter exceeds the set second threshold, it is determined that the sequence of question-answer pairs is an associated question-answer pair, so that the closeness of the sequence of question-answer pairs can be measured, and the higher the closeness is, the more likely it is to be a pair of reasonable question-answer sequence pairs, i.e., an associated question-answer pair, so as to more accurately select the associated question-answer pair.
It should be noted that simply considering the goodness of the challenge-response pair with pmi is sometimes not strict, because in some cases it is not necessarily a good QA pair even though the pmi value is large, as shown in table 4.
Q1 1 p/1.5 p/Difference/what
A1 1 p/Utility/area/10/15/sq
A2 1.5 p/fit/area/15/25/square meter
TABLE 4
Here, Q1A1It appears many times in the data sequence E of question-answer pairs, whose pmi value is large, but it is not a reasonable question-answer pair because the reply is not complete and A needs to be added2Can complete the question-answer pair, therefore, needs to be combined with the parameters of degree of freedom to perfect and filterAnd the mechanism is used for more accurately selecting the associated question-answer pairs.
In an embodiment, when the relevancy parameter is a repetitive parameter, the determining the relevancy parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter includes:
and acquiring the occurrence times of the question-answer pair sequence in the question-answer pair set, and determining that the question-answer pair sequence is a related question-answer pair when the occurrence times exceed a set third threshold value.
Here, when the question-answer pair sequence is Q1A1The third threshold is 10 times, when Q is1A1If the occurrence frequency in the question-answer pair set exceeds 10 times, determining Q 1A1For associating question and answer pairs.
In the above embodiment, the number of occurrences of the question-answer pair sequence in the question-answer pair set is calculated, and when the number of occurrences exceeds a third threshold, it is determined that the question-answer pair sequence is an associated question-answer pair.
In another embodiment, as shown in fig. 2, there is also provided a question-answer interaction method, including the steps of:
step 201: acquiring question data, and determining a corresponding associated question-answer pair according to the question data; wherein, the associated question-answer pair is determined by the corpus processing method according to any embodiment of the present invention;
step 202: and determining matched answer data to return based on the associated question-answer pairs.
In the above embodiment, the question data is obtained, and then the matched answer data is determined and returned according to the associated question-answer pair determined by the corpus processing method, so that the accuracy of the answer is improved, and the question of' answer is avoided.
In another embodiment, as shown in fig. 3, there is also provided a corpus processing apparatus, including:
The first obtaining module 11 is configured to obtain question-answer interaction data, and preprocess the question-answer interaction data to obtain a question-answer pair data sequence;
the processing module 12 is configured to determine a corresponding screening policy based on a set window value, and screen the question-answer pair data sequence through the corresponding screening policy to obtain a question-answer pair set formed by a question-answer pair sequence with a pattern length matched with the window value;
a first determining module 13, configured to determine a relevance parameter of each question-answer pair sequence in the question-answer pair set, and select a relevant question-answer pair from the question-answer pair sequence according to the relevance parameter.
In the above embodiment of the present application, a question-answer pair data sequence is obtained by obtaining question-answer interaction data and preprocessing the question-answer interaction data; determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value; therefore, by setting a variable window, a question-answer pair sequence matched with the window value is obtained, a question-answer pair sequence generated by a plurality of rounds of conversations is obtained, and the problem of interruption or discontinuity in the conversations is solved; and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter, so that high-quality question-answer pairs are accurately mined from each question-answer pair sequence in the question-answer pair set through the relevance parameter, and meanwhile, the relevant question-answer pairs generated in multiple rounds of conversations can be obtained, and the problem of 'answer-not-asked' is avoided.
Optionally, the first obtaining module 11 is further configured to process the question-answer interaction data based on a preset normalization processing manner; the normalization processing mode comprises at least one of the following modes: word segmentation processing, stop word processing and word bag model processing;
and coding the interactive data of the question and answer after the normalization processing to obtain a question and answer pair data sequence.
Optionally, the first determining module 13 is further configured to determine an inverse normalization processing manner corresponding to the normalization manner, and process the associated question-answer pair based on the inverse normalization processing manner to obtain a target associated question-answer pair set.
Optionally, the processing module 12 is further configured to sequentially select question-answer interaction data segments with lengths equal to the window values based on the window values, and form question-answer pair sequences with style lengths matched with the window values according to the question-answer interaction data segments; each question-answer pair sequence comprises at least one question word to be analyzed and at least one answer word to be analyzed;
and forming a question-answer pair set according to the question-answer pair sequence.
Optionally, the first determining module 13 is further configured to determine a relevancy parameter of each question-answer pair sequence in the question-answer pair set and a threshold corresponding to the relevancy parameter; the relevancy parameter comprises at least one of the following: a degree of freedom parameter, a closeness parameter, a repetition parameter;
And selecting the question-answer pair sequence with the correlation degree parameter meeting the threshold value as a correlation question-answer pair.
Optionally, the first determining module 13 is further configured to obtain a left adjacent question-answer pair and a right adjacent question-answer pair adjacent to each other in each question-answer pair sequence, so as to obtain a left question-answer pair and a right question-answer pair;
determining a left entropy value and a right entropy value of the question-answer pair sequence respectively based on the left question-answer pair sequence and the question-answer pair sequence, and determining the degree of freedom parameter according to the left entropy value and the right entropy value;
and when the degree of freedom parameter exceeds a set first threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
Optionally, the first determining module 13 is further configured to divide each question-answer pair sequence into a first part and a second part, where the first part at least contains one question word to be analyzed;
respectively acquiring the occurrence probability of the first part, the second part and the question-answer pair sequence in the question-answer pair set, and respectively acquiring the occurrence probability of the first part, the occurrence probability of the second part and the occurrence probability of the question-answer pair sequence;
determining the closeness parameter based on the first portion occurrence probability, the second portion occurrence probability, and the occurrence probability of the sequence of question-answer pairs;
And when the closeness parameter exceeds a set second threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
Optionally, the first determining module 13 is further configured to obtain the number of occurrences of the question-answer pair sequence in the question-answer pair set, and determine that the question-answer pair sequence is an associated question-answer pair when the number of occurrences exceeds a third threshold.
In another embodiment, as shown in fig. 4, there is also provided a question-answer interaction device, including:
a second obtaining module 21, configured to determine, according to the question data, a corresponding associated question-answer pair; wherein, the associated question-answer pair is determined according to the corpus processing method provided by any embodiment of the invention;
and a second determining module 22, configured to determine that matched answer data is returned based on the associated question-answer pair.
In the above embodiment, the question data is obtained, and then the matched answer data is determined and returned according to the associated question-answer pair determined by the corpus processing method, so that the accuracy of the answer is improved, and the problem of 'no question answering' is avoided.
In another embodiment, as shown in fig. 5, there is also provided a computer apparatus including: at least one processor 210 and a memory 211 for storing computer programs capable of running on the processor 210; the processor 210 illustrated in fig. 5 is not used to refer to the number of processors as one, but is only used to refer to the position relationship of the processor with respect to other devices, and in practical applications, the number of processors may be one or more; similarly, the memory 211 illustrated in fig. 5 is also used in the same sense, i.e., it is only used to refer to the position relationship of the memory with respect to other devices, and in practical applications, the number of the memory may be one or more.
Wherein, when the processor 210 is used for running the computer program, the following steps are executed:
obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;
determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;
and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
processing the question-answer interaction data based on a preset normalization processing mode; the normalization processing mode comprises at least one of the following modes: word segmentation processing, stop word processing and word bag model processing;
and coding the interactive data of the question and answer after the normalization processing to obtain a question and answer pair data sequence.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
And determining an inverse normalization processing mode corresponding to the normalization mode, and processing the association question-answer pair based on the inverse normalization processing mode to obtain a target association question-answer pair set.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
sequentially selecting question-answer interaction data segments with the length equal to the window value based on the window value, and respectively forming question-answer pair sequences with the style lengths matched with the window value according to the question-answer interaction data segments; each question-answer pair sequence comprises at least one question word to be analyzed and at least one answer word to be analyzed;
and forming a question-answer pair set according to the question-answer pair sequence.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
determining a relevancy parameter of each question-answer pair sequence in the question-answer pair set and a threshold corresponding to the relevancy parameter; the relevancy parameter comprises at least one of the following: a degree of freedom parameter, a closeness parameter, a repetition parameter;
and selecting the question-answer pair sequence with the correlation degree parameter meeting the threshold value as a correlation question-answer pair.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
acquiring left and right adjacent question-answer pairs adjacent to each question-answer pair sequence to obtain a left question-answer pair and a right question-answer pair;
determining a left entropy value and a right entropy value of the question-answer pair sequence respectively based on the left question-answer pair sequence and the question-answer pair sequence, and determining the degree of freedom parameter according to the left entropy value and the right entropy value;
and when the degree of freedom parameter exceeds a set first threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
dividing each question-answer pair sequence into a first part and a second part, wherein the first part at least comprises a question word to be analyzed;
respectively acquiring the occurrence probability of the first part, the second part and the question-answer pair sequence in the question-answer pair set, and respectively acquiring the occurrence probability of the first part, the occurrence probability of the second part and the occurrence probability of the question-answer pair sequence;
determining the closeness parameter based on the first portion occurrence probability, the second portion occurrence probability, and the occurrence probability of the sequence of question-answer pairs;
And when the closeness parameter exceeds a set second threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
and acquiring the occurrence times of the question-answer pair sequence in the question-answer pair set, and determining that the question-answer pair sequence is a related question-answer pair when the occurrence times exceed a set third threshold value.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
acquiring question data, and determining a corresponding associated question-answer pair according to the question data; wherein, the associated question-answer pair is determined by the corpus processing method according to any embodiment of the present invention;
and determining matched answer data to return based on the associated question-answer pairs.
The computer device further includes: at least one network interface 212. The various components on the transmit side are coupled together by a bus system 213. It will be appreciated that the bus system 213 is used to enable communications among the connections of these components. The bus system 213 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 213 in fig. 5.
The memory 211 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 211 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The memory 211 in the embodiment of the present invention is used to store various types of data to support the operation of the transmitting end. Examples of such data include: any computer program for operating on the sender side, such as an operating system and application programs. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs may include various application programs for implementing various application services. Here, the program that implements the method of the embodiment of the present invention may be included in an application program.
The embodiment further provides a computer storage medium, for example, including a memory 211 storing a computer program, which can be executed by a processor 210 in the transmitting end to perform the steps of the foregoing method. The computer storage medium can be FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM; or various devices including one or any combination of the above memories, such as a smart phone, a tablet computer, a notebook computer, and the like. A computer storage medium having a computer program stored therein, the computer program, when executed by a processor, performing the steps of:
Obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;
determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;
and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
processing the question-answer interaction data based on a preset normalization processing mode; the normalization processing mode comprises at least one of the following modes: word segmentation processing, stop word processing and word bag model processing;
and coding the interactive data of the question and answer after the normalization processing to obtain a question and answer pair data sequence.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
and determining an inverse normalization processing mode corresponding to the normalization mode, and processing the association question-answer pair based on the inverse normalization processing mode to obtain a target association question-answer pair set.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
sequentially selecting question-answer interaction data segments with the length equal to the window value based on the window value, and respectively forming question-answer pair sequences with the style lengths matched with the window value according to the question-answer interaction data segments; each question-answer pair sequence comprises at least one question word to be analyzed and at least one answer word to be analyzed;
and forming a question-answer pair set according to the question-answer pair sequence.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
determining a relevancy parameter of each question-answer pair sequence in the question-answer pair set and a threshold corresponding to the relevancy parameter; the relevancy parameter comprises at least one of the following: a degree of freedom parameter, a closeness parameter, a repetition parameter;
and selecting the question-answer pair sequence with the correlation degree parameter meeting the threshold value as a correlation question-answer pair.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
acquiring left and right adjacent question-answer pairs adjacent to each question-answer pair sequence to obtain a left question-answer pair and a right question-answer pair;
Determining a left entropy value and a right entropy value of the question-answer pair sequence respectively based on the left question-answer pair sequence and the question-answer pair sequence, and determining the degree of freedom parameter according to the left entropy value and the right entropy value;
and when the degree of freedom parameter exceeds a set first threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
dividing each question-answer pair sequence into a first part and a second part, wherein the first part at least comprises a question word to be analyzed;
respectively acquiring the occurrence probability of the first part, the second part and the question-answer pair sequence in the question-answer pair set, and respectively acquiring the occurrence probability of the first part, the occurrence probability of the second part and the occurrence probability of the question-answer pair sequence;
determining the closeness parameter based on the first portion occurrence probability, the second portion occurrence probability, and the occurrence probability of the sequence of question-answer pairs;
and when the closeness parameter exceeds a set second threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
and acquiring the occurrence times of the question-answer pair sequence in the question-answer pair set, and determining that the question-answer pair sequence is a related question-answer pair when the occurrence times exceed a set third threshold value.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
acquiring question data, and determining a corresponding associated question-answer pair according to the question data; wherein, the associated question-answer pair is determined by the corpus processing method according to any embodiment of the present invention;
and determining matched answer data to return based on the associated question-answer pairs.
Referring to fig. 6, the working process of the corpus processing method of the present application will be described in more detail by way of a more detailed example with reference to the accompanying drawings. The corpus processing method comprises the following steps:
step S11: inputting a question-answer pair data sequence, a window value, a first threshold value, a second threshold value and a third threshold value;
before inputting the question-answer pair data sequence, obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain the question-answer pair data sequence. The window value is the total number of Q and A included in the question-answer pair, for example, if the window value is set to 2, the question-answer pair in the QA mode is obtained; with the window value set to 3, a question-and-answer pair such as QAA or QQA can be obtained.
Here, the first threshold corresponds to a threshold for the degree of freedom parameter setting, the second threshold corresponds to a threshold for the tightness parameter setting, and the third threshold corresponds to a threshold for the repetition parameter setting.
Step S12: normalization processing;
here, the normalization processing method includes performing word segmentation processing, stop word processing, and bag-of-word model processing on the question-answer interaction data, respectively.
Specifically, if word segmentation processing is adopted, existing word segmentation methods can be classified into three main categories: the method comprises the steps of word segmentation based on character string matching, word segmentation based on understanding and word segmentation based on statistics, for example, the problem that the cost for installing is high in your family and 3 buildings, and the cost for installing is high in installation cost is high in the case that the text obtained by word segmentation is high in your family and home/3/building/installation cost/requirement/installation cost/time.
The text after word segmentation is processed to stop word processing, which means to remove the noise of some special characters, such as: "parent", "hello", "do", etc., the above-mentioned text after word segmentation is processed by stop word to get "home/3/building/installation/need/installation fee".
The bag-of-words mode processing means that the appearance sequence of the words is ignored, only whether the words appear is concerned, and the appearance sequence of the words is not concerned. For example, if the user a asks the question "floor 3, installation requires installation fee", the text processed in the word bag mode is "floor 3/installation/need/installation fee"; the question of user B "install in floor 3, need not install? "the text after processing in bag-of-words mode is" 3/floor/install/need/install fee ".
Encoding the interactive data of the question and answer pair after the normalization processing to obtain a question and answer pair data sequence means that the interactive data of the question and answer pair after the programming and processing adopts a serial number for encoding, and specifically, the interactive data can be represented by a shorter serial number. For example, "3/building/install/need/install fee" is denoted as Q1, where "Q" represents the question asked by the user and "1" represents the number of the normalized text, so that all user questions of the normalized text "3/building/install/need/install fee" can be represented in a simplified manner using Q1.
Step S13: screening based on a set window value to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;
the window value is the total number of Q and A included in the question-answer pair, for example, the window value is set to 2, and the corresponding screening strategy is to screen the question-answer pair in the data sequence according to a question and an answer to obtain the question-answer pair in the QA style; with the window value set to 3, challenge pairs of styles such as QAA or QQA can be screened. If the window is 4, question-answer pairs such as QAAA or QQA or QQQA or QQAA patterns can be screened out, and by analogy, a longer multi-turn question-answer pair can be screened out if the window is larger, and for a set window value, a question-answer pair set formed by a question-answer pair sequence matched with the window value is obtained by setting different window values.
Step S14: determining a relevancy parameter of each question-answer pair sequence in the question-answer pair set;
here, the relevance parameter refers to a relevant parameter set for relevance between questions and answers in the question-answer pair sequence, and is used for calculating matching degree of the questions and the answers in the question-answer pair sequence, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevance parameter refers to obtaining matching degree of the questions and the answers in the question-answer pair sequence based on the relevance parameter, and specifically, calculating a degree of freedom parameter, a closeness parameter, and a repetition parameter of each question-answer pair sequence respectively.
Step S15: the degree of freedom parameter exceeds a set first threshold;
here, when the degree of freedom parameter exceeds the set first threshold, step S16 is executed; otherwise, determining that the sequence of question-answer pairs is not an associated question-answer pair.
Step S16: the compactness parameter exceeds a set second threshold;
here, when the closeness parameter exceeds the set second threshold, step S17 is performed; otherwise, determining that the sequence of question-answer pairs is not an associated question-answer pair.
Step S17: the repetition parameter exceeds a set third threshold;
here, when the repetition parameter exceeds the set third threshold, step S18 is performed; otherwise, determining that the sequence of question-answer pairs is not an associated question-answer pair.
Step S18: obtaining a relevant question-answer pair;
here, when the degree of freedom parameter exceeds a set first threshold, the closeness parameter exceeds a set second threshold, and the repetition parameter exceeds a set third threshold, it is determined that the question-answer pair sequence is an associated question-answer pair.
Step S19: and performing inverse normalization processing to obtain a target question-answer pair.
Here, the associated question-answer pair is restored to a target associated question-answer pair based on the inverse normalization processing corresponding to the normalization processing, and finally an associated question-answer pair set is obtained.
The above-described embodiments solve at least the following problems:
(1) by setting a variable window value, a question-answer pair sequence matched with the window value is obtained, a question-answer pair sequence generated by a plurality of rounds of conversations is obtained, and the problem of interruption or discontinuity in the conversations is solved;
(2) the method has the advantages that the sequence of question-answer pairs is screened by combining the freedom degree parameters, the compactness parameters and the repeated parameters to obtain the associated question-answer pairs, so that the content and the number of the screened question-answer pairs can be greatly enriched, and the problem that key information does not co-occur between the problems and the answers in the conventional method can be solved.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (13)

1. A corpus processing method, comprising:
obtaining question-answer interaction data, and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;
determining a corresponding screening strategy based on a set window value, and screening the question-answer pair data sequence through the corresponding screening strategy to obtain a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;
and determining the relevance parameter of each question-answer pair sequence in the question-answer pair set, and selecting relevant question-answer pairs from the question-answer pair sequences according to the relevance parameter.
2. The corpus processing method according to claim 1, wherein said preprocessing said question-answer interaction data to obtain a question-answer pair data sequence comprises:
Processing the question-answer interaction data based on a preset normalization processing mode; the normalization processing mode comprises at least one of the following modes: word segmentation processing, stop word processing and word bag model processing;
and coding the interactive data of the question and answer after the normalization processing to obtain a question and answer pair data sequence.
3. The corpus processing method according to claim 2, wherein said selecting a relevant question-answer pair from said question-answer pair sequence according to said relevance parameter comprises:
and determining an inverse normalization processing mode corresponding to the normalization mode, and processing the association question-answer pair based on the inverse normalization processing mode to obtain a target association question-answer pair set.
4. The corpus processing method according to claim 1, wherein said obtaining a set of question-answer pairs formed by a sequence of question-answer pairs whose pattern lengths match said window values comprises:
sequentially selecting question-answer interaction data segments with the length equal to the window value based on the window value, and respectively forming question-answer pair sequences with the style lengths matched with the window value according to the question-answer interaction data segments; each question-answer pair sequence comprises at least one question word to be analyzed and at least one answer word to be analyzed;
And forming a question-answer pair set according to the question-answer pair sequence.
5. The corpus processing method according to claim 1, wherein said determining a relevance parameter of each question-answer pair sequence in said question-answer pair set, and selecting a relevant question-answer pair from said question-answer pair sequence according to said relevance parameter comprises:
determining a relevancy parameter of each question-answer pair sequence in the question-answer pair set and a threshold corresponding to the relevancy parameter; the relevancy parameter comprises at least one of the following: a degree of freedom parameter, a closeness parameter, a repetition parameter;
and selecting the question-answer pair sequence with the correlation degree parameter meeting the threshold value as a correlation question-answer pair.
6. The corpus processing method according to claim 1, wherein when the relevancy parameter is a degree of freedom parameter, the determining the relevancy parameter of each question-answer pair sequence in the question-answer pair set, and selecting a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter comprises:
acquiring left and right adjacent question-answer pairs adjacent to each question-answer pair sequence to obtain a left question-answer pair and a right question-answer pair;
determining a left entropy value and a right entropy value of the question-answer pair sequence respectively based on the left question-answer pair sequence and the question-answer pair sequence, and determining the degree of freedom parameter according to the left entropy value and the right entropy value;
And when the degree of freedom parameter exceeds a set first threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
7. The corpus processing method according to claim 1, wherein when the relevancy parameter is an closeness parameter, the determining of the relevancy parameter of each question-answer pair sequence in the question-answer pair set and the selecting of a relevant question-answer pair from the question-answer pair sequence according to the relevancy parameter comprises:
dividing each question-answer pair sequence into a first part and a second part, wherein the first part at least comprises a question word to be analyzed;
respectively acquiring the occurrence probability of the first part, the second part and the question-answer pair sequence in the question-answer pair set, and respectively acquiring the occurrence probability of the first part, the occurrence probability of the second part and the occurrence probability of the question-answer pair sequence;
determining the closeness parameter based on the first portion occurrence probability, the second portion occurrence probability, and the occurrence probability of the sequence of question-answer pairs;
and when the closeness parameter exceeds a set second threshold value, determining that the question-answer pair sequence is an associated question-answer pair.
8. The corpus processing method according to claim 1, wherein when the relevancy parameter is a repetition parameter, the determining of the relevancy parameter of each question-answer pair sequence in the question-answer pair set and the selecting of a relevancy question-answer pair from the question-answer pair sequence according to the relevancy parameter comprises:
And acquiring the occurrence times of the question-answer pair sequence in the question-answer pair set, and determining that the question-answer pair sequence is a related question-answer pair when the occurrence times exceed a set third threshold value.
9. A corpus processing apparatus, characterized in that the apparatus comprises:
the first acquisition module is used for acquiring question-answer interaction data and preprocessing the question-answer interaction data to obtain a question-answer pair data sequence;
the processing module is used for determining a corresponding screening strategy based on a set window value, screening the question-answer pair data sequence through the corresponding screening strategy and obtaining a question-answer pair set formed by a question-answer pair sequence with the pattern length matched with the window value;
and the first determining module is used for determining the association degree parameter of each question-answer pair sequence in the question-answer pair set and selecting an associated question-answer pair from the question-answer pair sequence according to the association degree parameter.
10. A question-answer interaction method is characterized by comprising the following steps:
acquiring question data, and determining a corresponding associated question-answer pair according to the question data; wherein the associated question-answer pair is determined by the corpus processing method according to any one of claims 1 to 8;
And determining matched answer data to return based on the associated question-answer pairs.
11. A question-answer interaction device, comprising:
the second acquisition module is used for determining a corresponding associated question-answer pair according to the question data; wherein the associated question-answer pair is determined by the corpus processing method according to any one of claims 1 to 8;
and the second determination module is used for determining the return of matched answer data based on the associated question-answer pair.
12. A computer device, comprising: a processor and a memory for storing a computer program capable of running on the processor;
when the processor is used to run the computer program, the corpus processing method according to any one of claims 1 to 8 is implemented, or the question-answer interaction method according to claim 10 is implemented.
13. A computer storage medium, wherein a computer program is stored in the computer storage medium, and wherein the computer program, when executed by a processor, implements the corpus processing method according to any one of claims 1 to 8 or implements the question-answer interaction method according to claim 10.
CN201910442283.8A 2019-05-24 2019-05-24 Corpus processing and question-answer interaction method and device, computer equipment and storage medium Pending CN111984768A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910442283.8A CN111984768A (en) 2019-05-24 2019-05-24 Corpus processing and question-answer interaction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910442283.8A CN111984768A (en) 2019-05-24 2019-05-24 Corpus processing and question-answer interaction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111984768A true CN111984768A (en) 2020-11-24

Family

ID=73436783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910442283.8A Pending CN111984768A (en) 2019-05-24 2019-05-24 Corpus processing and question-answer interaction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111984768A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015058604A1 (en) * 2013-10-21 2015-04-30 北京奇虎科技有限公司 Apparatus and method for obtaining degree of association of question and answer pair and for search ranking optimization
CN106484801A (en) * 2016-09-23 2017-03-08 厦门快商通科技股份有限公司 A kind of dialogue method of intelligent customer service robot and its knowledge base management system
CN106502984A (en) * 2016-10-19 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and device of field new word discovery
CN106844741A (en) * 2017-02-13 2017-06-13 哈尔滨工业大学 A kind of answer method towards specific area
CN108415980A (en) * 2018-02-09 2018-08-17 平安科技(深圳)有限公司 Question and answer data processing method, electronic device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015058604A1 (en) * 2013-10-21 2015-04-30 北京奇虎科技有限公司 Apparatus and method for obtaining degree of association of question and answer pair and for search ranking optimization
CN106484801A (en) * 2016-09-23 2017-03-08 厦门快商通科技股份有限公司 A kind of dialogue method of intelligent customer service robot and its knowledge base management system
CN106502984A (en) * 2016-10-19 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and device of field new word discovery
CN106844741A (en) * 2017-02-13 2017-06-13 哈尔滨工业大学 A kind of answer method towards specific area
CN108415980A (en) * 2018-02-09 2018-08-17 平安科技(深圳)有限公司 Question and answer data processing method, electronic device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
乔霈;王素格;陈鑫;谭红叶;陈千;王元龙;: "基于词语关联的散文阅读理解问题答案获取方法", 中文信息学报, no. 03 *

Similar Documents

Publication Publication Date Title
CN111444428B (en) Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
WO2020211566A1 (en) Method and device for making recommendation to user, computing apparatus, and storage medium
JP4857333B2 (en) How to determine context summary information across documents
US6925433B2 (en) System and method for context-dependent probabilistic modeling of words and documents
CN107797984B (en) Intelligent interaction method, equipment and storage medium
US9092517B2 (en) Generating synonyms based on query log data
CN111046152A (en) FAQ question-answer pair automatic construction method and device, computer equipment and storage medium
CN109767318A (en) Loan product recommended method, device, equipment and storage medium
CN109543007A (en) Put question to data creation method, device, computer equipment and storage medium
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN111832305B (en) User intention recognition method, device, server and medium
CN104700831A (en) Analyzing method and device of voice features of audio files
CN111813916A (en) Intelligent question and answer method, device, computer equipment and medium
CN111984768A (en) Corpus processing and question-answer interaction method and device, computer equipment and storage medium
CN115730058A (en) Reasoning question-answering method based on knowledge fusion
CN116150306A (en) Training method of question-answering robot, question-answering method and device
CN114360678A (en) Information processing method, device, equipment and storage medium
CN113761081A (en) Method and system for carrying out multi-dimensional combined retrieval on enterprise information
CN115438158A (en) Intelligent dialogue method, device, equipment and storage medium
CN111274331A (en) Relational data management maintenance system and method
CN116306514B (en) Text processing method and device, electronic equipment and storage medium
CN112819205B (en) Method, device and system for predicting working hours
CN116821309B (en) Context construction method based on large language model
CN116523024B (en) Training method, device, equipment and storage medium of recall model
CN116628179B (en) User operation data visualization and man-machine interaction recommendation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210525

Address after: 100176 room 1004, 10th floor, building 1, 18 Kechuang 11th Street, economic and Technological Development Zone, Daxing District, Beijing

Applicant after: Beijing Huijun Technology Co.,Ltd.

Address before: 100086 8th Floor, 76 Zhichun Road, Haidian District, Beijing

Applicant before: BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY Co.,Ltd.

Applicant before: BEIJING JINGDONG CENTURY TRADING Co.,Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination