CN108681574A

CN108681574A - A kind of non-true class quiz answers selection method and system based on text snippet

Info

Publication number: CN108681574A
Application number: CN201810428163.8A
Authority: CN
Inventors: 马荣强; 张健; 李淼; 陈雷; 高会议
Original assignee: Hefei Institutes of Physical Science of CAS
Current assignee: Hefei Institutes of Physical Science of CAS
Priority date: 2018-05-07
Filing date: 2018-05-07
Publication date: 2018-10-19
Anticipated expiration: 2038-05-07
Also published as: CN108681574B

Abstract

The non-true class quiz answers selection method and system that the invention discloses a kind of based on text snippet, belong to intelligent Search Technique field, include the first sentence and tail sentence of answer text to be selected described in extraction；Selection answer text remaining text in addition to first sentence and tail sentence is treated using text snippet model TextRank and carries out abstract extraction, obtains preliminary text snippet；First sentence, preliminary text snippet and tail sentence are combined successively, obtain the answer text snippet for waiting for selection；Using question sentence and the answer text snippet selected is waited as the input of neural network semantic expressiveness model, obtains question sentence and waits for the semantic degree of correlation of the answer text snippet of selection；It will be returned as answer with the highest answer text snippet of question semanteme degree of correlation.The present invention extracts the component part of the first sentence and tail sentence of answer text as abstract, ensure that the theme integrality of the text snippet extracted, to improve the accuracy rate of answer selection when carrying out answer abstract extraction.

Description

A kind of non-true class quiz answers selection method and system based on text snippet

Technical field

The present invention relates to intelligent Search Technique field, more particularly to a kind of non-true class quiz answers based on text snippet Selection method and system.

Background technology

Currently, question answering system has become one important research topic of natural language processing field, it is used for acquisition of information Multiple fields, such as information retrieval, expert system, automatic question answering and man-machine natural language interaction etc..Question answering system and letter Breath retrieves different place and is that it does not need user oneself and finds answer, but directly returns to answer.

According to the different data source of question answering system, be divided into three classes question answering system：Question answering system based on structural data, Question answering system based on free text and the question answering system based on problem answers pair.Wherein, the question and answer system based on problem answers pair The workflow of system will be returned with semantic most matched answer by meaning of one's words signature analysis after user's proposition problem, data master To come from Web Community's question and answer.

Early stage is generally basede on traditional semantic feature extraction to the research of answer selection method, manually chooses text feature, so It is trained using high performance classifier, is compared to carry out the method interpretation of semantic expressiveness using Manual definition's feature afterwards By force, the selection of feature covers entire data set.The feature of the selection mainly reflected language from answer content of text Sentence quality and problem answers and the correlation in answer content.The feature manually chosen generally comprise word N gram language models, Syntactic structure and grammer dependence etc..The researcher of early stage is when carrying out the research of answer selection method, most common method Exactly text to be dealt with is segmented, after part-of-speech tagging or syntactic analysis by existing natural language processing tool, Answer preference pattern of the training based on Manual definition's feature.

However, the answer textual form in non-fact class question and answer has polygons, and there are noise information, utilization is general Language rule is difficult to be matched to correct option.Therefore the answer for non-true class question answering system selects task, current mainstream side Method be the semantic information of text is excavated using the machine learning method for having supervision based on received text, such as：

The matching characteristic of word level is trained using SVM models, such as Keywords matching feature, phrase rank Non-semantic category feature, the more also feature etc. based on name entity.Researcher also by natural language processing tool come The feature of text is extracted, is included whether comprising mark to develop a series of lexical characteristics related with answer quality Point, hyperlink, the quantity of special word, part of speech and the frequency etc. for naming substance feature and N gram language models.Use syntax tree can be with The partial structurtes information for preferably capturing sentence, the answer choosing method based on syntax tree can effectively reduce feature selecting Workload.The method being combined using syntax and semantic feature carries out answer selection, by computational problem and is answered in terms of syntax Tree edit distance between the interdependent syntax tree of case, and semantic aspect is special using shallow semantics such as entity type, synonyms Sign.

Wherein, tree edit distance is to calculate the required operation (be inserted into, delete and replace) from two tree transfer processes Total dissipation value, calculating process is similar with the editing distance of character string, use condition random field (Conditional Random Fields, CRF) sequence in question and answer is labeled, practical feature includes tree edit distance and string editing distance etc.. This is that the answer select permeability by community's question and answer is converted for sequence labelling problem for the first time.In addition to syntax tree, also some are studied Person from the correlation of the angle changing rate problem and answer text of language model and term vector, such as using the model based on translation come Problem and candidate answers are regarded as two different language by the degree of correlation of comparison problem and answer.

Answer selection method based on traditional semantic feature extraction often has good interpretation, passes through what is manually chosen Feature can find its foundation, be easy to make one to understand.But when carrying out answer selection using the method, can also exist Defect：First, depending on some and the relevant kit of natural language field basic research, this allows for selected feature Effect depends on the effect of basic research.The thought of feature extraction may have foundation very much, but face complicated text, can not Obtain desired result.Second is that the feature extracted in answer preference pattern ultimately depends on the selection of people, model does not learn by oneself energy Power results in the limitation of model application.

Invention content

The purpose of the present invention is to provide a kind of non-true class quiz answers selection method and system based on text snippet, To improve the accuracy rate of request-answer system answer selection.

In order to achieve the above object, the present invention uses a kind of non-true class quiz answers selection method based on text snippet, Include the following steps：

The first sentence and tail sentence of answer text to be selected described in extraction；

Using text snippet model TextRank to the answer text to be selected the remaining text in addition to first sentence and tail sentence This carries out abstract extraction, obtains preliminary text snippet；

The first sentence, the preliminary text snippet and the tail sentence are combined successively, obtain the answer text for waiting for selection Abstract；

Using question sentence and described wait that the answer text snippet selected as the input of neural network semantic expressiveness model, is asked The semantic degree of correlation of sentence and the answer text snippet for waiting for selection；

It will be returned as answer with the highest answer text snippet of question semanteme degree of correlation.

Preferably, the first sentence and tail sentence of answer text to be selected described in the extraction, including：

According to the position of first sentence and tail sentence in the answer text to be selected, by the first sentence of the answer text to be selected and Tail sentence extracts.

Preferably, it is described using text snippet model TextRank to answer text select remove first sentence and tail sentence it Outer remaining text carries out abstract extraction, obtains preliminary text snippet, including：

The answer text segmentation to be selected is segmented at sentence, and to each sentence；

The part of speech of each word is labeled, and the information of word after mark is filtered, obtains the lexical item of specific word；

Using specific time lexical item or sentence as text unit, by text unit configuration node, between text unit Side between similarity configuration node obtains weight graph model；

The similarity of any two nodes is calculated, and using similarity value as the calculating parameter of node weights calculation formula；

The node weights calculation formula is iterated until convergence, obtains the scores of each node；

Score when according to convergence between each node, is ranked up each node, each node after being sorted；

Text unit, which is extracted, according to the decimation ratio of setting, in each node after sequence forms preliminary text snippet.

Preferably, the computational methods of the similarity of any two nodes include：Vocabulary overlay method, character string method, cosine Semblance and longest common subsequence method.

On the other hand, using a kind of, the non-true class quiz answers based on text snippet select system, including are sequentially connected The first abstraction module, the second abstraction module, composite module, matching module and determining module；

First abstraction module, first sentence and tail sentence for extracting the answer text to be selected；

Second abstraction module, for using text snippet model TextRank to the answer text to be selected except first sentence and Remaining text carries out abstract extraction except tail sentence, obtains preliminary text snippet；

Composite module obtains to be selected for combining the first sentence, the preliminary text snippet and the tail sentence successively The answer text snippet selected；

Matching module, for using question sentence and the answer text snippet for waiting selecting as neural network semantic expressiveness model Input, obtain the semantic degree of correlation of question sentence and the answer text snippet for waiting for selection；

Determining module, for will be returned as answer with the highest answer text snippet of question semanteme degree of correlation.

Preferably, first abstraction module is specifically used for：

Preferably, second abstraction module includes sequentially connected cutting unit, filter element, weight map model construction Unit, similarity calculated, iteration unit, sequencing unit and component units；

Cutting unit, for segmenting the answer text segmentation to be selected at sentence, and to each sentence；

Filter element is labeled for the part of speech to each word, and is filtered to the information of word after mark, and spy is obtained Determine the lexical item of word；

Weight map model construction unit is used for using the specific lexical item or sentence as text unit, by text list First configuration node, the side between similarity configuration node between text unit, obtains weight graph model；

Similarity calculated, the similarity for calculating any two nodes, and using similarity value as node weight restatement Calculate the calculating parameter of formula；

Iteration unit, for being iterated to the node weights calculation formula until convergence, obtains the score of each node As a result；

Sequencing unit, score when for according to convergence between each node, is ranked up each node, after being sorted Each node；

Assembled unit extracts text unit composition for the decimation ratio according to setting, in each node after sequence just Walk text snippet.

Preferably, the similarity calculating method of the similarity calculation module use includes：Vocabulary overlay method, character string Method, cosine similarity method and longest common subsequence method.

Compared with prior art, there are following technique effects by the present invention：In practical applications, it is contemplated that asked in non-true class Answer the question and answer centering of system, much longer than question sentence of the length of answer text, if using single text snippet abstracting method, The global information for only considering text, lacks the position of characteristic information such as sentence of text unit itself, the position etc. of lexical item, when Extract abstract ratio setting it is very low when, be easy to cause topic drift problem.This programme is carrying out answer text snippet extraction When, retain answer text first sentence and tail sentence, recycle abstract abstracting method carry out abstract extraction, in sequence by first sentence, pluck Want, tail sentence three parts combine, as the abstract result finally extracted.Since the first sentence of answer text in question and answer is usually pair The brief repetition of problem, the tail sentence of answer text are usually the short summary to answer content, so being taken out carrying out answer abstract When taking, the component part of the first sentence and tail sentence of answer text as abstract is extracted, ensure that the theme of the text snippet extracted Integrality, to improve the accuracy rate of answer selection.

Description of the drawings

Below in conjunction with the accompanying drawings, the specific implementation mode of the present invention is described in detail：

Fig. 1 is a kind of flow diagram of the non-true class quiz answers selection method based on text snippet；

Fig. 2 is the text snippet extraction schematic diagram of answer；

Fig. 3 is TextRank weight maps；

Fig. 4 is neural network semantic expressiveness model framework chart；

Fig. 5 is a kind of structural schematic diagram of the non-true class quiz answers selection system based on text snippet.

Specific implementation mode

In order to illustrate further the feature of the present invention, reference should be made to the following detailed description and accompanying drawings of the present invention.Institute Attached drawing is only for reference and purposes of discussion, is not used for limiting protection scope of the present invention.

The embodiment of the present application is by providing a kind of non-true class quiz answers selection method based on text snippet, to solve The low problem of existing request-answer system answer selection accuracy rate.

To solve the above-mentioned problems, the main thought of the present embodiment is protected from when the answer text of selection extracts abstract The first sentence and tail sentence of answer text are stayed, then from abstract is extracted in remaining content of text after the first sentence of removing and tail sentence, is then pressed First sentence, abstract, tail sentence are combined into final text snippet according to sequence, final text snippet and question sentence are matched, obtained To the answer for return.

As shown in Figure 1 to Figure 2, to a kind of non-true class quiz answers selection based on text snippet provided in this embodiment Method is described in detail comprising following steps S1 to S5：

The first sentence and tail sentence of answer text to be selected described in S1, extraction；

It is S2, remaining in addition to first sentence and tail sentence to the answer text to be selected using text snippet model TextRank Text carries out abstract extraction, obtains preliminary text snippet；

S3, the first sentence, the preliminary text snippet and the tail sentence are combined successively, obtains the answer text for waiting for selection This abstract；

S4, using question sentence and described wait that the answer text snippet selected as the input of neural network semantic expressiveness model, obtains To the semantic degree of correlation of question sentence and the answer text snippet for waiting for selection；

S5, it will be returned as answer with the highest answer text snippet of question semanteme degree of correlation.

It should be noted that question sentence and answer text snippet are input in neural network answer preference pattern, god is used Question sentence and answer text snippet are encoded through network, indicated by obtaining its vector to the excavation of text semantic, it is final logical The similarity for crossing the semantic vector for calculating question sentence and answer text obtains its semantic degree of correlation.

As further preferred scheme, above-mentioned steps S1：The first sentence and tail sentence of answer text to be selected described in extraction, Specifically extraction process is：The position for identifying the position and tail sentence of the first sentence of answer text to be selected first, then according to first sentence Position and the position of tail sentence first sentence and tail sentence are extracted.For example, identifying the fullstop first appeared in answer text Position, and extracted the sentence before fullstop as first sentence as first sentence.Identify two finally occurred in answer text The position of fullstop, and the sentence between two fullstop is extracted as tail sentence.

As further preferred scheme, to above-mentioned steps S2：Using text snippet model TextRank to described to be selected It selects answer text remaining text in addition to first sentence and tail sentence and carries out abstract extraction, obtain preliminary text snippet.It carries out specifically It is bright as follows：

When TextRank algorithm is extracted for critical sentence, sentence is labeled as node first, is then built according to the number on side Vertical graph model.When carrying out the calculating of sentence similarity, the method used in TextRank is generally based on the side of vocabulary overlapping Method is exactly that reduplication number in two sentences is more, then similarity is higher.In addition to word be overlapped, it is also possible to use character string, The sentence similarities computational methods such as cosine similarity, longest common subsequence, part of speech method are all based on statistical information.It establishes After graph model, recycles PageRank algorithms to carry out recursive calculation, finally obtain the score of node.Node score is higher, sentence Importance it is higher.After being ranked up according to the importance of sentence, critical sentence is extracted according to required ratio and forms text snippet.

Its key step is as follows：

(1) it pre-processes：Text is divided into several text units (lexical item or sentence), part of speech mark is carried out again after participle Note.Word information after mark is filtered, filtering content includes stop words and part of speech, finally only retains the word of specific part of speech .

(2) weight graph model is built：By text unit configuration node, between the similarity configuration node between text unit Side forms weight graph model.

(3) sentence similarity calculates：The similarity that two sentences are calculated using the method for word-based overlapping, to sentence Si And S_jSimilarity calculation is carried out with following formula, it is specific as follows：

Wherein, S_iAnd S_jRepresent two sentences.Sentence S_iBy N_iA lexical item indicates：w_kIndicate two The word that a sentence all includes, then the weights W on side_ji=Similarity (S_i,S_j)。

(4) node score calculation formula is iterated to convergence, obtains each node score：TextRank algorithm model can It is indicated with G=(V, E).All node sets in figure are expressed as V by algorithm, and all line sets in figure are expressed as E, V and E structures At all the elements in figure, wherein E is the subset of V × V.Node V_iScore it is as follows：

Wherein, w_jiIndicate node V_jWith node V_iBetween connect side weight, generally use node V_jWith node V_iSimilarity It indicates；In(V_i) indicate to be directed toward V_iAll node sets of node, Out (V_j) indicate node V_jAll node sets being directed toward, * Indicate multiplication sign；D is known as damped coefficient (0≤d≤1), indicates that a certain node in Fig. 3 jumps to the probability of any other node, d Generally take 0.85.

In addition, it is noted that 2 points when using TextRank algorithm；First, initial value is set, and is generally allowed at the beginning of all nodes Beginning is scored at 1；Secondly, convergence judgement, general threshold values of restraining is 0.0001, i.e., the error rate of any one node, which is less than, in figure changes When 0.0001, reach convergence, stops iteration.

(5) all nodes are ranked up according to each node score, and according to different decimation ratios, are taken out in each node Take crucial phrase at preliminary summary texts.

It should be noted that combining actual needs, different decimation ratios is set, can remove the colloquial style table in answer text It reaches and redundancy, it is ensured that the accuracy of keyword extraction.

It should be noted that TextRank algorithm is the warp of the keyword abstraction and the extraction of digest sentence for carrying out text Allusion quotation method, principle are a kind of unsupervised algorithms based on figure.TextRank algorithm is to the keyword in text in the present embodiment It is ranked up using PageRank algorithms with critical sentence.

For example, sentence S is calculated_iWith sentence S_jSimilarity, initially set up weight map as shown in figure 3, node V_iIt indicates Sentence S_i, node V_jIndicate sentence S_j.Node V_jWith node V_kSimilarity be expressed as w_jk.Node V_jWith node V_k+1It is similar Spend w_jk+1It can be obtained by formula.Egress V can be calculated according to formula_iTextRank score, wherein w_jk+w_jk+1It indicates Node V_jScore：

It should be noted that TextRank algorithm is a kind of unsupervised approaches extracted to keyword and critical sentence. Its advantage is that be not necessarily to training corpus, the text of different field content can be performed well in, do not have to consider linguistic knowledge or Person's domain knowledge has considered the overall structure of text.The disadvantage is that since TextRank algorithm only considered the complete of text Office's information, lacks characteristic information of text unit itself, for example, the position of sentence, the position etc. of lexical item.

In practical application, in the question and answer pair of non-true class question answering system, much longer than question sentence of the length of answer text, But utilize single text snippet abstracting method, when extract abstract ratio setting it is very low when, be easy to cause topic drift Problem.As shown in figure 4, this programme retains the first sentence and tail sentence of answer text, then again when carrying out the extraction of answer text snippet Abstract extraction is carried out using abstract abstracting method.By in question and answer the characteristics of answer text it is found that the first sentence of answer be usually to asking The brief repetition of topic, followed by way to solve the problem；The tail sentence of answer is usually the short summary to answer content.So When carrying out answer abstract extraction, ensure the theme integrality of abstract using the first sentence and tail sentence of answer text, and then improve The accuracy rate of answer selection.

Meanwhile the abstract of extraction is relative to colloquial style expression of the original answer text suppression without practical significance and redundancy letter Breath, obtains efficient answer text representation, then obtains including more key messages by neural network semantic expressiveness model Semantic vector.

As shown in figure 5, present embodiment discloses a kind of, the non-true class quiz answers based on text snippet select system, packet Include sequentially connected first abstraction module 10, the second abstraction module 20, composite module 30, matching module 40 and determining module 50；

First abstraction module 10, first sentence and tail sentence for extracting the answer text to be selected；

Second abstraction module 20, for removing first sentence to the answer text to be selected using text snippet model TextRank Abstract extraction is carried out with remaining text except tail sentence, obtains preliminary text snippet；

Composite module 30 is waited for for combining the first sentence, the preliminary text snippet and the tail sentence successively The answer text snippet of selection；

Matching module 40, for using question sentence and the answer text snippet for waiting selecting as neural network semantic expressiveness mould The input of type obtains the semantic degree of correlation of question sentence and the answer text snippet for waiting for selection；

Determining module 50, for will be returned as answer with the highest answer text snippet of question semanteme degree of correlation.

As further preferred scheme, first abstraction module 10 is specifically used for：

As further preferred scheme, second abstraction module 20 includes sequentially connected cutting unit, filtering list Member, weight map model construction unit, similarity calculated, iteration unit, sequencing unit and component units；

As further preferred scheme, the similarity calculating method that the similarity calculation module uses includes：Vocabulary Overlay method, character string method, cosine similarity method and longest common subsequence method.

It should be understood that a kind of non-true class quiz answers selection system based on text snippet disclosed in the present embodiment For realizing each flow in Fig. 1, and with a kind of non-true class quiz answers based on text snippet disclosed in above-described embodiment Selection method technical characteristic having the same and identical effect are no longer described in detail at this.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims

1. a kind of non-true class quiz answers selection method based on text snippet, which is characterized in that including：

Using text snippet model TextRank to the answer text to be selected in addition to first sentence and tail sentence remaining text into Row abstract extracts, and obtains preliminary text snippet；

The first sentence, the preliminary text snippet and the tail sentence are combined successively, obtain the answer text snippet for waiting for selection；

Using question sentence and the answer text snippet for waiting selecting as the input of neural network semantic expressiveness model, obtain question sentence and The semantic degree of correlation of the answer text snippet for waiting for selection；

2. the non-true class quiz answers selection method based on text snippet as described in claim 1, which is characterized in that described The first sentence and tail sentence of answer text to be selected described in extraction, including：

According to the position of first sentence and tail sentence in the answer text to be selected, by the first sentence and tail sentence of the answer text to be selected It extracts.

3. the non-true class quiz answers selection method based on text snippet as described in claim 1, which is characterized in that described Using text snippet model TextRank, to the answer text to be selected, the remaining text in addition to first sentence and tail sentence is plucked It extracts, obtains preliminary text snippet, including：

It is similar between text unit by text unit configuration node using specific time lexical item or sentence as text unit The side between configuration node is spent, weight graph model is obtained；

4. the non-true class quiz answers selection method based on text snippet as claimed in claim 3, which is characterized in that described The computational methods of the similarity of any two nodes include：Vocabulary overlay method, character string method, cosine similarity method and maximum are common Subsequence method.

5. a kind of non-true class quiz answers based on text snippet select system, which is characterized in that including sequentially connected the One abstraction module, the second abstraction module, composite module, matching module and determining module；

Second abstraction module, for removing first sentence and tail sentence to the answer text to be selected using text snippet model TextRank Except remaining text carry out abstract extraction, obtain preliminary text snippet；

Composite module obtains waiting for selection for combining the first sentence, the preliminary text snippet and the tail sentence successively Answer text snippet；

Matching module, for using question sentence and the answer text snippet for waiting selecting as the defeated of neural network semantic expressiveness model Enter, obtains the semantic degree of correlation of question sentence and the answer text snippet for waiting for selection；

6. the non-true class quiz answers based on text snippet select system as claimed in claim 5, which is characterized in that described First abstraction module is specifically used for：

7. the non-true class quiz answers based on text snippet select system as claimed in claim 5, which is characterized in that described Second abstraction module includes sequentially connected cutting unit, filter element, weight map model construction unit, similarity calculation list Member, iteration unit, sequencing unit and component units；

Filter element is labeled for the part of speech to each word, and is filtered to the information of word after mark, and specific word is obtained Lexical item；

Weight map model construction unit is used for using the specific lexical item or sentence as text unit, by text unit structure At node, the side between similarity configuration node between text unit obtains weight graph model；

Similarity calculated, the similarity for calculating any two nodes, and calculate public affairs using similarity value as node weights The calculating parameter of formula；

Iteration unit, for being iterated to the node weights calculation formula until convergence, obtains the scores of each node；

Sequencing unit, score when for according to convergence between each node, is ranked up each node, each section after being sorted Point；

Assembled unit extracts the preliminary text of text unit composition for the decimation ratio according to setting, in each node after sequence This abstract.

8. the non-true class quiz answers based on text snippet select system as claimed in claim 7, which is characterized in that described Similarity calculation module use similarity calculating method include：Vocabulary overlay method, character string method, cosine similarity method and most Big common subsequence method.