CN108959269A

CN108959269A - A kind of sentence auto ordering method and device

Info

Publication number: CN108959269A
Application number: CN201810839470.5A
Authority: CN
Inventors: 刘杰; 骆力明; 周建设; 史金生; 袁克柔
Original assignee: Capital Normal University
Current assignee: North China University of Technology
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2018-12-07
Anticipated expiration: 2038-07-27
Also published as: CN108959269B

Abstract

The present invention provides a kind of sentence auto ordering method and devices, wherein this method comprises: carrying out punctuate pretreatment to document sets, obtains sentence intersection；Sentence intersection is trained and obtains term vector dictionary, the term vector is clustered in conjunction with preset Chinese thesaurus；Based on conditional entropy algorithm, bluebeard compound vector clusters are as a result, obtain the proximity in sentence intersection between sentence；The sequence to sentence in sentence intersection is realized using Markov random walk model.Semantic analysis of the present invention can realize the auto judge to text sentence logicality, improve and judge efficiency, reduce and judge error, and can reduce the influence of Sparse, improve the efficiency that sentence ranking results generate.

Description

A kind of sentence auto ordering method and device

Technical field

The present invention relates to field of computer technology, specifically, being related to a kind of sentence auto ordering method and device.

Background technique

With the rapid development of Internet technology, the automatic scoring research of Chinese composition is gradually risen, and is write a composition for improving Scoring efficiency, be inherently eliminated the inconsistent of Evaluation of composition, control error score has a very important significance.Due in The complexity of literary logic of language is big, it is existing research to composition evaluation and test mostly from vocabulary uses, grammer, expression, write a composition length, Conjunctive word use, the utilization of rhetorical devices, article subject consistency are angularly evaluated and tested, and composition internal logic is not directed to Reasonability evaluation and test.But in composition evaluation and test, logic reasonability is equally an important indicator for evaluating language communicating competence. Logic in text between sentence rationally shows that sentence tissue sequence is reasonable, and such text has readable well.

In the prior art, the research about sentence sequence mainly appears in text summarization field, and text is plucked automatically Want the sentence Sorting task in field, it mainly will be manually having finished writing and upset the documentation summary sentence collection or machine choice of sequence Summary candidate sentence collection tissue be reasonable and readable digest.Existing research can substantially be divided into following a few classes: one, utilizing Temporal information determines sentence sequence in sentence: the time occurred in corpus with sentence is ranked up for foundation, such as news corpus In, the temporal information inside sentence is extracted, then sort algorithm is assisted to be ranked up sentence；Two, between sentence in collection of document Implication relation determines sentence sequence: this method is from transfer of the sentence internal entity between sentence, continuation status, the theme of event tag The logical relation contained between sentence is excavated in transfer etc.；Three, from large-scale corpus is relied on, the naturally suitable of internal sentence is excavated Sequence: this method calculates the proximity between adjacent sentence on the basis of vocabulary, and the condition that estimation sentence constitutes front and back sentence pair is general Rate obtains ranking results.

But there is also problems for the above research, and for the first, second class research, problem is mainly: being believed using the time The methods of breath, inheritance between sentence, sentence theme, have biggish limitation, can not be to not including these specific informations Text carries out sentence sequence；Additionally, due to machine to the deficiency of natural language understanding, rely on descriptor, time word and it is implicit when Between identification, stealthy conjunctive word excavates is also one big difficult.For third class research, deficiency is mainly: relying on large-scale language The problem of material calculates the collocations between sentence pair, and parameter space is big, is easy to appear Sparse, is unfavorable for subsequent proximity It calculates.

Summary of the invention

Aiming at the problems existing in the prior art, the present invention provides a kind of sentence auto ordering method, comprising:

(1) punctuate pretreatment is carried out to document sets, obtains sentence intersection；

(2) the sentence intersection is trained and obtains term vector dictionary, in conjunction with preset Chinese thesaurus to institute's predicate Vector is clustered；

(3) it is based on conditional entropy algorithm, in conjunction with the term vector cluster result, calculates in the sentence intersection word between sentence pair Logicality collocation information, to obtain the proximity in the sentence intersection between sentence.

Further, the calculation formula of the conditional entropy algorithm is as follows:

Wherein, H (S_m|S_m-1) value of conditional entropy, S between adjacent two sentence in the sentence intersection_mWith S_m-1It is adjacent two A sentence, m are that the serial number of sentence and m are positive integer and are more than or equal to 2 less than or equal to n in the sentence intersection, and n is described The sum of sentence in sentence intersection；w_iFor S_m-1The word of middle appearance, w_jFor S_mThe word of middle appearance, wherein i, j take positive integer；p (w_iw_j) it is w_i, w_jThe probability occurred jointly, p (w_j|w_i) it is conditional probability.

Further, global information can be obtained from whole recurrence using neural network based and determine any node The algorithm of importance realizes the sequence to the sentence in the sentence intersection.

Further, the neural network algorithm is based on Markov random walk model.

Further, the term vector is clustered as 500-1500 class.

Further, in the preset Chinese thesaurus number of synonym more than 7000 classes.

Further, the sentence auto ordering method further includes the evaluation and test step to the ranking results of the sentence, institute Commentary is surveyed step and is scored based on ranking results of the ROUGE-L to the sentence.

Further, the threshold value of the ROUGE-L scoring is set as 0.6, i.e., by the true sentence ranking results of the document Compared with the sentence ranking results of the sentence auto ordering method, if the ROUGE-L, which scores, is greater than or equal to threshold value, two Person's ranking results are similar.

Further, the sentence intersection is divided, is divided into several statement block intersections comprising 2-3 sentence；

Firstly, being calculated adjacent in the statement block intersection based on conditional entropy algorithm in conjunction with the term vector cluster result The logicality collocation information of word between statement block, to obtain the proximity in the statement block intersection between statement block；

Then, it is based on conditional entropy algorithm, in conjunction with the term vector cluster result, calculates the sentence pair in each statement block Between word logicality collocation information, to obtain the proximity between the sentence in each statement block.

The present invention also provides a kind of generating means of sentence auto-sequencing, comprising:

Document preprocessing module obtains the corresponding sentence intersection of the document sets for carrying out sentence segmentation to document sets；

Term vector cluster module obtains term vector dictionary, and combine preset for being trained to the sentence intersection Chinese thesaurus clusters the term vector；

Proximity computing module is based on conditional entropy algorithm, in conjunction with the cluster result of the term vector, calculates the sentence and closes The logicality collocation information of word between concentration sentence pair, to obtain the proximity in the sentence intersection between sentence；

Ranking results generation module is swum for the proximity calculated result according to the sentence using Markov at random It walks model to be ranked up the sentence, obtains ranking results.

The invention has the beneficial effects that:

(1) semantic analysis of the invention can realize the auto judge to text sentence logicality, improve judge efficiency, It reduces and judges error.

(2) present invention uses non-supervisory method, all has to the corpus of large number of corpus and lesser amt more excellent Versatility.

(3) present invention is ranked up sentence using Markov random walk model, and efficiency of algorithm height, ranking results are more Reliably.

(4) present invention divides word and is clustered using term vector semantically, can reduce the influence of Sparse, Improve computational efficiency.

(5) present invention combines Chinese thesaurus that can reduce the inaccuracy of automatic cluster, optimization sentence ranking results.

(6) it can be obtained more reasonable in sentence auto-sequencing of the invention using the method that paragraph is split as sentence block Sentence auto-sequencing effect.

Detailed description of the invention

It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, is also possible to obtain other drawings based on these drawings.

Fig. 1 is the flow chart of sentence auto ordering method of the present invention；

Fig. 2 is the structural schematic diagram of sentence automatic sequencing device of the present invention；

Specific embodiment

Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that in which the same or similar labels are throughly indicated identical or Similar element or element with the same or similar functions；Sentence pair refers to two sentences adjacent in sentence intersection.

It is according to the flow chart of the sentence auto ordering method of one embodiment of the invention referring to Fig. 1, which arranges automatically Sequence method the following steps are included:

(1) each composition website obtains middle and primary schools figure kind composition corpus 16329 from network, in addition obtains other Classification is write a composition 109404, and the document sets for obtaining total 125733 documents obtain the pretreatment that the document sets are made pauses in reading unpunctuated ancient writings Obtain sentence intersection；

(2) term vector dictionary is obtained, in conjunction with preset Chinese thesaurus to institute by being trained to the sentence intersection Predicate vector is clustered.Wherein, the term vector is preferably trained by word2vec, is obtained and is amounted to 79770 words Term vector dictionary；The number of synonym is more than 7000 classes in the preset Chinese thesaurus, and more preferably " Harbin Institute of Technology believes Cease retrieval research room Chinese thesaurus extended edition ", it is related to 11769 class synonyms altogether；Preferably it is by term vector cluster Term vector cluster is more preferably 1500 classes by 500-1500 class.

(3) it is based on conditional entropy algorithm, in conjunction with the cluster result of the term vector, calculates in the sentence intersection word between sentence pair The logicality collocation information of language, to obtain the proximity in the sentence intersection between sentence, the calculation formula of conditional entropy algorithm It is as follows:

(4) after the proximity in the sentence intersection has been determined between sentence, using it is neural network based can be from entirety Recurrence obtains global information and determines the algorithm of any node importance to realize to the sentence in the sentence intersection Sequence.Preferably, selection Markov random walk model is ranked up sentence, and calculation method is as follows:

The Markov Chain of the corresponding traversal of random walk matrix, can be with by constantly transfer between any two state It reaches mutually, the Markov random walk model defines figure G=(V, E), and V is vertex set, that is, institute's predicate to be sorted The set of sentence, E is side collection, that is, the proximity of any two sentence in the sentence set to be sorted, and value is to pass through institute State the sentence v that conditional entropy formula is calculated_i→v_jProbability, wherein i, j be positive integer represent sentence in the sentence intersection Serial number.Migration matrix sequence M=can be obtained in m sentences to be sorted | M_{I, j}|_m×m,

Score value SentScore (the v of some sentence in the ranking based on matrix norm M, in the sentence intersection_i) can pass through It is obtained with other sentences, calculation formula is as follows:

Scheme G=(V, E) to calculate according to above up to convergence, chooses the wherein highest sentence priority ordering of score value, it will be remaining Sentence reformulates new figure G ' and re-executes operation, until being sky to an ordering statement V.The collating sequence of sentence is final row Sequence result.

(5) after the ranking results for obtaining the sentence intersection, the ranking results of the sentence are evaluated and tested, are utilized ROUGE-L scores to the ranking results of the sentence, and the ROUGE-L is carried out from the point of view of Longest Common Substring The marking of similarity, calculation formula are as follows:

LSC=lsc (stand_order.sorted_order)

Wherein, LSC be the sentence auto ordering method ranking results (sorted_order) and the document it is true The length of the Longest Common Substring of real sentence ranking results (stand_order)；Len (sorted_order) be the sentence from The length of the ranking results of dynamic sort method, len (stand_order) are the length of the true sentence ranking results of the document Degree, the two length are consistent；R indicates recall rate, and P indicates accuracy rate, and score (ROUGE-L) is the scoring of ROUGE-L；It is public Formula passes through abbreviation, and the scoring of the ROUGE-L is determined by ratio of the public substring in the ranking results in length.

Preferably, the threshold value of ROUGE-L scoring is set as 0.6, i.e., by the true sentence ranking results of the document with Both the sentence ranking results of the sentence auto ordering method compare, if ROUGE-L scoring is greater than or equal to threshold value, Ranking results be it is similar, then it is believed that the sentence ranking results of the sentence auto ordering method be it is consistent and acceptable, Then the acceptable sequence ratio obtained through the semantic analysis auto ordering method is counted.

Further, the present invention has found the language that the document in the document sets includes by the analysis to experimental result In the case where sentence negligible amounts, the sentence auto ordering method achieves more acceptable ranking results, but with institute Increasing for the inside documents sentence quantity of document sets is stated, the numerical value of the acceptable sequence ratio is gradually reduced, therefore proposes one The Optimization Steps to the sentence auto ordering method are planted, specifically:

Firstly, dividing to the sentence intersection, it is divided into several statement block intersections comprising 2-3 sentence；

Secondly, being calculated adjacent in the statement block intersection based on conditional entropy algorithm in conjunction with the term vector cluster result The logicality collocation information of word between statement block, to obtain the proximity in the statement block intersection between statement block；It utilizes Markov random walk model, obtains the ranking results of the statement block.

Again, it is based on conditional entropy algorithm, in conjunction with the term vector cluster result, calculates the sentence pair in each statement block Between word logicality collocation information, to obtain the proximity between the sentence in each statement block；Utilize Markov Random walk model obtains the ranking results of the sentence in each statement block.

Finally, the ranking results of the sentence in the ranking results of the statement block and the statement block are combined, can obtain To the final sequence of the document sentence.

By experimental verification, after the sentence auto ordering method takes the Optimization Steps, can slow down with the text The case where numerical value for increasing the acceptable sequence ratio of the inside documents sentence quantity of shelves collection is gradually reduced, to demonstrate The Optimization Steps strategy that the sentence auto ordering method is taken is feasible.

In addition, referring to fig. 2, the present invention also provides a kind of generating means of sentence auto-sequencing, comprising:

Document preprocessing module 100, for carrying out sentence segmentation to document sets, to obtain the corresponding sentence of the document sets Intersection；

Term vector cluster module 200 obtains term vector dictionary, and combine pre- for being trained to the sentence intersection If Chinese thesaurus the term vector is clustered；

Proximity computing module 300 is based on conditional entropy algorithm, in conjunction with the cluster result of the term vector, calculates institute's predicate In sentence intersection between sentence pair word logicality collocation information, to obtain the proximity in the sentence intersection between sentence；

Ranking results generation module 400, it is random using Markov for the proximity calculated result according to the sentence Migration model is ranked up the sentence, obtains ranking results.

Although in addition, have shown that and describe several embodiments and preferred embodiment of present general inventive concept, It is that it should be appreciated by those skilled in the art, can be right in the case where not departing from the principle and spirit of present general inventive concept These embodiments are changed, and the present general inventive concept is defined by the claims and their equivalents.

Claims

1. a kind of sentence auto ordering method characterized by comprising

(2) the sentence intersection is trained and obtains term vector dictionary, in conjunction with preset Chinese thesaurus to the term vector It is clustered；

(3) it is based on conditional entropy algorithm, in conjunction with the term vector cluster result, calculates in the sentence intersection patrolling for word between sentence pair Property collocation information is collected, to obtain the proximity in the sentence intersection between sentence.

2. sentence auto ordering method according to claim 1, which is characterized in that the calculating of the conditional entropy algorithm is public Formula is as follows:

Wherein, H (S_m|S_m-1) value of conditional entropy, S between adjacent two sentence in the sentence intersection_mWith S_m-1For two adjacent languages Sentence, m are that the serial number of sentence and m are positive integer and are less than or equal to n more than or equal to 2 in the sentence intersection, and n is the sentence The sum of sentence in intersection；w_iFor S_m-1The word of middle appearance, w_jFor S_mThe word of middle appearance, wherein i, j take positive integer；p(w_iw_j) be w_i, w_jThe probability occurred jointly, p (w_j|w_i) it is conditional probability.

3. sentence auto ordering method according to claim 1, which is characterized in that using it is neural network based can be from whole Body recurrence obtains global information and determines the algorithm of any node importance to realize to the sentence in the sentence intersection Sequence.

4. sentence auto ordering method according to claim 3, which is characterized in that the neural network algorithm is based on Ma Er Section husband random walk model.

5. sentence auto ordering method according to claim 1, which is characterized in that clustering the term vector for 500- 1500 classes.

6. sentence auto ordering method according to claim 1, which is characterized in that same in the preset Chinese thesaurus The number of adopted word is more than 7000 classes.

7. sentence auto ordering method according to claim 3, which is characterized in that the sentence auto ordering method is also Evaluation and test step including the ranking results to the sentence, the evaluation and test step is based on ROUGE-L to the sequence knot of the sentence Fruit scores.

8. sentence auto ordering method according to claim 7, which is characterized in that the threshold value of the ROUGE-L scoring It is set as 0.6, i.e., by the sentence ranking results ratio of the true sentence ranking results of the document and the sentence auto ordering method Compared with if ROUGE-L scoring is greater than or equal to threshold value, the two ranking results are similar.

9. sentence auto ordering method according to any one of claims 1-4, which is characterized in that the sentence intersection into Row divides, and is divided into several statement block intersections comprising 2-3 sentence；

Firstly, calculating sentence adjacent in the statement block intersection in conjunction with the term vector cluster result based on conditional entropy algorithm The logicality collocation information of word between block, to obtain the proximity in the statement block intersection between statement block；

Then, it is based on conditional entropy algorithm, in conjunction with the term vector cluster result, calculates word between the sentence pair in each statement block The logicality collocation information of language, to obtain the proximity between the sentence in each statement block.

10. a kind of generating means of sentence auto-sequencing characterized by comprising

Term vector cluster module obtains term vector dictionary, and combine preset synonymous for being trained to the sentence intersection Word word woods clusters the term vector；

Proximity computing module is calculated in the sentence intersection based on conditional entropy algorithm in conjunction with the cluster result of the term vector The logicality collocation information of word between sentence pair, to obtain the proximity in the sentence intersection between sentence；

Ranking results generation module utilizes Markov random walk mould for the proximity calculated result according to the sentence Type is ranked up the sentence, obtains ranking results.