CN103246640A

CN103246640A - Duplicated text detection method and device

Info

Publication number: CN103246640A
Application number: CN2013101443394A
Authority: CN
Inventors: 李鹏; 孙熙; 陆承恩
Original assignee: TENFEN Inc
Current assignee: Kuyun Interactive Technology Ltd
Priority date: 2013-04-23
Filing date: 2013-04-23
Publication date: 2013-08-14
Anticipated expiration: 2033-04-23
Also published as: CN103246640B

Abstract

The invention discloses a duplicated text detection method for detecting whether a text is duplicated or not under higher accuracy. The method includes: acquiring feature words and a feature word sequence in an existing text and the text to be detected; matching each feature word in the text to be detected with each feature word in the existing text; when matching succeeds, acquiring absolute positions of the matched feature words in the feature word sequence in the text to be detected and absolute positions in the feature word sequence in the existing text; judging whether a group of matched feature words exists or not, the group of matched feature words being linearly related to the absolute positions in the text to be detected and the feature word sequence of the existing text; if the group of matched feature words exists, determining duplicated areas of the text to be detected and the existing text according to the absolute positions of the group of matched feature words in the text to be detected and the feature word sequence of the existing text. The invention further discloses a device for implementing the method.

Description

A kind of method and device that detects repeated text

Technical field

The present invention relates to Internet technical field, relate in particular to a kind of method that detects repeated text.

Background technology

All the time, each medium all can produce or issue the news of magnanimity, and wherein the content of considerable part is repetition, be embodied as reprinting (only revise a small amount of literal or do not revise fully), merge (two pieces of articles also become one piece) extracts forms such as (some fragments are independent written in the intercepting article).For news portal website and all kinds of news ocr softwares that rise at present, detect these duplicate contents and filter, be to improve the mission critical that the user experiences.

The prior art scheme has the scheme based on word frequency: with occurring words in the text, and the article that occurred in all articles set at the number of times that occurs in the article, word of each word counts equal frequency information and is used for the calculating text feature, and mode such as the employing cosine similarity feature similarity degree of measuring two pieces of texts.The number of times information that said method has utilized word to occur, but the position that occurs in the text of taking into account critical word not, this makes that the similar different articles of some key word may be by erroneous judgement (such as two different news of same position star); In addition, this method can't detect merging well, take passages this two kinds of polyisomenisms.

Scheme based on the word fragment: such scheme is regarded text as set that one group of continuous word subsequence constitutes, and for example the length of " I love Tian An-men, Beijing " is that 2 word set of segments is " I like ", " loving Beijing ", " Beijing, Tian An-men "; On this basis, can adopt similarity between indexs such as the registration tolerance text of set of segments.Said method has been considered the order information between the word, but when some little rewriting takes place the news Chinese version, (for example reprint medium and will " by our publication report " change " by the XX media report " into), to have influence on a plurality of sequence of terms simultaneously, namely this type of scheme for the slotting word in the document, to delete word comparatively responsive; In addition, this method also can't detect merging well, take passages this two kinds of polyisomenisms.

Summary of the invention

The embodiment of the invention provides a kind of method and device that text repeats that detect, and is used for the detection whether the realization text repeats, and improves the accuracy that detects.

A kind of method that detects the text repetition may further comprise the steps: obtain feature word and feature word sequence in text to be measured and the existing text; With each the feature word in the text to be measured respectively with existing text in each feature word mate; At the feature word when the match is successful, obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and absolute position in existing text feature word sequence; Judge whether to exist one group of feature word that coupling is consistent, there is linear relationship the described one group absolute position of all feature words in the feature word sequence of text to be measured and existing text; If there is one group of feature word that coupling is consistent, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.The embodiment of the invention can effectively detect forms such as text reprinting, merging and extracts by this parameter of absolute position of feature word.

Preferably, obtaining feature word in the text to be measured and the step of text feature word sequence to be measured comprises: text to be measured and existing text are carried out word segmentation processing; Text to be measured and existing text are carried out filtration treatment, obtain the feature word of text to be measured and existing text; According to the sequence of positions of feature word at text to be measured and existing text, obtain the feature word sequence of feature word in text to be measured and existing text respectively.Extract the feature word, can be accurate, detect text efficiently and repeat, prevent the interference of other nonsense words.

Preferably, judge whether to exist the step of the consistent feature word of one group of coupling to comprise: to calculate the alternate position spike between the absolute position of the consistent feature word of coupling in text to be measured and the absolute position in having text now; Add up the value number of times of described alternate position spike; Calculate the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums; Judge whether to exist first ratio greater than first predetermined threshold value; If there is first ratio greater than first predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of first ratio of first predetermined threshold value.By calculating the alternate position spike between the absolute position of the consistent feature word of coupling in text to be measured and the absolute position in having text now, quantity by alternate position spike can detect forms such as reprinting, merging and extracts, and be mainly additive operation, words with the programming realization, calculated amount is little, the efficient height.

Preferably, judge whether to exist the step of one group of feature word that the match is successful also to comprise: the feature word alternate position spike of the coupling unanimity of a plurality of vicinities as one group, is calculated second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums; Judge whether to exist described second ratio greater than second predetermined threshold value; If there is second ratio greater than second predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of second ratio of second predetermined threshold value.Present embodiment can detect only through inserting word, deleting the text that word is handled.

Preferably, judge whether to exist the step of one group of feature word that the match is successful to comprise: according to the consistent absolute position of feature word in text to be measured and existing text of coupling, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured; According to the coordinate of the consistent feature word of coupling, carry out fitting a straight line and handle; Judge whether to exist one group of coordinate, wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite; If there is described one group of coordinate, then will with fit to a slope and be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent.Present embodiment can be found two repeating parts in the text more intuitively by the method for fitting a straight line.

A kind of device that detects the text repetition comprises:

Extraction module is for the feature word and the feature word sequence that obtain text to be measured and existing text;

Matching module, be used for each feature word of text to be measured respectively with existing text in each feature word mate;

The position acquisition module is used for obtaining the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence when the success of matching module matching characteristic word;

Judge module is used for judging whether to exist one group of feature word that coupling is consistent, and there is linear relationship the described one group absolute position of all feature words in text feature word sequence to be measured and the absolute position in existing text feature word sequence;

The repeated text determination module, be used for when judge module is judged the feature word that has one group of coupling unanimity, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.

Preferably, extraction module comprises:

The word segmentation processing unit is used for text to be measured and existing text are carried out word segmentation processing;

Filter element is used for text to be measured and existing text are carried out filtration treatment, obtains the feature word of text to be measured and existing text;

Sequencing unit is used for according to the sequence of positions of feature word at text to be measured and existing text, obtains the feature word sequence of feature word in text to be measured and existing text respectively.

Preferably, described device also comprises: data computation module, statistical module and data processing module.

Data computation module is used for calculating the alternate position spike of the consistent feature word of coupling between the absolute position of text to be measured and the absolute position in existing text;

Statistical module is used for adding up the value number of times of described alternate position spike;

Data processing module is used for calculating the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums;

Judge module is for first ratio that judges whether to exist greater than first predetermined threshold value; When first ratio that determine to exist greater than first predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of first ratio of first predetermined threshold value, judge module is judged the feature word that has one group of coupling unanimity.

Preferably, data processing module also is used for feature word alternate position spike with the coupling unanimity of a plurality of vicinities as one group, calculates second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums;

Judge module is also for second ratio that judges whether to exist greater than second predetermined threshold value; When second ratio that determine to exist greater than second predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of second ratio of second predetermined threshold value, judge module is judged the feature word that has one group of coupling unanimity.

Preferably, described device also comprises: coordinate extraction module and fitting a straight line module.

The coordinate extraction module, be used for according to mating the absolute position of consistent feature word at text to be measured and existing text, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured;

The fitting a straight line module is used for the coordinate according to the consistent feature word of coupling, carries out fitting a straight line and handles;

Judge module is used for judging whether to exist one group of coordinate, wherein, is by fitting to that a slope is approximately 1 straight line and definite according to this group of coordinate; If judge to have described one group of coordinate, then will be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent with fitting to a slope, judge module is judged and is had one group of feature word that coupling is consistent.

Other features and advantages of the present invention will be set forth in the following description, and, partly from instructions, become apparent, perhaps understand by implementing the present invention.Purpose of the present invention and other advantages can realize and obtain by specifically noted structure in the instructions of writing, claims and accompanying drawing.

Below by drawings and Examples, technical scheme of the present invention is described in further detail.

Description of drawings

Accompanying drawing is used to provide further understanding of the present invention, and constitutes the part of instructions, is used from explanation the present invention with embodiments of the invention one, is not construed as limiting the invention.In the accompanying drawings:

Fig. 1 is the main method process flow diagram that detects repeated text in the embodiment of the invention;

Fig. 2 is first kind of method detailed process flow diagram that detects the text repetition in the embodiment of the invention;

After Fig. 3 is the value number of times of statistics alternate position spike in the embodiment of the invention, optimizes and detect the main method process flow diagram that text repeats;

Fig. 4 is second kind of method detailed process flow diagram that detects the text repetition in the embodiment of the invention;

Fig. 5 is the coordinates table diagram of the feature word that coupling is consistent in the embodiment of the invention;

Fig. 6 is a kind of primary structure figure that detects the device of text repetition in the embodiment of the invention;

Fig. 7 is the detailed structure view of extraction module in the embodiment of the invention;

Fig. 8 is a kind of first detailed structure view that detects the device of text repetition in the embodiment of the invention;

Fig. 9 is a kind of second detailed structure view that detects the device of text repetition in the embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein only is used for description and interpretation the present invention, and be not used in restriction the present invention.

In the embodiment of the invention, at first reject insignificant word in the text to be measured, with the feature word of remaining word as the repetition of detection text.Then with text feature word and the absolute position in text thereof a feature as text, and with the alternate position spike of the two pieces of text feature words important parameter as the tolerance that detects text similarity.

Referring to Fig. 1, the main method flow process that detects repeated text in the present embodiment is as follows:

Step 101: obtain feature word and feature word sequence in text to be measured and the existing text.

In embodiments of the present invention, at first extract feature word in text to be measured and the existing text.The word that is of practical significance in the text can be used as the feature word, such as noun, verb, proprietary vocabulary, place name, name etc., such as " automobile ", " haemocyte ", " Beijing " etc.; And preposition, conjunction etc. can not be as the feature words, such as " ", " because " etc.The feature word that extracts in text to be measured and the existing text mainly is divided into two steps: text to be measured and existing text are carried out word segmentation processing; Text to be measured and existing text are carried out filtration treatment, obtain the feature word in text to be measured and the existing text.The method that text to be measured and existing text are carried out word segmentation processing has a variety of, and that relatively more commonly used is maximum matching algorithm (Maximum Matching).Maximum matching algorithm mainly contains two kinds of algorithms: a kind of forward maximum matching algorithm, a kind of reverse maximum matching algorithm.Adopt maximum matching algorithm and existing vocabulary to mate, can effectively reduce the quantity of feature word, thereby improve text detection efficient.After text to be measured and existing text word segmentation processing, the inactive vocabulary according to default filters out the stop words in the text, can save the efficient of storage space and raising text detection like this.Through after extracting text to be measured and existing text feature word step, according to the sequence of positions of feature word in text to be measured and existing text, obtain orderly feature word sequence.

Step 102: with each the feature word in the text to be measured respectively with existing text in each feature word mate, if the match is successful, then continue step 103.

Step 103: obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence.

In embodiments of the present invention, described feature word sequence is regarded as an one-dimensional sequence, may there be a plurality of location labels each feature word absolute position in described sequence.Because may have with a kind of feature word a plurality of, so each feature word has one or more location labels after corresponding to the feature word sequence.For example feature word " news " No. 5 positions and No. 41 positions in text feature word sequence to be measured occurred, and then the absolute position of feature word " news " has 5 and 41.This moment can be with the absolute position of a feature word and feature word feature as text to be measured, as above-mentioned example, and then can be with { " news ", 5,41} is as a feature of text to be measured.Have only the feature word that all exists in text to be measured and the existing text just can repeat for detection of text, so with each the feature word in the text to be measured respectively with existing text in each feature word mate the consistent absolute position of feature word in text feature word sequence to be measured of record coupling and absolute position in existing text feature word sequence.For example the validity feature word " news " in the text to be measured occurred in No. 5 positions and No. 41 positions of text feature word sequence to be measured, and the validity feature word " news " in the existing text occurred in No. 9 positions, No. 37 positions and No. 45 positions of existing text feature word sequence; Then " news " is the consistent feature word of coupling, and " news " absolute position in text to be measured is 5 and 41, and " news " absolute position in existing text is 9,37 and 45.

Step 104: judge whether to exist one group of feature word that coupling is consistent, there is linear relationship the described one group absolute position of all feature words in the feature word sequence of text to be measured and existing text, is then to continue step 105 if be judged as.

In embodiments of the present invention, if a certain paragraph in the text to be measured is identical with a certain paragraph in the existing text, then the linear relationship of the absolute position of the feature word in the described paragraph in text feature word sequence to be measured and absolute position in having the text feature word sequence now can be that 1 straight line is represented with slope; Perhaps the absolute position of the feature word in the described paragraph in text feature word sequence to be measured is consistent with the alternate position spike of absolute position in existing text feature word sequence.By the feature word in text feature word sequence to be measured the absolute position and the absolute position in existing text feature word sequence between whether have linear dependence, can detect text and whether repeat.

Step 105: according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.

If there is one group of feature word that coupling is consistent, illustrate that then the corresponding text of a described stack features word all exists and be repetition that namely detecting two texts has repeating part, needs the described repeat region of output in text to be measured and existing text.

If be judged as not, then there is not one group of feature word that coupling is consistent, point out in text to be measured and the existing text not have the repetition paragraph.

In the present embodiment, if judged result is not for existing one group of feature word that coupling is consistent, the described one group absolute position of all feature words in text feature word sequence to be measured and the absolute position in existing text feature word sequence do not have linear relationship, even there is not the feature word that the match is successful, illustrate that then two texts do not contain the repetition paragraph, can not do any operation or point out text to be measured and existing text in do not have the repetition paragraph.

Judge whether to exist one group of feature word that the match is successful that several different methods is arranged, introduce the first method flow process that detects the text repetition in detail below by an embodiment, by the maximum word algorithm that divides of forward text to be measured is carried out word segmentation processing, inactive vocabulary according to default filters out the stop words in the text to be measured.

Referring to Fig. 2, first kind of method detailed flow process that detects the text repetition of present embodiment is as follows:

Step 201: text to be measured and existing text are carried out the maximum word segmentation processing of forward.

Step 202: the inactive vocabulary according to default, filter out the stop words in text to be measured and the existing text, obtain the feature word of text to be measured and existing text respectively.

Step 203: according to the sequence of positions of feature word in text to be measured and existing text, obtain feature word orderly feature word sequence in text to be measured and existing text respectively.

Step 204: with each feature word and the absolute position in the feature word sequence thereof, as a feature of text to be measured and existing text.

Step 205: each feature word of text to be measured is mated with each feature word of existing text respectively, if the match is successful, then continue step 206, otherwise continue step 213.

Step 206: determine the consistent feature word of the coupling absolute position in the feature word sequence and the absolute position in existing text feature word sequence in text to be measured.

Step 207: calculate the alternate position spike between the absolute position of the consistent feature word of coupling in text to be measured and the absolute position in having text now.

For example No. 5 positions and No. 41 positions in text feature word sequence to be measured of the feature word " news " in the text to be measured occurred, feature word " news " in the existing text No. 9 positions, No. 37 positions and No. 45 positions in existing text feature word sequence occurred, " news " absolute position in text to be measured is 5 and 41, " news " absolute position in existing text is 9,37 and 45, its alternate position spike has-4 ,-32 ,-40,32,4 ,-4, has 5 kinds of alternate position spikes.

Step 208: the value number of times of adding up described alternate position spike.

For example above-mentioned example is-40 ,-32 ,-4,4,32 according to alternate position spike size ordering back, and described alternate position spike value number of times is respectively 1,1,2,1,1.

Step 209: calculate the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums.

In embodiments of the present invention, detect text and whether repeat according to absolute position and the value number of times of the alternate position spike between the absolute position in existing text feature word sequence and described alternate position spike of feature word in text feature word sequence to be measured.N kind alternate position spike is for example arranged, and ascending ordering is respectively a ₁, a ₂, a ₃... a _n, the value number of times of every kind of alternate position spike is respectively b ₁, b ₂, b ₃... b _n, be about to P _i=b _i/ (b ₁+ b ₂+ b ₃₊...+b _n) as the tolerance that detects text similarity, i=1 wherein, 2,3...n.

Step 210: judge whether to exist first ratio greater than first predetermined threshold value, if there is described first ratio, then continue step 211, otherwise continue step 213.

For example, the first predetermined threshold value value is 0.2, with P _iWith 0.2 make comparisons, if P is arranged _i＞0.2, the paragraph that contains repetition in two texts then is described.Such as P ₁₂＞0.2, P ₂₉＞0.2, alternate position spike a then is described ₁₂Value number of times b ₁₂With alternate position spike a ₂₉Value number of times b ₂₉Occupy bigger ratio, illustrate in text to be measured and existing text, it is identical that two parts text is arranged, and it is respectively a that both positions differ ₁₂And a ₂₉, namely alternate position spike is a ₁₂The text filed and alternate position spike that covers of feature word be a ₂₉The feature word cover text filed very likely be identical text filed.In the present embodiment, can choose described first predetermined threshold value according to the length of text to be measured and existing text, described first predetermined threshold value is not unique, decides as the case may be.

Step 211: will be classified as one group of consistent feature word of coupling greater than the corresponding feature word of first ratio of first predetermined threshold value, and mark described feature word.

Step 212: it is text filed that the absolute position of the described feature word that marks in the feature word sequence of text to be measured and existing text covers, and is the repeat region of text to be measured and existing text.

Only be a with the alternate position spike herein ₁₂Be example, value number of times b ₁₂Be 100, surpass first predetermined threshold value and with alternate position spike a ₁₂The corresponding absolute position of feature word in text to be measured is arranged with 1,2,3 from small to large, 4,8,9,10...150 the absolute position in existing text has 101,102,103,104,108,109,110...250, then in text to be measured the absolute position be 1 to 150 text filed and absolute position in existing text be 101 to 250 text filedly be repeat region.

Step 213: point out in text to be measured and the existing text not have the repetition paragraph.

Preferable, through after the above-mentioned detection, can be again with the feature word alternate position spike of several vicinities as one group, all value number of times of alternate position spike and the ratio of total value number of times can detect like this and insert word or delete word as the tolerance that detects text similarity in the group.Introduce below by an embodiment and optimize to detect the method flow that text repeats, as one group, all value number of times of alternate position spike and the ratio of total value number of times are as the tolerance that detects text similarity in the group with the feature word alternate position spike of three vicinities.

Referring to Fig. 3, behind the value number of times of statistics alternate position spike, it is as follows to optimize the main method flow process that detects the text repetition in the present embodiment:

Step 301: the value number of times of statistical nature word alternate position spike.

Step 302: calculate the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums.

Step 303: judge whether to exist first ratio greater than first predetermined threshold value, if there is described first ratio, then continue step 304, otherwise continue step 305.

Step 304: will be classified as one group of consistent feature word of coupling greater than the corresponding feature word of first ratio of first predetermined threshold value, and mark described feature word.

Step 305: the validity feature word alternate position spike of three vicinities as one group, is calculated second ratio of alternate position spike value number of times sum and all alternate position spike value number of times sums in each group.N kind alternate position spike is for example arranged, and ascending ordering is respectively a ₁, a ₂, a ₃... a _n, the value number of times of every kind of alternate position spike is respectively b ₁, b ₂, b ₃... b _n, then with P _i=(b _i+ b _I+1+ b _I+2)/(b ₁+ b ₂+ b ₃₊...+b _n), i=1,2,3 ... n-2 calculates each P as the tolerance that detects text similarity _i, wherein, i=1,2,3 ... n-2.

Step 306: judge whether to exist second ratio greater than second predetermined threshold value, if exist, then continue step 307, otherwise continue step 309.

Step 307: will be classified as one group of consistent feature word of coupling greater than the corresponding feature word of second ratio of second predetermined threshold value, and mark described feature word.

Step 308: it is text filed that the absolute position of the described feature word that marks in the feature word sequence of text to be measured and existing text covers, and is the repeat region of text to be measured and existing text.

Step 309: point out in text to be measured and the existing text not have the repetition paragraph.

Judge whether to exist one group of feature word that the match is successful below by the method for an embodiment introduction by fitting a straight line.

Referring to Fig. 4, second kind of method detailed flow process that detects the text repetition of present embodiment is as follows:

Step 401: text to be measured and existing text are carried out the maximum word segmentation processing of forward.

Step 402: according to default inactive vocabulary, filter out the stop words of text to be measured and existing text, obtain the feature word of text to be measured and existing text.

Step 403: according to the sequence of positions of feature word in text to be measured and existing text, obtain orderly feature word sequence.

Step 404: with each feature word and the absolute position in the feature word sequence thereof, as a feature of text to be measured and existing text.

Step 405: each feature word of text to be measured is mated with each feature word of existing text respectively, if the match is successful, then continue step 406, otherwise continue step 411.

Step 406: according to the consistent absolute position of feature word in text to be measured and existing text of coupling, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured.

In the present embodiment, the absolute position of feature word in text feature word sequence to be measured that coupling is consistent is ordinate, the absolute position of described feature word in existing text feature word sequence is horizontal ordinate, and in the text to be measured and the point of the feature word respective coordinates that coupling is consistent in the existing text in being.For example No. 5 positions and No. 41 positions in text feature word sequence to be measured of the feature word " news " in the text to be measured occurred, and No. 9 positions, No. 37 positions and No. 45 positions in existing text feature word sequence of the feature word " news " in the existing text occurred, then the point (9 in the described coordinate system, 5), (9,41), (37,5), (37,41), (45,5), (45,41) corresponding a pair of validity feature all.If have long repetition paragraph in two texts, be presented as in described coordinate system that then a slope is 1 straight line y=x+b, referring to Fig. 5, the consistent feature word of a pair of coupling of each circle expression.

Step 407: according to the coordinate of the consistent feature word of coupling, carry out fitting a straight line and handle.

Step 408: judge whether to exist one group of coordinate, wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite, if there is described one group of coordinate, then continues step 409, otherwise, continue step 411.

In the present embodiment, fitting a straight line is prior art, does not do detailed description herein.

Step 409: will be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent with fitting to a slope, and mark a described stack features word.

Step 410: it is text filed that the absolute position of the described feature word that marks in the feature word sequence of text to be measured and existing text covers, and is the repeat region of text to be measured and existing text.

Step 411: point out in text to be measured and the existing text not have the repetition paragraph.

Understood the implementation procedure that detects repeated text by above introduction, this process can realize that inner structure and the function to this device is introduced below by device.

Referring to Fig. 6, in the present embodiment, a kind of device that detects the text repetition comprises: extraction module 601, matching module 602, position acquisition module 603, judge module 604 and repeated text determination module 605.

Extraction module 601 is used for obtaining feature word and the feature word sequence of text to be measured and existing text;

Matching module 602 be used for each feature word of text to be measured respectively with existing text in each feature word mate;

Position acquisition module 603 is used for obtaining the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence when the success of matching module matching characteristic word;

Judge module 604 is used for judging whether to exist one group of feature word that coupling is consistent, and there is linear relationship the described one group absolute position of all feature words in text feature word sequence to be measured and the absolute position in existing text feature word sequence;

Repeated text determination module 605 is used for when judge module is judged the feature word that has one group of coupling unanimity, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.

Preferable, extraction module 601 comprises: word segmentation processing unit 701, filter element 702 and sequencing unit 703, and referring to shown in Figure 7.

Word segmentation processing unit 701 is used for text to be measured and existing text are carried out word segmentation processing;

Filter element 702 is used for text to be measured and existing text are carried out filtration treatment, obtains the feature word of text to be measured and existing text;

Sequencing unit 703 is used for according to the sequence of positions of feature word at text to be measured and existing text, obtains feature word feature word sequence in text to be measured and existing text respectively.

Preferable, the absolute position in text feature word sequence to be measured with each feature word and described feature word is as a feature of text to be measured; Obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence.

Preferable, described device also comprises: data computation module 606, statistical module 607 and data processing module 608, and referring to shown in Figure 8.

Data computation module 606 is used for calculating the alternate position spike of the consistent feature word of coupling between the absolute position of text to be measured and the absolute position in existing text;

Statistical module 607 is used for the value number of times of the described alternate position spike of statistics;

Data processing module 608 is used for calculating the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums;

Judge module 604 is for first ratio that judges whether to exist greater than first predetermined threshold value; When first ratio that determine to exist greater than first predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of first ratio of first predetermined threshold value, judge module 604 is judged the feature word that has one group of coupling unanimity.

Preferable, data processing module 608 also is used for feature word alternate position spike with the coupling unanimity of a plurality of vicinities as one group, calculates second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums;

Judge module 604 is also for described second ratio that judges whether to exist greater than second predetermined threshold value; When second ratio that determine to exist greater than second predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of second ratio of second predetermined threshold value, judge module is judged the feature word that has one group of coupling unanimity.

Preferable, described device also comprises: coordinate extraction module 609 and fitting a straight line module 610, and referring to shown in Figure 9.

Coordinate extraction module 609 is used for according to mating the absolute position of consistent feature word at text to be measured and existing text, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured;

The coordinate that fitting a straight line module 610 is used for according to the consistent feature word of coupling carries out fitting a straight line and handles;

Judge module 604 is used for judging whether to exist one group of coordinate, and wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite; If judge to have described one group of coordinate, then will be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent with fitting to a slope, judge module 604 is judged and is had one group of feature word that coupling is consistent.

In the embodiment of the invention, the absolute position in text feature word sequence to be measured with each feature word and described feature word, a feature as text to be measured, obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence, judge the text that whether has repetition in two texts by the alternate position spike of calculating absolute position or by the method for fitting a straight line.Can effectively detect forms such as text reprinting, merging and extracts by described method, and the situation that can detect slotting word, delete word.

Those skilled in the art should understand that embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt complete hardware embodiment, complete software embodiment or in conjunction with the form of the embodiment of software and hardware aspect.And the present invention can adopt the form of the computer program of implementing in one or more computer-usable storage medium (including but not limited to magnetic disk memory and optical memory etc.) that wherein include computer usable program code.

The present invention is that reference is described according to process flow diagram and/or the block scheme of method, equipment (system) and the computer program of the embodiment of the invention.Should understand can be by the flow process in each flow process in computer program instructions realization flow figure and/or the block scheme and/or square frame and process flow diagram and/or the block scheme and/or the combination of square frame.Can provide these computer program instructions to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, make the instruction of carrying out by the processor of computing machine or other programmable data processing device produce to be used for the device of the function that is implemented in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame appointments.

These computer program instructions also can be stored in energy vectoring computer or the computer-readable memory of other programmable data processing device with ad hoc fashion work, make the instruction that is stored in this computer-readable memory produce the manufacture that comprises command device, this command device is implemented in the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.

These computer program instructions also can be loaded on computing machine or other programmable data processing device, make and carry out the sequence of operations step producing computer implemented processing at computing machine or other programmable devices, thereby be provided for being implemented in the step of the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame in the instruction that computing machine or other programmable devices are carried out.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. one kind is detected the method that text repeats, and it is characterized in that, may further comprise the steps:

Obtain feature word and feature word sequence in text to be measured and the existing text;

With each the feature word in the text to be measured respectively with existing text in each feature word mate;

At the feature word when the match is successful, obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and absolute position in existing text feature word sequence;

Judge whether to exist one group of feature word that coupling is consistent, there is linear relationship the described one group absolute position of all feature words in the feature word sequence of text to be measured and existing text;

If there is one group of feature word that coupling is consistent, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.

2. the method for claim 1 is characterized in that, obtains feature word in the text to be measured and the step of text feature word sequence to be measured and comprises:

Text to be measured and existing text are carried out word segmentation processing;

Text to be measured and existing text are carried out filtration treatment, obtain the feature word of text to be measured and existing text;

According to the sequence of positions of feature word in text to be measured and existing text, obtain the feature word sequence of feature word in text to be measured and existing text respectively.

3. the method for claim 1 is characterized in that, judges whether to exist one group of step of mating consistent feature word to comprise:

Calculate the alternate position spike between the absolute position of the consistent feature word of coupling in text to be measured and the absolute position in having text now;

Add up the value number of times of described alternate position spike;

Calculate the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums;

Judge whether to exist first ratio greater than first predetermined threshold value;

If there is first ratio greater than first predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of first ratio of first predetermined threshold value.

4. method as claimed in claim 3 is characterized in that, judges whether to exist the step of one group of feature word that the match is successful also to comprise:

The feature word alternate position spike of the coupling unanimity of a plurality of vicinities as one group, is calculated second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums;

Judge whether to exist second ratio greater than second predetermined threshold value;

If there is second ratio greater than second predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of second ratio of second predetermined threshold value.

5. the method for claim 1 is characterized in that, judges whether to exist the step of one group of feature word that the match is successful to comprise:

According to the consistent absolute position of feature word in text to be measured and existing text of coupling, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured;

According to the coordinate of the consistent feature word of coupling, carry out fitting a straight line and handle;

Judge whether to exist one group of coordinate, wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite;

If there is described one group of coordinate, then will with fit to a slope and be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent.

6. one kind is detected the device that text repeats, and it is characterized in that, comprising:

The repeated text determination module, be used for when judging the feature word that has one group of coupling unanimity, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.

7. device as claimed in claim 6 is characterized in that, extraction module comprises:

Sequencing unit is used for according to the sequence of positions of feature word at text to be measured and existing text, obtains feature word feature word sequence in text to be measured and existing text respectively.

8. device as claimed in claim 6 is characterized in that, also comprises: data computation module, statistical module and data processing module;

9. device as claimed in claim 8 is characterized in that, data processing module also is used for feature word alternate position spike with the coupling unanimity of a plurality of vicinities as one group, calculates second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums;

10. device as claimed in claim 6 is characterized in that, also comprises: coordinate extraction module and fitting a straight line module;

Judge module is used for judging whether to exist one group of coordinate, and wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite; If there is described one group of coordinate, then will be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent with fitting to a slope, judge module is judged and is had one group of feature word that coupling is consistent.