CN103246640A - Duplicated text detection method and device - Google Patents

Duplicated text detection method and device Download PDF

Info

Publication number
CN103246640A
CN103246640A CN2013101443394A CN201310144339A CN103246640A CN 103246640 A CN103246640 A CN 103246640A CN 2013101443394 A CN2013101443394 A CN 2013101443394A CN 201310144339 A CN201310144339 A CN 201310144339A CN 103246640 A CN103246640 A CN 103246640A
Authority
CN
China
Prior art keywords
text
feature word
measured
group
coupling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101443394A
Other languages
Chinese (zh)
Other versions
CN103246640B (en
Inventor
李鹏
孙熙
陆承恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kuyun Interactive Technology Ltd
Original Assignee
TENFEN Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TENFEN Inc filed Critical TENFEN Inc
Priority to CN201310144339.4A priority Critical patent/CN103246640B/en
Publication of CN103246640A publication Critical patent/CN103246640A/en
Application granted granted Critical
Publication of CN103246640B publication Critical patent/CN103246640B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a duplicated text detection method for detecting whether a text is duplicated or not under higher accuracy. The method includes: acquiring feature words and a feature word sequence in an existing text and the text to be detected; matching each feature word in the text to be detected with each feature word in the existing text; when matching succeeds, acquiring absolute positions of the matched feature words in the feature word sequence in the text to be detected and absolute positions in the feature word sequence in the existing text; judging whether a group of matched feature words exists or not, the group of matched feature words being linearly related to the absolute positions in the text to be detected and the feature word sequence of the existing text; if the group of matched feature words exists, determining duplicated areas of the text to be detected and the existing text according to the absolute positions of the group of matched feature words in the text to be detected and the feature word sequence of the existing text. The invention further discloses a device for implementing the method.

Description

A kind of method and device that detects repeated text
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of method that detects repeated text.
Background technology
All the time, each medium all can produce or issue the news of magnanimity, and wherein the content of considerable part is repetition, be embodied as reprinting (only revise a small amount of literal or do not revise fully), merge (two pieces of articles also become one piece) extracts forms such as (some fragments are independent written in the intercepting article).For news portal website and all kinds of news ocr softwares that rise at present, detect these duplicate contents and filter, be to improve the mission critical that the user experiences.
The prior art scheme has the scheme based on word frequency: with occurring words in the text, and the article that occurred in all articles set at the number of times that occurs in the article, word of each word counts equal frequency information and is used for the calculating text feature, and mode such as the employing cosine similarity feature similarity degree of measuring two pieces of texts.The number of times information that said method has utilized word to occur, but the position that occurs in the text of taking into account critical word not, this makes that the similar different articles of some key word may be by erroneous judgement (such as two different news of same position star); In addition, this method can't detect merging well, take passages this two kinds of polyisomenisms.
Scheme based on the word fragment: such scheme is regarded text as set that one group of continuous word subsequence constitutes, and for example the length of " I love Tian An-men, Beijing " is that 2 word set of segments is " I like ", " loving Beijing ", " Beijing, Tian An-men "; On this basis, can adopt similarity between indexs such as the registration tolerance text of set of segments.Said method has been considered the order information between the word, but when some little rewriting takes place the news Chinese version, (for example reprint medium and will " by our publication report " change " by the XX media report " into), to have influence on a plurality of sequence of terms simultaneously, namely this type of scheme for the slotting word in the document, to delete word comparatively responsive; In addition, this method also can't detect merging well, take passages this two kinds of polyisomenisms.
Summary of the invention
The embodiment of the invention provides a kind of method and device that text repeats that detect, and is used for the detection whether the realization text repeats, and improves the accuracy that detects.
A kind of method that detects the text repetition may further comprise the steps: obtain feature word and feature word sequence in text to be measured and the existing text; With each the feature word in the text to be measured respectively with existing text in each feature word mate; At the feature word when the match is successful, obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and absolute position in existing text feature word sequence; Judge whether to exist one group of feature word that coupling is consistent, there is linear relationship the described one group absolute position of all feature words in the feature word sequence of text to be measured and existing text; If there is one group of feature word that coupling is consistent, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.The embodiment of the invention can effectively detect forms such as text reprinting, merging and extracts by this parameter of absolute position of feature word.
Preferably, obtaining feature word in the text to be measured and the step of text feature word sequence to be measured comprises: text to be measured and existing text are carried out word segmentation processing; Text to be measured and existing text are carried out filtration treatment, obtain the feature word of text to be measured and existing text; According to the sequence of positions of feature word at text to be measured and existing text, obtain the feature word sequence of feature word in text to be measured and existing text respectively.Extract the feature word, can be accurate, detect text efficiently and repeat, prevent the interference of other nonsense words.
Preferably, judge whether to exist the step of the consistent feature word of one group of coupling to comprise: to calculate the alternate position spike between the absolute position of the consistent feature word of coupling in text to be measured and the absolute position in having text now; Add up the value number of times of described alternate position spike; Calculate the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums; Judge whether to exist first ratio greater than first predetermined threshold value; If there is first ratio greater than first predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of first ratio of first predetermined threshold value.By calculating the alternate position spike between the absolute position of the consistent feature word of coupling in text to be measured and the absolute position in having text now, quantity by alternate position spike can detect forms such as reprinting, merging and extracts, and be mainly additive operation, words with the programming realization, calculated amount is little, the efficient height.
Preferably, judge whether to exist the step of one group of feature word that the match is successful also to comprise: the feature word alternate position spike of the coupling unanimity of a plurality of vicinities as one group, is calculated second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums; Judge whether to exist described second ratio greater than second predetermined threshold value; If there is second ratio greater than second predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of second ratio of second predetermined threshold value.Present embodiment can detect only through inserting word, deleting the text that word is handled.
Preferably, judge whether to exist the step of one group of feature word that the match is successful to comprise: according to the consistent absolute position of feature word in text to be measured and existing text of coupling, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured; According to the coordinate of the consistent feature word of coupling, carry out fitting a straight line and handle; Judge whether to exist one group of coordinate, wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite; If there is described one group of coordinate, then will with fit to a slope and be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent.Present embodiment can be found two repeating parts in the text more intuitively by the method for fitting a straight line.
A kind of device that detects the text repetition comprises:
Extraction module is for the feature word and the feature word sequence that obtain text to be measured and existing text;
Matching module, be used for each feature word of text to be measured respectively with existing text in each feature word mate;
The position acquisition module is used for obtaining the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence when the success of matching module matching characteristic word;
Judge module is used for judging whether to exist one group of feature word that coupling is consistent, and there is linear relationship the described one group absolute position of all feature words in text feature word sequence to be measured and the absolute position in existing text feature word sequence;
The repeated text determination module, be used for when judge module is judged the feature word that has one group of coupling unanimity, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.
Preferably, extraction module comprises:
The word segmentation processing unit is used for text to be measured and existing text are carried out word segmentation processing;
Filter element is used for text to be measured and existing text are carried out filtration treatment, obtains the feature word of text to be measured and existing text;
Sequencing unit is used for according to the sequence of positions of feature word at text to be measured and existing text, obtains the feature word sequence of feature word in text to be measured and existing text respectively.
Preferably, described device also comprises: data computation module, statistical module and data processing module.
Data computation module is used for calculating the alternate position spike of the consistent feature word of coupling between the absolute position of text to be measured and the absolute position in existing text;
Statistical module is used for adding up the value number of times of described alternate position spike;
Data processing module is used for calculating the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums;
Judge module is for first ratio that judges whether to exist greater than first predetermined threshold value; When first ratio that determine to exist greater than first predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of first ratio of first predetermined threshold value, judge module is judged the feature word that has one group of coupling unanimity.
Preferably, data processing module also is used for feature word alternate position spike with the coupling unanimity of a plurality of vicinities as one group, calculates second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums;
Judge module is also for second ratio that judges whether to exist greater than second predetermined threshold value; When second ratio that determine to exist greater than second predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of second ratio of second predetermined threshold value, judge module is judged the feature word that has one group of coupling unanimity.
Preferably, described device also comprises: coordinate extraction module and fitting a straight line module.
The coordinate extraction module, be used for according to mating the absolute position of consistent feature word at text to be measured and existing text, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured;
The fitting a straight line module is used for the coordinate according to the consistent feature word of coupling, carries out fitting a straight line and handles;
Judge module is used for judging whether to exist one group of coordinate, wherein, is by fitting to that a slope is approximately 1 straight line and definite according to this group of coordinate; If judge to have described one group of coordinate, then will be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent with fitting to a slope, judge module is judged and is had one group of feature word that coupling is consistent.
Other features and advantages of the present invention will be set forth in the following description, and, partly from instructions, become apparent, perhaps understand by implementing the present invention.Purpose of the present invention and other advantages can realize and obtain by specifically noted structure in the instructions of writing, claims and accompanying drawing.
Below by drawings and Examples, technical scheme of the present invention is described in further detail.
Description of drawings
Accompanying drawing is used to provide further understanding of the present invention, and constitutes the part of instructions, is used from explanation the present invention with embodiments of the invention one, is not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the main method process flow diagram that detects repeated text in the embodiment of the invention;
Fig. 2 is first kind of method detailed process flow diagram that detects the text repetition in the embodiment of the invention;
After Fig. 3 is the value number of times of statistics alternate position spike in the embodiment of the invention, optimizes and detect the main method process flow diagram that text repeats;
Fig. 4 is second kind of method detailed process flow diagram that detects the text repetition in the embodiment of the invention;
Fig. 5 is the coordinates table diagram of the feature word that coupling is consistent in the embodiment of the invention;
Fig. 6 is a kind of primary structure figure that detects the device of text repetition in the embodiment of the invention;
Fig. 7 is the detailed structure view of extraction module in the embodiment of the invention;
Fig. 8 is a kind of first detailed structure view that detects the device of text repetition in the embodiment of the invention;
Fig. 9 is a kind of second detailed structure view that detects the device of text repetition in the embodiment of the invention.
Embodiment
Below in conjunction with accompanying drawing the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein only is used for description and interpretation the present invention, and be not used in restriction the present invention.
In the embodiment of the invention, at first reject insignificant word in the text to be measured, with the feature word of remaining word as the repetition of detection text.Then with text feature word and the absolute position in text thereof a feature as text, and with the alternate position spike of the two pieces of text feature words important parameter as the tolerance that detects text similarity.
Referring to Fig. 1, the main method flow process that detects repeated text in the present embodiment is as follows:
Step 101: obtain feature word and feature word sequence in text to be measured and the existing text.
In embodiments of the present invention, at first extract feature word in text to be measured and the existing text.The word that is of practical significance in the text can be used as the feature word, such as noun, verb, proprietary vocabulary, place name, name etc., such as " automobile ", " haemocyte ", " Beijing " etc.; And preposition, conjunction etc. can not be as the feature words, such as " ", " because " etc.The feature word that extracts in text to be measured and the existing text mainly is divided into two steps: text to be measured and existing text are carried out word segmentation processing; Text to be measured and existing text are carried out filtration treatment, obtain the feature word in text to be measured and the existing text.The method that text to be measured and existing text are carried out word segmentation processing has a variety of, and that relatively more commonly used is maximum matching algorithm (Maximum Matching).Maximum matching algorithm mainly contains two kinds of algorithms: a kind of forward maximum matching algorithm, a kind of reverse maximum matching algorithm.Adopt maximum matching algorithm and existing vocabulary to mate, can effectively reduce the quantity of feature word, thereby improve text detection efficient.After text to be measured and existing text word segmentation processing, the inactive vocabulary according to default filters out the stop words in the text, can save the efficient of storage space and raising text detection like this.Through after extracting text to be measured and existing text feature word step, according to the sequence of positions of feature word in text to be measured and existing text, obtain orderly feature word sequence.
Step 102: with each the feature word in the text to be measured respectively with existing text in each feature word mate, if the match is successful, then continue step 103.
Step 103: obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence.
In embodiments of the present invention, described feature word sequence is regarded as an one-dimensional sequence, may there be a plurality of location labels each feature word absolute position in described sequence.Because may have with a kind of feature word a plurality of, so each feature word has one or more location labels after corresponding to the feature word sequence.For example feature word " news " No. 5 positions and No. 41 positions in text feature word sequence to be measured occurred, and then the absolute position of feature word " news " has 5 and 41.This moment can be with the absolute position of a feature word and feature word feature as text to be measured, as above-mentioned example, and then can be with { " news ", 5,41} is as a feature of text to be measured.Have only the feature word that all exists in text to be measured and the existing text just can repeat for detection of text, so with each the feature word in the text to be measured respectively with existing text in each feature word mate the consistent absolute position of feature word in text feature word sequence to be measured of record coupling and absolute position in existing text feature word sequence.For example the validity feature word " news " in the text to be measured occurred in No. 5 positions and No. 41 positions of text feature word sequence to be measured, and the validity feature word " news " in the existing text occurred in No. 9 positions, No. 37 positions and No. 45 positions of existing text feature word sequence; Then " news " is the consistent feature word of coupling, and " news " absolute position in text to be measured is 5 and 41, and " news " absolute position in existing text is 9,37 and 45.
Step 104: judge whether to exist one group of feature word that coupling is consistent, there is linear relationship the described one group absolute position of all feature words in the feature word sequence of text to be measured and existing text, is then to continue step 105 if be judged as.
In embodiments of the present invention, if a certain paragraph in the text to be measured is identical with a certain paragraph in the existing text, then the linear relationship of the absolute position of the feature word in the described paragraph in text feature word sequence to be measured and absolute position in having the text feature word sequence now can be that 1 straight line is represented with slope; Perhaps the absolute position of the feature word in the described paragraph in text feature word sequence to be measured is consistent with the alternate position spike of absolute position in existing text feature word sequence.By the feature word in text feature word sequence to be measured the absolute position and the absolute position in existing text feature word sequence between whether have linear dependence, can detect text and whether repeat.
Step 105: according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.
If there is one group of feature word that coupling is consistent, illustrate that then the corresponding text of a described stack features word all exists and be repetition that namely detecting two texts has repeating part, needs the described repeat region of output in text to be measured and existing text.
If be judged as not, then there is not one group of feature word that coupling is consistent, point out in text to be measured and the existing text not have the repetition paragraph.
In the present embodiment, if judged result is not for existing one group of feature word that coupling is consistent, the described one group absolute position of all feature words in text feature word sequence to be measured and the absolute position in existing text feature word sequence do not have linear relationship, even there is not the feature word that the match is successful, illustrate that then two texts do not contain the repetition paragraph, can not do any operation or point out text to be measured and existing text in do not have the repetition paragraph.
Judge whether to exist one group of feature word that the match is successful that several different methods is arranged, introduce the first method flow process that detects the text repetition in detail below by an embodiment, by the maximum word algorithm that divides of forward text to be measured is carried out word segmentation processing, inactive vocabulary according to default filters out the stop words in the text to be measured.
Referring to Fig. 2, first kind of method detailed flow process that detects the text repetition of present embodiment is as follows:
Step 201: text to be measured and existing text are carried out the maximum word segmentation processing of forward.
Step 202: the inactive vocabulary according to default, filter out the stop words in text to be measured and the existing text, obtain the feature word of text to be measured and existing text respectively.
Step 203: according to the sequence of positions of feature word in text to be measured and existing text, obtain feature word orderly feature word sequence in text to be measured and existing text respectively.
Step 204: with each feature word and the absolute position in the feature word sequence thereof, as a feature of text to be measured and existing text.
Step 205: each feature word of text to be measured is mated with each feature word of existing text respectively, if the match is successful, then continue step 206, otherwise continue step 213.
Step 206: determine the consistent feature word of the coupling absolute position in the feature word sequence and the absolute position in existing text feature word sequence in text to be measured.
Step 207: calculate the alternate position spike between the absolute position of the consistent feature word of coupling in text to be measured and the absolute position in having text now.
For example No. 5 positions and No. 41 positions in text feature word sequence to be measured of the feature word " news " in the text to be measured occurred, feature word " news " in the existing text No. 9 positions, No. 37 positions and No. 45 positions in existing text feature word sequence occurred, " news " absolute position in text to be measured is 5 and 41, " news " absolute position in existing text is 9,37 and 45, its alternate position spike has-4 ,-32 ,-40,32,4 ,-4, has 5 kinds of alternate position spikes.
Step 208: the value number of times of adding up described alternate position spike.
For example above-mentioned example is-40 ,-32 ,-4,4,32 according to alternate position spike size ordering back, and described alternate position spike value number of times is respectively 1,1,2,1,1.
Step 209: calculate the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums.
In embodiments of the present invention, detect text and whether repeat according to absolute position and the value number of times of the alternate position spike between the absolute position in existing text feature word sequence and described alternate position spike of feature word in text feature word sequence to be measured.N kind alternate position spike is for example arranged, and ascending ordering is respectively a 1, a 2, a 3... a n, the value number of times of every kind of alternate position spike is respectively b 1, b 2, b 3... b n, be about to P i=b i/ (b 1+ b 2+ b 3+...+b n) as the tolerance that detects text similarity, i=1 wherein, 2,3...n.
Step 210: judge whether to exist first ratio greater than first predetermined threshold value, if there is described first ratio, then continue step 211, otherwise continue step 213.
For example, the first predetermined threshold value value is 0.2, with P iWith 0.2 make comparisons, if P is arranged i>0.2, the paragraph that contains repetition in two texts then is described.Such as P 12>0.2, P 29>0.2, alternate position spike a then is described 12Value number of times b 12With alternate position spike a 29Value number of times b 29Occupy bigger ratio, illustrate in text to be measured and existing text, it is identical that two parts text is arranged, and it is respectively a that both positions differ 12And a 29, namely alternate position spike is a 12The text filed and alternate position spike that covers of feature word be a 29The feature word cover text filed very likely be identical text filed.In the present embodiment, can choose described first predetermined threshold value according to the length of text to be measured and existing text, described first predetermined threshold value is not unique, decides as the case may be.
Step 211: will be classified as one group of consistent feature word of coupling greater than the corresponding feature word of first ratio of first predetermined threshold value, and mark described feature word.
Step 212: it is text filed that the absolute position of the described feature word that marks in the feature word sequence of text to be measured and existing text covers, and is the repeat region of text to be measured and existing text.
Only be a with the alternate position spike herein 12Be example, value number of times b 12Be 100, surpass first predetermined threshold value and with alternate position spike a 12The corresponding absolute position of feature word in text to be measured is arranged with 1,2,3 from small to large, 4,8,9,10...150 the absolute position in existing text has 101,102,103,104,108,109,110...250, then in text to be measured the absolute position be 1 to 150 text filed and absolute position in existing text be 101 to 250 text filedly be repeat region.
Step 213: point out in text to be measured and the existing text not have the repetition paragraph.
Preferable, through after the above-mentioned detection, can be again with the feature word alternate position spike of several vicinities as one group, all value number of times of alternate position spike and the ratio of total value number of times can detect like this and insert word or delete word as the tolerance that detects text similarity in the group.Introduce below by an embodiment and optimize to detect the method flow that text repeats, as one group, all value number of times of alternate position spike and the ratio of total value number of times are as the tolerance that detects text similarity in the group with the feature word alternate position spike of three vicinities.
Referring to Fig. 3, behind the value number of times of statistics alternate position spike, it is as follows to optimize the main method flow process that detects the text repetition in the present embodiment:
Step 301: the value number of times of statistical nature word alternate position spike.
Step 302: calculate the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums.
Step 303: judge whether to exist first ratio greater than first predetermined threshold value, if there is described first ratio, then continue step 304, otherwise continue step 305.
Step 304: will be classified as one group of consistent feature word of coupling greater than the corresponding feature word of first ratio of first predetermined threshold value, and mark described feature word.
Step 305: the validity feature word alternate position spike of three vicinities as one group, is calculated second ratio of alternate position spike value number of times sum and all alternate position spike value number of times sums in each group.N kind alternate position spike is for example arranged, and ascending ordering is respectively a 1, a 2, a 3... a n, the value number of times of every kind of alternate position spike is respectively b 1, b 2, b 3... b n, then with P i=(b i+ b I+1+ b I+2)/(b 1+ b 2+ b 3+...+b n), i=1,2,3 ... n-2 calculates each P as the tolerance that detects text similarity i, wherein, i=1,2,3 ... n-2.
Step 306: judge whether to exist second ratio greater than second predetermined threshold value, if exist, then continue step 307, otherwise continue step 309.
Step 307: will be classified as one group of consistent feature word of coupling greater than the corresponding feature word of second ratio of second predetermined threshold value, and mark described feature word.
Step 308: it is text filed that the absolute position of the described feature word that marks in the feature word sequence of text to be measured and existing text covers, and is the repeat region of text to be measured and existing text.
Step 309: point out in text to be measured and the existing text not have the repetition paragraph.
Judge whether to exist one group of feature word that the match is successful below by the method for an embodiment introduction by fitting a straight line.
Referring to Fig. 4, second kind of method detailed flow process that detects the text repetition of present embodiment is as follows:
Step 401: text to be measured and existing text are carried out the maximum word segmentation processing of forward.
Step 402: according to default inactive vocabulary, filter out the stop words of text to be measured and existing text, obtain the feature word of text to be measured and existing text.
Step 403: according to the sequence of positions of feature word in text to be measured and existing text, obtain orderly feature word sequence.
Step 404: with each feature word and the absolute position in the feature word sequence thereof, as a feature of text to be measured and existing text.
Step 405: each feature word of text to be measured is mated with each feature word of existing text respectively, if the match is successful, then continue step 406, otherwise continue step 411.
Step 406: according to the consistent absolute position of feature word in text to be measured and existing text of coupling, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured.
In the present embodiment, the absolute position of feature word in text feature word sequence to be measured that coupling is consistent is ordinate, the absolute position of described feature word in existing text feature word sequence is horizontal ordinate, and in the text to be measured and the point of the feature word respective coordinates that coupling is consistent in the existing text in being.For example No. 5 positions and No. 41 positions in text feature word sequence to be measured of the feature word " news " in the text to be measured occurred, and No. 9 positions, No. 37 positions and No. 45 positions in existing text feature word sequence of the feature word " news " in the existing text occurred, then the point (9 in the described coordinate system, 5), (9,41), (37,5), (37,41), (45,5), (45,41) corresponding a pair of validity feature all.If have long repetition paragraph in two texts, be presented as in described coordinate system that then a slope is 1 straight line y=x+b, referring to Fig. 5, the consistent feature word of a pair of coupling of each circle expression.
Step 407: according to the coordinate of the consistent feature word of coupling, carry out fitting a straight line and handle.
Step 408: judge whether to exist one group of coordinate, wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite, if there is described one group of coordinate, then continues step 409, otherwise, continue step 411.
In the present embodiment, fitting a straight line is prior art, does not do detailed description herein.
Step 409: will be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent with fitting to a slope, and mark a described stack features word.
Step 410: it is text filed that the absolute position of the described feature word that marks in the feature word sequence of text to be measured and existing text covers, and is the repeat region of text to be measured and existing text.
Step 411: point out in text to be measured and the existing text not have the repetition paragraph.
Understood the implementation procedure that detects repeated text by above introduction, this process can realize that inner structure and the function to this device is introduced below by device.
Referring to Fig. 6, in the present embodiment, a kind of device that detects the text repetition comprises: extraction module 601, matching module 602, position acquisition module 603, judge module 604 and repeated text determination module 605.
Extraction module 601 is used for obtaining feature word and the feature word sequence of text to be measured and existing text;
Matching module 602 be used for each feature word of text to be measured respectively with existing text in each feature word mate;
Position acquisition module 603 is used for obtaining the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence when the success of matching module matching characteristic word;
Judge module 604 is used for judging whether to exist one group of feature word that coupling is consistent, and there is linear relationship the described one group absolute position of all feature words in text feature word sequence to be measured and the absolute position in existing text feature word sequence;
Repeated text determination module 605 is used for when judge module is judged the feature word that has one group of coupling unanimity, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.
Preferable, extraction module 601 comprises: word segmentation processing unit 701, filter element 702 and sequencing unit 703, and referring to shown in Figure 7.
Word segmentation processing unit 701 is used for text to be measured and existing text are carried out word segmentation processing;
Filter element 702 is used for text to be measured and existing text are carried out filtration treatment, obtains the feature word of text to be measured and existing text;
Sequencing unit 703 is used for according to the sequence of positions of feature word at text to be measured and existing text, obtains feature word feature word sequence in text to be measured and existing text respectively.
Preferable, the absolute position in text feature word sequence to be measured with each feature word and described feature word is as a feature of text to be measured; Obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence.
Preferable, described device also comprises: data computation module 606, statistical module 607 and data processing module 608, and referring to shown in Figure 8.
Data computation module 606 is used for calculating the alternate position spike of the consistent feature word of coupling between the absolute position of text to be measured and the absolute position in existing text;
Statistical module 607 is used for the value number of times of the described alternate position spike of statistics;
Data processing module 608 is used for calculating the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums;
Judge module 604 is for first ratio that judges whether to exist greater than first predetermined threshold value; When first ratio that determine to exist greater than first predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of first ratio of first predetermined threshold value, judge module 604 is judged the feature word that has one group of coupling unanimity.
Preferable, data processing module 608 also is used for feature word alternate position spike with the coupling unanimity of a plurality of vicinities as one group, calculates second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums;
Judge module 604 is also for described second ratio that judges whether to exist greater than second predetermined threshold value; When second ratio that determine to exist greater than second predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of second ratio of second predetermined threshold value, judge module is judged the feature word that has one group of coupling unanimity.
Preferable, described device also comprises: coordinate extraction module 609 and fitting a straight line module 610, and referring to shown in Figure 9.
Coordinate extraction module 609 is used for according to mating the absolute position of consistent feature word at text to be measured and existing text, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured;
The coordinate that fitting a straight line module 610 is used for according to the consistent feature word of coupling carries out fitting a straight line and handles;
Judge module 604 is used for judging whether to exist one group of coordinate, and wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite; If judge to have described one group of coordinate, then will be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent with fitting to a slope, judge module 604 is judged and is had one group of feature word that coupling is consistent.
In the embodiment of the invention, the absolute position in text feature word sequence to be measured with each feature word and described feature word, a feature as text to be measured, obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence, judge the text that whether has repetition in two texts by the alternate position spike of calculating absolute position or by the method for fitting a straight line.Can effectively detect forms such as text reprinting, merging and extracts by described method, and the situation that can detect slotting word, delete word.
Those skilled in the art should understand that embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt complete hardware embodiment, complete software embodiment or in conjunction with the form of the embodiment of software and hardware aspect.And the present invention can adopt the form of the computer program of implementing in one or more computer-usable storage medium (including but not limited to magnetic disk memory and optical memory etc.) that wherein include computer usable program code.
The present invention is that reference is described according to process flow diagram and/or the block scheme of method, equipment (system) and the computer program of the embodiment of the invention.Should understand can be by the flow process in each flow process in computer program instructions realization flow figure and/or the block scheme and/or square frame and process flow diagram and/or the block scheme and/or the combination of square frame.Can provide these computer program instructions to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, make the instruction of carrying out by the processor of computing machine or other programmable data processing device produce to be used for the device of the function that is implemented in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame appointments.
These computer program instructions also can be stored in energy vectoring computer or the computer-readable memory of other programmable data processing device with ad hoc fashion work, make the instruction that is stored in this computer-readable memory produce the manufacture that comprises command device, this command device is implemented in the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.
These computer program instructions also can be loaded on computing machine or other programmable data processing device, make and carry out the sequence of operations step producing computer implemented processing at computing machine or other programmable devices, thereby be provided for being implemented in the step of the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame in the instruction that computing machine or other programmable devices are carried out.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (10)

1. one kind is detected the method that text repeats, and it is characterized in that, may further comprise the steps:
Obtain feature word and feature word sequence in text to be measured and the existing text;
With each the feature word in the text to be measured respectively with existing text in each feature word mate;
At the feature word when the match is successful, obtain the consistent absolute position of feature word in text feature word sequence to be measured of coupling and absolute position in existing text feature word sequence;
Judge whether to exist one group of feature word that coupling is consistent, there is linear relationship the described one group absolute position of all feature words in the feature word sequence of text to be measured and existing text;
If there is one group of feature word that coupling is consistent, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.
2. the method for claim 1 is characterized in that, obtains feature word in the text to be measured and the step of text feature word sequence to be measured and comprises:
Text to be measured and existing text are carried out word segmentation processing;
Text to be measured and existing text are carried out filtration treatment, obtain the feature word of text to be measured and existing text;
According to the sequence of positions of feature word in text to be measured and existing text, obtain the feature word sequence of feature word in text to be measured and existing text respectively.
3. the method for claim 1 is characterized in that, judges whether to exist one group of step of mating consistent feature word to comprise:
Calculate the alternate position spike between the absolute position of the consistent feature word of coupling in text to be measured and the absolute position in having text now;
Add up the value number of times of described alternate position spike;
Calculate the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums;
Judge whether to exist first ratio greater than first predetermined threshold value;
If there is first ratio greater than first predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of first ratio of first predetermined threshold value.
4. method as claimed in claim 3 is characterized in that, judges whether to exist the step of one group of feature word that the match is successful also to comprise:
The feature word alternate position spike of the coupling unanimity of a plurality of vicinities as one group, is calculated second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums;
Judge whether to exist second ratio greater than second predetermined threshold value;
If there is second ratio greater than second predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of second ratio of second predetermined threshold value.
5. the method for claim 1 is characterized in that, judges whether to exist the step of one group of feature word that the match is successful to comprise:
According to the consistent absolute position of feature word in text to be measured and existing text of coupling, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured;
According to the coordinate of the consistent feature word of coupling, carry out fitting a straight line and handle;
Judge whether to exist one group of coordinate, wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite;
If there is described one group of coordinate, then will with fit to a slope and be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent.
6. one kind is detected the device that text repeats, and it is characterized in that, comprising:
Extraction module is for the feature word and the feature word sequence that obtain text to be measured and existing text;
Matching module, be used for each feature word of text to be measured respectively with existing text in each feature word mate;
The position acquisition module is used for obtaining the consistent absolute position of feature word in text feature word sequence to be measured of coupling and the absolute position in existing text feature word sequence when the success of matching module matching characteristic word;
Judge module is used for judging whether to exist one group of feature word that coupling is consistent, and there is linear relationship the described one group absolute position of all feature words in text feature word sequence to be measured and the absolute position in existing text feature word sequence;
The repeated text determination module, be used for when judging the feature word that has one group of coupling unanimity, according to described one group of feature word absolute position in the feature word sequence of text to be measured and existing text that coupling is consistent, determine the repeat region of text to be measured and existing text.
7. device as claimed in claim 6 is characterized in that, extraction module comprises:
The word segmentation processing unit is used for text to be measured and existing text are carried out word segmentation processing;
Filter element is used for text to be measured and existing text are carried out filtration treatment, obtains the feature word of text to be measured and existing text;
Sequencing unit is used for according to the sequence of positions of feature word at text to be measured and existing text, obtains feature word feature word sequence in text to be measured and existing text respectively.
8. device as claimed in claim 6 is characterized in that, also comprises: data computation module, statistical module and data processing module;
Data computation module is used for calculating the alternate position spike of the consistent feature word of coupling between the absolute position of text to be measured and the absolute position in existing text;
Statistical module is used for adding up the value number of times of described alternate position spike;
Data processing module is used for calculating the value number of times of each described alternate position spike and first ratio of all alternate position spike value number of times sums;
Judge module is for first ratio that judges whether to exist greater than first predetermined threshold value; When first ratio that determine to exist greater than first predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of first ratio of first predetermined threshold value, judge module is judged the feature word that has one group of coupling unanimity.
9. device as claimed in claim 8 is characterized in that, data processing module also is used for feature word alternate position spike with the coupling unanimity of a plurality of vicinities as one group, calculates second ratio of one group of alternate position spike value number of times sum and all alternate position spike value number of times sums;
Judge module is also for second ratio that judges whether to exist greater than second predetermined threshold value; When second ratio that determine to exist greater than second predetermined threshold value, then will be classified as one group of feature word that coupling is consistent greater than the corresponding feature word of second ratio of second predetermined threshold value, judge module is judged the feature word that has one group of coupling unanimity.
10. device as claimed in claim 6 is characterized in that, also comprises: coordinate extraction module and fitting a straight line module;
The coordinate extraction module, be used for according to mating the absolute position of consistent feature word at text to be measured and existing text, determine the coordinate of feature word in rectangular coordinate system that coupling is consistent, wherein rectangular coordinate system is to construct according to absolute position and the absolute position in the existing text of feature word in text to be measured;
The fitting a straight line module is used for the coordinate according to the consistent feature word of coupling, carries out fitting a straight line and handles;
Judge module is used for judging whether to exist one group of coordinate, and wherein, this group of coordinate is by fitting to that a slope is approximately 1 straight line and definite; If there is described one group of coordinate, then will be approximately the corresponding feature word of one group of coordinate of 1 straight line and be classified as one group of feature word that coupling is consistent with fitting to a slope, judge module is judged and is had one group of feature word that coupling is consistent.
CN201310144339.4A 2013-04-23 2013-04-23 A kind of method and device detecting repeated text Expired - Fee Related CN103246640B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310144339.4A CN103246640B (en) 2013-04-23 2013-04-23 A kind of method and device detecting repeated text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310144339.4A CN103246640B (en) 2013-04-23 2013-04-23 A kind of method and device detecting repeated text

Publications (2)

Publication Number Publication Date
CN103246640A true CN103246640A (en) 2013-08-14
CN103246640B CN103246640B (en) 2016-08-03

Family

ID=48926167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310144339.4A Expired - Fee Related CN103246640B (en) 2013-04-23 2013-04-23 A kind of method and device detecting repeated text

Country Status (1)

Country Link
CN (1) CN103246640B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573027A (en) * 2015-01-13 2015-04-29 清华大学 System and method for excavating feature words from document set
CN106354730A (en) * 2015-07-16 2017-01-25 北京国双科技有限公司 Method and device for recognizing webpage text repeated content in webpage analysis
CN113326688A (en) * 2021-06-16 2021-08-31 黑龙江八一农垦大学 Ideological and political theory word duplication checking processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5999896A (en) * 1996-06-25 1999-12-07 Microsoft Corporation Method and system for identifying and resolving commonly confused words in a natural language parser
CN1963807A (en) * 2005-11-11 2007-05-16 威知资讯股份有限公司 Automatic checking method of similitude file
CN101079025A (en) * 2006-06-19 2007-11-28 腾讯科技(深圳)有限公司 File correlation computing system and method
CN101620616A (en) * 2009-05-07 2010-01-06 北京理工大学 Chinese similar web page de-emphasis method based on microcosmic characteristic

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5999896A (en) * 1996-06-25 1999-12-07 Microsoft Corporation Method and system for identifying and resolving commonly confused words in a natural language parser
CN1963807A (en) * 2005-11-11 2007-05-16 威知资讯股份有限公司 Automatic checking method of similitude file
CN101079025A (en) * 2006-06-19 2007-11-28 腾讯科技(深圳)有限公司 File correlation computing system and method
CN101620616A (en) * 2009-05-07 2010-01-06 北京理工大学 Chinese similar web page de-emphasis method based on microcosmic characteristic

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MOHAMED ELHADI等: "Webpage Duplicate Detection Using Combined POS and Sequence Alignment Algorithm", 《2009 WORLD CONGRESS ON COMPUTER SCIENCE AND INFORMATION ENGINEERING》 *
QI ZHANG等: "Efficient Partial-Duplicate Detection Based on Sequence Matching", 《PROCEEDINGS OF THE 33RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL》 *
张进: "中文网页查重方法研究", 《万方学位论文数据库》 *
连浩等: "一种改进的基于内容的快速网页查重算法", 《全国第八届计算语言学联合学术会议(JSCL-2005)论文集》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573027A (en) * 2015-01-13 2015-04-29 清华大学 System and method for excavating feature words from document set
CN106354730A (en) * 2015-07-16 2017-01-25 北京国双科技有限公司 Method and device for recognizing webpage text repeated content in webpage analysis
CN106354730B (en) * 2015-07-16 2019-12-10 北京国双科技有限公司 Method and device for identifying repeated content of webpage text in webpage analysis
CN113326688A (en) * 2021-06-16 2021-08-31 黑龙江八一农垦大学 Ideological and political theory word duplication checking processing method and device

Also Published As

Publication number Publication date
CN103246640B (en) 2016-08-03

Similar Documents

Publication Publication Date Title
CN107679144B (en) News sentence clustering method and device based on semantic similarity and storage medium
US10049096B2 (en) System and method of template creation for a data extraction tool
US8577155B2 (en) System and method for duplicate text recognition
KR101337874B1 (en) System and method for detecting malwares in a file based on genetic map of the file
CN104750795A (en) Intelligent semantic searching system and method
CN110929477B (en) Keyword variant determination method and device
CN105912514B (en) Text copy detection system and method based on fingerprint characteristic
US20150095769A1 (en) Layout Analysis Method And System
CN105022840A (en) News information processing method, news recommendation method and related devices
CN105187242B (en) A kind of user's anomaly detection method excavated based on variable-length pattern
CN102750379B (en) Fast character string matching method based on filtering type
CN103425639A (en) Similar information identifying method based on information fingerprints
CN110019640B (en) Secret-related file checking method and device
CN106407195B (en) Method and system for web page duplication elimination
CN114153962A (en) Data matching method and device and electronic equipment
CN109446410A (en) Knowledge point method for pushing, device and computer readable storage medium
CN105164676A (en) Query features and questions
CN104636319A (en) Text duplicate removal method and device
CN103500158A (en) Method and device for annotating electronic document
CN105095381A (en) Method and device for new word identification
CN103246640A (en) Duplicated text detection method and device
CN112784720A (en) Key information extraction method, device, equipment and medium based on bank receipt
CN110427622A (en) Appraisal procedure, device and the storage medium of corpus labeling
US20130322759A1 (en) Method and device for identifying font
CN104598473A (en) Information processing method and electronic device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: BEIJING KUYUN INTERACTION TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: BEIJING TENFEN TECHNOLOGY CO., LTD.

Effective date: 20140120

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100004 CHAOYANG, BEIJING TO: 100007 DONGCHENG, BEIJING

TA01 Transfer of patent application right

Effective date of registration: 20140120

Address after: 100007 Beijing City, Dongcheng District Andingmen East Street, No. 28, building B block 15 layer

Applicant after: KUYUN INTERACTIVE TECHNOLOGY Ltd.

Address before: No. 7 East Hanwei building 18A1 100004 Beijing City Guanghua Road Chaoyang District

Applicant before: Beijing Tenfen Technology Co.,Ltd.

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160803