CN110019642A - A kind of Similar Text detection method and device - Google Patents

A kind of Similar Text detection method and device Download PDF

Info

Publication number
CN110019642A
CN110019642A CN201710663797.7A CN201710663797A CN110019642A CN 110019642 A CN110019642 A CN 110019642A CN 201710663797 A CN201710663797 A CN 201710663797A CN 110019642 A CN110019642 A CN 110019642A
Authority
CN
China
Prior art keywords
text
screening
target
characteristic value
target text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710663797.7A
Other languages
Chinese (zh)
Inventor
贺达
徐文斌
宣静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710663797.7A priority Critical patent/CN110019642A/en
Publication of CN110019642A publication Critical patent/CN110019642A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of Similar Text detection method and device, are related to text-processing technical field, more to solve the time that existing Similar Text detection needs to spend, and the efficiency so as to cause Similar Text detection is lower and invents.The method comprise the steps that screening by the keyword extracted from target text to text collection to be detected, screening text collection is obtained;Calculate the characteristic value of each screening text in the screening text collection and the characteristic value of the target text;Judge whether the characteristic value of the screening text is identical as the characteristic value of the target text;If so, determining that the screening text is similar to the target text.The present invention is suitably applied in the detection of Similar Text.

Description

A kind of Similar Text detection method and device
Technical field
The present invention relates to text-processing technical field more particularly to a kind of Similar Text detection method and device.
Background technique
With flourishing for network, internet has become the key areas that advertisement is launched by all big enterprises.In order to better Product information is promoted, manufacturers start to launch " soft text " advertisement gradually to replace previous advertisement form.Wherein, " soft text " advertisement The product that manufacturers want to recommend can be combined with related article, can make reader in approval article More preferably receive the product recommended in article while theory.In order to verify the promotion effect of " soft text " advertisement, manufacturer would generally make With TF-IDF, LDA scheduling algorithm in the prior art or model, in the urtext and network by calculating " soft text " advertisement to Judge the feature of text, and the calculated feature compare one by one to determine whether text is similar, and then judge to promote Effect.
Currently, needing to calculate the feature of each text to be judged, and and urtext in the detection to Similar Text Feature be compared.However, when the feature quantity for when judging that the quantity of text is larger, needing to calculate and compare in network It can become larger, and then the detection of Similar Text is made to need to take more time, so as to cause the efficiency of Similar Text detection It is lower.
Summary of the invention
In view of the above problems, the present invention provides a kind of Similar Text detection method and device, and main purpose is to reduce phase The time spent like needed for during text detection, and then improve the efficiency of Similar Text detection.
In order to solve the above technical problems, in a first aspect, the present invention provides a kind of Similar Text detection method, this method packet It includes:
Text collection to be detected is screened by the keyword extracted from target text, obtains screening text set It closes;
Calculate the characteristic value of each screening text in the screening text collection and the characteristic value of the target text;
Judge whether the characteristic value of the screening text is identical as the characteristic value of the target text;
If so, determining that the screening text is similar to the target text.
Optionally, the characteristic value for calculating each screening text in the screening text collection and the target text Characteristic value include:
Extract the central segment sentence of preset quantity, the central segment respectively from the screening text and the target text Sentence is the sentence that the central segment of text obtains after splitting;
Screening text Kazakhstan corresponding with each central segment sentence in the target text is calculated according to hash algorithm Uncommon value, and generate the Hash array of the corresponding screening text, and the Hash array of the corresponding target text.
Optionally, whether the characteristic value for judging the screening text is identical as the characteristic value of the target text, packet It includes:
According to the quantity given threshold of cryptographic Hash in Hash array;
The cryptographic Hash in the Hash array of the screening text is judged, with the Hash in the Hash array of the target text It is worth whether identical quantity is more than the threshold value;
If the characteristic value of the screening text is identical as the characteristic value of the target text, it is determined that the screening text This is similar to the target text, comprising:
If the cryptographic Hash in the Hash array of the screening text, with the cryptographic Hash in the Hash array of the target text Identical quantity is more than threshold value, it is determined that the screening text is similar to the target text.
Optionally, the whether identical packet of characteristic value of the characteristic value for judging the screening text and the target text It includes:
The cryptographic Hash in the Hash array of the screening text is judged, with the Hash in the Hash array of the target text Whether identical it is worth;
If the characteristic value of the screening text is identical as the characteristic value of the target text, it is determined that the screening text This is similar to the target text, comprising:
If the cryptographic Hash in the Hash array of the screening text, with the cryptographic Hash in the Hash array of the target text It is identical, it is determined that the screening text is similar to the target text.
Optionally, the keyword by extracting from target text screens text collection to be detected, obtains Screening text collection includes:
Multiple keywords are extracted from the target text;
Whether judged in each text to be detected in the text collection to be detected one by one comprising the multiple keyword;
If so, determining that the text to be detected is screening text.
Optionally, text collection to be detected is screened in the keyword by being extracted from target text, is obtained To before screening text collection, the method also includes:
The content of the target text is parsed, and determines the text categories of the target text according to the content;
The correspondence text for obtaining the text categories obtains the text collection to be detected.
Second aspect, the present invention also provides a kind of Similar Text detection device, which includes:
Screening unit is screened text collection to be detected for the keyword by extracting from target text, is obtained To screening text collection;
Computing unit, for calculating the feature of each screening text in the screening text collection that the screening unit obtains The characteristic value of value and the target text;
Judging unit, for judging the calculated characteristic value for screening text of the computing unit and the target text Whether characteristic value is identical;
Determination unit, if judging the characteristic value of the screening text and the spy of the target text for the judging unit Value indicative is identical, it is determined that the screening text is similar to the target text.
Optionally, the computing unit includes:
Extraction module, for extracting the central segment sentence of preset quantity respectively from the screening text and the target text Son, the central segment sentence are the sentence that the central segment of text obtains after splitting;
Computing module is mentioned with the target text by extraction module for calculating the screening text according to hash algorithm Each the corresponding cryptographic Hash of central segment sentence taken, and generate the Hash array of the corresponding screening text, and corresponding institute State the Hash array of target text
Optionally, the judging unit includes:
Setting module, for the quantity given threshold according to cryptographic Hash in Hash array;
First judgment module, the cryptographic Hash in Hash array for judging the screening text, with the target text Hash array in the identical quantity of cryptographic Hash whether be more than setting module setting threshold value;
The determination unit is specifically used for, if the cryptographic Hash in the Hash array of the screening text, with the target text The identical quantity of cryptographic Hash in this Hash array is more than threshold value, it is determined that the screening text and the target text phase Seemingly.
Optionally, the judging unit includes:
Second judgment module, the cryptographic Hash in Hash array for judging the screening text, with the target text Hash array in cryptographic Hash it is whether identical;
The determination unit is specifically used for, if the cryptographic Hash in the Hash array of the screening text, with the target text Cryptographic Hash in this Hash array is identical, it is determined that the screening text is similar to the target text.
Optionally, the screening unit includes:
Extraction module, for extracting multiple keywords from the target text;
Judgment module, for whether being judged in each text to be detected in the text collection to be detected one by one comprising institute State multiple keywords of extraction module extraction;
Determining module, if judging in text to be detected for the judgment module comprising multiple keywords, it is determined that described Text to be detected is screening text.
Optionally, described device further include:
Resolution unit determines the target text for parsing the content of the target text, and according to the content Text categories;
Acquiring unit obtains described to be detected for obtaining the correspondence text for the text categories that the resolution unit obtains Text collection.
To achieve the goals above, according to the third aspect of the invention we, a kind of storage medium, the storage medium are provided Program including storage, wherein equipment where controlling the storage medium in described program operation executes phase described above Like Method for text detection.
To achieve the goals above, according to the fourth aspect of the invention, a kind of processor is provided, the processor is used for Run program, wherein described program executes Similar Text detection method described above when running.
By above-mentioned technical proposal, Similar Text detection method and device provided by the invention, for the prior art right When Similar Text is detected, need to calculate the feature of each text to be judged, and be compared with the feature of urtext, The present invention, which passes through, carries out screening operation to text to be detected using the keyword extracted from target text, calculates and compares screening The characteristic value of text and target text determines that screening text is similar to target text when determining that the two characteristic value is identical, because This compared with the prior art, the present invention is by screening text to be detected using the keyword extracted from target text Operation, the text that screening conditions are not met in text to be detected can be eliminated, and then be effectively reduced and need to detect Amount of text, thus reduce calculate text feature value during calculation amount, and then improve Similar Text detection Efficiency;In addition, by being compared to the screening text with the characteristic value of the target text, and it is identical in the two characteristic value When determine screening text it is similar to target text, it can be ensured that in Similar Text detection process, the accuracy of testing result.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of Similar Text detection method flow chart provided in an embodiment of the present invention;
Fig. 2 shows another Similar Text detection method flow charts provided in an embodiment of the present invention;
Fig. 3 shows a kind of composition block diagram of Similar Text detection device provided in an embodiment of the present invention;
Fig. 4 shows the composition block diagram of another Similar Text detection device provided in an embodiment of the present invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
In order to improve the accuracy of user requirements analysis result, the embodiment of the invention provides a kind of Similar Text detection sides Method, as shown in Figure 1, this method comprises:
101, text collection to be detected is screened by the keyword extracted from target text, obtains screening text Set.
In general, when carrying out the detection of Similar Text, the urtext compared is needed, i.e., by this article This object of reference as text to be detected when being detected.The method according to embodiments of the present invention, the urtext The as target text, therefore when carrying out Similar Text detection to text to be detected, by text to be detected and target Text is compared.
The method according to this step as a result, it is necessary first to the keyword extracted in target text is obtained, wherein described Keyword can be understood as more important one or more words in target text.It is specifically chosen the regular sum number of keyword Amount does not do specific restriction herein, can according to need and is chosen, for example, can choose in target text the frequency of occurrences most High word is as the keyword described in this step;Alternatively, can choose the frequency of occurrences comes the word of front three as this step The rapid crucial phrase.Then, screening operation is carried out to text collection to be detected according to the keyword or crucial phrase, and Text after screening is formed by set to be determined as screening text collection.
It should be noted that in method described in this step, the realization process of specific screening operation and the keyword Quantity and actual needs it is related, for example, when the negligible amounts of keyword, and textual data in text collection to be detected When measuring especially more, it can choose the text for only including Partial key word and the text rejecting not comprising keyword, and will be to be checked It surveys the text comprising all keywords in text collection to retain, the screening text collection as the generation after screening;Alternatively, When the quantity of keyword is more and in text collection to be detected when text negligible amounts, can choose will only not include institute completely The text for stating keyword is rejected, and will be retained comprising the text of the Partial key word in crucial phrase, as what is generated after screening Screen text collection.In this step, the concrete operations rule of screening and quantity, the type of keyword do not limit herein It is fixed, corresponding adjustment can be made according to the actual situation.
For example, keyword quantity is 3, according to this when the quantity of text in text collection to be detected is 10000000 Method described in step can retain in screening process using the text all comprising above-mentioned 3 keywords as screening text As screening text collection, the calculation amount of subsequent calculating process can be effectively reduced in this way;And when text collection Chinese to be detected When this quantity is 400, keyword quantity is 15, then the method according to this step, can wrap in screening process It is determined as screening text containing 10 or more texts in above-mentioned 15 keywords and generates corresponding screening text collection.
102, the characteristic value of each screening text in the screening text collection and the feature of the target text are calculated Value.
Specifically, the feature that the characteristic value can be understood as text in this step obtains after preset algorithm quantifies Numerical value.Wherein, the feature of text can be understood as some words that can be distinguished text or can be used for comparing text Language, word group or sentence.For example, the method according to this step can choose the paragraph center sentence or text purport of text Feature of the sentence as text;Or selection appears in the word or the sequence of frequency of occurrence frequency that frequency of occurrence is most in text Feature of the word group as text composed by the word of several former.
It should be noted that the Feature Selection mode of the text does not do specific restriction herein, but it is to ensure that described The Feature Selection mode for screening text is identical as the Feature Selection mode of target text.For example, when the feature of target text selection For every section in the text of paragraph center sentence, then the feature for screening text should also select the paragraph center sentence of its every section of text. In addition, can be calculated by preset algorithm or model the feature of text after obtaining the feature of text.? This, the calculation method and model of selection do not do specific restriction, can according to need to be chosen, for example, can choose Hash algorithm or other algorithms calculate the characteristic value of text.Wherein, the generally direct transliteration of hash algorithm is " hash algorithm ", just It is the input content random length, by hashing algorithm, is transformed into the output content of regular length, which is exactly to dissipate Train value.Briefly the algorithm is exactly a kind of algorithm of the eap-message digest of the message compression by random length to a certain regular length Function.
103, judge whether the characteristic value of the screening text is identical as the characteristic value of the target text.
It, can be in having calculated screening text collection after the characteristic value of each screening text and the characteristic value of target text The characteristic value is compared, to judge whether the two is identical.It should be noted that according to characteristic value described in this step Method described in step 102 is calculated, therefore, when the quantity of the characteristic value of target text calculated in 102 steps is more When a, the characteristic value for screening text also should be multiple, then the method according to this step, therefore, to assure that the characteristic value of the two is complete It is exactly the same, specifically whether need to be compared one by one according to order, or gathered, is according to actual needs Come what is chosen, it is not limited here.
If 104, judging, the characteristic value of the screening text is identical as the characteristic value of the target text, it is determined that the sieve Selection sheet is similar to the target text.
As described in step 103, when calculated text feature value is multiple, the spy for screening text and target text is judged Whether value indicative is identical, can choose one-to-one mode, also can choose the mode of set comparison to determine, i.e., when target text When this characteristic value is fully present in the characteristic value of screening text, that is, it can determine whether that screening text is similar to target text.
For example, when screening the calculated characteristic value of text be respectively 391c3337c24994e2bb19914ff62fa79f, Df803e7100844162686e417f63944a08 and 1c3c0845ab5a5862f95a76e2473e6cfa and target text Calculated characteristic value be df803e7100844162686e417f63944a08, When 1c3c0845ab5a5862f95a76e2473e6cfa and 391c3337c24994e2bb19914ff62fa79f, according to this step The method, target text is identical with the characteristic value that screening text includes, and is only sequentially difference, can determine two Person is Similar Text.
Typically, since the Similar Text of a text may be that the text is run by paragraph sequence or certain sentences The text generated afterwards, therefore in the detection of Similar Text, the sequence of characteristic value is likely to different, therefore general progress is special In value indicative comparison procedure, the mode of set comparison may be selected to carry out this step.But specific way of contrast, it still can root It is chosen according to actual needs, and way of contrast described in this step is only exemplary, does not do specific restriction.
Similar Text detection method provided in an embodiment of the present invention detects the prior art to Similar Text When, need to calculate the feature of each text to be judged, and be compared with the feature of urtext, the present invention pass through using from The keyword extracted in target text come to text collection to be detected carry out screening operation, can by text collection to be detected not The text for meeting screening conditions eliminates, and then is effectively reduced the amount of text for needing to detect, to reduce calculating Calculation amount during text feature value, and then the efficiency of the Similar Text detection improved;In addition, by the screening text This is compared with the characteristic value of the target text, and screening text and target text phase are determined when the two characteristic value is identical Seemingly, it can be ensured that in Similar Text detection process, the accuracy of testing result.
Further, as the refinement and extension to embodiment illustrated in fig. 1, the embodiment of the invention also provides another phases Like Method for text detection, as shown in Figure 2.
201, the content of target text is parsed, and determines the text categories of the target text according to the content.
In embodiments of the present invention, the target text is identical as target text described in embodiment 101, herein not It repeats.
According to the method for this step, after obtaining target text, need to carry out the content of target text parsing operation with Obtain the classification of the target text.Specifically, can be understood as determining the mesh by the analysis to target text content Mark the specific type, such as news, advertisement etc. of text.Either more specifically classification, for example, skin care item advertisement, health care product Advertisement etc..The type of specific text is determined according to the target text in practical operation.Wherein, parsing described in this step is grasped The process of work can be carried out by manual type, or by the related software containing natural discourse analysis ability come real It is existing.Concrete condition can select, it is not limited here according to the text quantity of target text or according to actual needs.
The method according to this step is parsed by the content to target text, and is determined according to content of text The classification of target text realizes with the main contents of target text and determines the function of text categories, is described in subsequent acquisition The corresponding text of text categories provides range, and then can reduce the quantity for obtaining text, reduces Similar Text detection Amount of text, to improve whole detection efficiency.
202, the correspondence text for obtaining the text categories, obtains the text collection to be detected.
According to the text categories that step 201 determines, such text can be obtained, on network to generate text to be detected Set.Wherein, the mode for obtaining the text of the classification can choose web crawlers, it is of course also possible to select others side Formula obtains, and it is not limited here, can voluntarily choose.In addition, in this step, when the text for obtaining the corresponding text categories When, while uniform resource locator (Uniform Resource Locator, the letter of the text can be obtained by web crawlers Claim URL), to realize the tracking to text source.Wherein, URL is a kind of for characterizing the position and the access that interconnect internet resource The character string of method, it can be understood as the address information of standard resource on internet.Each file on internet has one Unique URL.
According to the method for this step, text to be detected is generated by obtaining the text of the corresponding text categories can be When determining in network with the presence or absence of text similar with target text, can targetedly control text to be detected range and Quantity is obtained, and then avoids meaningless text from source and obtains operation, to reduce whole in Similar Text detection process The time loss of body, further improves detection efficiency.
203, text collection to be detected is screened by the keyword extracted from target text, obtains screening text Set.
Specifically, this step includes: firstly, extracting multiple keywords from target text;Then, one by one judgement it is described to Whether detect in each text to be detected in text collection includes the multiple keyword;Finally, however, it is determined that text to be detected Text in set includes the multiple keyword, it is determined that the text to be detected is screening text, and will repeatedly be screened The multiple screening texts obtained afterwards are formed by set and are determined as screening text collection.
The quantity of keyword described in this step can according to need to choose, and extract the model or program of keyword It can according to need selection, for example, it is higher to can choose the frequency of occurrences when selecting extracting tool of the TF-IDF as keyword Keyword of the preceding several words as the target text.Wherein, TF-IDF (term frequency-inverse Document frequency, abbreviation TF-IDF) it is a kind of common weighting technique for information retrieval and data mining.Specifically , which can be regarded to a kind of statistical method as, to assess a words in a file set or a corpus The significance level of a copy of it file.Wherein, the directly proportional increase of number that the importance of words occurs hereof with it, but The frequency that can occur in corpus with it is inversely proportional decline simultaneously.The various forms of TF-IDF weighting is often searched engine and answers With measurement or grading as degree of correlation between file and user query.
The method according to this step can be stored the text to be detected after being extracted keyword to full text originally In search engine, such as Elasticsearch engine.Then the text to be detected is retrieved according to keyword, will be accorded with The text of screening conditions is closed as screening text.
For example, the keyword extracted in the target text is " whitening ", " isolation ", " saturating white ", " porcelain flesh " 4 keywords When, the method according to this step screens text to be detected by described this search engine of full text.Wherein, if to When detecting in text comprising all above-mentioned 4 keywords, then retain the text as screening text;Otherwise, then the text is screened out.
In addition, the method according to this step can will include Partial key when the keyword quantity of extraction is more The text for completely including keyword not only is determined as sieving by the text of word also as the screening text in screening text collection Selection sheet.
Such as: when the keyword extracted in target text is " water profit ", " moisturizing ", " talent for swimming ", " moist ", " tender ", " is grown Support ", " soft and smooth " 7 keywords, and the method according to this step will determine that the text that comprise more than 4 keywords is sieve This when of selection, carries out screening operation to text to be detected by described this search engine of full text.If text to be detected includes on 6 Keyword is stated, then can retain the text as screening text and generates corresponding screening text collection;If in text to be detected Comprising 3 above-mentioned keywords, then the text is screened out.
The method according to this step as a result, by the keyword that is extracted from target text to text to be detected into Row screening operation can be realized the function of screening out the text for not including keyword in text to be detected, to reduce The subsequent text overall quantity that need to calculate characteristic value, and then reduce subsequent calculation amount.
204, the characteristic value of each screening text in the screening text collection and the feature of the target text are calculated Value.
This step specifically includes: firstly, being mentioned respectively from each screening text in screening text collection with target text Take the central segment sentence of preset quantity;Then, according to hash algorithm calculate the screening text with it is each in the target text The corresponding cryptographic Hash of a central segment sentence, and the Hash array of the corresponding screening text is generated, and the corresponding target text This Hash array.Wherein, what the central segment that central segment sentence can be understood as text described in this step obtained after splitting Sentence.
The specific steps according to this step need to first determine the central segment of text before extracting central segment sentence.This Central segment described in inventive embodiments can be understood as the main paragraph of content of text, can characterize text entirety meaning or packet The paragraph of the purport containing text.The head-end sections of text are screened out here, can choose, obtain the paragraph of rest part, and The rest part paragraph is determined as to the central segment of text.
It should be noted that the determination method of central segment described in this step is determined according to actual use.Due to The Similar Text occurred in network is often to be generated after being modified slightly by each media, in order in the mistake for reprinting Similar Text Achieve the purpose that promote itself in journey, reprinting person would generally be on the original text to the Similar Text reprinted to the front or back Add additional information.For example, public platform, website information, reprinting people's information of media etc. are reprinted in addition.And remaining content is practical On seldom modify.Therefore, the method screened out the first, last section of text according to this step, actually screens out The part being reprinted people and being added in reprinting text, thereby it can be assured that text remainder be original text it is main in Hold.Certainly, the mode of above-mentioned determining central segment is only exemplary, and the method for determination of central segment can also be chosen as needed Other modes.
After the central segment of determining text, the sentence of preset quantity, i.e. central segment sentence are extracted to central segment.Central segment The quantity of sentence is determined according to the content of target text.For example, when in the content of target text including 60 sentences, it can be with The quantity of extraction is 20 or 30 etc..In this way, sentence quantity accounts for sentence quantity in text in the target text actually extracted Ratio has reached 1/3 to 1/2, and therefore, extracted central segment sentence can largely represent the purport of entire central segment. Thus, it is possible to determine that the central segment sentence of preset quantity described in this step is actually the central segment sentence for occupying larger proportion Quantity, to ensure that the central segment sentence extracted can represent whole content.Meanwhile when extracting central segment sentence, Ke Yixuan It selects and is extracted since the initial position of the central segment;Alternatively, being distributed according to the paragraph of central segment entirety, in each of central segment A certain number of sentences are extracted in a paragragh;Or the whole sentence quantity according to central segment, it randomly selects a certain proportion of Section sentence centered on sentence, the mode for extracting central segment sentence herein do not do specific restriction, can according to need and selected It takes, but is to ensure that the screening text is identical as the extracting mode of central segment sentence of the target text.In addition, in It, can be using fullstop as the symbol of the fractionation sentence, naturally it is also possible to choose when heart section carries out the extraction process of central segment sentence Other symbols carry out the fractionation of central segment sentence, for example, can choose comma, it can root for the symbol of fractionation but choose According to needing to select, specific restriction is not done herein.
After determining central segment and extracting central segment sentence, the center extracted can be calculated according to hash algorithm The cryptographic Hash of section sentence.Wherein, the hash algorithm, that is, hash algorithm, it is identical with the description in step 102, it does not do herein It repeats.Since central segment sentence quantity is multiple described in this step, therefore the cryptographic Hash of obtained target text, and sieve The cryptographic Hash of selection sheet is multiple, thus by hash algorithm generate be respectively the corresponding target text Hash number Group, and the Hash array of the corresponding screening text.
It should be noted that when calculating the cryptographic Hash of the central segment sentence by hash algorithm, it need to be by the sentence In punctuation mark be removed, obtain the paragraph not comprising punctuation mark, then the paragraph calculated by hash algorithm The cryptographic Hash of the paragraph.
The method according to this step, by the central segment sentence from target text and screening Text Feature Extraction preset quantity Son, it can be ensured that under the premise of can characterize content of text, the calculating that cryptographic Hash is carried out to the full content of text is avoided, from And reduce the calculation amount of each text of calculating process, and then reduce and calculate the time, improve computational efficiency.
205, judge whether the characteristic value of the screening text is identical as the characteristic value of the target text.
Wherein, in a first aspect, this step can specifically include: according to the quantity given threshold of cryptographic Hash in Hash array; Judge the cryptographic Hash in the Hash array of the screening text, it is identical with the cryptographic Hash in the Hash array of the target text Whether quantity is more than the threshold value.
Since the cryptographic Hash of calculated text in abovementioned steps 204 is not one, but multiple central segment sentences are corresponding Multiple cryptographic Hash, so each text respectively corresponds a Hash array.Therefore, in the Hash to screening text in this step When cryptographic Hash in array is compared with the cryptographic Hash in the Hash array of the target text, a threshold can be set first Value, then compares the threshold value that identical quantity is set with this in two Hash arrays, judges that the two texts are corresponding Hash array in identical cryptographic Hash quantity whether be more than setting threshold value.
Second aspect, this step specifically can be with are as follows: the cryptographic Hash in the Hash array of the screening text are judged, with institute Whether the cryptographic Hash stated in the Hash array of target text is identical.
According to Hash array calculated in step 204, target text Hash array corresponding with screening text is carried out Compare, determines whether the cryptographic Hash in two Hash arrays is identical.
If 206, the characteristic value of the screening text is identical as the characteristic value of the target text, it is determined that the screening text This is similar to the target text.
Method described in first aspect, this step are specifically as follows in corresponding step 205: if the Hash array of screening text In cryptographic Hash, quantity identical with the cryptographic Hash in the Hash array of the target text be more than threshold value, it is determined that the sieve Selection sheet is similar to the target text.
For example, working as the corresponding Hash array of target text are as follows: 5ebeb3d0edb5518f6fd7323644081e930749d0b4、 e6552fbd0fd75d2d077623abc05720b570ef7805、 d929507eb444f3a2252a67e221233e69dba248e0、 2480b68d4c703efd6e4dbd632b49e63136761448、 625951d58f17acb62e6bc01f8a60144870006567, the corresponding Hash array of screening text are as follows: 5ebeb3d0edb5518f6fd7323644081e930749d0b4、 e6552fbd0fd75d2d077623abc05720b570ef7805、 d7c608d77977bd21a3153dbf54a44c70393769d4、 1f508912575219724c92c9d681b7a2f735f3024d、 When d929507eb444f3a2252a67e221233e69dba248e0, the method according to this step, the threshold value set is 2 A, then identical cryptographic Hash is 3 in the corresponding Hash array of the two texts, is greater than given threshold 2, then can determine institute It is similar to the target text to state screening text.
Method described in second aspect, this step are specifically as follows in corresponding step 205: if the corresponding Hash of screening text Cryptographic Hash in array, the cryptographic Hash in Hash array corresponding with target text are identical, it is determined that the screening text It is similar to the target text.
For example, working as the corresponding Hash array of target text are as follows: 5ebeb3d0edb5518f6fd7323644081e930749d0b4、 e6552fbd0fd75d2d077623abc05720b570ef7805、 d929507eb444f3a2252a67e221233e69dba248e0、 2480b68d4c703efd6e4dbd632b49e63136761448、 625951d58f17acb62e6bc01f8a60144870006567, the corresponding Hash array of screening text are as follows: 5ebeb3d0edb5518f6fd7323644081e930749d0b4、 e6552fbd0fd75d2d077623abc05720b570ef7805、 d7c608d77977bd21a3153dbf54a44c70393769d4、 1f508912575219724c92c9d681b7a2f735f3024d、 When d929507eb444f3a2252a67e221233e69dba248e0, the cryptographic Hash in the corresponding Hash array of two texts is simultaneously Not identical, only 3 cryptographic Hash are identical, it is thus determined that this described screening text and the target text are dissimilar.
By the method for first aspect described in step 205-206, by given threshold, and pass through the threshold value and two Kazakhstan The quantity of identical cryptographic Hash compares in uncommon array, to determine whether two texts are similar, can make institute of the embodiment of the present invention The Similar Text detection method stated, can reach the function of the text of two higher similarities of identification, and then can expand phase Like the identification range of text.And by the method for second aspect described in step 205-206, the identification function of same text may be implemented Can, to improve the accuracy of Similar Text detection.
Further, as the realization to method shown in above-mentioned Fig. 1, the embodiment of the invention also provides a kind of Similar Texts Detection device, for being realized to above-mentioned method shown in FIG. 1.The Installation practice is corresponding with preceding method embodiment, is Easy to read, present apparatus embodiment no longer repeats the detail content in preceding method embodiment one by one, it should be understood that Device in the present embodiment can correspond to the full content realized in preceding method embodiment.As shown in figure 3, the device includes: Screening unit 31, computing unit 32, judging unit 33, determination unit 34, wherein
Screening unit 31 can be used for the keyword by extracting from target text and sieve to text collection to be detected Choosing obtains screening text collection.
Computing unit 32 can be used for calculating each screening text in the screening text collection that the screening unit 31 obtains The characteristic value of this characteristic value and the target text.
Judging unit 33 can be used for judging the calculated characteristic value for screening text of the computing unit 32 and the mesh Whether the characteristic value for marking text is identical.
Determination unit 34, if can be used for characteristic value and the target that the judging unit 33 judges the screening text The characteristic value of text is identical, it is determined that the screening text is similar to the target text.
Further, as the realization to method shown in above-mentioned Fig. 2, the embodiment of the invention also provides another similar texts This detection device, for being realized to above-mentioned method shown in Fig. 2.The Installation practice is corresponding with preceding method embodiment, To be easy to read, present apparatus embodiment no longer repeats the detail content in preceding method embodiment one by one, but it should bright Really, the device in the present embodiment can correspond to the full content realized in preceding method embodiment.As shown in figure 4, the device packet It includes: screening unit 41, computing unit 42, judging unit 43, determination unit 44, wherein
Screening unit 41 can be used for the keyword by extracting from target text and sieve to text collection to be detected Choosing obtains screening text collection.
Computing unit 42 can be used for calculating each screening text in the screening text collection that the screening unit 41 obtains The characteristic value of this characteristic value and the target text.
Judging unit 43 can be used for judging the calculated characteristic value for screening text of the computing unit 42 and the mesh Whether the characteristic value for marking text is identical.
Determination unit 44, if can be used for characteristic value and the target that the judging unit 43 judges the screening text The characteristic value of text is identical, it is determined that the screening text is similar to the target text.
Further, the computing unit 42 includes:
Extraction module 421 can be used for extracting preset quantity respectively from the screening text and the target text Central segment sentence, the central segment sentence are the sentence that the central segment of text obtains after splitting.
Computing module 422, can be used for being calculated according to hash algorithm the screening text in the target text by mentioning Each corresponding cryptographic Hash of central segment sentence that modulus block 421 extracts, and the Hash array of the corresponding screening text is generated, And the Hash array of the corresponding target text.
Further, the judging unit 43 includes:
Setting module 431 can be used for the quantity given threshold according to cryptographic Hash in Hash array.
First judgment module 432 can be used for judging the cryptographic Hash in the Hash array of the screening text, with the mesh Mark whether the identical quantity of cryptographic Hash in the Hash array of text is more than threshold value that the setting module 431 is set.
The determination unit 44 can be specifically used for, if the cryptographic Hash in the Hash array of the screening text, and described The identical quantity of cryptographic Hash in the Hash array of target text is more than threshold value, it is determined that the screening text and target text This is similar.
Further, the judging unit 43 includes:
Second judgment module 433 can be used for judging the cryptographic Hash in the Hash array of the screening text, with the mesh Whether the cryptographic Hash marked in the Hash array of text is identical.
The determination unit 44 can be specifically used for, if the cryptographic Hash in the Hash array of the screening text, and described Cryptographic Hash in the Hash array of target text is identical, it is determined that the screening text is similar to the target text.
Further, the screening unit 41 includes:
Extraction module 411 can be used for from the target text extracting multiple keywords.
Judgment module 412, can be used for judging one by one be in each text to be detected in the text collection to be detected The no multiple keywords extracted comprising the extraction module 411.
Determining module 413 judges to include multiple keywords in text to be detected if can be used for the judgment module 412, Then determine the text to be detected for screening text.
Further, described device further include:
Resolution unit 45 can be used for parsing the content of the target text, and determine the target according to the content The text categories of text.
Acquiring unit 46 can be used for obtaining the correspondence text for the text categories that the resolution unit 45 obtains, obtain institute It states text set merging to be detected the text collection to be detected is sent in screening unit 41.
By above-mentioned technical proposal, a kind of Similar Text detection method and device provided in an embodiment of the present invention, for existing There is technology when detecting to Similar Text, needs to calculate the feature of each text to be judged, and the spy with urtext Sign is compared, and the present invention is by carrying out screening behaviour to text collection to be detected using the keyword extracted from target text Make, the text that screening conditions are not met in text collection to be detected can be eliminated, and then be effectively reduced and need to examine The amount of text of survey, thus the Similar Text detection for reducing the calculation amount calculated during text feature value, and then improving Efficiency.Meanwhile by the central segment sentence from target text and screening Text Feature Extraction preset quantity, it can be ensured that can Under the premise of representing content of text, the calculating that cryptographic Hash is carried out to the full content of text is avoided, to reduce calculating process In each text calculation amount, and then reduce calculate the time, improve computational efficiency.Further, on the one hand pass through head First given threshold, and compared according to the threshold value with the quantity of identical cryptographic Hash in two Hash arrays, to determine two Whether text is similar, can reach the function of the text of two higher similarities of identification, and then can expand Similar Text Identification range;On the other hand, by determining that two texts are similar to cryptographic Hash in two Hash arrays is identical, Ke Yishi The detection function of existing same text, to improve the accuracy of Similar Text detection.In addition, by the content to target text into Row parsing, and determine according to content of text the classification of target text realizes and determines text with the main contents of target text The function of classification improves the accuracy that text categories determine, also, identical with the text categories of target text by obtaining Text generates text to be detected, when can whether there is text similar with target text in determining network, can there is a needle To the range and acquisition quantity of the control text to be detected of property, and then meaningless text is avoided from source and obtains operation, To reduce time loss whole in Similar Text detection process, detection efficiency is further improved.
The text processing apparatus includes processor and memory, above-mentioned screening unit, computing unit, judging unit, really Order member etc. stores in memory as program unit, executes above procedure unit stored in memory by processor To realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, the accuracy of user requirements analysis result is improved by adjusting kernel parameter.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor The existing Similar Text detection method.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation Similar Text detection method described in Shi Zhihang.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor perform the steps of the key by extracting from target text when executing program Word screens text collection to be detected, obtains screening text collection;Calculate each screening in the screening text collection The characteristic value of the characteristic value of text and the target text;Judge the characteristic value of the screening text and the spy of the target text Whether value indicative is identical;If so, determining that the screening text is similar to the target text.
Further, the characteristic value for calculating each screening text in the screening text collection and target text This characteristic value includes:
Extract the central segment sentence of preset quantity, the central segment respectively from the screening text and the target text Sentence is the sentence that the central segment of text obtains after splitting;
Screening text Kazakhstan corresponding with each central segment sentence in the target text is calculated according to hash algorithm Uncommon value, and generate the Hash array of the corresponding screening text, and the Hash array of the corresponding target text.
Further, whether the characteristic value for judging the screening text is identical as the characteristic value of the target text, Include:
According to the quantity given threshold of cryptographic Hash in Hash array;
The cryptographic Hash in the Hash array of the screening text is judged, with the Hash in the Hash array of the target text It is worth whether identical quantity is more than the threshold value;
If the characteristic value of the screening text is identical as the characteristic value of the target text, it is determined that the screening text This is similar to the target text, comprising:
If the cryptographic Hash in the Hash array of the screening text, with the cryptographic Hash in the Hash array of the target text Identical quantity is more than threshold value, it is determined that the screening text is similar to the target text.
Further, the whether identical packet of characteristic value of the characteristic value for judging the screening text and the target text It includes:
The cryptographic Hash in the Hash array of the screening text is judged, with the Hash in the Hash array of the target text Whether identical it is worth;
If the characteristic value of the screening text is identical as the characteristic value of the target text, it is determined that the screening text This is similar to the target text, comprising:
If the cryptographic Hash in the Hash array of the screening text, with the cryptographic Hash in the Hash array of the target text It is identical, it is determined that the screening text is similar to the target text.
Further, the keyword by extracting from target text screens text collection to be detected, obtains Include: to screening text collection
Multiple keywords are extracted from the target text;
Whether judged in each text to be detected in the text collection to be detected one by one comprising the multiple keyword;
If so, determining that the text to be detected is screening text.
Further, text collection to be detected is screened in the keyword by being extracted from target text, Before obtaining screening text collection, the method also includes:
The content of the target text is parsed, and determines the text categories of the target text according to the content;
The correspondence text for obtaining the text categories obtains the text collection to be detected.
Equipment in the embodiment of the present invention can be server, PC, PAD, mobile phone etc..
The embodiment of the invention also provides a kind of computer program products, when executing on data processing equipment, are suitable for Execute the program of initialization there are as below methods step: by the keyword that is extracted from target text to text collection to be detected into Row screening obtains screening text collection;Calculate the characteristic value for screening each screening text in text collection and the mesh Mark the characteristic value of text;Judge whether the characteristic value of the screening text is identical as the characteristic value of the target text;If so, Determine that the screening text is similar to the target text.
Further, the characteristic value for calculating each screening text in the screening text collection and target text This characteristic value includes:
Extract the central segment sentence of preset quantity, the central segment respectively from the screening text and the target text Sentence is the sentence that the central segment of text obtains after splitting;
Screening text Kazakhstan corresponding with each central segment sentence in the target text is calculated according to hash algorithm Uncommon value, and generate the Hash array of the corresponding screening text, and the Hash array of the corresponding target text.
Further, whether the characteristic value for judging the screening text is identical as the characteristic value of the target text, Include:
According to the quantity given threshold of cryptographic Hash in Hash array;
The cryptographic Hash in the Hash array of the screening text is judged, with the Hash in the Hash array of the target text It is worth whether identical quantity is more than the threshold value;
If the characteristic value of the screening text is identical as the characteristic value of the target text, it is determined that the screening text This is similar to the target text, comprising:
If the cryptographic Hash in the Hash array of the screening text, with the cryptographic Hash in the Hash array of the target text Identical quantity is more than threshold value, it is determined that the screening text is similar to the target text.
Further, the whether identical packet of characteristic value of the characteristic value for judging the screening text and the target text It includes:
The cryptographic Hash in the Hash array of the screening text is judged, with the Hash in the Hash array of the target text Whether identical it is worth;
If the characteristic value of the screening text is identical as the characteristic value of the target text, it is determined that the screening text This is similar to the target text, comprising:
If the cryptographic Hash in the Hash array of the screening text, with the cryptographic Hash in the Hash array of the target text It is identical, it is determined that the screening text is similar to the target text.
Further, the keyword by extracting from target text screens text collection to be detected, obtains Include: to screening text collection
Multiple keywords are extracted from the target text;
Whether judged in each text to be detected in the text collection to be detected one by one comprising the multiple keyword;
If so, determining that the text to be detected is screening text.
Further, text collection to be detected is screened in the keyword by being extracted from target text, Before obtaining screening text collection, the method also includes:
The content of the target text is parsed, and determines the text categories of the target text according to the content;
The correspondence text for obtaining the text categories obtains the text collection to be detected.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims (10)

1. a kind of Similar Text detection method, which is characterized in that the described method includes:
Text collection to be detected is screened by the keyword extracted from target text, obtains screening text collection;
Calculate the characteristic value of each screening text in the screening text collection and the characteristic value of the target text;
Judge whether the characteristic value of the screening text is identical as the characteristic value of the target text;
If so, determining that the screening text is similar to the target text.
2. the method according to claim 1, wherein each screening calculated in the screening text collection The characteristic value of text and the characteristic value of the target text include:
Extract the central segment sentence of preset quantity, the central segment sentence respectively from the screening text and the target text The sentence obtained after splitting for the central segment of text;
Screening text cryptographic Hash corresponding with each central segment sentence in the target text is calculated according to hash algorithm, And generate the Hash array of the corresponding screening text, and the Hash array of the corresponding target text.
3. according to the method described in claim 2, it is characterized in that, characteristic value and the mesh of the judgement screening text Whether the characteristic value for marking text is identical, comprising:
According to the quantity given threshold of cryptographic Hash in Hash array;
The cryptographic Hash in the Hash array of the screening text is judged, with the cryptographic Hash phase in the Hash array of the target text Whether same quantity is more than the threshold value;
If it is described screening text characteristic value it is identical with the characteristic value of the target text, it is determined that the screening text and The target text is similar, comprising:
If the cryptographic Hash in the Hash array of the screening text, identical as the cryptographic Hash in the Hash array of the target text Quantity be more than threshold value, it is determined that the screening text is similar to the target text.
4. according to the method described in claim 2, it is characterized in that, characteristic value and the mesh of the judgement screening text Whether the characteristic value of mark text is identical to include:
Judge the cryptographic Hash in the Hash array of the screening text, is with the cryptographic Hash in the Hash array of the target text It is no identical;
If it is described screening text characteristic value it is identical with the characteristic value of the target text, it is determined that the screening text and The target text is similar, comprising:
If the cryptographic Hash in the Hash array of the screening text, complete with the cryptographic Hash in the Hash array of the target text It is identical, it is determined that the screening text is similar to the target text.
5. method according to any of claims 1-4, which is characterized in that described by extracting from target text Keyword screens text collection to be detected, obtains screening text collection and includes:
Multiple keywords are extracted from the target text;
Whether judged in each text to be detected in the text collection to be detected one by one comprising the multiple keyword;
If so, determining that the text to be detected is screening text.
6. the method according to claim 1, wherein in the keyword pair by being extracted from target text Text collection to be detected is screened, before obtaining screening text collection, the method also includes:
The content of the target text is parsed, and determines the text categories of the target text according to the content;
The correspondence text for obtaining the text categories obtains the text collection to be detected.
7. a kind of Similar Text detection device, which is characterized in that described device includes:
Screening unit is screened text collection to be detected for the keyword by extracting from target text, is sieved Select text collection;
Computing unit, for calculate the characteristic value of each screening text in the screening text collection that the screening unit obtains with The characteristic value of the target text;
Judging unit, for judging the characteristic value of the calculated screening text of the computing unit and the feature of the target text Whether identical it is worth;
Determination unit, if judging the characteristic value of the screening text and the characteristic value of the target text for the judging unit It is identical, it is determined that the screening text is similar to the target text.
8. device according to claim 7, which is characterized in that the computing unit includes:
Extraction module, for extracting the central segment sentence of preset quantity respectively from the screening text and the target text, The central segment sentence is the sentence that the central segment of text obtains after splitting;
Computing module, for being calculated in the screening text and the target text according to hash algorithm by extraction module extraction Each corresponding cryptographic Hash of central segment sentence, and generate the Hash array of the corresponding screening text, and the corresponding mesh Mark the Hash array of text.
9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment perform claim require 1 to the Similar Text inspection described in any one of claim 6 Survey method.
10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit requires 1 to the Similar Text detection method described in any one of claim 6.
CN201710663797.7A 2017-08-06 2017-08-06 A kind of Similar Text detection method and device Pending CN110019642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710663797.7A CN110019642A (en) 2017-08-06 2017-08-06 A kind of Similar Text detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710663797.7A CN110019642A (en) 2017-08-06 2017-08-06 A kind of Similar Text detection method and device

Publications (1)

Publication Number Publication Date
CN110019642A true CN110019642A (en) 2019-07-16

Family

ID=67186117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710663797.7A Pending CN110019642A (en) 2017-08-06 2017-08-06 A kind of Similar Text detection method and device

Country Status (1)

Country Link
CN (1) CN110019642A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304480A (en) * 2017-12-29 2018-07-20 东软集团股份有限公司 A kind of text similarity determines method, apparatus and equipment
CN111221959A (en) * 2019-09-27 2020-06-02 武汉创想外码科技有限公司 WNLP text traceability model
CN112579534A (en) * 2019-09-27 2021-03-30 北京国双科技有限公司 File screening method and device
WO2021057863A1 (en) * 2019-09-29 2021-04-01 腾讯科技(深圳)有限公司 Blockchain-based content processing method, apparatus, device, and storage medium
CN112632907A (en) * 2021-01-04 2021-04-09 北京明略软件系统有限公司 Document marking method, device and equipment
CN113688628A (en) * 2021-07-28 2021-11-23 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium
CN113704287A (en) * 2020-09-01 2021-11-26 广西云牛动力网络科技有限公司 Big data based data comparison analysis screening system and method
CN112579534B (en) * 2019-09-27 2024-06-25 北京国双科技有限公司 File screening method and device

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915295A (en) * 2011-03-31 2013-02-06 百度在线网络技术(北京)有限公司 Document detecting method and document detecting device
CN103123618A (en) * 2011-11-21 2013-05-29 北京新媒传信科技有限公司 Text similarity obtaining method and device
CN103207905A (en) * 2013-03-28 2013-07-17 大连理工大学 Method for calculating text similarity based on target text
CN103294671A (en) * 2012-02-22 2013-09-11 腾讯科技(深圳)有限公司 Document detection method and system
CN103425639A (en) * 2013-09-06 2013-12-04 广州一呼百应网络技术有限公司 Similar information identifying method based on information fingerprints
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
CN104636319A (en) * 2013-11-11 2015-05-20 腾讯科技(北京)有限公司 Text duplicate removal method and device
KR101663454B1 (en) * 2016-08-03 2016-10-07 주식회사 비욘드테크 Apparatus of sentence similarity calculation using keyword weight and method thereof
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof
US9514312B1 (en) * 2014-09-05 2016-12-06 Symantec Corporation Low-memory footprint fingerprinting and indexing for efficiently measuring document similarity and containment
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN106528581A (en) * 2015-09-15 2017-03-22 阿里巴巴集团控股有限公司 Text detection method and apparatus
CN106547738A (en) * 2016-11-02 2017-03-29 北京亿美软通科技有限公司 A kind of overdue short message intelligent method of discrimination of the financial class based on text mining
CN106649214A (en) * 2016-10-21 2017-05-10 天津海量信息技术股份有限公司 Internet information content similarity definition method
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN106909535A (en) * 2015-12-23 2017-06-30 北京国双科技有限公司 Similar Text decision method and device

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915295A (en) * 2011-03-31 2013-02-06 百度在线网络技术(北京)有限公司 Document detecting method and document detecting device
CN103123618A (en) * 2011-11-21 2013-05-29 北京新媒传信科技有限公司 Text similarity obtaining method and device
CN103294671A (en) * 2012-02-22 2013-09-11 腾讯科技(深圳)有限公司 Document detection method and system
CN103207905A (en) * 2013-03-28 2013-07-17 大连理工大学 Method for calculating text similarity based on target text
CN103425639A (en) * 2013-09-06 2013-12-04 广州一呼百应网络技术有限公司 Similar information identifying method based on information fingerprints
CN104636319A (en) * 2013-11-11 2015-05-20 腾讯科技(北京)有限公司 Text duplicate removal method and device
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
US9514312B1 (en) * 2014-09-05 2016-12-06 Symantec Corporation Low-memory footprint fingerprinting and indexing for efficiently measuring document similarity and containment
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof
CN106528581A (en) * 2015-09-15 2017-03-22 阿里巴巴集团控股有限公司 Text detection method and apparatus
CN106909535A (en) * 2015-12-23 2017-06-30 北京国双科技有限公司 Similar Text decision method and device
KR101663454B1 (en) * 2016-08-03 2016-10-07 주식회사 비욘드테크 Apparatus of sentence similarity calculation using keyword weight and method thereof
CN106649214A (en) * 2016-10-21 2017-05-10 天津海量信息技术股份有限公司 Internet information content similarity definition method
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN106547738A (en) * 2016-11-02 2017-03-29 北京亿美软通科技有限公司 A kind of overdue short message intelligent method of discrimination of the financial class based on text mining
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
徐琴: "基于二次特征提取的中文文本抄袭检测方法", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王海涛: "基于大规模文本数据集的相似检测关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *
董卫博: "中文文档复制检测系统的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304480A (en) * 2017-12-29 2018-07-20 东软集团股份有限公司 A kind of text similarity determines method, apparatus and equipment
CN111221959A (en) * 2019-09-27 2020-06-02 武汉创想外码科技有限公司 WNLP text traceability model
CN112579534A (en) * 2019-09-27 2021-03-30 北京国双科技有限公司 File screening method and device
CN112579534B (en) * 2019-09-27 2024-06-25 北京国双科技有限公司 File screening method and device
WO2021057863A1 (en) * 2019-09-29 2021-04-01 腾讯科技(深圳)有限公司 Blockchain-based content processing method, apparatus, device, and storage medium
CN113704287A (en) * 2020-09-01 2021-11-26 广西云牛动力网络科技有限公司 Big data based data comparison analysis screening system and method
CN112632907A (en) * 2021-01-04 2021-04-09 北京明略软件系统有限公司 Document marking method, device and equipment
CN113688628A (en) * 2021-07-28 2021-11-23 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium
CN113688628B (en) * 2021-07-28 2023-09-22 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium

Similar Documents

Publication Publication Date Title
CN110019642A (en) A kind of Similar Text detection method and device
WO2019174422A1 (en) Method for analyzing entity association relationship, and related apparatus
CN108255857B (en) Statement detection method and device
Schelter et al. Fairprep: Promoting data to a first-class citizen in studies on fairness-enhancing interventions
CN103365997B (en) A kind of opining mining method based on integrated study
CN107704503A (en) User's keyword extracting device, method and computer-readable recording medium
CN110019660A (en) A kind of Similar Text detection method and device
JP6053131B2 (en) Information processing apparatus, information processing method, and program
Srba et al. Auditing YouTube’s recommendation algorithm for misinformation filter bubbles
JPWO2019224891A1 (en) Classification device, classification method, generation method, classification program and generation program
CN106776609A (en) Reprint the statistical method and device of quantity in website
CN106407316B (en) Software question and answer recommendation method and device based on topic model
JP7040535B2 (en) Security information processing equipment, information processing methods and programs
JP5331023B2 (en) Important word extraction device, important word extraction method, and important word extraction program
CN105701085A (en) Network duplicate checking method and system
CN112691379B (en) Game resource text auditing method and device, storage medium and computer equipment
KR102280490B1 (en) Training data construction method for automatically generating training data for artificial intelligence model for counseling intention classification
CN112328469B (en) Function level defect positioning method based on embedding technology
JP6733366B2 (en) Task estimation device, task estimation method, and task estimation program
CN111950265A (en) Domain lexicon construction method and device
Carpineto et al. Automatic assessment of website compliance to the European cookie law with CooLCheck
CN109933791A (en) Material recommended method, device, computer equipment and computer readable storage medium
CN111488452A (en) Webpage tampering detection method, detection system and related equipment
JP6696344B2 (en) Information processing device and program
JP4726683B2 (en) EXPERIENCE INFORMATION EXTRACTION METHOD AND DEVICE, PROGRAM, AND COMPUTER-READABLE RECORDING MEDIUM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190716