CN109241505A - Text De-weight method and device - Google Patents

Text De-weight method and device Download PDF

Info

Publication number
CN109241505A
CN109241505A CN201811173826.2A CN201811173826A CN109241505A CN 109241505 A CN109241505 A CN 109241505A CN 201811173826 A CN201811173826 A CN 201811173826A CN 109241505 A CN109241505 A CN 109241505A
Authority
CN
China
Prior art keywords
text
similar
centering
cryptographic hash
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811173826.2A
Other languages
Chinese (zh)
Inventor
唐梓毅
汪冠春
胡川
胡一川
张海雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Olsenberg Technology Co ltd
Original Assignee
Beijing Benying Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Benying Network Technology Co Ltd filed Critical Beijing Benying Network Technology Co Ltd
Priority to CN201811173826.2A priority Critical patent/CN109241505A/en
Publication of CN109241505A publication Critical patent/CN109241505A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of text De-weight method and devices.This method includes obtaining Similar Text pair by calculating the similar cryptographic Hash of text to be processed;Judge whether the pre-set text feature of the Similar Text centering is identical;If it is determined that the pre-set text feature of the Similar Text centering is identical, then retain a text of the Similar Text centering;And if it is determined that the pre-set text feature of the Similar Text centering is different, then retain the Similar Text pair.Present application addresses the technical problems that literal similar text duplicate removal effect is poor.Text duplicate removal can be quickly and accurately carried out by the application.In addition, it finds particular application to the text duplicate removals of bidding documents class in website.

Description

Text De-weight method and device
Technical field
This application involves text-processing fields, in particular to a kind of text De-weight method and device.
Background technique
Text duplicate removal typically refers to vocabulary, sentence or the title etc. repeated for the removal that target text carries out.
Inventors have found that in the case where some special texts require scene, such as file bidding documents.Usually used text removing repeat Method can not be for the literal close higher text of degree to distinguishing.
For the problem that literal similar text duplicate removal effect in the related technology is poor, effective solution is not yet proposed at present Scheme.
Summary of the invention
The main purpose of the application is to provide a kind of text De-weight method and device, to solve literal similar text The poor problem of weight effect.
To achieve the goals above, according to the one aspect of the application, a kind of text De-weight method is provided.
Text De-weight method according to the application includes: to obtain similar text by calculating the similar cryptographic Hash of text to be processed This is right;Judge whether the pre-set text feature of the Similar Text centering is identical;If it is determined that the Similar Text centering is pre- If text feature is identical, then retain a text of the Similar Text centering;And the if it is determined that Similar Text centering Pre-set text feature it is different, then retain the Similar Text pair.
Further, by calculate the similar cryptographic Hash of text to be processed obtain Similar Text to include: calculate it is to be processed The similar cryptographic Hash of title in text;It extracts the pre-set text feature in the text to be processed and establishes aspect indexing;And The document pair that distance is less than threshold value is searched out in the similar cryptographic Hash by the aspect indexing, obtains Similar Text pair.
Further, judge the pre-set text feature of the Similar Text centering it is whether identical include: the Similar Text To the Similar Text clock synchronization to obtain by the similar cryptographic Hash for calculating project bid text, the Similar Text centering is judged Whether website sources are identical;If it is determined that the website sources of the Similar Text centering are identical, then the Similar Text pair is judged In project number it is whether identical;If it is determined that the project number of the Similar Text centering is identical, then the similar text is judged Whether the bulletin type of this centering is identical.
Further, Similar Text is obtained to before by calculating the similar cryptographic Hash of text to be processed further include: calculate The similar cryptographic Hash of title in document to be processed;Judge whether the similar cryptographic Hash meets the item of default Similar Text pair Part;If it is determined that the similar cryptographic Hash is unsatisfactory for the condition of default Similar Text pair, then it is assumed that be not present in document to be processed Repetitive file simultaneously retains the document to be processed.
Further, if it is determined that the pre-set text feature of the Similar Text centering is identical, then retain the similar text One text of this centering includes: if it is determined that the pre-set text feature of the Similar Text centering is identical, then it is assumed that document weight Redoubling retains a text of Similar Text centering according to preset rules;If it is determined that the pre-set text of the Similar Text centering Feature is different, then retain the Similar Text to include: if it is determined that the pre-set text feature of the Similar Text centering is different, Then think that document does not repeat and all retains the text for retaining Similar Text centering.
To achieve the goals above, according to the another aspect of the application, a kind of text duplicate removal device is provided.
It include: computing module according to the text duplicate removal device of the application, for the similar Kazakhstan by calculating text to be processed Uncommon value obtains Similar Text pair;Whether judgment module, the pre-set text feature for judging the Similar Text centering are identical;The One processing module retains the Similar Text centering when pre-set text feature for judging the Similar Text centering is identical A text;And Second processing module, when for judging the pre-set text feature difference of the Similar Text centering, retain The Similar Text pair.
Further, the computing module includes: the first computing unit, for calculating the similar of title in text to be processed Cryptographic Hash;Extracting unit, for extracting the pre-set text feature in the text to be processed and establishing aspect indexing;And search Unit, the document pair for being less than threshold value for searching out distance in the similar cryptographic Hash by the aspect indexing, obtains phase Like text pair.
Further, judgment module includes: the first judging unit, for the Similar Text to for by calculating project trick The Similar Text clock synchronization that the similar cryptographic Hash of mark text obtains, judges whether the website sources of the Similar Text centering are identical; Second judgment unit judges the Similar Text centering when website sources for judging the Similar Text centering are identical Whether project number is identical;Third judging unit, for if it is determined that sentencing when the project number of the Similar Text centering is identical Break the Similar Text centering bulletin type it is whether identical.
Further, device further include: text judgment module, the text judgment module includes: pretreatment unit, is used for Calculate the similar cryptographic Hash of the title in document to be processed;First text judging unit, for judging that the similar cryptographic Hash is The no condition for meeting default Similar Text pair;Second text judging unit, for judging that the similar cryptographic Hash is unsatisfactory for presetting When the condition of Similar Text pair, it is believed that repetitive file is not present in document to be processed and retains the document to be processed.
Further, the first processing module includes: the first stick unit, and the Second processing module includes: second Stick unit, first stick unit, when the pre-set text feature for judging the Similar Text centering is identical, it is believed that text Shelves repetition and a text for retaining Similar Text centering according to preset rules;Second stick unit, it is described for judging When the pre-set text feature difference of Similar Text centering, it is believed that document not repeat and by it is described retain Similar Text centering text All retain.
In the embodiment of the present application, the side of Similar Text pair is obtained using the similar cryptographic Hash by calculating text to be processed Formula, whether the pre-set text feature by judging the Similar Text centering is identical, has reached if it is determined that the Similar Text The pre-set text feature of centering is identical, then retains a text of the Similar Text centering and if it is determined that the Similar Text The pre-set text feature of centering is different, then retains the purpose of the Similar Text pair, quickly and accurately carries out text to realize The technical effect of this duplicate removal, and then solve the poor technical problem of literal similar text duplicate removal effect.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present application, so that the application's is other Feature, objects and advantages become more apparent upon.The illustrative examples attached drawing and its explanation of the application is for explaining the application, not Constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 is the text De-weight method schematic diagram according to the embodiment of the present application;
Fig. 2 is the text De-weight method schematic diagram according to the embodiment of the present application;
Fig. 3 is the text De-weight method schematic diagram according to the embodiment of the present application;
Fig. 4 is the text De-weight method schematic diagram according to the embodiment of the present application;
Fig. 5 is the text De-weight method schematic diagram according to the embodiment of the present application;
Fig. 6 is the text duplicate removal device schematic diagram according to the embodiment of the present application;
Fig. 7 is the text duplicate removal device schematic diagram according to the embodiment of the present application;
Fig. 8 is the text duplicate removal device schematic diagram according to the embodiment of the present application;
Fig. 9 is the text duplicate removal device schematic diagram according to the embodiment of the present application;
Figure 10 is the text duplicate removal device schematic diagram according to the embodiment of the present application;And
Figure 11 is the realization principle flow chart according to the text De-weight method of the embodiment of the present application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
In this application, term " on ", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outside", " in ", "vertical", "horizontal", " transverse direction ", the orientation or positional relationship of the instructions such as " longitudinal direction " be orientation based on the figure or Positional relationship.These terms are not intended to limit indicated dress primarily to better describe the application and embodiment Set, element or component must have particular orientation, or constructed and operated with particular orientation.
Also, above-mentioned part term is other than it can be used to indicate that orientation or positional relationship, it is also possible to for indicating it His meaning, such as term " on " also are likely used for indicating certain relations of dependence or connection relationship in some cases.For ability For the those of ordinary skill of domain, the concrete meaning of these terms in this application can be understood as the case may be.
In addition, term " installation ", " setting ", " being equipped with ", " connection ", " connected ", " socket " shall be understood in a broad sense.For example, It may be a fixed connection, be detachably connected or monolithic construction;It can be mechanical connection, or electrical connection;It can be direct phase It even, or indirectly connected through an intermediary, or is two connections internal between device, element or component. For those of ordinary skills, the concrete meaning of above-mentioned term in this application can be understood as the case may be.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As shown in Figure 1, this method includes the following steps, namely S102 to step S108:
Step S102, the similar cryptographic Hash by calculating text to be processed obtain Similar Text pair;
The similar cryptographic Hash of text to be processed is calculated by simhash algorithm after receiving text to be processed, it is similar What cryptographic Hash obtained after calculating is Similar Text pair.
It should be noted that Similar Text is literal close to referring to, but may duplicate text pair.While calculated result is not With the appearance of individual Similar Text, but calculated result is obtained in the form of Similar Text pair.
Preferably, literal close, possible duplicate text pair can be rapidly quickly found out using Simhash.
Step S104 judges whether the pre-set text feature of the Similar Text centering is identical;
Judge whether the preset text feature in Similar Text pair is identical.
Specifically, text feature can be for text to specific feature.It can be right by introducing external text feature Feature other than Simhash carries out auxiliary judgment.
Step S106 then retains the similar text if it is determined that the pre-set text feature of the Similar Text centering is identical One text of this centering;
When judging that pre-set text feature in Similar Text pair is identical, then it is assumed that two sections of texts be it is duplicate, this when Wait a text for only needing to retain Similar Text centering.
Step S108 then retains the similar text if it is determined that the pre-set text feature of the Similar Text centering is different This is right.
When judging the pre-set text feature difference in Similar Text pair, then it is assumed that two sections of texts be not it is duplicate, need Two sections of texts are all retained.
As shown in figure 11, specifically, cryptographic Hash similar for all document calculations title simhash, and Extraction Projects are compiled Number, Bale No., bid section, website sources, number, the features such as (bulletin) bidding documents type and establish index;Among all simhash, It is less than the document pair of threshold value A by all distances of indexed search;Successively judge the website sources, project number, public affairs of each pair of document Accuse the features such as type.If feature is identical, it is believed that two sections of texts repeat, and retain wherein one according to preset rules; Otherwise it is assumed that two texts do not repeat, the document that all reservation does not appear in similar document centering, which is all considered as, not to be repeated, is all protected It stays.Threshold value A can be set according to actual scene, be not defined in this application, and those skilled in the art can be illustrated How to set and calculate the document pair that distance is less than threshold value.
It can be seen from the above description that the application realizes following technical effect:
In the embodiment of the present application, the side of Similar Text pair is obtained using the similar cryptographic Hash by calculating text to be processed Formula, whether the pre-set text feature by judging the Similar Text centering is identical, has reached if it is determined that the Similar Text The pre-set text feature of centering is identical, then retains a text of the Similar Text centering and if it is determined that the Similar Text The pre-set text feature of centering is different, then retains the purpose of the Similar Text pair, quickly and accurately carries out text to realize The technical effect of this duplicate removal, and then solve the poor technical problem of literal similar text duplicate removal effect.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Fig. 2, by calculating text to be processed Similar cryptographic Hash obtains Similar Text to including:
Step S202 calculates the similar cryptographic Hash of title in text to be processed;
Specifically, when calculating text to be processed, need to calculate the similar cryptographic Hash of title in text to be processed.
For example, title 1:
Bid bulletin (CB190272018000001) ----Beijing Normal University Zhuhai Campus
Beijing Normal University Zhuhai Campus-bids bulletin (CB190272018000001)
For another example, title 2:
The Guangxi Xincheng County ridge Pan He Forest Park western movie area Garden Engineering (HCLB2017-540) construction bid notifies 2018- 01-03 invitation for bid
The Xincheng County ridge Pan He Forest Park western movie area Garden Engineering (HCLB2017-540) construction bid announces 2018-01- 03
Step S204 extracts the pre-set text feature in the text to be processed and establishes aspect indexing;
By extracting the pre-set text feature in text to be processed and establishing aspect indexing, when needing a certain feature, Result can be retrieved according to related text aspect indexing.
Step S206 searches out the document that distance is less than threshold value by the aspect indexing in the similar cryptographic Hash It is right, obtain Similar Text pair.
Specifically, the document pair that distance is less than threshold value is searched out in the similar cryptographic Hash by aspect indexing, in turn Positioning obtains Similar Text pair.
It indexes it should be noted that can also be established in this application by any other mode and finds all phases It is less than the text pair of a certain threshold value like cryptographic Hash distance, concrete mode is not defined in this application.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 3, judging the Similar Text centering Pre-set text feature whether identical include:
Step S302, the Similar Text is to the similar text to be obtained by the similar cryptographic Hash for calculating project bid text This clock synchronization judges whether the website sources of the Similar Text centering are identical;
Specifically, when text to be processed is project bid text, need to calculate the similar of title in project bid text The Similar Text pair that cryptographic Hash obtains.Need further judge whether the website sources of Similar Text centering are identical.
Step S304 then judges the Similar Text pair if it is determined that the website sources of the Similar Text centering are identical In project number it is whether identical;
Need further judge whether the project number of Similar Text centering is identical.
Step S306 then judges the Similar Text pair if it is determined that the project number of the Similar Text centering is identical In bulletin type it is whether identical.
Need further judge whether the bulletin type of Similar Text centering is identical.
It can recognize that the text for some candidates by above-mentioned some bulletin types, project number or website sources This is right, it may be possible to which some key messages, such as project number cannot be judged to repeat, for example, Similar Text is to 1 there are difference (project number is different):
The Huaiyuan County Bureau of Land and Resources state-owned land right to use, which recruits to clap to hang up, allows conclusion of the business publicity HYCJ2016-22
The Huaiyuan County Bureau of Land and Resources state-owned land right to use, which recruits to clap to hang up, allows conclusion of the business publicity HYCJ2016-12
For another example, Similar Text is to 2 (bulletin type is different):
The golden general inter-city passenger rail engineering high speed in public Daliang City, parallel transposition section anticollision barrier engineering (three times) the change bulletin of national highway
The general inter-city passenger rail engineering high speed of Daliang City of Liaoning Province gold, the parallel transposition section anticollision barrier change in the work bulletin of national highway
For another example, Similar Text is to 3 (bulletin type is different):
Battery/charger conclusion of the business bulletin is purchased in Wuhe County Amitabha Buddha temple junior high school
Battery/charger buying bulletin is purchased in Wuhe County Amitabha Buddha temple junior high school
For these texts pair, the above-mentioned text feature needed to introduce other than simhash value carries out auxiliary judgment, for example takes out It takes features, only these features such as project number, Bale No., bulletin type to be different from, just thinks that two sections of texts really repeat.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 4, by calculating text to be processed Similar cryptographic Hash obtains Similar Text to before further include:
Step S402 calculates the similar cryptographic Hash of the title in document to be processed;
Similar cryptographic Hash text pair is obtained by the title calculated in document to be processed.
Step S404, judges whether the similar cryptographic Hash meets the condition of default Similar Text pair;
Text is judged to whether the condition of preset Similar Text pair is met, i.e., similar cryptographic Hash distance is greater than a certain The text pair of threshold value.
Step S406, if it is determined that the similar cryptographic Hash is unsatisfactory for the condition of default Similar Text pair, then it is assumed that wait locate Repetitive file is not present in reason document and retains the document to be processed.
When being unsatisfactory for the condition of default Similar Text pair for similar cryptographic Hash, then summarizes in document to be processed and weight is not present Multiple document simultaneously retains the document storage to be processed, does not need to carry out duplicate removal processing again.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 5, if it is determined that the Similar Text The pre-set text feature of centering is identical, then a text for retaining the Similar Text centering includes:
Step S502, if it is determined that the pre-set text feature of the Similar Text centering is identical, then it is assumed that document repeats simultaneously Retain a text of Similar Text centering according to preset rules;
Successively judge whether the pre-set text feature of two sections of Similar Text centerings identical, if pre-set text be characterized in it is identical When, then document is repeated and is put in storage according to any one text that dependency rule retains Similar Text centering at this time.
If it is determined that the pre-set text feature of the Similar Text centering is different, then retain the Similar Text to including:
Step S504 is if it is determined that the pre-set text feature of the Similar Text centering is different, then it is assumed that document does not repeat simultaneously The text for retaining Similar Text centering is all retained.
Successively judge whether the pre-set text feature of two sections of Similar Text centerings is identical, if pre-set text is characterized in not phase Meanwhile then the text of Similar Text centering all being retained at this time in document there is no repeating.
It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not The sequence being same as herein executes shown or described step.
According to the embodiment of the present application, additionally provide it is a kind of for implementing the text duplicate removal device of above-mentioned text De-weight method, As shown in fig. 6, the device includes: computing module 10, similar text is obtained for the similar cryptographic Hash by calculating text to be processed This is right;Whether judgment module 20, the pre-set text feature for judging the Similar Text centering are identical;First processing module 30, when the pre-set text feature for judging the Similar Text centering is identical, retain a provision of the Similar Text centering This;And Second processing module 40 retains the phase when for judging the pre-set text feature difference of the Similar Text centering Like text pair.
In the computing module 10 of the embodiment of the present application after receiving text to be processed by simhash algorithm calculate to The similar cryptographic Hash of text is handled, what similar cryptographic Hash obtained after calculating is Similar Text pair.
It should be noted that Similar Text is literal close to referring to, but may duplicate text pair.While calculated result is not With the appearance of individual Similar Text, but calculated result is obtained in the form of Similar Text pair.
Preferably, literal close, possible duplicate text pair can be rapidly quickly found out using Simhash.
Judge whether is preset text feature in Similar Text pair in the judgment module 20 of the embodiment of the present application It is identical.
Specifically, text feature can be for text to specific feature.It can be right by introducing external text feature Feature other than Simhash carries out auxiliary judgment.
When judging that pre-set text feature in Similar Text pair is identical in the first processing module 30 of the embodiment of the present application, Then think two sections of texts be it is duplicate, only need to retain a text of Similar Text centering this when.
When judging the pre-set text feature difference in Similar Text pair in the Second processing module 40 of the embodiment of the present application, Then think two sections of texts be not it is duplicate, need all to retain two sections of texts.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in fig. 7, the computing module 10 includes: One computing unit 101, for calculating the similar cryptographic Hash of title in text to be processed;Extracting unit 102, for extract it is described to It handles the pre-set text feature in text and establishes aspect indexing;And search unit 103, for being existed by the aspect indexing The document pair that distance is less than threshold value is searched out in the similar cryptographic Hash, obtains Similar Text pair.
In first computing unit 101 of the embodiment of the present application specifically, it when calculating text to be processed, needs to calculate wait locate Manage the similar cryptographic Hash of title in text.
For example, title 1:
Bid bulletin (CB190272018000001) ----Beijing Normal University Zhuhai Campus
Beijing Normal University Zhuhai Campus-bids bulletin (CB190272018000001)
For another example, title 2:
The Guangxi Xincheng County ridge Pan He Forest Park western movie area Garden Engineering (HCLB2017-540) construction bid notifies 2018- 01-03 invitation for bid
The Xincheng County ridge Pan He Forest Park western movie area Garden Engineering (HCLB2017-540) construction bid announces 2018-01- 03
By extracting the pre-set text feature in text to be processed and establishing spy in the extracting unit 102 of the embodiment of the present application Sign index can retrieve result according to related text aspect indexing when needing a certain feature.
In the search unit 103 of the embodiment of the present application specifically, it is searched in the similar cryptographic Hash by aspect indexing Distance is less than the document pair of threshold value out, and then positions and obtain Similar Text pair.
It indexes it should be noted that can also be established in this application by any other mode and finds all phases It is less than the text pair of a certain threshold value like cryptographic Hash distance, concrete mode is not defined in this application.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 8, judgment module 20 includes: first to sentence Disconnected unit 201, for the Similar Text to the Similar Text to be obtained by the similar cryptographic Hash for calculating project bid text Clock synchronization judges whether the website sources of the Similar Text centering are identical;Second judgment unit 202, it is described similar for judging When the website sources of text pair are identical, judge whether the project number of the Similar Text centering is identical;Third judging unit 203, for if it is determined that judging the bulletin class of the Similar Text centering when project number of the Similar Text centering is identical Whether type is identical.
In first judging unit 201 of the embodiment of the present application specifically, it when text to be processed is project bid text, needs Calculate the Similar Text pair that the similar cryptographic Hash of title in project bid text obtains.It needs further to judge Similar Text pair In website sources whether be identical.
The project number for further judging Similar Text centering is needed in the second judgment unit 202 of the embodiment of the present application is No is identical.
The bulletin type for further judging Similar Text centering is needed in the third judging unit 203 of the embodiment of the present application is No is identical.
It can recognize that the text for some candidates by above-mentioned some bulletin types, project number or website sources This is right, it may be possible to which some key messages, such as project number cannot be judged to repeat, for example, Similar Text is to 1 there are difference (project number is different):
The Huaiyuan County Bureau of Land and Resources state-owned land right to use, which recruits to clap to hang up, allows conclusion of the business publicity HYCJ2016-22
The Huaiyuan County Bureau of Land and Resources state-owned land right to use, which recruits to clap to hang up, allows conclusion of the business publicity HYCJ2016-12
For another example, Similar Text is to 2 (bulletin type is different):
The golden general inter-city passenger rail engineering high speed in public Daliang City, parallel transposition section anticollision barrier engineering (three times) the change bulletin of national highway
The general inter-city passenger rail engineering high speed of Daliang City of Liaoning Province gold, the parallel transposition section anticollision barrier change in the work bulletin of national highway
For another example, Similar Text is to 3 (bulletin type is different):
Battery/charger conclusion of the business bulletin is purchased in Wuhe County Amitabha Buddha temple junior high school
Battery/charger buying bulletin is purchased in Wuhe County Amitabha Buddha temple junior high school
For these texts pair, the above-mentioned text feature needed to introduce other than simhash value carries out auxiliary judgment, for example takes out It takes features, only these features such as project number, Bale No., bulletin type to be different from, just thinks that two sections of texts really repeat.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Figure 9, further includes: text judgment module 50, the text judgment module includes: pretreatment unit 501, for calculating the similar cryptographic Hash of the title in document to be processed; First text judging unit 502, for judging whether the similar cryptographic Hash meets the condition of default Similar Text pair;Second text This judging unit 503, when for judging that the similar cryptographic Hash is unsatisfactory for the condition of default Similar Text pair, it is believed that text to be processed Repetitive file is not present in shelves and retains the document to be processed.
Similar cryptographic Hash is obtained by the title calculated in document to be processed in the pretreatment unit 501 of the embodiment of the present application Text pair.
Judge text to whether meeting preset similar text in the first text judging unit 502 of the embodiment of the present application This pair of condition, i.e., similar cryptographic Hash distance are greater than the text pair of a certain threshold value.
Default Similar Text pair is unsatisfactory for for similar cryptographic Hash in the second text judging unit 503 of the embodiment of the present application Condition when, then summarize in document to be processed there is no repetitive file and retain the document to be processed storage, do not need again into Row duplicate removal processing.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Figure 10, the first processing module 30 is wrapped Include: the first stick unit 301, the Second processing module 40 include: the second stick unit 401, first stick unit 301, when the pre-set text feature for judging the Similar Text centering is identical, it is believed that document is repeated and protected according to preset rules Stay a text of Similar Text centering;Second stick unit 401, for judging the default text of the Similar Text centering When eigen difference, it is believed that document does not repeat and all retains the text for retaining Similar Text centering.
The pre-set text feature of two sections of Similar Text centerings is successively judged in first stick unit 301 of the embodiment of the present application Whether identical, if pre-set text is characterized in identical, document repeats and retains Similar Text pair according to dependency rule at this time In any one text storage.
The pre-set text feature of two sections of Similar Text centerings is successively judged in second stick unit 401 of the embodiment of the present application Whether identical, if pre-set text is characterized in not identical, there is no repetitions in document at this time, by the text of Similar Text centering This all retains.
Obviously, those skilled in the art should be understood that each module of above-mentioned the application or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the application be not limited to it is any specific Hardware and software combines.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims (10)

1. a kind of text De-weight method characterized by comprising
Similar cryptographic Hash by calculating text to be processed obtains Similar Text pair;
Judge whether the pre-set text feature of the Similar Text centering is identical;
If it is determined that the pre-set text feature of the Similar Text centering is identical, then retain a provision of the Similar Text centering This;And
If it is determined that the pre-set text feature of the Similar Text centering is different, then retain the Similar Text pair.
2. text De-weight method according to claim 1, which is characterized in that by the similar Hash for calculating text to be processed Value obtains Similar Text to including:
Calculate the similar cryptographic Hash of title in text to be processed;
It extracts the pre-set text feature in the text to be processed and establishes aspect indexing;And
The document pair that distance is less than threshold value is searched out in the similar cryptographic Hash by the aspect indexing, obtains Similar Text It is right.
3. text De-weight method according to claim 1, which is characterized in that judge the default text of the Similar Text centering Whether eigen is identical to include:
The Similar Text judges institute to the Similar Text clock synchronization to obtain by the similar cryptographic Hash for calculating project bid text Whether the website sources for stating Similar Text centering are identical;
If it is determined that the website sources of the Similar Text centering are identical, then judge that the project number of the Similar Text centering is It is no identical;
If it is determined that the project number of the Similar Text centering is identical, then judge that the bulletin type of the Similar Text centering is It is no identical.
4. text De-weight method according to claim 1, which is characterized in that by the similar Hash for calculating text to be processed Value obtains Similar Text to before further include:
Calculate the similar cryptographic Hash of the title in document to be processed;
Judge whether the similar cryptographic Hash meets the condition of default Similar Text pair;
If it is determined that the similar cryptographic Hash is unsatisfactory for the condition of default Similar Text pair, then it is assumed that be not present in document to be processed Repetitive file simultaneously retains the document to be processed.
5. text De-weight method according to claim 1, which is characterized in that
If it is determined that the pre-set text feature of the Similar Text centering is identical, then retain a provision of the Similar Text centering Originally include:
If it is determined that the pre-set text feature of the Similar Text centering is identical, then it is assumed that document is repeated and protected according to preset rules Stay a text of Similar Text centering;
If it is determined that the pre-set text feature of the Similar Text centering is different, then retain the Similar Text to including:
If it is determined that the pre-set text feature of the Similar Text centering is different, then it is assumed that document does not repeat and by the reservation phase All retain like the text of text pair.
6. a kind of text duplicate removal device characterized by comprising
Computing module obtains Similar Text pair for the similar cryptographic Hash by calculating text to be processed;
Whether judgment module, the pre-set text feature for judging the Similar Text centering are identical;
First processing module when the pre-set text feature for judging the Similar Text centering is identical, retains the similar text One text of this centering;And
Second processing module when for judging the pre-set text feature difference of the Similar Text centering, retains the similar text This is right.
7. text duplicate removal device according to claim 6, which is characterized in that the computing module includes:
First computing unit, for calculating the similar cryptographic Hash of title in text to be processed;
Extracting unit, for extracting the pre-set text feature in the text to be processed and establishing aspect indexing;And
Search unit, the document for being less than threshold value for searching out distance in the similar cryptographic Hash by the aspect indexing It is right, obtain Similar Text pair.
8. text duplicate removal device according to claim 6, which is characterized in that judgment module includes:
First judging unit, for the Similar Text to the phase to be obtained by the similar cryptographic Hash for calculating project bid text Like text clock synchronization, judge whether the website sources of the Similar Text centering are identical;
Second judgment unit judges the Similar Text pair when the website sources for judging the Similar Text centering are identical In project number it is whether identical;
Third judging unit, for if it is determined that judging the similar text when project number of the Similar Text centering is identical Whether the bulletin type of this centering is identical.
9. text duplicate removal device according to claim 6, which is characterized in that further include: text judgment module, the text Judgment module includes:
Pretreatment unit, for calculating the similar cryptographic Hash of the title in document to be processed;
First text judging unit, for judging whether the similar cryptographic Hash meets the condition of default Similar Text pair;
Second text judging unit, when for judging that the similar cryptographic Hash is unsatisfactory for the condition of default Similar Text pair, it is believed that Repetitive file is not present in document to be processed and retains the document to be processed.
10. text duplicate removal device according to claim 6, which is characterized in that the first processing module includes: the first guarantor Unit is stayed, the Second processing module includes: the second stick unit,
First stick unit, when the pre-set text feature for judging the Similar Text centering is identical, it is believed that document weight Redoubling retains a text of Similar Text centering according to preset rules;
Second stick unit, when for judging the pre-set text feature difference of the Similar Text centering, it is believed that document is not It repeats and all retains the text for retaining Similar Text centering.
CN201811173826.2A 2018-10-09 2018-10-09 Text De-weight method and device Pending CN109241505A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811173826.2A CN109241505A (en) 2018-10-09 2018-10-09 Text De-weight method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811173826.2A CN109241505A (en) 2018-10-09 2018-10-09 Text De-weight method and device

Publications (1)

Publication Number Publication Date
CN109241505A true CN109241505A (en) 2019-01-18

Family

ID=65055136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811173826.2A Pending CN109241505A (en) 2018-10-09 2018-10-09 Text De-weight method and device

Country Status (1)

Country Link
CN (1) CN109241505A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209659A (en) * 2019-06-10 2019-09-06 广州合摩计算机科技有限公司 A kind of resume filter method, system and computer readable storage medium
WO2021109850A1 (en) * 2019-12-03 2021-06-10 世强先进(深圳)科技股份有限公司 Method and system for deduplicating and storing pdf files

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8046372B1 (en) * 2007-05-25 2011-10-25 Amazon Technologies, Inc. Duplicate entry detection system and method
CN103218443A (en) * 2013-04-22 2013-07-24 中山大学 Blogging webpage retrieval system and retrieval method
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
CN104252445A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Document similarity calculation method and near-duplicate document detection method and device
CN104615768A (en) * 2015-02-13 2015-05-13 广州神马移动信息科技有限公司 Method and device for identifying documents of same works
CN104636319A (en) * 2013-11-11 2015-05-20 腾讯科技(北京)有限公司 Text duplicate removal method and device
CN105808738A (en) * 2016-03-10 2016-07-27 哈尔滨工程大学 Duplication elimination method based on search results of metasearch engine
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof
CN106294350A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 A kind of text polymerization and device
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN106708927A (en) * 2016-11-18 2017-05-24 北京二六三企业通信有限公司 Duplicate removal processing method and duplicate removal processing device for files
CN107315799A (en) * 2017-06-19 2017-11-03 重庆誉存大数据科技有限公司 A kind of internet duplicate message screening technique and system
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
CN108009599A (en) * 2017-12-27 2018-05-08 福建中金在线信息科技有限公司 A kind of original document determination methods, device, electronic equipment and storage medium
CN108170650A (en) * 2016-12-07 2018-06-15 北京京东尚科信息技术有限公司 Text comparative approach and text comparison means
CN108334513A (en) * 2017-01-20 2018-07-27 阿里巴巴集团控股有限公司 A kind of identification processing method of Similar Text, apparatus and system
US20180246955A1 (en) * 2015-12-01 2018-08-30 Beijing Gridsum Technology Co., Ltd. Method and device for searching legal provision

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8046372B1 (en) * 2007-05-25 2011-10-25 Amazon Technologies, Inc. Duplicate entry detection system and method
CN103218443A (en) * 2013-04-22 2013-07-24 中山大学 Blogging webpage retrieval system and retrieval method
CN104252445A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Document similarity calculation method and near-duplicate document detection method and device
CN104636319A (en) * 2013-11-11 2015-05-20 腾讯科技(北京)有限公司 Text duplicate removal method and device
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
CN104615768A (en) * 2015-02-13 2015-05-13 广州神马移动信息科技有限公司 Method and device for identifying documents of same works
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof
CN106294350A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 A kind of text polymerization and device
US20180246955A1 (en) * 2015-12-01 2018-08-30 Beijing Gridsum Technology Co., Ltd. Method and device for searching legal provision
CN105808738A (en) * 2016-03-10 2016-07-27 哈尔滨工程大学 Duplication elimination method based on search results of metasearch engine
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN106708927A (en) * 2016-11-18 2017-05-24 北京二六三企业通信有限公司 Duplicate removal processing method and duplicate removal processing device for files
CN108170650A (en) * 2016-12-07 2018-06-15 北京京东尚科信息技术有限公司 Text comparative approach and text comparison means
CN108334513A (en) * 2017-01-20 2018-07-27 阿里巴巴集团控股有限公司 A kind of identification processing method of Similar Text, apparatus and system
CN107315799A (en) * 2017-06-19 2017-11-03 重庆誉存大数据科技有限公司 A kind of internet duplicate message screening technique and system
CN108009599A (en) * 2017-12-27 2018-05-08 福建中金在线信息科技有限公司 A kind of original document determination methods, device, electronic equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
刘驰等: "基于元信息的云盘资源检索结果去重", 《山东大学学报(理学版)》 *
刘驰等: "基于元信息的云盘资源检索结果去重", 《山东大学学报(理学版)》, no. 07, 31 May 2016 (2016-05-31), pages 11 - 17 *
彭双和 等: "基于Simhash的中文文本去重技术研究", 计算机技术与发展, no. 11, pages 137 - 140 *
杨春明等: "元搜索引擎的结果去重及排序研究", 《软件》 *
杨春明等: "元搜索引擎的结果去重及排序研究", 《软件》, no. 06, 15 June 2012 (2012-06-15), pages 51 - 53 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209659A (en) * 2019-06-10 2019-09-06 广州合摩计算机科技有限公司 A kind of resume filter method, system and computer readable storage medium
WO2021109850A1 (en) * 2019-12-03 2021-06-10 世强先进(深圳)科技股份有限公司 Method and system for deduplicating and storing pdf files

Similar Documents

Publication Publication Date Title
CN102262618B (en) Method and device for identifying page information
CN106503148B (en) A kind of table entity link method based on multiple knowledge base
CN107025239B (en) Sensitive word filtering method and device
CN104580027A (en) OpenFlow message forwarding method and equipment
CN104536956A (en) A Microblog platform based event visualization method and system
CN104063383A (en) Information recommendation method and device
CN102722709A (en) Method and device for identifying garbage pictures
CN105630884A (en) Geographic position discovery method for microblog hot event
CN104317891A (en) Method and device for tagging pages
CN106156041A (en) Hot information finds method and system
CN110083722A (en) A kind of electronic drawing lookup method, device, equipment and readable storage medium storing program for executing
CN109241505A (en) Text De-weight method and device
CN105095391A (en) Device and method for identifying organization name by word segmentation program
CN106021556A (en) Address information processing method and device
CN103077250A (en) Method and device for capturing webpage content
CN103488637B (en) A kind of method carrying out expert Finding based on dynamics community's excavation
CN103646029A (en) Similarity calculation method for blog articles
CN104951478A (en) Information processing method and information processing device
CN110825887A (en) Knowledge graph fusion method
CN102999495B (en) A kind of synonym Semantic mapping relation determines method and device
CN106227741B (en) A kind of extensive URL matching process based on multilevel hash index chained list
CN103646035B (en) A kind of information search method based on heuristic
CN105528421B (en) A kind of search dimension method for digging for query word in mass data
Liu et al. Spotting significant changing subgraphs in evolving graphs
CN109299443B (en) News text duplication eliminating method based on minimum vertex coverage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190923

Address after: 410006 Room 2101, Xincheng Science and Technology Park, 588 Yuelu West Avenue, Changsha High-tech Development Zone, Hunan Province

Applicant after: Hunan Olsenberg Technology Co.,Ltd.

Address before: 100083 Beijing Haidian District Academy of Sciences South Road 2 Finance Information Center A Block 701

Applicant before: BEIJING BENYING NETWORK TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 410006 room 2401-3, building F1, Lugu Yuyuan, No. 27, Wenxuan Road, high tech Development Zone, Changsha, Hunan

Applicant after: Hunan laiye Technology Co.,Ltd.

Address before: 410006 room 2101, building 1, Xincheng science and Technology Park, No. 588, Yuelu West Avenue, Changsha high tech Development Zone, Hunan Province

Applicant before: Hunan Olsenberg Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190118