Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only
The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people
Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection
It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool
Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units
Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear
Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
In this application, term " on ", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outside",
" in ", "vertical", "horizontal", " transverse direction ", the orientation or positional relationship of the instructions such as " longitudinal direction " be orientation based on the figure or
Positional relationship.These terms are not intended to limit indicated dress primarily to better describe the application and embodiment
Set, element or component must have particular orientation, or constructed and operated with particular orientation.
Also, above-mentioned part term is other than it can be used to indicate that orientation or positional relationship, it is also possible to for indicating it
His meaning, such as term " on " also are likely used for indicating certain relations of dependence or connection relationship in some cases.For ability
For the those of ordinary skill of domain, the concrete meaning of these terms in this application can be understood as the case may be.
In addition, term " installation ", " setting ", " being equipped with ", " connection ", " connected ", " socket " shall be understood in a broad sense.For example,
It may be a fixed connection, be detachably connected or monolithic construction;It can be mechanical connection, or electrical connection;It can be direct phase
It even, or indirectly connected through an intermediary, or is two connections internal between device, element or component.
For those of ordinary skills, the concrete meaning of above-mentioned term in this application can be understood as the case may be.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As shown in Figure 1, this method includes the following steps, namely S102 to step S108:
Step S102, the similar cryptographic Hash by calculating text to be processed obtain Similar Text pair;
The similar cryptographic Hash of text to be processed is calculated by simhash algorithm after receiving text to be processed, it is similar
What cryptographic Hash obtained after calculating is Similar Text pair.
It should be noted that Similar Text is literal close to referring to, but may duplicate text pair.While calculated result is not
With the appearance of individual Similar Text, but calculated result is obtained in the form of Similar Text pair.
Preferably, literal close, possible duplicate text pair can be rapidly quickly found out using Simhash.
Step S104 judges whether the pre-set text feature of the Similar Text centering is identical;
Judge whether the preset text feature in Similar Text pair is identical.
Specifically, text feature can be for text to specific feature.It can be right by introducing external text feature
Feature other than Simhash carries out auxiliary judgment.
Step S106 then retains the similar text if it is determined that the pre-set text feature of the Similar Text centering is identical
One text of this centering;
When judging that pre-set text feature in Similar Text pair is identical, then it is assumed that two sections of texts be it is duplicate, this when
Wait a text for only needing to retain Similar Text centering.
Step S108 then retains the similar text if it is determined that the pre-set text feature of the Similar Text centering is different
This is right.
When judging the pre-set text feature difference in Similar Text pair, then it is assumed that two sections of texts be not it is duplicate, need
Two sections of texts are all retained.
As shown in figure 11, specifically, cryptographic Hash similar for all document calculations title simhash, and Extraction Projects are compiled
Number, Bale No., bid section, website sources, number, the features such as (bulletin) bidding documents type and establish index;Among all simhash,
It is less than the document pair of threshold value A by all distances of indexed search;Successively judge the website sources, project number, public affairs of each pair of document
Accuse the features such as type.If feature is identical, it is believed that two sections of texts repeat, and retain wherein one according to preset rules;
Otherwise it is assumed that two texts do not repeat, the document that all reservation does not appear in similar document centering, which is all considered as, not to be repeated, is all protected
It stays.Threshold value A can be set according to actual scene, be not defined in this application, and those skilled in the art can be illustrated
How to set and calculate the document pair that distance is less than threshold value.
It can be seen from the above description that the application realizes following technical effect:
In the embodiment of the present application, the side of Similar Text pair is obtained using the similar cryptographic Hash by calculating text to be processed
Formula, whether the pre-set text feature by judging the Similar Text centering is identical, has reached if it is determined that the Similar Text
The pre-set text feature of centering is identical, then retains a text of the Similar Text centering and if it is determined that the Similar Text
The pre-set text feature of centering is different, then retains the purpose of the Similar Text pair, quickly and accurately carries out text to realize
The technical effect of this duplicate removal, and then solve the poor technical problem of literal similar text duplicate removal effect.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Fig. 2, by calculating text to be processed
Similar cryptographic Hash obtains Similar Text to including:
Step S202 calculates the similar cryptographic Hash of title in text to be processed;
Specifically, when calculating text to be processed, need to calculate the similar cryptographic Hash of title in text to be processed.
For example, title 1:
Bid bulletin (CB190272018000001) ----Beijing Normal University Zhuhai Campus
Beijing Normal University Zhuhai Campus-bids bulletin (CB190272018000001)
For another example, title 2:
The Guangxi Xincheng County ridge Pan He Forest Park western movie area Garden Engineering (HCLB2017-540) construction bid notifies 2018-
01-03 invitation for bid
The Xincheng County ridge Pan He Forest Park western movie area Garden Engineering (HCLB2017-540) construction bid announces 2018-01-
03
Step S204 extracts the pre-set text feature in the text to be processed and establishes aspect indexing;
By extracting the pre-set text feature in text to be processed and establishing aspect indexing, when needing a certain feature,
Result can be retrieved according to related text aspect indexing.
Step S206 searches out the document that distance is less than threshold value by the aspect indexing in the similar cryptographic Hash
It is right, obtain Similar Text pair.
Specifically, the document pair that distance is less than threshold value is searched out in the similar cryptographic Hash by aspect indexing, in turn
Positioning obtains Similar Text pair.
It indexes it should be noted that can also be established in this application by any other mode and finds all phases
It is less than the text pair of a certain threshold value like cryptographic Hash distance, concrete mode is not defined in this application.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 3, judging the Similar Text centering
Pre-set text feature whether identical include:
Step S302, the Similar Text is to the similar text to be obtained by the similar cryptographic Hash for calculating project bid text
This clock synchronization judges whether the website sources of the Similar Text centering are identical;
Specifically, when text to be processed is project bid text, need to calculate the similar of title in project bid text
The Similar Text pair that cryptographic Hash obtains.Need further judge whether the website sources of Similar Text centering are identical.
Step S304 then judges the Similar Text pair if it is determined that the website sources of the Similar Text centering are identical
In project number it is whether identical;
Need further judge whether the project number of Similar Text centering is identical.
Step S306 then judges the Similar Text pair if it is determined that the project number of the Similar Text centering is identical
In bulletin type it is whether identical.
Need further judge whether the bulletin type of Similar Text centering is identical.
It can recognize that the text for some candidates by above-mentioned some bulletin types, project number or website sources
This is right, it may be possible to which some key messages, such as project number cannot be judged to repeat, for example, Similar Text is to 1 there are difference
(project number is different):
The Huaiyuan County Bureau of Land and Resources state-owned land right to use, which recruits to clap to hang up, allows conclusion of the business publicity HYCJ2016-22
The Huaiyuan County Bureau of Land and Resources state-owned land right to use, which recruits to clap to hang up, allows conclusion of the business publicity HYCJ2016-12
For another example, Similar Text is to 2 (bulletin type is different):
The golden general inter-city passenger rail engineering high speed in public Daliang City, parallel transposition section anticollision barrier engineering (three times) the change bulletin of national highway
The general inter-city passenger rail engineering high speed of Daliang City of Liaoning Province gold, the parallel transposition section anticollision barrier change in the work bulletin of national highway
For another example, Similar Text is to 3 (bulletin type is different):
Battery/charger conclusion of the business bulletin is purchased in Wuhe County Amitabha Buddha temple junior high school
Battery/charger buying bulletin is purchased in Wuhe County Amitabha Buddha temple junior high school
For these texts pair, the above-mentioned text feature needed to introduce other than simhash value carries out auxiliary judgment, for example takes out
It takes features, only these features such as project number, Bale No., bulletin type to be different from, just thinks that two sections of texts really repeat.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 4, by calculating text to be processed
Similar cryptographic Hash obtains Similar Text to before further include:
Step S402 calculates the similar cryptographic Hash of the title in document to be processed;
Similar cryptographic Hash text pair is obtained by the title calculated in document to be processed.
Step S404, judges whether the similar cryptographic Hash meets the condition of default Similar Text pair;
Text is judged to whether the condition of preset Similar Text pair is met, i.e., similar cryptographic Hash distance is greater than a certain
The text pair of threshold value.
Step S406, if it is determined that the similar cryptographic Hash is unsatisfactory for the condition of default Similar Text pair, then it is assumed that wait locate
Repetitive file is not present in reason document and retains the document to be processed.
When being unsatisfactory for the condition of default Similar Text pair for similar cryptographic Hash, then summarizes in document to be processed and weight is not present
Multiple document simultaneously retains the document storage to be processed, does not need to carry out duplicate removal processing again.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 5, if it is determined that the Similar Text
The pre-set text feature of centering is identical, then a text for retaining the Similar Text centering includes:
Step S502, if it is determined that the pre-set text feature of the Similar Text centering is identical, then it is assumed that document repeats simultaneously
Retain a text of Similar Text centering according to preset rules;
Successively judge whether the pre-set text feature of two sections of Similar Text centerings identical, if pre-set text be characterized in it is identical
When, then document is repeated and is put in storage according to any one text that dependency rule retains Similar Text centering at this time.
If it is determined that the pre-set text feature of the Similar Text centering is different, then retain the Similar Text to including:
Step S504 is if it is determined that the pre-set text feature of the Similar Text centering is different, then it is assumed that document does not repeat simultaneously
The text for retaining Similar Text centering is all retained.
Successively judge whether the pre-set text feature of two sections of Similar Text centerings is identical, if pre-set text is characterized in not phase
Meanwhile then the text of Similar Text centering all being retained at this time in document there is no repeating.
It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions
It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not
The sequence being same as herein executes shown or described step.
According to the embodiment of the present application, additionally provide it is a kind of for implementing the text duplicate removal device of above-mentioned text De-weight method,
As shown in fig. 6, the device includes: computing module 10, similar text is obtained for the similar cryptographic Hash by calculating text to be processed
This is right;Whether judgment module 20, the pre-set text feature for judging the Similar Text centering are identical;First processing module
30, when the pre-set text feature for judging the Similar Text centering is identical, retain a provision of the Similar Text centering
This;And Second processing module 40 retains the phase when for judging the pre-set text feature difference of the Similar Text centering
Like text pair.
In the computing module 10 of the embodiment of the present application after receiving text to be processed by simhash algorithm calculate to
The similar cryptographic Hash of text is handled, what similar cryptographic Hash obtained after calculating is Similar Text pair.
It should be noted that Similar Text is literal close to referring to, but may duplicate text pair.While calculated result is not
With the appearance of individual Similar Text, but calculated result is obtained in the form of Similar Text pair.
Preferably, literal close, possible duplicate text pair can be rapidly quickly found out using Simhash.
Judge whether is preset text feature in Similar Text pair in the judgment module 20 of the embodiment of the present application
It is identical.
Specifically, text feature can be for text to specific feature.It can be right by introducing external text feature
Feature other than Simhash carries out auxiliary judgment.
When judging that pre-set text feature in Similar Text pair is identical in the first processing module 30 of the embodiment of the present application,
Then think two sections of texts be it is duplicate, only need to retain a text of Similar Text centering this when.
When judging the pre-set text feature difference in Similar Text pair in the Second processing module 40 of the embodiment of the present application,
Then think two sections of texts be not it is duplicate, need all to retain two sections of texts.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in fig. 7, the computing module 10 includes:
One computing unit 101, for calculating the similar cryptographic Hash of title in text to be processed;Extracting unit 102, for extract it is described to
It handles the pre-set text feature in text and establishes aspect indexing;And search unit 103, for being existed by the aspect indexing
The document pair that distance is less than threshold value is searched out in the similar cryptographic Hash, obtains Similar Text pair.
In first computing unit 101 of the embodiment of the present application specifically, it when calculating text to be processed, needs to calculate wait locate
Manage the similar cryptographic Hash of title in text.
For example, title 1:
Bid bulletin (CB190272018000001) ----Beijing Normal University Zhuhai Campus
Beijing Normal University Zhuhai Campus-bids bulletin (CB190272018000001)
For another example, title 2:
The Guangxi Xincheng County ridge Pan He Forest Park western movie area Garden Engineering (HCLB2017-540) construction bid notifies 2018-
01-03 invitation for bid
The Xincheng County ridge Pan He Forest Park western movie area Garden Engineering (HCLB2017-540) construction bid announces 2018-01-
03
By extracting the pre-set text feature in text to be processed and establishing spy in the extracting unit 102 of the embodiment of the present application
Sign index can retrieve result according to related text aspect indexing when needing a certain feature.
In the search unit 103 of the embodiment of the present application specifically, it is searched in the similar cryptographic Hash by aspect indexing
Distance is less than the document pair of threshold value out, and then positions and obtain Similar Text pair.
It indexes it should be noted that can also be established in this application by any other mode and finds all phases
It is less than the text pair of a certain threshold value like cryptographic Hash distance, concrete mode is not defined in this application.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 8, judgment module 20 includes: first to sentence
Disconnected unit 201, for the Similar Text to the Similar Text to be obtained by the similar cryptographic Hash for calculating project bid text
Clock synchronization judges whether the website sources of the Similar Text centering are identical;Second judgment unit 202, it is described similar for judging
When the website sources of text pair are identical, judge whether the project number of the Similar Text centering is identical;Third judging unit
203, for if it is determined that judging the bulletin class of the Similar Text centering when project number of the Similar Text centering is identical
Whether type is identical.
In first judging unit 201 of the embodiment of the present application specifically, it when text to be processed is project bid text, needs
Calculate the Similar Text pair that the similar cryptographic Hash of title in project bid text obtains.It needs further to judge Similar Text pair
In website sources whether be identical.
The project number for further judging Similar Text centering is needed in the second judgment unit 202 of the embodiment of the present application is
No is identical.
The bulletin type for further judging Similar Text centering is needed in the third judging unit 203 of the embodiment of the present application is
No is identical.
It can recognize that the text for some candidates by above-mentioned some bulletin types, project number or website sources
This is right, it may be possible to which some key messages, such as project number cannot be judged to repeat, for example, Similar Text is to 1 there are difference
(project number is different):
The Huaiyuan County Bureau of Land and Resources state-owned land right to use, which recruits to clap to hang up, allows conclusion of the business publicity HYCJ2016-22
The Huaiyuan County Bureau of Land and Resources state-owned land right to use, which recruits to clap to hang up, allows conclusion of the business publicity HYCJ2016-12
For another example, Similar Text is to 2 (bulletin type is different):
The golden general inter-city passenger rail engineering high speed in public Daliang City, parallel transposition section anticollision barrier engineering (three times) the change bulletin of national highway
The general inter-city passenger rail engineering high speed of Daliang City of Liaoning Province gold, the parallel transposition section anticollision barrier change in the work bulletin of national highway
For another example, Similar Text is to 3 (bulletin type is different):
Battery/charger conclusion of the business bulletin is purchased in Wuhe County Amitabha Buddha temple junior high school
Battery/charger buying bulletin is purchased in Wuhe County Amitabha Buddha temple junior high school
For these texts pair, the above-mentioned text feature needed to introduce other than simhash value carries out auxiliary judgment, for example takes out
It takes features, only these features such as project number, Bale No., bulletin type to be different from, just thinks that two sections of texts really repeat.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Figure 9, further includes: text judgment module
50, the text judgment module includes: pretreatment unit 501, for calculating the similar cryptographic Hash of the title in document to be processed;
First text judging unit 502, for judging whether the similar cryptographic Hash meets the condition of default Similar Text pair;Second text
This judging unit 503, when for judging that the similar cryptographic Hash is unsatisfactory for the condition of default Similar Text pair, it is believed that text to be processed
Repetitive file is not present in shelves and retains the document to be processed.
Similar cryptographic Hash is obtained by the title calculated in document to be processed in the pretreatment unit 501 of the embodiment of the present application
Text pair.
Judge text to whether meeting preset similar text in the first text judging unit 502 of the embodiment of the present application
This pair of condition, i.e., similar cryptographic Hash distance are greater than the text pair of a certain threshold value.
Default Similar Text pair is unsatisfactory for for similar cryptographic Hash in the second text judging unit 503 of the embodiment of the present application
Condition when, then summarize in document to be processed there is no repetitive file and retain the document to be processed storage, do not need again into
Row duplicate removal processing.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Figure 10, the first processing module 30 is wrapped
Include: the first stick unit 301, the Second processing module 40 include: the second stick unit 401, first stick unit
301, when the pre-set text feature for judging the Similar Text centering is identical, it is believed that document is repeated and protected according to preset rules
Stay a text of Similar Text centering;Second stick unit 401, for judging the default text of the Similar Text centering
When eigen difference, it is believed that document does not repeat and all retains the text for retaining Similar Text centering.
The pre-set text feature of two sections of Similar Text centerings is successively judged in first stick unit 301 of the embodiment of the present application
Whether identical, if pre-set text is characterized in identical, document repeats and retains Similar Text pair according to dependency rule at this time
In any one text storage.
The pre-set text feature of two sections of Similar Text centerings is successively judged in second stick unit 401 of the embodiment of the present application
Whether identical, if pre-set text is characterized in not identical, there is no repetitions in document at this time, by the text of Similar Text centering
This all retains.
Obviously, those skilled in the art should be understood that each module of above-mentioned the application or each step can be with general
Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed
Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored
Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they
In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the application be not limited to it is any specific
Hardware and software combines.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field
For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair
Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.