CN110309446A - The quick De-weight method of content of text, device, computer equipment and storage medium - Google Patents
The quick De-weight method of content of text, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110309446A CN110309446A CN201910344414.9A CN201910344414A CN110309446A CN 110309446 A CN110309446 A CN 110309446A CN 201910344414 A CN201910344414 A CN 201910344414A CN 110309446 A CN110309446 A CN 110309446A
- Authority
- CN
- China
- Prior art keywords
- content
- text
- keyword
- webpage
- webpage text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000003860 storage Methods 0.000 title claims abstract description 33
- 238000004364 calculation method Methods 0.000 claims abstract description 22
- 239000000284 extract Substances 0.000 claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 62
- 238000004590 computer program Methods 0.000 claims description 23
- 238000000605 extraction Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 13
- 230000011218 segmentation Effects 0.000 claims description 9
- 238000004140 cleaning Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 18
- 238000004422 calculation algorithm Methods 0.000 description 14
- 230000000694 effects Effects 0.000 description 8
- 230000035945 sensitivity Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 230000015572 biosynthetic process Effects 0.000 description 5
- 230000006835 compression Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 229910002056 binary alloy Inorganic materials 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000013517 stratification Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention relates to the quick De-weight method of content of text, device, computer equipment and storage medium, this method includes several webpage text contents that crawl needs duplicate removal;Several webpage text contents are pre-processed, to obtain to duplicate removal content of text;It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;Weight calculation is carried out to target signature keyword, to obtain weighted value;It signs to target signature keyword, to obtain characteristic signature;Webpage text content fingerprint is formed according to characteristic signature;Inverted index storage is carried out to webpage text content fingerprint;Similarity is calculated according to webpage text content fingerprint, to obtain the similitude of webpage text content;Export the similitude of webpage text content.The present invention effectively meets magnanimity large-scale data real-time repetition removal process performance demand, realizes and improves accuracy rate and duplicate removal performance.
Description
Technical field
The present invention relates to content of text De-weight methods, more specifically refer to the quick De-weight method of content of text, device, meter
Calculate machine equipment and storage medium.
Background technique
The fast development of Internet technology, so that the duplication of information and propagation cost are extremely low.The network information shares to people
Bring great convenience, but introduce a large amount of duplicate messages simultaneously.On the one hand many repeated pages come from content of text
The completely the same reprinting with structure causes the incomplete of internal form on the other hand from differences such as itself website layout styles
Unanimously.A large amount of duplicate web page contents have not only aggravated the burden of user's browsing, but also in information collection, index and search for
A large amount of resource is consumed in journey.
Existing extensive magnanimity duplicate removal technical method mainly uses local sensitivity hash algorithm, which is that one kind is based on
The duplicate removal technology of content of text, it is main by the raw hash signature of dimensionality reduction, content of text is then judged by the similitude of signature
Similarity, due to the complexity of Chinese language, existing method very can not accurately indicate content of text, existing text
Eigen extraction all assumes that between feature independently of each other, in true environment, has semantic relation between characteristic key words, no
Can simply it ignore;Similarity calculation performance is lower, can not expand under extensive mass data environment and apply;Due to ignoring
Semantic context relationship between characteristic key words, causes the whole accuracy rate lower.
Therefore, it is necessary to design a kind of new method, realizes and improve accuracy rate and duplicate removal performance, effectively meet the big rule of magnanimity
Modulus factually when duplicate removal processing performance requirement.
Summary of the invention
It is an object of the invention to overcome the deficiencies of existing technologies, the quick De-weight method of content of text, device, calculating are provided
Machine equipment and storage medium.
To achieve the above object, the invention adopts the following technical scheme: the quick De-weight method of content of text, comprising:
Crawl needs several webpage text contents of duplicate removal;
Several described webpage text contents are pre-processed, to obtain to duplicate removal content of text;
It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;
Weight calculation is carried out to the target signature keyword, to obtain weighted value;
It signs to the target signature keyword, to obtain characteristic signature;
Webpage text content fingerprint is formed according to characteristic signature;
Inverted index storage is carried out to webpage text content fingerprint;
Similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content;
Export the similitude of webpage text content.
Its further technical solution are as follows: described to grab several webpage text contents for needing duplicate removal, comprising:
Distribute the address URL;
URL is crawled according to the address URL, to obtain URL to be crawled;
Judge whether the URL to be crawled has crawled;
If so, return is described to crawl URL according to the address URL, to obtain URL to be crawled;
If it is not, then grabbing wait crawl the webpage text content in URL.
Its further technical solution are as follows: it is described that several described webpage text contents are pre-processed, to obtain wait go
Weight content of text, comprising:
Parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
Word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
Its further technical solution are as follows: the duplicate removal content of text for the treatment of extracts characteristic key words, to obtain target
Characteristic key words, comprising:
It will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Extraction feature keyword is carried out to text block, to obtain initial characteristics keyword;
Semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword;
Intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
Its further technical solution are as follows: it is described to sign to the target signature keyword, to obtain characteristic signature, packet
It includes:
It is calculated according to target signature keyword and generates feature hashed value, to obtain feature vector;
Feature vector and target signature keyword are integrated, characteristic signature is formed.
Its further technical solution are as follows: described that webpage text content fingerprint is formed according to characteristic signature, comprising:
The calculating of weighted value is carried out, to every dimensional vector of the feature vector in characteristic signature to obtain object vector;
By numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector be it is non-just
Position corresponding to several vectors is placed in zero, to obtain webpage text content fingerprint.
Its further technical solution are as follows: it is described that similarity is calculated according to the webpage text content fingerprint, to obtain webpage
The similitude of content of text, comprising:
Web page fingerprint inverted list is established according to webpage text content;
Obtain the number of documents occurred in web page fingerprint inverted list;
Intersection calculating is carried out well to the document, to obtain the similitude of webpage text content.
The present invention also provides the quick duplicate removal devices of content of text, comprising:
Picking unit, for grabbing several webpage text contents for needing duplicate removal;
Pretreatment unit, for being pre-processed several described webpage text contents, to obtain in duplicate removal text
Hold;
Extraction unit extracts characteristic key words for treating duplicate removal content of text, to obtain target signature keyword;
Weight calculation unit, for carrying out weight calculation to the target signature keyword, to obtain weighted value;
Signature unit, for signing to the target signature keyword, to obtain characteristic signature;
Fingerprint forms unit, for forming webpage text content fingerprint according to characteristic signature;
Storage unit, for carrying out inverted index storage to webpage text content fingerprint;
Similarity calculated, for calculating similarity according to the webpage text content fingerprint, to obtain web page text
The similitude of content;
Output unit, for exporting the similitude of webpage text content.
The present invention also provides a kind of computer equipment, the computer equipment includes memory and processor, described to deposit
Computer program is stored on reservoir, the processor realizes above-mentioned method when executing the computer program.
The present invention also provides a kind of storage medium, the storage medium is stored with computer program, the computer journey
Sequence can realize above-mentioned method when being executed by processor.
Compared with the prior art, the invention has the advantages that: the present invention being capable of table by being based on word relation extraction
Show the target signature keyword and weight of webpage text content, webpage text content is generated based on target signature keyword and weight
Fingerprint realizes compression expression, saves memory space and calculates the time, is based on Elasticsearch inverted index data structure
Webpage text content fingerprint is stored, the Boolean Model that similarity calculation is converted to Elasticsearch is retrieved, sea is effectively met
Large-scale data real-time repetition removal process performance demand is measured, realizes and improves accuracy rate and duplicate removal performance.
The invention will be further described in the following with reference to the drawings and specific embodiments.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the application scenarios schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 3 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 4 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 5 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 6 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 7 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 8 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 9 is the formation schematic diagram of webpage text content fingerprint provided in an embodiment of the present invention;
Figure 10 is the formation schematic diagram that target signature keyword provided in an embodiment of the present invention is formed;
Figure 11 is the structural schematic diagram of inverted index provided in an embodiment of the present invention;
Figure 12 is the schematic block diagram of the quick duplicate removal device of content of text provided in an embodiment of the present invention;
Figure 13 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded
Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment
And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the application scenarios of the quick De-weight method of content of text provided in an embodiment of the present invention
Schematic diagram.Fig. 2 is the schematic flow chart of the quick De-weight method of content of text provided in an embodiment of the present invention.Text content is fast
Fast De-weight method is applied in server, and the server and terminal carry out data interaction, needs duplicate removal if getting from terminal
Dry webpage text content, then quick duplicate removal is carried out to these webpage text contents, and the result after duplicate removal is exported to terminal
Display.
Fig. 2 is the flow diagram of the quick De-weight method of content of text provided in an embodiment of the present invention.As shown in Fig. 2, should
Method includes the following steps S110 to S190.
S110, crawl need several webpage text contents of duplicate removal.
In the present embodiment, webpage text content refers to the text with information shown in webpage.
In one embodiment, referring to Fig. 3, above-mentioned step S110 may include step S111~S114.
S111, the distribution address URL;
S112, URL is crawled according to the address URL, to obtain URL to be crawled;
Whether URL to be crawled described in S113, judgement has crawled;
If so, returning to the step S112;
S114, if it is not, then grabbing wait crawl the webpage text content in URL.
Distributed task dispatching program is with distributing URL (uniform resource locator, Uniform Resource Locator)
Crawler application program node is given in location, if URL to be crawled is the URL grabbed, directly abandons, otherwise, passes through crawler
Application program node grabs webpage text content.
S120, several described webpage text contents are pre-processed, to obtain to duplicate removal content of text.
In the present embodiment, the content of text for having cleaned screening and having carried out word segmentation processing is referred to duplicate removal content of text.
In one embodiment, referring to Fig. 4, above-mentioned step S120 may include step S121~S122.
S121, parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
S122, word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
In the present embodiment, internal expression text content refers to remove unwanted data after remaining content.
Parsing cleaning is carried out to webpage text content, is mainly converted into including removal html label, English capitalization small
It writes, the conversion between simplified and traditional Chinese etc. of Chinese.Webpage text content processing can also be related to Chinese word segmentation, by participle technique by content of text
It is cut into independent and significant word.
S130, it treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword.
In the present embodiment, target signature keyword refers to the word for indicating webpage text content feature and essence.Entirely
The forming process of target signature keyword sees Figure 10.Selection to duplicate removal content of text feature has many methods, such as
Shingles, n-grams etc. are reduced since multiple words or character string do not have individually clearly semanteme in document
The expression of appearance is extracted using single keyword as the feature to duplicate removal content of text here.
In one embodiment, referring to Fig. 5, above-mentioned step S130 may include step S131~S134.
S131, will be to duplicate removal content of text according to position piecemeal, to obtain text block.
In the present embodiment, text block, which refers to, forms content by the text of different positions.
Metamessage block, Web page text block and title block will be broadly divided into duplicate removal content of text opsition dependent piecemeal.
S132, extraction feature keyword is carried out to text block, to obtain initial characteristics keyword.
In the present embodiment, initial characteristics keyword refers in the representative text block directly extracted by the content of text block
The feature word of appearance.
Specifically, according to the characteristic key words of each text block of semantic relation extraction.
S133, semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword.
In the present embodiment, intermediate features keyword refers to the word synonymous with initial characteristics keyword.
S134, intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
In the present embodiment, the keyword after extension is merged together with initial characteristics keyword, target can be made special
It is more comprehensive and accurate to levy keyword.
S140, weight calculation is carried out to the target signature keyword, to obtain weighted value.
In the present embodiment, weighted value refers to that word appears in accounting rate and position in content of text etc..
In one embodiment, above-mentioned step S140 may include step S141~S142.
S141, the frequency occurred according to target signature keyword in webpage text content and position calculate weight, with
To several weights.
In the present embodiment, the simplest calculation method of the corresponding weight of target signature keyword mainly use different degree and
Discrimination two indices, wherein different degree Weight is based primarily upon the variant of the frequency tf of word appearance, its calculation formula is:
Weight=log (1+log (1+tf));
Discrimination Discrimination is based on the inverse document frequency factor, its calculation formula is Wherein, N represents in collection of document how many document in total, and df represents document frequency.
It is finally normalized based on Document Length, normalization calculation formula is
Wherein, b is regulatory factor, and default value is 0.85.
Comprehensively consider above several influence weight factors, the weight computing formula of final target signature keyword is as follows:
S142, several weights are ranked up, to obtain weighted value.
Final target signature keyword and weight set is obtained according to weight sequencing.
S150, it signs to the target signature keyword, to obtain characteristic signature.
In the present embodiment, characteristic signature refers to the generated feature hash of the semantic relation between target signature keyword
Value.Specifically signed using semantic feature signature algorithm to the target signature keyword.
In one embodiment, referring to Fig. 6, above-mentioned step S150 may include step S151~S152.
S151, generation feature hashed value is calculated according to target signature keyword, to obtain feature vector.
In the present embodiment, feature vector refers to by the hashed value of target signature keyword sign production.
In the present embodiment, hashed value is the vector of b dimension, and b is by being manually set;Webpage text content is by sentence
It constitutes, sentence is to have word composition again, and the theme to be expressed of document is codetermined by word and its context environmental at place
's;Two texts below such as: " China is made great efforts to turn the outstanding country of environment into ";" China is made great efforts to turn ring into
The good country in border ";Two sentences expression purport be it is the same, if there is in different documents, it is believed that be weight
Multiple content.But the feature of course of two sentences is not exclusively, therefore, is generated by original local sensitivity hash algorithm
Document fingerprint can be different, it is thus regarded that being different document, generate erroneous judgement.
Semantic feature signature algorithm pseudocode:
Input: web document characteristic set;
Feature=Feature1, Feature2 ..., Featurei ..., Featuren }, Featurei=
{Featurei1,Featurei2,...,Featureij,...,Featurein};
Corresponding weight set;
Weight=Weight1, Weight2 ..., Weighti ..., Weightn }, Weighti=Weighti1,
Weighti2,...,Weightij,...,Weightin};
Output: web document semantic feature hashes value set;
HashVal=HashVal1, HashVal2 ..., HashVali ..., HashValn }, HashVali=
{HashVali1,HashVali2,...,HashValij,...,HashValin};
Pseudocode is as follows:
Wherein, sim (Featureij, Featurekl) is the similarity function for judging word feature, using between concept
Semantic relation calculates, and specific organizational form is the semantic dictionary of stratification, utilizes hierarchy distance of the word in semantic dictionary
Semantic similarity is measured in path.When the similarity between word feature is less than the threshold value threshold of setting, then by feature
Hashed value be set as same value, and be then different to the hashed value that different characteristic produces in original local sensitivity hash algorithm
Sample, have ignored the relationship between word.
S152, feature vector and target signature keyword are integrated, forms characteristic signature.
The characteristic signature of formation, which carries out compression, can obtain webpage text content fingerprint.It realizes compression expression, saves and deposit
It stores up space and calculates the time.
S160, webpage text content fingerprint is formed according to characteristic signature;
In the present embodiment, webpage text content fingerprint refers to zero and one setting is carried out according to feature vector after formed
Vector.
In one embodiment, referring to Fig. 7, above-mentioned step S160 may include step S161~S162.
S161, the calculating that weighted value is carried out to every dimensional vector of the feature vector in characteristic signature, to obtain object vector;
S162, by numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector
Position corresponding to vector for non-positive number is placed in zero, to obtain webpage text content fingerprint.
Webpage text content, that is, document is made of a series of character string, is directly carried out operation to character string and is needed largely
Memory space and calculate the time.Therefore, original text is analyzed and is handled, extract the target signature pass that can represent original text shelves
Keyword generates webpage text content fingerprint by hash function.Referred to by the webpage text content to expression webpage text content
Line is compared, and finds out repetition or approximate duplicate document.When two documents possess identical fingerprint quantity or identical finger
The ratio of the total fingerprint quantity of line Zhan then thinks to repeat when reaching certain threshold value, otherwise it is assumed that not repeating.
In the feature vector V of b dimension, every dimensional vector is calculated respectively, i.e., if the hashed value of feature corresponding positions is
1, then this corresponding weight of target critical feature is added, weight is otherwise subtracted.After all features are all disposed, if to
The i-th dimension measured in V is positive number, then i-th bit in b fingerprints is set to 1, is otherwise set to 0, and then obtain a numerical value packet
0 and 1 vector, i.e. webpage text content fingerprint are included, as shown in Figure 9.
S170, inverted index storage is carried out to webpage text content fingerprint.
Web page fingerprint is stored based on Elasticsearch inverted index data structure, similarity calculation is converted to
The Boolean Model of Elasticsearch is retrieved, and ElasticSearch is the search server based on Lucene, it is provided
The full-text search engine of one distributed multi-user ability is based on RESTful web interface.
It by the Mapping and Converting of web document ID to web page fingerprint is web page fingerprint to web document based on Elasticsearch
The mapping of ID, and stored, as shown in figure 11, wherein web page fingerprint 1 refers to the ID of web document 1 and the ID of web document 2;
Web page fingerprint 2 refers to the web document ID list for having this target signature keyword, which refers to webpage text content.
S180, similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content.
In the present embodiment, above-mentioned similarity refers to the similarity of two webpage text contents.
In one embodiment, referring to Fig. 8, above-mentioned step S180 may include step S181~S183.
S181, web page fingerprint inverted list is established according to webpage text content;
S182, the number of documents occurred in web page fingerprint inverted list is obtained;
S183, intersection calculating is carried out to the document, well to obtain the similitude of webpage text content.
After calculating webpage text content fingerprint to every webpage text content, then calculate two webpage text content fingerprints
Similarity.Similarity is sought using Hamming distances, establishes web page fingerprint inverted list, by being inquired in web page fingerprint inverted list
Existing number of documents seeks common ground to obtain final result.Assuming that 32 webpage text content fingerprints, 32 binary system label
Name is divided into 2 pieces, 16 every piece, all signatures of the Hamming distances within 1 is calculated, according to piezomagnetic principle, if two webpages
The Hamming distances of content of text fingerprint within 1, they must have one piece it is identical, thus can be by inverted index by phase
It is calculated like degree and is converted to Boolean retrieval model, greatly reduced Documents Similarity and calculate the time.
S190, the similitude for exporting webpage text content.
In the present embodiment, the similitude of webpage text content is exported to terminal and is shown.
It is using accuracy Precision and recall rate for the effect assessment that this method carries out webpage text content duplicate removal
Recall is evaluated.
Precision and
Recall index merely illustrates one-side performance indicator, and has ignored overall performance, and duplicate removal effect assessment value F1 is comprehensive
The two, is defined as:
This method and classical local sensitivity algorithm simhash algorithm operational effect are compared, obtained operational effect pair
It is more as shown in table 1 below than situation:
The comparison of 1. algorithm operational effect of table
From the point of view of operational effect, operational effect has larger amplitude in accuracy rate and recall rate compared with local sensitivity hash algorithm
The promotion of degree.
By this method and classical local sensitivity algorithm simhash algorithm operational efficiency comparison, which compares feelings
Condition is as shown in table 2 below:
The comparison of 2. algorithm operational efficiency of table
In terms of algorithm operational efficiency, this method performance is higher, in the case where being significantly increased compared with local sensitivity Hash, under performance
Drop is slower, can satisfy and applies under extensive mass data environment.
Duplicate removal effect assessment value can be also exported in addition to the similitude of output webpage text content in other embodiments.
The quick De-weight method of above-mentioned content of text, by that can be indicated in web page text based on word relation extraction
The target signature keyword and weight of appearance generate webpage text content fingerprint based on target signature keyword and weight, realize pressure
Contracting indicates, saves memory space and calculates the time, stores web page text based on Elasticsearch inverted index data structure
The Boolean Model that similarity calculation is converted to Elasticsearch is retrieved, effectively meets magnanimity large-scale data by user supplied video content using fingerprints
Real-time repetition removal process performance demand is realized and improves accuracy rate and duplicate removal performance.
A kind of schematic block diagram for the quick duplicate removal device 300 of content of text that Figure 12 inventive embodiments provide.It is right such as Figure 12
The quick De-weight method of Ying Yu or more content of text, the present invention also provides a kind of quick duplicate removal devices 300 of content of text.In the text
Holding quick duplicate removal device 300 includes the unit for executing the quick De-weight method of above-mentioned content of text, which can be configured
In server.
Specifically, Figure 12 is please referred to, the quick duplicate removal device 300 of text content includes:
Picking unit 301, for grabbing several webpage text contents for needing duplicate removal;
Pretreatment unit 302, for being pre-processed several described webpage text contents, to obtain to duplicate removal text
Content;
Extraction unit 303 extracts characteristic key words for treating duplicate removal content of text, to obtain target signature key
Word;
Weight calculation unit 304, for carrying out weight calculation to the target signature keyword, to obtain weighted value;
Signature unit 305, for signing to the target signature keyword, to obtain characteristic signature;
Fingerprint forms unit 306, for forming webpage text content fingerprint according to characteristic signature;
Storage unit 307, for carrying out inverted index storage to webpage text content fingerprint;
Similarity calculated 308, for calculating similarity according to the webpage text content fingerprint, to obtain webpage text
The similitude of this content;
Output unit 309, for exporting the similitude of webpage text content.
In one embodiment, the picking unit 301 includes:
Subelement is distributed in address, for distributing the address URL;
Subelement is crawled, for crawling URL according to the address URL, to obtain URL to be crawled;
Judgment sub-unit is crawled, for judging whether the URL to be crawled has crawled;If so, returning described according to URL
Address crawls URL, to obtain URL to be crawled;
Content crawls subelement, for if it is not, then grabbing wait crawl the webpage text content in URL.
In one embodiment, the pretreatment unit 302 includes:
Subelement is cleaned, for carrying out parsing cleaning several described webpage text contents, to obtain in internal expression text
Hold;
Word segmentation processing subelement, for carrying out word segmentation processing to intermediate content of text, to obtain to duplicate removal content of text.
In one embodiment, the extraction unit 303 includes:
Piecemeal subelement, being used for will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Subelement is extracted, for carrying out extraction feature keyword to text block, to obtain initial characteristics keyword;
Subelement is extended, for carrying out semantic extension to initial characteristics keyword, to obtain intermediate features keyword;
Merge subelement, for merging intermediate features keyword and initial characteristics keyword, to obtain target
Characteristic key words.
In one embodiment, the weight calculation unit 304 includes:
Weight obtains subelement, frequency and position for being occurred according to target signature keyword in webpage text content
Weight is calculated, to obtain several weights;
Sorting subunit, for being ranked up to several weights, to obtain weighted value.
In one embodiment, the signature unit 305 includes:
Vector obtain subelement, for according to target signature keyword calculate generate feature hashed value, with obtain feature to
Amount;
Subelement is integrated, for integrating feature vector and target signature keyword, forms characteristic signature.
In one embodiment, the fingerprint formation unit 306 includes:
Object vector forms subelement, based on every dimensional vector progress weighted value to the feature vector in characteristic signature
It calculates, to obtain object vector;
Be arranged subelement, for by numerical value in object vector be positive number vector corresponding to position be placed in one, by target
Numerical value is that position corresponding to the vector of non-positive number is placed in zero in vector, to obtain webpage text content fingerprint.
In one embodiment, the similarity calculated 308 includes:
Subelement is established, for establishing web page fingerprint inverted list according to webpage text content;
Number of documents obtains subelement, for obtaining the number of documents occurred in web page fingerprint inverted list;
Intersection computation subunit, for carrying out intersection calculating well to the document, to obtain the similar of webpage text content
Property.
It should be noted that it is apparent to those skilled in the art that, the above-mentioned quick duplicate removal of content of text
The specific implementation process of device 300 and each unit, can be with reference to the corresponding description in preceding method embodiment, for the side of description
Just and succinctly, details are not described herein.
The above-mentioned quick duplicate removal device 300 of content of text can be implemented as a kind of form of computer program, the computer journey
Sequence can be run in computer equipment as shown in fig. 13 that.
Figure 13 is please referred to, Figure 13 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The calculating
Machine equipment 500 is server.
Refering to fig. 13, which includes processor 502, memory and the net connected by system bus 501
Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.
The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program
5032 include program instruction, which is performed, and processor 502 may make to execute a kind of quick removing repeat of content of text
Method.
The processor 502 is for providing calculating and control ability, to support the operation of entire computer equipment 500.
The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, should
When computer program 5032 is executed by processor 502, processor 502 may make to execute a kind of quick De-weight method of content of text.
The network interface 505 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Figure 13
The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme
The restriction of computer equipment 500 thereon, specific computer equipment 500 may include more more or fewer than as shown in the figure
Component perhaps combines certain components or with different component layouts.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following step
It is rapid:
Crawl needs several webpage text contents of duplicate removal;
Several described webpage text contents are pre-processed, to obtain to duplicate removal content of text;
It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;
Weight calculation is carried out to the target signature keyword, to obtain weighted value;
It signs to the target signature keyword, to obtain characteristic signature;
Webpage text content fingerprint is formed according to characteristic signature;
Inverted index storage is carried out to webpage text content fingerprint;
Similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content;
Export the similitude of webpage text content.
In one embodiment, processor 502 is in several webpage text content steps for realizing that the crawl needs duplicate removal
When, it is implemented as follows step:
Distribute the address URL;
URL is crawled according to the address URL, to obtain URL to be crawled;
Judge whether the URL to be crawled has crawled;
If so, return is described to crawl URL according to the address URL, to obtain URL to be crawled;
If it is not, then grabbing wait crawl the webpage text content in URL.
In one embodiment, processor 502 realize it is described several described webpage text contents are pre-processed, with
It obtains being implemented as follows step when duplicate removal content of text step:
Parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
Word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
In one embodiment, processor 502 treats duplicate removal content of text described in the realization and extracts characteristic key words, with
When obtaining target signature keyword step, it is implemented as follows step:
It will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Extraction feature keyword is carried out to text block, to obtain initial characteristics keyword;
Semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword;
Intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
In one embodiment, processor 502 realize it is described sign to the target signature keyword, to obtain spy
When levying signature step, it is implemented as follows step:
It is calculated according to target signature keyword and generates feature hashed value, to obtain feature vector;
Feature vector and target signature keyword are integrated, characteristic signature is formed.
In one embodiment, processor 502 is described according to characteristic signature formation webpage text content fingerprint step in realization
When, it is implemented as follows step:
The calculating of weighted value is carried out, to every dimensional vector of the feature vector in characteristic signature to obtain object vector;
By numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector be it is non-just
Position corresponding to several vectors is placed in zero, to obtain webpage text content fingerprint.
In one embodiment, processor 502 realize it is described according to the webpage text content fingerprint calculate similarity, with
When obtaining the similitude step of webpage text content, it is implemented as follows step:
Web page fingerprint inverted list is established according to webpage text content;
Obtain the number of documents occurred in web page fingerprint inverted list;
Intersection calculating is carried out well to the document, to obtain the similitude of webpage text content.
It should be appreciated that in the embodiment of the present application, processor 502 can be central processing unit (Central
Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital
Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit,
ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic
Device, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or
Person's processor is also possible to any conventional processor etc..
Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process,
It is that relevant hardware can be instructed to complete by computer program.The computer program includes program instruction, computer journey
Sequence can be stored in a storage medium, which is computer readable storage medium.The program instruction is by the department of computer science
At least one processor in system executes, to realize the process step of the embodiment of the above method.
Therefore, the present invention also provides a kind of storage mediums.The storage medium can be computer readable storage medium.This is deposited
Storage media is stored with computer program, and processor is made to execute following steps when wherein the computer program is executed by processor:
Crawl needs several webpage text contents of duplicate removal;
Several described webpage text contents are pre-processed, to obtain to duplicate removal content of text;
It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;
Weight calculation is carried out to the target signature keyword, to obtain weighted value;
It signs to the target signature keyword, to obtain characteristic signature;
Webpage text content fingerprint is formed according to characteristic signature;
Inverted index storage is carried out to webpage text content fingerprint;
Similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content;
Export the similitude of webpage text content.
In one embodiment, if the processor realizes the crawl and need duplicate removal in the execution computer program
When dry webpage text content step, it is implemented as follows step:
Distribute the address URL;
URL is crawled according to the address URL, to obtain URL to be crawled;
Judge whether the URL to be crawled has crawled;
If so, return is described to crawl URL according to the address URL, to obtain URL to be crawled;
If it is not, then grabbing wait crawl the webpage text content in URL.
In one embodiment, the processor is realized described several webpages in the execution computer program
Content of text is pre-processed, to obtain being implemented as follows step when duplicate removal content of text step:
Parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
Word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
In one embodiment, the processor is realized and described treats duplicate removal content of text executing the computer program
Characteristic key words are extracted, when obtaining target signature keyword step, are implemented as follows step:
It will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Extraction feature keyword is carried out to text block, to obtain initial characteristics keyword;
Semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword;
Intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
In one embodiment, the processor is realized described to target signature pass in the execution computer program
Keyword is signed, and when obtaining characteristic signature step, is implemented as follows step:
It is calculated according to target signature keyword and generates feature hashed value, to obtain feature vector;
Feature vector and target signature keyword are integrated, characteristic signature is formed.
In one embodiment, the processor is realized and described is formed according to characteristic signature executing the computer program
When webpage text content fingerprint step, it is implemented as follows step:
The calculating of weighted value is carried out, to every dimensional vector of the feature vector in characteristic signature to obtain object vector;
By numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector be it is non-just
Position corresponding to several vectors is placed in zero, to obtain webpage text content fingerprint.
In one embodiment, the processor is realized described according to the web page text in the execution computer program
User supplied video content using fingerprints calculate similarity and are implemented as follows step when obtaining the similitude step of webpage text content:
Web page fingerprint inverted list is established according to webpage text content;
Obtain the number of documents occurred in web page fingerprint inverted list;
Intersection calculating is carried out well to the document, to obtain the similitude of webpage text content.
The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk
Or the various computer readable storage mediums that can store program code such as CD.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure
Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware
With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This
A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially
Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not
It is considered as beyond the scope of this invention.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it
Its mode is realized.For example, the apparatus embodiments described above are merely exemplary.For example, the division of each unit, only
Only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components can be tied
Another system is closed or is desirably integrated into, or some features can be ignored or not executed.
The steps in the embodiment of the present invention can be sequentially adjusted, merged and deleted according to actual needs.This hair
Unit in bright embodiment device can be combined, divided and deleted according to actual needs.In addition, in each implementation of the present invention
Each functional unit in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also be with
It is that two or more units are integrated in one unit.
If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product,
It can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing skill
The all or part of part or the technical solution that art contributes can be embodied in the form of software products, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a
People's computer, terminal or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace
It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right
It is required that protection scope subject to.
Claims (10)
1. the quick De-weight method of content of text characterized by comprising
Crawl needs several webpage text contents of duplicate removal;
Several described webpage text contents are pre-processed, to obtain to duplicate removal content of text;
It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;
Weight calculation is carried out to the target signature keyword, to obtain weighted value;
It signs to the target signature keyword, to obtain characteristic signature;
Webpage text content fingerprint is formed according to characteristic signature;
Inverted index storage is carried out to webpage text content fingerprint;
Similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content;
Export the similitude of webpage text content.
2. the quick De-weight method of content of text according to claim 1, which is characterized in that if the crawl needs duplicate removal
Dry webpage text content, comprising:
Distribute the address URL;
URL is crawled according to the address URL, to obtain URL to be crawled;
Judge whether the URL to be crawled has crawled;
If so, return is described to crawl URL according to the address URL, to obtain URL to be crawled;
If it is not, then grabbing wait crawl the webpage text content in URL.
3. the quick De-weight method of content of text according to claim 1, which is characterized in that described several webpages
Content of text is pre-processed, to obtain to duplicate removal content of text, comprising:
Parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
Word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
4. the quick De-weight method of content of text according to claim 1, which is characterized in that described to treat duplicate removal content of text
Characteristic key words are extracted, to obtain target signature keyword, comprising:
It will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Extraction feature keyword is carried out to text block, to obtain initial characteristics keyword;
Semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword;
Intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
5. the quick De-weight method of content of text according to claim 1, which is characterized in that described to be closed to the target signature
Keyword is signed, to obtain characteristic signature, comprising:
It is calculated according to target signature keyword and generates feature hashed value, to obtain feature vector;
Feature vector and target signature keyword are integrated, characteristic signature is formed.
6. the quick De-weight method of content of text according to claim 1, which is characterized in that described to be formed according to characteristic signature
Webpage text content fingerprint, comprising:
The calculating of weighted value is carried out, to every dimensional vector of the feature vector in characteristic signature to obtain object vector;
Position corresponding to vector of the numerical value in object vector for positive number is placed in one, is non-positive number by numerical value in object vector
Position corresponding to vector is placed in zero, to obtain webpage text content fingerprint.
7. the quick De-weight method of content of text according to claim 1, which is characterized in that described according to the web page text
User supplied video content using fingerprints calculate similarity, to obtain the similitude of webpage text content, comprising:
Web page fingerprint inverted list is established according to webpage text content;
Obtain the number of documents occurred in web page fingerprint inverted list;
Intersection calculating is carried out well to the document, to obtain the similitude of webpage text content.
8. the quick duplicate removal device of content of text characterized by comprising
Picking unit, for grabbing several webpage text contents for needing duplicate removal;
Pretreatment unit, for being pre-processed several described webpage text contents, to obtain to duplicate removal content of text;
Extraction unit extracts characteristic key words for treating duplicate removal content of text, to obtain target signature keyword;
Weight calculation unit, for carrying out weight calculation to the target signature keyword, to obtain weighted value;
Signature unit, for signing to the target signature keyword, to obtain characteristic signature;
Fingerprint forms unit, for forming webpage text content fingerprint according to characteristic signature;
Storage unit, for carrying out inverted index storage to webpage text content fingerprint;
Similarity calculated, for calculating similarity according to the webpage text content fingerprint, to obtain webpage text content
Similitude;
Output unit, for exporting the similitude of webpage text content.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor, on the memory
It is stored with computer program, the processor is realized as described in any one of claims 1 to 7 when executing the computer program
Method.
10. a kind of storage medium, which is characterized in that the storage medium is stored with computer program, the computer program quilt
Processor can realize the method as described in any one of claims 1 to 7 when executing.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910344414.9A CN110309446A (en) | 2019-04-26 | 2019-04-26 | The quick De-weight method of content of text, device, computer equipment and storage medium |
PCT/CN2019/116606 WO2020215667A1 (en) | 2019-04-26 | 2019-11-08 | Text content quick duplicate removal method and apparatus, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910344414.9A CN110309446A (en) | 2019-04-26 | 2019-04-26 | The quick De-weight method of content of text, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110309446A true CN110309446A (en) | 2019-10-08 |
Family
ID=68075778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910344414.9A Pending CN110309446A (en) | 2019-04-26 | 2019-04-26 | The quick De-weight method of content of text, device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110309446A (en) |
WO (1) | WO2020215667A1 (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909019A (en) * | 2019-11-14 | 2020-03-24 | 湖南赛吉智慧城市建设管理有限公司 | Big data duplicate checking method and device, computer equipment and storage medium |
CN110956037A (en) * | 2019-10-16 | 2020-04-03 | 厦门美柚股份有限公司 | Multimedia content repeated judgment method and device |
CN110955751A (en) * | 2019-11-13 | 2020-04-03 | 广州供电局有限公司 | Method, device and system for removing duplication of work ticket text and computer storage medium |
CN111027282A (en) * | 2019-11-21 | 2020-04-17 | 精硕科技(北京)股份有限公司 | Text duplicate removal method and device, electronic equipment and computer readable storage medium |
CN111061934A (en) * | 2019-11-27 | 2020-04-24 | 西安四叶草信息技术有限公司 | Fingerprint identification method, equipment and storage medium |
CN111428180A (en) * | 2020-03-20 | 2020-07-17 | 名创优品(横琴)企业管理有限公司 | Webpage duplicate removal method, device and equipment |
CN111507260A (en) * | 2020-04-17 | 2020-08-07 | 重庆邮电大学 | Video similarity rapid detection method and detection device |
WO2020215667A1 (en) * | 2019-04-26 | 2020-10-29 | 深圳市赛为智能股份有限公司 | Text content quick duplicate removal method and apparatus, computer device, and storage medium |
CN111913912A (en) * | 2020-07-16 | 2020-11-10 | 北京字节跳动网络技术有限公司 | File processing method, file matching device, electronic equipment and medium |
CN113051907A (en) * | 2019-12-26 | 2021-06-29 | 深圳市北科瑞声科技股份有限公司 | News content duplicate checking method, system and device |
WO2022141860A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Text deduplication method and apparatus, electronic device, and computer readable storage medium |
CN114741468A (en) * | 2022-03-22 | 2022-07-12 | 平安科技(深圳)有限公司 | Text duplicate removal method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
CN102831198A (en) * | 2012-08-07 | 2012-12-19 | 人民搜索网络股份公司 | Similar document identifying device and similar document identifying method based on document signature technology |
WO2016177069A1 (en) * | 2015-07-20 | 2016-11-10 | 中兴通讯股份有限公司 | Management method, device, spam short message monitoring system and computer storage medium |
CN108595517A (en) * | 2018-03-26 | 2018-09-28 | 南京邮电大学 | A kind of extensive document similarity detection method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107025218B (en) * | 2017-04-07 | 2021-03-02 | 腾讯科技(深圳)有限公司 | Text duplicate removal method and device |
CN108563636A (en) * | 2018-04-04 | 2018-09-21 | 广州杰赛科技股份有限公司 | Extract method, apparatus, equipment and the storage medium of text key word |
CN110309446A (en) * | 2019-04-26 | 2019-10-08 | 深圳市赛为智能股份有限公司 | The quick De-weight method of content of text, device, computer equipment and storage medium |
-
2019
- 2019-04-26 CN CN201910344414.9A patent/CN110309446A/en active Pending
- 2019-11-08 WO PCT/CN2019/116606 patent/WO2020215667A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
CN102831198A (en) * | 2012-08-07 | 2012-12-19 | 人民搜索网络股份公司 | Similar document identifying device and similar document identifying method based on document signature technology |
WO2016177069A1 (en) * | 2015-07-20 | 2016-11-10 | 中兴通讯股份有限公司 | Management method, device, spam short message monitoring system and computer storage medium |
CN108595517A (en) * | 2018-03-26 | 2018-09-28 | 南京邮电大学 | A kind of extensive document similarity detection method |
Non-Patent Citations (3)
Title |
---|
姜雪等: "基于语义指纹的海量文本快速相似检测算法研究", 《电脑知识与技术》 * |
薛剑等: "应用语义相似的海量网页文本去重策略研究", 《小型微型计算机系统》 * |
闫亮等: "基于网页特征关键词的近似检测算法", 《科学技术与工程》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020215667A1 (en) * | 2019-04-26 | 2020-10-29 | 深圳市赛为智能股份有限公司 | Text content quick duplicate removal method and apparatus, computer device, and storage medium |
CN110956037A (en) * | 2019-10-16 | 2020-04-03 | 厦门美柚股份有限公司 | Multimedia content repeated judgment method and device |
CN110956037B (en) * | 2019-10-16 | 2022-07-08 | 厦门美柚股份有限公司 | Multimedia content repeated judgment method and device |
CN110955751A (en) * | 2019-11-13 | 2020-04-03 | 广州供电局有限公司 | Method, device and system for removing duplication of work ticket text and computer storage medium |
CN110909019B (en) * | 2019-11-14 | 2022-04-08 | 湖南赛吉智慧城市建设管理有限公司 | Big data duplicate checking method and device, computer equipment and storage medium |
CN110909019A (en) * | 2019-11-14 | 2020-03-24 | 湖南赛吉智慧城市建设管理有限公司 | Big data duplicate checking method and device, computer equipment and storage medium |
CN111027282A (en) * | 2019-11-21 | 2020-04-17 | 精硕科技(北京)股份有限公司 | Text duplicate removal method and device, electronic equipment and computer readable storage medium |
CN111061934A (en) * | 2019-11-27 | 2020-04-24 | 西安四叶草信息技术有限公司 | Fingerprint identification method, equipment and storage medium |
CN111061934B (en) * | 2019-11-27 | 2023-04-07 | 西安四叶草信息技术有限公司 | Fingerprint identification method, equipment and storage medium |
CN113051907A (en) * | 2019-12-26 | 2021-06-29 | 深圳市北科瑞声科技股份有限公司 | News content duplicate checking method, system and device |
CN113051907B (en) * | 2019-12-26 | 2023-05-12 | 深圳市北科瑞声科技股份有限公司 | Method, system and device for searching duplicate of news content |
CN111428180B (en) * | 2020-03-20 | 2022-02-08 | 创优数字科技(广东)有限公司 | Webpage duplicate removal method, device and equipment |
CN111428180A (en) * | 2020-03-20 | 2020-07-17 | 名创优品(横琴)企业管理有限公司 | Webpage duplicate removal method, device and equipment |
CN111507260B (en) * | 2020-04-17 | 2022-08-05 | 重庆邮电大学 | Video similarity rapid detection method and detection device |
CN111507260A (en) * | 2020-04-17 | 2020-08-07 | 重庆邮电大学 | Video similarity rapid detection method and detection device |
CN111913912A (en) * | 2020-07-16 | 2020-11-10 | 北京字节跳动网络技术有限公司 | File processing method, file matching device, electronic equipment and medium |
WO2022141860A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Text deduplication method and apparatus, electronic device, and computer readable storage medium |
CN114741468A (en) * | 2022-03-22 | 2022-07-12 | 平安科技(深圳)有限公司 | Text duplicate removal method, device, equipment and storage medium |
CN114741468B (en) * | 2022-03-22 | 2024-03-29 | 平安科技(深圳)有限公司 | Text deduplication method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2020215667A1 (en) | 2020-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110309446A (en) | The quick De-weight method of content of text, device, computer equipment and storage medium | |
CN106202518B (en) | Short text classification method based on CHI and sub-category association rule algorithm | |
CN103514183B (en) | Information search method and system based on interactive document clustering | |
CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
US6678681B1 (en) | Information extraction from a database | |
Zong et al. | On assigning place names to geography related web pages | |
US8458198B1 (en) | Document analysis and multi-word term detector | |
CN101694668B (en) | Method and device for confirming web structure similarity | |
US20070294223A1 (en) | Text Categorization Using External Knowledge | |
US8825620B1 (en) | Behavioral word segmentation for use in processing search queries | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN101582080A (en) | Web image clustering method based on image and text relevant mining | |
CN106372117B (en) | A kind of file classification method and its device based on Term co-occurrence | |
CN110321466A (en) | A kind of security information duplicate checking method and system based on semantic analysis | |
Berberich et al. | Computing n-gram statistics in MapReduce | |
Roy et al. | Discovering and understanding word level user intent in web search queries | |
US10706030B2 (en) | Utilizing artificial intelligence to integrate data from multiple diverse sources into a data structure | |
US20140365494A1 (en) | Search term clustering | |
CN106708926A (en) | Realization method for analysis model supporting massive long text data classification | |
Paulheim | Machine learning with and for semantic web knowledge graphs | |
Kim et al. | Graph-based fake news detection using a summarization technique | |
CN104881446A (en) | Searching method and searching device | |
Kostakos | Strings and things: A semantic search engine for news quotes using named entity recognition | |
Cousseau et al. | Linking place records using multi-view encoders | |
Yuan et al. | A mathematical information retrieval system based on RankBoost |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191008 |