CN110309446A - The quick De-weight method of content of text, device, computer equipment and storage medium - Google Patents

The quick De-weight method of content of text, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110309446A
CN110309446A CN201910344414.9A CN201910344414A CN110309446A CN 110309446 A CN110309446 A CN 110309446A CN 201910344414 A CN201910344414 A CN 201910344414A CN 110309446 A CN110309446 A CN 110309446A
Authority
CN
China
Prior art keywords
content
text
keyword
webpage
webpage text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910344414.9A
Other languages
Chinese (zh)
Inventor
耿伟
王英明
周起如
谷国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial & Commercial College Anhui University Of Technology
Shenzhen Sunwin Intelligent Co Ltd
Original Assignee
Industrial & Commercial College Anhui University Of Technology
Shenzhen Sunwin Intelligent Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial & Commercial College Anhui University Of Technology, Shenzhen Sunwin Intelligent Co Ltd filed Critical Industrial & Commercial College Anhui University Of Technology
Priority to CN201910344414.9A priority Critical patent/CN110309446A/en
Publication of CN110309446A publication Critical patent/CN110309446A/en
Priority to PCT/CN2019/116606 priority patent/WO2020215667A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to the quick De-weight method of content of text, device, computer equipment and storage medium, this method includes several webpage text contents that crawl needs duplicate removal;Several webpage text contents are pre-processed, to obtain to duplicate removal content of text;It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;Weight calculation is carried out to target signature keyword, to obtain weighted value;It signs to target signature keyword, to obtain characteristic signature;Webpage text content fingerprint is formed according to characteristic signature;Inverted index storage is carried out to webpage text content fingerprint;Similarity is calculated according to webpage text content fingerprint, to obtain the similitude of webpage text content;Export the similitude of webpage text content.The present invention effectively meets magnanimity large-scale data real-time repetition removal process performance demand, realizes and improves accuracy rate and duplicate removal performance.

Description

The quick De-weight method of content of text, device, computer equipment and storage medium
Technical field
The present invention relates to content of text De-weight methods, more specifically refer to the quick De-weight method of content of text, device, meter Calculate machine equipment and storage medium.
Background technique
The fast development of Internet technology, so that the duplication of information and propagation cost are extremely low.The network information shares to people Bring great convenience, but introduce a large amount of duplicate messages simultaneously.On the one hand many repeated pages come from content of text The completely the same reprinting with structure causes the incomplete of internal form on the other hand from differences such as itself website layout styles Unanimously.A large amount of duplicate web page contents have not only aggravated the burden of user's browsing, but also in information collection, index and search for A large amount of resource is consumed in journey.
Existing extensive magnanimity duplicate removal technical method mainly uses local sensitivity hash algorithm, which is that one kind is based on The duplicate removal technology of content of text, it is main by the raw hash signature of dimensionality reduction, content of text is then judged by the similitude of signature Similarity, due to the complexity of Chinese language, existing method very can not accurately indicate content of text, existing text Eigen extraction all assumes that between feature independently of each other, in true environment, has semantic relation between characteristic key words, no Can simply it ignore;Similarity calculation performance is lower, can not expand under extensive mass data environment and apply;Due to ignoring Semantic context relationship between characteristic key words, causes the whole accuracy rate lower.
Therefore, it is necessary to design a kind of new method, realizes and improve accuracy rate and duplicate removal performance, effectively meet the big rule of magnanimity Modulus factually when duplicate removal processing performance requirement.
Summary of the invention
It is an object of the invention to overcome the deficiencies of existing technologies, the quick De-weight method of content of text, device, calculating are provided Machine equipment and storage medium.
To achieve the above object, the invention adopts the following technical scheme: the quick De-weight method of content of text, comprising:
Crawl needs several webpage text contents of duplicate removal;
Several described webpage text contents are pre-processed, to obtain to duplicate removal content of text;
It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;
Weight calculation is carried out to the target signature keyword, to obtain weighted value;
It signs to the target signature keyword, to obtain characteristic signature;
Webpage text content fingerprint is formed according to characteristic signature;
Inverted index storage is carried out to webpage text content fingerprint;
Similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content;
Export the similitude of webpage text content.
Its further technical solution are as follows: described to grab several webpage text contents for needing duplicate removal, comprising:
Distribute the address URL;
URL is crawled according to the address URL, to obtain URL to be crawled;
Judge whether the URL to be crawled has crawled;
If so, return is described to crawl URL according to the address URL, to obtain URL to be crawled;
If it is not, then grabbing wait crawl the webpage text content in URL.
Its further technical solution are as follows: it is described that several described webpage text contents are pre-processed, to obtain wait go Weight content of text, comprising:
Parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
Word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
Its further technical solution are as follows: the duplicate removal content of text for the treatment of extracts characteristic key words, to obtain target Characteristic key words, comprising:
It will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Extraction feature keyword is carried out to text block, to obtain initial characteristics keyword;
Semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword;
Intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
Its further technical solution are as follows: it is described to sign to the target signature keyword, to obtain characteristic signature, packet It includes:
It is calculated according to target signature keyword and generates feature hashed value, to obtain feature vector;
Feature vector and target signature keyword are integrated, characteristic signature is formed.
Its further technical solution are as follows: described that webpage text content fingerprint is formed according to characteristic signature, comprising:
The calculating of weighted value is carried out, to every dimensional vector of the feature vector in characteristic signature to obtain object vector;
By numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector be it is non-just Position corresponding to several vectors is placed in zero, to obtain webpage text content fingerprint.
Its further technical solution are as follows: it is described that similarity is calculated according to the webpage text content fingerprint, to obtain webpage The similitude of content of text, comprising:
Web page fingerprint inverted list is established according to webpage text content;
Obtain the number of documents occurred in web page fingerprint inverted list;
Intersection calculating is carried out well to the document, to obtain the similitude of webpage text content.
The present invention also provides the quick duplicate removal devices of content of text, comprising:
Picking unit, for grabbing several webpage text contents for needing duplicate removal;
Pretreatment unit, for being pre-processed several described webpage text contents, to obtain in duplicate removal text Hold;
Extraction unit extracts characteristic key words for treating duplicate removal content of text, to obtain target signature keyword;
Weight calculation unit, for carrying out weight calculation to the target signature keyword, to obtain weighted value;
Signature unit, for signing to the target signature keyword, to obtain characteristic signature;
Fingerprint forms unit, for forming webpage text content fingerprint according to characteristic signature;
Storage unit, for carrying out inverted index storage to webpage text content fingerprint;
Similarity calculated, for calculating similarity according to the webpage text content fingerprint, to obtain web page text The similitude of content;
Output unit, for exporting the similitude of webpage text content.
The present invention also provides a kind of computer equipment, the computer equipment includes memory and processor, described to deposit Computer program is stored on reservoir, the processor realizes above-mentioned method when executing the computer program.
The present invention also provides a kind of storage medium, the storage medium is stored with computer program, the computer journey Sequence can realize above-mentioned method when being executed by processor.
Compared with the prior art, the invention has the advantages that: the present invention being capable of table by being based on word relation extraction Show the target signature keyword and weight of webpage text content, webpage text content is generated based on target signature keyword and weight Fingerprint realizes compression expression, saves memory space and calculates the time, is based on Elasticsearch inverted index data structure Webpage text content fingerprint is stored, the Boolean Model that similarity calculation is converted to Elasticsearch is retrieved, sea is effectively met Large-scale data real-time repetition removal process performance demand is measured, realizes and improves accuracy rate and duplicate removal performance.
The invention will be further described in the following with reference to the drawings and specific embodiments.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the application scenarios schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 3 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 4 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 5 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 6 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 7 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 8 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention;
Fig. 9 is the formation schematic diagram of webpage text content fingerprint provided in an embodiment of the present invention;
Figure 10 is the formation schematic diagram that target signature keyword provided in an embodiment of the present invention is formed;
Figure 11 is the structural schematic diagram of inverted index provided in an embodiment of the present invention;
Figure 12 is the schematic block diagram of the quick duplicate removal device of content of text provided in an embodiment of the present invention;
Figure 13 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the application scenarios of the quick De-weight method of content of text provided in an embodiment of the present invention Schematic diagram.Fig. 2 is the schematic flow chart of the quick De-weight method of content of text provided in an embodiment of the present invention.Text content is fast Fast De-weight method is applied in server, and the server and terminal carry out data interaction, needs duplicate removal if getting from terminal Dry webpage text content, then quick duplicate removal is carried out to these webpage text contents, and the result after duplicate removal is exported to terminal Display.
Fig. 2 is the flow diagram of the quick De-weight method of content of text provided in an embodiment of the present invention.As shown in Fig. 2, should Method includes the following steps S110 to S190.
S110, crawl need several webpage text contents of duplicate removal.
In the present embodiment, webpage text content refers to the text with information shown in webpage.
In one embodiment, referring to Fig. 3, above-mentioned step S110 may include step S111~S114.
S111, the distribution address URL;
S112, URL is crawled according to the address URL, to obtain URL to be crawled;
Whether URL to be crawled described in S113, judgement has crawled;
If so, returning to the step S112;
S114, if it is not, then grabbing wait crawl the webpage text content in URL.
Distributed task dispatching program is with distributing URL (uniform resource locator, Uniform Resource Locator) Crawler application program node is given in location, if URL to be crawled is the URL grabbed, directly abandons, otherwise, passes through crawler Application program node grabs webpage text content.
S120, several described webpage text contents are pre-processed, to obtain to duplicate removal content of text.
In the present embodiment, the content of text for having cleaned screening and having carried out word segmentation processing is referred to duplicate removal content of text.
In one embodiment, referring to Fig. 4, above-mentioned step S120 may include step S121~S122.
S121, parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
S122, word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
In the present embodiment, internal expression text content refers to remove unwanted data after remaining content.
Parsing cleaning is carried out to webpage text content, is mainly converted into including removal html label, English capitalization small It writes, the conversion between simplified and traditional Chinese etc. of Chinese.Webpage text content processing can also be related to Chinese word segmentation, by participle technique by content of text It is cut into independent and significant word.
S130, it treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword.
In the present embodiment, target signature keyword refers to the word for indicating webpage text content feature and essence.Entirely The forming process of target signature keyword sees Figure 10.Selection to duplicate removal content of text feature has many methods, such as Shingles, n-grams etc. are reduced since multiple words or character string do not have individually clearly semanteme in document The expression of appearance is extracted using single keyword as the feature to duplicate removal content of text here.
In one embodiment, referring to Fig. 5, above-mentioned step S130 may include step S131~S134.
S131, will be to duplicate removal content of text according to position piecemeal, to obtain text block.
In the present embodiment, text block, which refers to, forms content by the text of different positions.
Metamessage block, Web page text block and title block will be broadly divided into duplicate removal content of text opsition dependent piecemeal.
S132, extraction feature keyword is carried out to text block, to obtain initial characteristics keyword.
In the present embodiment, initial characteristics keyword refers in the representative text block directly extracted by the content of text block The feature word of appearance.
Specifically, according to the characteristic key words of each text block of semantic relation extraction.
S133, semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword.
In the present embodiment, intermediate features keyword refers to the word synonymous with initial characteristics keyword.
S134, intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
In the present embodiment, the keyword after extension is merged together with initial characteristics keyword, target can be made special It is more comprehensive and accurate to levy keyword.
S140, weight calculation is carried out to the target signature keyword, to obtain weighted value.
In the present embodiment, weighted value refers to that word appears in accounting rate and position in content of text etc..
In one embodiment, above-mentioned step S140 may include step S141~S142.
S141, the frequency occurred according to target signature keyword in webpage text content and position calculate weight, with To several weights.
In the present embodiment, the simplest calculation method of the corresponding weight of target signature keyword mainly use different degree and Discrimination two indices, wherein different degree Weight is based primarily upon the variant of the frequency tf of word appearance, its calculation formula is: Weight=log (1+log (1+tf));
Discrimination Discrimination is based on the inverse document frequency factor, its calculation formula is Wherein, N represents in collection of document how many document in total, and df represents document frequency.
It is finally normalized based on Document Length, normalization calculation formula is
Wherein, b is regulatory factor, and default value is 0.85.
Comprehensively consider above several influence weight factors, the weight computing formula of final target signature keyword is as follows:
S142, several weights are ranked up, to obtain weighted value.
Final target signature keyword and weight set is obtained according to weight sequencing.
S150, it signs to the target signature keyword, to obtain characteristic signature.
In the present embodiment, characteristic signature refers to the generated feature hash of the semantic relation between target signature keyword Value.Specifically signed using semantic feature signature algorithm to the target signature keyword.
In one embodiment, referring to Fig. 6, above-mentioned step S150 may include step S151~S152.
S151, generation feature hashed value is calculated according to target signature keyword, to obtain feature vector.
In the present embodiment, feature vector refers to by the hashed value of target signature keyword sign production.
In the present embodiment, hashed value is the vector of b dimension, and b is by being manually set;Webpage text content is by sentence It constitutes, sentence is to have word composition again, and the theme to be expressed of document is codetermined by word and its context environmental at place 's;Two texts below such as: " China is made great efforts to turn the outstanding country of environment into ";" China is made great efforts to turn ring into The good country in border ";Two sentences expression purport be it is the same, if there is in different documents, it is believed that be weight Multiple content.But the feature of course of two sentences is not exclusively, therefore, is generated by original local sensitivity hash algorithm Document fingerprint can be different, it is thus regarded that being different document, generate erroneous judgement.
Semantic feature signature algorithm pseudocode:
Input: web document characteristic set;
Feature=Feature1, Feature2 ..., Featurei ..., Featuren }, Featurei= {Featurei1,Featurei2,...,Featureij,...,Featurein};
Corresponding weight set;
Weight=Weight1, Weight2 ..., Weighti ..., Weightn }, Weighti=Weighti1, Weighti2,...,Weightij,...,Weightin};
Output: web document semantic feature hashes value set;
HashVal=HashVal1, HashVal2 ..., HashVali ..., HashValn }, HashVali= {HashVali1,HashVali2,...,HashValij,...,HashValin};
Pseudocode is as follows:
Wherein, sim (Featureij, Featurekl) is the similarity function for judging word feature, using between concept Semantic relation calculates, and specific organizational form is the semantic dictionary of stratification, utilizes hierarchy distance of the word in semantic dictionary Semantic similarity is measured in path.When the similarity between word feature is less than the threshold value threshold of setting, then by feature Hashed value be set as same value, and be then different to the hashed value that different characteristic produces in original local sensitivity hash algorithm Sample, have ignored the relationship between word.
S152, feature vector and target signature keyword are integrated, forms characteristic signature.
The characteristic signature of formation, which carries out compression, can obtain webpage text content fingerprint.It realizes compression expression, saves and deposit It stores up space and calculates the time.
S160, webpage text content fingerprint is formed according to characteristic signature;
In the present embodiment, webpage text content fingerprint refers to zero and one setting is carried out according to feature vector after formed Vector.
In one embodiment, referring to Fig. 7, above-mentioned step S160 may include step S161~S162.
S161, the calculating that weighted value is carried out to every dimensional vector of the feature vector in characteristic signature, to obtain object vector;
S162, by numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector Position corresponding to vector for non-positive number is placed in zero, to obtain webpage text content fingerprint.
Webpage text content, that is, document is made of a series of character string, is directly carried out operation to character string and is needed largely Memory space and calculate the time.Therefore, original text is analyzed and is handled, extract the target signature pass that can represent original text shelves Keyword generates webpage text content fingerprint by hash function.Referred to by the webpage text content to expression webpage text content Line is compared, and finds out repetition or approximate duplicate document.When two documents possess identical fingerprint quantity or identical finger The ratio of the total fingerprint quantity of line Zhan then thinks to repeat when reaching certain threshold value, otherwise it is assumed that not repeating.
In the feature vector V of b dimension, every dimensional vector is calculated respectively, i.e., if the hashed value of feature corresponding positions is 1, then this corresponding weight of target critical feature is added, weight is otherwise subtracted.After all features are all disposed, if to The i-th dimension measured in V is positive number, then i-th bit in b fingerprints is set to 1, is otherwise set to 0, and then obtain a numerical value packet 0 and 1 vector, i.e. webpage text content fingerprint are included, as shown in Figure 9.
S170, inverted index storage is carried out to webpage text content fingerprint.
Web page fingerprint is stored based on Elasticsearch inverted index data structure, similarity calculation is converted to The Boolean Model of Elasticsearch is retrieved, and ElasticSearch is the search server based on Lucene, it is provided The full-text search engine of one distributed multi-user ability is based on RESTful web interface.
It by the Mapping and Converting of web document ID to web page fingerprint is web page fingerprint to web document based on Elasticsearch The mapping of ID, and stored, as shown in figure 11, wherein web page fingerprint 1 refers to the ID of web document 1 and the ID of web document 2; Web page fingerprint 2 refers to the web document ID list for having this target signature keyword, which refers to webpage text content.
S180, similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content.
In the present embodiment, above-mentioned similarity refers to the similarity of two webpage text contents.
In one embodiment, referring to Fig. 8, above-mentioned step S180 may include step S181~S183.
S181, web page fingerprint inverted list is established according to webpage text content;
S182, the number of documents occurred in web page fingerprint inverted list is obtained;
S183, intersection calculating is carried out to the document, well to obtain the similitude of webpage text content.
After calculating webpage text content fingerprint to every webpage text content, then calculate two webpage text content fingerprints Similarity.Similarity is sought using Hamming distances, establishes web page fingerprint inverted list, by being inquired in web page fingerprint inverted list Existing number of documents seeks common ground to obtain final result.Assuming that 32 webpage text content fingerprints, 32 binary system label Name is divided into 2 pieces, 16 every piece, all signatures of the Hamming distances within 1 is calculated, according to piezomagnetic principle, if two webpages The Hamming distances of content of text fingerprint within 1, they must have one piece it is identical, thus can be by inverted index by phase It is calculated like degree and is converted to Boolean retrieval model, greatly reduced Documents Similarity and calculate the time.
S190, the similitude for exporting webpage text content.
In the present embodiment, the similitude of webpage text content is exported to terminal and is shown.
It is using accuracy Precision and recall rate for the effect assessment that this method carries out webpage text content duplicate removal Recall is evaluated.
Precision and Recall index merely illustrates one-side performance indicator, and has ignored overall performance, and duplicate removal effect assessment value F1 is comprehensive The two, is defined as:
This method and classical local sensitivity algorithm simhash algorithm operational effect are compared, obtained operational effect pair It is more as shown in table 1 below than situation:
The comparison of 1. algorithm operational effect of table
From the point of view of operational effect, operational effect has larger amplitude in accuracy rate and recall rate compared with local sensitivity hash algorithm The promotion of degree.
By this method and classical local sensitivity algorithm simhash algorithm operational efficiency comparison, which compares feelings Condition is as shown in table 2 below:
The comparison of 2. algorithm operational efficiency of table
In terms of algorithm operational efficiency, this method performance is higher, in the case where being significantly increased compared with local sensitivity Hash, under performance Drop is slower, can satisfy and applies under extensive mass data environment.
Duplicate removal effect assessment value can be also exported in addition to the similitude of output webpage text content in other embodiments.
The quick De-weight method of above-mentioned content of text, by that can be indicated in web page text based on word relation extraction The target signature keyword and weight of appearance generate webpage text content fingerprint based on target signature keyword and weight, realize pressure Contracting indicates, saves memory space and calculates the time, stores web page text based on Elasticsearch inverted index data structure The Boolean Model that similarity calculation is converted to Elasticsearch is retrieved, effectively meets magnanimity large-scale data by user supplied video content using fingerprints Real-time repetition removal process performance demand is realized and improves accuracy rate and duplicate removal performance.
A kind of schematic block diagram for the quick duplicate removal device 300 of content of text that Figure 12 inventive embodiments provide.It is right such as Figure 12 The quick De-weight method of Ying Yu or more content of text, the present invention also provides a kind of quick duplicate removal devices 300 of content of text.In the text Holding quick duplicate removal device 300 includes the unit for executing the quick De-weight method of above-mentioned content of text, which can be configured In server.
Specifically, Figure 12 is please referred to, the quick duplicate removal device 300 of text content includes:
Picking unit 301, for grabbing several webpage text contents for needing duplicate removal;
Pretreatment unit 302, for being pre-processed several described webpage text contents, to obtain to duplicate removal text Content;
Extraction unit 303 extracts characteristic key words for treating duplicate removal content of text, to obtain target signature key Word;
Weight calculation unit 304, for carrying out weight calculation to the target signature keyword, to obtain weighted value;
Signature unit 305, for signing to the target signature keyword, to obtain characteristic signature;
Fingerprint forms unit 306, for forming webpage text content fingerprint according to characteristic signature;
Storage unit 307, for carrying out inverted index storage to webpage text content fingerprint;
Similarity calculated 308, for calculating similarity according to the webpage text content fingerprint, to obtain webpage text The similitude of this content;
Output unit 309, for exporting the similitude of webpage text content.
In one embodiment, the picking unit 301 includes:
Subelement is distributed in address, for distributing the address URL;
Subelement is crawled, for crawling URL according to the address URL, to obtain URL to be crawled;
Judgment sub-unit is crawled, for judging whether the URL to be crawled has crawled;If so, returning described according to URL Address crawls URL, to obtain URL to be crawled;
Content crawls subelement, for if it is not, then grabbing wait crawl the webpage text content in URL.
In one embodiment, the pretreatment unit 302 includes:
Subelement is cleaned, for carrying out parsing cleaning several described webpage text contents, to obtain in internal expression text Hold;
Word segmentation processing subelement, for carrying out word segmentation processing to intermediate content of text, to obtain to duplicate removal content of text.
In one embodiment, the extraction unit 303 includes:
Piecemeal subelement, being used for will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Subelement is extracted, for carrying out extraction feature keyword to text block, to obtain initial characteristics keyword;
Subelement is extended, for carrying out semantic extension to initial characteristics keyword, to obtain intermediate features keyword;
Merge subelement, for merging intermediate features keyword and initial characteristics keyword, to obtain target Characteristic key words.
In one embodiment, the weight calculation unit 304 includes:
Weight obtains subelement, frequency and position for being occurred according to target signature keyword in webpage text content Weight is calculated, to obtain several weights;
Sorting subunit, for being ranked up to several weights, to obtain weighted value.
In one embodiment, the signature unit 305 includes:
Vector obtain subelement, for according to target signature keyword calculate generate feature hashed value, with obtain feature to Amount;
Subelement is integrated, for integrating feature vector and target signature keyword, forms characteristic signature.
In one embodiment, the fingerprint formation unit 306 includes:
Object vector forms subelement, based on every dimensional vector progress weighted value to the feature vector in characteristic signature It calculates, to obtain object vector;
Be arranged subelement, for by numerical value in object vector be positive number vector corresponding to position be placed in one, by target Numerical value is that position corresponding to the vector of non-positive number is placed in zero in vector, to obtain webpage text content fingerprint.
In one embodiment, the similarity calculated 308 includes:
Subelement is established, for establishing web page fingerprint inverted list according to webpage text content;
Number of documents obtains subelement, for obtaining the number of documents occurred in web page fingerprint inverted list;
Intersection computation subunit, for carrying out intersection calculating well to the document, to obtain the similar of webpage text content Property.
It should be noted that it is apparent to those skilled in the art that, the above-mentioned quick duplicate removal of content of text The specific implementation process of device 300 and each unit, can be with reference to the corresponding description in preceding method embodiment, for the side of description Just and succinctly, details are not described herein.
The above-mentioned quick duplicate removal device 300 of content of text can be implemented as a kind of form of computer program, the computer journey Sequence can be run in computer equipment as shown in fig. 13 that.
Figure 13 is please referred to, Figure 13 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The calculating Machine equipment 500 is server.
Refering to fig. 13, which includes processor 502, memory and the net connected by system bus 501 Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.
The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program 5032 include program instruction, which is performed, and processor 502 may make to execute a kind of quick removing repeat of content of text Method.
The processor 502 is for providing calculating and control ability, to support the operation of entire computer equipment 500.
The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, should When computer program 5032 is executed by processor 502, processor 502 may make to execute a kind of quick De-weight method of content of text.
The network interface 505 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Figure 13 The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme The restriction of computer equipment 500 thereon, specific computer equipment 500 may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following step It is rapid:
Crawl needs several webpage text contents of duplicate removal;
Several described webpage text contents are pre-processed, to obtain to duplicate removal content of text;
It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;
Weight calculation is carried out to the target signature keyword, to obtain weighted value;
It signs to the target signature keyword, to obtain characteristic signature;
Webpage text content fingerprint is formed according to characteristic signature;
Inverted index storage is carried out to webpage text content fingerprint;
Similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content;
Export the similitude of webpage text content.
In one embodiment, processor 502 is in several webpage text content steps for realizing that the crawl needs duplicate removal When, it is implemented as follows step:
Distribute the address URL;
URL is crawled according to the address URL, to obtain URL to be crawled;
Judge whether the URL to be crawled has crawled;
If so, return is described to crawl URL according to the address URL, to obtain URL to be crawled;
If it is not, then grabbing wait crawl the webpage text content in URL.
In one embodiment, processor 502 realize it is described several described webpage text contents are pre-processed, with It obtains being implemented as follows step when duplicate removal content of text step:
Parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
Word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
In one embodiment, processor 502 treats duplicate removal content of text described in the realization and extracts characteristic key words, with When obtaining target signature keyword step, it is implemented as follows step:
It will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Extraction feature keyword is carried out to text block, to obtain initial characteristics keyword;
Semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword;
Intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
In one embodiment, processor 502 realize it is described sign to the target signature keyword, to obtain spy When levying signature step, it is implemented as follows step:
It is calculated according to target signature keyword and generates feature hashed value, to obtain feature vector;
Feature vector and target signature keyword are integrated, characteristic signature is formed.
In one embodiment, processor 502 is described according to characteristic signature formation webpage text content fingerprint step in realization When, it is implemented as follows step:
The calculating of weighted value is carried out, to every dimensional vector of the feature vector in characteristic signature to obtain object vector;
By numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector be it is non-just Position corresponding to several vectors is placed in zero, to obtain webpage text content fingerprint.
In one embodiment, processor 502 realize it is described according to the webpage text content fingerprint calculate similarity, with When obtaining the similitude step of webpage text content, it is implemented as follows step:
Web page fingerprint inverted list is established according to webpage text content;
Obtain the number of documents occurred in web page fingerprint inverted list;
Intersection calculating is carried out well to the document, to obtain the similitude of webpage text content.
It should be appreciated that in the embodiment of the present application, processor 502 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic Device, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or Person's processor is also possible to any conventional processor etc..
Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process, It is that relevant hardware can be instructed to complete by computer program.The computer program includes program instruction, computer journey Sequence can be stored in a storage medium, which is computer readable storage medium.The program instruction is by the department of computer science At least one processor in system executes, to realize the process step of the embodiment of the above method.
Therefore, the present invention also provides a kind of storage mediums.The storage medium can be computer readable storage medium.This is deposited Storage media is stored with computer program, and processor is made to execute following steps when wherein the computer program is executed by processor:
Crawl needs several webpage text contents of duplicate removal;
Several described webpage text contents are pre-processed, to obtain to duplicate removal content of text;
It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;
Weight calculation is carried out to the target signature keyword, to obtain weighted value;
It signs to the target signature keyword, to obtain characteristic signature;
Webpage text content fingerprint is formed according to characteristic signature;
Inverted index storage is carried out to webpage text content fingerprint;
Similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content;
Export the similitude of webpage text content.
In one embodiment, if the processor realizes the crawl and need duplicate removal in the execution computer program When dry webpage text content step, it is implemented as follows step:
Distribute the address URL;
URL is crawled according to the address URL, to obtain URL to be crawled;
Judge whether the URL to be crawled has crawled;
If so, return is described to crawl URL according to the address URL, to obtain URL to be crawled;
If it is not, then grabbing wait crawl the webpage text content in URL.
In one embodiment, the processor is realized described several webpages in the execution computer program Content of text is pre-processed, to obtain being implemented as follows step when duplicate removal content of text step:
Parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
Word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
In one embodiment, the processor is realized and described treats duplicate removal content of text executing the computer program Characteristic key words are extracted, when obtaining target signature keyword step, are implemented as follows step:
It will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Extraction feature keyword is carried out to text block, to obtain initial characteristics keyword;
Semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword;
Intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
In one embodiment, the processor is realized described to target signature pass in the execution computer program Keyword is signed, and when obtaining characteristic signature step, is implemented as follows step:
It is calculated according to target signature keyword and generates feature hashed value, to obtain feature vector;
Feature vector and target signature keyword are integrated, characteristic signature is formed.
In one embodiment, the processor is realized and described is formed according to characteristic signature executing the computer program When webpage text content fingerprint step, it is implemented as follows step:
The calculating of weighted value is carried out, to every dimensional vector of the feature vector in characteristic signature to obtain object vector;
By numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector be it is non-just Position corresponding to several vectors is placed in zero, to obtain webpage text content fingerprint.
In one embodiment, the processor is realized described according to the web page text in the execution computer program User supplied video content using fingerprints calculate similarity and are implemented as follows step when obtaining the similitude step of webpage text content:
Web page fingerprint inverted list is established according to webpage text content;
Obtain the number of documents occurred in web page fingerprint inverted list;
Intersection calculating is carried out well to the document, to obtain the similitude of webpage text content.
The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk Or the various computer readable storage mediums that can store program code such as CD.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond the scope of this invention.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary.For example, the division of each unit, only Only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.
The steps in the embodiment of the present invention can be sequentially adjusted, merged and deleted according to actual needs.This hair Unit in bright embodiment device can be combined, divided and deleted according to actual needs.In addition, in each implementation of the present invention Each functional unit in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also be with It is that two or more units are integrated in one unit.
If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product, It can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing skill The all or part of part or the technical solution that art contributes can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, terminal or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims (10)

1. the quick De-weight method of content of text characterized by comprising
Crawl needs several webpage text contents of duplicate removal;
Several described webpage text contents are pre-processed, to obtain to duplicate removal content of text;
It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword;
Weight calculation is carried out to the target signature keyword, to obtain weighted value;
It signs to the target signature keyword, to obtain characteristic signature;
Webpage text content fingerprint is formed according to characteristic signature;
Inverted index storage is carried out to webpage text content fingerprint;
Similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content;
Export the similitude of webpage text content.
2. the quick De-weight method of content of text according to claim 1, which is characterized in that if the crawl needs duplicate removal Dry webpage text content, comprising:
Distribute the address URL;
URL is crawled according to the address URL, to obtain URL to be crawled;
Judge whether the URL to be crawled has crawled;
If so, return is described to crawl URL according to the address URL, to obtain URL to be crawled;
If it is not, then grabbing wait crawl the webpage text content in URL.
3. the quick De-weight method of content of text according to claim 1, which is characterized in that described several webpages Content of text is pre-processed, to obtain to duplicate removal content of text, comprising:
Parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content;
Word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.
4. the quick De-weight method of content of text according to claim 1, which is characterized in that described to treat duplicate removal content of text Characteristic key words are extracted, to obtain target signature keyword, comprising:
It will be to duplicate removal content of text according to position piecemeal, to obtain text block;
Extraction feature keyword is carried out to text block, to obtain initial characteristics keyword;
Semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword;
Intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.
5. the quick De-weight method of content of text according to claim 1, which is characterized in that described to be closed to the target signature Keyword is signed, to obtain characteristic signature, comprising:
It is calculated according to target signature keyword and generates feature hashed value, to obtain feature vector;
Feature vector and target signature keyword are integrated, characteristic signature is formed.
6. the quick De-weight method of content of text according to claim 1, which is characterized in that described to be formed according to characteristic signature Webpage text content fingerprint, comprising:
The calculating of weighted value is carried out, to every dimensional vector of the feature vector in characteristic signature to obtain object vector;
Position corresponding to vector of the numerical value in object vector for positive number is placed in one, is non-positive number by numerical value in object vector Position corresponding to vector is placed in zero, to obtain webpage text content fingerprint.
7. the quick De-weight method of content of text according to claim 1, which is characterized in that described according to the web page text User supplied video content using fingerprints calculate similarity, to obtain the similitude of webpage text content, comprising:
Web page fingerprint inverted list is established according to webpage text content;
Obtain the number of documents occurred in web page fingerprint inverted list;
Intersection calculating is carried out well to the document, to obtain the similitude of webpage text content.
8. the quick duplicate removal device of content of text characterized by comprising
Picking unit, for grabbing several webpage text contents for needing duplicate removal;
Pretreatment unit, for being pre-processed several described webpage text contents, to obtain to duplicate removal content of text;
Extraction unit extracts characteristic key words for treating duplicate removal content of text, to obtain target signature keyword;
Weight calculation unit, for carrying out weight calculation to the target signature keyword, to obtain weighted value;
Signature unit, for signing to the target signature keyword, to obtain characteristic signature;
Fingerprint forms unit, for forming webpage text content fingerprint according to characteristic signature;
Storage unit, for carrying out inverted index storage to webpage text content fingerprint;
Similarity calculated, for calculating similarity according to the webpage text content fingerprint, to obtain webpage text content Similitude;
Output unit, for exporting the similitude of webpage text content.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor, on the memory It is stored with computer program, the processor is realized as described in any one of claims 1 to 7 when executing the computer program Method.
10. a kind of storage medium, which is characterized in that the storage medium is stored with computer program, the computer program quilt Processor can realize the method as described in any one of claims 1 to 7 when executing.
CN201910344414.9A 2019-04-26 2019-04-26 The quick De-weight method of content of text, device, computer equipment and storage medium Pending CN110309446A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910344414.9A CN110309446A (en) 2019-04-26 2019-04-26 The quick De-weight method of content of text, device, computer equipment and storage medium
PCT/CN2019/116606 WO2020215667A1 (en) 2019-04-26 2019-11-08 Text content quick duplicate removal method and apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910344414.9A CN110309446A (en) 2019-04-26 2019-04-26 The quick De-weight method of content of text, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110309446A true CN110309446A (en) 2019-10-08

Family

ID=68075778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910344414.9A Pending CN110309446A (en) 2019-04-26 2019-04-26 The quick De-weight method of content of text, device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110309446A (en)
WO (1) WO2020215667A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909019A (en) * 2019-11-14 2020-03-24 湖南赛吉智慧城市建设管理有限公司 Big data duplicate checking method and device, computer equipment and storage medium
CN110956037A (en) * 2019-10-16 2020-04-03 厦门美柚股份有限公司 Multimedia content repeated judgment method and device
CN110955751A (en) * 2019-11-13 2020-04-03 广州供电局有限公司 Method, device and system for removing duplication of work ticket text and computer storage medium
CN111027282A (en) * 2019-11-21 2020-04-17 精硕科技(北京)股份有限公司 Text duplicate removal method and device, electronic equipment and computer readable storage medium
CN111061934A (en) * 2019-11-27 2020-04-24 西安四叶草信息技术有限公司 Fingerprint identification method, equipment and storage medium
CN111428180A (en) * 2020-03-20 2020-07-17 名创优品(横琴)企业管理有限公司 Webpage duplicate removal method, device and equipment
CN111507260A (en) * 2020-04-17 2020-08-07 重庆邮电大学 Video similarity rapid detection method and detection device
WO2020215667A1 (en) * 2019-04-26 2020-10-29 深圳市赛为智能股份有限公司 Text content quick duplicate removal method and apparatus, computer device, and storage medium
CN111913912A (en) * 2020-07-16 2020-11-10 北京字节跳动网络技术有限公司 File processing method, file matching device, electronic equipment and medium
CN113051907A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 News content duplicate checking method, system and device
WO2022141860A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Text deduplication method and apparatus, electronic device, and computer readable storage medium
CN114741468A (en) * 2022-03-22 2022-07-12 平安科技(深圳)有限公司 Text duplicate removal method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102831198A (en) * 2012-08-07 2012-12-19 人民搜索网络股份公司 Similar document identifying device and similar document identifying method based on document signature technology
WO2016177069A1 (en) * 2015-07-20 2016-11-10 中兴通讯股份有限公司 Management method, device, spam short message monitoring system and computer storage medium
CN108595517A (en) * 2018-03-26 2018-09-28 南京邮电大学 A kind of extensive document similarity detection method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025218B (en) * 2017-04-07 2021-03-02 腾讯科技(深圳)有限公司 Text duplicate removal method and device
CN108563636A (en) * 2018-04-04 2018-09-21 广州杰赛科技股份有限公司 Extract method, apparatus, equipment and the storage medium of text key word
CN110309446A (en) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 The quick De-weight method of content of text, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102831198A (en) * 2012-08-07 2012-12-19 人民搜索网络股份公司 Similar document identifying device and similar document identifying method based on document signature technology
WO2016177069A1 (en) * 2015-07-20 2016-11-10 中兴通讯股份有限公司 Management method, device, spam short message monitoring system and computer storage medium
CN108595517A (en) * 2018-03-26 2018-09-28 南京邮电大学 A kind of extensive document similarity detection method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
姜雪等: "基于语义指纹的海量文本快速相似检测算法研究", 《电脑知识与技术》 *
薛剑等: "应用语义相似的海量网页文本去重策略研究", 《小型微型计算机系统》 *
闫亮等: "基于网页特征关键词的近似检测算法", 《科学技术与工程》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020215667A1 (en) * 2019-04-26 2020-10-29 深圳市赛为智能股份有限公司 Text content quick duplicate removal method and apparatus, computer device, and storage medium
CN110956037A (en) * 2019-10-16 2020-04-03 厦门美柚股份有限公司 Multimedia content repeated judgment method and device
CN110956037B (en) * 2019-10-16 2022-07-08 厦门美柚股份有限公司 Multimedia content repeated judgment method and device
CN110955751A (en) * 2019-11-13 2020-04-03 广州供电局有限公司 Method, device and system for removing duplication of work ticket text and computer storage medium
CN110909019B (en) * 2019-11-14 2022-04-08 湖南赛吉智慧城市建设管理有限公司 Big data duplicate checking method and device, computer equipment and storage medium
CN110909019A (en) * 2019-11-14 2020-03-24 湖南赛吉智慧城市建设管理有限公司 Big data duplicate checking method and device, computer equipment and storage medium
CN111027282A (en) * 2019-11-21 2020-04-17 精硕科技(北京)股份有限公司 Text duplicate removal method and device, electronic equipment and computer readable storage medium
CN111061934A (en) * 2019-11-27 2020-04-24 西安四叶草信息技术有限公司 Fingerprint identification method, equipment and storage medium
CN111061934B (en) * 2019-11-27 2023-04-07 西安四叶草信息技术有限公司 Fingerprint identification method, equipment and storage medium
CN113051907A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 News content duplicate checking method, system and device
CN113051907B (en) * 2019-12-26 2023-05-12 深圳市北科瑞声科技股份有限公司 Method, system and device for searching duplicate of news content
CN111428180B (en) * 2020-03-20 2022-02-08 创优数字科技(广东)有限公司 Webpage duplicate removal method, device and equipment
CN111428180A (en) * 2020-03-20 2020-07-17 名创优品(横琴)企业管理有限公司 Webpage duplicate removal method, device and equipment
CN111507260B (en) * 2020-04-17 2022-08-05 重庆邮电大学 Video similarity rapid detection method and detection device
CN111507260A (en) * 2020-04-17 2020-08-07 重庆邮电大学 Video similarity rapid detection method and detection device
CN111913912A (en) * 2020-07-16 2020-11-10 北京字节跳动网络技术有限公司 File processing method, file matching device, electronic equipment and medium
WO2022141860A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Text deduplication method and apparatus, electronic device, and computer readable storage medium
CN114741468A (en) * 2022-03-22 2022-07-12 平安科技(深圳)有限公司 Text duplicate removal method, device, equipment and storage medium
CN114741468B (en) * 2022-03-22 2024-03-29 平安科技(深圳)有限公司 Text deduplication method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2020215667A1 (en) 2020-10-29

Similar Documents

Publication Publication Date Title
CN110309446A (en) The quick De-weight method of content of text, device, computer equipment and storage medium
CN106202518B (en) Short text classification method based on CHI and sub-category association rule algorithm
CN103514183B (en) Information search method and system based on interactive document clustering
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
US6678681B1 (en) Information extraction from a database
Zong et al. On assigning place names to geography related web pages
US8458198B1 (en) Document analysis and multi-word term detector
CN101694668B (en) Method and device for confirming web structure similarity
US20070294223A1 (en) Text Categorization Using External Knowledge
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN101582080A (en) Web image clustering method based on image and text relevant mining
CN106372117B (en) A kind of file classification method and its device based on Term co-occurrence
CN110321466A (en) A kind of security information duplicate checking method and system based on semantic analysis
Berberich et al. Computing n-gram statistics in MapReduce
Roy et al. Discovering and understanding word level user intent in web search queries
US10706030B2 (en) Utilizing artificial intelligence to integrate data from multiple diverse sources into a data structure
US20140365494A1 (en) Search term clustering
CN106708926A (en) Realization method for analysis model supporting massive long text data classification
Paulheim Machine learning with and for semantic web knowledge graphs
Kim et al. Graph-based fake news detection using a summarization technique
CN104881446A (en) Searching method and searching device
Kostakos Strings and things: A semantic search engine for news quotes using named entity recognition
Cousseau et al. Linking place records using multi-view encoders
Yuan et al. A mathematical information retrieval system based on RankBoost

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191008