CN110309446A

CN110309446A - The quick De-weight method of content of text, device, computer equipment and storage medium

Info

Publication number: CN110309446A
Application number: CN201910344414.9A
Authority: CN
Inventors: 耿伟; 王英明; 周起如; 谷国栋
Original assignee: Industrial & Commercial College Anhui University Of Technology; Shenzhen Sunwin Intelligent Co Ltd
Current assignee: Industrial & Commercial College Anhui University Of Technology; Shenzhen Sunwin Intelligent Co Ltd
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2019-10-08
Also published as: WO2020215667A1

Abstract

The present invention relates to the quick De-weight method of content of text, device, computer equipment and storage medium, this method includes several webpage text contents that crawl needs duplicate removal；Several webpage text contents are pre-processed, to obtain to duplicate removal content of text；It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword；Weight calculation is carried out to target signature keyword, to obtain weighted value；It signs to target signature keyword, to obtain characteristic signature；Webpage text content fingerprint is formed according to characteristic signature；Inverted index storage is carried out to webpage text content fingerprint；Similarity is calculated according to webpage text content fingerprint, to obtain the similitude of webpage text content；Export the similitude of webpage text content.The present invention effectively meets magnanimity large-scale data real-time repetition removal process performance demand, realizes and improves accuracy rate and duplicate removal performance.

Description

The quick De-weight method of content of text, device, computer equipment and storage medium

Technical field

The present invention relates to content of text De-weight methods, more specifically refer to the quick De-weight method of content of text, device, meter Calculate machine equipment and storage medium.

Background technique

The fast development of Internet technology, so that the duplication of information and propagation cost are extremely low.The network information shares to people Bring great convenience, but introduce a large amount of duplicate messages simultaneously.On the one hand many repeated pages come from content of text The completely the same reprinting with structure causes the incomplete of internal form on the other hand from differences such as itself website layout styles Unanimously.A large amount of duplicate web page contents have not only aggravated the burden of user's browsing, but also in information collection, index and search for A large amount of resource is consumed in journey.

Existing extensive magnanimity duplicate removal technical method mainly uses local sensitivity hash algorithm, which is that one kind is based on The duplicate removal technology of content of text, it is main by the raw hash signature of dimensionality reduction, content of text is then judged by the similitude of signature Similarity, due to the complexity of Chinese language, existing method very can not accurately indicate content of text, existing text Eigen extraction all assumes that between feature independently of each other, in true environment, has semantic relation between characteristic key words, no Can simply it ignore；Similarity calculation performance is lower, can not expand under extensive mass data environment and apply；Due to ignoring Semantic context relationship between characteristic key words, causes the whole accuracy rate lower.

Therefore, it is necessary to design a kind of new method, realizes and improve accuracy rate and duplicate removal performance, effectively meet the big rule of magnanimity Modulus factually when duplicate removal processing performance requirement.

Summary of the invention

It is an object of the invention to overcome the deficiencies of existing technologies, the quick De-weight method of content of text, device, calculating are provided Machine equipment and storage medium.

To achieve the above object, the invention adopts the following technical scheme: the quick De-weight method of content of text, comprising:

Crawl needs several webpage text contents of duplicate removal；

Several described webpage text contents are pre-processed, to obtain to duplicate removal content of text；

It treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword；

Weight calculation is carried out to the target signature keyword, to obtain weighted value；

It signs to the target signature keyword, to obtain characteristic signature；

Webpage text content fingerprint is formed according to characteristic signature；

Inverted index storage is carried out to webpage text content fingerprint；

Similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content；

Export the similitude of webpage text content.

Its further technical solution are as follows: described to grab several webpage text contents for needing duplicate removal, comprising:

Distribute the address URL；

URL is crawled according to the address URL, to obtain URL to be crawled；

Judge whether the URL to be crawled has crawled；

If so, return is described to crawl URL according to the address URL, to obtain URL to be crawled；

If it is not, then grabbing wait crawl the webpage text content in URL.

Its further technical solution are as follows: it is described that several described webpage text contents are pre-processed, to obtain wait go Weight content of text, comprising:

Parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content；

Word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.

Its further technical solution are as follows: the duplicate removal content of text for the treatment of extracts characteristic key words, to obtain target Characteristic key words, comprising:

It will be to duplicate removal content of text according to position piecemeal, to obtain text block；

Extraction feature keyword is carried out to text block, to obtain initial characteristics keyword；

Semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword；

Intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.

Its further technical solution are as follows: it is described to sign to the target signature keyword, to obtain characteristic signature, packet It includes:

It is calculated according to target signature keyword and generates feature hashed value, to obtain feature vector；

Feature vector and target signature keyword are integrated, characteristic signature is formed.

Its further technical solution are as follows: described that webpage text content fingerprint is formed according to characteristic signature, comprising:

The calculating of weighted value is carried out, to every dimensional vector of the feature vector in characteristic signature to obtain object vector；

By numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector be it is non-just Position corresponding to several vectors is placed in zero, to obtain webpage text content fingerprint.

Its further technical solution are as follows: it is described that similarity is calculated according to the webpage text content fingerprint, to obtain webpage The similitude of content of text, comprising:

Web page fingerprint inverted list is established according to webpage text content；

Obtain the number of documents occurred in web page fingerprint inverted list；

Intersection calculating is carried out well to the document, to obtain the similitude of webpage text content.

The present invention also provides the quick duplicate removal devices of content of text, comprising:

Picking unit, for grabbing several webpage text contents for needing duplicate removal；

Pretreatment unit, for being pre-processed several described webpage text contents, to obtain in duplicate removal text Hold；

Extraction unit extracts characteristic key words for treating duplicate removal content of text, to obtain target signature keyword；

Weight calculation unit, for carrying out weight calculation to the target signature keyword, to obtain weighted value；

Signature unit, for signing to the target signature keyword, to obtain characteristic signature；

Fingerprint forms unit, for forming webpage text content fingerprint according to characteristic signature；

Storage unit, for carrying out inverted index storage to webpage text content fingerprint；

Similarity calculated, for calculating similarity according to the webpage text content fingerprint, to obtain web page text The similitude of content；

Output unit, for exporting the similitude of webpage text content.

The present invention also provides a kind of computer equipment, the computer equipment includes memory and processor, described to deposit Computer program is stored on reservoir, the processor realizes above-mentioned method when executing the computer program.

The present invention also provides a kind of storage medium, the storage medium is stored with computer program, the computer journey Sequence can realize above-mentioned method when being executed by processor.

Compared with the prior art, the invention has the advantages that: the present invention being capable of table by being based on word relation extraction Show the target signature keyword and weight of webpage text content, webpage text content is generated based on target signature keyword and weight Fingerprint realizes compression expression, saves memory space and calculates the time, is based on Elasticsearch inverted index data structure Webpage text content fingerprint is stored, the Boolean Model that similarity calculation is converted to Elasticsearch is retrieved, sea is effectively met Large-scale data real-time repetition removal process performance demand is measured, realizes and improves accuracy rate and duplicate removal performance.

The invention will be further described in the following with reference to the drawings and specific embodiments.

Detailed description of the invention

Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is the application scenarios schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of the quick De-weight method of content of text provided in an embodiment of the present invention；

Fig. 3 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention；

Fig. 4 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention；

Fig. 5 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention；

Fig. 6 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention；

Fig. 7 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention；

Fig. 8 is the sub-process schematic diagram of the quick De-weight method of content of text provided in an embodiment of the present invention；

Fig. 9 is the formation schematic diagram of webpage text content fingerprint provided in an embodiment of the present invention；

Figure 10 is the formation schematic diagram that target signature keyword provided in an embodiment of the present invention is formed；

Figure 11 is the structural schematic diagram of inverted index provided in an embodiment of the present invention；

Figure 12 is the schematic block diagram of the quick duplicate removal device of content of text provided in an embodiment of the present invention；

Figure 13 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.

It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.

It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.

Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the application scenarios of the quick De-weight method of content of text provided in an embodiment of the present invention Schematic diagram.Fig. 2 is the schematic flow chart of the quick De-weight method of content of text provided in an embodiment of the present invention.Text content is fast Fast De-weight method is applied in server, and the server and terminal carry out data interaction, needs duplicate removal if getting from terminal Dry webpage text content, then quick duplicate removal is carried out to these webpage text contents, and the result after duplicate removal is exported to terminal Display.

Fig. 2 is the flow diagram of the quick De-weight method of content of text provided in an embodiment of the present invention.As shown in Fig. 2, should Method includes the following steps S110 to S190.

S110, crawl need several webpage text contents of duplicate removal.

In the present embodiment, webpage text content refers to the text with information shown in webpage.

In one embodiment, referring to Fig. 3, above-mentioned step S110 may include step S111~S114.

S111, the distribution address URL；

S112, URL is crawled according to the address URL, to obtain URL to be crawled；

Whether URL to be crawled described in S113, judgement has crawled；

If so, returning to the step S112；

S114, if it is not, then grabbing wait crawl the webpage text content in URL.

Distributed task dispatching program is with distributing URL (uniform resource locator, Uniform Resource Locator) Crawler application program node is given in location, if URL to be crawled is the URL grabbed, directly abandons, otherwise, passes through crawler Application program node grabs webpage text content.

S120, several described webpage text contents are pre-processed, to obtain to duplicate removal content of text.

In the present embodiment, the content of text for having cleaned screening and having carried out word segmentation processing is referred to duplicate removal content of text.

In one embodiment, referring to Fig. 4, above-mentioned step S120 may include step S121~S122.

S121, parsing cleaning is carried out several described webpage text contents, to obtain internal expression text content；

S122, word segmentation processing is carried out to intermediate content of text, to obtain to duplicate removal content of text.

In the present embodiment, internal expression text content refers to remove unwanted data after remaining content.

Parsing cleaning is carried out to webpage text content, is mainly converted into including removal html label, English capitalization small It writes, the conversion between simplified and traditional Chinese etc. of Chinese.Webpage text content processing can also be related to Chinese word segmentation, by participle technique by content of text It is cut into independent and significant word.

S130, it treats duplicate removal content of text and extracts characteristic key words, to obtain target signature keyword.

In the present embodiment, target signature keyword refers to the word for indicating webpage text content feature and essence.Entirely The forming process of target signature keyword sees Figure 10.Selection to duplicate removal content of text feature has many methods, such as Shingles, n-grams etc. are reduced since multiple words or character string do not have individually clearly semanteme in document The expression of appearance is extracted using single keyword as the feature to duplicate removal content of text here.

In one embodiment, referring to Fig. 5, above-mentioned step S130 may include step S131~S134.

S131, will be to duplicate removal content of text according to position piecemeal, to obtain text block.

In the present embodiment, text block, which refers to, forms content by the text of different positions.

Metamessage block, Web page text block and title block will be broadly divided into duplicate removal content of text opsition dependent piecemeal.

S132, extraction feature keyword is carried out to text block, to obtain initial characteristics keyword.

In the present embodiment, initial characteristics keyword refers in the representative text block directly extracted by the content of text block The feature word of appearance.

Specifically, according to the characteristic key words of each text block of semantic relation extraction.

S133, semantic extension is carried out to initial characteristics keyword, to obtain intermediate features keyword.

In the present embodiment, intermediate features keyword refers to the word synonymous with initial characteristics keyword.

S134, intermediate features keyword and initial characteristics keyword are merged, to obtain target signature keyword.

In the present embodiment, the keyword after extension is merged together with initial characteristics keyword, target can be made special It is more comprehensive and accurate to levy keyword.

S140, weight calculation is carried out to the target signature keyword, to obtain weighted value.

In the present embodiment, weighted value refers to that word appears in accounting rate and position in content of text etc..

In one embodiment, above-mentioned step S140 may include step S141~S142.

S141, the frequency occurred according to target signature keyword in webpage text content and position calculate weight, with To several weights.

In the present embodiment, the simplest calculation method of the corresponding weight of target signature keyword mainly use different degree and Discrimination two indices, wherein different degree Weight is based primarily upon the variant of the frequency tf of word appearance, its calculation formula is: Weight=log (1+log (1+tf))；

Discrimination Discrimination is based on the inverse document frequency factor, its calculation formula is Wherein, N represents in collection of document how many document in total, and df represents document frequency.

It is finally normalized based on Document Length, normalization calculation formula is

Wherein, b is regulatory factor, and default value is 0.85.

Comprehensively consider above several influence weight factors, the weight computing formula of final target signature keyword is as follows:

S142, several weights are ranked up, to obtain weighted value.

Final target signature keyword and weight set is obtained according to weight sequencing.

S150, it signs to the target signature keyword, to obtain characteristic signature.

In the present embodiment, characteristic signature refers to the generated feature hash of the semantic relation between target signature keyword Value.Specifically signed using semantic feature signature algorithm to the target signature keyword.

In one embodiment, referring to Fig. 6, above-mentioned step S150 may include step S151~S152.

S151, generation feature hashed value is calculated according to target signature keyword, to obtain feature vector.

In the present embodiment, feature vector refers to by the hashed value of target signature keyword sign production.

In the present embodiment, hashed value is the vector of b dimension, and b is by being manually set；Webpage text content is by sentence It constitutes, sentence is to have word composition again, and the theme to be expressed of document is codetermined by word and its context environmental at place 's；Two texts below such as: " China is made great efforts to turn the outstanding country of environment into "；" China is made great efforts to turn ring into The good country in border "；Two sentences expression purport be it is the same, if there is in different documents, it is believed that be weight Multiple content.But the feature of course of two sentences is not exclusively, therefore, is generated by original local sensitivity hash algorithm Document fingerprint can be different, it is thus regarded that being different document, generate erroneous judgement.

Semantic feature signature algorithm pseudocode:

Input: web document characteristic set；

Feature=Feature1, Feature2 ..., Featurei ..., Featuren }, Featurei= {Featurei1,Featurei2,...,Featureij,...,Featurein}；

Corresponding weight set；

Weight=Weight1, Weight2 ..., Weighti ..., Weightn }, Weighti=Weighti1, Weighti2,...,Weightij,...,Weightin}；

Output: web document semantic feature hashes value set；

HashVal=HashVal1, HashVal2 ..., HashVali ..., HashValn }, HashVali= {HashVali1,HashVali2,...,HashValij,...,HashValin}；

Pseudocode is as follows:

Wherein, sim (Featureij, Featurekl) is the similarity function for judging word feature, using between concept Semantic relation calculates, and specific organizational form is the semantic dictionary of stratification, utilizes hierarchy distance of the word in semantic dictionary Semantic similarity is measured in path.When the similarity between word feature is less than the threshold value threshold of setting, then by feature Hashed value be set as same value, and be then different to the hashed value that different characteristic produces in original local sensitivity hash algorithm Sample, have ignored the relationship between word.

S152, feature vector and target signature keyword are integrated, forms characteristic signature.

The characteristic signature of formation, which carries out compression, can obtain webpage text content fingerprint.It realizes compression expression, saves and deposit It stores up space and calculates the time.

S160, webpage text content fingerprint is formed according to characteristic signature；

In the present embodiment, webpage text content fingerprint refers to zero and one setting is carried out according to feature vector after formed Vector.

In one embodiment, referring to Fig. 7, above-mentioned step S160 may include step S161~S162.

S161, the calculating that weighted value is carried out to every dimensional vector of the feature vector in characteristic signature, to obtain object vector；

S162, by numerical value in object vector be positive number vector corresponding to position be placed in one, by numerical value in object vector Position corresponding to vector for non-positive number is placed in zero, to obtain webpage text content fingerprint.

Webpage text content, that is, document is made of a series of character string, is directly carried out operation to character string and is needed largely Memory space and calculate the time.Therefore, original text is analyzed and is handled, extract the target signature pass that can represent original text shelves Keyword generates webpage text content fingerprint by hash function.Referred to by the webpage text content to expression webpage text content Line is compared, and finds out repetition or approximate duplicate document.When two documents possess identical fingerprint quantity or identical finger The ratio of the total fingerprint quantity of line Zhan then thinks to repeat when reaching certain threshold value, otherwise it is assumed that not repeating.

In the feature vector V of b dimension, every dimensional vector is calculated respectively, i.e., if the hashed value of feature corresponding positions is 1, then this corresponding weight of target critical feature is added, weight is otherwise subtracted.After all features are all disposed, if to The i-th dimension measured in V is positive number, then i-th bit in b fingerprints is set to 1, is otherwise set to 0, and then obtain a numerical value packet 0 and 1 vector, i.e. webpage text content fingerprint are included, as shown in Figure 9.

S170, inverted index storage is carried out to webpage text content fingerprint.

Web page fingerprint is stored based on Elasticsearch inverted index data structure, similarity calculation is converted to The Boolean Model of Elasticsearch is retrieved, and ElasticSearch is the search server based on Lucene, it is provided The full-text search engine of one distributed multi-user ability is based on RESTful web interface.

It by the Mapping and Converting of web document ID to web page fingerprint is web page fingerprint to web document based on Elasticsearch The mapping of ID, and stored, as shown in figure 11, wherein web page fingerprint 1 refers to the ID of web document 1 and the ID of web document 2； Web page fingerprint 2 refers to the web document ID list for having this target signature keyword, which refers to webpage text content.

S180, similarity is calculated according to the webpage text content fingerprint, to obtain the similitude of webpage text content.

In the present embodiment, above-mentioned similarity refers to the similarity of two webpage text contents.

In one embodiment, referring to Fig. 8, above-mentioned step S180 may include step S181~S183.

S181, web page fingerprint inverted list is established according to webpage text content；

S182, the number of documents occurred in web page fingerprint inverted list is obtained；

S183, intersection calculating is carried out to the document, well to obtain the similitude of webpage text content.

After calculating webpage text content fingerprint to every webpage text content, then calculate two webpage text content fingerprints Similarity.Similarity is sought using Hamming distances, establishes web page fingerprint inverted list, by being inquired in web page fingerprint inverted list Existing number of documents seeks common ground to obtain final result.Assuming that 32 webpage text content fingerprints, 32 binary system label Name is divided into 2 pieces, 16 every piece, all signatures of the Hamming distances within 1 is calculated, according to piezomagnetic principle, if two webpages The Hamming distances of content of text fingerprint within 1, they must have one piece it is identical, thus can be by inverted index by phase It is calculated like degree and is converted to Boolean retrieval model, greatly reduced Documents Similarity and calculate the time.

S190, the similitude for exporting webpage text content.

In the present embodiment, the similitude of webpage text content is exported to terminal and is shown.

It is using accuracy Precision and recall rate for the effect assessment that this method carries out webpage text content duplicate removal Recall is evaluated.

Precision and Recall index merely illustrates one-side performance indicator, and has ignored overall performance, and duplicate removal effect assessment value F1 is comprehensive The two, is defined as:

This method and classical local sensitivity algorithm simhash algorithm operational effect are compared, obtained operational effect pair It is more as shown in table 1 below than situation:

The comparison of 1. algorithm operational effect of table

From the point of view of operational effect, operational effect has larger amplitude in accuracy rate and recall rate compared with local sensitivity hash algorithm The promotion of degree.

By this method and classical local sensitivity algorithm simhash algorithm operational efficiency comparison, which compares feelings Condition is as shown in table 2 below:

The comparison of 2. algorithm operational efficiency of table

In terms of algorithm operational efficiency, this method performance is higher, in the case where being significantly increased compared with local sensitivity Hash, under performance Drop is slower, can satisfy and applies under extensive mass data environment.

Duplicate removal effect assessment value can be also exported in addition to the similitude of output webpage text content in other embodiments.

The quick De-weight method of above-mentioned content of text, by that can be indicated in web page text based on word relation extraction The target signature keyword and weight of appearance generate webpage text content fingerprint based on target signature keyword and weight, realize pressure Contracting indicates, saves memory space and calculates the time, stores web page text based on Elasticsearch inverted index data structure The Boolean Model that similarity calculation is converted to Elasticsearch is retrieved, effectively meets magnanimity large-scale data by user supplied video content using fingerprints Real-time repetition removal process performance demand is realized and improves accuracy rate and duplicate removal performance.

A kind of schematic block diagram for the quick duplicate removal device 300 of content of text that Figure 12 inventive embodiments provide.It is right such as Figure 12 The quick De-weight method of Ying Yu or more content of text, the present invention also provides a kind of quick duplicate removal devices 300 of content of text.In the text Holding quick duplicate removal device 300 includes the unit for executing the quick De-weight method of above-mentioned content of text, which can be configured In server.

Specifically, Figure 12 is please referred to, the quick duplicate removal device 300 of text content includes:

Picking unit 301, for grabbing several webpage text contents for needing duplicate removal；

Pretreatment unit 302, for being pre-processed several described webpage text contents, to obtain to duplicate removal text Content；

Extraction unit 303 extracts characteristic key words for treating duplicate removal content of text, to obtain target signature key Word；

Weight calculation unit 304, for carrying out weight calculation to the target signature keyword, to obtain weighted value；

Signature unit 305, for signing to the target signature keyword, to obtain characteristic signature；

Fingerprint forms unit 306, for forming webpage text content fingerprint according to characteristic signature；

Storage unit 307, for carrying out inverted index storage to webpage text content fingerprint；

Similarity calculated 308, for calculating similarity according to the webpage text content fingerprint, to obtain webpage text The similitude of this content；

Output unit 309, for exporting the similitude of webpage text content.

In one embodiment, the picking unit 301 includes:

Subelement is distributed in address, for distributing the address URL；

Subelement is crawled, for crawling URL according to the address URL, to obtain URL to be crawled；

Judgment sub-unit is crawled, for judging whether the URL to be crawled has crawled；If so, returning described according to URL Address crawls URL, to obtain URL to be crawled；

Content crawls subelement, for if it is not, then grabbing wait crawl the webpage text content in URL.

In one embodiment, the pretreatment unit 302 includes:

Subelement is cleaned, for carrying out parsing cleaning several described webpage text contents, to obtain in internal expression text Hold；

Word segmentation processing subelement, for carrying out word segmentation processing to intermediate content of text, to obtain to duplicate removal content of text.

In one embodiment, the extraction unit 303 includes:

Piecemeal subelement, being used for will be to duplicate removal content of text according to position piecemeal, to obtain text block；

Subelement is extracted, for carrying out extraction feature keyword to text block, to obtain initial characteristics keyword；

Subelement is extended, for carrying out semantic extension to initial characteristics keyword, to obtain intermediate features keyword；

Merge subelement, for merging intermediate features keyword and initial characteristics keyword, to obtain target Characteristic key words.

In one embodiment, the weight calculation unit 304 includes:

Weight obtains subelement, frequency and position for being occurred according to target signature keyword in webpage text content Weight is calculated, to obtain several weights；

Sorting subunit, for being ranked up to several weights, to obtain weighted value.

In one embodiment, the signature unit 305 includes:

Vector obtain subelement, for according to target signature keyword calculate generate feature hashed value, with obtain feature to Amount；

Subelement is integrated, for integrating feature vector and target signature keyword, forms characteristic signature.

In one embodiment, the fingerprint formation unit 306 includes:

Object vector forms subelement, based on every dimensional vector progress weighted value to the feature vector in characteristic signature It calculates, to obtain object vector；

Be arranged subelement, for by numerical value in object vector be positive number vector corresponding to position be placed in one, by target Numerical value is that position corresponding to the vector of non-positive number is placed in zero in vector, to obtain webpage text content fingerprint.

In one embodiment, the similarity calculated 308 includes:

Subelement is established, for establishing web page fingerprint inverted list according to webpage text content；

Number of documents obtains subelement, for obtaining the number of documents occurred in web page fingerprint inverted list；

Intersection computation subunit, for carrying out intersection calculating well to the document, to obtain the similar of webpage text content Property.

It should be noted that it is apparent to those skilled in the art that, the above-mentioned quick duplicate removal of content of text The specific implementation process of device 300 and each unit, can be with reference to the corresponding description in preceding method embodiment, for the side of description Just and succinctly, details are not described herein.

The above-mentioned quick duplicate removal device 300 of content of text can be implemented as a kind of form of computer program, the computer journey Sequence can be run in computer equipment as shown in fig. 13 that.

Figure 13 is please referred to, Figure 13 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The calculating Machine equipment 500 is server.

Refering to fig. 13, which includes processor 502, memory and the net connected by system bus 501 Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.

The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program 5032 include program instruction, which is performed, and processor 502 may make to execute a kind of quick removing repeat of content of text Method.

The processor 502 is for providing calculating and control ability, to support the operation of entire computer equipment 500.

The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, should When computer program 5032 is executed by processor 502, processor 502 may make to execute a kind of quick De-weight method of content of text.

The network interface 505 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Figure 13 The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme The restriction of computer equipment 500 thereon, specific computer equipment 500 may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.

Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following step It is rapid:

Crawl needs several webpage text contents of duplicate removal；

It signs to the target signature keyword, to obtain characteristic signature；

Inverted index storage is carried out to webpage text content fingerprint；

Export the similitude of webpage text content.

In one embodiment, processor 502 is in several webpage text content steps for realizing that the crawl needs duplicate removal When, it is implemented as follows step:

Distribute the address URL；

URL is crawled according to the address URL, to obtain URL to be crawled；

Judge whether the URL to be crawled has crawled；

If it is not, then grabbing wait crawl the webpage text content in URL.

In one embodiment, processor 502 realize it is described several described webpage text contents are pre-processed, with It obtains being implemented as follows step when duplicate removal content of text step:

In one embodiment, processor 502 treats duplicate removal content of text described in the realization and extracts characteristic key words, with When obtaining target signature keyword step, it is implemented as follows step:

In one embodiment, processor 502 realize it is described sign to the target signature keyword, to obtain spy When levying signature step, it is implemented as follows step:

In one embodiment, processor 502 is described according to characteristic signature formation webpage text content fingerprint step in realization When, it is implemented as follows step:

In one embodiment, processor 502 realize it is described according to the webpage text content fingerprint calculate similarity, with When obtaining the similitude step of webpage text content, it is implemented as follows step:

It should be appreciated that in the embodiment of the present application, processor 502 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic Device, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or Person's processor is also possible to any conventional processor etc..

Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process, It is that relevant hardware can be instructed to complete by computer program.The computer program includes program instruction, computer journey Sequence can be stored in a storage medium, which is computer readable storage medium.The program instruction is by the department of computer science At least one processor in system executes, to realize the process step of the embodiment of the above method.

Therefore, the present invention also provides a kind of storage mediums.The storage medium can be computer readable storage medium.This is deposited Storage media is stored with computer program, and processor is made to execute following steps when wherein the computer program is executed by processor:

Crawl needs several webpage text contents of duplicate removal；

It signs to the target signature keyword, to obtain characteristic signature；

Inverted index storage is carried out to webpage text content fingerprint；

Export the similitude of webpage text content.

In one embodiment, if the processor realizes the crawl and need duplicate removal in the execution computer program When dry webpage text content step, it is implemented as follows step:

Distribute the address URL；

URL is crawled according to the address URL, to obtain URL to be crawled；

Judge whether the URL to be crawled has crawled；

If it is not, then grabbing wait crawl the webpage text content in URL.

In one embodiment, the processor is realized described several webpages in the execution computer program Content of text is pre-processed, to obtain being implemented as follows step when duplicate removal content of text step:

In one embodiment, the processor is realized and described treats duplicate removal content of text executing the computer program Characteristic key words are extracted, when obtaining target signature keyword step, are implemented as follows step:

In one embodiment, the processor is realized described to target signature pass in the execution computer program Keyword is signed, and when obtaining characteristic signature step, is implemented as follows step:

In one embodiment, the processor is realized and described is formed according to characteristic signature executing the computer program When webpage text content fingerprint step, it is implemented as follows step:

In one embodiment, the processor is realized described according to the web page text in the execution computer program User supplied video content using fingerprints calculate similarity and are implemented as follows step when obtaining the similitude step of webpage text content:

The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk Or the various computer readable storage mediums that can store program code such as CD.

Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond the scope of this invention.

In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary.For example, the division of each unit, only Only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.

The steps in the embodiment of the present invention can be sequentially adjusted, merged and deleted according to actual needs.This hair Unit in bright embodiment device can be combined, divided and deleted according to actual needs.In addition, in each implementation of the present invention Each functional unit in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also be with It is that two or more units are integrated in one unit.

If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product, It can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing skill The all or part of part or the technical solution that art contributes can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, terminal or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims

1. the quick De-weight method of content of text characterized by comprising

Crawl needs several webpage text contents of duplicate removal；

It signs to the target signature keyword, to obtain characteristic signature；

Inverted index storage is carried out to webpage text content fingerprint；

Export the similitude of webpage text content.

2. the quick De-weight method of content of text according to claim 1, which is characterized in that if the crawl needs duplicate removal Dry webpage text content, comprising:

Distribute the address URL；

URL is crawled according to the address URL, to obtain URL to be crawled；

Judge whether the URL to be crawled has crawled；

If it is not, then grabbing wait crawl the webpage text content in URL.

3. the quick De-weight method of content of text according to claim 1, which is characterized in that described several webpages Content of text is pre-processed, to obtain to duplicate removal content of text, comprising:

4. the quick De-weight method of content of text according to claim 1, which is characterized in that described to treat duplicate removal content of text Characteristic key words are extracted, to obtain target signature keyword, comprising:

5. the quick De-weight method of content of text according to claim 1, which is characterized in that described to be closed to the target signature Keyword is signed, to obtain characteristic signature, comprising:

6. the quick De-weight method of content of text according to claim 1, which is characterized in that described to be formed according to characteristic signature Webpage text content fingerprint, comprising:

Position corresponding to vector of the numerical value in object vector for positive number is placed in one, is non-positive number by numerical value in object vector Position corresponding to vector is placed in zero, to obtain webpage text content fingerprint.

7. the quick De-weight method of content of text according to claim 1, which is characterized in that described according to the web page text User supplied video content using fingerprints calculate similarity, to obtain the similitude of webpage text content, comprising:

8. the quick duplicate removal device of content of text characterized by comprising

Pretreatment unit, for being pre-processed several described webpage text contents, to obtain to duplicate removal content of text；

Similarity calculated, for calculating similarity according to the webpage text content fingerprint, to obtain webpage text content Similitude；

Output unit, for exporting the similitude of webpage text content.

9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor, on the memory It is stored with computer program, the processor is realized as described in any one of claims 1 to 7 when executing the computer program Method.

10. a kind of storage medium, which is characterized in that the storage medium is stored with computer program, the computer program quilt Processor can realize the method as described in any one of claims 1 to 7 when executing.