CN114417102A - Text duplicate removal method and device and electronic equipment - Google Patents

Text duplicate removal method and device and electronic equipment Download PDF

Info

Publication number
CN114417102A
CN114417102A CN202111618080.3A CN202111618080A CN114417102A CN 114417102 A CN114417102 A CN 114417102A CN 202111618080 A CN202111618080 A CN 202111618080A CN 114417102 A CN114417102 A CN 114417102A
Authority
CN
China
Prior art keywords
webpage
deduplicated
web page
text
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111618080.3A
Other languages
Chinese (zh)
Inventor
张洵
刘青松
刘博伟
彭辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingge Technology Co ltd
Original Assignee
Beijing Qingge Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingge Technology Co ltd filed Critical Beijing Qingge Technology Co ltd
Priority to CN202111618080.3A priority Critical patent/CN114417102A/en
Publication of CN114417102A publication Critical patent/CN114417102A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the disclosure discloses a text duplicate removal method and device and electronic equipment. One embodiment of the method comprises: acquiring a webpage set to be deduplicated; extracting webpage features from webpage data of the webpage to be deduplicated for each webpage to be deduplicated in a webpage set to be deduplicated, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting a similar flag bit of the webpage to be deduplicated; grouping the web pages to be deduplicated in the web page set to be deduplicated by using the similar zone bits; and based on the webpage characteristics, selecting a target webpage from each group of webpages to be deduplicated, and deleting other webpages except the target webpage to obtain a webpage set after deduplication. According to the embodiment, the webpage text duplicate removal effect is improved, the duplicate removal efficiency is improved, and the memory is saved.

Description

Text duplicate removal method and device and electronic equipment
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a text duplicate removal method and device and electronic equipment.
Background
There is often a large number of mutual reloads in some internet sites, for example, local government sites reloading news in central government sites which can result in duplicate searches of the crawled data in the search engine, resulting in a poor user experience.
Existing text deduplication approaches are based on vector space hash (sim-hash) algorithms, or minimum common substring matching (Jaccard) algorithms. However, the duplication checking based on the Jaccard algorithm involves a large amount of calculation, the speed is relatively slow, the requirement on the computer memory is met, and if the webpage text needing duplication checking is too long, the memory overflow is caused. The duplication checking based on the sim-hash algorithm is more suitable for the similarity comparison of long texts, and inaccurate conditions can exist for the duplication removing effect of short texts.
Disclosure of Invention
This disclosure is provided to introduce concepts in a simplified form that are further described below in the detailed description. This disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The embodiment of the disclosure provides a text duplicate removal method and device and electronic equipment, which improve the duplicate removal efficiency and save the memory while improving the duplicate removal effect of a webpage text.
In a first aspect, an embodiment of the present disclosure provides a text deduplication method, including: acquiring a webpage set to be deduplicated; extracting webpage features from webpage data of a webpage to be deduplicated for each webpage to be deduplicated in a webpage set to be deduplicated, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting similar flag bits of the webpage to be deduplicated, wherein the webpage features comprise the webpage title and the webpage text, and the similar flag bits of the similar webpages are the same; grouping the web pages to be deduplicated in the web page set to be deduplicated by using the similar zone bits; and based on the webpage characteristics, selecting a target webpage from each group of webpages to be deduplicated, and deleting other webpages except the target webpage to obtain a webpage set after deduplication.
In a second aspect, an embodiment of the present disclosure provides a text deduplication apparatus, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a webpage set to be deduplicated; the device comprises a setting unit, a comparison unit and a comparison unit, wherein the setting unit is used for extracting webpage characteristics from webpage data of a webpage to be deduplicated in a webpage set, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting similar flag bits of the webpage to be deduplicated, wherein the webpage characteristics comprise the webpage title and the webpage text, and the similar flag bits of the similar webpage are the same; the grouping unit is used for grouping the webpages to be deduplicated in the webpage set to be deduplicated by using the similar zone bits; and the duplication removing unit is used for selecting a target webpage from each group of webpages to be duplicated based on the webpage characteristics, deleting other webpages except the target webpage, and obtaining a duplication removed webpage set.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the text deduplication method as described in the first aspect.
In a fourth aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the steps of the text deduplication method according to the first aspect.
According to the text duplicate removal method, the text duplicate removal device and the electronic equipment, the webpage set to be duplicated is obtained; extracting webpage features from webpage data of the webpage to be deduplicated aiming at each webpage to be deduplicated in the webpage set to be deduplicated, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting a similar flag bit of the webpage to be deduplicated; then, grouping the web pages to be deduplicated in the web page set to be deduplicated by using the similar zone bits; and finally, based on the webpage characteristics, selecting a target webpage from each group of webpages to be deduplicated, deleting other webpages except the target webpage, and obtaining a webpage set after deduplication. By combining the vector space hash algorithm and the minimum common substring matching algorithm in the mode, the duplication removing efficiency is improved and the memory is saved while the duplication removing effect of the webpage text is improved.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
FIG. 1 is an exemplary system architecture diagram in which various embodiments of the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of a text deduplication method according to the present disclosure;
FIG. 3 is a flow diagram of yet another embodiment of a text deduplication method according to the present disclosure;
FIG. 4 is a schematic block diagram of one embodiment of a text deduplication machine according to the present disclosure;
FIG. 5 is a schematic block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the text deduplication method of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 1011, 1012, 1013, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 1011, 1012, 1013 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may interact with the server 103 via the network 102 using the terminal devices 1011, 1012, 1013 to send or receive messages or the like, for example, the server 103 may obtain a set of web pages to be deduplicated from the terminal devices 1011, 1012, 1013. Various communication client applications, such as a browser application, an instant messaging software, and the like, may be installed on the terminal devices 1011, 1012, 1013.
The terminal devices 1011, 1012, 1013 may be hardware or software. When the terminal devices 1011, 1012, 1013 are hardware, they may be various electronic devices having a display screen and supporting information interaction, including but not limited to smart phones, tablet computers, laptop computers, and the like. When the terminal devices 1011, 1012, 1013 are software, they may be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 103 may be a server that provides various services. For example, it may be a background server that deduplicates a set of web pages to be deduplicated. The server 103 may first obtain a set of web pages to be deduplicated; then, extracting webpage features from webpage data of the webpage to be deduplicated for each webpage to be deduplicated in the webpage set to be deduplicated, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by using a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting a similar flag bit of the webpage to be deduplicated; then, the similar flag bits can be used for grouping the web pages to be deduplicated in the web page set to be deduplicated; finally, a target webpage can be selected from each group of webpages to be deduplicated based on the webpage characteristics, and other webpages except the target webpage are deleted to obtain a webpage set after deduplication.
The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be further noted that the text deduplication method provided by the embodiment of the present disclosure is generally executed by the server 103, and then the text deduplication device is generally disposed in the server 103.
It should be further noted that if the collection of web pages to be deduplicated is stored locally on the server 103, the system architecture 100 may not have the terminal devices 1011, 1012, 1013 and the network 102.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a text deduplication method according to the present disclosure is shown. The text deduplication method comprises the following steps:
step 201, acquiring a to-be-deduplicated webpage set.
In this embodiment, an execution subject of the text deduplication method (e.g., the server shown in fig. 1) may wait for the webpage collection to be deduplicated. The web pages with the same or similar web page contents may exist in the to-be-deduplicated web page set, and at this time, the to-be-deduplicated web page set needs to be deduplicated.
Here, the web page to be deduplicated may include a web page in an internet government website.
Step 202, for each webpage to be deduplicated in the set of webpages to be deduplicated, extracting webpage features from the webpage data of the webpage to be deduplicated.
In this embodiment, for each to-be-deduplicated web page in the to-be-deduplicated web page set acquired in step 201, the execution main body may analyze the web page data of the to-be-deduplicated web page, and extract the web page features from the web page data of the to-be-deduplicated web page. The web page features may include a web page body and a web page title. Furthermore, the web page features may also include, but are not limited to: the name of the website to which the webpage belongs, webpage content keywords and webpage character length.
And 203, determining whether a webpage similar to the webpage to be deduplicated exists in the candidate webpage set or not by using the webpage title and the webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm.
In this embodiment, the execution main body may determine whether a webpage similar to the webpage to be deduplicated exists in the candidate webpage set based on a vector space hash algorithm and a minimum common substring matching algorithm by using the webpage title and the webpage text of the webpage to be deduplicated. Duplicate web pages do not exist in the candidate web page set.
The sim-hash algorithm is used for the detection of approximate web pages. The sim-Hash generates an n-bit fingerprint for the input text, and unlike the Hash functions of the conventional MD5(MD5 Message-Digest Algorithm, MD5 Message Digest Algorithm) and SHA-1(Secure Hash Algorithm 1), the sim-Hash generated fingerprint is also similar for the approximate text. The more similar the text, the fewer the number of binary digits (denoted as k) whose fingerprints differ. For two texts, the steps of comparing the sim-hash similarity are as follows: step 1, segmenting words of a text, and in order to reduce the influence caused by stop words and other common words (for example, yes at …), adding weight to each word by using TF-IDF (Term Frequency-Inverse text Frequency), wherein TF refers to the Frequency of a certain word appearing in the text, IDF is related to the common degree of a word (namely the number of texts containing the word), and the more common words have lower IDF values. The TF-IDF weight is the product of TF and IDF, and in a text, words with relatively high TF-IDF weight become keywords of the text. And 2, hash, calculating the hash value of each word, and improving the calculation efficiency by converting characters into numbers. And 3, weighting, namely multiplying 1 in the hash by the weight of a positive number and multiplying 0 by the weight of a negative number. And 4, combining, namely adding the weighted hash values according to columns to obtain a sequence consisting of numbers. And 5, reducing dimensions, namely converting the digit sequence obtained in the step 4 into a 0, 1 character string, converting digits larger than 0 into 1, and converting digits smaller than 0 into 0. And 6, comparing the similarity, and comparing the Hamming distance of the generated sim-hash value.
The similarity of the two sets is obtained by using the Jaccard algorithm, and the proportion of the intersection of the two sets occupying the union set is used as the similarity of the two sets.
If it is determined that there are web pages similar to the to-be-deduplicated web page in the candidate web page set, the executing entity may execute step 204.
And step 204, if the web page exists, setting a similar flag bit of the web page to be deduplicated.
In this embodiment, if it is determined in step 203 that there is a web page similar to the to-be-deduplicated web page in the candidate web page set, the execution subject may set a similar flag bit of the to-be-deduplicated web page. Here, web pages typically have a similarity flag that indicates whether two or more web pages are similar. If the similarity flag bits of the two web pages are the same, it can be shown that the two web pages are similar.
Specifically, the executing entity may set the value of the similarity flag of the to-be-deduplicated web page to the value of the similarity flag of the web page similar to the to-be-deduplicated web page in the candidate web page set. For example, if the similarity flag of the web page similar to the to-be-deduplicated web page in the candidate web page set is 2, the execution main body may set the value of the similarity flag of the to-be-deduplicated web page to 2.
It should be noted that, at the beginning, the value of the similar flag bits of the to-be-deduplicated web pages in the above-mentioned to-be-deduplicated web pages set may be initialized to a target value, for example, -1. When determining that the webpage similar to the webpage to be deduplicated exists in the candidate webpage set, the value of the similar flag bit of the webpage to be deduplicated may be updated from-1 to the value of the similar flag bit of the webpage similar to the webpage to be deduplicated.
And step 205, grouping the webpages to be deduplicated in the webpage set to be deduplicated by using the similar zone bits.
In this embodiment, the execution body may group the webpages to be deduplicated in the set of webpages to be deduplicated by using the similar flag bit. Specifically, the execution body may divide the webpages with the same similar flag bits in the set of webpages to be deduplicated into a group.
It should be noted that, if the value of the similar flag bit of the to-be-deduplicated web pages in the to-be-deduplicated web page set is initialized to the target value at the beginning, at this time, each to-be-deduplicated web page whose value of the similar flag bit is the value at the time of initialization may be separately divided into a group.
And step 206, selecting a target webpage from each group of webpages to be deduplicated based on the webpage characteristics, and deleting other webpages except the target webpage to obtain a webpage set after deduplication.
In this embodiment, the execution main body may select a target webpage from each group of webpages to be deduplicated based on the webpage features, and delete other webpages except the target webpage to obtain a webpage set after deduplication. For example, if the web page features include a web page character length, the execution main body may select a web page with the largest web page character length from each group of web pages to be deduplicated as a target web page.
The method provided by the embodiment of the disclosure comprises the steps of acquiring a set of web pages to be deduplicated; extracting webpage features from webpage data of the webpage to be deduplicated aiming at each webpage to be deduplicated in the webpage set to be deduplicated, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting a similar flag bit of the webpage to be deduplicated; then, grouping the web pages to be deduplicated in the web page set to be deduplicated by using the similar zone bits; and finally, based on the webpage characteristics, selecting a target webpage from each group of webpages to be deduplicated, deleting other webpages except the target webpage, and obtaining a webpage set after deduplication. By combining the vector space hash algorithm and the minimum common substring matching algorithm in the mode, the duplication removing efficiency is improved and the memory is saved while the duplication removing effect of the webpage text is improved.
In some alternative implementations, the web page characteristics may include a weight of a website to which the web page belongs. The execution main body can select a target webpage from each group of webpages to be deduplicated based on the webpage features in the following way: the execution main body can select a webpage with the highest weight of a website to which the webpage belongs from each group of webpages to be deduplicated as a target webpage.
In some alternative implementations, the web page feature may include a web page publishing time. The execution main body can select a target webpage from each group of webpages to be deduplicated based on the webpage features in the following way: for each group of to-be-duplication webpages, if the weights of websites to which webpages in the group of to-be-duplication webpages belong are the same, the execution main body can select a webpage with the earliest webpage release time from the group of to-be-duplication webpages as a target webpage.
In some optional implementations, the executing body may select a target webpage from each group of webpages to be deduplicated based on the webpage features in the following manner: for each group of to-be-deduplicated web pages, if the weights of websites to which the web pages belong in the group of to-be-deduplicated web pages are the same and the web page release time is the same, the execution main body can select the web page with the earliest capturing time from the group of to-be-deduplicated web pages as a target web page.
With further reference to fig. 3, a flow 300 of yet another embodiment of a text deduplication method is illustrated. The process 300 of the text deduplication method includes the following steps:
301, acquiring a to-be-deduplicated webpage set
Step 302, for each webpage to be deduplicated in the set of webpages to be deduplicated, extracting webpage features from the webpage data of the webpage to be deduplicated.
In the present embodiment, the steps 301-302 can be performed in a similar manner to the steps 201-202, and will not be described herein again.
Step 303, using a vector space hash algorithm to obtain a vector space hash value for the webpage title and the webpage text of the webpage to be deduplicated.
In this embodiment, an executing body (for example, the server shown in fig. 1) of the text deduplication method may use a vector space hash algorithm to obtain a vector space hash value for the web page title and the web page body of the webpage to be deduplicated.
Here, the specific manner of using the vector space hash algorithm to obtain the vector space hash value for the text is not described herein again.
Step 304, determining a hamming distance between the vector space hash value of the web page text of the web page to be deduplicated and the vector space hash value of the web page text of the candidate web page as a first hamming distance for each candidate web page in the candidate web page set.
In this embodiment, for each candidate web page in the candidate web page set, the execution main body may determine a Hamming Distance (Hamming Distance) between the vector space hash value of the web page body of the web page to be deduplicated and the vector space hash value of the web page body of the candidate web page as the first Hamming Distance. The hamming distance represents the number of different characters in the corresponding position of two equal-length character strings. And carrying out binary exclusive OR operation on the two character strings, and counting the number of 1, wherein the number is the Hamming distance.
Here, the vector space hash value of the web page body of each candidate web page in the candidate web page set is usually obtained in advance.
Step 305, determining whether a first webpage exists in the candidate webpage set or not by using the first hamming distance.
In this embodiment, the execution subject may determine whether the first webpage exists in the candidate webpage set by using the first hamming distance. The web page text of the first web page is generally similar to the web page text of the web page to be deduplicated. Specifically, the executing body may determine whether the first hamming distance is smaller than a preset first distance threshold. If the first hamming distance is smaller than the first distance threshold, it may be determined that a first web page exists in the candidate web page set.
If the first web page exists in the candidate web page set, the executing entity may execute step 306.
Step 306, if the first webpage exists in the candidate webpage set, determining that a webpage similar to the webpage to be deduplicated exists in the candidate webpage set.
In this embodiment, if it is determined in step 305 that the first webpage exists in the candidate webpage set, the executing entity may determine that a webpage similar to the webpage to be deduplicated exists in the candidate webpage set, where the first webpage in the candidate webpage set is a webpage similar to the webpage to be deduplicated.
Step 307, if the first webpage does not exist in the candidate webpage set, determining, for each candidate webpage in the candidate webpage set, a hamming distance between the vector space hash value of the webpage title of the webpage to be deduplicated and the vector space hash value of the webpage title of the candidate webpage as a second hamming distance.
In this embodiment, if it is determined in step 305 that the first webpage does not exist in the candidate webpage set, the execution main body may determine, for each candidate webpage in the candidate webpage set, a hamming distance between a vector space hash value of a webpage title of the webpage to be deduplicated and a vector space hash value of a webpage title of the candidate webpage as a second hamming distance. Specifically, the executing body may perform binary xor operation on the vector space hash value of the web page title of the to-be-deduplicated web page and the vector space hash value of the web page title of the candidate web page, where an xor result is a hamming distance.
Here, the vector space hash value of the web page title of each candidate web page in the candidate web page set is usually obtained in advance.
And 308, determining whether a second webpage set exists in the candidate webpage set or not by using the second Hamming distance.
In this embodiment, the execution subject may determine whether a second set of web pages exists in the candidate set of web pages by using the second hamming distance. The web page title of the second web page in the second web page set is similar to the web page title of the web page to be deduplicated. Specifically, the executing body may determine whether the second hamming distance is smaller than a preset second distance threshold. If the second hamming distance is smaller than the second distance threshold, it may be determined that a second web page set exists in the candidate web page set.
If the second set of web pages exists in the candidate set of web pages, the executing entity may execute step 309.
Step 309, if a second web page set exists in the candidate web page set, performing minimum common substring comparison on the web page text of the second web page and the web page text of the web page to be deduplicated for each second web page in the second web page set.
In this embodiment, if it is determined in step 308 that the second web page set exists in the candidate web page set, the executing entity may perform, for each second web page in the second web page set, a minimum common substring comparison between the web page text of the second web page and the web page text of the to-be-deduplicated web page. Specifically, the executing entity may first obtain an intersection of the web page text of the second web page and the web page text of the to-be-deduplicated web page; then, obtaining a union of the webpage text of the second webpage and the webpage text of the webpage to be deduplicated; and finally, the proportion of the intersection to the union can be obtained as the similarity between the webpage text of the second webpage and the webpage text of the webpage to be deduplicated.
Step 310, determining whether the web page text of the second web page is similar to the web page text of the web page to be deduplicated.
In this embodiment, the executing entity may determine whether the web page text of the second web page is similar to the web page text of the to-be-deduplicated web page. Specifically, the execution main body may determine whether the similarity between the web page text of the second web page and the web page text of the to-be-deduplicated web page is greater than a preset similarity threshold. If the similarity is greater than the similarity threshold, step 311 may be executed.
Step 311, if the web page text of the second web page is similar to the web page text of the web page to be deduplicated, determining that a web page similar to the web page to be deduplicated exists in the candidate web page set.
In this embodiment, if it is determined in step 310 that the web page text of the second web page is similar to the web page text of the to-be-deduplicated web page, the executing body may determine that a web page similar to the to-be-deduplicated web page exists in the candidate web page set. And the second webpage with the webpage text similar to the webpage text of the webpage to be deduplicated in the candidate webpage set is the webpage similar to the webpage to be deduplicated.
Step 312, if it is determined in steps 306 and 311 that there is a web page similar to the web page to be deduplicated in the candidate web page set, setting a similar flag bit of the web page to be deduplicated.
And 313, grouping the webpages to be deduplicated in the webpage set to be deduplicated by using the similar flag bits.
And step 314, selecting a target webpage from each group of webpages to be deduplicated based on the webpage characteristics, and deleting other webpages except the target webpage to obtain a webpage set after deduplication.
In the present embodiment, the steps 312 and 314 can be performed in a similar manner as the steps 204 and 206, and are not described herein again.
As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the process 300 of the text deduplication method in this embodiment embodies the steps of determining the similarity of the web page texts by using the sim-hash algorithm, determining the similarity of the web page titles by using the sim-hash algorithm if similar web pages are not determined by using this method, and comparing the web page texts of the web pages with similar titles by using the Jaccard algorithm. Therefore, the scheme described in the embodiment provides a way for determining similar webpages by combining a vector space hash algorithm and a minimum common substring matching algorithm, so that the accuracy of removing the duplicate of the short text is improved, the speed is obviously improved compared with that of the Jaccard algorithm, and the service condition of the memory is optimized.
In some optional implementation manners, if it is determined in step 311 that the web page body of the second web page is not similar to the web page body of the to-be-deduplicated web page, the execution subject may add the to-be-deduplicated web page to the candidate web page set.
In some optional implementations, if it is determined in step 308 that the second set of web pages does not exist in the candidate set of web pages, the executing entity may add the web page to be deduplicated to the candidate set of web pages.
In some optional implementation manners, the executing entity may obtain a vector space hash value for the web page title and the web page text of the to-be-deduplicated web page by using a vector space hash algorithm as follows: when the executing body performs word segmentation on the webpage title and the webpage text of the webpage to be deduplicated by using the vector space hash algorithm, window sliding word segmentation can be performed on the webpage title and the webpage text of the webpage to be deduplicated, so that the character length of each word segmentation result is equal. By the word segmentation mode, the length of each word segmentation result is the same, the word segmentation results have the same weight when the sim-hash is calculated, and the problem that the hamming distance of sim-hash calculation is not accurate enough due to word segmentation content offset caused by the word segmentation length is avoided.
As an example, if the text is "how to perform de-duplication on the text", performing window sliding word segmentation on every four characters of the text, and obtaining word segmentation results as follows: how to perform text, how to perform text entry, how to perform text removal, and how to perform deduplication.
With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a text deduplication apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 4, the text deduplication apparatus 400 of the present embodiment includes: an acquisition unit 401, a setting unit 402, a grouping unit 403, and a deduplication unit 404. The acquiring unit 401 is configured to acquire a to-be-deduplicated web page set; the setting unit 402 is configured to extract, for each to-be-deduplicated web page in the to-be-deduplicated web page set, a web page feature from web page data of the to-be-deduplicated web page, determine, by using a web page title and a web page text of the to-be-deduplicated web page, whether a web page similar to the to-be-deduplicated web page exists in a candidate web page set based on a vector space hash algorithm and a minimum common substring matching algorithm, and if the web page exists, set a similar flag bit of the to-be-deduplicated web page, where the web page feature includes the web page title and the web page text, and similar flag bits of similar web pages are the same; the grouping unit 403 is configured to group the to-be-deduplicated web pages in the to-be-deduplicated web page set by using the similar flag bit; the duplication removal unit 404 is configured to select a target webpage from each group of webpages to be duplicated based on the webpage features, and delete other webpages except the target webpage to obtain a duplicate-removed webpage set.
In this embodiment, the specific processing of the obtaining unit 401 of the text deduplication device 400 may refer to step 201 in the embodiment corresponding to fig. 2, the specific processing of the setting unit 402 may refer to step 202, step 203, and step 204 in the embodiment corresponding to fig. 2, the specific processing of the grouping unit 403 may refer to step 205 in the embodiment corresponding to fig. 2, and the specific processing of the deduplication unit 404 may refer to step 206 in the embodiment corresponding to fig. 2.
In some optional implementation manners, the setting unit 402 may be further configured to determine whether a webpage similar to the webpage to be deduplicated exists in the candidate webpage set based on a vector space hash algorithm and a minimum common substring matching algorithm by using a webpage title and a webpage text of the webpage to be deduplicated as follows: the setting unit 402 may use a vector space hash algorithm to obtain a vector space hash value for the web title and the web text of the to-be-deduplicated web page; then, determining a hamming distance between the vector space hash value of the web page text of the web page to be deduplicated and the vector space hash value of the web page text of the candidate web page as a first hamming distance for each candidate web page in the candidate web page set; then, determining whether a first webpage exists in the candidate webpage set or not by using the first hamming distance, wherein the webpage text of the first webpage is similar to the webpage text of the webpage to be deduplicated; if the first web page exists in the candidate web page set, the setting unit 402 may determine that a web page similar to the to-be-deduplicated web page exists in the candidate web page set.
In some optional implementations, after determining whether a first web page exists in the candidate web page set by using the first hamming distance, if the first web page does not exist in the candidate web page set, the setting unit 402 may determine, as a second hamming distance, a hamming distance between a vector space hash value of a web page title of the web page to be deduplicated and a vector space hash value of a web page title of the candidate web page for each candidate web page in the candidate web page set; then, determining whether a second webpage set exists in the candidate webpage set or not by using the second hamming distance, wherein the webpage title of the second webpage is similar to the webpage title of the webpage to be deduplicated; if the second webpage set exists in the candidate webpage set, performing minimum common substring comparison on the webpage text of the second webpage and the webpage text of the webpage to be deduplicated for each second webpage in the second webpage set, and determining whether the webpage text of the second webpage is similar to the webpage text of the webpage to be deduplicated; if the web page text of the second web page is similar to the web page text of the to-be-deduplicated web page, the setting unit 402 may determine that a web page similar to the to-be-deduplicated web page exists in the candidate web page set.
In some optional implementation manners, after determining whether the web page text of the second web page is similar to the web page text of the to-be-deduplicated web page, if the web page text of the second web page is not similar to the web page text of the to-be-deduplicated web page, the setting unit 402 may add the to-be-deduplicated web page to the candidate web page set.
In some optional implementation manners, after determining whether a second set of web pages exists in the candidate set of web pages by using the second hamming distance, if the second set of web pages does not exist in the candidate set of web pages, the setting unit 402 may add the web page to be deduplicated to the candidate set of web pages.
In some optional implementations, the setting unit 402 may be further configured to obtain a vector space hash value for the web page title and the web page text of the web page to be deduplicated by using a vector space hash algorithm as follows: the setting unit 402 may perform window sliding segmentation on the web title and the web text of the to-be-deduplicated web page, where the character length of each segmentation result is equal.
In some optional implementations, the web page feature may include a weight of a website to which the web page belongs; and the deduplication unit 404 may be further configured to select a target webpage from each group of webpages to be deduplicated based on the webpage features as follows: the deduplication unit 404 may select, from each group of webpages to be deduplicated, a webpage with the highest weight of a website to which the webpage belongs as a target webpage.
In some optional implementations, the web page feature may include a web page publishing time; and the deduplication unit 404 may be further configured to select a target webpage from each group of webpages to be deduplicated based on the webpage features as follows: for each group of to-be-deduplicated web pages, if the weights of the websites to which the web pages in the group of to-be-deduplicated web pages belong are the same, the deduplication unit 404 may select, as the target web page, a web page with the earliest web page release time from the group of to-be-deduplicated web pages.
In some optional implementations, the deduplication unit 404 may be further configured to select a target webpage from each group of webpages to be deduplicated based on the webpage features by: for each group of to-be-deduplicated web pages, if the weights of the websites to which the web pages belong in the group of to-be-deduplicated web pages are the same and the web page release times are the same, the deduplication unit 404 may select the web page with the earliest crawling time from the group of to-be-deduplicated web pages as the target web page.
Referring now to FIG. 5, a schematic diagram of an electronic device (e.g., the server of FIG. 1) 500 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 5 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a webpage set to be deduplicated; extracting webpage features from webpage data of a webpage to be deduplicated for each webpage to be deduplicated in a webpage set to be deduplicated, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting similar flag bits of the webpage to be deduplicated, wherein the webpage features comprise the webpage title and the webpage text, and the similar flag bits of the similar webpages are the same; grouping the web pages to be deduplicated in the web page set to be deduplicated by using the similar zone bits; and based on the webpage characteristics, selecting a target webpage from each group of webpages to be deduplicated, and deleting other webpages except the target webpage to obtain a webpage set after deduplication.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
According to one or more embodiments of the present disclosure, there is provided a text deduplication method, including: acquiring a webpage set to be deduplicated; extracting webpage features from webpage data of a webpage to be deduplicated for each webpage to be deduplicated in a webpage set to be deduplicated, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting similar flag bits of the webpage to be deduplicated, wherein the webpage features comprise the webpage title and the webpage text, and the similar flag bits of the similar webpages are the same; grouping the web pages to be deduplicated in the web page set to be deduplicated by using the similar zone bits; and based on the webpage characteristics, selecting a target webpage from each group of webpages to be deduplicated, and deleting other webpages except the target webpage to obtain a webpage set after deduplication.
According to one or more embodiments of the present disclosure, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set based on a vector space hash algorithm and a minimum common substring matching algorithm by using a webpage title and a webpage text of the webpage to be deduplicated includes: utilizing a vector space hash algorithm to obtain a vector space hash value for the webpage title and the webpage text of the webpage to be deduplicated; determining a Hamming distance between a vector space Hash value of a webpage text of the webpage to be deduplicated and a vector space Hash value of a webpage text of the candidate webpage as a first Hamming distance for each candidate webpage in the candidate webpage set; determining whether a first webpage exists in the candidate webpage set or not by using the first Hamming distance, wherein the webpage text of the first webpage is similar to the webpage text of the webpage to be deduplicated; and if the first webpage exists in the candidate webpage set, determining that the webpage similar to the webpage to be deduplicated exists in the candidate webpage set.
According to one or more embodiments of the present disclosure, after determining whether a first web page exists in the candidate web page set by using the first hamming distance, the method further includes: if the first webpage does not exist in the candidate webpage set, determining a Hamming distance between the vector space Hash value of the webpage title of the webpage to be deduplicated and the vector space Hash value of the webpage title of the candidate webpage as a second Hamming distance for each candidate webpage in the candidate webpage set; determining whether a second webpage set exists in the candidate webpage set or not by using a second Hamming distance, wherein the webpage title of the second webpage is similar to the webpage title of the webpage to be deduplicated; if the candidate webpage set comprises a second webpage set, aiming at each second webpage in the second webpage set, performing minimum public substring comparison on the webpage text of the second webpage and the webpage text of the webpage to be deduplicated, and determining whether the webpage text of the second webpage is similar to the webpage text of the webpage to be deduplicated; and if the webpage text of the second webpage is similar to the webpage text of the webpage to be deduplicated, determining that the webpage similar to the webpage to be deduplicated exists in the candidate webpage set.
According to one or more embodiments of the present disclosure, after determining whether the web page text of the second web page is similar to the web page text of the web page to be deduplicated, the method includes: and if the webpage text of the second webpage is not similar to the webpage text of the webpage to be deduplicated, adding the webpage to be deduplicated into the candidate webpage set.
According to one or more embodiments of the present disclosure, after determining whether there is a second set of web pages in the candidate set of web pages using the second hamming distance, the method includes: and if the second webpage set does not exist in the candidate webpage set, adding the webpage to be deduplicated into the candidate webpage set.
According to one or more embodiments of the present disclosure, a vector space hash value is obtained for a webpage title and a webpage text of a webpage to be deduplicated by using a vector space hash algorithm, including: and carrying out window sliding word segmentation on the webpage title and the webpage text of the webpage to be deduplicated, wherein the character length of each word segmentation result is equal.
According to one or more embodiments of the present disclosure, the web page characteristics include a weight of a website to which the web page belongs; and selecting a target webpage from each group of webpages to be deduplicated based on the webpage characteristics, wherein the method comprises the following steps: and selecting the webpage with the highest weight of the website to which the webpage belongs as a target webpage from each group of webpages to be deduplicated.
According to one or more embodiments of the present disclosure, the web page characteristics include a web page publishing time; and selecting a target webpage from each group of webpages to be deduplicated based on the webpage characteristics, wherein the method comprises the following steps: and aiming at each group of web pages to be deduplicated, if the weights of websites to which the web pages in the group of web pages to be deduplicated are belong are the same, selecting the web page with the earliest web page release time from the group of web pages to be deduplicated as a target web page.
According to one or more embodiments of the present disclosure, selecting a target webpage from each group of webpages to be deduplicated based on webpage features includes: and aiming at each group of web pages to be deduplicated, if the weight of websites to which the web pages belong in the group of web pages to be deduplicated is the same and the web page release time is the same, selecting the web page with the earliest capture time from the group of web pages to be deduplicated as a target web page.
According to one or more embodiments of the present disclosure, there is provided a text deduplication apparatus, the apparatus including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a webpage set to be deduplicated; the device comprises a setting unit, a comparison unit and a comparison unit, wherein the setting unit is used for extracting webpage characteristics from webpage data of a webpage to be deduplicated in a webpage set, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting similar flag bits of the webpage to be deduplicated, wherein the webpage characteristics comprise the webpage title and the webpage text, and the similar flag bits of the similar webpage are the same; the grouping unit is used for grouping the webpages to be deduplicated in the webpage set to be deduplicated by using the similar zone bits; and the duplication removing unit is used for selecting a target webpage from each group of webpages to be duplicated based on the webpage characteristics, deleting other webpages except the target webpage, and obtaining a duplication removed webpage set.
According to one or more embodiments of the present disclosure, the setting unit is further configured to determine whether a webpage similar to the webpage to be deduplicated exists in the candidate webpage set based on a vector space hash algorithm and a minimum common substring matching algorithm by using a webpage title and a webpage text of the webpage to be deduplicated as follows: utilizing a vector space hash algorithm to obtain a vector space hash value for the webpage title and the webpage text of the webpage to be deduplicated; determining a Hamming distance between a vector space Hash value of a webpage text of the webpage to be deduplicated and a vector space Hash value of a webpage text of the candidate webpage as a first Hamming distance for each candidate webpage in the candidate webpage set; determining whether a first webpage exists in the candidate webpage set or not by using the first Hamming distance, wherein the webpage text of the first webpage is similar to the webpage text of the webpage to be deduplicated; and if the first webpage exists in the candidate webpage set, determining that the webpage similar to the webpage to be deduplicated exists in the candidate webpage set.
According to one or more embodiments of the present disclosure, after determining whether a first webpage exists in the candidate webpage set by using the first hamming distance, the setting unit is further configured to determine, for each candidate webpage in the candidate webpage set, a hamming distance between a vector space hash value of a webpage title of the webpage to be deduplicated and a vector space hash value of a webpage title of the candidate webpage as a second hamming distance if the first webpage does not exist in the candidate webpage set; determining whether a second webpage set exists in the candidate webpage set or not by using a second Hamming distance, wherein the webpage title of the second webpage is similar to the webpage title of the webpage to be deduplicated; if the candidate webpage set comprises a second webpage set, aiming at each second webpage in the second webpage set, performing minimum public substring comparison on the webpage text of the second webpage and the webpage text of the webpage to be deduplicated, and determining whether the webpage text of the second webpage is similar to the webpage text of the webpage to be deduplicated; and if the webpage text of the second webpage is similar to the webpage text of the webpage to be deduplicated, determining that the webpage similar to the webpage to be deduplicated exists in the candidate webpage set.
According to one or more embodiments of the present disclosure, after determining whether the web page text of the second web page is similar to the web page text of the to-be-deduplicated web page, the setting unit is further configured to add the to-be-deduplicated web page to the candidate web page set if the web page text of the second web page is not similar to the web page text of the to-be-deduplicated web page.
According to one or more embodiments of the present disclosure, after determining whether the second set of web pages exists in the candidate set of web pages by using the second hamming distance, the setting unit is further configured to add the web page to be deduplicated to the candidate set of web pages if the second set of web pages does not exist in the candidate set of web pages.
According to one or more embodiments of the present disclosure, the setting unit is further configured to obtain a vector space hash value for the web title and the web text of the web page to be deduplicated by using a vector space hash algorithm as follows: and carrying out window sliding word segmentation on the webpage title and the webpage text of the webpage to be deduplicated, wherein the character length of each word segmentation result is equal.
According to one or more embodiments of the present disclosure, the web page characteristics include a weight of a website to which the web page belongs; and the duplication eliminating unit is further used for selecting a target webpage from each group of webpages to be duplicated based on the webpage characteristics in the following mode: and selecting the webpage with the highest weight of the website to which the webpage belongs as a target webpage from each group of webpages to be deduplicated.
According to one or more embodiments of the present disclosure, the web page characteristics include a web page publishing time; and the duplication eliminating unit is further used for selecting a target webpage from each group of webpages to be duplicated based on the webpage characteristics in the following mode: and aiming at each group of web pages to be deduplicated, if the weights of websites to which the web pages in the group of web pages to be deduplicated are belong are the same, selecting the web page with the earliest web page release time from the group of web pages to be deduplicated as a target web page.
According to one or more embodiments of the present disclosure, the duplication elimination unit is further configured to select a target webpage from each group of webpages to be duplicated based on the webpage features by: and aiming at each group of web pages to be deduplicated, if the weight of websites to which the web pages belong in the group of web pages to be deduplicated is the same and the web page release time is the same, selecting the web page with the earliest capture time from the group of web pages to be deduplicated as a target web page.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a setting unit, a grouping unit, and a deduplication unit. The names of the units do not form a limitation on the units themselves in some cases, for example, the grouping unit may also be described as a "unit that groups the web pages to be deduplicated in the set of web pages to be deduplicated with similar flags".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (12)

1. A method for text deduplication, comprising:
acquiring a webpage set to be deduplicated;
extracting webpage features from webpage data of the webpage to be deduplicated, determining whether a webpage similar to the webpage to be deduplicated exists in a candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting similar flag bits of the webpage to be deduplicated, wherein the webpage features comprise a webpage title and a webpage text, and the similar flag bits of the similar webpages are the same;
grouping the web pages to be deduplicated in the web page set to be deduplicated by using the similar zone bits;
and based on the webpage characteristics, selecting a target webpage from each group of webpages to be deduplicated, deleting other webpages except the target webpage, and obtaining a webpage set after deduplication.
2. The method according to claim 1, wherein the determining whether the web page similar to the web page to be deduplicated exists in the candidate web page set based on a vector space hash algorithm and a minimum common substring matching algorithm by using the web page title and the web page text of the web page to be deduplicated comprises:
utilizing a vector space hash algorithm to obtain a vector space hash value for the webpage title and the webpage text of the webpage to be deduplicated;
determining a Hamming distance between a vector space Hash value of a webpage text of the webpage to be deduplicated and a vector space Hash value of a webpage text of the candidate webpage as a first Hamming distance for each candidate webpage in the candidate webpage set;
determining whether a first webpage exists in the candidate webpage set or not by using the first hamming distance, wherein the webpage text of the first webpage is similar to the webpage text of the webpage to be deduplicated;
and if the first webpage exists in the candidate webpage set, determining that the webpage similar to the webpage to be deduplicated exists in the candidate webpage set.
3. The method of claim 2, wherein after said determining whether a first web page exists in the set of candidate web pages using the first hamming distance, the method further comprises:
if the first webpage does not exist in the candidate webpage set, determining a Hamming distance between the vector space Hash value of the webpage title of the webpage to be deduplicated and the vector space Hash value of the webpage title of the candidate webpage as a second Hamming distance for each candidate webpage in the candidate webpage set;
determining whether a second webpage set exists in the candidate webpage set or not by using the second Hamming distance, wherein the webpage title of the second webpage is similar to the webpage title of the webpage to be deduplicated;
if the second webpage set exists in the candidate webpage set, performing minimum common substring comparison on the webpage text of the second webpage and the webpage text of the webpage to be deduplicated for each second webpage in the second webpage set, and determining whether the webpage text of the second webpage is similar to the webpage text of the webpage to be deduplicated;
and if the webpage text of the second webpage is similar to the webpage text of the webpage to be deduplicated, determining that the webpage similar to the webpage to be deduplicated exists in the candidate webpage set.
4. The method of claim 3, wherein after determining whether the web page text of the second web page is similar to the web page text of the web page to be deduplicated, the method comprises:
and if the webpage text of the second webpage is not similar to the webpage text of the webpage to be deduplicated, adding the webpage to be deduplicated into the candidate webpage set.
5. The method of claim 3, wherein after said determining whether a second set of web pages exists in the set of candidate web pages using the second hamming distance, the method comprises:
and if the second webpage set does not exist in the candidate webpage set, adding the webpage to be deduplicated into the candidate webpage set.
6. The method of claim 2, wherein the using a vector space hash algorithm to obtain a vector space hash value for the web page title and the web page text of the web page to be deduplicated comprises:
and carrying out window sliding word segmentation on the webpage title and the webpage text of the webpage to be deduplicated, wherein the character length of each word segmentation result is equal.
7. The method of claim 1, wherein the web page characteristics include a weight of a website to which the web page belongs; and
selecting a target webpage from each group of webpages to be deduplicated based on the webpage characteristics, wherein the selecting comprises the following steps:
and selecting the webpage with the highest weight of the website to which the webpage belongs as a target webpage from each group of webpages to be deduplicated.
8. The method of claim 7, wherein the web page characteristics include a web page publication time; and
selecting a target webpage from each group of webpages to be deduplicated based on the webpage characteristics, wherein the selecting comprises the following steps:
and aiming at each group of web pages to be deduplicated, if the weights of websites to which the web pages in the group of web pages to be deduplicated are belong are the same, selecting the web page with the earliest web page release time from the group of web pages to be deduplicated as a target web page.
9. The method of claim 8, wherein selecting a target webpage from each group of webpages to be deduplicated based on the webpage features comprises:
and aiming at each group of web pages to be deduplicated, if the weight of websites to which the web pages belong in the group of web pages to be deduplicated is the same and the web page release time is the same, selecting the web page with the earliest capture time from the group of web pages to be deduplicated as a target web page.
10. A text deduplication apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a webpage set to be deduplicated;
the setting unit is used for extracting webpage features from webpage data of the webpage to be deduplicated in the webpage set to be deduplicated, determining whether a webpage similar to the webpage to be deduplicated exists in the candidate webpage set or not by utilizing a webpage title and a webpage text of the webpage to be deduplicated and based on a vector space hash algorithm and a minimum common substring matching algorithm, and if so, setting similar flag bits of the webpage to be deduplicated, wherein the webpage features comprise the webpage title and the webpage text, and the similar flag bits of the similar webpages are the same;
the grouping unit is used for grouping the webpages to be deduplicated in the webpage set to be deduplicated by using the similar zone bits;
and the duplication removing unit is used for selecting a target webpage from each group of webpages to be duplicated based on the webpage characteristics, deleting other webpages except the target webpage, and obtaining a duplication removed webpage set.
11. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.
12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.
CN202111618080.3A 2021-12-27 2021-12-27 Text duplicate removal method and device and electronic equipment Pending CN114417102A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111618080.3A CN114417102A (en) 2021-12-27 2021-12-27 Text duplicate removal method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111618080.3A CN114417102A (en) 2021-12-27 2021-12-27 Text duplicate removal method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN114417102A true CN114417102A (en) 2022-04-29

Family

ID=81269185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111618080.3A Pending CN114417102A (en) 2021-12-27 2021-12-27 Text duplicate removal method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114417102A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093717A (en) * 2023-10-20 2023-11-21 湖南财信数字科技有限公司 Similar text aggregation method, device, equipment and storage medium thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034475A1 (en) * 2017-07-28 2019-01-31 Enigma Technologies, Inc. System and method for detecting duplicate data records
CN110750731A (en) * 2019-09-27 2020-02-04 成都数联铭品科技有限公司 Duplicate removal method and system for news public sentiment
CN113688629A (en) * 2021-08-04 2021-11-23 德邦证券股份有限公司 Text deduplication method and device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034475A1 (en) * 2017-07-28 2019-01-31 Enigma Technologies, Inc. System and method for detecting duplicate data records
CN110750731A (en) * 2019-09-27 2020-02-04 成都数联铭品科技有限公司 Duplicate removal method and system for news public sentiment
CN113688629A (en) * 2021-08-04 2021-11-23 德邦证券股份有限公司 Text deduplication method and device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093717A (en) * 2023-10-20 2023-11-21 湖南财信数字科技有限公司 Similar text aggregation method, device, equipment and storage medium thereof
CN117093717B (en) * 2023-10-20 2024-01-30 湖南财信数字科技有限公司 Similar text aggregation method, device, equipment and storage medium thereof

Similar Documents

Publication Publication Date Title
CN110162750B (en) Text similarity detection method, electronic device and computer readable storage medium
US11003731B2 (en) Method and apparatus for generating information
CN110069693B (en) Method and device for determining target page
CN112988753B (en) Data searching method and device
CN112052120B (en) Database deleted data recovery method and device
CN111259282A (en) URL duplicate removal method and device, electronic equipment and computer readable storage medium
CN111368697A (en) Information identification method and device
CN117131281B (en) Public opinion event processing method, apparatus, electronic device and computer readable medium
CN114417102A (en) Text duplicate removal method and device and electronic equipment
CN110674635A (en) Method and device for text paragraph division
CN110852057A (en) Method and device for calculating text similarity
CN116127925B (en) Text data enhancement method and device based on destruction processing of text
CN113761565A (en) Data desensitization method and apparatus
CN111368693A (en) Identification method and device for identity card information
CN110737691B (en) Method and apparatus for processing access behavior data
CN111488450A (en) Method and device for generating keyword library and electronic equipment
CN113946648B (en) Structured information generation method and device, electronic equipment and medium
CN111666449B (en) Video retrieval method, apparatus, electronic device, and computer-readable medium
CN111382233A (en) Similar text detection method and device, electronic equipment and storage medium
CN112579646A (en) Method and device for screening lists
CN110891010B (en) Method and apparatus for transmitting information
CN112784596A (en) Method and device for identifying sensitive words
CN110929512A (en) Data enhancement method and device
CN113239687B (en) Data processing method and device
US11698889B2 (en) Method, electronic device, and computer program product for processing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination