WO2020215667A1 - 文本内容快速去重方法、装置、计算机设备及存储介质 - Google Patents
文本内容快速去重方法、装置、计算机设备及存储介质 Download PDFInfo
- Publication number
- WO2020215667A1 WO2020215667A1 PCT/CN2019/116606 CN2019116606W WO2020215667A1 WO 2020215667 A1 WO2020215667 A1 WO 2020215667A1 CN 2019116606 W CN2019116606 W CN 2019116606W WO 2020215667 A1 WO2020215667 A1 WO 2020215667A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text content
- webpage
- feature
- similarity
- keywords
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Definitions
- This application relates to methods for deduplication of text content, and more specifically to methods, devices, computer equipment, and storage media for rapid deduplication of text content.
- Existing large-scale mass deduplication technology methods mainly use local sensitive hashing algorithm, which is a deduplication technology based on text content, mainly by reducing the life of the hash signature, and then judging the text content by the similarity of the signature Due to the complexity of the Chinese language, the existing methods cannot represent the text content very accurately.
- the existing text feature extraction is based on the assumption that the features are independent of each other.
- the feature keywords are The semantic relationship cannot be simply ignored; the similarity calculation performance is low and cannot be extended to the application in a large-scale and massive data environment; the overall accuracy is low due to the ignoring of the semantic context relationship between the feature keywords.
- the purpose of this application is to overcome the defects of the prior art and provide a method, device, computer equipment and storage medium for fast deduplication of text content.
- a method for fast deduplication of text content including:
- the crawling of several webpage text contents that need to be deduplicated includes:
- the preprocessing of the text content of a plurality of web pages to obtain the text content to be deduplicated includes:
- the further technical solution is: extracting characteristic keywords from the text content to be de-duplicated to obtain target characteristic keywords, including:
- the intermediate feature keywords and the initial feature keywords are merged to obtain the target feature keywords.
- said signing the target characteristic keyword to obtain a characteristic signature including:
- said forming a fingerprint of webpage text content according to a characteristic signature includes:
- the calculation of the similarity according to the fingerprint of the text content of the webpage to obtain the similarity of the text content of the webpage includes:
- intersection calculation is performed on the documents to obtain the similarity of the text content of the webpage.
- This application also provides a fast deduplication device for text content, including:
- the crawling unit is used to crawl the text content of several web pages that need to be deduplicated;
- a preprocessing unit configured to preprocess a number of the webpage text content to obtain the text content to be deduplicated
- the extraction unit is used to extract feature keywords from the text content to be de-duplicated to obtain target feature keywords
- a weight calculation unit configured to perform weight calculation on the target feature keyword to obtain a weight value
- the signature unit is used to sign the target characteristic keyword to obtain a characteristic signature
- the fingerprint forming unit is used to form the fingerprint of the text content of the webpage according to the characteristic signature
- the storage unit is used for inverted index storage of webpage text content fingerprints
- the similarity calculation unit is configured to calculate the similarity according to the fingerprint of the webpage text content to obtain the similarity of the webpage text content;
- the output unit is used to output the similarity of the text content of the webpage.
- the present application also provides a computer device that includes a memory and a processor, the memory stores a computer program, and the processor implements the above-mentioned method when the computer program is executed.
- the present application also provides a storage medium that stores a computer program, and the computer program can implement the above-mentioned method when executed by a processor.
- this application extracts target feature keywords and weights that can represent webpage text content based on word semantic relations, and generates webpage text content fingerprints based on the target feature keywords and weights.
- the compressed representation saves storage space and computing time.
- the Elasticsearch inverted index data structure it stores the fingerprints of the text content of the webpage, and converts the similarity calculation to the Elasticsearch Boolean model retrieval, which effectively meets the real-time deduplication processing performance requirements of massive and large-scale data. Improve accuracy and deduplication performance.
- FIG. 1 is a schematic diagram of an application scenario of a method for fast deduplication of text content provided by an embodiment of the application;
- FIG. 2 is a schematic flowchart of a method for fast deduplication of text content provided by an embodiment of the application
- Fig. 3 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application;
- FIG. 4 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application;
- FIG. 5 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application
- FIG. 6 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application
- FIG. 7 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application.
- FIG. 8 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application.
- FIG. 9 is a schematic diagram of the formation of a fingerprint of webpage text content provided by an embodiment of the application.
- FIG. 10 is a schematic diagram of the formation of target feature keywords provided by an embodiment of the application.
- FIG. 11 is a schematic structural diagram of an inverted index provided by an embodiment of the application.
- FIG. 12 is a schematic block diagram of an apparatus for fast deduplication of text content provided by an embodiment of the application.
- FIG. 13 is a schematic block diagram of a computer device provided by an embodiment of the application.
- FIG. 1 is a schematic diagram of an application scenario of a method for fast deduplication of text content provided by an embodiment of the application.
- FIG. 2 is a schematic flowchart of a method for fast deduplication of text content provided by an embodiment of the application.
- the method for fast deduplication of text content is applied in a server, and the server interacts with a terminal to obtain several webpage text contents that need to be deduplicated from the terminal, and then quickly deduplicate the text content of these webpages. The result is output to the terminal for display.
- FIG. 2 is a schematic flowchart of a method for fast deduplication of text content provided by an embodiment of the present application. As shown in Figure 2, the method includes the following steps S110 to S190.
- the text content of the webpage refers to the text with information displayed in the webpage.
- the above-mentioned step S110 may include steps S111 to S114.
- the distributed task scheduler assigns the URL (Uniform Resource Locator) address to the crawler application node. If the URL to be crawled is a URL that has already been crawled, it is discarded, otherwise, it is crawled by the crawler application node Web page text content.
- URL Uniform Resource Locator
- the text content to be deduplicated refers to the text content that has been cleaned, filtered, and subjected to word segmentation.
- the above-mentioned step S120 may include steps S121 to S122.
- S122 Perform word segmentation processing on the intermediate text content to obtain the text content to be deduplicated.
- the intermediate text content refers to the content remaining after removing unnecessary data.
- the target characteristic keywords refer to words that represent the characteristics and essence of the text content of the webpage. Refer to Figure 10 for the formation process of the target feature keywords. There are many ways to select the content characteristics of the text to be removed, such as shingles, n-grams, etc. Because multiple words or character sequences do not have a separate and clear semantics, which weakens the representation of the document content, a single keyword is used here as the waiting Extract the features of heavy text content.
- the above-mentioned step S130 may include steps S131 to S134.
- the text content to be deduplicated is divided into blocks according to positions to obtain text blocks.
- the text block refers to the content formed by text at different positions.
- the text content to be deduplicated is divided into blocks by location, which are mainly divided into meta-information block, web page body block and title block.
- the initial characteristic keywords refer to the characteristic words representing the content of the text block directly extracted from the content of the text block.
- the feature keywords of each text block are extracted according to the semantic relationship.
- S133 Perform semantic expansion on the initial feature keywords to obtain intermediate feature keywords.
- the intermediate feature keywords refer to words that are synonymous with the initial feature keywords.
- combining the expanded keywords with the initial feature keywords can make the target feature keywords more comprehensive and accurate.
- S140 Perform weight calculation on the target feature keyword to obtain a weight value.
- the weight value refers to the proportion and position of words appearing in the text content.
- the aforementioned step S140 may include steps S141 to S142.
- S141 Calculate weights according to the frequency and position of the target feature keywords in the text content of the webpage to obtain several weights.
- the simplest calculation method for the weight corresponding to the target feature keyword mainly adopts two indexes of importance and discrimination.
- N represents the total number of documents in the document collection
- df represents the document frequency
- b is the adjustment factor, and the default value is 0.85.
- the final weight calculation formula of the target feature keyword is as follows:
- the characteristic signature refers to the characteristic hash value generated by the semantic relationship between the target characteristic keywords.
- a semantic feature signature algorithm is used to sign the target feature keyword.
- the above-mentioned step S150 may include steps S151 to S152.
- the feature vector refers to the hash value produced by the target feature keyword.
- the hash value is a b-dimensional vector, and b is artificially set;
- the text content of the web page is composed of sentences, and the sentences are composed of words.
- the subject of the document is composed of words and their locations.
- the context and environment are jointly determined; for example, the following two sentences: "strive to build our country into a country with an outstanding environment"; "strive to build our country into a country with a good environment”; the main thrust of the two sentences is the same, if they appear in different In the document, it can be considered duplicate content.
- the characteristics of the two sentences are obviously not exactly the same. Therefore, the fingerprints of the documents generated by the original locality sensitive hash algorithm will be different, which is considered to be different documents, resulting in misjudgment.
- Output a set of semantic feature hash values of web documents
- sim is the similarity function for judging the features of words, calculated by using the semantic relationship between concepts, the specific organization form is a hierarchical semantic dictionary, and the hierarchical distance path of words in the semantic dictionary is used to measure semantic similarity degree.
- the similarity between word features is less than the set threshold threshold, the hash value of the feature is set to the same value, and the hash value produced by the original local sensitive hash algorithm for different features is different Yes, ignoring the relationship between words.
- the formed characteristic signature can be compressed to obtain the fingerprint of the web page text content. Achieve compressed representation, saving storage space and computing time.
- the webpage text content fingerprint refers to a vector formed by setting zero and one according to the feature vector.
- the above-mentioned step S160 may include steps S161 to S162.
- S161 Perform weight value calculation on each dimension vector of the feature vector in the feature signature to obtain a target vector
- the text content of a web page that is, a document
- the text content of a web page is composed of a series of strings, and direct manipulation of strings requires a lot of storage space and computing time. Therefore, the original text is analyzed and processed, the target feature keywords that can represent the original document are extracted, and the web page text content fingerprint is generated through the hash function.
- the fingerprints of the webpage text content representing the webpage text content find out duplicate or nearly duplicate documents. When two documents have the same number of fingerprints or the ratio of the same fingerprint to the total number of fingerprints reaches a certain threshold, they are considered to be duplicates, otherwise they are not considered to be duplicates.
- each dimension vector is calculated separately, that is, if the hash value of the corresponding bit of the feature is 1, then the weight corresponding to the target key feature is added, otherwise the weight is subtracted.
- the i-th dimension in the vector V is a positive number, then the i-th position in the b-bit fingerprint is set to 1, otherwise it is set to 0, and a vector with values including 0 and 1 is obtained, namely The fingerprint of the text content of the webpage, as shown in Figure 9.
- S170 Perform inverted index storage on the fingerprint of the text content of the webpage.
- ElasticSearch is a Lucene-based search server, which provides a distributed multi-user full-text search engine based on RESTful web interface .
- mapping of web page document ID to web page fingerprint is converted into the mapping of web page fingerprint to web page document ID, and stored, as shown in Figure 11, where web page fingerprint 1 refers to the ID of web document 1 and the ID of web document 2; Web page fingerprint 2 refers to a list of web document IDs with this target feature keyword, and the web document refers to the text content of the web page.
- S180 Calculate the similarity according to the fingerprint of the text content of the webpage to obtain the similarity of the text content of the webpage.
- the aforementioned similarity refers to the similarity of the text content of two web pages.
- the above-mentioned step S180 may include steps S181 to S183.
- the similarity of the fingerprints of the text content of the two web pages is calculated.
- the Hamming distance is used to find the similarity, and the web page fingerprint inverted table is established, and the final result is obtained by querying the document number that appears in the web fingerprint inverted table.
- the 32-bit binary signatures are divided into 2 blocks, each with 16 bits, and all the signatures with Hamming distance within 1 are calculated.
- the two webpage text content fingerprints are If the Hamming distance is within 1, they must be exactly the same. In this way, the similarity calculation can be converted into a Boolean retrieval model through the inverted index, which greatly reduces the document similarity calculation time.
- the similarity of the text content of the webpage is output to the terminal for display.
- the Precision and Recall indicators only represent unilateral performance indicators and ignore the overall performance.
- the deduplication effect evaluation value F1 combines the two, which is defined as:
- the operating effect has a greater improvement in accuracy and recall than the local sensitive hash algorithm.
- the deduplication effect evaluation value is also output.
- the above-mentioned rapid de-duplication method for text content extracts target feature keywords and weights that can represent web page text content based on word semantic relations, and generates web page text content fingerprints based on target feature keywords and weights to achieve compressed representation and save storage Space and calculation time, based on Elasticsearch's inverted index data structure to store webpage text content fingerprints, and convert similarity calculations to Elasticsearch's Boolean model retrieval, which effectively meets the real-time deduplication processing performance requirements of massive and large-scale data, and achieves improved accuracy and deduplication performance.
- FIG. 12 is a schematic block diagram of an apparatus 300 for fast deduplication of text content according to an application embodiment. As shown in Fig. 12, corresponding to the above method for fast deduplication of text content, the present application also provides an apparatus 300 for fast deduplication of text content.
- the text content rapid deduplication device 300 includes a unit for executing the above-mentioned text content rapid deduplication method, and the device may be configured in a server.
- the text content quick deduplication device 300 includes:
- the grabbing unit 301 is used to grab the text content of several webpages that need to be deduplicated;
- the preprocessing unit 302 is configured to preprocess a number of the webpage text content to obtain the text content to be deduplicated;
- the extraction unit 303 is configured to extract feature keywords from the text content to be deduplicated to obtain target feature keywords;
- the weight calculation unit 304 is configured to perform weight calculation on the target feature keyword to obtain a weight value
- the signature unit 305 is used to sign the target characteristic keyword to obtain a characteristic signature
- the fingerprint forming unit 306 is configured to form a fingerprint of the text content of the webpage according to the characteristic signature
- the storage unit 307 is configured to perform inverted index storage on the fingerprint of the text content of the webpage
- the similarity calculation unit 308 is configured to calculate the similarity according to the fingerprint of the webpage text content, so as to obtain the similarity of the webpage text content;
- the output unit 309 is used to output the similarity of the text content of the webpage.
- the grabbing unit 301 includes:
- Address allocation subunit used to allocate URL addresses
- the crawling subunit is used to crawl the URL according to the URL address to obtain the URL to be crawled;
- the crawling judgment subunit is used to judge whether the URL to be crawled has been crawled; if so, return to the crawling URL based on the URL address to obtain the URL to be crawled;
- the content crawling subunit is used to crawl the text content of the webpage in the URL to be crawled if not.
- the preprocessing unit 302 includes:
- the cleaning subunit is used to parse and clean the text content of a plurality of web pages to obtain intermediate text content
- the word segmentation processing subunit is used to segment the intermediate text content to obtain the text content to be deduplicated.
- the extraction unit 303 includes:
- the block subunit is used to block the text content to be deduplicated according to the position to obtain the text block;
- the extraction subunit is used to extract feature keywords from the text block to obtain the initial feature keywords
- the expansion subunit is used for semantic expansion of the initial feature keywords to obtain intermediate feature keywords
- the merging subunit is used to merge the intermediate feature keywords and the initial feature keywords to obtain the target feature keywords.
- the weight calculation unit 304 includes:
- the weight obtaining subunit is used to calculate the weight value according to the frequency and position of the target feature keyword in the text content of the webpage to obtain several weight values;
- the sorting subunit is used to sort several weights to obtain the weights.
- the signature unit 305 includes:
- the vector acquisition subunit is used to calculate and generate a feature hash value according to the target feature keyword to obtain a feature vector
- the integration subunit is used to integrate the feature vector with the target feature keyword to form a feature signature.
- the fingerprint forming unit 306 includes:
- the target vector forming subunit is used to calculate the weight value of each dimension vector of the feature vector in the feature signature to obtain the target vector;
- the setting subunit is used to set the position corresponding to the positive value vector in the target vector to one, and set the position corresponding to the non-positive value vector in the target vector to zero, so as to obtain the webpage text content fingerprint.
- the similarity calculation unit 308 includes:
- the document number obtaining subunit is used to obtain the document number appearing in the fingerprint inversion table of the webpage;
- intersection calculation subunit is used to perform intersection calculation on the documents to obtain the similarity of the text content of the webpage.
- the foregoing text content rapid deduplication apparatus 300 may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 13.
- FIG. 13 is a schematic block diagram of a computer device according to an embodiment of the present application.
- the computer device 500 is a server.
- the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
- the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
- the computer program 5032 includes program instructions.
- the processor 502 can execute a method for fast deduplication of text content.
- the processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.
- the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
- the processor 502 can execute a method for fast deduplication of text content.
- the network interface 505 is used for network communication with other devices.
- the structure shown in FIG. 13 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
- the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
- the processor 502 is configured to run a computer program 5032 stored in the memory to implement the following steps:
- the processor 502 specifically implements the following steps when implementing the steps of crawling several webpage text content that need to be deduplicated:
- the processor 502 when the processor 502 implements the step of preprocessing the text content of a plurality of web pages to obtain the text content to be deduplicated, the processor 502 specifically implements the following steps:
- the processor 502 when the processor 502 implements the step of extracting characteristic keywords from the text content to be deduplicated to obtain target characteristic keywords, the processor 502 specifically implements the following steps:
- the intermediate feature keywords and the initial feature keywords are merged to obtain the target feature keywords.
- the processor 502 when the processor 502 implements the step of signing the target feature keyword to obtain a feature signature, the processor 502 specifically implements the following steps:
- the processor 502 specifically implements the following steps when implementing the step of forming a fingerprint of the text content of a webpage according to the characteristic signature:
- the processor 502 when the processor 502 implements the step of calculating the similarity based on the fingerprint of the webpage text content to obtain the similarity of the webpage text content, the processor 502 specifically implements the following steps:
- intersection calculation is performed on the documents to obtain the similarity of the text content of the webpage.
- the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
- the computer program includes program instructions, and the computer program can be stored in a storage medium, which is a computer-readable storage medium.
- the program instructions are executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiments.
- the storage medium may be a computer-readable storage medium.
- the storage medium stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
- the processor executes the computer program to realize the step of preprocessing the text content of a plurality of the web pages to obtain the text content to be deduplicated, the following steps are specifically implemented:
- the intermediate feature keywords and the initial feature keywords are merged to obtain the target feature keywords.
- the processor executes the computer program to implement the step of forming a fingerprint of webpage text content according to a characteristic signature, the following steps are specifically implemented:
- the processor executes the computer program to realize the step of calculating the similarity according to the fingerprint of the webpage text content to obtain the similarity of the webpage text content, the following steps are specifically implemented:
- intersection calculation is performed on the documents to obtain the similarity of the text content of the webpage.
- the storage medium may be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other computer-readable storage media that can store program codes.
- ROM Read-Only Memory
- the disclosed device and method may be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of each unit is only a logical function division, and there may be other division methods in actual implementation.
- multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
- the steps in the method of the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs.
- the units in the devices in the embodiments of the present application may be combined, divided, and deleted according to actual needs.
- the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
- the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which can be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
算法 | 网页数量 | 用时(ms) |
Simhash方法 | 60000 | 18990.56 |
语义相似方法 | 60000 | 80.3 |
Simhash方法 | 276000 | 89535.32 |
语义相似方法 | 276000 | 116.5 |
Simhash方法 | 1436000 | 513050.70 |
语义相似方法 | 1436000 | 120.9 |
Simhash方法 | 25435017 | 7206347.65 |
语义相似方法 | 25435017 | 386.5 |
Claims (10)
- 文本内容快速去重方法,其特征在于,包括:抓取需要去重的若干个网页文本内容;对若干个所述网页文本内容进行预处理,以得到待去重文本内容;对待去重文本内容进行提取特征关键词,以得到目标特征关键词;对所述目标特征关键词进行权重计算,以得到权重值;对所述目标特征关键词进行签名,以得到特征签名;根据特征签名形成网页文本内容指纹;对网页文本内容指纹进行倒排索引存储;根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性;输出网页文本内容的相似性。
- 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述抓取需要去重的若干个网页文本内容,包括:分配URL地址;根据URL地址爬取URL,以得到待爬取URL;判断所述待爬取URL是否已爬取;若是,则返回所述根据URL地址爬取URL,以得到待爬取URL;若否,则抓取待爬取URL内的网页文本内容。
- 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述对若干个所述网页文本内容进行预处理,以得到待去重文本内容,包括:对若干个所述网页文本内容进行解析清洗,以得到中间文本内容;对中间文本内容进行分词处理,以得到待去重文本内容。
- 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述对待去重文本内容进行提取特征关键词,以得到目标特征关键词,包括:将待去重文本内容按照位置分块,以得到文本块;对文本块进行抽取特征关键词,以得到初始特征关键词;对初始特征关键词进行语义扩展,以得到中间特征关键词;将中间特征关键词以及初始特征关键词进行合并,以得到目标特征关键词。
- 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述对所述目标特征关键词进行签名,以得到特征签名,包括:根据目标特征关键词计算生成特征散列值,以得到特征向量;将特征向量与目标特征关键词整合,形成特征签名。
- 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述根据特征签名形成网页文本内容指纹,包括:对特征签名内的特征向量的每维向量进行权重值的计算,以得到目标向量;将目标向量内数值为正数的向量所对应的位置置于一,将目标向量内数值为非正数的向量所对应的位置置于零,以得到网页文本内容指纹。
- 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性,包括:根据网页文本内容建立网页指纹倒排表;获取网页指纹倒排表中出现的文档号;对所述文档好进行交集计算,以得到网页文本内容的相似性。
- 文本内容快速去重装置,其特征在于,包括:抓取单元,用于抓取需要去重的若干个网页文本内容;预处理单元,用于对若干个所述网页文本内容进行预处理,以得到待去重文本内容;提取单元,用于对待去重文本内容进行提取特征关键词,以得到目标特征关键词;权重计算单元,用于对所述目标特征关键词进行权重计算,以得到权重值;签名单元,用于对所述目标特征关键词进行签名,以得到特征签名;指纹形成单元,用于根据特征签名形成网页文本内容指纹;存储单元,用于对网页文本内容指纹进行倒排索引存储;相似度计算单元,用于根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性;输出单元,用于输出网页文本内容的相似性。
- 一种计算机设备,其特征在于,所述计算机设备包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现如权 利要求1至7中任一项所述的方法。
- 一种存储介质,其特征在于,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时可实现如权利要求1至7中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910344414.9 | 2019-04-26 | ||
CN201910344414.9A CN110309446A (zh) | 2019-04-26 | 2019-04-26 | 文本内容快速去重方法、装置、计算机设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020215667A1 true WO2020215667A1 (zh) | 2020-10-29 |
Family
ID=68075778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/116606 WO2020215667A1 (zh) | 2019-04-26 | 2019-11-08 | 文本内容快速去重方法、装置、计算机设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110309446A (zh) |
WO (1) | WO2020215667A1 (zh) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110309446A (zh) * | 2019-04-26 | 2019-10-08 | 深圳市赛为智能股份有限公司 | 文本内容快速去重方法、装置、计算机设备及存储介质 |
CN110956037B (zh) * | 2019-10-16 | 2022-07-08 | 厦门美柚股份有限公司 | 多媒体内容重复判断方法及装置 |
CN110955751A (zh) * | 2019-11-13 | 2020-04-03 | 广州供电局有限公司 | 工作票文本去重方法、装置、系统及计算机存储介质 |
CN110909019B (zh) * | 2019-11-14 | 2022-04-08 | 湖南赛吉智慧城市建设管理有限公司 | 大数据查重方法、装置、计算机设备及存储介质 |
CN111027282A (zh) * | 2019-11-21 | 2020-04-17 | 精硕科技(北京)股份有限公司 | 文本去重方法和装置、电子设备及计算机可读存储介质 |
CN111061934B (zh) * | 2019-11-27 | 2023-04-07 | 西安四叶草信息技术有限公司 | 指纹识别方法、设备和存储介质 |
CN113051907B (zh) * | 2019-12-26 | 2023-05-12 | 深圳市北科瑞声科技股份有限公司 | 一种新闻内容的查重方法、系统及装置 |
CN111428180B (zh) * | 2020-03-20 | 2022-02-08 | 创优数字科技(广东)有限公司 | 一种网页去重方法、装置和设备 |
CN111507260B (zh) * | 2020-04-17 | 2022-08-05 | 重庆邮电大学 | 一种视频相似度快速检测方法及检测装置 |
CN111913912A (zh) * | 2020-07-16 | 2020-11-10 | 北京字节跳动网络技术有限公司 | 文件处理方法、文件匹配方法、装置、电子设备和介质 |
CN114510919A (zh) * | 2020-11-16 | 2022-05-17 | 中国移动通信有限公司研究院 | 一种网页文本信息的监控方法、装置及设备 |
WO2022141860A1 (zh) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | 文本去重方法、装置、电子设备及计算机可读存储介质 |
CN114741468B (zh) * | 2022-03-22 | 2024-03-29 | 平安科技(深圳)有限公司 | 文本去重方法、装置、设备及存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645082A (zh) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | 基于并行编程模式的相似网页去重系统 |
CN107025218A (zh) * | 2017-04-07 | 2017-08-08 | 腾讯科技(深圳)有限公司 | 一种文本去重方法和装置 |
CN108563636A (zh) * | 2018-04-04 | 2018-09-21 | 广州杰赛科技股份有限公司 | 提取文本关键词的方法、装置、设备及存储介质 |
CN110309446A (zh) * | 2019-04-26 | 2019-10-08 | 深圳市赛为智能股份有限公司 | 文本内容快速去重方法、装置、计算机设备及存储介质 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831198A (zh) * | 2012-08-07 | 2012-12-19 | 人民搜索网络股份公司 | 一种基于文档签名技术的相似文档识别装置及方法 |
CN106376002B (zh) * | 2015-07-20 | 2021-10-12 | 中兴通讯股份有限公司 | 一种管理方法及装置、垃圾短信监控系统 |
CN108595517B (zh) * | 2018-03-26 | 2021-03-09 | 南京邮电大学 | 一种大规模文档相似性检测方法 |
-
2019
- 2019-04-26 CN CN201910344414.9A patent/CN110309446A/zh active Pending
- 2019-11-08 WO PCT/CN2019/116606 patent/WO2020215667A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645082A (zh) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | 基于并行编程模式的相似网页去重系统 |
CN107025218A (zh) * | 2017-04-07 | 2017-08-08 | 腾讯科技(深圳)有限公司 | 一种文本去重方法和装置 |
CN108563636A (zh) * | 2018-04-04 | 2018-09-21 | 广州杰赛科技股份有限公司 | 提取文本关键词的方法、装置、设备及存储介质 |
CN110309446A (zh) * | 2019-04-26 | 2019-10-08 | 深圳市赛为智能股份有限公司 | 文本内容快速去重方法、装置、计算机设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN110309446A (zh) | 2019-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020215667A1 (zh) | 文本内容快速去重方法、装置、计算机设备及存储介质 | |
CN109241274B (zh) | 文本聚类方法及装置 | |
WO2019091026A1 (zh) | 知识库文档快速检索方法、应用服务器及计算机可读存储介质 | |
US7962491B1 (en) | Document near-duplicate detection | |
US10346257B2 (en) | Method and device for deduplicating web page | |
CN101872351B (zh) | 识别同义词的方法、装置及利用其进行搜索的方法和装置 | |
WO2015180432A1 (zh) | 一种聚簇存储方法及装置 | |
CN102918532B (zh) | 在搜索结果排序中对垃圾的检测 | |
CN108763348B (zh) | 一种扩展短文本词特征向量的分类改进方法 | |
CN102043851A (zh) | 一种基于频繁项集的多文档自动摘要方法 | |
CN108804642A (zh) | 检索方法、装置、计算机设备及存储介质 | |
CN110321466B (zh) | 一种基于语义分析的证券资讯查重方法及系统 | |
JP2005302043A (ja) | 検索語提案のためのマルチ型データオブジェクトの強化されたクラスタリング | |
CN108647322B (zh) | 基于词网识别大量Web文本信息相似度的方法 | |
WO2017113592A1 (zh) | 模型生成方法、词语赋权方法、装置、设备及计算机存储介质 | |
WO2020228182A1 (zh) | 基于大数据的数据去重的方法、装置、设备及存储介质 | |
CN112395875A (zh) | 一种关键词提取方法、装置、终端以及存储介质 | |
CN109815401A (zh) | 一种应用于Web人物搜索的人名消歧方法 | |
CN110705285B (zh) | 一种政务文本主题词库构建方法、装置、服务器及可读存储介质 | |
Yuan et al. | A mathematical information retrieval system based on RankBoost | |
Bama et al. | A mathematical approach for mining web content outliers using term frequency ranking | |
CN113609247A (zh) | 一种基于改进Simhash算法的大数据文本去重技术 | |
CN114547233A (zh) | 数据查重方法、装置及电子设备 | |
Duan et al. | Error correction for search engine by mining bad case | |
CN115905577B (zh) | 知识图谱的构建方法及装置、法规检索方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19926181 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19926181 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 180322) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19926181 Country of ref document: EP Kind code of ref document: A1 |