WO2020215667A1 - 文本内容快速去重方法、装置、计算机设备及存储介质 - Google Patents

文本内容快速去重方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020215667A1
WO2020215667A1 PCT/CN2019/116606 CN2019116606W WO2020215667A1 WO 2020215667 A1 WO2020215667 A1 WO 2020215667A1 CN 2019116606 W CN2019116606 W CN 2019116606W WO 2020215667 A1 WO2020215667 A1 WO 2020215667A1
Authority
WO
WIPO (PCT)
Prior art keywords
text content
webpage
feature
similarity
keywords
Prior art date
Application number
PCT/CN2019/116606
Other languages
English (en)
French (fr)
Inventor
耿伟
王英明
周起如
谷国栋
Original Assignee
深圳市赛为智能股份有限公司
安徽工业大学工商学院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市赛为智能股份有限公司, 安徽工业大学工商学院 filed Critical 深圳市赛为智能股份有限公司
Publication of WO2020215667A1 publication Critical patent/WO2020215667A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Definitions

  • This application relates to methods for deduplication of text content, and more specifically to methods, devices, computer equipment, and storage media for rapid deduplication of text content.
  • Existing large-scale mass deduplication technology methods mainly use local sensitive hashing algorithm, which is a deduplication technology based on text content, mainly by reducing the life of the hash signature, and then judging the text content by the similarity of the signature Due to the complexity of the Chinese language, the existing methods cannot represent the text content very accurately.
  • the existing text feature extraction is based on the assumption that the features are independent of each other.
  • the feature keywords are The semantic relationship cannot be simply ignored; the similarity calculation performance is low and cannot be extended to the application in a large-scale and massive data environment; the overall accuracy is low due to the ignoring of the semantic context relationship between the feature keywords.
  • the purpose of this application is to overcome the defects of the prior art and provide a method, device, computer equipment and storage medium for fast deduplication of text content.
  • a method for fast deduplication of text content including:
  • the crawling of several webpage text contents that need to be deduplicated includes:
  • the preprocessing of the text content of a plurality of web pages to obtain the text content to be deduplicated includes:
  • the further technical solution is: extracting characteristic keywords from the text content to be de-duplicated to obtain target characteristic keywords, including:
  • the intermediate feature keywords and the initial feature keywords are merged to obtain the target feature keywords.
  • said signing the target characteristic keyword to obtain a characteristic signature including:
  • said forming a fingerprint of webpage text content according to a characteristic signature includes:
  • the calculation of the similarity according to the fingerprint of the text content of the webpage to obtain the similarity of the text content of the webpage includes:
  • intersection calculation is performed on the documents to obtain the similarity of the text content of the webpage.
  • This application also provides a fast deduplication device for text content, including:
  • the crawling unit is used to crawl the text content of several web pages that need to be deduplicated;
  • a preprocessing unit configured to preprocess a number of the webpage text content to obtain the text content to be deduplicated
  • the extraction unit is used to extract feature keywords from the text content to be de-duplicated to obtain target feature keywords
  • a weight calculation unit configured to perform weight calculation on the target feature keyword to obtain a weight value
  • the signature unit is used to sign the target characteristic keyword to obtain a characteristic signature
  • the fingerprint forming unit is used to form the fingerprint of the text content of the webpage according to the characteristic signature
  • the storage unit is used for inverted index storage of webpage text content fingerprints
  • the similarity calculation unit is configured to calculate the similarity according to the fingerprint of the webpage text content to obtain the similarity of the webpage text content;
  • the output unit is used to output the similarity of the text content of the webpage.
  • the present application also provides a computer device that includes a memory and a processor, the memory stores a computer program, and the processor implements the above-mentioned method when the computer program is executed.
  • the present application also provides a storage medium that stores a computer program, and the computer program can implement the above-mentioned method when executed by a processor.
  • this application extracts target feature keywords and weights that can represent webpage text content based on word semantic relations, and generates webpage text content fingerprints based on the target feature keywords and weights.
  • the compressed representation saves storage space and computing time.
  • the Elasticsearch inverted index data structure it stores the fingerprints of the text content of the webpage, and converts the similarity calculation to the Elasticsearch Boolean model retrieval, which effectively meets the real-time deduplication processing performance requirements of massive and large-scale data. Improve accuracy and deduplication performance.
  • FIG. 1 is a schematic diagram of an application scenario of a method for fast deduplication of text content provided by an embodiment of the application;
  • FIG. 2 is a schematic flowchart of a method for fast deduplication of text content provided by an embodiment of the application
  • Fig. 3 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application;
  • FIG. 5 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application
  • FIG. 6 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application
  • FIG. 7 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application.
  • FIG. 8 is a schematic diagram of a sub-flow of a method for fast deduplication of text content provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of the formation of a fingerprint of webpage text content provided by an embodiment of the application.
  • FIG. 10 is a schematic diagram of the formation of target feature keywords provided by an embodiment of the application.
  • FIG. 11 is a schematic structural diagram of an inverted index provided by an embodiment of the application.
  • FIG. 12 is a schematic block diagram of an apparatus for fast deduplication of text content provided by an embodiment of the application.
  • FIG. 13 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic diagram of an application scenario of a method for fast deduplication of text content provided by an embodiment of the application.
  • FIG. 2 is a schematic flowchart of a method for fast deduplication of text content provided by an embodiment of the application.
  • the method for fast deduplication of text content is applied in a server, and the server interacts with a terminal to obtain several webpage text contents that need to be deduplicated from the terminal, and then quickly deduplicate the text content of these webpages. The result is output to the terminal for display.
  • FIG. 2 is a schematic flowchart of a method for fast deduplication of text content provided by an embodiment of the present application. As shown in Figure 2, the method includes the following steps S110 to S190.
  • the text content of the webpage refers to the text with information displayed in the webpage.
  • the above-mentioned step S110 may include steps S111 to S114.
  • the distributed task scheduler assigns the URL (Uniform Resource Locator) address to the crawler application node. If the URL to be crawled is a URL that has already been crawled, it is discarded, otherwise, it is crawled by the crawler application node Web page text content.
  • URL Uniform Resource Locator
  • the text content to be deduplicated refers to the text content that has been cleaned, filtered, and subjected to word segmentation.
  • the above-mentioned step S120 may include steps S121 to S122.
  • S122 Perform word segmentation processing on the intermediate text content to obtain the text content to be deduplicated.
  • the intermediate text content refers to the content remaining after removing unnecessary data.
  • the target characteristic keywords refer to words that represent the characteristics and essence of the text content of the webpage. Refer to Figure 10 for the formation process of the target feature keywords. There are many ways to select the content characteristics of the text to be removed, such as shingles, n-grams, etc. Because multiple words or character sequences do not have a separate and clear semantics, which weakens the representation of the document content, a single keyword is used here as the waiting Extract the features of heavy text content.
  • the above-mentioned step S130 may include steps S131 to S134.
  • the text content to be deduplicated is divided into blocks according to positions to obtain text blocks.
  • the text block refers to the content formed by text at different positions.
  • the text content to be deduplicated is divided into blocks by location, which are mainly divided into meta-information block, web page body block and title block.
  • the initial characteristic keywords refer to the characteristic words representing the content of the text block directly extracted from the content of the text block.
  • the feature keywords of each text block are extracted according to the semantic relationship.
  • S133 Perform semantic expansion on the initial feature keywords to obtain intermediate feature keywords.
  • the intermediate feature keywords refer to words that are synonymous with the initial feature keywords.
  • combining the expanded keywords with the initial feature keywords can make the target feature keywords more comprehensive and accurate.
  • S140 Perform weight calculation on the target feature keyword to obtain a weight value.
  • the weight value refers to the proportion and position of words appearing in the text content.
  • the aforementioned step S140 may include steps S141 to S142.
  • S141 Calculate weights according to the frequency and position of the target feature keywords in the text content of the webpage to obtain several weights.
  • the simplest calculation method for the weight corresponding to the target feature keyword mainly adopts two indexes of importance and discrimination.
  • N represents the total number of documents in the document collection
  • df represents the document frequency
  • b is the adjustment factor, and the default value is 0.85.
  • the final weight calculation formula of the target feature keyword is as follows:
  • the characteristic signature refers to the characteristic hash value generated by the semantic relationship between the target characteristic keywords.
  • a semantic feature signature algorithm is used to sign the target feature keyword.
  • the above-mentioned step S150 may include steps S151 to S152.
  • the feature vector refers to the hash value produced by the target feature keyword.
  • the hash value is a b-dimensional vector, and b is artificially set;
  • the text content of the web page is composed of sentences, and the sentences are composed of words.
  • the subject of the document is composed of words and their locations.
  • the context and environment are jointly determined; for example, the following two sentences: "strive to build our country into a country with an outstanding environment"; "strive to build our country into a country with a good environment”; the main thrust of the two sentences is the same, if they appear in different In the document, it can be considered duplicate content.
  • the characteristics of the two sentences are obviously not exactly the same. Therefore, the fingerprints of the documents generated by the original locality sensitive hash algorithm will be different, which is considered to be different documents, resulting in misjudgment.
  • Output a set of semantic feature hash values of web documents
  • sim is the similarity function for judging the features of words, calculated by using the semantic relationship between concepts, the specific organization form is a hierarchical semantic dictionary, and the hierarchical distance path of words in the semantic dictionary is used to measure semantic similarity degree.
  • the similarity between word features is less than the set threshold threshold, the hash value of the feature is set to the same value, and the hash value produced by the original local sensitive hash algorithm for different features is different Yes, ignoring the relationship between words.
  • the formed characteristic signature can be compressed to obtain the fingerprint of the web page text content. Achieve compressed representation, saving storage space and computing time.
  • the webpage text content fingerprint refers to a vector formed by setting zero and one according to the feature vector.
  • the above-mentioned step S160 may include steps S161 to S162.
  • S161 Perform weight value calculation on each dimension vector of the feature vector in the feature signature to obtain a target vector
  • the text content of a web page that is, a document
  • the text content of a web page is composed of a series of strings, and direct manipulation of strings requires a lot of storage space and computing time. Therefore, the original text is analyzed and processed, the target feature keywords that can represent the original document are extracted, and the web page text content fingerprint is generated through the hash function.
  • the fingerprints of the webpage text content representing the webpage text content find out duplicate or nearly duplicate documents. When two documents have the same number of fingerprints or the ratio of the same fingerprint to the total number of fingerprints reaches a certain threshold, they are considered to be duplicates, otherwise they are not considered to be duplicates.
  • each dimension vector is calculated separately, that is, if the hash value of the corresponding bit of the feature is 1, then the weight corresponding to the target key feature is added, otherwise the weight is subtracted.
  • the i-th dimension in the vector V is a positive number, then the i-th position in the b-bit fingerprint is set to 1, otherwise it is set to 0, and a vector with values including 0 and 1 is obtained, namely The fingerprint of the text content of the webpage, as shown in Figure 9.
  • S170 Perform inverted index storage on the fingerprint of the text content of the webpage.
  • ElasticSearch is a Lucene-based search server, which provides a distributed multi-user full-text search engine based on RESTful web interface .
  • mapping of web page document ID to web page fingerprint is converted into the mapping of web page fingerprint to web page document ID, and stored, as shown in Figure 11, where web page fingerprint 1 refers to the ID of web document 1 and the ID of web document 2; Web page fingerprint 2 refers to a list of web document IDs with this target feature keyword, and the web document refers to the text content of the web page.
  • S180 Calculate the similarity according to the fingerprint of the text content of the webpage to obtain the similarity of the text content of the webpage.
  • the aforementioned similarity refers to the similarity of the text content of two web pages.
  • the above-mentioned step S180 may include steps S181 to S183.
  • the similarity of the fingerprints of the text content of the two web pages is calculated.
  • the Hamming distance is used to find the similarity, and the web page fingerprint inverted table is established, and the final result is obtained by querying the document number that appears in the web fingerprint inverted table.
  • the 32-bit binary signatures are divided into 2 blocks, each with 16 bits, and all the signatures with Hamming distance within 1 are calculated.
  • the two webpage text content fingerprints are If the Hamming distance is within 1, they must be exactly the same. In this way, the similarity calculation can be converted into a Boolean retrieval model through the inverted index, which greatly reduces the document similarity calculation time.
  • the similarity of the text content of the webpage is output to the terminal for display.
  • the Precision and Recall indicators only represent unilateral performance indicators and ignore the overall performance.
  • the deduplication effect evaluation value F1 combines the two, which is defined as:
  • the operating effect has a greater improvement in accuracy and recall than the local sensitive hash algorithm.
  • the deduplication effect evaluation value is also output.
  • the above-mentioned rapid de-duplication method for text content extracts target feature keywords and weights that can represent web page text content based on word semantic relations, and generates web page text content fingerprints based on target feature keywords and weights to achieve compressed representation and save storage Space and calculation time, based on Elasticsearch's inverted index data structure to store webpage text content fingerprints, and convert similarity calculations to Elasticsearch's Boolean model retrieval, which effectively meets the real-time deduplication processing performance requirements of massive and large-scale data, and achieves improved accuracy and deduplication performance.
  • FIG. 12 is a schematic block diagram of an apparatus 300 for fast deduplication of text content according to an application embodiment. As shown in Fig. 12, corresponding to the above method for fast deduplication of text content, the present application also provides an apparatus 300 for fast deduplication of text content.
  • the text content rapid deduplication device 300 includes a unit for executing the above-mentioned text content rapid deduplication method, and the device may be configured in a server.
  • the text content quick deduplication device 300 includes:
  • the grabbing unit 301 is used to grab the text content of several webpages that need to be deduplicated;
  • the preprocessing unit 302 is configured to preprocess a number of the webpage text content to obtain the text content to be deduplicated;
  • the extraction unit 303 is configured to extract feature keywords from the text content to be deduplicated to obtain target feature keywords;
  • the weight calculation unit 304 is configured to perform weight calculation on the target feature keyword to obtain a weight value
  • the signature unit 305 is used to sign the target characteristic keyword to obtain a characteristic signature
  • the fingerprint forming unit 306 is configured to form a fingerprint of the text content of the webpage according to the characteristic signature
  • the storage unit 307 is configured to perform inverted index storage on the fingerprint of the text content of the webpage
  • the similarity calculation unit 308 is configured to calculate the similarity according to the fingerprint of the webpage text content, so as to obtain the similarity of the webpage text content;
  • the output unit 309 is used to output the similarity of the text content of the webpage.
  • the grabbing unit 301 includes:
  • Address allocation subunit used to allocate URL addresses
  • the crawling subunit is used to crawl the URL according to the URL address to obtain the URL to be crawled;
  • the crawling judgment subunit is used to judge whether the URL to be crawled has been crawled; if so, return to the crawling URL based on the URL address to obtain the URL to be crawled;
  • the content crawling subunit is used to crawl the text content of the webpage in the URL to be crawled if not.
  • the preprocessing unit 302 includes:
  • the cleaning subunit is used to parse and clean the text content of a plurality of web pages to obtain intermediate text content
  • the word segmentation processing subunit is used to segment the intermediate text content to obtain the text content to be deduplicated.
  • the extraction unit 303 includes:
  • the block subunit is used to block the text content to be deduplicated according to the position to obtain the text block;
  • the extraction subunit is used to extract feature keywords from the text block to obtain the initial feature keywords
  • the expansion subunit is used for semantic expansion of the initial feature keywords to obtain intermediate feature keywords
  • the merging subunit is used to merge the intermediate feature keywords and the initial feature keywords to obtain the target feature keywords.
  • the weight calculation unit 304 includes:
  • the weight obtaining subunit is used to calculate the weight value according to the frequency and position of the target feature keyword in the text content of the webpage to obtain several weight values;
  • the sorting subunit is used to sort several weights to obtain the weights.
  • the signature unit 305 includes:
  • the vector acquisition subunit is used to calculate and generate a feature hash value according to the target feature keyword to obtain a feature vector
  • the integration subunit is used to integrate the feature vector with the target feature keyword to form a feature signature.
  • the fingerprint forming unit 306 includes:
  • the target vector forming subunit is used to calculate the weight value of each dimension vector of the feature vector in the feature signature to obtain the target vector;
  • the setting subunit is used to set the position corresponding to the positive value vector in the target vector to one, and set the position corresponding to the non-positive value vector in the target vector to zero, so as to obtain the webpage text content fingerprint.
  • the similarity calculation unit 308 includes:
  • the document number obtaining subunit is used to obtain the document number appearing in the fingerprint inversion table of the webpage;
  • intersection calculation subunit is used to perform intersection calculation on the documents to obtain the similarity of the text content of the webpage.
  • the foregoing text content rapid deduplication apparatus 300 may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 13.
  • FIG. 13 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 is a server.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the computer program 5032 includes program instructions.
  • the processor 502 can execute a method for fast deduplication of text content.
  • the processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute a method for fast deduplication of text content.
  • the network interface 505 is used for network communication with other devices.
  • the structure shown in FIG. 13 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in the memory to implement the following steps:
  • the processor 502 specifically implements the following steps when implementing the steps of crawling several webpage text content that need to be deduplicated:
  • the processor 502 when the processor 502 implements the step of preprocessing the text content of a plurality of web pages to obtain the text content to be deduplicated, the processor 502 specifically implements the following steps:
  • the processor 502 when the processor 502 implements the step of extracting characteristic keywords from the text content to be deduplicated to obtain target characteristic keywords, the processor 502 specifically implements the following steps:
  • the intermediate feature keywords and the initial feature keywords are merged to obtain the target feature keywords.
  • the processor 502 when the processor 502 implements the step of signing the target feature keyword to obtain a feature signature, the processor 502 specifically implements the following steps:
  • the processor 502 specifically implements the following steps when implementing the step of forming a fingerprint of the text content of a webpage according to the characteristic signature:
  • the processor 502 when the processor 502 implements the step of calculating the similarity based on the fingerprint of the webpage text content to obtain the similarity of the webpage text content, the processor 502 specifically implements the following steps:
  • intersection calculation is performed on the documents to obtain the similarity of the text content of the webpage.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the computer program includes program instructions, and the computer program can be stored in a storage medium, which is a computer-readable storage medium.
  • the program instructions are executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiments.
  • the storage medium may be a computer-readable storage medium.
  • the storage medium stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the processor executes the computer program to realize the step of preprocessing the text content of a plurality of the web pages to obtain the text content to be deduplicated, the following steps are specifically implemented:
  • the intermediate feature keywords and the initial feature keywords are merged to obtain the target feature keywords.
  • the processor executes the computer program to implement the step of forming a fingerprint of webpage text content according to a characteristic signature, the following steps are specifically implemented:
  • the processor executes the computer program to realize the step of calculating the similarity according to the fingerprint of the webpage text content to obtain the similarity of the webpage text content, the following steps are specifically implemented:
  • intersection calculation is performed on the documents to obtain the similarity of the text content of the webpage.
  • the storage medium may be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other computer-readable storage media that can store program codes.
  • ROM Read-Only Memory
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of each unit is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the steps in the method of the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs.
  • the units in the devices in the embodiments of the present application may be combined, divided, and deleted according to actual needs.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which can be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请涉及文本内容快速去重方法、装置、计算机设备及存储介质,该方法包括抓取需要去重的若干个网页文本内容;对若干个网页文本内容进行预处理,以得到待去重文本内容;对待去重文本内容进行提取特征关键词,以得到目标特征关键词;对目标特征关键词进行权重计算,以得到权重值;对目标特征关键词进行签名,以得到特征签名;根据特征签名形成网页文本内容指纹;对网页文本内容指纹进行倒排索引存储;根据网页文本内容指纹计算相似度,以得到网页文本内容的相似性;输出网页文本内容的相似性。本申请有效满足海量大规模数据实时去重处理性能需求,实现提高准确率和去重性能。

Description

文本内容快速去重方法、装置、计算机设备及存储介质
本申请是以申请号为201910344414.9、申请日为2019年4月26日的中国专利申请为基础,并主张其优先权,该申请的全部内容在此作为整体引入本申请中。
技术领域
本申请涉及文本内容去重方法,更具体地说是指文本内容快速去重方法、装置、计算机设备及存储介质。
背景技术
互联网技术的快速发展,使得信息的复制及传播成本极低。网络信息的共享给人们带来了极大的便利,但是同时引入了大量重复信息。很多重复网页一方面来自文本内容和结构完全一致的转载,另一方面来自网站本身版面风格等不同,造成内部格式的不完全一致。大量重复的网页内容不仅加重了用户浏览的负担,而且还在信息采集、索引和搜索过程中消耗了大量的资源。
现有的大规模海量去重技术方法主要采用局部敏感哈希算法,该算法是一种基于文本内容的去重技术,主要通过降维生哈希签名,然后通过签名的相似性来判断文本内容的相似度,由于汉语语言的复杂性,现有的方法并不能十分准确的表示文本内容,现有的文本特征提取都是假设特征之间相互独立,真实环境中,特征关键词之间是有语义关系的,不可以简单的忽略;相似度计算性能较低,无法扩展到大规模海量数据环境下应用;由于忽略了特征关键词间的语义上下文关系,造成整体准确率较低。
因此,有必要设计一种新的方法,实现提高准确率和去重性能,有效满足海量大规模数据实时去重处理性能需求。
申请内容
本申请的目的在于克服现有技术的缺陷,提供文本内容快速去重方法、装置、计算机设备及存储介质。
为实现上述目的,本申请采用以下技术方案:文本内容快速去重方法,包括:
抓取需要去重的若干个网页文本内容;
对若干个所述网页文本内容进行预处理,以得到待去重文本内容;
对待去重文本内容进行提取特征关键词,以得到目标特征关键词;
对所述目标特征关键词进行权重计算,以得到权重值;
对所述目标特征关键词进行签名,以得到特征签名;
根据特征签名形成网页文本内容指纹;
对网页文本内容指纹进行倒排索引存储;
根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性;
输出网页文本内容的相似性。
其进一步技术方案为:所述抓取需要去重的若干个网页文本内容,包括:
分配URL地址;
根据URL地址爬取URL,以得到待爬取URL;
判断所述待爬取URL是否已爬取;
若是,则返回所述根据URL地址爬取URL,以得到待爬取URL;
若否,则抓取待爬取URL内的网页文本内容。
其进一步技术方案为:所述对若干个所述网页文本内容进行预处理,以得到待去重文本内容,包括:
对若干个所述网页文本内容进行解析清洗,以得到中间文本内容;
对中间文本内容进行分词处理,以得到待去重文本内容。
其进一步技术方案为:所述对待去重文本内容进行提取特征关键词,以得到目标特征关键词,包括:
将待去重文本内容按照位置分块,以得到文本块;
对文本块进行抽取特征关键词,以得到初始特征关键词;
对初始特征关键词进行语义扩展,以得到中间特征关键词;
将中间特征关键词以及初始特征关键词进行合并,以得到目标特征关键词。
其进一步技术方案为:所述对所述目标特征关键词进行签名,以得到特征签名,包括:
根据目标特征关键词计算生成特征散列值,以得到特征向量;
将特征向量与目标特征关键词整合,形成特征签名。
其进一步技术方案为:所述根据特征签名形成网页文本内容指纹,包括:
对特征签名内的特征向量的每维向量进行权重值的计算,以得到目标向量;
将目标向量内数值为正数的向量所对应的位置置于一,将目标向量内数值为非正数的向量所对应的位置置于零,以得到网页文本内容指纹。
其进一步技术方案为:所述根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性,包括:
根据网页文本内容建立网页指纹倒排表;
获取网页指纹倒排表中出现的文档号;
对所述文档好进行交集计算,以得到网页文本内容的相似性。
本申请还提供了,文本内容快速去重装置,包括:
抓取单元,用于抓取需要去重的若干个网页文本内容;
预处理单元,用于对若干个所述网页文本内容进行预处理,以得到待去重文本内容;
提取单元,用于对待去重文本内容进行提取特征关键词,以得到目标特征关键词;
权重计算单元,用于对所述目标特征关键词进行权重计算,以得到权重值;
签名单元,用于对所述目标特征关键词进行签名,以得到特征签名;
指纹形成单元,用于根据特征签名形成网页文本内容指纹;
存储单元,用于对网页文本内容指纹进行倒排索引存储;
相似度计算单元,用于根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性;
输出单元,用于输出网页文本内容的相似性。
本申请还提供了一种计算机设备,所述计算机设备包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现上述的方法。
本申请还提供了一种存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时可实现上述的方法。
本申请与现有技术相比的有益效果是:本申请通过基于词语语义关系提取能够表示网页文本内容的目标特征关键词及权值,基于目标特征关键词及权值 生成网页文本内容指纹,实现压缩表示,节省了存储空间和计算时间,基于Elasticsearch倒排索引数据结构存储网页文本内容指纹,将相似度计算转换为Elasticsearch的布尔模型检索,有效满足海量大规模数据实时去重处理性能需求,实现提高准确率和去重性能。
下面结合附图和具体实施例对本申请作进一步描述。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的文本内容快速去重方法的应用场景示意图;
图2为本申请实施例提供的文本内容快速去重方法的流程示意图;
图3为本申请实施例提供的文本内容快速去重方法的子流程示意图;
图4为本申请实施例提供的文本内容快速去重方法的子流程示意图;
图5为本申请实施例提供的文本内容快速去重方法的子流程示意图;
图6为本申请实施例提供的文本内容快速去重方法的子流程示意图;
图7为本申请实施例提供的文本内容快速去重方法的子流程示意图;
图8为本申请实施例提供的文本内容快速去重方法的子流程示意图;
图9为本申请实施例提供的网页文本内容指纹的形成示意图;
图10为本申请实施例提供的目标特征关键词形成的形成示意图;
图11为本申请实施例提供的倒排索引的结构示意图;
图12为本申请实施例提供的文本内容快速去重装置的示意性框图;
图13为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包 含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
请参阅图1和图2,图1为本申请实施例提供的文本内容快速去重方法的应用场景示意图。图2为本申请实施例提供的文本内容快速去重方法的示意性流程图。该文本内容快速去重方法应用于服务器内,该服务器与终端进行数据交互,从终端获取到需要去重的若干个网页文本内容,再对这些网页文本内容进行快速去重,且将去重后的结果输出至终端显示。
图2是本申请实施例提供的文本内容快速去重方法的流程示意图。如图2所示,该方法包括以下步骤S110至S190。
S110、抓取需要去重的若干个网页文本内容。
在本实施例中,网页文本内容是指网页内所显示的带有资讯的文本。
在一实施例中,请参阅图3,上述的步骤S110可包括步骤S111~S114。
S111、分配URL地址;
S112、根据URL地址爬取URL,以得到待爬取URL;
S113、判断所述待爬取URL是否已爬取;
若是,则返回所述步骤S112;
S114、若否,则抓取待爬取URL内的网页文本内容。
分布式任务调度程序分配URL(统一资源定位符,Uniform Resource Locator)地址给爬虫应用程序节点,如果待爬取URL是已经抓取过的URL,则直接丢弃,否则,通过爬虫应用程序节点抓取网页文本内容。
S120、对若干个所述网页文本内容进行预处理,以得到待去重文本内容。
在本实施例中,待去重文本内容是指已经清洗筛选且进行分词处理的文本 内容。
在一实施例中,请参阅图4,上述的步骤S120可包括步骤S121~S122。
S121、对若干个所述网页文本内容进行解析清洗,以得到中间文本内容;
S122、对中间文本内容进行分词处理,以得到待去重文本内容。
在本实施例中,中间文本内容是指去除不需要的数据后所剩的内容。
对网页文本内容进行解析清洗,主要包括去除html标签、英文大写字母转化为小写、中文的简繁转换等。网页文本内容处理还会涉及到中文分词,通过分词技术将文本内容切分成单独且有意义的词语。
S130、对待去重文本内容进行提取特征关键词,以得到目标特征关键词。
在本实施例中,目标特征关键词是指表示网页文本内容特征和本质的词语。整个目标特征关键词的形成过程可参阅图10。待去重文本内容特征的选择有很多方法,如shingles、n-grams等,由于多个词或者字符序列不具有单独清晰的语义,减弱了对文档内容的表示,这里采用单个关键词作为待去重文本内容的特征进行提取。
在一实施例中,请参阅图5,上述的步骤S130可包括步骤S131~S134。
S131、将待去重文本内容按照位置分块,以得到文本块。
在本实施例中,文本块是指由不同的位置的文本形成内容。
将待去重文本内容按位置分块,主要分为元信息块、网页正文块和标题块。
S132、对文本块进行抽取特征关键词,以得到初始特征关键词。
在本实施例中,初始特征关键词是指由文本块的内容直接提取出的代表文本块内容的特征词语。
具体地,按照语义关系抽取每一个文本块的特征关键词。
S133、对初始特征关键词进行语义扩展,以得到中间特征关键词。
在本实施例中,中间特征关键词是指与初始特征关键词同义的词语。
S134、将中间特征关键词以及初始特征关键词进行合并,以得到目标特征关键词。
在本实施例中,将扩展后的关键词与初始特征关键词一起合并,可以使得目标特征关键词更加全面和准确。
S140、对所述目标特征关键词进行权重计算,以得到权重值。
在本实施例中,权重值是指词语出现在文本内容中的占比率以及位置等。
在一实施例中,上述的步骤S140可包括步骤S141~S142。
S141、根据目标特征关键词在网页文本内容所出现的频率和位置计算权值,以得到若干个权值。
在本实施例中,目标特征关键词对应的权值最简单的计算方法主要采用重要度和区分度两个指标,其中重要度Weight主要基于词语出现的频率tf的变体,其计算公式为:Weight=log(1+log(1+tf));
区分度Discrimination基于逆文档频率因子,其计算公式为
Figure PCTCN2019116606-appb-000001
其中,N代表文档集合中总共有多少个文档,而df代表文档频率。
最后基于文档长度做归一化,其归一化计算公式为
Figure PCTCN2019116606-appb-000002
其中,b为调节因子,默认取值为0.85。
综合考虑以上几个影响权值因素,最终的目标特征关键词的权值计算公式如下:
Figure PCTCN2019116606-appb-000003
S142、对若干个权值进行排序,以得到权重值。
按照权重排序得到最终的目标特征关键词及权重集合。
S150、对所述目标特征关键词进行签名,以得到特征签名。
在本实施例中,特征签名是指目标特征关键词之间的语义关系所生成特征散列值。具体是采用语义特征签名算法对所述目标特征关键词进行签名。
在一实施例中,请参阅图6,上述的步骤S150可包括步骤S151~S152。
S151、根据目标特征关键词计算生成特征散列值,以得到特征向量。
在本实施例中,特征向量是指由目标特征关键词征生产的散列值。
在本实施例中,散列值是一个b维的向量,b是由人为设定的;网页文本内容由句子构成,句子又是有词语构成,文档要表达的主题是由词语及其所在的上下文环境共同决定的;例如下面两句文字:“努力把我国建设成为环境杰出的国家”;“努力把我国建设成为环境良好的国家”;两个句子表达的主旨是一样的,如果出现在不同的文档中,可以认为是重复的内容。但是两个句子的特征显然不完全一样,因此,通过原始局部敏感哈希算法生成的文档指纹会不同,从而认为是不同的文档,产生误判。
语义特征签名算法伪代码:
输入:网页文档特征集合;
Feature={Feature1,Feature2,...,Featurei,...,Featuren},Featurei={Featurei1,Featurei2,...,Featureij,...,Featurein};
对应的权值集合;
Weight={Weight1,Weight2,...,Weighti,...,Weightn},Weighti={Weighti1,Weighti2,...,Weightij,...,Weightin};
输出:网页文档语义特征散列值集合;
HashVal={HashVal1,HashVal2,...,HashVali,...,HashValn},HashVali={HashVali1,HashVali2,...,HashValij,...,HashValin};
伪代码如下:
Figure PCTCN2019116606-appb-000004
其中,sim(Featureij,Featurekl)为判断词语特征的相似度函数,利用概念间的语义关系来计算,具体组织形式为层次化的语义词典,利用词语在语义词典中的层次距离路径来衡量语义相似度。当词语特征之间的相似度小于设定的阈值threshold时,则将特征的散列值设定为同一值,而在原始的局部敏感哈希算法对不同特征生产的散列值则是不一样的,忽略了词语之间的关系。
S152、将特征向量与目标特征关键词整合,形成特征签名。
形成的特征签名进行压缩便可得到网页文本内容指纹。实现压缩表示,节省了存储空间和计算时间。
S160、根据特征签名形成网页文本内容指纹;
在本实施例中,网页文本内容指纹是指根据特征向量进行零与一的设置后 形成的向量。
在一实施例中,请参阅图7,上述的步骤S160可包括步骤S161~S162。
S161、对特征签名内的特征向量的每维向量进行权重值的计算,以得到目标向量;
S162、将目标向量内数值为正数的向量所对应的位置置于一,将目标向量内数值为非正数的向量所对应的位置置于零,以得到网页文本内容指纹。
网页文本内容即文档是有一系列的字符串组成,直接对字符串进行操作需要大量的存储空间及计算时间。因此,对原文进行分析和处理,提取能够代表原文档的目标特征关键词,通过散列函数生成网页文本内容指纹。通过对表示网页文本内容的网页文本内容指纹进行比较,找出重复或者近似重复的文档。当两个文档拥有相同的指纹数量或者相同指纹占总指纹数量的比例达到一定的阈值时则认为重复,否则认为不重复。
在b维的特征向量V中,分别对每维向量进行计算,即如果特征相应位的散列值为1,则加上此目标关键特征对应的权值,否则减去权值。当所有特征都处理完毕之后,如果向量V中的第i维是正数,则将b位的指纹中第i位置为1,否则置为0,进而得到得到一个数值包括0和1的向量,即网页文本内容指纹,如图9所示。
S170、对网页文本内容指纹进行倒排索引存储。
基于Elasticsearch倒排索引数据结构存储网页指纹,将相似度计算转换为Elasticsearch的布尔模型检索,ElasticSearch是一个基于Lucene的搜索服务器,它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口。
基于Elasticsearch将网页文档ID到网页指纹的映射转换为网页指纹到网页文档ID的映射,并进行存储,如图11所示,其中,网页指纹1指网页文档1的ID和网页文档2的ID;网页指纹2指带有此目标特征关键词的网页文档ID列表,该网页文档指代网页文本内容。
S180、根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性。
在本实施例中,上述的相似度是指两个网页文本内容的相似度。
在一实施例中,请参阅图8,上述的步骤S180可包括步骤S181~S183。
S181、根据网页文本内容建立网页指纹倒排表;
S182、获取网页指纹倒排表中出现的文档号;
S183、对所述文档好进行交集计算,以得到网页文本内容的相似性。
对每篇网页文本内容算出网页文本内容指纹后,再计算两个网页文本内容指纹的相似度。使用海明距离来求相似度,建立网页指纹倒排表,通过在网页指纹倒排表中查询出现的文档号,求交集得到最终的结果。假设对32位的网页文本内容指纹,把32位的二进制签名均分成2块,每块16位,计算海明距离在1以内的所有签名,根据鸽巢原理,如果两个网页文本内容指纹的海明距离在1以内,它们必有一块完全相同,这样就可以通过倒排索引将相似度计算转换为布尔检索模型,大大减少了文档相似度计算时间。
S190、输出网页文本内容的相似性。
在本实施例中,将网页文本内容的相似性输出至终端显示。
对于该方法进行网页文本内容去重的效果评价是采用正确率Precision和召回率Recall进行评价。
Figure PCTCN2019116606-appb-000005
Precision和Recall指标仅仅表示了单方面的性能指标,而忽略了总体的性能,去重效果评价值F1综合了两者,其定义为:
Figure PCTCN2019116606-appb-000006
将本方法与经典的局部敏感算法simhash算法运行效果对比,得到的运行效果对比情况如下表1所示:
表1.算法运行效果对比
Figure PCTCN2019116606-appb-000007
从运行效果来看,运行效果较局部敏感哈希算法在准确率和召回率上均有较大幅度的提升。
将本方法和经典的局部敏感算法simhash算法运行效率对比,该运行效率对比情况如下表2所示:
表2.算法运行效率对比
算法 网页数量 用时(ms)
Simhash方法 60000 18990.56
语义相似方法 60000 80.3
Simhash方法 276000 89535.32
语义相似方法 276000 116.5
Simhash方法 1436000 513050.70
语义相似方法 1436000 120.9
Simhash方法 25435017 7206347.65
语义相似方法 25435017 386.5
从算法运行效率看,本方法性能较高,较局部敏感哈希大幅增加的情况下,性能下降较慢,可以满足在大规模海量数据环境下应用。
于其他实施例,除了输出网页文本内容的相似性,还会输出去重效果评价值。
上述的文本内容快速去重方法,通过基于词语语义关系提取能够表示网页文本内容的目标特征关键词及权值,基于目标特征关键词及权值生成网页文本内容指纹,实现压缩表示,节省了存储空间和计算时间,基于Elasticsearch倒排索引数据结构存储网页文本内容指纹,将相似度计算转换为Elasticsearch的布尔模型检索,有效满足海量大规模数据实时去重处理性能需求,实现提高准确率和去重性能。
图12申请实施例提供的一种文本内容快速去重装置300的示意性框图。如图12,对应于以上文本内容快速去重方法,本申请还提供一种文本内容快速去重装置300。该文本内容快速去重装置300包括用于执行上述文本内容快速去重方法的单元,该装置可以被配置于服务器中。
具体地,请参阅图12,该文本内容快速去重装置300包括:
抓取单元301,用于抓取需要去重的若干个网页文本内容;
预处理单元302,用于对若干个所述网页文本内容进行预处理,以得到待去重文本内容;
提取单元303,用于对待去重文本内容进行提取特征关键词,以得到目标特征关键词;
权重计算单元304,用于对所述目标特征关键词进行权重计算,以得到权重值;
签名单元305,用于对所述目标特征关键词进行签名,以得到特征签名;
指纹形成单元306,用于根据特征签名形成网页文本内容指纹;
存储单元307,用于对网页文本内容指纹进行倒排索引存储;
相似度计算单元308,用于根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性;
输出单元309,用于输出网页文本内容的相似性。
在一实施例中,所述抓取单元301包括:
地址分配子单元,用于分配URL地址;
爬取子单元,用于根据URL地址爬取URL,以得到待爬取URL;
爬取判断子单元,用于判断所述待爬取URL是否已爬取;若是,则返回所述根据URL地址爬取URL,以得到待爬取URL;
内容爬取子单元,用于若否,则抓取待爬取URL内的网页文本内容。
在一实施例中,所述预处理单元302包括:
清洗子单元,用于对若干个所述网页文本内容进行解析清洗,以得到中间文本内容;
分词处理子单元,用于对中间文本内容进行分词处理,以得到待去重文本内容。
在一实施例中,所述提取单元303包括:
分块子单元,用于将待去重文本内容按照位置分块,以得到文本块;
抽取子单元,用于对文本块进行抽取特征关键词,以得到初始特征关键词;
扩展子单元,用于对初始特征关键词进行语义扩展,以得到中间特征关键词;
合并子单元,用于将中间特征关键词以及初始特征关键词进行合并,以得到目标特征关键词。
在一实施例中,所述权重计算单元304包括:
权值获取子单元,用于根据目标特征关键词在网页文本内容所出现的频率和位置计算权值,以得到若干个权值;
排序子单元,用于对若干个权值进行排序,以得到权重值。
在一实施例中,所述签名单元305包括:
向量获取子单元,用于根据目标特征关键词计算生成特征散列值,以得到特征向量;
整合子单元,用于将特征向量与目标特征关键词整合,形成特征签名。
在一实施例中,所述指纹形成单元306包括:
目标向量形成子单元,用于对特征签名内的特征向量的每维向量进行权重值的计算,以得到目标向量;
设置子单元,用于将目标向量内数值为正数的向量所对应的位置置于一,将目标向量内数值为非正数的向量所对应的位置置于零,以得到网页文本内容指纹。
在一实施例中,所述相似度计算单元308包括:
建立子单元,用于根据网页文本内容建立网页指纹倒排表;
文档号获取子单元,用于获取网页指纹倒排表中出现的文档号;
交集计算子单元,用于对所述文档好进行交集计算,以得到网页文本内容的相似性。
需要说明的是,所属领域的技术人员可以清楚地了解到,上述文本内容快速去重装置300和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。
上述文本内容快速去重装置300可以实现为一种计算机程序的形式,该计算机程序可以在如图13所示的计算机设备上运行。
请参阅图13,图13是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500是服务器。
参阅图13,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032包括程序指令,该程序指令被执行时,可使得处理器502执行一种文本内容快速去重方法。
该处理器502用于提供计算和控制能力,以支撑整个计算机设备500的运行。
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行一种文本内容快速去重方法。
该网络接口505用于与其它设备进行网络通信。本领域技术人员可以理解,图13中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现如下步骤:
抓取需要去重的若干个网页文本内容;
对若干个所述网页文本内容进行预处理,以得到待去重文本内容;
对待去重文本内容进行提取特征关键词,以得到目标特征关键词;
对所述目标特征关键词进行权重计算,以得到权重值;
对所述目标特征关键词进行签名,以得到特征签名;
根据特征签名形成网页文本内容指纹;
对网页文本内容指纹进行倒排索引存储;
根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性;
输出网页文本内容的相似性。
在一实施例中,处理器502在实现所述抓取需要去重的若干个网页文本内 容步骤时,具体实现如下步骤:
分配URL地址;
根据URL地址爬取URL,以得到待爬取URL;
判断所述待爬取URL是否已爬取;
若是,则返回所述根据URL地址爬取URL,以得到待爬取URL;
若否,则抓取待爬取URL内的网页文本内容。
在一实施例中,处理器502在实现所述对若干个所述网页文本内容进行预处理,以得到待去重文本内容步骤时,具体实现如下步骤:
对若干个所述网页文本内容进行解析清洗,以得到中间文本内容;
对中间文本内容进行分词处理,以得到待去重文本内容。
在一实施例中,处理器502在实现所述对待去重文本内容进行提取特征关键词,以得到目标特征关键词步骤时,具体实现如下步骤:
将待去重文本内容按照位置分块,以得到文本块;
对文本块进行抽取特征关键词,以得到初始特征关键词;
对初始特征关键词进行语义扩展,以得到中间特征关键词;
将中间特征关键词以及初始特征关键词进行合并,以得到目标特征关键词。
在一实施例中,处理器502在实现所述对所述目标特征关键词进行签名,以得到特征签名步骤时,具体实现如下步骤:
根据目标特征关键词计算生成特征散列值,以得到特征向量;
将特征向量与目标特征关键词整合,形成特征签名。
在一实施例中,处理器502在实现所述根据特征签名形成网页文本内容指纹步骤时,具体实现如下步骤:
对特征签名内的特征向量的每维向量进行权重值的计算,以得到目标向量;
将目标向量内数值为正数的向量所对应的位置置于一,将目标向量内数值为非正数的向量所对应的位置置于零,以得到网页文本内容指纹。
在一实施例中,处理器502在实现所述根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性步骤时,具体实现如下步骤:
根据网页文本内容建立网页指纹倒排表;
获取网页指纹倒排表中出现的文档号;
对所述文档好进行交集计算,以得到网页文本内容的相似性。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成。该计算机程序包括程序指令,计算机程序可存储于一存储介质中,该存储介质为计算机可读存储介质。该程序指令被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。
因此,本申请还提供一种存储介质。该存储介质可以为计算机可读存储介质。该存储介质存储有计算机程序,其中该计算机程序被处理器执行时使处理器执行如下步骤:
抓取需要去重的若干个网页文本内容;
对若干个所述网页文本内容进行预处理,以得到待去重文本内容;
对待去重文本内容进行提取特征关键词,以得到目标特征关键词;
对所述目标特征关键词进行权重计算,以得到权重值;
对所述目标特征关键词进行签名,以得到特征签名;
根据特征签名形成网页文本内容指纹;
对网页文本内容指纹进行倒排索引存储;
根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性;
输出网页文本内容的相似性。
在一实施例中,所述处理器在执行所述计算机程序而实现所述抓取需要去重的若干个网页文本内容步骤时,具体实现如下步骤:
分配URL地址;
根据URL地址爬取URL,以得到待爬取URL;
判断所述待爬取URL是否已爬取;
若是,则返回所述根据URL地址爬取URL,以得到待爬取URL;
若否,则抓取待爬取URL内的网页文本内容。
在一实施例中,所述处理器在执行所述计算机程序而实现所述对若干个所述网页文本内容进行预处理,以得到待去重文本内容步骤时,具体实现如下步骤:
对若干个所述网页文本内容进行解析清洗,以得到中间文本内容;
对中间文本内容进行分词处理,以得到待去重文本内容。
在一实施例中,所述处理器在执行所述计算机程序而实现所述对待去重文本内容进行提取特征关键词,以得到目标特征关键词步骤时,具体实现如下步骤:
将待去重文本内容按照位置分块,以得到文本块;
对文本块进行抽取特征关键词,以得到初始特征关键词;
对初始特征关键词进行语义扩展,以得到中间特征关键词;
将中间特征关键词以及初始特征关键词进行合并,以得到目标特征关键词。
在一实施例中,所述处理器在执行所述计算机程序而实现所述对所述目标特征关键词进行签名,以得到特征签名步骤时,具体实现如下步骤:
根据目标特征关键词计算生成特征散列值,以得到特征向量;
将特征向量与目标特征关键词整合,形成特征签名。
在一实施例中,所述处理器在执行所述计算机程序而实现所述根据特征签名形成网页文本内容指纹步骤时,具体实现如下步骤:
对特征签名内的特征向量的每维向量进行权重值的计算,以得到目标向量;
将目标向量内数值为正数的向量所对应的位置置于一,将目标向量内数值为非正数的向量所对应的位置置于零,以得到网页文本内容指纹。
在一实施例中,所述处理器在执行所述计算机程序而实现所述根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性步骤时,具体实现如下步骤:
根据网页文本内容建立网页指纹倒排表;
获取网页指纹倒排表中出现的文档号;
对所述文档好进行交集计算,以得到网页文本内容的相似性。
所述存储介质可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的计算机可读存储介质。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的。例如,各个单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。本申请实施例装置中的单元可以根据实际需要进行合并、划分和删减。另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,终端,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (10)

  1. 文本内容快速去重方法,其特征在于,包括:
    抓取需要去重的若干个网页文本内容;
    对若干个所述网页文本内容进行预处理,以得到待去重文本内容;
    对待去重文本内容进行提取特征关键词,以得到目标特征关键词;
    对所述目标特征关键词进行权重计算,以得到权重值;
    对所述目标特征关键词进行签名,以得到特征签名;
    根据特征签名形成网页文本内容指纹;
    对网页文本内容指纹进行倒排索引存储;
    根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性;
    输出网页文本内容的相似性。
  2. 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述抓取需要去重的若干个网页文本内容,包括:
    分配URL地址;
    根据URL地址爬取URL,以得到待爬取URL;
    判断所述待爬取URL是否已爬取;
    若是,则返回所述根据URL地址爬取URL,以得到待爬取URL;
    若否,则抓取待爬取URL内的网页文本内容。
  3. 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述对若干个所述网页文本内容进行预处理,以得到待去重文本内容,包括:
    对若干个所述网页文本内容进行解析清洗,以得到中间文本内容;
    对中间文本内容进行分词处理,以得到待去重文本内容。
  4. 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述对待去重文本内容进行提取特征关键词,以得到目标特征关键词,包括:
    将待去重文本内容按照位置分块,以得到文本块;
    对文本块进行抽取特征关键词,以得到初始特征关键词;
    对初始特征关键词进行语义扩展,以得到中间特征关键词;
    将中间特征关键词以及初始特征关键词进行合并,以得到目标特征关键词。
  5. 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述对所述目标特征关键词进行签名,以得到特征签名,包括:
    根据目标特征关键词计算生成特征散列值,以得到特征向量;
    将特征向量与目标特征关键词整合,形成特征签名。
  6. 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述根据特征签名形成网页文本内容指纹,包括:
    对特征签名内的特征向量的每维向量进行权重值的计算,以得到目标向量;
    将目标向量内数值为正数的向量所对应的位置置于一,将目标向量内数值为非正数的向量所对应的位置置于零,以得到网页文本内容指纹。
  7. 根据权利要求1所述的文本内容快速去重方法,其特征在于,所述根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性,包括:
    根据网页文本内容建立网页指纹倒排表;
    获取网页指纹倒排表中出现的文档号;
    对所述文档好进行交集计算,以得到网页文本内容的相似性。
  8. 文本内容快速去重装置,其特征在于,包括:
    抓取单元,用于抓取需要去重的若干个网页文本内容;
    预处理单元,用于对若干个所述网页文本内容进行预处理,以得到待去重文本内容;
    提取单元,用于对待去重文本内容进行提取特征关键词,以得到目标特征关键词;
    权重计算单元,用于对所述目标特征关键词进行权重计算,以得到权重值;
    签名单元,用于对所述目标特征关键词进行签名,以得到特征签名;
    指纹形成单元,用于根据特征签名形成网页文本内容指纹;
    存储单元,用于对网页文本内容指纹进行倒排索引存储;
    相似度计算单元,用于根据所述网页文本内容指纹计算相似度,以得到网页文本内容的相似性;
    输出单元,用于输出网页文本内容的相似性。
  9. 一种计算机设备,其特征在于,所述计算机设备包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现如权 利要求1至7中任一项所述的方法。
  10. 一种存储介质,其特征在于,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时可实现如权利要求1至7中任一项所述的方法。
PCT/CN2019/116606 2019-04-26 2019-11-08 文本内容快速去重方法、装置、计算机设备及存储介质 WO2020215667A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910344414.9 2019-04-26
CN201910344414.9A CN110309446A (zh) 2019-04-26 2019-04-26 文本内容快速去重方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020215667A1 true WO2020215667A1 (zh) 2020-10-29

Family

ID=68075778

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116606 WO2020215667A1 (zh) 2019-04-26 2019-11-08 文本内容快速去重方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110309446A (zh)
WO (1) WO2020215667A1 (zh)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309446A (zh) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 文本内容快速去重方法、装置、计算机设备及存储介质
CN110956037B (zh) * 2019-10-16 2022-07-08 厦门美柚股份有限公司 多媒体内容重复判断方法及装置
CN110955751A (zh) * 2019-11-13 2020-04-03 广州供电局有限公司 工作票文本去重方法、装置、系统及计算机存储介质
CN110909019B (zh) * 2019-11-14 2022-04-08 湖南赛吉智慧城市建设管理有限公司 大数据查重方法、装置、计算机设备及存储介质
CN111027282A (zh) * 2019-11-21 2020-04-17 精硕科技(北京)股份有限公司 文本去重方法和装置、电子设备及计算机可读存储介质
CN111061934B (zh) * 2019-11-27 2023-04-07 西安四叶草信息技术有限公司 指纹识别方法、设备和存储介质
CN113051907B (zh) * 2019-12-26 2023-05-12 深圳市北科瑞声科技股份有限公司 一种新闻内容的查重方法、系统及装置
CN111428180B (zh) * 2020-03-20 2022-02-08 创优数字科技(广东)有限公司 一种网页去重方法、装置和设备
CN111507260B (zh) * 2020-04-17 2022-08-05 重庆邮电大学 一种视频相似度快速检测方法及检测装置
CN111913912A (zh) * 2020-07-16 2020-11-10 北京字节跳动网络技术有限公司 文件处理方法、文件匹配方法、装置、电子设备和介质
CN114510919A (zh) * 2020-11-16 2022-05-17 中国移动通信有限公司研究院 一种网页文本信息的监控方法、装置及设备
WO2022141860A1 (zh) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 文本去重方法、装置、电子设备及计算机可读存储介质
CN114741468B (zh) * 2022-03-22 2024-03-29 平安科技(深圳)有限公司 文本去重方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645082A (zh) * 2009-04-17 2010-02-10 华中科技大学 基于并行编程模式的相似网页去重系统
CN107025218A (zh) * 2017-04-07 2017-08-08 腾讯科技(深圳)有限公司 一种文本去重方法和装置
CN108563636A (zh) * 2018-04-04 2018-09-21 广州杰赛科技股份有限公司 提取文本关键词的方法、装置、设备及存储介质
CN110309446A (zh) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 文本内容快速去重方法、装置、计算机设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831198A (zh) * 2012-08-07 2012-12-19 人民搜索网络股份公司 一种基于文档签名技术的相似文档识别装置及方法
CN106376002B (zh) * 2015-07-20 2021-10-12 中兴通讯股份有限公司 一种管理方法及装置、垃圾短信监控系统
CN108595517B (zh) * 2018-03-26 2021-03-09 南京邮电大学 一种大规模文档相似性检测方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645082A (zh) * 2009-04-17 2010-02-10 华中科技大学 基于并行编程模式的相似网页去重系统
CN107025218A (zh) * 2017-04-07 2017-08-08 腾讯科技(深圳)有限公司 一种文本去重方法和装置
CN108563636A (zh) * 2018-04-04 2018-09-21 广州杰赛科技股份有限公司 提取文本关键词的方法、装置、设备及存储介质
CN110309446A (zh) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 文本内容快速去重方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN110309446A (zh) 2019-10-08

Similar Documents

Publication Publication Date Title
WO2020215667A1 (zh) 文本内容快速去重方法、装置、计算机设备及存储介质
CN109241274B (zh) 文本聚类方法及装置
WO2019091026A1 (zh) 知识库文档快速检索方法、应用服务器及计算机可读存储介质
US7962491B1 (en) Document near-duplicate detection
US10346257B2 (en) Method and device for deduplicating web page
CN101872351B (zh) 识别同义词的方法、装置及利用其进行搜索的方法和装置
WO2015180432A1 (zh) 一种聚簇存储方法及装置
CN102918532B (zh) 在搜索结果排序中对垃圾的检测
CN108763348B (zh) 一种扩展短文本词特征向量的分类改进方法
CN102043851A (zh) 一种基于频繁项集的多文档自动摘要方法
CN108804642A (zh) 检索方法、装置、计算机设备及存储介质
CN110321466B (zh) 一种基于语义分析的证券资讯查重方法及系统
JP2005302043A (ja) 検索語提案のためのマルチ型データオブジェクトの強化されたクラスタリング
CN108647322B (zh) 基于词网识别大量Web文本信息相似度的方法
WO2017113592A1 (zh) 模型生成方法、词语赋权方法、装置、设备及计算机存储介质
WO2020228182A1 (zh) 基于大数据的数据去重的方法、装置、设备及存储介质
CN112395875A (zh) 一种关键词提取方法、装置、终端以及存储介质
CN109815401A (zh) 一种应用于Web人物搜索的人名消歧方法
CN110705285B (zh) 一种政务文本主题词库构建方法、装置、服务器及可读存储介质
Yuan et al. A mathematical information retrieval system based on RankBoost
Bama et al. A mathematical approach for mining web content outliers using term frequency ranking
CN113609247A (zh) 一种基于改进Simhash算法的大数据文本去重技术
CN114547233A (zh) 数据查重方法、装置及电子设备
Duan et al. Error correction for search engine by mining bad case
CN115905577B (zh) 知识图谱的构建方法及装置、法规检索方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19926181

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19926181

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 180322)

122 Ep: pct application non-entry in european phase

Ref document number: 19926181

Country of ref document: EP

Kind code of ref document: A1