WO2020228182A1 - 基于大数据的数据去重的方法、装置、设备及存储介质 - Google Patents

基于大数据的数据去重的方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020228182A1
WO2020228182A1 PCT/CN2019/103446 CN2019103446W WO2020228182A1 WO 2020228182 A1 WO2020228182 A1 WO 2020228182A1 CN 2019103446 W CN2019103446 W CN 2019103446W WO 2020228182 A1 WO2020228182 A1 WO 2020228182A1
Authority
WO
WIPO (PCT)
Prior art keywords
hash function
text data
word segmentation
data
binary string
Prior art date
Application number
PCT/CN2019/103446
Other languages
English (en)
French (fr)
Inventor
王保军
江腾飞
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020228182A1 publication Critical patent/WO2020228182A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of Internet technology, and in particular to a method, device, equipment, and storage medium for data deduplication based on big data.
  • Big data refers to a collection of data that cannot be captured, managed, and processed with conventional software tools within a certain time frame. It requires a new processing model to have stronger decision-making, insight and discovery, and process optimization capabilities. Massive, high growth rate and diversified information assets.
  • Capacity (Volume): The size of the data determines the value and potential information of the data under consideration
  • Velocity refers to the speed at which data is obtained
  • Veracity the quality of the data
  • Big data processing technologies mainly include massively parallel processing (MPP) databases, data mining, distributed file systems, and distributed Database, cloud computing platform, Internet and scalable storage system.
  • MPP massively parallel processing
  • Duplicate data greatly increases the I/0 and CPU processing pressure of the analysis system. If you do not do the de-duplication processing, the data analysis efficiency will be reduced, and the hardware overhead of the analysis system will increase, and the extra analysis cost is unacceptable for charging items based on the total analysis traffic. Duplicated data is especially serious when dealing with big data, because the Internet is currently full of a large amount of nearly duplicate information, so for big data mining, duplicate data will lead to misjudgments in certain aspects, that is, invalid big data .
  • the existing data deduplication technology is to compare the data according to the data overhead load (payload), full data or custom rules, so as to determine whether there is duplication, and then filter and remove redundant data.
  • the data deduplication technology described above can be well applied to scenarios with a small amount of data, but when there is a large amount of duplicate data on the Internet, the above data deduplication technology is difficult to apply to the scene of massive data processing, otherwise it will greatly increase the analysis system. I/0 and CPU handle the pressure and waste resources.
  • the purpose of the embodiments of this application is to propose a method, device, computer equipment, and storage medium for data deduplication based on big data, which uses a hash algorithm to reduce the dimensionality of the data, which can reduce the comparison time between two texts and reduce the need for text Storage overhead.
  • an embodiment of the present application provides a data deduplication method based on big data, which adopts the following technical solutions:
  • the k-bit binary string is equally divided into j binary strings, where j is a positive integer greater than or equal to 1;
  • an embodiment of the present application further provides a device for data deduplication based on big data, adopting the following technical solution, the device for data deduplication based on big data includes:
  • the collection module is used to collect at least two text data according to preset keywords
  • the splitting module is used to equally divide the k-bit binary string into j sub-binary strings, where j is a positive integer greater than or equal to 1;
  • the adjustment module adjusts the arrangement order of the j sub-binary strings, and uses different sub-binary strings as the foremost binary string to generate corresponding j sets and store them in a preset sample library;
  • a matching module configured to match the sample library with the foremost binary string of each of the j sets, and obtain candidate results of each set returned by the sample library;
  • the calculation module is used to calculate the Hamming distance of any two text data according to the candidate results of each text data, and if the Hamming distance is less than or equal to the threshold, perform deduplication.
  • the embodiments of the present application also provide a computer device, which includes a memory, a processor, and computer-readable instructions stored in the memory and running on the processor.
  • the processor executes the computer-readable instructions, the steps of the method for data deduplication based on big data are implemented.
  • the embodiments of the present application also provide one or more non-volatile readable storage media storing computer readable instructions.
  • the one or more processors execute the steps of the method for data deduplication based on big data.
  • the above-mentioned method, device, equipment and storage medium for data deduplication based on big data for each text data, generate k-bit binary strings according to similar hash functions and hash functions, divide them into j sub-binary strings and arrange them to Different sub-binary strings are the front-end binary strings to generate corresponding j sets and store them in the preset sample library, and match the sample library with the front-end binary strings of each set to obtain candidate results, and then calculate The Hamming distance of any two text data determines whether to perform deduplication. Therefore, using a hash algorithm to reduce the dimensionality of big data can reduce the comparison time of two texts and reduce the text storage overhead.
  • Fig. 1 is a flowchart of an embodiment of a method for data deduplication based on big data according to the present application
  • FIG. 2 is a flowchart of a specific implementation of step 102 in FIG. 1;
  • Fig. 3 is a schematic structural diagram of an embodiment of an apparatus for data deduplication based on big data according to the present application
  • Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • FIG. 1 it is a schematic flow chart of a method for data deduplication based on big data according to an embodiment of this application.
  • the method for data deduplication based on big data can be described as follows.
  • Step 101 Collect at least two text data according to preset keywords.
  • Web crawlers also known as web spiders, web robots, web chasers
  • Traditional crawlers start from the URL of one or several initial webpages and obtain the URL on the initial webpage. During the process of crawling the webpage, they continuously extract new URLs from the current page and put them into the queue until a certain stopping condition of the system is met.
  • multiple text data can be collected by a focused web crawler.
  • the focused web crawler selects access log records, APP feedback, WeChat or the World Wide Web based on a predetermined crawling target (for example, customer investment category information) Web pages and related links to obtain the required information.
  • a predetermined crawling target for example, customer investment category information
  • the keywords can be: name, ID number, address, phone number, bank account, email address, city, zip code, password (Such as account query password, withdrawal password, login password, etc.), organization name, business license number, bank account number, transaction date, transaction amount, etc.
  • the web crawler grabs text data related to keywords from log records, APP feedback, WeChat or web pages on the World Wide Web.
  • the correlation refers to the text data containing the keywords, and the collected text data Stored in a data warehouse in a buffer or memory according to various dimensions, the data in the data warehouse is big data.
  • the simhash function is used to convert the text data into a hash code (hashcode), as described below.
  • Step 1021 select the digit k of the simhash function.
  • Step 1022 initialize the bits of the simhash function to 0.
  • Step 1023 Perform word segmentation extraction on each text data, and extract multiple word segmentation_weight pairs.
  • the predetermined number is 2 or 3.
  • the predetermined number is 2 or 3.
  • the space is also counted as a letter.
  • Step 1024 Perform hash function processing on the word segmentation (feature) in each word segmentation_weight pair (feature_weight_pairs).
  • Step 1025 Perform bitwise vertical accumulation on the word segmentation_weight pairs processed by the hash function to generate k values.
  • the word segmentation_weight pair processed by the hash function is longitudinally accumulated, if the bit is 1, then add 1, if it is 0, then subtract 1, and finally k (ie bits_count) values are generated.
  • the number of bits generated by the hash is 32.
  • the bit For each bit of the hashcode of each word, if the bit is 1, the value of the corresponding bit of simhash is increased by 1; otherwise, it is decreased by 1.
  • Get 32 values that is, simhash includes 32 values).
  • Step 1026 Convert the generated k values into k-bit binary strings.
  • a 64- or 128-bit binary string can also be generated, which is not limited in this embodiment.
  • the hamming distance of the three texts is the number of different bits in the two binary strings.
  • Step 103 Divide the k-bit binary string equally into j binary strings, where j is a positive integer greater than or equal to 1, and j is a positive integer greater than or equal to 1.
  • a 32-bit or 64-bit binary string is equally divided into four parts.
  • each part includes 8-bit sub-binary strings.
  • each includes a 16-bit binary string.
  • j is a positive integer greater than or equal to 1, for example, j can It is 2, 3, 4, 5, 6, 7, or 8, etc.
  • Step 104 Adjust the arrangement order of the j sub-binary strings, and generate corresponding j sets with different sub-binary strings as the foremost binary string and store them in a preset sample library.
  • any 16-bit sub-binary string can be adjusted as the foremost binary string of the four-part sub-binary string, for example, the sub-binary string L 1-16 and L 17- 32 , L 33-48 and L 48-64 can be adjusted respectively as the foremost binary string of all binary strings, there are 4 sets, which can be stored in a preset sample library as a table, for example, to In the preset memory, that is, there are 4 tables stored in the memory.
  • the 4 sets are: (L 1-16 , L 17-32 , L 33-48 , L 48-64 ), (L 17 -32 , L 1-16 , L 33-48 , L 48-64 ), (L 33-48 , L 1-16 , L 17-32 , L 48-64 ), (L 48-64 , L 1- 16 , L 17-32 , L 33-48 ).
  • the above-mentioned embodiment only uses the first sub-binary string for group classification, and the arrangement of the subsequent sub-binary strings is not limited.
  • the set classification can also be performed in other ways, for example, a 64-bit binary string is equally divided into two parts, and each part includes a 32-bit sub-binary string, for example, the sub-binary string L 1-32 And L 33-64 . Any 32-bit sub-binary string can be adjusted as the front-end binary string.
  • the sub-binary strings L 1-32 and L 33-64 are adjusted respectively as the front-end binary string, there are two sets, which can be set in the table (table) is stored in the sample library of the memory, that is, 2 tables are stored in the memory.
  • the 2 sets are (L 1-32 , L 33-64 ) and (L 33-64 , L 1 -32 ).
  • Step 105 Match the sample library with the foremost binary string of each of the j sets, and obtain candidate results of each set returned by the sample library.
  • the matching the sample library with the foremost binary string of each of the j sets specifically includes: determining the minimum value of each binary string of the j binary strings. Whether the front-end binary string is exactly the same as the front-end binary string stored in the memory, if they are the same, it is determined to match, that is, it is determined that the candidate result currently fed back by the sample library is the correct candidate, and if it is different, it is determined that there is no match, that is The candidate result currently fed back by the sample library is an incorrect candidate result.
  • Step 106 Calculate the Hamming distance of any two text data according to the candidate results of each text data, and if the Hamming distance is less than or equal to the threshold, perform deduplication (that is, discard or delete one of the texts).
  • the Hamming distance between binary string A and binary string B is the number of 1s in the binary after A xor B.
  • the high-dimensional feature vector is mapped into an f-bit fingerprint through the simhash algorithm, and the Hamming Distance (Hamming Distance) of the f-bit fingerprint of the two texts is compared to determine whether the two texts are repeated or highly similar. That is, the smaller the value of the Hamming distance, the more similar it is. When the Hamming distance is equal to zero, it means that the two comparison texts are the same. The larger the value of the Hamming distance, the less similar.
  • the biggest difference between the simhash function operation and the hash function operation is that although the hash function can also be used for mapping to compare text repetition, the difference is only A byte of text will also be mapped into two completely different hash results, and the hash mapping results of similar texts by the simhash function are also similar.
  • the method for data deduplication based on big data described in the embodiments of the present application uses a hash algorithm to reduce the dimensionality of the big data, which can reduce the comparison time of two texts and reduce the text storage overhead.
  • the method for data deduplication based on big data is generally executed by the server/terminal device. Accordingly, the method and device for data deduplication based on big data are generally set in the server/terminal device.
  • the terminal device may be a wireless terminal or a wired terminal.
  • the wireless terminal may be a device that provides voice and/or data connectivity to users, a handheld device with a wireless connection function, or other processing devices connected to a wireless modem.
  • the terminal can be a portable, pocket-sized, handheld, computer built-in or vehicle-mounted mobile device.
  • terminal devices are merely illustrative. According to implementation needs, there can be any number of terminal devices, networks and servers.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM).
  • this application provides an embodiment of a data deduplication device based on big data, and the device embodiment corresponds to the method embodiment shown in FIG. ,
  • the device can be specifically applied to various electronic equipment.
  • the device 300 for data deduplication based on big data in this embodiment includes: a collection module 301, a processing module 302, a splitting module 303, an adjustment module 304, a matching module 305, a calculation module 306, and a bus 307.
  • the collection module 301, the processing module 302, the splitting module 303, the adjustment module 304, the matching module 305, and the calculation module 306 are connected to each other through the bus 307.
  • the module division in this embodiment is only illustrative, and respective logical divisions can also be made according to respective method actions.
  • the bus 307 is used to implement connection and communication between these components.
  • the bus 307 may be an Industry Standard Architecture (ISA) bus, Peripheral Component Interconnect (PCI) bus, or Extended Industry Standard Architecture (EISA) bus, etc.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus system can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the collection module 301 is configured to collect at least two text data according to preset keywords
  • Simhash hash function
  • hash hash function
  • the splitting module 303 is configured to equally divide the k-bit binary string into j sub-binary strings, where j is a positive integer greater than or equal to 1;
  • the adjustment module 304 is configured to adjust the arrangement order of the j sub-binary strings, and generate corresponding j sets with different sub-binary strings as the foremost binary string and store them in a preset sample library;
  • the matching module 305 is configured to match the sample library with the foremost binary string of each of the j sets, and obtain the candidate results of each set returned by the sample library.
  • the matching module 305 is configured to match the sample library with the foremost binary string of each of the j sets, and if the sample library always has 2 m hash fingerprints, it is used for each set Return 2 mj candidate results, where m is an integer greater than 2, and m>j;
  • the calculation module 306 is configured to calculate the Hamming distance of any two text data according to the candidate results of each text data, and if the Hamming distance is less than or equal to the threshold, perform deduplication.
  • the processing module 302 is used to convert each text data (for example, Doc text, web text) of big data into a hash code ( hashcode), for example, the processing module 302 further includes: a selection subunit, an initialization subunit, an extraction subunit, a hash function processing subunit, an accumulation subunit, and a processing subunit, wherein the selection subunit, the initialization subunit Any two of the unit, the extraction subunit, the hash function processing subunit, the accumulation subunit and the processing subunit can communicate with each other.
  • a hash code hashcode
  • the initialization subunit is used to initialize the bits of similar hash functions to 0;
  • the extraction subunit is used for word segmentation extraction of each text data to extract multiple word segmentation_weight pairs.
  • the extraction subunit performs hash function processing on the word segmentation in each word segmentation_weight pair.
  • the extraction subunit is used to calculate the hash function of a predetermined number of word segmentation letters of each text data using a k-bit hash function. Greek code.
  • the predetermined number is 2 or 3.
  • n word segmentation, weight
  • n is a positive integer greater than or equal to 2.
  • a variety of predetermined number of word segmentation methods are generally used, for example, the predetermined number is 2 or 3.
  • the hash function processing subunit is used to perform hash function processing on the word segmentation in each word segmentation_weight pair.
  • the hash function processing subunit is used to calculate the hash code of each predetermined number of word segmentation letters of the text data using a 32-bit hash function, and calculate the hash code of each 2 or 3 letters of the text data.
  • the accumulating subunit is used to longitudinally accumulate the word segmentation_weight pairs processed by the hash function to generate k values. For example, the accumulating subunit performs bitwise accumulation on the word segmentation_weight after the hash function processing.
  • the weight is a vertical accumulation of bits, if the bit is 1, then the weight is weighted, if it is 0, the weight is reduced, and finally k values are generated.
  • the accumulating subunit adopts a 32-bit hash function, the number of bits generated by the hash is 32. For each bit of the hashcode of each word, if the bit is 1, the value of the corresponding bit of simhash is increased by 1; otherwise Subtract 1 to get 32 values (that is, simhash includes 32 values).
  • the processing subunit is used to convert the generated k values into k-bit binary strings.
  • the k is 32, 64, or 128.
  • the processing subunit sets the 32-bit simhash finally obtained if the bit is greater than 1, then it is set to 1; otherwise, it is set to 0.
  • the processing subunit may also generate a 64- or 128-bit binary string, which is not limited in this embodiment.
  • the hamming distance of the three texts is the number of different bits in the two binary strings.
  • the matching module 305 is configured to match the sample library with the foremost binary string of each of the j sets, specifically including: the matching module 305 is configured to compare all The foremost binary string of each of the j sets is judged for the sameness with the foremost binary string of each set stored in the memory. If they are the same, the match is determined, that is, if they are the same, the current feedback of the sample library is determined. The candidate result is a correct candidate result. If it is different, it is determined that it does not match, that is, if it is not the same, it is determined that the candidate result currently fed back by the sample library is an incorrect candidate result.
  • the splitting module 303 is also used to equally divide a 32-bit or 64-bit binary string into four parts. For example, when a 32-bit binary string is equally divided into four parts, each part includes An 8-bit sub-binary string, for example, when a 64-bit binary string is equally divided into four parts, each part includes a 16-bit binary string.
  • the splitting module 303 is also used to equally divide a 64-bit binary string into four 16-bit sub-binary strings: L 1-16 , L 17-32 , L 33-48 and L 48-64 , L 1- 16 , L 17-32 , L 33-48 and L 48-64 each include a 16-bit binary string.
  • j is a positive integer greater than or equal to 1, for example, j can It is an even number such as 2, 4, 6, or 8.
  • any 16-bit sub-binary string can be adjusted as the front end of the four-part binary string.
  • the sub-binary strings L 1-16 , L 17-32 , L 33-48, and L 48-64 can be adjusted as the frontmost binary strings of all binary strings.
  • the 4 sets are: (L 1-16 , L 17-32 , L 33-48 , L 48-64 ), (L 17-32 , L 1-16 , L 33-48 , L 48-64 ), (L 33-48 , L 1-16 , L 17-32 , L 48-64 ), (L 48-64 , L 1-16 , L 17-32 , L 33-48 ).
  • the calculation module 306 is also used to calculate the Hamming distance between two text data (for example, the first text data and the second text data).
  • the Hamming distance between the binary string A() and the binary string B is A.
  • the calculation module 306 is also used to map the high-dimensional feature vector into an f-bit fingerprint through the simhash algorithm, and to determine the Hamming Distance (Hamming Distance) of the f-bit fingerprints of two texts. Whether the text is repeated or highly similar, that is, the smaller the Hamming distance, the more similar it is. When the Hamming distance is equal to zero, it means that the two comparison texts are the same, and the larger the Hamming distance, the less similar.
  • the aforementioned modules may all be implemented by one or more processors, chips or integrated circuits, which is not limited in this embodiment.
  • the device for data deduplication based on big data described in the embodiment of the present application uses a hash algorithm to reduce the dimensionality of the big data, which can reduce the comparison time of two texts and reduce the text storage overhead.
  • FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 4 includes a memory 41, one or more processors 42, and a network interface 43 that are connected to each other in communication through a system bus. It should be pointed out that the figure only shows the computer device 4 with components 41-43, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes but is not limited to microprocessors, dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded devices, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 41 includes at least one type of non-volatile readable storage medium, the non-volatile readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), Random Access Memory (RAM), Static Random-Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4.
  • the memory 41 may also be an external storage device of the computer device 4, for example, a plug-in hard disk equipped on the computer device 4, a smart media card (SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
  • the memory 41 is generally used to store an operating system and various application software installed in the computer device 4, such as computer-readable instructions of the above-mentioned data processing method.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 42 is generally used to control the overall operation of the computer device 4.
  • the processor 42 is configured to run computer-readable instructions or data deduplication stored in the memory 41, for example, run the computer-readable instructions of the big data-based data deduplication method.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is usually used to establish a communication connection between the computer device 4 and other electronic devices.
  • the sub-binary strings are the foremost binary strings to generate corresponding j sets and store them in the preset sample library; match the sample library with the foremost binary string of each of the j sets to obtain all The candidate results of each set returned by the sample library; calculate the Hamming distance of any two text data according to the candidate results of each text data, and if the Hamming distance is less than or equal to the threshold, perform deduplication.
  • the processor 42 is further configured to: when the 64-bit binary string is equally divided into four parts, any 16-bit sub-binary string can be adjusted as the minimum of the four sub-binary strings.
  • the front-end binary strings for example, the sub-binary strings L1-16, L17-32, L33-48 and L48-64 can be adjusted respectively as the front-end binary strings of all binary strings.
  • the 4 sets are: (L1-16, L17-32, L33-48, L48-64), (L17-32, L1- 16, L33-48, L48-64), (L33-48, L1-16, L17-32, L48-64), (L48-64, L1-16, L17-32, L33-48).
  • the processor 42 is also used to calculate the Hamming distance between two text data (for example, the first text data and the second text data).
  • the Hamming distance between the binary string A and the binary string B is A xor B The number of 1s in the last binary.
  • the processor is also used to map the high-dimensional feature vector into an f-bit (f-bit) fingerprint through the simhash algorithm, where f is an integer greater than or equal to 2, and by comparing the f- of two texts
  • f is an integer greater than or equal to 2
  • the Hamming Distance of bit fingerprints is used to determine whether two texts are duplicated or highly similar, that is, the smaller the Hamming distance, the more similar. When the Hamming distance is equal to zero, it means that the two comparison texts are the same. The larger the value, the less similar.
  • This application also provides another implementation manner, that is, a non-volatile readable storage medium is provided, and the non-volatile readable storage medium stores readable instructions for data processing.
  • the instructions may be executed by at least one processor, so that the at least one processor executes the steps of the data processing method described above.
  • the following content is executed: at least two text data are collected according to preset keywords; for each text data, according to a similar hash function (simhash) and a hash function
  • Adjust the arrangement order of the j sub-binary strings use different sub-binary strings as the foremost binary string to generate corresponding j sets and store them in a preset sample library; take each of the j sets The foremost binary string matches the sample library to obtain the candidate results of each set returned by the sample library; calculate the Hamming distance of any two text data according to the candidate results of each text data, if the Hamming distance is less than or equal to Threshold, for deduplication.
  • any 16-bit sub-binary string can be divided into four parts.
  • String adjustment is the frontmost binary string of the quartet sub-binary strings.
  • sub-binary strings L1-16, L17-32, L33-48 and L48-64 can be adjusted as the frontmost binary strings of all binary strings, respectively.
  • the 4 sets are: (L1-16, L17-32, L33-48, L48 -64), (L17-32, L1-16, L33-48, L48-64), (L33-48, L1-16, L17-32, L48-64), (L48-64, L1-16, L17 -32, L33-48).
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于大数据的数据去重的方法、装置、设备及存储介质,属于互联网领域。方法包括根据预设关键字收集至少两个文本数据(101);针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串数(102);调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中(104);以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果(105);根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重(106)。使用哈希算法对数据进行降维,减少两个文本的对比时间,降低对文本储存开销。

Description

基于大数据的数据去重的方法、装置、设备及存储介质
【交叉引用】
本申请以2019年5月15日提交的申请号为201910401427.5,名称为“一种基于大数据的数据去重的方法、装置及存储介质”的中国发明专利申请为基础,并要求其优先权。
【技术领域】
本申请涉及互联网技术领域,尤其涉及一种基于大数据的数据去重的方法、装置、设备及存储介质。
【背景技术】
大数据(big data),是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。
大数据具有如下几大特点:
容量(Volume):数据的大小决定所考虑的数据的价值和潜在的信息;
种类(Variety):数据类型的多样性;
速度(Velocity):指获得数据的速度;
可变性(Variability):妨碍了处理和有效地管理数据的过程;
真实性(Veracity):数据的质量;
复杂性(Complexity):数据量巨大,来源多渠道;
价值(value):合理运用大数据,以低成本创造高价值。
大数据最小的基本单位是比特(bit),按顺序给出所有单位:bit、字节(Byte)、KB、MB、GB、TB、PB、EB、ZB、YB、BB、NB、DB,除了1Byte=8bit,其它的单位之间按照进率1024(2的十次方)来计算。
随着信息爆炸时代的来临,以及云技术的应用,大数据吸引了越来越多的关注,大数据的处理技术主要包括大规模并行处理(MPP)数据库、数据挖掘、分布式文件系统、分布式数据库、云计算平台、互联网和可扩展的存储系统。
当在网络中多次从同一目录下备份相同的文件,或者从多个地址处备份相同的文件时,就会出现重复的数据,重复数据大大增加了分析系统的I/0和CPU处理压力,如果不做去重处理,那数据的分析效率会降低,并导致分析系统的硬件开销增大,而对于按照分析总流量进行收费项目,那多余的分析成本花费,是不可接受的。重复数据在进行大数据处理时尤其严重,因为目前互联网上充斥着大量的近重复信息,所以对于大数据挖掘来说,重复的数据会导致对于某方面做出误判,也就是无效的大数据。
因此,有必要对重复数据进行去重,来避免上述问题的发生。
现有的数据去重技术,是根据数据的开销负载(payload)、全数据或自定义规则进行数据比对,从而判断是否有重复,然后做多余数据的过滤去重。
还有一种现有技术的数据去重技术,是比较两个文本相似性,大多是将文本分词之后, 转化为特征向量距离的度量,比如常见的欧氏距离、海明距离或者余弦角度等等。
上述描述的数据去重技术能很好的适用于数据量少的场景,但当互联网存在大量的重复数据时,上述数据去重技术难以适用于海量数据处理的场景,否则会大大增加分析系统的I/0和CPU处理压力,浪费资源。
【发明内容】
本申请实施例的目的在于提出一种基于大数据的数据去重的方法、装置、计算机设备及存储介质,使用哈希算法对数据进行降维,可以减少两个文本的对比时间,降低对文本储存开销。
为了解决上述技术问题,本申请实施例提供一种基于大数据的数据去重的方法,采用了如下所述的技术方案:
根据预设关键字收集至少两个文本数据;
针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串,其中k=2 n,其中n为大于等于2的正整数;
将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数;
调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中;
以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果;
根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重。
为了解决上述技术问题,本申请实施例还提供一种基于大数据的数据去重的装置,采用了如下所述的技术方案,所述基于大数据的数据去重的装置包括:
收集模块,用于根据预设关键字收集至少两个文本数据;
处理模块,用于针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串,其中k=2 n,其中n为大于等于2的正整数;
分裂模块,用于将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数;
调整模块,调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中;
匹配模块,用于以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果;
计算模块,用于根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,该计算机设备包括存储器、处理器,以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现所述的基于大数据的数据去重的方法的步骤。
为了解决上述技术问题,本申请实施例还提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个 处理器执行所述的基于大数据的数据去重的方法的步骤。
上述基于大数据的数据去重的方法、装置、设备及存储介质,针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串,分成j份子二进制串并进行排列,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中,以每个集合的最前端的二进制串匹配所述样本库,得到候选结果,再通过计算任意两个文本数据的海明距离判断是否进行去重,因此,使用哈希算法对大数据进行降维,可以减少两个文本的对比时间,降低对文本储存开销。
【附图说明】
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1根据本申请的一种基于大数据的数据去重的方法的一个实施例的流程图;
图2是图1中步骤102的一种具体实施方式的流程图;
图3是根据本申请的一种基于大数据的数据去重的装置的一个实施例的结构示意图;
图4是根据本申请的计算机设备的一个实施例的结构示意图。
附图标记:301-收集模块、302-处理模块、303-分裂模块、304-调整模块、305-匹配模块、306-计算模块、307-总线、41-存储器、42-处理器和43-网络接口
【具体实施方式】
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
如图1所示,为本申请一实施例的一种基于大数据的数据去重的方法流程示意图,该基于大数据的数据去重的方法可以如下所述。
步骤101,根据预设关键字收集至少两个文本数据。
例如,利用网络爬虫技术根据预设的关键字抓取与所述关键字相关的至少两个文本数据, 将所述至少两个文本数据保存在缓存器或存储器的数据仓库中。
网络爬虫(又被称为网页蜘蛛、网络机器人、网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的计算机可读指令或者脚本。它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成。传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。
本实施例中,可以通过聚焦网络爬虫收集多个文本数据,该聚焦网络爬虫根据既定的抓取目标(例如,客户的投资类别信息),有选择的访问日志记录、APP反馈、微信或万维网上的网页与相关的链接,获取所需要的信息。例如,通过聚焦网络爬虫搜索时,设置投资数据有关的关键字,例如,所述关键字可以为:姓名、身份证号码、地址、电话号码、银行账号、邮箱地址、所属城市、邮编、密码类(如账户查询密码、取款密码、登录密码等)、组织机构名称、营业执照号码、银行帐号、交易日期、交易金额等等。然后网络爬虫在日志记录、APP反馈、微信或万维网上的网页上抓取与关键字相关的文本数据,例如,所述相关是指包含所述关键字的文本数据,把这些收集到的文本数据按照各种维度保存在缓存器或存储器的数据仓库中,则该数据仓库的数据即为大数据。
步骤102,针对每个文本数据,根据相似哈希函数(simhash)和哈希函数(hash)生成k位的二进制串,其中k=2 n,其中n为大于等于2的正整数。
例如,针对大数据的每个文本数据(例如,Doc文本,web文本),利用simhash函数将该文本数据转化为哈希编码(hashcode),具体如下所述。
例如,以下三段文本为例进行说明:p1=the cat sat on the mat;p2=the cat sat on a mat;p3=we all scream for ice cream,整个过程可以如下所述,如图2所示,为图1中步骤102的一种具体实施方式的流程图。
步骤1021、选择simhash函数的位数k。
例如,根据存储成本以及数据集的大小,选择simhash的位数k,其中,k=2 n,n为大于等于2的正整数,例如k=16、32、64或128位。
步骤1022、将simhash函数的各位初始化为0。
步骤1023、将每个文本数据进行分词抽取,抽取出多个分词_权重对。
例如,将每个文本数据进行分词抽取(其中包括分词和计算权重),例如,抽取得到n个分词_权重对(feature_weight_pairs),记为feature_weight_pairs=[fw1,fw2…fwn],其中fwn=(feature_n,weight_n),其中n为大于等于2的正整数。
例如,一般采用各种预定数的分词的方式,例如,所述预定数量为2或3,例如,对于"the cat sat on the mat",采用两两分词的方式得到如下结果:{"th","he","e","c","ca","at","t","s","sa","o","on","n","t","m","ma"},其中,空格也算一个字母。
步骤1024、对每个分词_权重对(feature_weight_pairs)中的分词(feature)进行hash函数处理。
例如,使用32位hash函数计算该文本数据的每个预定数量分词字母(word)的哈希代码(hashcode),计算文本数据每2个或3个字母的哈希代码(hashcode),比如:"th".hash=-502157718,"he".hash-369049682,……。
步骤1025、对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,生成k个数值。
例如,对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,如果该位是1,则加1,如果是0,则减1,最后生成k(即bits_count)个数值。
例如,采用32位hash函数,则hash生成的位数bits_count=32,对各分词(word)的hashcode的每一位,如果该位为1,则simhash相应位的值加1;否则减1,得到32个数值(即simhash包括32个数值)。
步骤1026、将生成的k个数值转换为k位的二进制串。
例如,对最后得到的32位的simhash,如果该位大于1,则设为1;否则设为0。
在本申请的另一实施例中,也可以生成64或128位的二进制串,本实施例并不限定。
使用simhash会应该产生类似如下的结果:
irb(main):003:0>p1.simhash=>851459198 00110010110000000011110001111110
irb(main):004:0>p2.simhash=>847263864 00110010100000000011100001111000
irb(main):002:0>p3.simhash=>984968088 00111010101101010110101110011000。
经过simhash函数运算后,这三个文本的海明距离(hammingdistance)为两个二进制串中不同位的数量。
步骤103,将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数j为大于等于1的正整数。
例如,将32位或64位的二进制串等分成四份,例如,当将32位的二进制串等分成四份时,每份包括8位子二进制串,例如,当将64位的二进制串等分成四份时,每份包括16位二进制串。例如,将64位的二进制串等分成四份16位的子二进制串:L 1-16,L 17-32,L 33-48和L 48-64,L 1-16,L 17-32,L 33-48和L 48-64分别包括16位的二进制串。
上述实施例仅仅是将32位或64位的二进制串等分成四份为例进行说明,但本申请的实施例并不限制分成多少分,例如,j为大于等于1的正整数,例如j可以为2、3、4、5、6、7或8等。
步骤104,调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的的样本库中。
当将64位的二进制串等分成四份时,可以将任意一份16位子二进制串调整作为所述四份子二进制串的最前端的二进制串,例如,子二进制串L 1-16,L 17-32,L 33-48和L 48-64可以分别调整作为所有二进制串的最前端的二进制串,则存在4个集合,可以以表格(table)存储于预设的样本库中,例如,存储到预设的存储器中,即在存储器中存储有4个table,例如所述4个集合分别为:(L 1-16,L 17-32,L 33-48,L 48-64)、(L 17-32,L 1-16,L 33-48,L 48-64)、(L 33-48,L 1-16,L 17-32,L 48-64)、(L 48-64,L 1-16,L 17-32,L 33-48)。
上述实施例仅仅是以最前面的一份子二进制串进行集合分类,后面的子二进制串如何排列并不限定。例如,在本申请的另一实施例中,还可以以其他方式进行集合分类,例如,将64位的二进制串等分成两份,每份包括32位子二进制串,例如子二进制串L 1-32和L 33-64。可以将任意一份32位子二进制串调整作为最前端的二进制串,例如,将子二进制串L 1-32和L 33-64分别调整作为最前端的二进制串,则存在2个集合,可以以表格(table)存储于存储器的样本库中,即在存储器中存储有2个table,例如,所述2个集合分别为(L 1-32,L 33-64)和(L 33-64,L 1-32)。
步骤105,以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果。
例如,以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,如果所述样本 库总有2 m个哈希指纹,则针对每个集合返回2 m-j个候选结果,其中,m为大于2的整数,且m>j。
例如,上述64位二进制串生成四个table时,利用匹配的方式查找最前16位子二进制串,如果样本库中存有2 34(差不多10亿)的哈希指纹,则每个table返回2 (34-16)=262144个候选结果,相对于现有技术返回2 34的哈希指纹,大大减少了海明距离的计算成本。
在本申请的另一实施例中,所述以所述j个集合的每个集合的最前端的二进制串匹配所述样本库具体包括:确定所述j份二进制串的每份二进制串的最前端的二进制串与存储器存储的最前端的二进制串是否完全相同,如果相同就确定匹配,即确定所述样本库当前反馈的候选结果为正确的候选结果,如果不同就确定不匹配,即确定将所述样本库当前反馈的候选结果为不正确的候选结果。
步骤106、根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重(即丢弃或删除其中一个文本)。
例如,二进制串A和二进制串B的海明距离就是A xor B后二进制中1的个数。
例如,二进制串A=100111,二进制串B=101010,则hamming_distance(A,B)=count_1(A xor B)=count_1(001101)=3。
通过simhash算法将高维的特征向量映射成一个f-bit的指纹(fingerprint),通过比较两个文本的f-bit指纹的海明距离(Hamming Distance)来确定两个文本是否重复或者高度近似,即海明距离的值越小就越相似,当海明距离等于零时,说明两个比较文本相同,海明距离的值越大就越不相似。
例如,上述三个文本p1,p2和p3的simhash结果,其两两之间的海明距离为(p1,p2)=4,(p1,p3)=16以及(p2,p3)=12。则两两文本之间的相似度,p1和p2间的相似度要远大于与p3的相似度。
综上所述,上述实施例描述的基于大数据的数据去重的方法,simhash函数运算和hash函数运算最大的不同在于hash函数虽然也可以用于映射来比较文本的重复,但是对于可能差距只有一个字节的文本也会映射成两个完全不同的哈希结果,而simhash函数对相似的文本的哈希映射结果也相似。例如,设置simhash函数为64位,即f=64,将文本的加权特征集合映射到一个64-bit的哈希指纹(fingerprint)上。
例如,设置simhash函数为64位,将64位的二进制串等分成4份子二进制串,然后调整上述64位二进制,将任意一份子二进制串作为前16位,总共有四种组合,生成四份table并存储到样本库中,采用精确匹配的方式查找前16位,如果样本库中存有2 34(差不多10亿)的哈希指纹,则每个table返回2 (34-16)=262144个候选结果,大大减少了海明距离的计算成本。
因此,本申请的实施例描述的基于大数据的数据去重的方法,使用哈希算法对大数据进行降维,可以减少两个文本的对比时间,降低对文本储存开销。
需要说明的是,本申请实施例所提供的基于大数据的数据去重的方法一般由服务器/终端设备执行,相应地,基于大数据的数据去重的方法装置一般设置于服务器/终端设备中。所述终端设备可以是无线终端也可以是有线终端,无线终端可以是指向用户提供语音和/或数据连通性的设备,具有无线连接功能的手持式设备、或连接到无线调制解调器的其他处理设备。终端可以是便携式、袖珍式、手持式、计算机内置的或者车载的移动装置。
应该理解,终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于非易失性可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的各个流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
进一步参考图3,作为对上述图1所示方法的实现,本申请提供了一种基于大数据的数据去重装置的一个实施例,该装置实施例与图1所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图3所示,本实施例所述的基于大数据的数据去重装置300包括:收集模块301、处理模块302、分裂模块303、调整模块304、匹配模块305、计算模块306和总线307。所述收集模块301、所述处理模块302、所述分裂模块303、所述调整模块304、所述匹配模块305和所述计算模块306相互之间通过所述总线307连接。本实施例的模块划分仅仅是示意性,还可以根据各自方法动作进行各自逻辑划分。
所述总线307用于实现这些组件之间的连接通信。例如,所述总线307可以是工业标准体系结构(Industry Standard Architecture,ISA)总线、外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。该总线系统可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
所述收集模块301,用于根据预设关键字收集至少两个文本数据;
所述处理模块302,用于针对每个文本数据,根据相似哈希函数(simhash)和哈希函数(hash)成k位的二进制串,其中k=2 n,其中n为大于等于2的正整数;
所述分裂模块303,用于将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数;
所述调整模块304,用于调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中;
所述匹配模块305,用于以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果。例如,所述匹配模块305用于以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,如果所述样本库总有2 m个哈希指纹,则针对每个集合返回2 m-j个候选结果,其中,m为大于2的整数,且m>j;
所述计算模块306,用于根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重。
以下三段文本为例进行说明:p1=the cat sat on the mat;p2=the cat sat on a mat;p3=we all scream for ice cream。
在本申请的另一实施例中,例如,所述处理模块302用于针对大数据的每个文本数据(例 如,Doc文本,web文本),利用simhash函数将该文本数据转化为哈希编码(hashcode),例如,所述处理模块302还包括:选择子单元、初始化子单元、抽取子单元、哈希函数处理子单元、累加子单元和处理子单元,其中,该选择子单元、该初始化子单元、该抽取子单元、该哈希函数处理子单元、该累加子单元和该处理子单元任意两者相互之间可以通信连接。
所述选择子单元,用于选择相似哈希函数的位数k,例如,所述选择子单元用于根据存储成本以及数据集的大小,选择simhash的位数k,其中,k=2 n,n为大于等于2的正整数,例如k=16、32、64或128位。
初始化子单元,用于将相似哈希函数的各位初始化为0;
抽取子单元,用于将每个文本数据进行分词抽取,抽取出多个分词_权重对。例如,所述抽取子单元对每个分词_权重对中的分词进行哈希函数处理,例如,所述抽取子单元用于使用k位哈希函数计算每个文本数据的预定数量分词字母的哈希代码。例如,所述预定数量为2或3。
所述抽取子单元还用于将每个文本数据进行分词抽取(其中包括分词和计算权重),例如,抽取得到n个(分词,权重)(分词_权重)对(feature_weight_pairs),记为feature_weight_pairs=[fw1,fw2…fwn],其中fwn=(feature_n,weight_n),其中n为大于等于2的正整数。例如,一般采用各种预定数的分词的方式,例如,所述预定数量为2或3,例如,对于"the cat sat on the mat",采用两两分词的方式得到如下结果:{"th","he","e","c","ca","at","t","s","sa","o","on","n","t","m","ma"},其中,空格也算一个字母。
哈希函数处理子单元,用于对每个分词_权重对中的分词进行哈希函数处理。例如,例如,哈希函数处理子单元用于使用32位hash函数计算该文本数据的每个预定数量分词字母(word)的哈希代码(hashcode),计算文本数据每2个或3个字母的哈希代码(hashcode),比如:"th".hash=-502157718,"he".hash=-369049682,……。
累加子单元,用于对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,生成k个数值,例如,所述累加子单元对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,如果该位是1,则加权重,如果是0,则减权重,最后生成k个数值。例如,累加子单元采用32位hash函数,则hash生成的位数bits_count=32,对各分词(word)的hashcode的每一位,如果该位为1,则simhash相应位的值加1;否则减1,得到32个数值(即simhash包括32个数值)。
处理子单元,用于将所述生成的k个数值转换为k位的二进制串。例如,所述k为32、64或128,例如,所述处理子单元对最后得到的32位的simhash,如果该位大于1,则设为1;否则设为0。
在本申请的另一实施例中,处理子单元也可以生成64或128位的二进制串,本实施例并不限定。
使用simhash会应该产生类似如下的结果:
irb(main):003:0>p1.simhash=>851459198 00110010110000000011110001111110
irb(main):004:0>p2.simhash=>847263864 00110010100000000011100001111000
irb(main):002:0>p3.simhash=>984968088 00111010101101010110101110011000。
经过simhash函数运算后,这三个文本的海明距离(hammingdistance)为两个二进制串中不同位的数量。
在本申请的另一实施例中,所述匹配模块305用于以所述j个集合的每个集合的最前端的二进制串匹配所述样本库具体包括:所述匹配模块305用于将所述j个集合的每个集合的 最前端的二进制串与存储器存储的每个集合的最前端的二进制串进行相同性判断,如果相同就确定匹配,即如果相同就确定所述样本库当前反馈的候选结果为正确的候选结果,如果不同就确定不匹配,即如果不相同就确定所述样本库当前反馈的候选结果为不正确的候选结果。
在本申请的另一实施例中,所述分裂模块303还用于将32位或64位的二进制串等分成四份,例如,当将32位的二进制串等分成四份时,每份包括8位子二进制串,例如,当将64位的二进制串等分成四份时,每份包括16位二进制串。例如,所述分裂模块303还用于将64位的二进制串等分成四份16位的子二进制串:L 1-16,L 17-32,L 33-48和L 48-64,L 1-16,L 17-32,L 33-48和L 48-64分别包括16位的二进制串。
上述实施例仅仅是将32位或64位的二进制串等分成四份为例进行说明,但本申请的实施例并不限制分成多少分,例如,j为大于等于1的正整数,例如j可以为2、4、6、或8等偶数。
在本申请的另一实施例中,所述调整模块304还用于将64位的二进制串等分成四份时,可以将任意一份16位子二进制串调整作为所述四份子二进制串的最前端的二进制串,例如,子二进制串L 1-16,L 17-32,L 33-48和L 48-64可以分别调整作为所有二进制串的最前端的二进制串,则存在4个集合,可以以表格(table)存储于存储器中,即在存储器中存储有4个table,例如所述4个集合分别为:(L 1-16,L 17-32,L 33-48,L 48-64)、(L 17-32,L 1-16,L 33-48,L 48-64)、(L 33-48,L 1-16,L 17-32,L 48-64)、(L 48-64,L 1-16,L 17-32,L 33-48)。
例如,所述计算模块306还用于计算两个文本数据(例如,第一文本数据和第二文本数据)之间的海明距离,二进制串A()和二进制串B的海明距离就是A xor B后二进制中1的个数。
例如,二进制串A=100111,二进制串B=101010,则hamming_distance(A,B)=count_1(A xor B)=count_1(001101)=3。
所述计算模块306还用于通过simhash算法将高维的特征向量映射成一个f-bit的指纹(fingerprint),通过比较两个文本的f-bit指纹的海明距离(Hamming Distance)来确定两个文本是否重复或者高度近似,即海明距离的值越小就越相似,当海明距离等于零时,说明两个比较文本相同,海明距离的值越大就越不相似。
例如,上述三个文本p1,p2和p3的simhash结果,其两两之间的海明距离为(p1,p2)=4,(p1,p3)=16以及(p2,p3)=12。则两两文本之间的相似度,p1和p2间的相似度要远大于与p3的相似度。
本实施例中,上述模块均可以通过一个或多个处理器、芯片或集成电路实现,本实施例并不限定。
因此,本申请的实施例描述的基于大数据的数据去重的装置,使用哈希算法对大数据进行降维,可以减少两个文本的对比时间,降低对文本储存开销。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。
所述计算机设备4包括通过系统总线相互通信连接存储器41、一个或多个处理器42、网络接口43。需要指出的是,图中仅示出了具有组件41-43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific  Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器41至少包括一种类型的非易失性可读存储介质,所述非易失性可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(Static Random-Access Memory,SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(Electrically Erasable Programmable read only memory,EEPROM)、可编程只读存储器(Programmable read-only memory,PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系统和各类应用软件,例如上述数据处理方法的计算机可读指令等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的计算机可读指令或者数据去重,例如运行所述基于大数据的数据去重方法的计算机可读指令。
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。
所述处理器42,用于根据预设关键字收集至少两个文本数据;针对每个文本数据,根据相似哈希函数(simhash)和哈希函数(hash)成k位的二进制串,其中k=2n,其中n为大于等于2的正整数;将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数;调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中;以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果;根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重。
在本申请的另一实施例中,所述处理器42还用于:将64位的二进制串等分成四份时,可以将任意一份16位子二进制串调整作为所述四份子二进制串的最前端的二进制串,例如,子二进制串L1-16,L17-32,L33-48和L48-64可以分别调整作为所有二进制串的最前端的二进制串,则存在4个集合,可以以表格(table)存储于存储器中,即在存储器中存储有4个table,例如所述4个集合分别为:(L1-16,L17-32,L33-48,L48-64)、(L17-32,L1-16,L33-48,L48-64)、(L33-48,L1-16,L17-32,L48-64)、(L48-64,L1-16,L17-32,L33-48)。
所述处理器42还用于计算两个文本数据(例如,第一文本数据和第二文本数据)之间的海明距离,例如,二进制串A和二进制串B的海明距离就是A xor B后二进制中1的个数。
例如,二进制串A=100111,二进制串B=101010,则hamming_distance(A,B)=count_1(A xor B)=count_1(001101)=3。
所述处理器还用于通过simhash算法将高维的特征向量映射成一个f比特(f-bit)的指纹(fingerprint),其中,f为大于等于2的整数,通过比较两个文本的f-bit指纹的海明距离(Hamming Distance)来确定两个文本是否重复或者高度近似,即海明距离的值越小就越相似,当海明距离等于零时,说明两个比较文本相同,海明距离的值越大就越不相似。
例如,上述三个文本p1,p2和p3的simhash结果,其两两之间的海明距离为(p1,p2)=4,(p1,p3)=16以及(p2,p3)=12。则两两文本之间的相似度,p1和p2间的相似度要远大于与p3的相似度。
本申请还提供了另一种实施方式,即提供一种非易失性可读存储介质,所述非易失性可读存储介质存储有数据处理的可读指令,所述数据处理的可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的数据处理方法的步骤。
例如,所述数据处理的可读指令被至少一个处理器执行时,执行如下内容:根据预设关键字收集至少两个文本数据;针对每个文本数据,根据相似哈希函数(simhash)和哈希函数(hash)成k位的二进制串,其中k=2n,其中n为大于等于2的正整数;将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数;调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中;以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果;根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重。
在本申请的另一实施例中,所述数据处理的可读指令被至少一个处理器执行时,执行如下内容:将64位的二进制串等分成四份时,可以将任意一份16位子二进制串调整作为所述四份子二进制串的最前端的二进制串,例如,子二进制串L1-16,L17-32,L33-48和L48-64可以分别调整作为所有二进制串的最前端的二进制串,则存在4个集合,可以以表格(table)存储于存储器中,即在存储器中存储有4个table,例如所述4个集合分别为:(L1-16,L17-32,L33-48,L48-64)、(L17-32,L1-16,L33-48,L48-64)、(L33-48,L1-16,L17-32,L48-64)、(L48-64,L1-16,L17-32,L33-48)。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种基于大数据的数据去重的方法,其特征在于,包括:
    根据预设关键字收集至少两个文本数据;
    针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串,其中k=2 n,其中n为大于等于2的正整数;
    将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数;
    调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中;
    以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果;
    根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重。
  2. 根据权利要求1所述的基于大数据的数据去重的方法,其特征在于,所述针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串具体包括:
    选择相似哈希函数的位数k;
    将相似哈希函数的各位初始化为0;
    将每个文本数据进行分词抽取,抽取出多个分词_权重对;
    对每个分词_权重对中的分词进行哈希函数处理;
    对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,生成k个数值;
    将所述生成的k个数值转换为k位的二进制串。
  3. 根据权利要求2所述的基于大数据的数据去重的方法,其特征在于,所述选择相似哈希函数的位数k具体包括:
    根据存储成本以及数据集的大小,选择相似哈希函数的位数k。
  4. 根据权利要求2所述的基于大数据的数据去重的方法,其特征在于,所述对每个分词_权重对中的分词进行哈希函数处理具体包括:
    使用k位哈希函数计算该每个文本数据的预定数量分词字母的哈希代码。
  5. 根据权利要求4所述的基于大数据的数据去重的方法,其特征在于,所述根据所述预设关键字收集所述至少两个文本数据具体包括:利用网络爬虫技术根据预设的关键字抓取与所述关键字相关的至少两个文本数据。
  6. 根据权利要求1-5任意一项所述的基于大数据的数据去重的方法,其特征在于,所述以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果具体包括:
    确定所述j份二进制串的每份二进制串的最前端的二进制串与存储器所述样本库存储的最前端的二进制串是否完全相同,如果相同,则确定所述样本库当前反馈的候选结果为正确的候选结果,如果不同,则确定将所述样本库当前反馈的候选结果为不正确的候选结果。
  7. 根据权利要求2所述的基于大数据的数据去重的方法,其特征在于,所述对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,生成k个数值具体包括:
    对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,如果该位是1,则加1, 如果是0,则减1,最后生成k个数值。
  8. 一种基于大数据的数据去重的装置,其特征在于,包括:
    收集模块,用于根据预设关键字收集至少两个文本数据;
    处理模块,用于针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串,其中k=2 n,其中n为大于等于2的正整数;
    分裂模块,用于将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数;
    调整模块,调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中;
    匹配模块,用于以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果;
    计算模块,用于根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重。
  9. 根据权利要求8所述的基于大数据的数据去重的装置,其特征在于,所述针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串具体包括:
    选择子单元,用于选择相似哈希函数的位数k;
    初始化子单元,用于将相似哈希函数的各位初始化为0;
    抽取子单元,用于将每个文本数据进行分词抽取,抽取出多个分词_权重对;
    哈希函数处理子单元,用于对每个分词_权重对中的分词进行哈希函数处理;
    累加子单元,用于对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,生成k个数值;
    处理子单元,用于将所述生成的k个数值转换为k位的二进制串。
  10. 根据权利要求9所述的基于大数据的数据去重的装置,其特征在于,所述选择相似哈希函数的位数k具体包括:
    位数选取子单元,用于根据存储成本以及数据集的大小,选择相似哈希函数的位数k。
  11. 一种计算机设备,包括存储器、处理器,以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下基于大数据的数据去重的方法的步骤:
    根据预设关键字收集至少两个文本数据;
    针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串,其中k=2 n,其中n为大于等于2的正整数;
    将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数;
    调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中;
    以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果;
    根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重。
  12. 根据权利要求11所述计算机设备,其特征在于,所述针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串具体包括:
    选择相似哈希函数的位数k;
    将相似哈希函数的各位初始化为0;
    将每个文本数据进行分词抽取,抽取出多个分词_权重对;
    对每个分词_权重对中的分词进行哈希函数处理;
    对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,生成k个数值;
    将所述生成的k个数值转换为k位的二进制串。
  13. 根据权利要求12所述的计算机设备,其特征在于,所述选择相似哈希函数的位数k具体包括:
    根据存储成本以及数据集的大小,选择相似哈希函数的位数k。
  14. 根据权利要求12所述的计算机设备,其特征在于,所述对每个分词_权重对中的分词进行哈希函数处理具体包括:
    使用k位哈希函数计算该每个文本数据的预定数量分词字母的哈希代码。
  15. 根据权利要求14所述的计算机设备,其特征在于,所述根据所述预设关键字收集所述至少两个文本数据具体包括:利用网络爬虫技术根据预设的关键字抓取与所述关键字相关的至少两个文本数据。
  16. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时实现如下基于大数据的数据去重的方法的步骤:
    根据预设关键字收集至少两个文本数据;
    针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串,其中k=2 n,其中n为大于等于2的正整数;
    将该k位的二进制串等分成j份子二进制串,其中j为大于等于1的正整数;
    调整所述j份子二进制串的排列顺序,以不同份的子二进制串为最前端的二进制串生成相应j个集合并存储到预设的样本库中;
    以所述j个集合的每个集合的最前端的二进制串匹配所述样本库,获取所述样本库返回的各集合的候选结果;
    根据每个文本数据的候选结果计算任意两个文本数据的海明距离,如果海明距离小于等于阈值,进行去重。
  17. 根据权利要求16所述的非易失性可读存储介质,其特征在于,所述针对每个文本数据,根据相似哈希函数和哈希函数生成k位的二进制串具体包括:
    选择相似哈希函数的位数k;
    将相似哈希函数的各位初始化为0;
    将每个文本数据进行分词抽取,抽取出多个分词_权重对;
    对每个分词_权重对中的分词进行哈希函数处理;
    对经过所述哈希函数处理后的分词_权重对进行位的纵向累加,生成k个数值;
    将所述生成的k个数值转换为k位的二进制串。
  18. 根据权利要求17所述的非易失性可读存储介质,其特征在于,所述选择相似哈希函数的位数k具体包括:
    根据存储成本以及数据集的大小,选择相似哈希函数的位数k。
  19. 根据权利要求17所述的非易失性可读存储介质,其特征在于,所述对每个分词_权重对中的分词进行哈希函数处理具体包括:
    使用k位哈希函数计算该每个文本数据的预定数量分词字母的哈希代码。
  20. 根据权利要求19所述的非易失性可读存储介质,其特征在于,所述根据所述预设关键字收集所述至少两个文本数据具体包括:利用网络爬虫技术根据预设的关键字抓取与所述关键字相关的至少两个文本数据。
PCT/CN2019/103446 2019-05-15 2019-08-29 基于大数据的数据去重的方法、装置、设备及存储介质 WO2020228182A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910401427.5 2019-05-15
CN201910401427.5A CN110297879B (zh) 2019-05-15 2019-05-15 一种基于大数据的数据去重的方法、装置及存储介质

Publications (1)

Publication Number Publication Date
WO2020228182A1 true WO2020228182A1 (zh) 2020-11-19

Family

ID=68026845

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103446 WO2020228182A1 (zh) 2019-05-15 2019-08-29 基于大数据的数据去重的方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN110297879B (zh)
WO (1) WO2020228182A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733140A (zh) * 2020-12-28 2021-04-30 上海观安信息技术股份有限公司 一种针对模型倾斜攻击的检测方法及系统
CN112861505A (zh) * 2021-02-04 2021-05-28 北京百度网讯科技有限公司 重复度检测方法、装置和电子设备
CN113129056A (zh) * 2021-04-15 2021-07-16 微梦创科网络科技(中国)有限公司 一种控制广告投放频次的方法及系统
CN117150518A (zh) * 2023-08-04 2023-12-01 中国移动通信集团四川有限公司 一种通信运营商数据安全加密方法及系统
CN117251445B (zh) * 2023-10-11 2024-06-04 杭州今元标矩科技有限公司 一种基于深度学习的crm数据查重筛选方法、系统及介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765756B (zh) * 2019-10-29 2023-12-01 北京齐尔布莱特科技有限公司 一种文本处理方法、装置、计算设备及介质
CN113377294B (zh) * 2021-08-11 2021-10-22 武汉泰乐奇信息科技有限公司 一种基于二值化数据转换的大数据存储方法和装置
CN113836208A (zh) * 2021-08-16 2021-12-24 深圳希施玛数据科技有限公司 一种数据处理方法、装置及终端设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160171009A1 (en) * 2014-12-10 2016-06-16 International Business Machines Corporation Method and apparatus for data deduplication
CN108132929A (zh) * 2017-12-25 2018-06-08 上海大学 一种海量非结构化文本的相似性计算方法
CN108345586A (zh) * 2018-02-09 2018-07-31 重庆誉存大数据科技有限公司 一种文本去重方法及系统
CN109271487A (zh) * 2018-09-29 2019-01-25 浪潮软件股份有限公司 一种相似文本分析方法

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8661341B1 (en) * 2011-01-19 2014-02-25 Google, Inc. Simhash based spell correction
US9996764B2 (en) * 2014-04-29 2018-06-12 Institute Of Automation Chinese Academy Of Sciences Image matching method based on cascaded binary encoding
CN105095162A (zh) * 2014-05-19 2015-11-25 腾讯科技(深圳)有限公司 文本相似度确定方法、装置、电子设备及系统
CN107977347B (zh) * 2017-12-04 2021-12-21 海南云江科技有限公司 一种题目去重方法和计算设备
CN108280127A (zh) * 2017-12-15 2018-07-13 广州艾媒数聚信息咨询股份有限公司 一种海量相似新闻查重甄选方法、系统及装置
CN108573045B (zh) * 2018-04-18 2021-12-24 同方知网数字出版技术股份有限公司 一种基于多阶指纹的比对矩阵相似度检索方法
CN109359183B (zh) * 2018-10-11 2021-04-23 南京中孚信息技术有限公司 文本信息的查重方法、装置及电子设备
CN109670153B (zh) * 2018-12-21 2023-11-17 北京城市网邻信息技术有限公司 一种相似帖子的确定方法、装置、存储介质及终端

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160171009A1 (en) * 2014-12-10 2016-06-16 International Business Machines Corporation Method and apparatus for data deduplication
CN108132929A (zh) * 2017-12-25 2018-06-08 上海大学 一种海量非结构化文本的相似性计算方法
CN108345586A (zh) * 2018-02-09 2018-07-31 重庆誉存大数据科技有限公司 一种文本去重方法及系统
CN109271487A (zh) * 2018-09-29 2019-01-25 浪潮软件股份有限公司 一种相似文本分析方法

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733140A (zh) * 2020-12-28 2021-04-30 上海观安信息技术股份有限公司 一种针对模型倾斜攻击的检测方法及系统
CN112733140B (zh) * 2020-12-28 2023-12-22 上海观安信息技术股份有限公司 一种针对模型倾斜攻击的检测方法及系统
CN112861505A (zh) * 2021-02-04 2021-05-28 北京百度网讯科技有限公司 重复度检测方法、装置和电子设备
CN113129056A (zh) * 2021-04-15 2021-07-16 微梦创科网络科技(中国)有限公司 一种控制广告投放频次的方法及系统
CN117150518A (zh) * 2023-08-04 2023-12-01 中国移动通信集团四川有限公司 一种通信运营商数据安全加密方法及系统
CN117251445B (zh) * 2023-10-11 2024-06-04 杭州今元标矩科技有限公司 一种基于深度学习的crm数据查重筛选方法、系统及介质

Also Published As

Publication number Publication date
CN110297879B (zh) 2023-05-30
CN110297879A (zh) 2019-10-01

Similar Documents

Publication Publication Date Title
WO2020228182A1 (zh) 基于大数据的数据去重的方法、装置、设备及存储介质
US8364686B1 (en) Document near-duplicate detection
US8321434B1 (en) Two tiered architecture of named entity recognition engine
WO2020215667A1 (zh) 文本内容快速去重方法、装置、计算机设备及存储介质
Sood et al. Probabilistic near-duplicate detection using simhash
US20160171052A1 (en) Method and system for document indexing and data querying
WO2017000610A1 (zh) 一种网页分类的方法和装置
CN110808987B (zh) 识别恶意域名的方法及计算设备
WO2012075884A1 (zh) 书签智能分类的方法和服务器
WO2020114100A1 (zh) 一种信息处理方法、装置和计算机存储介质
EP3926484B1 (en) Improved fuzzy search using field-level deletion neighborhoods
US20180276244A1 (en) Method and system for searching for similar images that is nearly independent of the scale of the collection of images
CN113688954A (zh) 一种计算文本相似度的方法、系统、设备和存储介质
WO2022105497A1 (zh) 文本筛选方法、装置、设备及存储介质
CN103324886A (zh) 一种网络攻击检测中指纹库的提取方法和系统
Manaa et al. Web documents similarity using k-shingle tokens and minhash technique
CN109359090A (zh) 基于卷积神经网络的文件碎片分类方法及系统
CN110399464B (zh) 一种相似新闻判别方法、系统及电子设备
CN109918661B (zh) 同义词获取方法及装置
CN109460500B (zh) 热点事件发现方法、装置、计算机设备和存储介质
US11709798B2 (en) Hash suppression
CN111985217B (zh) 一种关键词提取方法、计算设备及可读存储介质
Smith et al. Classification of text to subject using LDA
Sinha et al. Introduction to data deduplication approaches
Zhao et al. MapReduce-based clustering for near-duplicate image identification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19928626

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19928626

Country of ref document: EP

Kind code of ref document: A1