CN110147363A - A kind of the data deduplication method for cleaning and system of information full-text search - Google Patents

A kind of the data deduplication method for cleaning and system of information full-text search Download PDF

Info

Publication number
CN110147363A
CN110147363A CN201910280637.3A CN201910280637A CN110147363A CN 110147363 A CN110147363 A CN 110147363A CN 201910280637 A CN201910280637 A CN 201910280637A CN 110147363 A CN110147363 A CN 110147363A
Authority
CN
China
Prior art keywords
data
data cell
retrieval
cell
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910280637.3A
Other languages
Chinese (zh)
Inventor
何宬呈
赵鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huadi Computer Group Co Ltd
Original Assignee
Huadi Computer Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huadi Computer Group Co Ltd filed Critical Huadi Computer Group Co Ltd
Priority to CN201910280637.3A priority Critical patent/CN110147363A/en
Publication of CN110147363A publication Critical patent/CN110147363A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of data deduplication method for cleaning of information full-text search and systems, comprising: format analysis processing is carried out to each data cell in the initial retrieval data of acquisition, to obtain the retrieval data of unformatted plain text content;Digest calculations are carried out to each data cell, to obtain the abstract code of each data cell, and duplicate removal processing are carried out according to retrieval data of the abstract code of each data cell to unformatted plain text content, to obtain the retrieval data Jing Guo duplicate removal processing;Legitimacy screening is carried out to each data cell in the retrieval data by duplicate removal processing according to preset legitimacy screening strategy, is added to index database with the retrieval data for obtaining legal.The present invention, using the re-scheduling mode of dual abstract time comparison, has reached efficiency and promotion while accuracy in duplicate removal;By calculate data cell sensitivity value, the threat degree of markup document data in a manner of quantization, ensure that searching system safety and political orientation it is correct.

Description

A kind of the data deduplication method for cleaning and system of information full-text search
Technical field
The present invention relates to technical field of data processing, and go more particularly, to a kind of data of information full-text search Weight method for cleaning and system.
Background technique
Show the data acquisition modes of the text retrieval system of mainstream for industry, generally all with crawler acquisition and database Directly acquire two ways.Database, which directly acquires, is generally used for existing controllable application, and data standard is concentrated, repeated data It is less with invalid data.Crawler acquisition range is wider, and uncontrollable data are more, and the data format got is more regular but interior Appearance is relatively numerous and jumbled, and there are a large amount of repeated datas and the data for not meeting relevant laws and regulations, this just needs the data to acquisition It is cleared up and is screened.
One full-text search is made of four searcher, index, searcher and user interface parts.It is examined as full text The important component of rope, the function of collecting device are synchronizing information to be collected in domain, and carry out duplicate removal and cleaning to data, To provide the data search source of accurate safety.
Has the crawler collecting part that searching system largely collects device at this stage, according to certain collection rule, acquisition The data to come over, do not handle and directly enter search library.Data duplication rate is high, invalid data is more.The inspection that the quality of data is required Cable system, existing duplicate removal process generally take two ways: (1) using the actual access address of resource as condition, carrying out entity Data deduplication.That is ur l uniqueness;(2) entire contents matching rate compare, matching rate it is high do rejecting processing.Content legality Cleaning generally takes following manner: based on fixed sensitive dictionary, when entering search library, and the text for the sensitive word that will match to Shelves do rejecting processing.But above scheme has the disadvantage that for duplicate removal process, the mode of the first is to a certain degree On filtered physics repetition, steal chain and circulation obtain risk.But the logic of such as wire copy, copy etc is repeated not do It handles, it can also comparable logic repeated data in library;Second of mode avoids logic repetition, but due to being that full text matches, The affairs execution time that efficiency is relatively slow, batch updating indexes can greatly prolong.For the mode of legitimacy processing, quite Guarantee data validation in degree, however processing mode is excessively rough.It is easy accidentally delete, delete more.
Therefore, it is necessary to a kind of efficiently and accurately data deduplication method for cleaning, to guarantee the accurate of data to the greatest extent It is legal.
Summary of the invention
The present invention proposes the data deduplication method for cleaning and system of a kind of information full-text search, how efficient, quasi- to solve The problem of duplicate removal cleaning really is carried out to the data of acquisition.
To solve the above-mentioned problems, according to an aspect of the invention, there is provided a kind of data of information full-text search are gone Weight method for cleaning, which is characterized in that the described method includes:
Format analysis processing is carried out to each data cell in the initial retrieval data of acquisition, it is unformatted to obtain The retrieval data of plain text content;
Digest calculations are carried out to each data cell in the retrieval data of the unformatted plain text content, to obtain The abstract code of each data cell is taken, and according to the abstract code of each data cell in the unformatted plain text The retrieval data of appearance carry out duplicate removal processing, to obtain the retrieval data Jing Guo duplicate removal processing;
According to preset legitimacy screening strategy to each data cell in the retrieval data by duplicate removal processing Legitimacy screening is carried out, is added to index database with the retrieval data for obtaining legal.
Preferably, wherein data acquisition is carried out using crawlers, to obtain initial retrieval data.
Preferably, wherein each data cell in the initial retrieval data of described pair of acquisition carries out format analysis processing, To obtain the retrieval data of unformatted plain text content, comprising:
Each data cell in the initial retrieval data of acquisition is separated according to preset information category, with Obtain the retrieval data of unformatted plain text content;Wherein, the preset information category includes: format descriptor, sky Lattice, additional character and text.
Preferably, wherein each data cell in the retrieval data to the unformatted plain text content into Row digest calculations, to obtain the abstract code of each data cell, and according to the abstract code of each data cell to the nothing The retrieval data of the plain text content of format carry out duplicate removal processing, to obtain the retrieval data Jing Guo duplicate removal processing, comprising:
To in the retrieval data of the unformatted plain text content each data cell carry out CRC digest calculations and MD5 digest calculates, to obtain the CRC abstract code and MD5 digest code of each data cell;
Successively judge the CRC abstract code of each data cell whether in alternative library;
Wherein, if the data cell is stored in database not in alternative library by the CRC abstract code of the data cell;
If some data cell CRC abstract code in alternative library, judge the data cell MD5 abstract code whether In alternative library;If the MD5 digest code of the data cell is stored in alternative library not in alternative library, by the data cell;Instead It, directly gives up the data cell;
Using the data cell in alternative library as the retrieval data Jing Guo duplicate removal processing.
Preferably, wherein the preset legitimacy screening strategy, comprising:
The sensitivity value of each data cell is calculated, and judges whether the word susceptibility of each data cell is greater than respectively Default susceptibility threshold, if so, giving up the data cell, i.e., the data cell is added without index database;It is on the contrary, it is determined that should Data cell is legal retrieval data.
Preferably, wherein the susceptibility for calculating each data cell, comprising:
According to the weight of the sensitivity levels of the sensitive word of the varying sensitivity rank in each data cell and corresponding The weight of matching degree rank determines the sensitivity value of each data cell;
Wherein, the sensitivity levels include: high sensitive grade, middle susceptibility grade and low sensitivity grade;The matching degree Rank includes: high matching degree grade, middle matching degree grade and low matching degree grade.
Preferably, wherein the method also includes:
Sensitive word in the data cell for the legal retrieval data that will acquire replaces with predetermined symbol and is added to index Library, search hit rate when reducing sensitive word as querying condition.
According to another aspect of the present invention, a kind of data deduplication cleaning system of information full-text search is provided, it is special Sign is, the system comprises:
Data preprocessing module carries out format for each data cell in the initial retrieval data to acquisition Processing, to obtain the retrieval data of unformatted plain text content;
Data deduplication processing module, for each data in the retrieval data to the unformatted plain text content Unit carries out digest calculations, to obtain the abstract code of each data cell, and it is right according to the abstract code of each data cell The retrieval data of the unformatted plain text content carry out duplicate removal processing, to obtain the retrieval data Jing Guo duplicate removal processing;
Data validation screening module, for according to preset legitimacy screening strategy to described by duplicate removal processing The each data cell retrieved in data carries out legitimacy screening, is added to index database with the retrieval data for obtaining legal.
Preferably, wherein the system also includes:
Data acquisition module, for carrying out data acquisition using crawlers, to obtain initial retrieval data.
Preferably, wherein the data preprocessing module, to each data cell in the initial retrieval data of acquisition Format analysis processing is carried out, to obtain the retrieval data of unformatted plain text content, comprising:
Each data cell in the initial retrieval data of acquisition is separated according to preset information category, with Obtain the retrieval data of unformatted plain text content;Wherein, the preset information category includes: format descriptor, sky Lattice, additional character and text.
Preferably, wherein the data deduplication processing module, comprising:
Abstract code computational submodule, for each data in the retrieval data to the unformatted plain text content Unit carries out CRC digest calculations and MD5 digest calculates, to obtain the CRC abstract code and MD5 digest code of each data cell;
Judging submodule, for successively judging that the CRC of each data cell makes a summary code whether in alternative library;
Wherein, if the data cell is stored in database not in alternative library by the CRC abstract code of the data cell;
If some data cell CRC abstract code in alternative library, judge the data cell MD5 abstract code whether In alternative library;If the MD5 digest code of the data cell is stored in alternative library not in alternative library, by the data cell;Instead It, directly gives up the data cell;
Duplicate removal processing data determine submodule, for using the data cell in alternative library as the inspection Jing Guo duplicate removal processing Rope data.
Preferably, wherein preset legitimacy screening strategy in the data validation screening module, comprising:
The sensitivity value of each data cell is calculated, and judges whether the word susceptibility of each data cell is greater than respectively Default susceptibility threshold, if so, giving up the data cell, i.e., the data cell is added without index database;It is on the contrary, it is determined that should Data cell is legal retrieval data.
Preferably, wherein calculating the susceptibility of each data cell using such as under type:
According to the weight of the sensitivity levels of the sensitive word of the varying sensitivity rank in each data cell and corresponding The weight of matching degree rank determines the sensitivity value of each data cell;
Wherein, the sensitivity levels include: high sensitive grade, middle susceptibility grade and low sensitivity grade;The matching degree Rank includes: high matching degree grade, middle matching degree grade and low matching degree grade.
Preferably, wherein the system also includes:
Sensitive word replacement module is replaced with for the sensitive word in the data cell for the legal retrieval data that will acquire Predetermined symbol is added to index database, search hit rate when reducing sensitive word as querying condition.
The present invention provides a kind of data deduplication method for cleaning of information full-text search and systems, comprising: to the first of acquisition Each data cell in the retrieval data of beginning carries out format analysis processing, to obtain the retrieval number of unformatted plain text content According to;Digest calculations are carried out to each data cell in the retrieval data of the unformatted plain text content, it is each to obtain The abstract code of data cell, and the inspection according to the abstract code of each data cell to the unformatted plain text content Rope data carry out duplicate removal processing, to obtain the retrieval data Jing Guo duplicate removal processing;According to preset legitimacy screening strategy to institute The each data cell stated in the retrieval data by duplicate removal processing carries out legitimacy screening, to obtain legal retrieval data It is added to index database.The present invention is in duplicate removal using the re-scheduling mode of dual abstract time comparison, the CRC abstract code of first time Length is short, and comparison efficiency is exceedingly fast, and has screened out most data, and second of MD5 digest code ensure that the accurate of data deduplication Property, promotion while having reached efficiency and accuracy;In legitimacy screening, by calculating the sensitivity value of data cell, with The threat degree of the mode markup document data of quantization, and customized susceptibility threshold are realized other to system safety strategy Control, ensure that searching system safety and political orientation it is correct.
Detailed description of the invention
By reference to the following drawings, exemplary embodiments of the present invention can be more fully understood by:
Fig. 1 is the flow chart according to the data deduplication method for cleaning 100 of the information full-text search of embodiment of the present invention;
Fig. 2 is the exemplary diagram according to the carry out duplicate removal processing of embodiment of the present invention;
Fig. 3 is the schematic diagram according to the progress legitimate verification of embodiment of the present invention;And
Fig. 4 is the structural representation that system 400 is cleared up according to the data deduplication of the information full-text search of embodiment of the present invention Figure.
Specific embodiment
Exemplary embodiments of the present invention are introduced referring now to the drawings, however, the present invention can use many different shapes Formula is implemented, and is not limited to the embodiment described herein, and to provide these embodiments be at large and fully disclose The present invention, and the scope of the present invention is sufficiently conveyed to person of ordinary skill in the field.Show for what is be illustrated in the accompanying drawings Term in example property embodiment is not limitation of the invention.In the accompanying drawings, identical cells/elements use identical Appended drawing reference.
Unless otherwise indicated, term (including scientific and technical terminology) used herein has person of ordinary skill in the field Have and common understands meaning.Further it will be understood that with the term that usually used dictionary limits, should be understood as with The context of its related fields has consistent meaning, and is not construed as Utopian or too formal meaning.
Fig. 1 is the flow chart according to the data deduplication method for cleaning 100 of the information full-text search of embodiment of the present invention. As shown in Figure 1, the data deduplication method for cleaning for the information full-text search that embodiments of the present invention provide, uses in duplicate removal The CRC abstract code length of the re-scheduling mode of dual abstract time comparison, first time is short, and comparison efficiency is exceedingly fast, and has screened out exhausted big portion Divided data, second of MD5 digest code ensure that the accuracy of data deduplication, promotion while having reached efficiency and accuracy;? When legitimacy screening, by the sensitivity value of calculating data cell, the threat degree of markup document data in a manner of quantization, And customized susceptibility threshold, it realizes to the other control of system safety strategy, ensure that the safety and political orientation of searching system It is correct.The data deduplication method for cleaning 100 for the information full-text search that embodiments of the present invention provide is opened from step 101 Begin, each data cell in the initial retrieval data of step 101 pair acquisition carries out format analysis processing, to obtain without lattice The retrieval data of the plain text content of formula.
Preferably, wherein data acquisition is carried out using crawlers, to obtain initial retrieval data.
Preferably, wherein each data cell in the initial retrieval data of described pair of acquisition carries out format analysis processing, To obtain the retrieval data of unformatted plain text content, comprising:
Each data cell in the initial retrieval data of acquisition is separated according to preset information category, with Obtain the retrieval data of unformatted plain text content;Wherein, the preset information category includes: format descriptor, sky Lattice, additional character and text.
The content-data (data cell) got using crawlers is usually one section of content segments of tape format, lattice Formula may be the word document of html document, xml document and tape format, but format is generally unrelated with content.Therefore, in we In the embodiment in face, the first step is to be separated the text etc. of format descriptor, space, additional character and intelligent recognition, To obtain unformatted plain text content.
In step 102, make a summary to each data cell in the retrieval data of the unformatted plain text content It calculates, to obtain the abstract code of each data cell, and according to the abstract code of each data cell to described unformatted The retrieval data of plain text content carry out duplicate removal processing, to obtain the retrieval data Jing Guo duplicate removal processing.
Preferably, wherein each data cell in the retrieval data to the unformatted plain text content into Row digest calculations, to obtain the abstract code of each data cell, and according to the abstract code of each data cell to the nothing The retrieval data of the plain text content of format carry out duplicate removal processing, to obtain the retrieval data Jing Guo duplicate removal processing, comprising:
To in the retrieval data of the unformatted plain text content each data cell carry out CRC digest calculations and MD5 digest calculates, to obtain the CRC abstract code and MD5 digest code of each data cell;
Successively judge the CRC abstract code of each data cell whether in alternative library;
Wherein, if the data cell is stored in database not in alternative library by the CRC abstract code of the data cell;
If some data cell CRC abstract code in alternative library, judge the data cell MD5 abstract code whether In alternative library;If the MD5 digest code of the data cell is stored in alternative library not in alternative library, by the data cell;Instead It, directly gives up the data cell;
Using the data cell in alternative library as the retrieval data Jing Guo duplicate removal processing.
In embodiments of the present invention, CRC digest calculations are done to obtained content-data and MD5 digest calculates, to obtain Take CRC abstract code and MD5 digest code.Both digest algorithms are all very efficient, because each is involved in digest calculations, Gu There is one to change, abstract result will be variant, therefore can be used to do content comparison.Increase in memory when acquiring Add the abstract result field of two fixed length, does not have too big influence for performance or data volume either.
Then, before entering index database, the CRC abstract code for retrieving each content-data whether there is, if it does not exist, directly Access alternative library.If it exists, then the MD5 code for comparing the data again, then directly enters alternative library, and if it exists, then say if it does not exist Bright word data content has repetition, directly gives up.
Because CRC makes a summary, code bit number is shorter, and it is fast to compare speed, therefore does first time screening with it.If without repeated code in alternative library It can be then put in storage in library representation without identical content-data.
Because CRC abstract code is shorter, theoretically there is the possibility for hitting code, therefore when CRC abstract code repeats, then compare MD5 and pluck Want code.Because MD5 code is longer and belongs to two kinds of entirely different algorithms with CRC, so only when two abstract codes are all unanimously ability bases Originally it can determine that content-data is identical.
Fig. 2 is the exemplary diagram according to the carry out duplicate removal processing of embodiment of the present invention.As shown in Fig. 2, if collected first The content of the data cell of beginning be " AaBBc, cc n<br>d##,!!@$ Kangxu, Yongzheng ";It carries out formatting to handle, to obtain Access is that t=" aabbcccd Kangxu with Yongzheng " carries out hash calculating according to location contents, obtains CRC and makes a summary code Crc (t) and MD5 It makes a summary code md5 (t);It compares Crc (t), if inconsistent, is directly put in storage;If consistent, md5 (t) is compared, if md5 (t) one It causes, then gives up and change data cell, if md5 (t) is inconsistent, carry out in-stockroom operation.
In step 103, according to preset legitimacy screening strategy to every in the retrieval data by duplicate removal processing A data cell carries out legitimacy screening, is added to index database with the retrieval data for obtaining legal.
Preferably, wherein the preset legitimacy screening strategy, comprising:
The sensitivity value of each data cell is calculated, and judges whether the word susceptibility of each data cell is greater than respectively Default susceptibility threshold, if so, giving up the data cell, i.e., the data cell is added without index database;It is on the contrary, it is determined that should Data cell is legal retrieval data.
Preferably, wherein the susceptibility for calculating each data cell, comprising:
According to the weight of the sensitivity levels of the sensitive word of the varying sensitivity rank in each data cell and corresponding The weight of matching degree rank determines the sensitivity value of each data cell;
Wherein, the sensitivity levels include: high sensitive grade, middle susceptibility grade and low sensitivity grade;The matching degree Rank includes: high matching degree grade, middle matching degree grade and low matching degree grade.
Preferably, wherein the method also includes:
Sensitive word in the data cell for the legal retrieval data that will acquire replaces with predetermined symbol and is added to index Library, search hit rate when reducing sensitive word as querying condition.
The data crawled due to crawler are multifarious, have normal upward information, also do not meet state's laws regulation Information.For example, comprising publicizing, promoting as purpose content in text information;Include pornographic, obscene property content in text information; Item Information comprising the limitation of state's laws regulation in this information;In text information comprising laws and regulations disagree to relate to political affairs quick The flames such as sense.The harmfulness of these information is very big, will cause very undesirable influence if do not screened, it is therefore desirable to mistake Filter these information.
Fig. 3 is the schematic diagram according to the progress legitimate verification of embodiment of the present invention.As shown in figure 3, by normal It is serious for susceptibility when document, the antisocial document of violence, political sensitivity document and pornographic obscene document carry out legal screening Document be not put in storage, be put in storage after being hidden of the sensitive word processing in the document low for susceptibility as normal document.
In embodiments of the present invention, sensitive dictionary is built according to improper sensitive word, word and the clause of collection It is vertical.
Word, word and the clause total to sensitive dictionary etc. carry out differentiated control, and sensitivity levels are arranged, comprising: high sensitive Degree, middle susceptibility, low sensitivity are third, and corresponding weight is respectively set.Meanwhile it is corresponding that different matching degree ranks is arranged Weight.
Then, according to the weight of the sensitivity levels of the sensitive word of the varying sensitivity rank in each data cell and The weight of corresponding matching degree rank, determines the sensitivity value of each data cell;Judge that the word of each data cell is quick respectively Whether sensitivity is greater than default susceptibility threshold, if so, giving up the data cell, i.e., the data cell is added without index database;Instead It, it is determined that the data cell is legal retrieval data.
Dictionary classification in embodiments of the present invention receives flexible rules customization, supports customized keyword, passes through Semantic analysis constructs intelligence machine learning algorithm, and high efficiency filter complexity mutation text supports the national languages such as Uighur, Tibetan language The overseas languages area such as speech identification, support English, Japanese, German, French strategy is supported.
When calculating sensitivity value, high sensitive, middle susceptibility, muting sensitive can be divided into according to the sensitivity of word, phrase Sensitivity is third, and setting weight is respectively 1,3,10.It is 1,3,10 that the corresponding weight of matching degree rank, which is arranged,.Different susceptibilitys Rank and the corresponding sensitivity value of matching degree rank are as shown in table 1.
The different sensitivity levels of table 1 and the corresponding sensitivity value of matching degree rank
Susceptibility is low (1) In susceptibility (3) Susceptibility height (10)
Matching degree is low (1) 1 3 10
In matching degree (3) 3 9 30
Matching degree height (10) 10 30 100
The sum of score of each shelves sensitive word in each data cell is the final score of document, i.e. sensitivity value.
For example, the matching rate of the low sensitive word in data cell 1 is high, the matching rate of medium sensitivity word is low, highly sensitive word Matching rate in, then the final score of the document be 43 (10+3+30);
The matching rate of low sensitive word in data cell 2 is low, and the matching rate of medium sensitivity word is low, of highly sensitive word With rate height, the final score of the document is 104 (1+3+100).
If it is 80 that susceptibility threshold, which is arranged, data cell 2 is given up, data cell 1 is added to document library, and The sensitive word in content is replaced with into * in storage, to reduce using sensitive word as the hit rate of the search of querying condition.
Wherein, susceptibility threshold can be arranged according to the practical safety requirements of system in user.
Fig. 4 is the structural representation that system 400 is cleared up according to the data deduplication of the information full-text search of embodiment of the present invention Figure.As shown in figure 4, the data deduplication for the information full-text search that embodiments of the present invention provide clears up system 400, comprising: number Data preprocess module 401, data deduplication processing module 402 and data validation screening module 403.
Preferably, the data preprocessing module 401, for each data in the initial retrieval data to acquisition Unit carries out format analysis processing, to obtain the retrieval data of unformatted plain text content.
Preferably, wherein the system also includes data acquisition module, for using crawlers progress data acquisition, To obtain initial retrieval data.
Preferably, wherein the data preprocessing module, to each data cell in the initial retrieval data of acquisition Format analysis processing is carried out, to obtain the retrieval data of unformatted plain text content, comprising: to the initial retrieval number of acquisition Each data cell in is separated according to preset information category, to obtain the retrieval of unformatted plain text content Data;Wherein, the preset information category includes: format descriptor, space, additional character and text.
Preferably, the data deduplication processing module 402, for the retrieval number to the unformatted plain text content Each data cell in carries out digest calculations, to obtain the abstract code of each data cell, and according to each data The abstract code of unit carries out duplicate removal processing to the retrieval data of the unformatted plain text content, to obtain by duplicate removal The retrieval data of reason.
Preferably, wherein the data deduplication processing module 402, comprising: abstract code computational submodule 4021, judgement Module 4022 and duplicate removal processing data determine submodule 4023.
Preferably, the abstract code computational submodule 4021, for the retrieval number to the unformatted plain text content Each data cell in carries out CRC digest calculations and MD5 digest calculates, to obtain the CRC abstract code of each data cell With MD5 digest code.
Preferably, the judging submodule 4022, for successively judge each data cell CRC make a summary code whether In alternative library;Wherein, if the data cell is stored in database not in alternative library by the CRC abstract code of the data cell; If whether the CRC abstract code of some data cell judges the MD5 digest code of the data cell in alternative library in alternative library In;If the MD5 digest code of the data cell is stored in alternative library not in alternative library, by the data cell;Conversely, directly giving up Abandon the data cell.
Preferably, the duplicate removal processing data determine submodule 4023, for using the data cell in alternative library as warp Cross the retrieval data of duplicate removal processing.
Preferably, the data validation screening module 403 is used for according to preset legitimacy screening strategy to described Each data cell in retrieval data by duplicate removal processing carries out legitimacy screening, is added with the retrieval data for obtaining legal Enter to index database.
Preferably, wherein preset legitimacy screening strategy in the data validation screening module 403, comprising: meter The sensitivity value of each data cell is calculated, and judges whether the word susceptibility of each data cell is greater than default susceptibility respectively Threshold value, if so, giving up the data cell, i.e., the data cell is added without index database;It is on the contrary, it is determined that the data cell is Legal retrieval data.
Preferably, wherein calculating the susceptibility of each data cell using such as under type: according in each data cell The weight of the weight of the sensitivity levels of the sensitive word of varying sensitivity rank and corresponding matching degree rank, determines every number According to the sensitivity value of unit;Wherein, the sensitivity levels include: high sensitive grade, middle susceptibility grade and low sensitivity grade; The matching degree rank includes: high matching degree grade, middle matching degree grade and low matching degree grade.
Preferably, wherein the system also includes sensitive word replacement modules, the legal retrieval data for will acquire Data cell in sensitive word replace with predetermined symbol and be added to index database, when reducing sensitive word as querying condition Search hit rate.
The data deduplication cleaning system 400 of the information full-text search of the embodiment of the present invention and another reality of the invention The data deduplication method for cleaning 100 for applying the information full-text search of example is corresponding, and details are not described herein.
The present invention is described by reference to a small amount of embodiment.However, it is known in those skilled in the art, just As defined by subsidiary Patent right requirement, in addition to the present invention other embodiments disclosed above equally fall in this hair In bright range.
Normally, all terms used in the claims are all solved according to them in the common meaning of technical field It releases, unless in addition clearly being defined wherein.All references " one/described/be somebody's turn to do [device, component etc.] " are all opened Ground is construed at least one example in described device, component etc., unless otherwise expressly specified.Any side disclosed herein The step of method, need not all be run with disclosed accurate sequence, unless explicitly stated otherwise.

Claims (10)

1. a kind of data deduplication method for cleaning of information full-text search, which is characterized in that the described method includes:
Format analysis processing is carried out to each data cell in the initial retrieval data of acquisition, to obtain unformatted plain text The retrieval data of content;
Digest calculations are carried out to each data cell in the retrieval data of the unformatted plain text content, it is each to obtain The abstract code of data cell, and the retrieval according to the abstract code of each data cell to the unformatted plain text content Data carry out duplicate removal processing, to obtain the retrieval data Jing Guo duplicate removal processing;
Each data cell in the retrieval data by duplicate removal processing is carried out according to preset legitimacy screening strategy Legitimacy screening is added to index database with the retrieval data for obtaining legal.
2. the method according to claim 1, wherein every number in the initial retrieval data of described pair of acquisition Format analysis processing is carried out according to unit, to obtain the retrieval data of unformatted plain text content, comprising:
Each data cell in the initial retrieval data of acquisition is separated according to preset information category, to obtain nothing The retrieval data of the plain text content of format;Wherein, the preset information category includes: format descriptor, space, special symbol Number and text.
3. the method according to claim 1, wherein the retrieval number to the unformatted plain text content Each data cell in carries out digest calculations, to obtain the abstract code of each data cell, and according to each data The abstract code of unit carries out duplicate removal processing to the retrieval data of the unformatted plain text content, to obtain by duplicate removal processing Retrieval data, comprising:
CRC digest calculations are carried out to each data cell in the retrieval data of the unformatted plain text content and MD5 is plucked It calculates, to obtain the CRC abstract code and MD5 digest code of each data cell;
Successively judge the CRC abstract code of each data cell whether in alternative library;
Wherein, if the data cell is stored in database not in alternative library by the CRC abstract code of the data cell;
If whether the CRC abstract code of some data cell judges the MD5 digest code of the data cell alternative in alternative library In library;If the MD5 digest code of the data cell is stored in alternative library not in alternative library, by the data cell;Conversely, directly giving up Abandon the data cell;
Using the data cell in alternative library as the retrieval data Jing Guo duplicate removal processing.
4. the method according to claim 1, wherein the preset legitimacy screening strategy, comprising:
Calculate the sensitivity value of each data cell, and judge respectively the word susceptibility of each data cell whether be greater than preset it is quick Sensitivity threshold value, if so, giving up the data cell, i.e., the data cell is added without index database;It is on the contrary, it is determined that the data cell For legal retrieval data.
5. according to the method described in claim 4, it is characterized in that, the susceptibility for calculating each data cell, comprising:
According to the weight and corresponding matching of the sensitivity levels of the sensitive word of the varying sensitivity rank in each data cell The weight for spending rank, determines the sensitivity value of each data cell;
Wherein, the sensitivity levels include: high sensitive grade, middle susceptibility grade and low sensitivity grade;The matching degree rank It include: high matching degree grade, middle matching degree grade and low matching degree grade.
6. a kind of data deduplication of information full-text search clears up system, which is characterized in that the system comprises:
Data preprocessing module carries out format analysis processing for each data cell in the initial retrieval data to acquisition, To obtain the retrieval data of unformatted plain text content;
Data deduplication processing module, for each data cell in the retrieval data to the unformatted plain text content into Row digest calculations, to obtain the abstract code of each data cell, and according to the abstract code of each data cell to the nothing The retrieval data of the plain text content of format carry out duplicate removal processing, to obtain the retrieval data Jing Guo duplicate removal processing;
Data validation screening module, for according to preset legitimacy screening strategy to the retrieval number by duplicate removal processing Each data cell in carries out legitimacy screening, is added to index database with the retrieval data for obtaining legal.
7. system according to claim 6, which is characterized in that the data preprocessing module, the initial inspection to acquisition Each data cell in rope data carries out format analysis processing, to obtain the retrieval data of unformatted plain text content, comprising:
Each data cell in the initial retrieval data of acquisition is separated according to preset information category, to obtain nothing The retrieval data of the plain text content of format;Wherein, the preset information category includes: format descriptor, space, special symbol Number and text.
8. system according to claim 6, which is characterized in that the data deduplication processing module, comprising:
Make a summary code computational submodule, for each data cell in the retrieval data to the unformatted plain text content into Row CRC digest calculations and MD5 digest calculate, to obtain the CRC abstract code and MD5 digest code of each data cell;
Judging submodule, for successively judging that the CRC of each data cell makes a summary code whether in alternative library;
Wherein, if the data cell is stored in database not in alternative library by the CRC abstract code of the data cell;
If whether the CRC abstract code of some data cell judges the MD5 digest code of the data cell alternative in alternative library In library;If the MD5 digest code of the data cell is stored in alternative library not in alternative library, by the data cell;Conversely, directly giving up Abandon the data cell;
Duplicate removal processing data determine submodule, for using the data cell in alternative library as the retrieval number Jing Guo duplicate removal processing According to.
9. system according to claim 6, which is characterized in that preset legal in the data validation screening module Property screening strategy, comprising:
Calculate the sensitivity value of each data cell, and judge respectively the word susceptibility of each data cell whether be greater than preset it is quick Sensitivity threshold value, if so, giving up the data cell, i.e., the data cell is added without index database;It is on the contrary, it is determined that the data cell For legal retrieval data.
10. system according to claim 9, which is characterized in that calculate the sensitivity of each data cell using such as under type Degree:
According to the weight and corresponding matching of the sensitivity levels of the sensitive word of the varying sensitivity rank in each data cell The weight for spending rank, determines the sensitivity value of each data cell;
Wherein, the sensitivity levels include: high sensitive grade, middle susceptibility grade and low sensitivity grade;The matching degree rank It include: high matching degree grade, middle matching degree grade and low matching degree grade.
CN201910280637.3A 2019-04-09 2019-04-09 A kind of the data deduplication method for cleaning and system of information full-text search Pending CN110147363A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910280637.3A CN110147363A (en) 2019-04-09 2019-04-09 A kind of the data deduplication method for cleaning and system of information full-text search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910280637.3A CN110147363A (en) 2019-04-09 2019-04-09 A kind of the data deduplication method for cleaning and system of information full-text search

Publications (1)

Publication Number Publication Date
CN110147363A true CN110147363A (en) 2019-08-20

Family

ID=67588259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910280637.3A Pending CN110147363A (en) 2019-04-09 2019-04-09 A kind of the data deduplication method for cleaning and system of information full-text search

Country Status (1)

Country Link
CN (1) CN110147363A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0346259A2 (en) * 1988-06-07 1989-12-13 International Business Machines Corporation Single data stream architecture for presentation, revision and resource document types
CN102402537A (en) * 2010-09-15 2012-04-04 盛乐信息技术(上海)有限公司 Chinese web page text deduplication system and method
CN102682085A (en) * 2012-04-18 2012-09-19 北京十分科技有限公司 Method for removing duplicated web page
CN106708927A (en) * 2016-11-18 2017-05-24 北京二六三企业通信有限公司 Duplicate removal processing method and duplicate removal processing device for files
CN108280130A (en) * 2017-12-22 2018-07-13 中国电子科技集团公司第三十研究所 A method of finding sensitive data in text big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0346259A2 (en) * 1988-06-07 1989-12-13 International Business Machines Corporation Single data stream architecture for presentation, revision and resource document types
CN102402537A (en) * 2010-09-15 2012-04-04 盛乐信息技术(上海)有限公司 Chinese web page text deduplication system and method
CN102682085A (en) * 2012-04-18 2012-09-19 北京十分科技有限公司 Method for removing duplicated web page
CN106708927A (en) * 2016-11-18 2017-05-24 北京二六三企业通信有限公司 Duplicate removal processing method and duplicate removal processing device for files
CN108280130A (en) * 2017-12-22 2018-07-13 中国电子科技集团公司第三十研究所 A method of finding sensitive data in text big data

Similar Documents

Publication Publication Date Title
Hall et al. Approximate string matching
CA2513851C (en) Phrase-based generation of document descriptions
AU2005203239B2 (en) Phrase-based indexing in an information retrieval system
Alguliev et al. Evolutionary algorithm for extractive text summarization
US20060167930A1 (en) Self-organized concept search and data storage method
US7107263B2 (en) Multistage intelligent database search method
US20030074353A1 (en) Answer retrieval technique
US20050021545A1 (en) Very-large-scale automatic categorizer for Web content
US20130110839A1 (en) Constructing an analysis of a document
CN108197117A (en) A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme
Tian et al. Sewordsim: Software-specific word similarity database
Atlam et al. Documents similarity measurement using field association terms
Gorman et al. Scaling distributional similarity to large corpora
Akritidis et al. A self-verifying clustering approach to unsupervised matching of product titles
JP3847273B2 (en) Word classification device, word classification method, and word classification program
KR20220041337A (en) Graph generation system of updating a search word from thesaurus and extracting core documents and method thereof
Moravec et al. A comparison of extended fingerprint hashing and locality sensitive hashing for binary audio fingerprints
Zhang Start small, build complete: Effective and efficient semantic table interpretation using tableminer
Strzalkowski Natural language processing in large-scale text retrieval tasks
Wei et al. A mining-based category evolution approach to managing online document categories
Kalaivani et al. The effect of stop word removal and stemming in datapreprocessing
CN110147363A (en) A kind of the data deduplication method for cleaning and system of information full-text search
Veritawati et al. Text preprocessing using annotated suffix tree with matching keyphrase
Li et al. Keyphrase extraction and grouping based on association rules
Malki Comprehensive study and comparison of information retrieval indexing techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190820

RJ01 Rejection of invention patent application after publication