CN107766581A - The method that Data duplication record cleaning is carried out to URL - Google Patents
The method that Data duplication record cleaning is carried out to URL Download PDFInfo
- Publication number
- CN107766581A CN107766581A CN201711181477.4A CN201711181477A CN107766581A CN 107766581 A CN107766581 A CN 107766581A CN 201711181477 A CN201711181477 A CN 201711181477A CN 107766581 A CN107766581 A CN 107766581A
- Authority
- CN
- China
- Prior art keywords
- data
- url
- record
- reptile
- needing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the method that Data duplication record cleaning is carried out to URL, including:Step 1, web page contents are captured from Internet by web crawlers, and extracts the property content of needs;Step 2, the URL for needing to capture data network is provided by URL queues for reptile;Step 3, the content of crawler capturing is handled by data processing module;Step 4, stored by data memory module to needing to capture the URL information of Data web site, the data that reptile extracts from webpage and the data after DP is handled.The present invention discloses API modes by web crawlers or website and data message is obtained from website, the present invention can extract unstructured data from webpage, it is stored as unified local data file, and store in a structured way, and support the collection of picture, audio, video, annex with auto-associating, can improve collection and the grasp speed of the network information with text, while improve the storage speed of information after crawl.
Description
Technical field
The present invention relates to big data field, and in particular to the method that Data duplication record cleaning is carried out to URL.
Background technology
Occurring similar term in data development course has ultra-large data, mass data etc.." ultra-large " one
As represent the data of corresponding GB (1GB=1024MB), what " magnanimity " typicallyed represent is the data of TB (1TB=1024GB) level, and
Present " big data " be then PB (1PB=1024TB), EB (1EB=1024PB), even ZB (1ZB=1024EB) ranks with
On data.The data that the predictions of Gartner in 2013 store in the world are up to 1.2ZB, if by these data carvings to CD-
On R read-only optical discs, and bank out, its height will be the earth to 5 times of moon distance.What the behind of different scales was implied is different
Technical problem or challenge research puzzle.
Big data (big data), refer to can not be caught in the range of certain time with conventional software instrument, manage and
The data acquisition system of processing, it is to need new tupe to have stronger decision edge, see clearly discovery power and process optimization ability
Magnanimity, high growth rate and diversified information assets.In IT industry circle with rapid changepl. never-ending changes and improvements, each enterprise suffers from certainly to big data
Oneself different deciphering but everybody be commonly held that big data has 4 " V " features, i.e. Volume (capacity is big), Variety (kinds
Class is more), Velocity (speed is fast) and most important Value (value density is low):
(1) measure big (Volume Big).Data magnitude is developed to PB (210TB) or even ZB from TB (210GB)
(220PB), magnanimity, flood tide or even excess can be claimed.
(2) it is diversified (Variable Type).Data type is various, more and more mostly webpage, picture, video, image with
Positional information etc. is semi-structured and unstructured data.
(3) rapid (Velocity Fast).Data flow is often high-speed real-time stream, and generally require it is quick,
Lasting real-time processing;Handling implement also may intervention in quick evolution, soft project and artificial intelligence etc..
(4) it is worth high (Value Highand Low Density) low with density.So that Video security monitors as an example, continuously
Constantly in monitoring stream, the person that has substantial worth can be solely the data flow of one or two second;" dead angle " of 360 ° omni-directional video monitoring
Place, the image information of most worthy may be excavated.
(5) Complexity is checked:The difficulty of processing and analysis is very big.
Network data is that quantity is big, content is mixed and disorderly, existing big data data acquisition technology grabbing for the network information
Take method more complicated, spend the time more.
The content of the invention
The technical problems to be solved by the invention are that network data is that quantity is big, content is mixed and disorderly, existing big data number
It is more complicated according to grasping means of the acquisition technique for the network information, spend the time more, and it is an object of the present invention to provide carrying out data to URL
The method for repeating record cleaning, improves the crawl to the network information and storage speed.
The present invention is achieved through the following technical solutions:
The method that Data duplication record cleaning is carried out to URL, including:
Step 1, web page contents are captured from Internet by web crawlers, and extracts the property content of needs;
Step 2, the URL for needing to capture data network is provided by URL queues for reptile;First to needing to capture data network
The URL of network is pre-processed, then is realized by fields match and record matching and repeated record detection, in the medical detection of database level
The algorithm for repeating record clusters to the repetition record in whole data set, is detected according to compatible rule merging or deletion same
One repeats to record the repetition record in cluster, only retains that wherein correct record;
Step 3, the content of crawler capturing is handled by data processing module;
Step 4, by data memory module to needing the URL information of crawl Data web site, reptile to be extracted from webpage
The data and the data after DP is handled are stored.
Further, the method that Data duplication record cleaning is carried out to URL, the step 1 include:
Step 11, it would be desirable to capture the URL information write-in URL queues of Data web site;
Step 12, reptile obtains the site URL informations for needing to capture Data web site from URL queues;
Step 13, reptile captures corresponding web page contents from Internet, and extracts the contents value of particular community;
Step 14, reptile is by the data write into Databasce from webpage at extraction;
Step 15, DP reads SpiderDATA and handled;
Step 16, DP is by the data write into Databasce after processing.
Further, the method that Data duplication record cleaning is carried out to URL, data processing module logarithm in the step 3
According to processing include data cleansing, data de-noising and further integrated storage.
The present invention discloses API modes by web crawlers or website and data message is obtained from website, and the present invention can incite somebody to action
Unstructured data extracts from webpage, is stored as unified local data file, and deposit in a structured way
Storage, and the collection of the files such as picture, audio, video or annex is supported, annex can improve network with text with auto-associating
The collection of information and grasp speed, while improve the storage speed of information after crawl.
, it is necessary to import substantial amounts of data from various data sources during data warehouse is constructed, ideally, for
An entity in real world, should there was only a corresponding probability in database or data warehouse, but to different
When multiple data sources that kind information represents carry out integrated, due to there may be data entry error, form, spelling in real data
On the various problems such as have differences, cause DBMS correctly to identify a plurality of record for identifying same entity so that logic
The main body of the same real world of upper finger, there may be multiple different expressions in warehouse.The present invention by pretreatment,
Repeat record detection, the repetition of database level record cluster and clash handle and simplify existing duplicate data cleaning way,
Improve cleaning efficiency.
The present invention compared with prior art, has the following advantages and advantages:The present invention passes through net by step 1
Network reptile captures web page contents from Internet, and extracts the property content of needs;Step 2, it is reptile by URL queues
The URL for needing to capture data network is provided;Step 3, the content of crawler capturing is handled by data processing module;Step
4, by data memory module to need to capture the URL information of Data web site, the data that reptile extracts from webpage and
Data after DP is handled are stored.API modes are disclosed by web crawlers or website data letter is obtained from website
Breath, the present invention can extract unstructured data from webpage, be stored as unified local data file, and with
The mode of structuring stores, and supports the collection of the files such as picture, audio, video or annex, and annex can be automatic with text
Association, improves collection and the grasp speed of the network information, while improves the storage speed of information after crawl.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, with reference to embodiment, the present invention is made
Further to describe in detail, exemplary embodiment of the invention and its explanation are only used for explaining the present invention, are not intended as to this
The restriction of invention.
Embodiment
The method that Data duplication record cleaning is carried out to URL, including:
Step 1, web page contents are captured from Internet by web crawlers, and extracts the property content of needs;
Step 2, the URL for needing to capture data network is provided by URL queues for reptile;First to needing to capture data network
The URL of network is pre-processed, then is realized by fields match and record matching and repeated record detection, in the medical detection of database level
The algorithm for repeating record clusters to the repetition record in whole data set, is detected according to compatible rule merging or deletion same
One repeats to record the repetition record in cluster, only retains that wherein correct record;
Step 3, the content of crawler capturing is handled by data processing module;
Step 4, by data memory module to needing the URL information of crawl Data web site, reptile to be extracted from webpage
The data and the data after DP is handled are stored.
Step 1 includes:
Step 11, it would be desirable to capture the URL information write-in URL queues of Data web site;
Step 12, reptile obtains the site URL informations for needing to capture Data web site from URL queues;
Step 13, reptile captures corresponding web page contents from Internet, and extracts the contents value of particular community;
Step 14, reptile is by the data write into Databasce from webpage at extraction;
Step 15, DP reads SpiderDATA and handled;
Step 16, DP is by the data write into Databasce after processing.
Processing of the data processing module to data includes data cleansing, data de-noising and further integrated in step 3
Storage.
Above-described embodiment, the purpose of the present invention, technical scheme and beneficial effect are carried out further
Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention
Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc., all should include
Within protection scope of the present invention.
Claims (3)
1. the method that couple URL carries out Data duplication record cleaning, it is characterised in that including:
Step 1, web page contents are captured from Internet by web crawlers, and extracts the property content of needs;
Step 2, the URL for needing to capture data network is provided by URL queues for reptile;First to needing to capture data network
URL is pre-processed, then is realized by fields match and record matching and repeated record detection, is repeated in the medical detection of database level
The algorithm of record clusters to the repetition record in whole data set, is detected according to compatible rule merging or deletion same heavy
The multiple repetition record recorded in cluster, only retain that wherein correct record;
Step 3, the content of crawler capturing is handled by data processing module;
Step 4, by data memory module to needing the URL information of crawl Data web site, reptile to be extracted from webpage
Data and the data after DP is handled are stored.
2. the method according to claim 1 that Data duplication record cleaning is carried out to URL, it is characterised in that the step 1
Including:
Step 11, it would be desirable to capture the URL information write-in URL queues of Data web site;
Step 12, reptile obtains the site URL informations for needing to capture Data web site from URL queues;
Step 13, reptile captures corresponding web page contents from Internet, and extracts the contents value of particular community;
Step 14, reptile is by the data write into Databasce from webpage at extraction;
Step 15, DP reads SpiderDATA and handled;
Step 16, DP is by the data write into Databasce after processing.
3. the method according to claim 1 that Data duplication record cleaning is carried out to URL, it is characterised in that the step 3
Processing of the middle data processing module to data includes data cleansing, data de-noising and further integrated storage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711181477.4A CN107766581A (en) | 2017-11-23 | 2017-11-23 | The method that Data duplication record cleaning is carried out to URL |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711181477.4A CN107766581A (en) | 2017-11-23 | 2017-11-23 | The method that Data duplication record cleaning is carried out to URL |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107766581A true CN107766581A (en) | 2018-03-06 |
Family
ID=61280322
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711181477.4A Withdrawn CN107766581A (en) | 2017-11-23 | 2017-11-23 | The method that Data duplication record cleaning is carried out to URL |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107766581A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106547914A (en) * | 2016-11-25 | 2017-03-29 | 国信优易数据有限公司 | A kind of data acquisition management system and its method |
CN106776768A (en) * | 2016-11-23 | 2017-05-31 | 福建六壬网安股份有限公司 | A kind of URL grasping means of distributed reptile engine and system |
CN107273409A (en) * | 2017-05-03 | 2017-10-20 | 广州赫炎大数据科技有限公司 | A kind of network data acquisition, storage and processing method and system |
-
2017
- 2017-11-23 CN CN201711181477.4A patent/CN107766581A/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776768A (en) * | 2016-11-23 | 2017-05-31 | 福建六壬网安股份有限公司 | A kind of URL grasping means of distributed reptile engine and system |
CN106547914A (en) * | 2016-11-25 | 2017-03-29 | 国信优易数据有限公司 | A kind of data acquisition management system and its method |
CN107273409A (en) * | 2017-05-03 | 2017-10-20 | 广州赫炎大数据科技有限公司 | A kind of network data acquisition, storage and processing method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Papadopoulou et al. | A corpus of debunked and verified user-generated videos | |
Feng et al. | Satar: A self-supervised approach to twitter account representation learning and its application in bot detection | |
CN107992764B (en) | Sensitive webpage identification and detection method and device | |
CN108234462A (en) | A kind of method that intelligent intercept based on cloud protection threatens IP | |
CN111953697B (en) | APT attack recognition and defense method | |
US9563770B2 (en) | Spammer group extraction apparatus and method | |
Huggett | Digital haystacks: Open data and the transformation of archaeological knowledge | |
CN103927398A (en) | Microblog hype group discovering method based on maximum frequent item set mining | |
CN105718590A (en) | Multi-tenant oriented SaaS public opinion monitoring system and method | |
CN103455597B (en) | Distributed information towards magnanimity web graph picture hides detection method | |
MX2011005771A (en) | Method and device for intercepting spam. | |
CN107070897A (en) | Network log storage method based on many attribute Hash duplicate removals in intruding detection system | |
CN106844588A (en) | A kind of analysis method and system of the user behavior data based on web crawlers | |
CN105069158B (en) | Data digging method and system | |
CN107832450A (en) | Method for cleaning Data duplication record | |
CN107992533A (en) | A kind of network data acquisition method | |
Yang et al. | Hadoop-based dark web threat intelligence analysis framework | |
CN109194605A (en) | A kind of suspected threat index Proactive authentication method and system based on open source information | |
CN108595466A (en) | A kind of filtering of internet information and Internet user's information and net note structure analysis method | |
CN107895032A (en) | Carry out the network data acquisition method that data are tentatively cleaned | |
CN107766581A (en) | The method that Data duplication record cleaning is carried out to URL | |
CN100367295C (en) | Intelligent imaging implicit writting analytical system based on three-layer frame | |
CN107992534A (en) | The method that improved sort key sorts data set | |
Wang | Research on the collection method of financial blockchain risk prompt information from sandbox perspective | |
Babu et al. | Examining login urls to identify phishing threats |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180306 |