CN102937988A - Parallel distributed internet data extract method and system - Google Patents

Parallel distributed internet data extract method and system Download PDF

Info

Publication number
CN102937988A
CN102937988A CN2012104215747A CN201210421574A CN102937988A CN 102937988 A CN102937988 A CN 102937988A CN 2012104215747 A CN2012104215747 A CN 2012104215747A CN 201210421574 A CN201210421574 A CN 201210421574A CN 102937988 A CN102937988 A CN 102937988A
Authority
CN
China
Prior art keywords
data
pick
code conversion
parallelization
submodule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012104215747A
Other languages
Chinese (zh)
Inventor
杨睿尘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tengyi Science & Technology Development Co Ltd
Original Assignee
Beijing Tengyi Science & Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tengyi Science & Technology Development Co Ltd filed Critical Beijing Tengyi Science & Technology Development Co Ltd
Priority to CN2012104215747A priority Critical patent/CN102937988A/en
Publication of CN102937988A publication Critical patent/CN102937988A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a parallel distributed internet data extract method and system. The method comprises the following steps of: acquiring a webpage sequence obtained by crawling, sequentially acquiring webpage configuration information, and performing data extraction on webpage; performing code conversion on contents acquired by the data extraction; performing data cleaning on the contents subjected to the code conversion; and judging whether repeated information exists in the contents subjected to the data cleaning, and if not, memorizing the contents into a database. The parallel distributed internet data extract method and system have the advantages of high quality and high efficiency.

Description

Parallelization distributed interconnection data pick-up method and system thereof
Technical field
The present invention relates to Computer Applied Technology field and areas of information technology, be specifically related to a kind of parallelization distributed interconnection data pick-up method and system thereof.
Background technology
Now, the development of internet is maked rapid progress, and netizen's quantity of China also is being explosive growth.The internet progressively replaces traditional media (comprising newspaper, books, broadcasting, TV etc.), becomes the main source that people obtain and release news.Simultaneously, because the internet is free and open, it is simple to use, velocity of propagation is fast, the user is numerous, so that internet information can be propagated and impact rapidly.More and more important just because of the internet role, so various research for internet information is also flourish.In order to carry out the research of internet information, at first need the Internet web page information that the form of magnanimity is different to extract and process, and carry out unified format conversion, process to make things convenient for post analysis; Secondly, need to use high-quality and high-level efficiency extraction technique.Just be based on this active demand, we have developed parallelization distributed interconnection data pick-up system.
Summary of the invention
The present invention one of is intended to solve the problems of the technologies described above at least to a certain extent or provides at least a kind of useful commerce to select.For this reason, one object of the present invention is to propose a kind of parallelization distributed interconnection data pick-up method and system thereof with high-quality and high-efficiency.
An aspect of of the present present invention proposes a kind of parallelization distributed interconnection data pick-up method, comprises step: obtain and crawl the page sequence that obtains, obtain successively the webpage configuration information and webpage is carried out data pick-up; Data are extracted the content that obtains carry out code conversion; Content after the code conversion is carried out data cleansing; And judge after the data cleansing content whether information repeat, as not repeating, deposit database in.
In an embodiment of method of the present invention, described data pick-up is to carry out with the distributed pattern of parallelization.
In an embodiment of method of the present invention, described code conversion comprises: the integer or the floating number that numeric type information are converted to unified length; All temporal informations are converted into the absolute time of consolidation form; And unit information is converted into unified data unit and weights and measures.
In an embodiment of method of the present invention, described data cleansing comprises: for the data cleansing of text with for the data cleansing of commenting on.
Another aspect of the present invention proposes a kind of parallelization distributed interconnection data pick-up system, comprising: data extraction module, described data extraction module are used for obtaining crawling the page sequence that obtains, and obtain successively the webpage configuration information and webpage is carried out data pick-up; Code conversion module, described code conversion module are used for that data are extracted the content that obtains and carry out code conversion; Data cleansing module, described data cleansing module are used for the content after the code conversion is carried out data cleansing; Judge replicated blocks, describedly judge that replicated blocks are used for judging the whether information repetition of content after the data cleansing; And memory module, if the result of described judgement replicated blocks deposits the content after the data cleansing in database for not repeating.
In an embodiment of system of the present invention, described data extraction module is the parallelization distributed frame.
In an embodiment of system of the present invention, described code conversion module comprises: numerical value conversion submodule, and described numerical value conversion submodule is used for numeric type information is converted to integer or the floating number of unified length; Time conversion submodule, described time conversion submodule is used for all temporal informations are converted into the absolute time of consolidation form; And the Conversion of measurement unit submodule, described Conversion of measurement unit submodule is used for unit information is converted into unified data unit and weights and measures.
In an embodiment of system of the present invention, described data cleansing module comprises: textual data cleans submodule and comment data is cleaned submodule.
At first, method and system of the present invention can be expanded the targeted sites that needs extraction freely by the mode of configuration, owing to adopted parallelization and distributed design, so that the efficient of data pick-up and real-time are guaranteed.Secondly, adopt extracted data to clean mechanism among the present invention, carried out filtration, purification for extracted data, removed the various illegal insignificant content that wherein may comprise, greatly improved the degree of functioning of extracted data, after having guaranteed for accuracy and the authenticity of the analyzing and processing of extracted data.Moreover, the present invention is directed to the data that are drawn into and done the conversion of unified Data Format Transform and character code, for final routine processes and data storage provide great convenience.Therefore, method and system of the present invention has high-quality and high efficiency advantage.
Additional aspect of the present invention and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Description of drawings
Above-mentioned and/or additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment in conjunction with following accompanying drawing, wherein:
Fig. 1 is the process flow diagram of the parallelization distributed interconnection data pick-up method of the embodiment of the invention;
Fig. 2 is the structured flowchart of the parallelization distributed interconnection data pick-up system of the embodiment of the invention;
Fig. 3 is the detail flowchart of the parallelization distributed interconnection data pick-up method of the embodiment of the invention; With
Fig. 4 is the structural representation of data extraction module of the parallelization distributed nature of the embodiment of the invention.
Embodiment
The below describes embodiments of the invention in detail, and the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, be intended to for explaining the present invention, and can not be interpreted as limitation of the present invention.
In description of the invention, it will be appreciated that, term " " center "; " vertically "; " laterally "; " length "; " width "; " thickness ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end " " interior ", " outward ", " clockwise ", orientation or the position relationship of indications such as " counterclockwise " are based on orientation shown in the drawings or position relationship, only be for convenience of description the present invention and simplified characterization, rather than device or the element of indication or hint indication must have specific orientation, with specific orientation structure and operation, therefore can not be interpreted as limitation of the present invention.
In addition, term " first ", " second " only are used for describing purpose, and can not be interpreted as indication or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, one or more these features can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ".In description of the invention, the implication of " a plurality of " is two or more, unless clear and definite concrete restriction is arranged in addition.
In the present invention, unless clear and definite regulation and restriction are arranged in addition, broad understanding should be done in the terms such as term " installation ", " linking to each other ", " connection ", " fixing ", for example, can be to be fixedly connected with, and also can be to removably connect, or connect integratedly; Can be mechanical connection, also can be to be electrically connected; Can be directly to link to each other, also can indirectly link to each other by intermediary, can be the connection of two element internals.For the ordinary skill in the art, can understand as the case may be above-mentioned term concrete meaning in the present invention.
In the present invention, unless clear and definite regulation and restriction are arranged in addition, First Characteristic Second Characteristic it " on " or D score can comprise that the first and second features directly contact, can comprise that also the first and second features are not directly contacts but by the other feature contact between them.And, First Characteristic Second Characteristic " on ", " top " and " above " comprise First Characteristic directly over Second Characteristic and oblique upper, or only represent that the First Characteristic level height is higher than Second Characteristic.First Characteristic Second Characteristic " under ", " below " and " below " comprise First Characteristic under the Second Characteristic and tiltedly, or only represent that the First Characteristic level height is less than Second Characteristic.
The invention belongs to Computer Applied Technology field and areas of information technology, relate generally to the webpage that obtains for crawl and carry out data pick-up, data cleansing is filtered, the realization of data layout unification and code conversion.Data pick-up is based on basis and the prerequisite that internet information is analyzed, and all analysis operations all are to obtain clean basis with data Unified coding and data layout at data pick-up to carry out.
The fundamental purpose of patent of the present invention is the efficiently and accurately extraction for the internet data that solves the magnanimity isomery, the cleaning and filtering of extracted data, the problem of data layout unification and code conversion three aspects:.Because the outstanding requirement of the analysis of Internet-based data is to analyze the ageing of data to want high.Because the internet data amount all is very surprising, structure of web page varies, so in order to guarantee the convenience of ageing, the comprehensive and analyzing and processing that internet data is analyzed, a kind of extraction magnanimity isomery internet data technology that can efficiently and accurately need to be arranged.The parallelization distributed interconnection data pick-up system that we develop has solved this active demand.But, many insignificant illegal characters or content may be comprised in the data that extract, cleaning and filtering need to be carried out, otherwise the effect of the data analysis after can affecting.At last, because the webpage that is present on the internet now varies, the data structure that adopts on the different web pages also differs widely, and such as time format, multiple format is just arranged.And the character encoding format that different web pages adopts often also differs widely.So,, before preserving extracted data, also need the data that extract are done unified Data Format Transform and character encoding format conversion afterwards to the analyzing and processing of extracted data for convenient.Through after such processing, the final extracted data that is kept in the database just only has a kind of unified form and coding, and routine processes is got up also more convenient quick.
According to parallelization distributed interconnection data pick-up method of the present invention, as shown in Figure 1, may further comprise the steps:
Step S1. obtains and crawls the page sequence that obtains, and obtains successively the webpage configuration information and webpage is carried out data pick-up;
Step S2. extracts the content that obtains to data and carries out code conversion;
The content of step S3. after to code conversion carried out data cleansing;
Step S4. judge after the data cleansing content whether information repeat, as not repeating, deposit database in.
In an embodiment of method of the present invention, described data pick-up is to carry out with the distributed pattern of parallelization.
In an embodiment of method of the present invention, described code conversion comprises: the integer or the floating number that numeric type information are converted to unified length; All temporal informations are converted into the absolute time of consolidation form; And unit information is converted into unified data unit and weights and measures.
In an embodiment of method of the present invention, described data cleansing comprises: for the data cleansing of text with for the data cleansing of commenting on.
According to parallelization distributed interconnection data pick-up of the present invention system, as shown in Figure 2, comprise following part:
Data extraction module 100, data extraction module are used for obtaining crawling the page sequence that obtains, and obtain successively the webpage configuration information and webpage is carried out data pick-up;
Code conversion module 200, code conversion module are used for that data are extracted the content that obtains and carry out code conversion;
Data cleansing module 300, data cleansing module are used for the content after the code conversion is carried out data cleansing;
Judge replicated blocks 400, judge that replicated blocks are used for judging the whether information repetition of content after the data cleansing; And
Memory module 500 is if the result who judges replicated blocks deposits the content after the data cleansing in database for not repeating.
In an embodiment of system of the present invention, described data extraction module is the parallelization distributed frame.
In an embodiment of system of the present invention, described code conversion module comprises: numerical value conversion submodule, and described numerical value conversion submodule is used for numeric type information is converted to integer or the floating number of unified length; Time conversion submodule, described time conversion submodule is used for all temporal informations are converted into the absolute time of consolidation form; And the Conversion of measurement unit submodule, described Conversion of measurement unit submodule is used for unit information is converted into unified data unit and weights and measures.
For making those skilled in the art understand better technical scheme of the present invention, be further described below in conjunction with Fig. 3 and Fig. 4.
Fig. 3 is the refinement process flow diagram of Fig. 1, therefrom can find out, the overall design philosophy of parallelization distributed interconnection data pick-up of the present invention can be summarised as: to be grabbed local webpage as input for targeted website space of a whole page human configuration good extraction configuration information and data grasping system in advance, the log-on data extraction system, the info web that the extracted data grasping system grasps, simultaneously the information that is drawn into is cleaned and format conversion, can obtain at last the title of the uniform format of each webpage, the author, deliver the time, the article source, the article text, the comment author, the comment issuing time, publisher address (IP), the information such as comment content.
Hereinafter the inventor extracts according to the efficiently and accurately of the internet data that how to solve the magnanimity isomery, how to carry out cleaning and filtering for extracted data, and the problem of how to carry out unified Data Format Transform and character code conversion three aspects:, the realization situation that each aspect that makes introductions all round is concrete.
1, the efficiently and accurately of the internet data of magnanimity isomery extracts
For the efficiently and accurately of the internet data that solves the magnanimity isomery extracts, the present invention considers to solve from two aspects: the one, and parallelization is namely extracted server at same and is started simultaneously parallel extraction operation of carrying out network data of a plurality of extraction examples.The 2nd, distributed, namely simultaneously at multiple servers deploy extraction program.Extraction program on every station server can both work alone simultaneously.The hardware device of parallelization distributed interconnection data pick-up as shown in Figure 4, whole data pick-up system launches round a central database.Around this central database, dispose many and extracting server, and moving simultaneously a plurality of extraction threads on each extraction server.A kind of like this program structure and realization have guaranteed the efficient realization of extracting accurately in real time of the internet data of magnanimity isomery.
2, the cleaning and filtering of extracted data
The cleaning and filtering module can be divided into two parts, for the cleaning and filtering of text with for the cleaning and filtering of commenting on.
1) for the cleaning and filtering of text, for example removes the picture except the text Word message in the text, audio frequency, video, the information such as advertisement.
2) for the cleaning and filtering of comment, for example remove non-legible information in the comment content, such as (expression, picture, literal mess code etc.); Remove unnecessary space and newline in the comment; Remove the comment content and be sky or word length less than 4 comment; Removal is without the character of the random input of practical significance, such as " adfpaetljfofdsf "; Remove the comment that only comprises some punctuation marks, as ".。。。。。”,“......”;
3, data layout unification and code conversion
With the valid data information after extracting, carry out unified format conversion, the data message uniform format in the assurance system.Concrete steps are as follows: the integer or the floating number that numeric type information are converted to unified length; All temporal informations are converted into the absolute time of consolidation form; Unit information is converted into unified data unit and weights and measures.Then, need to carry out to the data that extract the conversion of character encoding format, because the coded format that each webpage that extracts is corresponding may be different, needs for the convenient follow-up data analysis of on the extracted data basis, doing and data storage, the requirement system is for the data that extract, do unified character code conversion, the general unified UTF-8 coded format that converts to.
In sum, at first, method and system of the present invention can be expanded the targeted sites that needs extraction freely by the mode of configuration, owing to adopted parallelization and distributed design, so that the efficient of data pick-up and real-time are guaranteed.Secondly, adopt extracted data to clean mechanism among the present invention, carried out filtration, purification for extracted data, removed the various illegal insignificant content that wherein may comprise, greatly improved the degree of functioning of extracted data, after having guaranteed for accuracy and the authenticity of the analyzing and processing of extracted data.Moreover, the present invention is directed to the data that are drawn into and done the conversion of unified Data Format Transform and character code, for final routine processes and data storage provide great convenience.Therefore, method and system of the present invention has high-quality and high efficiency advantage.
Need to prove, describe and to be understood in the process flow diagram or in this any process of otherwise describing or method, expression comprises the module of code of the executable instruction of the step that one or more is used to realize specific logical function or process, fragment or part, and the scope of preferred implementation of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by opposite order, carry out function, this should be understood by the embodiments of the invention person of ordinary skill in the field.
In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or the example in conjunction with specific features, structure, material or the characteristics of this embodiment or example description.In this manual, the schematic statement of above-mentioned term not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or characteristics can be with suitable mode combinations in any one or more embodiment or example.
Although the above has illustrated and has described embodiments of the invention, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art is not in the situation that break away from principle of the present invention and aim can change above-described embodiment within the scope of the invention, modification, replacement and modification.

Claims (8)

1. a parallelization distributed interconnection data pick-up method is characterized in that, comprises step:
Obtain and crawl the page sequence that obtains, obtain successively the webpage configuration information and webpage is carried out data pick-up;
Data are extracted the content that obtains carry out code conversion;
Content after the code conversion is carried out data cleansing; And
Judge that whether information repeats for content after the data cleansing, as not repeating, deposits database in.
2. parallelization distributed interconnection data pick-up method as claimed in claim 1 is characterized in that, described data pick-up is to carry out with the distributed pattern of parallelization.
3. such as claim 1 and 2 described parallelization distributed interconnection data pick-up methods, it is characterized in that, described code conversion comprises:
Numeric type information is converted to integer or the floating number of unified length;
All temporal informations are converted into the absolute time of consolidation form; And
Unit information is converted into unified data unit and weights and measures.
4. such as the described parallelization distributed interconnection of claim 1-3 data pick-up method, it is characterized in that, described data cleansing comprises: for the data cleansing of text with for the data cleansing of commenting on.
5. a parallelization distributed interconnection data pick-up system is characterized in that, comprising:
Data extraction module, described data extraction module are used for obtaining crawling the page sequence that obtains, and obtain successively the webpage configuration information and webpage is carried out data pick-up;
Code conversion module, described code conversion module are used for that data are extracted the content that obtains and carry out code conversion;
Data cleansing module, described data cleansing module are used for the content after the code conversion is carried out data cleansing;
Judge replicated blocks, describedly judge that replicated blocks are used for judging the whether information repetition of content after the data cleansing; And
Memory module is if the result of described judgement replicated blocks deposits the content after the data cleansing in database for not repeating.
6. parallelization distributed interconnection data pick-up as claimed in claim 5 system is characterized in that, described data extraction module is the parallelization distributed frame.
7. such as claim 5 and 6 described parallelization distributed interconnection data pick-up methods, it is characterized in that, described code conversion module comprises:
Numerical value conversion submodule, described numerical value conversion submodule is used for numeric type information is converted to integer or the floating number of unified length;
Time conversion submodule, described time conversion submodule is used for all temporal informations are converted into the absolute time of consolidation form; And
Conversion of measurement unit submodule, described Conversion of measurement unit submodule are used for unit information is converted into unified data unit and weights and measures.
8. such as the described parallelization distributed interconnection of claim 5-7 data pick-up method, it is characterized in that, described data cleansing module comprises: textual data cleans submodule and comment data is cleaned submodule.
CN2012104215747A 2012-10-29 2012-10-29 Parallel distributed internet data extract method and system Pending CN102937988A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012104215747A CN102937988A (en) 2012-10-29 2012-10-29 Parallel distributed internet data extract method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012104215747A CN102937988A (en) 2012-10-29 2012-10-29 Parallel distributed internet data extract method and system

Publications (1)

Publication Number Publication Date
CN102937988A true CN102937988A (en) 2013-02-20

Family

ID=47696885

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012104215747A Pending CN102937988A (en) 2012-10-29 2012-10-29 Parallel distributed internet data extract method and system

Country Status (1)

Country Link
CN (1) CN102937988A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679819A (en) * 2014-12-22 2015-06-03 上海钢富电子商务有限公司 Data analysis method and system of spot resources for steel trading industry
CN105912140A (en) * 2016-04-08 2016-08-31 乐视控股(北京)有限公司 Information storage method and device
CN109254967A (en) * 2018-08-29 2019-01-22 河南智慧云大数据有限公司 A kind of depth analysis method and device based on multi-source heterogeneous mass data
CN110059077A (en) * 2019-04-19 2019-07-26 深圳乐信软件技术有限公司 A kind of verification of data method, apparatus, equipment and storage medium
CN115409075A (en) * 2022-11-03 2022-11-29 成都中科合迅科技有限公司 Feature analysis system based on wireless signal analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李学凯: "《面向多任务、多通道并行爬虫的技术研究》", 《中国优秀硕士学位论文全文数据库》 *
龚秋艳: "《并行网络爬虫设计与实现》", 《中国优秀硕士学位论文全文数据库》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679819A (en) * 2014-12-22 2015-06-03 上海钢富电子商务有限公司 Data analysis method and system of spot resources for steel trading industry
CN104679819B (en) * 2014-12-22 2018-03-23 上海找钢网信息科技股份有限公司 The data analysis method and system of steel trade industry stock resource
CN105912140A (en) * 2016-04-08 2016-08-31 乐视控股(北京)有限公司 Information storage method and device
CN109254967A (en) * 2018-08-29 2019-01-22 河南智慧云大数据有限公司 A kind of depth analysis method and device based on multi-source heterogeneous mass data
CN110059077A (en) * 2019-04-19 2019-07-26 深圳乐信软件技术有限公司 A kind of verification of data method, apparatus, equipment and storage medium
CN115409075A (en) * 2022-11-03 2022-11-29 成都中科合迅科技有限公司 Feature analysis system based on wireless signal analysis

Similar Documents

Publication Publication Date Title
CN102937988A (en) Parallel distributed internet data extract method and system
CN105989074B (en) A kind of method and apparatus recommend by mobile device information cold start-up
CN108460139B (en) Online course teaching quality assessment management system based on web crawler data mining
CN101673266B (en) Method for searching audio and video contents
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
CN108920434A (en) A kind of general Web page subject method for extracting content and system
CN105744292A (en) Video data processing method and device
CN104063521A (en) Method and device for achieving searching service
CN103577556A (en) Device and method for obtaining association degree of question and answer pair
CN103902674A (en) Method and device for collecting evaluation data of specific subject
CN108364199A (en) A kind of data analysing method and system based on Internet user's comment
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN110390038A (en) Segment method, apparatus, equipment and storage medium based on dom tree
CN102650999A (en) Method and system for extracting object attribution value information from webpage
CN103617290A (en) Chinese machine-reading system
CN102937989A (en) Parallel distributed internet data capture method and system
CN106446072A (en) Webpage content processing method and apparatus
CN109063144A (en) Visual network crawler method and device
CN104268192A (en) Webpage information extracting method, device and terminal
CN103049581A (en) Web text classification method based on consistency clustering
CN102567521B (en) Webpage data capturing and filtering method
CN103500158A (en) Method and device for annotating electronic document
Fauzi et al. Webpage segmentation for extracting images and their surrounding contextual information
CN103761257A (en) Webpage handling method and system based on mobile browser
CN105117482A (en) Method and device for achieving website navigation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130220