CN108133027A - A kind of machine automatic classification method based on web crawlers - Google Patents

A kind of machine automatic classification method based on web crawlers Download PDF

Info

Publication number
CN108133027A
CN108133027A CN201711461953.8A CN201711461953A CN108133027A CN 108133027 A CN108133027 A CN 108133027A CN 201711461953 A CN201711461953 A CN 201711461953A CN 108133027 A CN108133027 A CN 108133027A
Authority
CN
China
Prior art keywords
data
network address
data processing
captured
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711461953.8A
Other languages
Chinese (zh)
Inventor
梁镇爽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Global Tone Communication Technology Qingdao Co Ltd
Original Assignee
Global Tone Communication Technology Qingdao Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Global Tone Communication Technology Qingdao Co Ltd filed Critical Global Tone Communication Technology Qingdao Co Ltd
Priority to CN201711461953.8A priority Critical patent/CN108133027A/en
Publication of CN108133027A publication Critical patent/CN108133027A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The present invention provides a kind of machine automatic classification method based on web crawlers, including:Data processing container is set, and selected seed network address judges whether network address has been collected, and data acquisition is carried out using web crawlers;The result that data acquire is sent to data processing container;Data processing container is stored according to different classes of classification in archival memory.The present invention greatly improves the quality of data, reduces the waste of manpower and materials.

Description

A kind of machine automatic classification method based on web crawlers
Technical field
The present invention relates to the machine automatic classification methods based on reptile principle.
Background technology
Information-based tide have swepts the globe, while with universal and network technology the constantly improve of internet, Internet has become whole world information resource database the abundantest the hugest, due to the opening of internet, various information It can be transferred through various forms and be published to internet at the first time, formally due to this opening of internet, lead to the superfluous of information Remaining and mixed and disorderly, therefore, automatic classification technology is quickly grown with the demand of data age, as a kind of effective information processing side Method, automatic classification technology encourages various information and carries out taxonomic revision according to certain taxonomic hierarchies, so as to greatly improve user The efficiency of mobile phone advices reduces the huge wasting of resources caused by manual sort's technology.
Invention content
The technical problems to be solved by the invention are to provide a kind of machine automatic classification method based on web crawlers, can The data preparation for making redundancy mixed and disorderly is into there is Similar content, methodically.
The technical proposal for solving the technical problem of the invention is:A kind of machine based on web crawlers side of classification automatically Method includes the following steps,
(1)Data processing container is set,
(2)A part of kind of subnet is chosen, and these kind of subnet is put into network address queue to be captured;
(3)Judge whether these network address queues have been collected, if it is, be sent directly to acquire network address queue, if It is no, into next step;
(4)Data acquisition is carried out from network address queue to be captured using web crawlers;
(5)The result that data acquire is sent to data processing container, data classification processing is carried out by data processing container;
(6)To classify treated data of data processing container are stored according to different classes of classification in archival memory.
Further, data processing container is the data processing model that have passed through information classification based training in advance, and information is classified Training refers to pre-define out the data of classification, by artificial screening and adds mark, then using these data to data at Reason model is trained study.
Further, the step of carrying out data acquisition using web crawlers includes:
(4.1)The URL of network address to be captured is taken out from network address queue to be captured;
(4.2)The DNS of URL is parsed, and obtains the IP address of host;
(4.3)The corresponding network address of URL is downloaded, deposit has been downloaded in web page library;
(4.4)The URL captured in URL queues is put into URL queues to be captured, hence into next cycle.
The beneficial effects of the invention are as follows:The present invention has customized a URL container, the container is only when realizing web crawlers Pipe stores data so that job order one, which is that a process mass data trains the automatic disaggregated model come, at this Model carries out classification processing by that can will acquire website back, and is stored in corresponding database according to classification, thus will Mixed and disorderly data carry out rational regular.Such mode not only reduces hash warehousing quantity, and data are advised Square square is put into library, and whole process reptile only focuses on collecting part, and acquired data are sent in data processing container, Classified automatically by data processing container, in-stockroom operation.The present invention greatly improves the quality of data, reduces the wave of manpower and materials Take.
Description of the drawings
Fig. 1 is the flow chart of the present invention.
Specific embodiment
Specific embodiments of the present invention are described in further details below in conjunction with attached drawing, it is noted that specific Embodiment is only specifically described technical solution of the present invention, is not limitation of the invention.
As shown in Figure 1, the machine automatic classification method based on web crawlers of the present invention, includes the following steps:
(1)Data processing container is set, and data processing container is the data processing model that have passed through information classification based training in advance, letter Breath classification based training refers to pre-define out the data of classification, and classification can classify, such as about political affairs according to the content of information Control, it is a series of about sport etc. about economy, by artificial screening and add mark, then using these data to data at Reason model is trained study.
(2)A part of kind of subnet is chosen, and these kind of subnet is put into network address queue to be captured.
(3)Judge whether these network address queues have been collected, if it is, be sent directly to acquire network address queue, If not, enter in next step;
(4)Data acquisition is carried out from network address queue to be captured using web crawlers;
(4.1)The URL of network address to be captured is taken out from network address queue to be captured;
(4.2)The DNS of URL is parsed, and obtains the IP address of host;
(4.3)The corresponding network address of URL is downloaded, deposit has been downloaded in web page library;
(4.4)The URL captured in URL queues is put into URL queues to be captured, hence into next cycle.
(5)The result that data acquire is sent to data processing container, data classification processing is carried out by data processing container;
(6)To classify treated data of data processing container are stored according to different classes of classification in archival memory.

Claims (3)

1. a kind of machine automatic classification method based on web crawlers, it is characterized in that, include the following steps,
(1)Data processing container is set,
(2)A part of kind of subnet is chosen, and these kind of subnet is put into network address queue to be captured;
(3)Judge whether these network address queues have been collected, if it is, be sent directly to acquire network address queue, if It is no, into next step;
(4)Data acquisition is carried out from network address queue to be captured using web crawlers;
(5)The result that data acquire is sent to data processing container, data classification processing is carried out by data processing container;
(6)To classify treated data of data processing container are stored according to different classes of classification in archival memory.
2. a kind of machine automatic classification method based on web crawlers according to claim 1, it is characterized in that, data processing Container is the data processing model that have passed through information classification based training in advance, and information classification based training refers to pre-define out the number of classification According to by artificial screening and adding mark, study be then trained to data processing model using these data.
3. a kind of machine automatic classification method based on web crawlers according to claim 1, it is characterized in that, utilize network The step of reptile progress data acquisition, includes:
(4.1)The URL of network address to be captured is taken out from network address queue to be captured;
(4.2)The DNS of URL is parsed, and obtains the IP address of host;
(4.3)The corresponding network address of URL is downloaded, deposit has been downloaded in web page library;
(4.4)The URL captured in URL queues is put into URL queues to be captured, hence into next cycle.
CN201711461953.8A 2017-12-28 2017-12-28 A kind of machine automatic classification method based on web crawlers Pending CN108133027A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711461953.8A CN108133027A (en) 2017-12-28 2017-12-28 A kind of machine automatic classification method based on web crawlers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711461953.8A CN108133027A (en) 2017-12-28 2017-12-28 A kind of machine automatic classification method based on web crawlers

Publications (1)

Publication Number Publication Date
CN108133027A true CN108133027A (en) 2018-06-08

Family

ID=62393662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711461953.8A Pending CN108133027A (en) 2017-12-28 2017-12-28 A kind of machine automatic classification method based on web crawlers

Country Status (1)

Country Link
CN (1) CN108133027A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753171A (en) * 2020-06-09 2020-10-09 北京天空卫士网络安全技术有限公司 Malicious website identification method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1437140A (en) * 2002-02-05 2003-08-20 国际商业机器公司 Method and system for queuing uncalled web based on path
CN101101601A (en) * 2007-07-10 2008-01-09 北京大学 Subject crawling method based on link hierarchical classification in network search
CN101751438A (en) * 2008-12-17 2010-06-23 中国科学院自动化研究所 Theme webpage filter system for driving self-adaption semantics
CN105447018A (en) * 2014-08-20 2016-03-30 阿里巴巴集团控股有限公司 Method and apparatus for verifying web page classification model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1437140A (en) * 2002-02-05 2003-08-20 国际商业机器公司 Method and system for queuing uncalled web based on path
CN101101601A (en) * 2007-07-10 2008-01-09 北京大学 Subject crawling method based on link hierarchical classification in network search
CN101751438A (en) * 2008-12-17 2010-06-23 中国科学院自动化研究所 Theme webpage filter system for driving self-adaption semantics
CN105447018A (en) * 2014-08-20 2016-03-30 阿里巴巴集团控股有限公司 Method and apparatus for verifying web page classification model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753171A (en) * 2020-06-09 2020-10-09 北京天空卫士网络安全技术有限公司 Malicious website identification method and device
CN111753171B (en) * 2020-06-09 2024-04-26 北京天空卫士网络安全技术有限公司 Malicious website identification method and device

Similar Documents

Publication Publication Date Title
CN104077402B (en) Data processing method and data handling system
CN104135498B (en) A kind of cross-platform information transmission system and its method for pushing
CN106951925A (en) Data processing method, device, server and system
CN109948506A (en) A kind of multi-angle garbage classification cloud platform based on deep learning
CN104899261B (en) A kind of apparatus and method for building structuring video image information
CN105915438A (en) Message pushing method, apparatus, and system
CN104850662A (en) User portrait based mobile terminal intelligent message pushing method, server and system
CN107358194A (en) A kind of violence sorting express delivery determination methods based on computer vision
CN109117879A (en) Image classification method, apparatus and system
CN101419614A (en) Video resource clustering method and device
CN110288007A (en) The method, apparatus and electronic equipment of data mark
CN109857862A (en) File classification method, device, server and medium based on intelligent decision
CN111368895A (en) Garbage bag target detection method and detection system in wet garbage
CN109934255A (en) A kind of Model Fusion method for delivering object Classification and Identification suitable for beverage bottle recycling machine
CN107957940A (en) A kind of test log processing method, system and terminal
CN108650195A (en) A kind of APP flows automatic identification model building method
CN108241662A (en) The optimization method and device of data mark
CN105791543A (en) Method, device, client and system for cleaning short messages
CN108133027A (en) A kind of machine automatic classification method based on web crawlers
CN107087006A (en) A kind of agreement shunt method, system and server
CN110414321A (en) The method and system of automatic identification shaking video
CN103475532B (en) Hardware detection method and system
CN102902794B (en) Web page classification system and method
CN104021170B (en) A kind of information acquisition method and cloud server
CN105956069A (en) Network information collection and analysis method and network information collection and analysis system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180608