CN108133027A

CN108133027A - A kind of machine automatic classification method based on web crawlers

Info

Publication number: CN108133027A
Application number: CN201711461953.8A
Authority: CN
Inventors: 梁镇爽
Original assignee: Global Tone Communication Technology Qingdao Co Ltd
Current assignee: Global Tone Communication Technology Qingdao Co Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2018-06-08

Abstract

The present invention provides a kind of machine automatic classification method based on web crawlers, including：Data processing container is set, and selected seed network address judges whether network address has been collected, and data acquisition is carried out using web crawlers；The result that data acquire is sent to data processing container；Data processing container is stored according to different classes of classification in archival memory.The present invention greatly improves the quality of data, reduces the waste of manpower and materials.

Description

A kind of machine automatic classification method based on web crawlers

Technical field

The present invention relates to the machine automatic classification methods based on reptile principle.

Background technology

Information-based tide have swepts the globe, while with universal and network technology the constantly improve of internet, Internet has become whole world information resource database the abundantest the hugest, due to the opening of internet, various information It can be transferred through various forms and be published to internet at the first time, formally due to this opening of internet, lead to the superfluous of information Remaining and mixed and disorderly, therefore, automatic classification technology is quickly grown with the demand of data age, as a kind of effective information processing side Method, automatic classification technology encourages various information and carries out taxonomic revision according to certain taxonomic hierarchies, so as to greatly improve user The efficiency of mobile phone advices reduces the huge wasting of resources caused by manual sort's technology.

Invention content

The technical problems to be solved by the invention are to provide a kind of machine automatic classification method based on web crawlers, can The data preparation for making redundancy mixed and disorderly is into there is Similar content, methodically.

The technical proposal for solving the technical problem of the invention is：A kind of machine based on web crawlers side of classification automatically Method includes the following steps,

（1）Data processing container is set,

（2）A part of kind of subnet is chosen, and these kind of subnet is put into network address queue to be captured；

（3）Judge whether these network address queues have been collected, if it is, be sent directly to acquire network address queue, if It is no, into next step；

（4）Data acquisition is carried out from network address queue to be captured using web crawlers；

（5）The result that data acquire is sent to data processing container, data classification processing is carried out by data processing container；

（6）To classify treated data of data processing container are stored according to different classes of classification in archival memory.

Further, data processing container is the data processing model that have passed through information classification based training in advance, and information is classified Training refers to pre-define out the data of classification, by artificial screening and adds mark, then using these data to data at Reason model is trained study.

Further, the step of carrying out data acquisition using web crawlers includes：

（4.1）The URL of network address to be captured is taken out from network address queue to be captured；

（4.2）The DNS of URL is parsed, and obtains the IP address of host；

（4.3）The corresponding network address of URL is downloaded, deposit has been downloaded in web page library；

（4.4）The URL captured in URL queues is put into URL queues to be captured, hence into next cycle.

The beneficial effects of the invention are as follows：The present invention has customized a URL container, the container is only when realizing web crawlers Pipe stores data so that job order one, which is that a process mass data trains the automatic disaggregated model come, at this Model carries out classification processing by that can will acquire website back, and is stored in corresponding database according to classification, thus will Mixed and disorderly data carry out rational regular.Such mode not only reduces hash warehousing quantity, and data are advised Square square is put into library, and whole process reptile only focuses on collecting part, and acquired data are sent in data processing container, Classified automatically by data processing container, in-stockroom operation.The present invention greatly improves the quality of data, reduces the wave of manpower and materials Take.

Description of the drawings

Fig. 1 is the flow chart of the present invention.

Specific embodiment

Specific embodiments of the present invention are described in further details below in conjunction with attached drawing, it is noted that specific Embodiment is only specifically described technical solution of the present invention, is not limitation of the invention.

As shown in Figure 1, the machine automatic classification method based on web crawlers of the present invention, includes the following steps：

（1）Data processing container is set, and data processing container is the data processing model that have passed through information classification based training in advance, letter Breath classification based training refers to pre-define out the data of classification, and classification can classify, such as about political affairs according to the content of information Control, it is a series of about sport etc. about economy, by artificial screening and add mark, then using these data to data at Reason model is trained study.

（2）A part of kind of subnet is chosen, and these kind of subnet is put into network address queue to be captured.

（3）Judge whether these network address queues have been collected, if it is, be sent directly to acquire network address queue, If not, enter in next step；

（4.2）The DNS of URL is parsed, and obtains the IP address of host；

Claims

1. a kind of machine automatic classification method based on web crawlers, it is characterized in that, include the following steps,

（1）Data processing container is set,

2. a kind of machine automatic classification method based on web crawlers according to claim 1, it is characterized in that, data processing Container is the data processing model that have passed through information classification based training in advance, and information classification based training refers to pre-define out the number of classification According to by artificial screening and adding mark, study be then trained to data processing model using these data.

3. a kind of machine automatic classification method based on web crawlers according to claim 1, it is characterized in that, utilize network The step of reptile progress data acquisition, includes：

（4.2）The DNS of URL is parsed, and obtains the IP address of host；