CN108133027A - A kind of machine automatic classification method based on web crawlers - Google Patents
A kind of machine automatic classification method based on web crawlers Download PDFInfo
- Publication number
- CN108133027A CN108133027A CN201711461953.8A CN201711461953A CN108133027A CN 108133027 A CN108133027 A CN 108133027A CN 201711461953 A CN201711461953 A CN 201711461953A CN 108133027 A CN108133027 A CN 108133027A
- Authority
- CN
- China
- Prior art keywords
- data
- network address
- data processing
- captured
- url
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/972—Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer And Data Communications (AREA)
Abstract
The present invention provides a kind of machine automatic classification method based on web crawlers, including:Data processing container is set, and selected seed network address judges whether network address has been collected, and data acquisition is carried out using web crawlers;The result that data acquire is sent to data processing container;Data processing container is stored according to different classes of classification in archival memory.The present invention greatly improves the quality of data, reduces the waste of manpower and materials.
Description
Technical field
The present invention relates to the machine automatic classification methods based on reptile principle.
Background technology
Information-based tide have swepts the globe, while with universal and network technology the constantly improve of internet,
Internet has become whole world information resource database the abundantest the hugest, due to the opening of internet, various information
It can be transferred through various forms and be published to internet at the first time, formally due to this opening of internet, lead to the superfluous of information
Remaining and mixed and disorderly, therefore, automatic classification technology is quickly grown with the demand of data age, as a kind of effective information processing side
Method, automatic classification technology encourages various information and carries out taxonomic revision according to certain taxonomic hierarchies, so as to greatly improve user
The efficiency of mobile phone advices reduces the huge wasting of resources caused by manual sort's technology.
Invention content
The technical problems to be solved by the invention are to provide a kind of machine automatic classification method based on web crawlers, can
The data preparation for making redundancy mixed and disorderly is into there is Similar content, methodically.
The technical proposal for solving the technical problem of the invention is:A kind of machine based on web crawlers side of classification automatically
Method includes the following steps,
(1)Data processing container is set,
(2)A part of kind of subnet is chosen, and these kind of subnet is put into network address queue to be captured;
(3)Judge whether these network address queues have been collected, if it is, be sent directly to acquire network address queue, if
It is no, into next step;
(4)Data acquisition is carried out from network address queue to be captured using web crawlers;
(5)The result that data acquire is sent to data processing container, data classification processing is carried out by data processing container;
(6)To classify treated data of data processing container are stored according to different classes of classification in archival memory.
Further, data processing container is the data processing model that have passed through information classification based training in advance, and information is classified
Training refers to pre-define out the data of classification, by artificial screening and adds mark, then using these data to data at
Reason model is trained study.
Further, the step of carrying out data acquisition using web crawlers includes:
(4.1)The URL of network address to be captured is taken out from network address queue to be captured;
(4.2)The DNS of URL is parsed, and obtains the IP address of host;
(4.3)The corresponding network address of URL is downloaded, deposit has been downloaded in web page library;
(4.4)The URL captured in URL queues is put into URL queues to be captured, hence into next cycle.
The beneficial effects of the invention are as follows:The present invention has customized a URL container, the container is only when realizing web crawlers
Pipe stores data so that job order one, which is that a process mass data trains the automatic disaggregated model come, at this
Model carries out classification processing by that can will acquire website back, and is stored in corresponding database according to classification, thus will
Mixed and disorderly data carry out rational regular.Such mode not only reduces hash warehousing quantity, and data are advised
Square square is put into library, and whole process reptile only focuses on collecting part, and acquired data are sent in data processing container,
Classified automatically by data processing container, in-stockroom operation.The present invention greatly improves the quality of data, reduces the wave of manpower and materials
Take.
Description of the drawings
Fig. 1 is the flow chart of the present invention.
Specific embodiment
Specific embodiments of the present invention are described in further details below in conjunction with attached drawing, it is noted that specific
Embodiment is only specifically described technical solution of the present invention, is not limitation of the invention.
As shown in Figure 1, the machine automatic classification method based on web crawlers of the present invention, includes the following steps:
(1)Data processing container is set, and data processing container is the data processing model that have passed through information classification based training in advance, letter
Breath classification based training refers to pre-define out the data of classification, and classification can classify, such as about political affairs according to the content of information
Control, it is a series of about sport etc. about economy, by artificial screening and add mark, then using these data to data at
Reason model is trained study.
(2)A part of kind of subnet is chosen, and these kind of subnet is put into network address queue to be captured.
(3)Judge whether these network address queues have been collected, if it is, be sent directly to acquire network address queue,
If not, enter in next step;
(4)Data acquisition is carried out from network address queue to be captured using web crawlers;
(4.1)The URL of network address to be captured is taken out from network address queue to be captured;
(4.2)The DNS of URL is parsed, and obtains the IP address of host;
(4.3)The corresponding network address of URL is downloaded, deposit has been downloaded in web page library;
(4.4)The URL captured in URL queues is put into URL queues to be captured, hence into next cycle.
(5)The result that data acquire is sent to data processing container, data classification processing is carried out by data processing container;
(6)To classify treated data of data processing container are stored according to different classes of classification in archival memory.
Claims (3)
1. a kind of machine automatic classification method based on web crawlers, it is characterized in that, include the following steps,
(1)Data processing container is set,
(2)A part of kind of subnet is chosen, and these kind of subnet is put into network address queue to be captured;
(3)Judge whether these network address queues have been collected, if it is, be sent directly to acquire network address queue, if
It is no, into next step;
(4)Data acquisition is carried out from network address queue to be captured using web crawlers;
(5)The result that data acquire is sent to data processing container, data classification processing is carried out by data processing container;
(6)To classify treated data of data processing container are stored according to different classes of classification in archival memory.
2. a kind of machine automatic classification method based on web crawlers according to claim 1, it is characterized in that, data processing
Container is the data processing model that have passed through information classification based training in advance, and information classification based training refers to pre-define out the number of classification
According to by artificial screening and adding mark, study be then trained to data processing model using these data.
3. a kind of machine automatic classification method based on web crawlers according to claim 1, it is characterized in that, utilize network
The step of reptile progress data acquisition, includes:
(4.1)The URL of network address to be captured is taken out from network address queue to be captured;
(4.2)The DNS of URL is parsed, and obtains the IP address of host;
(4.3)The corresponding network address of URL is downloaded, deposit has been downloaded in web page library;
(4.4)The URL captured in URL queues is put into URL queues to be captured, hence into next cycle.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711461953.8A CN108133027A (en) | 2017-12-28 | 2017-12-28 | A kind of machine automatic classification method based on web crawlers |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711461953.8A CN108133027A (en) | 2017-12-28 | 2017-12-28 | A kind of machine automatic classification method based on web crawlers |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108133027A true CN108133027A (en) | 2018-06-08 |
Family
ID=62393662
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711461953.8A Pending CN108133027A (en) | 2017-12-28 | 2017-12-28 | A kind of machine automatic classification method based on web crawlers |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108133027A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753171A (en) * | 2020-06-09 | 2020-10-09 | 北京天空卫士网络安全技术有限公司 | Malicious website identification method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1437140A (en) * | 2002-02-05 | 2003-08-20 | 国际商业机器公司 | Method and system for queuing uncalled web based on path |
CN101101601A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Subject crawling method based on link hierarchical classification in network search |
CN101751438A (en) * | 2008-12-17 | 2010-06-23 | 中国科学院自动化研究所 | Theme webpage filter system for driving self-adaption semantics |
CN105447018A (en) * | 2014-08-20 | 2016-03-30 | 阿里巴巴集团控股有限公司 | Method and apparatus for verifying web page classification model |
-
2017
- 2017-12-28 CN CN201711461953.8A patent/CN108133027A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1437140A (en) * | 2002-02-05 | 2003-08-20 | 国际商业机器公司 | Method and system for queuing uncalled web based on path |
CN101101601A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Subject crawling method based on link hierarchical classification in network search |
CN101751438A (en) * | 2008-12-17 | 2010-06-23 | 中国科学院自动化研究所 | Theme webpage filter system for driving self-adaption semantics |
CN105447018A (en) * | 2014-08-20 | 2016-03-30 | 阿里巴巴集团控股有限公司 | Method and apparatus for verifying web page classification model |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753171A (en) * | 2020-06-09 | 2020-10-09 | 北京天空卫士网络安全技术有限公司 | Malicious website identification method and device |
CN111753171B (en) * | 2020-06-09 | 2024-04-26 | 北京天空卫士网络安全技术有限公司 | Malicious website identification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104077402B (en) | Data processing method and data handling system | |
CN104135498B (en) | A kind of cross-platform information transmission system and its method for pushing | |
CN106951925A (en) | Data processing method, device, server and system | |
CN109948506A (en) | A kind of multi-angle garbage classification cloud platform based on deep learning | |
CN104899261B (en) | A kind of apparatus and method for building structuring video image information | |
CN105915438A (en) | Message pushing method, apparatus, and system | |
CN104850662A (en) | User portrait based mobile terminal intelligent message pushing method, server and system | |
CN107358194A (en) | A kind of violence sorting express delivery determination methods based on computer vision | |
CN109117879A (en) | Image classification method, apparatus and system | |
CN101419614A (en) | Video resource clustering method and device | |
CN110288007A (en) | The method, apparatus and electronic equipment of data mark | |
CN109857862A (en) | File classification method, device, server and medium based on intelligent decision | |
CN111368895A (en) | Garbage bag target detection method and detection system in wet garbage | |
CN109934255A (en) | A kind of Model Fusion method for delivering object Classification and Identification suitable for beverage bottle recycling machine | |
CN107957940A (en) | A kind of test log processing method, system and terminal | |
CN108650195A (en) | A kind of APP flows automatic identification model building method | |
CN108241662A (en) | The optimization method and device of data mark | |
CN105791543A (en) | Method, device, client and system for cleaning short messages | |
CN108133027A (en) | A kind of machine automatic classification method based on web crawlers | |
CN107087006A (en) | A kind of agreement shunt method, system and server | |
CN110414321A (en) | The method and system of automatic identification shaking video | |
CN103475532B (en) | Hardware detection method and system | |
CN102902794B (en) | Web page classification system and method | |
CN104021170B (en) | A kind of information acquisition method and cloud server | |
CN105956069A (en) | Network information collection and analysis method and network information collection and analysis system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180608 |