CN112632356A

CN112632356A - Network information data classification collection method

Info

Publication number: CN112632356A
Application number: CN202011565108.7A
Authority: CN
Inventors: 李锦基; 黄永权; 王勋; 符伟杰; 骆新坤; 李明东
Original assignee: Gold Sea Comm Corp
Current assignee: Gold Sea Comm Corp
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-09

Abstract

The invention discloses a network information data classification acquisition method, which comprises the following steps: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, and acquiring and classifying the website and the corresponding website through information acquisition and classification software; the invention captures the data information of the target website or the network address through the web crawler, determines at least one classification parameter of the data to be collected in advance, collects the data information in a classification mode, and has less content of a list page corresponding to each classification, so that the web crawler can be used for collecting all the data information on the list page.

Description

Network information data classification collection method

Technical Field

The invention relates to the technical field of network information, in particular to a network information data classification and collection method.

Background

At present, the big data era is silently emerging, a large amount of public information is enriched on the network, and large-scale internet sites are comparatively common, so that the websites become key objects of data acquisition work, various network information data classification acquisition methods begin to appear, but most of the websites for information data acquisition at present are large-scale internet sites, the total amount of data is overlarge, the current network information data classification acquisition method cannot realize the complete coverage of website information, and data omission is easily caused.

Disclosure of Invention

The invention aims to provide a network information data classification and collection method, which solves the problems that most websites for collecting information data at present are large-scale internet websites, the total amount of data is overlarge, the existing network information data classification and collection method cannot realize the complete coverage of website information, and data omission is easily caused.

In order to achieve the purpose, the invention provides the following technical scheme: a network information data classification collection method comprises the following steps:

step 1: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter;

step 2: building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data;

and step 3: data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database;

and 4, step 4: and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.

Preferably, in step 1, determining a parameter value corresponding to each classification parameter includes: determining a target website or network address where data to be acquired is located; acquiring a list page corresponding to data to be acquired from a target website or a network address; selecting each classification parameter one by one in a list page to obtain a classification link corresponding to each classification parameter; and determining a parameter value corresponding to each classification parameter according to each obtained classification link.

Preferably, in step 1, links corresponding to the classification parameters are generated according to each classification parameter and the corresponding parameter value, and the current classification parameter, the current parameter value and the set character are spliced according to a set form for each current classification parameter and the corresponding current parameter value; and adding the spliced content into the classification link corresponding to the current classification parameter to obtain an entrance link corresponding to the current classification parameter.

Preferably, in step 3, when the information to be collected is charging information, the payment may be performed through secure payment software inside the cloud server.

Preferably, in step 3, the repeated information data is filtered and removed by special software, and similar information is merged.

Preferably, in step 3, the web crawler acquires data corresponding to the corresponding classification parameters one by one for each link.

Preferably, in step 3, during information data acquisition, a target list page corresponding to the current link is obtained, where the target list page includes at least one paging page; and accessing the detail link in each paging page, and acquiring data of the accessed detail link.

Preferably, in step 4, the original information data is stored in the cloud server, and the information collection and classification software performs classification and storage on the collected information data through classification and storage.

Preferably, in step 4, the user may subscribe to information data of the same category, and when the information data is updated, the update is pushed in time.

Compared with the prior art, the invention has the beneficial effects that: the invention captures the data information of the target website or the network address through the web crawler, determines at least one classification parameter of the data to be collected in advance, collects the data information in a classification mode, and has less content of a list page corresponding to each classification, so that the web crawler can be used for collecting all the data information on the list page.

Detailed Description

The present invention will now be described in more detail by way of examples, which are given by way of illustration only and are not intended to limit the scope of the present invention in any way.

The invention provides a technical scheme that: a network information data classification collection method comprises the following steps:

The first embodiment is as follows:

collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.

Example two:

in the first embodiment, the following steps are added:

in step 1, determining a parameter value corresponding to each classification parameter includes: determining a target website or network address where data to be acquired is located; acquiring a list page corresponding to data to be acquired from a target website or a network address; selecting each classification parameter one by one in a list page to obtain a classification link corresponding to each classification parameter; and determining the parameter value corresponding to each classification parameter according to each obtained classification link, so that information data can be accurately acquired conveniently.

Example three:

in the second embodiment, the following steps are added:

in step 1, generating links corresponding to the classification parameters respectively according to each classification parameter and the corresponding parameter value, and splicing the current classification parameter, the current parameter value and the set character according to a set form aiming at each current classification parameter and the corresponding current parameter value; and adding the spliced content into the classification link corresponding to the current classification parameter to obtain the entry link corresponding to the current classification parameter, so that the accuracy of information acquisition is improved.

Example four:

in the third embodiment, the following steps are added:

in step 3, when the information to be collected is charging information, payment can be performed through safety payment software inside the cloud server, repeated information data are filtered and removed through special software, similar information is combined, and the phenomena that the collected information data are repeated and the like are avoided.

Example five:

in the fourth example, the following steps were added:

in step 3, the web crawler acquires data corresponding to the corresponding classification parameters one by one for each link, and acquires a target list page corresponding to the current link during information data acquisition, wherein the target list page comprises at least one paging page; and the detail links in each paging page are accessed, and the accessed detail links are subjected to data acquisition, so that information is convenient to crawl.

Example six:

in the fifth example, the following steps were added:

in step 4, the original information data is stored in the cloud server, the information acquisition classification software performs classification storage on the acquired information data through classification storage, a user can subscribe the same type of information data, and when the information data is updated, the information data is pushed and updated in time, so that the information data can be stored for a long time conveniently.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A network information data classification collection method is characterized in that: the method comprises the following steps:

2. The method for classifying and collecting network information data according to claim 1, wherein: in the step 1, determining a parameter value corresponding to each classification parameter, and determining a target website or a network address where data to be acquired is located; acquiring a list page corresponding to data to be acquired from a target website or a network address; selecting each classification parameter one by one in a list page to obtain a classification link corresponding to each classification parameter; and determining a parameter value corresponding to each classification parameter according to each obtained classification link.

3. The method for classifying and collecting network information data according to claim 2, wherein: in the step 1, generating links corresponding to the classification parameters respectively according to each classification parameter and the corresponding parameter value, and splicing the current classification parameter, the current parameter value and the set character according to a set form aiming at each current classification parameter and the corresponding current parameter value respectively; and adding the spliced content into the classification link corresponding to the current classification parameter to obtain an entrance link corresponding to the current classification parameter.

4. The method for classifying and collecting network information data according to claim 1, wherein: in step 3, when the information to be collected is charging information, the payment can be performed through the secure payment software inside the cloud server.

5. The method for classifying and collecting network information data according to claim 1, wherein: in step 3, the repeated information data is filtered and removed by special software, and similar information is merged.

6. The method for classifying and collecting network information data according to claim 1, wherein: in step 3, the web crawler acquires data corresponding to the corresponding classification parameters one by one for each link.

7. The method for classifying and collecting network information data according to claim 6, wherein: in the step 3, when information data is acquired, a target list page corresponding to the current link is acquired, wherein the target list page comprises at least one paging page; and accessing the detail link in each paging page, and acquiring data of the accessed detail link.

8. The method for classifying and collecting network information data according to claim 1, wherein: in the step 4, the original information data is stored in the cloud server, and the information acquisition classification software performs classification storage on the acquired information data through classification storage.

9. The method for classifying and collecting network information data according to claim 1, wherein: in step 4, the user may subscribe to the information data of the same category, and when the information data is updated, the update is pushed in time.