CN112632356A - Network information data classification collection method - Google Patents
Network information data classification collection method Download PDFInfo
- Publication number
- CN112632356A CN112632356A CN202011565108.7A CN202011565108A CN112632356A CN 112632356 A CN112632356 A CN 112632356A CN 202011565108 A CN202011565108 A CN 202011565108A CN 112632356 A CN112632356 A CN 112632356A
- Authority
- CN
- China
- Prior art keywords
- classification
- data
- information data
- information
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 230000002457 bidirectional effect Effects 0.000 claims description 9
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/972—Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
Abstract
The invention discloses a network information data classification acquisition method, which comprises the following steps: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, and acquiring and classifying the website and the corresponding website through information acquisition and classification software; the invention captures the data information of the target website or the network address through the web crawler, determines at least one classification parameter of the data to be collected in advance, collects the data information in a classification mode, and has less content of a list page corresponding to each classification, so that the web crawler can be used for collecting all the data information on the list page.
Description
Technical Field
The invention relates to the technical field of network information, in particular to a network information data classification and collection method.
Background
At present, the big data era is silently emerging, a large amount of public information is enriched on the network, and large-scale internet sites are comparatively common, so that the websites become key objects of data acquisition work, various network information data classification acquisition methods begin to appear, but most of the websites for information data acquisition at present are large-scale internet sites, the total amount of data is overlarge, the current network information data classification acquisition method cannot realize the complete coverage of website information, and data omission is easily caused.
Disclosure of Invention
The invention aims to provide a network information data classification and collection method, which solves the problems that most websites for collecting information data at present are large-scale internet websites, the total amount of data is overlarge, the existing network information data classification and collection method cannot realize the complete coverage of website information, and data omission is easily caused.
In order to achieve the purpose, the invention provides the following technical scheme: a network information data classification collection method comprises the following steps:
step 1: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter;
step 2: building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data;
and step 3: data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database;
and 4, step 4: and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Preferably, in step 1, determining a parameter value corresponding to each classification parameter includes: determining a target website or network address where data to be acquired is located; acquiring a list page corresponding to data to be acquired from a target website or a network address; selecting each classification parameter one by one in a list page to obtain a classification link corresponding to each classification parameter; and determining a parameter value corresponding to each classification parameter according to each obtained classification link.
Preferably, in step 1, links corresponding to the classification parameters are generated according to each classification parameter and the corresponding parameter value, and the current classification parameter, the current parameter value and the set character are spliced according to a set form for each current classification parameter and the corresponding current parameter value; and adding the spliced content into the classification link corresponding to the current classification parameter to obtain an entrance link corresponding to the current classification parameter.
Preferably, in step 3, when the information to be collected is charging information, the payment may be performed through secure payment software inside the cloud server.
Preferably, in step 3, the repeated information data is filtered and removed by special software, and similar information is merged.
Preferably, in step 3, the web crawler acquires data corresponding to the corresponding classification parameters one by one for each link.
Preferably, in step 3, during information data acquisition, a target list page corresponding to the current link is obtained, where the target list page includes at least one paging page; and accessing the detail link in each paging page, and acquiring data of the accessed detail link.
Preferably, in step 4, the original information data is stored in the cloud server, and the information collection and classification software performs classification and storage on the collected information data through classification and storage.
Preferably, in step 4, the user may subscribe to information data of the same category, and when the information data is updated, the update is pushed in time.
Compared with the prior art, the invention has the beneficial effects that: the invention captures the data information of the target website or the network address through the web crawler, determines at least one classification parameter of the data to be collected in advance, collects the data information in a classification mode, and has less content of a list page corresponding to each classification, so that the web crawler can be used for collecting all the data information on the list page.
Detailed Description
The present invention will now be described in more detail by way of examples, which are given by way of illustration only and are not intended to limit the scope of the present invention in any way.
The invention provides a technical scheme that: a network information data classification collection method comprises the following steps:
step 1: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter;
step 2: building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data;
and step 3: data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database;
and 4, step 4: and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
The first embodiment is as follows:
collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example two:
in the first embodiment, the following steps are added:
in step 1, determining a parameter value corresponding to each classification parameter includes: determining a target website or network address where data to be acquired is located; acquiring a list page corresponding to data to be acquired from a target website or a network address; selecting each classification parameter one by one in a list page to obtain a classification link corresponding to each classification parameter; and determining the parameter value corresponding to each classification parameter according to each obtained classification link, so that information data can be accurately acquired conveniently.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example three:
in the second embodiment, the following steps are added:
in step 1, generating links corresponding to the classification parameters respectively according to each classification parameter and the corresponding parameter value, and splicing the current classification parameter, the current parameter value and the set character according to a set form aiming at each current classification parameter and the corresponding current parameter value; and adding the spliced content into the classification link corresponding to the current classification parameter to obtain the entry link corresponding to the current classification parameter, so that the accuracy of information acquisition is improved.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example four:
in the third embodiment, the following steps are added:
in step 3, when the information to be collected is charging information, payment can be performed through safety payment software inside the cloud server, repeated information data are filtered and removed through special software, similar information is combined, and the phenomena that the collected information data are repeated and the like are avoided.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example five:
in the fourth example, the following steps were added:
in step 3, the web crawler acquires data corresponding to the corresponding classification parameters one by one for each link, and acquires a target list page corresponding to the current link during information data acquisition, wherein the target list page comprises at least one paging page; and the detail links in each paging page are accessed, and the accessed detail links are subjected to data acquisition, so that information is convenient to crawl.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example six:
in the fifth example, the following steps were added:
in step 4, the original information data is stored in the cloud server, the information acquisition classification software performs classification storage on the acquired information data through classification storage, a user can subscribe the same type of information data, and when the information data is updated, the information data is pushed and updated in time, so that the information data can be stored for a long time conveniently.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (9)
1. A network information data classification collection method is characterized in that: the method comprises the following steps:
step 1: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter;
step 2: building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data;
and step 3: data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database;
and 4, step 4: and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
2. The method for classifying and collecting network information data according to claim 1, wherein: in the step 1, determining a parameter value corresponding to each classification parameter, and determining a target website or a network address where data to be acquired is located; acquiring a list page corresponding to data to be acquired from a target website or a network address; selecting each classification parameter one by one in a list page to obtain a classification link corresponding to each classification parameter; and determining a parameter value corresponding to each classification parameter according to each obtained classification link.
3. The method for classifying and collecting network information data according to claim 2, wherein: in the step 1, generating links corresponding to the classification parameters respectively according to each classification parameter and the corresponding parameter value, and splicing the current classification parameter, the current parameter value and the set character according to a set form aiming at each current classification parameter and the corresponding current parameter value respectively; and adding the spliced content into the classification link corresponding to the current classification parameter to obtain an entrance link corresponding to the current classification parameter.
4. The method for classifying and collecting network information data according to claim 1, wherein: in step 3, when the information to be collected is charging information, the payment can be performed through the secure payment software inside the cloud server.
5. The method for classifying and collecting network information data according to claim 1, wherein: in step 3, the repeated information data is filtered and removed by special software, and similar information is merged.
6. The method for classifying and collecting network information data according to claim 1, wherein: in step 3, the web crawler acquires data corresponding to the corresponding classification parameters one by one for each link.
7. The method for classifying and collecting network information data according to claim 6, wherein: in the step 3, when information data is acquired, a target list page corresponding to the current link is acquired, wherein the target list page comprises at least one paging page; and accessing the detail link in each paging page, and acquiring data of the accessed detail link.
8. The method for classifying and collecting network information data according to claim 1, wherein: in the step 4, the original information data is stored in the cloud server, and the information acquisition classification software performs classification storage on the acquired information data through classification storage.
9. The method for classifying and collecting network information data according to claim 1, wherein: in step 4, the user may subscribe to the information data of the same category, and when the information data is updated, the update is pushed in time.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011565108.7A CN112632356A (en) | 2020-12-25 | 2020-12-25 | Network information data classification collection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011565108.7A CN112632356A (en) | 2020-12-25 | 2020-12-25 | Network information data classification collection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112632356A true CN112632356A (en) | 2021-04-09 |
Family
ID=75325033
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011565108.7A Pending CN112632356A (en) | 2020-12-25 | 2020-12-25 | Network information data classification collection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112632356A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114024979A (en) * | 2021-10-25 | 2022-02-08 | 深圳市高德信通信股份有限公司 | Distributed edge computing data storage system |
CN114945180A (en) * | 2022-04-06 | 2022-08-26 | 徐州工业职业技术学院 | Network high-order structure community generation and division method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104765823A (en) * | 2015-04-08 | 2015-07-08 | 天脉聚源(北京)传媒科技有限公司 | Method and device for collecting website data |
CN104951512A (en) * | 2015-05-27 | 2015-09-30 | 中国科学院信息工程研究所 | Public sentiment data collection method and system based on Internet |
CN105893583A (en) * | 2016-04-01 | 2016-08-24 | 北京鼎泰智源科技有限公司 | Data acquisition method and system based on artificial intelligence |
CN106168973A (en) * | 2016-07-11 | 2016-11-30 | 浪潮软件集团有限公司 | Network data classified collection method and device |
CN109213912A (en) * | 2018-08-16 | 2019-01-15 | 北京神州泰岳软件股份有限公司 | A kind of method and network data crawl dispatching device of crawl network data |
-
2020
- 2020-12-25 CN CN202011565108.7A patent/CN112632356A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104765823A (en) * | 2015-04-08 | 2015-07-08 | 天脉聚源(北京)传媒科技有限公司 | Method and device for collecting website data |
CN104951512A (en) * | 2015-05-27 | 2015-09-30 | 中国科学院信息工程研究所 | Public sentiment data collection method and system based on Internet |
CN105893583A (en) * | 2016-04-01 | 2016-08-24 | 北京鼎泰智源科技有限公司 | Data acquisition method and system based on artificial intelligence |
CN106168973A (en) * | 2016-07-11 | 2016-11-30 | 浪潮软件集团有限公司 | Network data classified collection method and device |
CN109213912A (en) * | 2018-08-16 | 2019-01-15 | 北京神州泰岳软件股份有限公司 | A kind of method and network data crawl dispatching device of crawl network data |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114024979A (en) * | 2021-10-25 | 2022-02-08 | 深圳市高德信通信股份有限公司 | Distributed edge computing data storage system |
CN114945180A (en) * | 2022-04-06 | 2022-08-26 | 徐州工业职业技术学院 | Network high-order structure community generation and division method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE112013000387B4 (en) | Dynamic scanning of a web application using web traffic information | |
DE112012005533B4 (en) | Supporting query and a query | |
DE112013000865B4 (en) | Consolidation of different cloud service data and behaviors based on trust relationships between cloud services | |
CN103605738B (en) | Web page access data statistical method and device | |
CN110929039B (en) | Data processing method, device, equipment and storage medium | |
CN105740335A (en) | Titan-based enterprise information analysis platform and construction method thereof | |
CN112632356A (en) | Network information data classification collection method | |
DE102013017085A1 (en) | System for deep linking and search engine support for websites integrating a third-party application and components | |
DE202011110873U1 (en) | Scalable rendering of large spatial databases | |
DE102019001267A1 (en) | Dialog-like system for answering inquiries | |
DE102016104478A1 (en) | Cryptographic methods that realize work records in systems of interconnected nodes | |
CN111831636A (en) | Data processing method, device, computer system and readable storage medium | |
DE112012002669B4 (en) | Improve the exchange of data in the social network environment | |
DE102012221251A1 (en) | Semantic and contextual search of knowledge stores | |
DE112012003541T5 (en) | Automatic recognition of element lists within a web page | |
CN104573024A (en) | Self-adaptive extracting method and system for heterogeneous security log information under complex network system | |
CN102663052A (en) | Method and device for providing search results of search engine | |
CN104765823A (en) | Method and device for collecting website data | |
CN111353095A (en) | Intelligent information management system based on SEO optimization | |
CN103440199A (en) | Method and device for guiding test | |
CN111831856A (en) | Metadata-based automatic holographic digital power grid data storage system and method | |
DE102012222036A1 (en) | Personalizing Internet search content based on targeted data derived user data | |
DE102019220056A1 (en) | DOMAIN KNOWLEDGE INJECTION IN SEMI-SWARMED UNSTRUCTURED DATA SUMMARY FOR DIAGNOSTICS AND REPAIR | |
DE112012000305T5 (en) | Joint restoration of data sources | |
DE202014010918U1 (en) | The clustering of ads with organic map content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |