CN112632356A - Network information data classification collection method - Google Patents

Network information data classification collection method Download PDF

Info

Publication number
CN112632356A
CN112632356A CN202011565108.7A CN202011565108A CN112632356A CN 112632356 A CN112632356 A CN 112632356A CN 202011565108 A CN202011565108 A CN 202011565108A CN 112632356 A CN112632356 A CN 112632356A
Authority
CN
China
Prior art keywords
classification
data
information data
information
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011565108.7A
Other languages
Chinese (zh)
Inventor
李锦基
黄永权
王勋
符伟杰
骆新坤
李明东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gold Sea Comm Corp
Original Assignee
Gold Sea Comm Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gold Sea Comm Corp filed Critical Gold Sea Comm Corp
Priority to CN202011565108.7A priority Critical patent/CN112632356A/en
Publication of CN112632356A publication Critical patent/CN112632356A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation

Abstract

The invention discloses a network information data classification acquisition method, which comprises the following steps: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, and acquiring and classifying the website and the corresponding website through information acquisition and classification software; the invention captures the data information of the target website or the network address through the web crawler, determines at least one classification parameter of the data to be collected in advance, collects the data information in a classification mode, and has less content of a list page corresponding to each classification, so that the web crawler can be used for collecting all the data information on the list page.

Description

Network information data classification collection method
Technical Field
The invention relates to the technical field of network information, in particular to a network information data classification and collection method.
Background
At present, the big data era is silently emerging, a large amount of public information is enriched on the network, and large-scale internet sites are comparatively common, so that the websites become key objects of data acquisition work, various network information data classification acquisition methods begin to appear, but most of the websites for information data acquisition at present are large-scale internet sites, the total amount of data is overlarge, the current network information data classification acquisition method cannot realize the complete coverage of website information, and data omission is easily caused.
Disclosure of Invention
The invention aims to provide a network information data classification and collection method, which solves the problems that most websites for collecting information data at present are large-scale internet websites, the total amount of data is overlarge, the existing network information data classification and collection method cannot realize the complete coverage of website information, and data omission is easily caused.
In order to achieve the purpose, the invention provides the following technical scheme: a network information data classification collection method comprises the following steps:
step 1: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter;
step 2: building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data;
and step 3: data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database;
and 4, step 4: and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Preferably, in step 1, determining a parameter value corresponding to each classification parameter includes: determining a target website or network address where data to be acquired is located; acquiring a list page corresponding to data to be acquired from a target website or a network address; selecting each classification parameter one by one in a list page to obtain a classification link corresponding to each classification parameter; and determining a parameter value corresponding to each classification parameter according to each obtained classification link.
Preferably, in step 1, links corresponding to the classification parameters are generated according to each classification parameter and the corresponding parameter value, and the current classification parameter, the current parameter value and the set character are spliced according to a set form for each current classification parameter and the corresponding current parameter value; and adding the spliced content into the classification link corresponding to the current classification parameter to obtain an entrance link corresponding to the current classification parameter.
Preferably, in step 3, when the information to be collected is charging information, the payment may be performed through secure payment software inside the cloud server.
Preferably, in step 3, the repeated information data is filtered and removed by special software, and similar information is merged.
Preferably, in step 3, the web crawler acquires data corresponding to the corresponding classification parameters one by one for each link.
Preferably, in step 3, during information data acquisition, a target list page corresponding to the current link is obtained, where the target list page includes at least one paging page; and accessing the detail link in each paging page, and acquiring data of the accessed detail link.
Preferably, in step 4, the original information data is stored in the cloud server, and the information collection and classification software performs classification and storage on the collected information data through classification and storage.
Preferably, in step 4, the user may subscribe to information data of the same category, and when the information data is updated, the update is pushed in time.
Compared with the prior art, the invention has the beneficial effects that: the invention captures the data information of the target website or the network address through the web crawler, determines at least one classification parameter of the data to be collected in advance, collects the data information in a classification mode, and has less content of a list page corresponding to each classification, so that the web crawler can be used for collecting all the data information on the list page.
Detailed Description
The present invention will now be described in more detail by way of examples, which are given by way of illustration only and are not intended to limit the scope of the present invention in any way.
The invention provides a technical scheme that: a network information data classification collection method comprises the following steps:
step 1: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter;
step 2: building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data;
and step 3: data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database;
and 4, step 4: and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
The first embodiment is as follows:
collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example two:
in the first embodiment, the following steps are added:
in step 1, determining a parameter value corresponding to each classification parameter includes: determining a target website or network address where data to be acquired is located; acquiring a list page corresponding to data to be acquired from a target website or a network address; selecting each classification parameter one by one in a list page to obtain a classification link corresponding to each classification parameter; and determining the parameter value corresponding to each classification parameter according to each obtained classification link, so that information data can be accurately acquired conveniently.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example three:
in the second embodiment, the following steps are added:
in step 1, generating links corresponding to the classification parameters respectively according to each classification parameter and the corresponding parameter value, and splicing the current classification parameter, the current parameter value and the set character according to a set form aiming at each current classification parameter and the corresponding current parameter value; and adding the spliced content into the classification link corresponding to the current classification parameter to obtain the entry link corresponding to the current classification parameter, so that the accuracy of information acquisition is improved.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example four:
in the third embodiment, the following steps are added:
in step 3, when the information to be collected is charging information, payment can be performed through safety payment software inside the cloud server, repeated information data are filtered and removed through special software, similar information is combined, and the phenomena that the collected information data are repeated and the like are avoided.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example five:
in the fourth example, the following steps were added:
in step 3, the web crawler acquires data corresponding to the corresponding classification parameters one by one for each link, and acquires a target list page corresponding to the current link during information data acquisition, wherein the target list page comprises at least one paging page; and the detail links in each paging page are accessed, and the accessed detail links are subjected to data acquisition, so that information is convenient to crawl.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Example six:
in the fifth example, the following steps were added:
in step 4, the original information data is stored in the cloud server, the information acquisition classification software performs classification storage on the acquired information data through classification storage, a user can subscribe the same type of information data, and when the information data is updated, the information data is pushed and updated in time, so that the information data can be stored for a long time conveniently.
Collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter; building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data; data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database; and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (9)

1. A network information data classification collection method is characterized in that: the method comprises the following steps:
step 1: collecting preparation: creating a database inside a cloud server, operating information acquisition classification software inside the cloud server, adopting a plurality of web crawlers at an information acquisition end of the information acquisition classification software, wherein each web crawler corresponds to at least one classification parameter, and determining a parameter value corresponding to each classification parameter;
step 2: building a link: firstly, determining the type of network information data to be acquired, then selecting a website or a network address suitable for acquiring the network information data, establishing a link with the corresponding website or network address through information acquisition classification software, and respectively setting a plurality of web crawlers according to the data types, wherein each web crawler is responsible for acquiring one type or two types of data;
and step 3: data acquisition: the web crawler can directly enter a target list page and a paging page of a website or a network address, can capture longitudinal and transverse bidirectional data and information of network information, and then transmits corresponding information data back to the database;
and 4, step 4: and (4) classified storage: the inside of the database is divided into various classified catalogues in advance according to the classification requirements, the information data transmitted back by each web crawler is directly stored in the catalogues of the corresponding classification, and a user can select proper classification software to classify, display and output the data in the database again according to the requirements.
2. The method for classifying and collecting network information data according to claim 1, wherein: in the step 1, determining a parameter value corresponding to each classification parameter, and determining a target website or a network address where data to be acquired is located; acquiring a list page corresponding to data to be acquired from a target website or a network address; selecting each classification parameter one by one in a list page to obtain a classification link corresponding to each classification parameter; and determining a parameter value corresponding to each classification parameter according to each obtained classification link.
3. The method for classifying and collecting network information data according to claim 2, wherein: in the step 1, generating links corresponding to the classification parameters respectively according to each classification parameter and the corresponding parameter value, and splicing the current classification parameter, the current parameter value and the set character according to a set form aiming at each current classification parameter and the corresponding current parameter value respectively; and adding the spliced content into the classification link corresponding to the current classification parameter to obtain an entrance link corresponding to the current classification parameter.
4. The method for classifying and collecting network information data according to claim 1, wherein: in step 3, when the information to be collected is charging information, the payment can be performed through the secure payment software inside the cloud server.
5. The method for classifying and collecting network information data according to claim 1, wherein: in step 3, the repeated information data is filtered and removed by special software, and similar information is merged.
6. The method for classifying and collecting network information data according to claim 1, wherein: in step 3, the web crawler acquires data corresponding to the corresponding classification parameters one by one for each link.
7. The method for classifying and collecting network information data according to claim 6, wherein: in the step 3, when information data is acquired, a target list page corresponding to the current link is acquired, wherein the target list page comprises at least one paging page; and accessing the detail link in each paging page, and acquiring data of the accessed detail link.
8. The method for classifying and collecting network information data according to claim 1, wherein: in the step 4, the original information data is stored in the cloud server, and the information acquisition classification software performs classification storage on the acquired information data through classification storage.
9. The method for classifying and collecting network information data according to claim 1, wherein: in step 4, the user may subscribe to the information data of the same category, and when the information data is updated, the update is pushed in time.
CN202011565108.7A 2020-12-25 2020-12-25 Network information data classification collection method Pending CN112632356A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011565108.7A CN112632356A (en) 2020-12-25 2020-12-25 Network information data classification collection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011565108.7A CN112632356A (en) 2020-12-25 2020-12-25 Network information data classification collection method

Publications (1)

Publication Number Publication Date
CN112632356A true CN112632356A (en) 2021-04-09

Family

ID=75325033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011565108.7A Pending CN112632356A (en) 2020-12-25 2020-12-25 Network information data classification collection method

Country Status (1)

Country Link
CN (1) CN112632356A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114024979A (en) * 2021-10-25 2022-02-08 深圳市高德信通信股份有限公司 Distributed edge computing data storage system
CN114945180A (en) * 2022-04-06 2022-08-26 徐州工业职业技术学院 Network high-order structure community generation and division method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765823A (en) * 2015-04-08 2015-07-08 天脉聚源(北京)传媒科技有限公司 Method and device for collecting website data
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN106168973A (en) * 2016-07-11 2016-11-30 浪潮软件集团有限公司 Network data classified collection method and device
CN109213912A (en) * 2018-08-16 2019-01-15 北京神州泰岳软件股份有限公司 A kind of method and network data crawl dispatching device of crawl network data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765823A (en) * 2015-04-08 2015-07-08 天脉聚源(北京)传媒科技有限公司 Method and device for collecting website data
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN106168973A (en) * 2016-07-11 2016-11-30 浪潮软件集团有限公司 Network data classified collection method and device
CN109213912A (en) * 2018-08-16 2019-01-15 北京神州泰岳软件股份有限公司 A kind of method and network data crawl dispatching device of crawl network data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114024979A (en) * 2021-10-25 2022-02-08 深圳市高德信通信股份有限公司 Distributed edge computing data storage system
CN114945180A (en) * 2022-04-06 2022-08-26 徐州工业职业技术学院 Network high-order structure community generation and division method

Similar Documents

Publication Publication Date Title
DE112013000387B4 (en) Dynamic scanning of a web application using web traffic information
DE112012005533B4 (en) Supporting query and a query
DE112013000865B4 (en) Consolidation of different cloud service data and behaviors based on trust relationships between cloud services
CN103605738B (en) Web page access data statistical method and device
CN110929039B (en) Data processing method, device, equipment and storage medium
CN105740335A (en) Titan-based enterprise information analysis platform and construction method thereof
CN112632356A (en) Network information data classification collection method
DE102013017085A1 (en) System for deep linking and search engine support for websites integrating a third-party application and components
DE202011110873U1 (en) Scalable rendering of large spatial databases
DE102019001267A1 (en) Dialog-like system for answering inquiries
DE102016104478A1 (en) Cryptographic methods that realize work records in systems of interconnected nodes
CN111831636A (en) Data processing method, device, computer system and readable storage medium
DE112012002669B4 (en) Improve the exchange of data in the social network environment
DE102012221251A1 (en) Semantic and contextual search of knowledge stores
DE112012003541T5 (en) Automatic recognition of element lists within a web page
CN104573024A (en) Self-adaptive extracting method and system for heterogeneous security log information under complex network system
CN102663052A (en) Method and device for providing search results of search engine
CN104765823A (en) Method and device for collecting website data
CN111353095A (en) Intelligent information management system based on SEO optimization
CN103440199A (en) Method and device for guiding test
CN111831856A (en) Metadata-based automatic holographic digital power grid data storage system and method
DE102012222036A1 (en) Personalizing Internet search content based on targeted data derived user data
DE102019220056A1 (en) DOMAIN KNOWLEDGE INJECTION IN SEMI-SWARMED UNSTRUCTURED DATA SUMMARY FOR DIAGNOSTICS AND REPAIR
DE112012000305T5 (en) Joint restoration of data sources
DE202014010918U1 (en) The clustering of ads with organic map content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination