CN112035725A - Data acquisition system and method - Google Patents

Data acquisition system and method Download PDF

Info

Publication number
CN112035725A
CN112035725A CN202010914439.0A CN202010914439A CN112035725A CN 112035725 A CN112035725 A CN 112035725A CN 202010914439 A CN202010914439 A CN 202010914439A CN 112035725 A CN112035725 A CN 112035725A
Authority
CN
China
Prior art keywords
data
crawler
module
management module
filtering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010914439.0A
Other languages
Chinese (zh)
Inventor
张学颖
曹六一
杨飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN202010914439.0A priority Critical patent/CN112035725A/en
Publication of CN112035725A publication Critical patent/CN112035725A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The embodiment of the invention provides a data acquisition system and a method, wherein the system comprises: the crawler management module is used for acquiring the cluster module and the data landing module; the crawler management module is provided with a plurality of crawler threads, and the data acquisition modes corresponding to different crawler threads are different; the crawler management module is used for: controlling the corresponding crawler thread to collect data in the webpage through the collection cluster module based on a scheduling mechanism, and processing the collected data based on a filtering mechanism to obtain effective data; and the data landing module is used for acquiring the effective data and writing the effective data into a corresponding memory according to a predetermined landing path. The data acquisition system greatly reduces the workload of developers and improves the data acquisition efficiency and precision by uniformly managing the crawler threads and filtering the data.

Description

Data acquisition system and method
Technical Field
The embodiment of the invention relates to the technical field of data acquisition, in particular to a data acquisition system and a data acquisition method.
Background
With the rapid development of networks, the internet becomes a carrier of a large amount of information, including public opinion information, employment information, social event information, information of each industry, and the like, and the information mainly carried in different webpages is not completely the same, for example, the webpages of entertainment websites mainly carry public opinion information and the webpages related to medical treatment mainly carry information in the field of medicine industry. How to effectively collect this information in each web page is the basis of big data analysis.
At present, a web crawler is a very important part in a data analysis system, the web crawler is responsible for collecting web pages from the internet and collecting information in the web pages, the collected information provides support for subsequent big data analysis, and the content richness and the data analysis effect of the whole data analysis system are directly determined by the type and the collection speed of the collected information.
However, the general crawler frame cannot meet the requirements of acquisition and development in many personalized websites, if an acquisition frame is developed separately for each personalized website, developers are required to develop modules such as scheduling, analyzing, filtering, landing and the like for every personalized website, and with the increase of the number of personalized websites, the workload of the developers is increased, so that the efficiency and the precision of data acquisition are reduced.
Disclosure of Invention
The embodiment of the invention provides a data acquisition system and a data acquisition method, which aim to overcome the technical problems of low efficiency and low precision of internet webpage data acquisition in the prior art.
In a first aspect, an embodiment of the present invention provides a data acquisition system, including:
the crawler management module is used for acquiring the cluster module and the data landing module;
the crawler management module is provided with a plurality of crawler threads, and the data acquisition modes corresponding to different crawler threads are different;
the crawler management module is used for: controlling the corresponding crawler thread to collect data in the webpage through the collection cluster module based on a scheduling mechanism, and processing the collected data based on a filtering mechanism to obtain effective data;
and the data landing module is used for acquiring the effective data and writing the effective data into a corresponding memory according to a predetermined landing path.
Optionally, the crawler management module includes a scheduling unit, and the scheduling unit is configured to:
controlling a corresponding crawler thread to create a corresponding collection task, and sending the collection task to the collection cluster module so that the collection cluster module collects list pages and content pages in a corresponding website according to the collection task;
and analyzing the list page and the content page to obtain a derivative task or a write-in file, wherein the derivative task comprises a list page task and a content page task.
Optionally, the crawler management module further includes a filtering unit, and the filtering unit is configured to:
and filtering the list pages and the content pages according to the hash values of the Uniform Resource Locators (URLs) corresponding to the list pages and the content pages.
Optionally, the filtering and repeating unit is further configured to:
and carrying out filtering operation on the list page and the content page according to a preset filtering time point.
Optionally, the filtering and repeating unit is further configured to:
and carrying out filtering operation on the list page and the content page in the valid time range according to the preset valid time range.
Optionally, the crawler management module further includes a checking unit, and the checking unit is configured to:
and checking each field contained in the written file, and taking the field meeting preset conditions as valid data.
Optionally, the data grounding module is specifically configured to:
obtaining effective data sent by the plurality of crawler threads, wherein each crawler thread comprises data landing path information;
and writing the effective data sent by each crawler thread into a memory corresponding to the corresponding data drop path information according to the data drop path information.
In a second aspect, an embodiment of the present invention provides a data acquisition method, including:
the crawler management module controls a plurality of crawler threads to collect data in a webpage through the collection cluster module based on a scheduling mechanism, the crawler threads are deployed in the crawler management module in advance, and the data collection modes corresponding to different crawler threads are different;
the crawler management module processes the acquired data based on a weight filtering mechanism to obtain effective data;
and the data grounding module writes the effective data into a corresponding memory according to a predetermined grounding path.
In a third aspect, an embodiment of the present invention provides a computer device, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the data acquisition method as described above in the first aspect and various possible designs of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the data acquisition method according to the first aspect and various possible designs of the first aspect is implemented.
The embodiment of the invention provides a data acquisition system and a method, wherein the system comprises a crawler management module, an acquisition cluster module and a data landing module; the crawler management module is provided with a plurality of crawler threads, and the data acquisition modes corresponding to different crawler threads are different; the crawler management module can control corresponding crawler threads to collect data in the webpage through the collection cluster module based on an internally deployed scheduling mechanism, and the crawler management module is used for managing crawler work in a unified manner, so that repeated deployment work of developers can be reduced, and the data collection efficiency is improved; the crawler management module also processes the collected data based on an internally deployed weight filtering mechanism to obtain effective data; therefore, the repeated acquisition of the same data is avoided, and the accuracy of data acquisition is improved. And finally, the data landing module acquires the effective data and writes the effective data into a corresponding memory according to a predetermined landing path. The data acquisition system greatly reduces the workload of developers and improves the data acquisition efficiency and precision by uniformly managing the crawler threads and filtering the data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a first schematic structural diagram of a data acquisition system according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a data acquisition system according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a data acquisition method according to an embodiment of the present invention;
fig. 4 is an application scenario diagram of the data acquisition method according to the embodiment of the present invention;
fig. 5 is a schematic hardware structure diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the related art, with the rapid development of networks, the internet becomes a carrier of a large amount of information, including public opinion information, employment information, social event information, information of each industry, and the like, and the information mainly carried in different webpages is not completely the same, for example, a webpage of an entertainment website mainly carries public opinion information, and a webpage related to medical treatment mainly carries information in the field of medical industry. How to effectively collect this information in each web page is the basis of big data analysis. The web crawler is an important part in the data analysis system, the web crawler is responsible for collecting web pages from the Internet and collecting information in the web pages, the collected information provides support for subsequent big data analysis, and the type and the collection speed of the collected information directly determine the content richness and the data analysis effect of the whole data analysis system. However, the general crawler frame cannot meet the requirements of acquisition and development in many personalized websites, if an acquisition frame is developed separately for each personalized website, developers are required to develop modules such as scheduling, analyzing, filtering, landing and the like for every personalized website, and with the increase of the number of personalized websites, the workload of the developers is increased, so that the efficiency and the precision of data acquisition are reduced. For example, for the existing crawler frame, there are frame websites such as script and Nutch, but the support of the crawler frame for personalized collection is not enough. Most users are crawlers that need to do accurate data. And Nutch is used as a crawler designed by a search engine, and two thirds of the operating processes are designed for the search engine, so that the accurate extraction support is not enough. That is, using Nutch for data extraction wastes much time on unnecessary calculations. Another framework, Scapy, which does not support js rendering, requires the separate download of selenium. Full command operation improves the ease of use of script, but scalability is greatly reduced. And Scapy is not friendly to the deduplication of the uniform resource locator URL, and a bloom filter is adopted to filter the deduplication, and the bloom filter is a probability filter, so that the wrong judgment rate exists. The default debug mode of the script as a framework has too large information amount, is not easy to debug and has poor readability. The self-defining degree of the frame is low, and related knowledge to be learned is also large, so that the time for completing one crawler is long.
In order to overcome the defects, the technical idea of the application is that a data acquisition system is related, developers can realize data acquisition of a personalized website only by realizing two functions of task generation and content extraction and then simply configuring, and the data acquisition system specifically comprises a crawler management module, an acquisition cluster module and a data landing module; the crawler management module is provided with a plurality of crawler threads, and the data acquisition modes corresponding to different crawler threads are different; the crawler management module can control corresponding crawler threads to collect data in the webpage through the collection cluster module based on an internally deployed scheduling mechanism, and the crawler management module is used for managing crawler work in a unified manner, so that repeated deployment work of developers can be reduced, and the data collection efficiency is improved; the crawler management module also processes the collected data based on an internally deployed weight filtering mechanism to obtain effective data; therefore, the repeated acquisition of the same data is avoided, and the accuracy of data acquisition is improved. And finally, the data landing module acquires the effective data and writes the effective data into a corresponding memory according to a predetermined landing path. The data acquisition system greatly reduces the workload of developers and improves the data acquisition efficiency and precision by uniformly managing the crawler threads and filtering the data.
Fig. 1 is a first schematic structural diagram of a data acquisition system according to an embodiment of the present invention.
As shown in fig. 1, a system provided in an embodiment of the present invention includes: the crawler management module 11, the collection cluster module 12 and the data landing module 13; the crawler management module is provided with a plurality of crawler threads, and the data acquisition modes corresponding to different crawler threads are different; the crawler management module is used for: controlling the corresponding crawler thread to collect data in the webpage through the collection cluster module based on a scheduling mechanism, and processing the collected data based on a filtering mechanism to obtain effective data; and the data landing module is used for acquiring the effective data and writing the effective data into a corresponding memory according to a predetermined landing path.
Specifically, a developer can deploy a plurality of crawler threads in the crawler management module in advance, and the crawler threads can run simultaneously and independently without mutual influence. The crawler threads can be added, deleted or updated by developers according to different personalized websites, so that the requirement for acquiring personalized website data is met, the developers do not need to repeatedly deploy for different personalized websites, and the data acquisition efficiency is improved.
In a possible embodiment, for different crawler threads, a developer may preset different acquisition modes, for example, the acquisition mode of the first thread is set to repeatedly acquire webpage data at intervals, and the time interval may be determined according to actual requirements; for another example, the collection mode of the second thread is set to collect all data on the web page at one time. And the crawler management module automatically awakens the corresponding crawler thread to crawl data on the webpage according to different acquisition modes. For example, for the entertainment information on the first webpage, since the entertainment information may change at any time and needs to be collected together at intervals, the crawler management module may call a first crawler thread that supports repeated data collection at each interval to crawl the data in the first webpage. For the industry bulletin information on the second webpage, the crawler management module can call a second crawler thread to crawl data in the second webpage.
In a possible embodiment, after the crawler management module wakes up the corresponding crawler thread, the crawler thread creates a collection task and sends the collection task to the collection cluster module, the collection cluster module executes the collection task, collects data from a webpage, and then returns the collected data to the crawler management module, and the crawler management module performs filtering and other related processing on the collected data to obtain effective data. The filtering is mainly used for filtering the same webpage data which are repeatedly collected, so that the precision of the collected data is improved. And finally, the crawler thread sends the effective data to the data landing module, and the data landing module writes the effective data into a corresponding memory according to a preset determined landing path.
In a possible embodiment, each crawler thread may store some data landing path information in advance, the data landing path information may be added by a developer when developing the crawler thread, and data landing paths corresponding to different crawler threads are different, for example, the data landing path information corresponding to the first crawler thread is a first storage address, and the data landing path information corresponding to the second crawler thread is a second storage address. Then, after receiving the effective data sent by the first crawler thread, the data landing module lands the effective data and stores the effective data into a memory corresponding to the first storage address; and after receiving the effective data sent by the second crawler thread, the data landing module lands and stores the effective data into a memory corresponding to the second storage address. Therefore, the problem of file memory congestion can not occur when the data falling module writes a large amount of effective data into the memory.
Fig. 2 is a schematic structural diagram of a data acquisition system according to an embodiment of the present invention, and the embodiment of the present invention further details an internal structure of the crawler management module based on the embodiment of the system shown in fig. 1.
As shown in fig. 2, the system provided by the present embodiment includes: the crawler management module 11, the collection cluster module 12 and the data landing module 13; the crawler management module is provided with a plurality of crawler threads, and the data acquisition modes corresponding to different crawler threads are different; the crawler management module is used for: controlling the corresponding crawler thread to collect data in the webpage through the collection cluster module based on a scheduling mechanism, and processing the collected data based on a filtering mechanism to obtain effective data; and the data landing module is used for acquiring the effective data and writing the effective data into a corresponding memory according to a predetermined landing path.
Further, the crawler management module includes a scheduling unit 111, and the scheduling unit is configured to: controlling a corresponding crawler thread to create a corresponding collection task, and sending the collection task to the collection cluster module so that the collection cluster module collects list pages and content pages in a corresponding website according to the collection task; and analyzing the list page and the content page to obtain a derivative task or a write-in file, wherein the derivative task comprises a list page task and a content page task.
Specifically, the downloading task is divided into two categories: a list page and a content page. The list page parsing may derive a list page task and a content page task. The content page analysis can derive a content page task or write a file, and the information of the current page needs to be stored in the content page task derivation process so as to ensure that the information is not lost. The scheduling unit manages all downloading tasks (namely data acquisition tasks), and after the downloading tasks are downloaded by using the acquisition clusters, derivative tasks are generated or files are written in through task analysis. And when the derivative tasks are generated, a deduplication mechanism is called, so that the collected tasks are not downloaded any more. And the filtering mechanism can realize personalized filtering by the user.
In a possible embodiment, for different crawler threads, a developer may preset different acquisition modes, for example, the acquisition mode of the first thread is set to repeatedly acquire webpage data at intervals, and the time interval may be determined according to actual requirements; for another example, the collection mode of the second thread is set to collect all data on the web page at one time. And a scheduling unit in the crawler management module automatically awakens corresponding crawler threads to crawl data on the webpage according to different acquisition modes. For example, for the entertainment information on the first webpage, since the entertainment information may change at any time and needs to be collected at intervals, the scheduling unit may invoke a first crawler thread supporting repeated data collection at each interval to crawl the data in the first webpage. For the industry bulletin information on the second webpage, the scheduling unit can call the second crawler thread to crawl data in the second webpage.
In a possible embodiment, after the scheduling unit wakes up the corresponding crawler thread according to the acquisition modes suitable for different personalized webpages, the crawler thread creates an acquisition task and sends the acquisition task to the acquisition cluster module, the acquisition cluster module executes the acquisition task to acquire data from the webpages, and then returns the acquired data to the crawler management module, and the crawler management module performs filtering and other related processing on the acquired data to obtain effective data. The filtering is mainly used for filtering the same webpage data which are repeatedly collected, so that the precision of the collected data is improved. And finally, the crawler thread sends the effective data to the data landing module, and the data landing module writes the effective data into a corresponding memory according to a preset determined landing path.
Further, the crawler management module further includes a filtering unit 112, where the filtering unit is configured to: and filtering the list pages and the content pages according to the hash values of the Uniform Resource Locators (URLs) corresponding to the list pages and the content pages.
Specifically, the filtering unit filters the list pages and the content pages based on a filtering mechanism to filter out web pages containing repeated content, so that a large amount of repeated data is avoided being acquired, data acquisition time is saved, and data acquisition precision is improved.
In a possible embodiment, the above-mentioned filtering mechanism uses an embedded database for deep encapsulation, and the embedded database adopts a storage mode outside a disk, so as to realize filtering of mass URLs without occupying a memory.
In a possible embodiment, the filtering unit 112 is further configured to: and carrying out filtering operation on the list page and the content page according to a preset filtering time point.
Specifically, the developer can configure the filtering time point in the filtering unit according to actual requirements, and then the filtering unit performs the filtering operation according to the configured filtering time point. Such as logging filtering before downloading the configuration task or logging filtering after downloading the task.
In a possible embodiment, the filtering unit 112 is further configured to: and carrying out filtering operation on the list page and the content page in the valid time range according to the preset valid time range.
Specifically, the developer can configure the filtering effective time range in the filtering unit, for example, only the data within three days is filtered, and the data before three days is the failure data, so that the personalized data failure time is realized.
In a possible embodiment, the crawler management module further comprises a checking unit 113 for: and checking each field contained in the written file, and taking the field meeting preset conditions as valid data.
Specifically, in order to ensure the correctness and integrity of the data, the detection unit needs to check each field in all the collected files, determine whether the files contain wrongly written characters, incomplete information, and the like, and use the correct and complete content of the field as valid data.
Further, the data ground module is specifically configured to: obtaining effective data sent by the plurality of crawler threads, wherein each crawler thread comprises data landing path information; and writing the effective data sent by each crawler thread into a memory corresponding to the corresponding data drop path information according to the data drop path information.
Specifically, all crawler threads deployed in the crawler management module are registered in the data landing module in advance by a developer, the data landing module receives effective data transmitted by all the crawler threads registered in the data landing module through a queue, the data landing module supports self-defined data landing path information in each crawler thread, for example, the data landing path information corresponding to the first crawler thread is a first storage address, and the data landing path information corresponding to the second crawler thread is a second storage address. Then, after receiving the effective data sent by the first crawler thread, the data landing module lands the effective data and stores the effective data into a memory corresponding to the first storage address; and after receiving the effective data sent by the second crawler thread, the data landing module lands and stores the effective data into a memory corresponding to the second storage address. Therefore, the problem of file memory congestion can not occur when the data falling module writes a large amount of effective data into the memory.
Fig. 3 is a schematic flow chart of a data acquisition method according to an embodiment of the present invention.
As shown in fig. 3, the method provided by the present embodiment may include the following steps.
S301, the crawler management module controls a plurality of crawler threads to pass through based on a scheduling mechanism and collects data in a webpage through the collection cluster module, the crawler threads are pre-deployed in the crawler management module, and the data collection modes corresponding to different crawler threads are different.
Specifically, a developer can deploy a plurality of crawler threads in the crawler management module in advance, and each crawler thread can run simultaneously and independently without mutual influence. The crawler threads can be added, deleted or updated by developers according to different personalized websites, so that the requirement for acquiring personalized website data is met, the developers do not need to repeatedly deploy for different personalized websites, and the data acquisition efficiency is improved.
In a possible embodiment, for different crawler threads, a developer may preset different acquisition modes, for example, the acquisition mode of the first thread is set to repeatedly acquire webpage data at intervals, and the time interval may be determined according to actual requirements; for another example, the collection mode of the second thread is set to collect all data on the web page at one time. And the crawler management module automatically awakens the corresponding crawler thread to crawl data on the webpage according to different acquisition modes. For example, for the entertainment information on the first webpage, since the entertainment information may change at any time and needs to be collected together at intervals, the crawler management module may call a first crawler thread that supports repeated data collection at each interval to crawl the data in the first webpage. For the industry bulletin information on the second webpage, the crawler management module can call a second crawler thread to crawl data in the second webpage.
S302, the crawler management module processes the collected data based on a weight filtering mechanism to obtain effective data.
And S303, the data grounding module writes the effective data into a corresponding memory according to a predetermined grounding path.
After the crawler management module awakens the corresponding crawler thread, the crawler thread creates a collection task and sends the collection task to the collection cluster module, the collection cluster module executes the collection task, data are collected from a webpage, the collected data are returned to the crawler management module, and the crawler management module performs filtering and other related processing on the collected data to obtain effective data. The filtering is mainly used for filtering the same webpage data which are repeatedly collected, so that the precision of the collected data is improved. And finally, the crawler thread sends the effective data to the data landing module, and the data landing module writes the effective data into a corresponding memory according to a preset determined landing path.
In a possible embodiment, each crawler thread may store some data landing path information in advance, the data landing path information may be added by a developer when developing the crawler thread, and data landing paths corresponding to different crawler threads are different, for example, the data landing path information corresponding to the first crawler thread is a first storage address, and the data landing path information corresponding to the second crawler thread is a second storage address. Then, after receiving the effective data sent by the first crawler thread, the data landing module lands the effective data and stores the effective data into a memory corresponding to the first storage address; and after receiving the effective data sent by the second crawler thread, the data landing module lands and stores the effective data into a memory corresponding to the second storage address. Therefore, the problem of file memory congestion can not occur when the data falling module writes a large amount of effective data into the memory.
For specific implementation of each step in the method provided in this embodiment, reference may be made to a method executed by each module in the system, and details of this embodiment are not described herein again.
Fig. 4 is an application scenario diagram of the data acquisition method according to the embodiment of the present invention.
As shown in fig. 4, an application scenario of the method provided in the embodiment of the present invention mainly includes: a server 401, a display terminal 402; the data acquisition system is deployed in the server, the scheduling unit in the crawler management module schedules corresponding crawler threads to enter the cluster module through acquisition to download webpages on the Internet, data on the webpages are crawled by the crawler threads, the filtering unit and the checking unit in the crawler management module respectively filter the crawled data, correct error check, integrity check and the like, and finally effective data are obtained.
Fig. 5 is a schematic hardware structure diagram of a computer device according to an embodiment of the present invention. As shown in fig. 5, the computer device 50 of the present embodiment includes: a processor 501 and a memory 502; wherein
A memory 502 for storing computer-executable instructions;
a processor 501 for executing computer-executable instructions stored in the memory to implement the steps performed in the above-described method embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 502 may be separate or integrated with the processor 501.
When the memory 502 is provided separately, the computer device further comprises a bus 503 for connecting said memory 502 and the processor 501.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the data acquisition method described above is implemented.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to implement the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods described in the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into a download address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in a computer device or host device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A data acquisition system, comprising: the crawler management module is used for acquiring the cluster module and the data landing module;
the crawler management module is provided with a plurality of crawler threads, and the data acquisition modes corresponding to different crawler threads are different;
the crawler management module is used for: controlling the corresponding crawler thread to collect data in the webpage through the collection cluster module based on a scheduling mechanism, and processing the collected data based on a filtering mechanism to obtain effective data;
and the data landing module is used for acquiring the effective data and writing the effective data into a corresponding memory according to a predetermined landing path.
2. The system of claim 1, wherein the crawler management module comprises a scheduling unit configured to:
controlling a corresponding crawler thread to create a corresponding collection task, and sending the collection task to the collection cluster module so that the collection cluster module collects list pages and content pages in a corresponding website according to the collection task;
and analyzing the list page and the content page to obtain a derivative task or a write-in file, wherein the derivative task comprises a list page task and a content page task.
3. The system of claim 2, wherein the crawler management module further comprises a re-filtering unit to:
and filtering the list pages and the content pages according to the hash values of the Uniform Resource Locators (URLs) corresponding to the list pages and the content pages.
4. The system of claim 3, wherein the filtering and repeating unit is further configured to:
and carrying out filtering operation on the list page and the content page according to a preset filtering time point.
5. The system of claim 3, wherein the filtering and repeating unit is further configured to:
and carrying out filtering operation on the list page and the content page in the valid time range according to the preset valid time range.
6. The system of any of claims 1-5, wherein the crawler management module further comprises a checking unit to:
and checking each field contained in the written file, and taking the field meeting preset conditions as valid data.
7. The system of claim 6, wherein the data grounding module is specifically configured to:
obtaining effective data sent by the plurality of crawler threads, wherein each crawler thread comprises data landing path information;
and writing the effective data sent by each crawler thread into a memory corresponding to the corresponding data drop path information according to the data drop path information.
8. A method of data acquisition, comprising:
the crawler management module controls a plurality of crawler threads to collect data in a webpage through the collection cluster module based on a scheduling mechanism, the crawler threads are deployed in the crawler management module in advance, and the data collection modes corresponding to different crawler threads are different;
the crawler management module processes the acquired data based on a weight filtering mechanism to obtain effective data;
and the data grounding module writes the effective data into a corresponding memory according to a predetermined grounding path.
9. A computer device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the data acquisition method of claim 8.
10. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the data acquisition method of claim 8.
CN202010914439.0A 2020-09-03 2020-09-03 Data acquisition system and method Pending CN112035725A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010914439.0A CN112035725A (en) 2020-09-03 2020-09-03 Data acquisition system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010914439.0A CN112035725A (en) 2020-09-03 2020-09-03 Data acquisition system and method

Publications (1)

Publication Number Publication Date
CN112035725A true CN112035725A (en) 2020-12-04

Family

ID=73591766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010914439.0A Pending CN112035725A (en) 2020-09-03 2020-09-03 Data acquisition system and method

Country Status (1)

Country Link
CN (1) CN112035725A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system
CN104899323A (en) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 Crawler system used for IDC harmful information monitoring platform
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN107071009A (en) * 2017-03-28 2017-08-18 江苏飞搏软件股份有限公司 A kind of distributed big data crawler system of load balancing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system
CN104899323A (en) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 Crawler system used for IDC harmful information monitoring platform
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN107071009A (en) * 2017-03-28 2017-08-18 江苏飞搏软件股份有限公司 A kind of distributed big data crawler system of load balancing

Similar Documents

Publication Publication Date Title
CN107895009B (en) Distributed internet data acquisition method and system
JP6488508B2 (en) Web page access method, apparatus, device, and program
CN106557470B (en) Data extraction method and device
CN104166567A (en) Method and device for downloading network stream data
CN109491763A (en) A kind of system deployment method, apparatus and electronic equipment
US9116808B2 (en) Method and system for determining device configuration settings
CN108763042A (en) A kind of Cloud Server performance data acquisition method and device based on python
CN110275705A (en) Generate method, apparatus, equipment and the storage medium for preloading page code
CN111813629A (en) Method, device and equipment for generating monitoring data of Web page
CN106484459B (en) Flow control method and device applied to JavaScript
CN112527459B (en) Log analysis method and device based on Kubernetes cluster
CN113343312A (en) Page tamper-proofing method and system based on front-end point burying technology
CN113158118A (en) Page buried point data acquisition method, device and system
CN111026945B (en) Multi-platform crawler scheduling method, device and storage medium
CN112307386A (en) Information monitoring method, system, electronic device and computer readable storage medium
CN112035725A (en) Data acquisition system and method
CN111913996B (en) Data processing method, device, equipment and storage medium
CN112100036B (en) Page performance monitoring method and system based on PaaS front-end engine
CN106790521A (en) The system and method for distributed networking is carried out using the node device based on FTP
CN112464242A (en) Webpage platform vulnerability collection method, system, terminal and storage medium
CN112417324A (en) Chrome-based URL (Uniform resource locator) interception method and device and computer equipment
CN117235023B (en) Remote warehouse cache management method, device, equipment and storage medium
Yost Finding flaky tests in JavaScript applications using stress and test suite reordering
CN111078714B (en) Data processing method and device
CN112181834B (en) Method, device and equipment for debugging lua based on gdb and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230627

Address after: 3007, Hengqin International Financial Center Building, No. 58 Huajin Street, Hengqin New District, Zhuhai City, Guangdong Province, 519030

Applicant after: New founder holdings development Co.,Ltd.

Applicant after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Applicant before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Applicant before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.