CN112764908B - Network data acquisition processing method and device and electronic equipment - Google Patents

Network data acquisition processing method and device and electronic equipment Download PDF

Info

Publication number
CN112764908B
CN112764908B CN202110106093.6A CN202110106093A CN112764908B CN 112764908 B CN112764908 B CN 112764908B CN 202110106093 A CN202110106093 A CN 202110106093A CN 112764908 B CN112764908 B CN 112764908B
Authority
CN
China
Prior art keywords
data
file
scheduling
network data
target network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110106093.6A
Other languages
Chinese (zh)
Other versions
CN112764908A (en
Inventor
刘龙强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING TIP TECHNOLOGY CO LTD
Original Assignee
BEIJING TIP TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING TIP TECHNOLOGY CO LTD filed Critical BEIJING TIP TECHNOLOGY CO LTD
Priority to CN202110106093.6A priority Critical patent/CN112764908B/en
Publication of CN112764908A publication Critical patent/CN112764908A/en
Application granted granted Critical
Publication of CN112764908B publication Critical patent/CN112764908B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/484Precedence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Abstract

The embodiment of the invention discloses a network data acquisition and processing method, a device and electronic equipment, wherein the method comprises the following steps: acquiring target network data; generating a scheduling data file and a scheduling control file according to the target network data, wherein the scheduling data file is used for storing tasks to be acquired, each record in the scheduling data file comprises data length and data information, the scheduling control file is used for controlling the scheduling data file, and the scheduling control file comprises data sources, priority levels and data reading related information; and analyzing the data information in the dispatching data file through the control of the dispatching control file, and storing the analyzed data information in a file queue. The invention has high network data acquisition and storage efficiency and high resource utilization rate.

Description

Network data acquisition processing method and device and electronic equipment
Technical Field
The embodiment of the invention relates to the field of network data acquisition, in particular to a network data acquisition processing method, a device and electronic equipment.
Background
When network data is acquired, the data in a plurality of sites are required to be acquired in a multi-task mode, and the data acquisition efficiency is improved by adopting a distributed mode, namely, the simultaneous acquisition of a plurality of site tasks is realized by one acquisition schedule and a plurality of acquisition crawlers.
In order to realize network data acquisition, important URLs with a degree of departure in a website are required to be selected as entrance addresses (called seed URLs) of acquired websites, crawlers start to acquire the seed URLs, after webpage data are acquired, data elements in the webpage are required to be analyzed again, and URLs in the webpage are extracted to acquire the URLs again. Such a URL can parse out a new batch of URLs, and so on, repeatedly until all URL collection in the site is completed.
The structure of the relationships between web pages in a site can be considered as a forest, each seed URL corresponding to an entry of a forest through which the entire forest can be found. Therefore, in the process of network data acquisition, the explosion growth of URL is frequently encountered, and how to manage, store and distribute the URL resources has important significance in realizing reasonable scheduling of crawlers and acquisition tasks.
Currently, there are problems with the way these URL resources are managed, saved, and allocated:
1. without scheduling, a site and a crawler collect the URL tasks independently, and the URL tasks are scattered on the crawler; the independent work of each can not realize reasonable resource allocation and multi-task and multi-site simultaneous acquisition scheduling.
2. And storing URL data to be acquired by using the queue. The memory is large, data is easy to be lost when the memory is insufficient, and cache data is lost when the application is withdrawn or the system is restarted.
3. The URL address data is cached using a database. The scheduling efficiency is low, and each URL scheduling needs to add and delete database records once, so that the data acquisition efficiency is affected.
4. URL address data is cached using middleware such as kafka. URL address data is cached using middleware such as kafka.
Disclosure of Invention
The embodiment of the invention aims to provide a network data acquisition processing method, a device and electronic equipment, which are used for solving the problems in the existing data acquisition and storage.
In order to achieve the above purpose, the embodiment of the present invention mainly provides the following technical solutions:
in a first aspect, an embodiment of the present invention provides a network data acquisition processing method, including:
acquiring target network data;
generating a scheduling data file and a scheduling control file according to the target network data, wherein the scheduling data file is used for storing tasks to be acquired, each record in the scheduling data file comprises data length and data information, the scheduling control file is used for controlling the scheduling data file, and the scheduling control file comprises data sources, priority levels and data reading related information;
and controlling the data information in the dispatching data file to be stored into a file queue through the dispatching control file.
According to one embodiment of the present invention, the storing, by the schedule control file, the data information in the schedule data file into the file queue includes:
and the scheduling control file stores the data information in the scheduling data file into the file queue according to the priority level and the resource serial number of the resource.
According to one embodiment of the present invention, generating a scheduling data file and a scheduling control file according to the target network data includes:
storing the analysis result of the target network data analysis into a cache;
and when the number of records in the cache reaches a preset data threshold or the cache duration exceeds a preset time threshold, generating the scheduling data file and the scheduling control file according to the analysis result of the target network data analysis.
According to one embodiment of the present invention, the method further includes generating a scheduling data file and a scheduling control file according to the target network data, and further including:
classifying the analysis result of the target network data;
and writing the analysis results of the same type into a plurality of scheduling data files.
In a second aspect, an embodiment of the present invention further provides a network data acquisition processing apparatus, including:
the acquisition module is used for acquiring the target network data;
the generation module is used for generating a scheduling data file and a scheduling control file according to the target network data, wherein the scheduling data file is used for storing tasks to be acquired, each record in the scheduling data file comprises data length and data information, the scheduling control file is used for controlling the scheduling data file, and the scheduling control file comprises data sources, priority levels and data reading related information;
and the storage module is used for controlling the data information in the dispatching data file to be stored in a file queue through the dispatching control file.
According to one embodiment of the invention, the scheduling control file stores the data information in the scheduling data file into the file queue through the storage module according to the priority level and the resource serial number of the resource.
According to one embodiment of the present invention, the system further comprises a buffer module, wherein the buffer module is used for buffering the analysis result of the target network data; and when the number of records in the cache reaches a preset data threshold or the cache duration exceeds a preset time threshold, generating the scheduling data file and the scheduling control file according to the analysis result of the target network data analysis.
According to one embodiment of the present invention, the generating module is further configured to classify a result of the parsing of the target network data; and writing the analysis results of the same type into a plurality of scheduling data files.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: at least one processor and at least one memory; the memory is used for storing one or more program instructions; the processor is configured to execute one or more program instructions to perform the network data acquisition processing method according to the first aspect.
In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium containing one or more program instructions for being executed with the network data acquisition processing method according to the first aspect.
The technical scheme provided by the embodiment of the invention has at least the following advantages:
according to the network data acquisition and processing method, the device and the electronic equipment provided by the embodiment of the invention, the data is stored in the binary file stream, and the resource scheduling of the data is operated (including reading and writing) in a sequential and unidirectional mode to realize the storage, the distribution and the management of URL resources, so that the network data acquisition and storage efficiency is high, and the resource utilization rate is high.
Drawings
Fig. 1 is a flowchart of a network data acquisition processing method according to an embodiment of the present invention.
Fig. 2 is a block diagram of a network data acquisition and processing device according to an embodiment of the present invention.
Detailed Description
Further advantages and effects of the present invention will become apparent to those skilled in the art from the disclosure of the present invention, which is described by the following specific examples.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In the description of the present invention, it is to be understood that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Fig. 1 is a flowchart of a network data acquisition processing method according to an embodiment of the present invention. As shown in fig. 1, the network data acquisition processing method in the embodiment of the invention includes:
s1: and acquiring target network data.
Specifically, a specified network resource is acquired as a target network resource by a web crawler using a predetermined algorithm. The target network resource may be a type of resource in a website, such as a news resource of website a.
S2: and generating a scheduling data file and a scheduling control file according to the target network data, wherein the scheduling data file is used for storing tasks needing to be acquired. Each record in the schedule data file includes a data length and data information. The scheduling control file is used for controlling the scheduling data file. The scheduling control file includes data sources, priority levels, and data reading related information.
In one embodiment of the present invention, step S2 includes: storing an analysis result of target network data analysis into a cache; and when the number of records in the cache reaches a preset data threshold or the cache duration exceeds a preset time threshold, generating the scheduling data file and the scheduling control file according to the analysis result of target network data analysis.
Specifically, the parsing program caches the parsing result when parsing the data, and triggers the caching to save the result to the file in a specific manner. The analysis conditions for starting the analysis program to schedule the data file include two types, one type is when the record number in the cache reaches a preset data threshold value and the other type is when the cache duration exceeds a preset time threshold value.
In the acquisition scheduling, the embodiment uses the site as a basic unit. On the site, a grouping of sites, such as information, portals, forums, etc., is performed, referred to as a task group. Two parameters of a task group and a site to which the task group belongs need to be explicitly determined when task scheduling is performed (including writing and reading). Tasks are classified by task group + belonging sites, and the same type of task comprises a scheduling control file and a plurality of scheduling data files, and has the same naming prefix (such as task group, belonging site, etc.).
And controlling a scheduling file naming rule, wherein the scheduling file naming rule is used for realizing data storage, resource allocation, task debugging and the like under the same category of resources.
The scheduling control file is used for managing the scheduling data file and realizing the scheduling of tasks, and is characterized by small data volume and high use frequency, and is frequently loaded into the memory for caching when in scheduling use so as to improve the efficiency. The content of the scheduling control file comprises a data head part and a data record part. The data head comprises a task group, a site and the like. The data records include priority level, task number, total number of records, read record and read location. The read record and the read position can identify from which position the record should be read next time the record is read, so that repeated reading of the record can be avoided and rapid reading of the record can be achieved.
The scheduling data file naming rule includes three states of initialization (record writing) and waiting for scheduling and scheduling reading, and one file only belongs to one state at the same time (namely, task scheduling file reading is not allowed when the file is initialized and written). A dispatch data file is used for storing a batch of tasks needing to be dispatched and collected, the content format is simple, one record is written at a time, and each record comprises two parts of data length and data content. The data length is used to represent the number of bytes occupied by the corresponding data content.
In one embodiment of the present invention, step S2 further comprises: classifying the target network data; the same type of network data is written to a plurality of schedule data files.
Specifically, when a task in a dispatch data file is dispatched, the file has lost value, the system reclaims the dispatch data file, and a large dispatch data file may contain too much resources to facilitate reclamation. In order to realize effective recovery of resources, the embodiment writes the task resources of the same type into a plurality of scheduling data files, thereby realizing effective recovery of resources.
In one embodiment of the invention, the scheduling control file analyzes the data information in the scheduling data file according to the priority level and the resource serial number of the resource and stores the analyzed data information in the file queue.
In order to implement a scheme of multiple scheduling data files, the present embodiment introduces task sequence numbers for managing the same type of resources. In the initial state, the task serial number defaults to 1, 1 is added when the task serial number is used next time, and 1 is set again after the maximum allowed value N (for example, 1000) is exceeded, so that 1 to N are repeatedly recycled, the number of bits of data of the data is determined according to the number of scheduled data files as required, and the data can not cover the same-name effective data files when being written.
When data acquisition is performed, the importance degree of each type of resource is different, for example, a webpage, a word document and the like are important, attention is paid more, and attention such as pictures, style files and the like is slightly less, so that the acquired URL resources are required to be distinguished in priority. The present embodiment classifies resource task classes into 9 priority classes, 1-9 respectively, with higher numbers giving higher priority classes.
In this embodiment, since the content of the scheduling control file is small, the scheduling control file is frequently used, in order to improve efficiency, data is cached when the system is started, and the data is directly obtained from the cache when the system is used. The control data is stored with a scheduling data file list, a use sequence number and read-write state information of each priority level of the current type, the scheduling algorithm calculates the priority level of the current resource according to the parameters of the scheduling control file, then the current task sequence number and the total number of records written into the file are obtained according to the data head of the scheduling control file, when the maximum number of records of a single file is exceeded, the task sequence number is an increment value, and meanwhile, the total number of records is reset to 0, so that one type of scheduling data is stored in a plurality of different files of the same type.
After the scheduling data file is calculated through the scheduling control file, the data can be written into the scheduling data file, the length of the data is written into the scheduling data file when the recording data is written into the scheduling data file, and the following operations are circularly executed when the recording data is written into the scheduling data file: serializing the data records to generate a record stream; calculating the length of the generated record stream; writing a data length to the file; the data content is written to the schedule data file.
S3: the data information in the scheduling data file is stored in the file queue under the control of the scheduling control file.
Specifically, the URL record is read from the scheduling data file, and the task record of the type is determined to be loaded according to two parameters of the designated task group and the site to which the task group belongs during reading. Loading scheduling control data of the current type from the cache, and calculating a scheduling data file of the current task; when the method is used for calculating, all the dispatching data files under the current type of task are listed first, then the files are sorted according to descending order of priority, and after sorting is completed according to ascending order of task serial numbers, the first file is taken out for task acquisition.
After the dispatch data file is determined, the record reading is simpler, and only the data is required to be read back according to the appointed format, and the implementation steps are as follows: and reading the recording position, reading the current length data, and deserializing the data.
The network data acquisition and processing method provided by the embodiment of the invention stores data in a binary file stream, and performs resource scheduling of data in a sequential and unidirectional mode (including reading and writing) to realize the storage, distribution and management of URL resources, so that the network data acquisition and storage efficiency is high, and the resource utilization rate is high.
Fig. 2 is a block diagram of a network data acquisition and processing device according to an embodiment of the present invention. As shown in fig. 2, a network data acquisition processing device according to an embodiment of the present invention includes: the device comprises an acquisition module 100, a generation module 200 and a storage module 300.
The acquiring module 100 is configured to acquire target network data. The generating module 200 is configured to generate a scheduling data file and a scheduling control file according to the target network data. The scheduling data file is used for storing tasks needing to be acquired. Each record in the schedule data file includes a data length and data information. The scheduling control file is used for controlling the scheduling data file. The scheduling control file includes data sources, priority levels, and data reading related information. The storage module 300 is configured to store data information in a scheduled data file into a file queue under the control of a scheduling control file.
In one embodiment of the invention, the scheduling control file analyzes the data information in the scheduling data file according to the priority level and the resource serial number of the resource and stores the analyzed data information in the file queue through the storage module.
In one embodiment of the present invention, the network data acquisition processing device further includes a buffer module. The caching module is used for caching the analysis result of the target network data. And when the number of records in the cache reaches a preset data threshold or the cache duration exceeds a preset time threshold, generating the scheduling data file and the scheduling control file according to the analysis result of target network data analysis.
In one embodiment of the present invention, the generating module 200 is further configured to classify the parsing result of the target network data; and writing the analysis results of the same type into a plurality of scheduling data files.
It should be noted that, the specific implementation manner of the network data acquisition processing device in the embodiment of the present invention is similar to the specific implementation manner of the network data acquisition processing method in the embodiment of the present invention, and specific reference is made to the description of the network data acquisition processing method section, so that redundancy is reduced and redundant description is omitted.
In addition, other structures and functions of the network data acquisition and processing device according to the embodiments of the present invention are known to those skilled in the art, and in order to reduce redundancy, details are not described.
The embodiment of the invention also provides electronic equipment, which comprises: at least one processor and at least one memory; the memory is used for storing one or more program instructions; the processor is configured to execute one or more program instructions to perform the network data acquisition processing method according to the first aspect.
The disclosed embodiments provide a computer readable storage medium having stored therein computer program instructions that, when executed on a computer, cause the computer to perform the network data acquisition processing method described above.
In the embodiment of the invention, the processor may be an integrated circuit chip with signal processing capability. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP for short), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), a field programmable gate array (Field Programmable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The processor reads the information in the storage medium and, in combination with its hardware, performs the steps of the above method.
The storage medium may be memory, for example, may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable ROM (Electrically EPROM, EEPROM), or a flash Memory.
The volatile memory may be a random access memory (Random Access Memory, RAM for short) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (Direct Rambus RAM, DRRAM).
The storage media described in embodiments of the present invention are intended to comprise, without being limited to, these and any other suitable types of memory.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in a combination of hardware and software. When the software is applied, the corresponding functions may be stored in a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims (9)

1. The network data acquisition and processing method is characterized by comprising the following steps of:
acquiring target network data, and storing an analysis result of analyzing the target network data into a cache;
when the number of records in the cache reaches a preset data threshold or when the cache duration exceeds a preset time threshold, generating the scheduling data file and the scheduling control file according to an analysis result of target network data analysis;
generating a scheduling data file and a scheduling control file according to the target network data, wherein the scheduling data file is used for storing tasks to be acquired, each record in the scheduling data file comprises data length and data information, the scheduling control file is used for controlling the scheduling data file, and the scheduling control file comprises data sources, priority levels and data reading related information;
and controlling the data information in the dispatching data file to be stored into a file queue through the dispatching control file.
2. The network data acquisition and processing method according to claim 1, wherein the storing of the data information in the schedule data file into the file queue is controlled by the schedule control file, comprising:
and the scheduling control file stores the data information in the scheduling data file in the file queue according to the priority level and the resource serial number of the resource.
3. The network data acquisition processing method according to claim 1, wherein a scheduling data file and a scheduling control file are generated from the target network data, further comprising:
classifying the analysis result of the target network data;
and writing the analysis results of the same type into a plurality of scheduling data files.
4. A network data acquisition and processing device, comprising:
the acquisition module is used for acquiring target network data and storing an analysis result of analyzing the target network data into a cache; when the number of records in the cache reaches a preset data threshold or when the cache duration exceeds a preset time threshold, generating the scheduling data file and the scheduling control file according to an analysis result of target network data analysis;
the generation module is used for generating a scheduling data file and a scheduling control file according to the target network data, wherein the scheduling data file is used for storing tasks to be acquired, each record in the scheduling data file comprises data length and data information, the scheduling control file is used for controlling the scheduling data file, and the scheduling control file comprises data sources, priority levels and data reading related information;
and the storage module is used for controlling the data information in the dispatching data file to be stored in a file queue through the dispatching control file.
5. The network data acquisition and processing device according to claim 4, wherein the scheduling control file stores the data information in the scheduling data file into the file queue through the storage module according to the priority level and the resource sequence number of the resource.
6. The network data acquisition and processing device according to claim 4, further comprising a buffer module, wherein the buffer module is configured to buffer the analysis result of the target network data; and when the number of records in the cache reaches a preset data threshold or the cache duration exceeds a preset time threshold, generating the scheduling data file and the scheduling control file according to the analysis result of the target network data analysis.
7. The network data acquisition and processing device according to claim 6, wherein the generating module is further configured to classify a result of the parsing of the target network data; and writing the analysis results of the same type into a plurality of scheduling data files.
8. An electronic device, the electronic device comprising: at least one processor and at least one memory;
the memory is used for storing one or more program instructions;
the processor configured to execute one or more program instructions configured to perform the network data acquisition processing method of any one of claims 1-3.
9. A computer readable storage medium, wherein one or more program instructions are included in the computer readable storage medium, the one or more program instructions being configured to perform the network data acquisition processing method of any one of claims 1-3.
CN202110106093.6A 2021-01-26 2021-01-26 Network data acquisition processing method and device and electronic equipment Active CN112764908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110106093.6A CN112764908B (en) 2021-01-26 2021-01-26 Network data acquisition processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110106093.6A CN112764908B (en) 2021-01-26 2021-01-26 Network data acquisition processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112764908A CN112764908A (en) 2021-05-07
CN112764908B true CN112764908B (en) 2024-01-26

Family

ID=75707425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110106093.6A Active CN112764908B (en) 2021-01-26 2021-01-26 Network data acquisition processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112764908B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101277272A (en) * 2008-05-16 2008-10-01 北京航空航天大学 Method for implementing magnanimity broadcast data warehouse-in
CN103559217A (en) * 2013-10-17 2014-02-05 北京航空航天大学 Heterogeneous database oriented massive multicast data storage implementation method
CN106020986A (en) * 2016-05-26 2016-10-12 中国建设银行股份有限公司 Data processing method and device
CN107273409A (en) * 2017-05-03 2017-10-20 广州赫炎大数据科技有限公司 A kind of network data acquisition, storage and processing method and system
CN107870928A (en) * 2016-09-26 2018-04-03 上海泓智信息科技有限公司 File reading and device
CN109840298A (en) * 2018-12-29 2019-06-04 中国科学院计算技术研究所 The multi information source acquisition method and system of large scale network data
CN110704381A (en) * 2019-09-06 2020-01-17 平安城市建设科技(深圳)有限公司 Data analysis method, device and storage medium
CN111221744A (en) * 2020-04-23 2020-06-02 杭州海康威视数字技术股份有限公司 Data acquisition method and device and electronic equipment
CN111241447A (en) * 2020-01-13 2020-06-05 浙江省北大信息技术高等研究院 Webpage data acquisition method, system and storage medium
CN111367925A (en) * 2020-02-27 2020-07-03 深圳壹账通智能科技有限公司 Data dynamic real-time updating method, device and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7769742B1 (en) * 2005-05-31 2010-08-03 Google Inc. Web crawler scheduler that utilizes sitemaps from websites
US8935389B2 (en) * 2011-05-17 2015-01-13 Guavus, Inc. Method and system for collecting and managing network data

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101277272A (en) * 2008-05-16 2008-10-01 北京航空航天大学 Method for implementing magnanimity broadcast data warehouse-in
CN103559217A (en) * 2013-10-17 2014-02-05 北京航空航天大学 Heterogeneous database oriented massive multicast data storage implementation method
CN106020986A (en) * 2016-05-26 2016-10-12 中国建设银行股份有限公司 Data processing method and device
CN107870928A (en) * 2016-09-26 2018-04-03 上海泓智信息科技有限公司 File reading and device
CN107273409A (en) * 2017-05-03 2017-10-20 广州赫炎大数据科技有限公司 A kind of network data acquisition, storage and processing method and system
CN109840298A (en) * 2018-12-29 2019-06-04 中国科学院计算技术研究所 The multi information source acquisition method and system of large scale network data
CN110704381A (en) * 2019-09-06 2020-01-17 平安城市建设科技(深圳)有限公司 Data analysis method, device and storage medium
CN111241447A (en) * 2020-01-13 2020-06-05 浙江省北大信息技术高等研究院 Webpage data acquisition method, system and storage medium
CN111367925A (en) * 2020-02-27 2020-07-03 深圳壹账通智能科技有限公司 Data dynamic real-time updating method, device and storage medium
CN111221744A (en) * 2020-04-23 2020-06-02 杭州海康威视数字技术股份有限公司 Data acquisition method and device and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"The Implementation of Crawling News Page Based on Incremental Web Crawler";Zejian Shi;《2016 4th Intl Conf on Applied Computing and Information Technology/3rd Intl Conf on Computational Science/Intelligence and Applied Informatics/1st Intl Conf on Big Data, Cloud Computing, Data Science & Engineering (ACIT-CSII-BCD)》;全文 *
"主题爬虫搜索策略的设计与实现 ";田磊;《中国优秀硕士学位论文全文数据库信息科技辑》;全文 *
基于众包的社交网络数据采集模型设计与实现;高梦超;胡庆宝;程耀东;周旭;李海波;杜然;;计算机工程(第04期);全文 *

Also Published As

Publication number Publication date
CN112764908A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
US8959490B2 (en) Optimizing heap memory usage
DE102013206744A1 (en) DEDUPLICATING STORAGE WITH IMPROVED DETECTION OF COMMON STOPS
CN111324427B (en) Task scheduling method and device based on DSP
CN109656779A (en) Internal memory monitoring method, device, terminal and storage medium
CN108763572B (en) Method and device for realizing Apache Solr read-write separation
CN113760189B (en) Load data filling and storing method and system
CN111177271B (en) Data storage method, device and computer equipment for persistence of kafka data to hdfs
US20140281060A1 (en) Low-contention update buffer queuing for large systems
CN103595571A (en) Preprocessing method, device and system for website access logs
CN108829345B (en) Data processing method of log file and terminal equipment
CN112148736A (en) Method, device and storage medium for caching data
CN111694806B (en) Method, device, equipment and storage medium for caching transaction log
CN112764908B (en) Network data acquisition processing method and device and electronic equipment
CN111858393A (en) Memory page management method, memory page management device, medium and electronic device
CN114443595A (en) Method and device for processing file
CN113626483B (en) Front-end caching method, system, equipment and storage medium for filling forms
CN114490576A (en) Database storage method, device, equipment and storage medium
CN114116790A (en) Data processing method and device
US11003578B2 (en) Method and system for parallel mark processing
CN110888588B (en) Flash memory controller and related access method and electronic device
CN106371770A (en) Data write-in method and device
CN107643892B (en) Interface processing method, device, storage medium and processor
CN116303125B (en) Request scheduling method, cache, device, computer equipment and storage medium
CN116700940B (en) Request handling method, system and device based on encapsulation class and medium
CN117453643B (en) File caching method, device, terminal and medium based on distributed file system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant