CN116244486A

CN116244486A - Crawling data processing method and system based on data stream

Info

Publication number: CN116244486A
Application number: CN202310244348.4A
Authority: CN
Inventors: 程宇浩; 王丹琛; 万振华; 王颉; 李华; 董燕
Original assignee: Seczone Technology Co Ltd
Current assignee: Seczone Technology Co Ltd
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-06-09

Abstract

The invention discloses a crawling data processing method and system based on data flow, wherein the method comprises the following steps: crawling target data based on the key information strip to generate a plurality of data items, and transmitting the data items to a first data pipeline; receiving data items through a first data pipeline, inputting the data items into a corresponding data cleaning function according to the types of the data items for cleaning, and transmitting the cleaned data items meeting the requirements to a second data pipeline; creating a plurality of data entry functions of different types, receiving data items through a second data pipeline, inputting the data items into corresponding data entry functions according to the types of the data items for entry query processing, and updating the database according to query processing results; the data processing mode has clear logic structure, is convenient to expand, can realize the quick construction of one data acquisition item, and is not easy to make mistakes when the data meeting the conditions are stored in the database.

Description

Crawling data processing method and system based on data stream

Technical Field

The present invention relates to the field of crawling data processing technologies, and in particular, to a crawling data processing method and system based on data flow.

Background

With the development of artificial intelligence technology, more and more functions require a large amount of data as support. While a significant portion of enterprise users employ crawler tools to collect data and analyze the data using big data. Crawler technology is used for capturing data from web pages or equipment information and other places through certain rules and methods. But the quality of the data collected by the crawler tool is far from meeting the requirements of being able to be used, so the data needs to be subjected to a large number of cleaning and warehousing procedures. Often, an enterprise or an item needs to collect data information of tens or hundreds of dimensions, so a multitasking data collection program generally adopts a parallel processing mode in the prior art, that is, each processing module (such as cleaning, warehousing and the like) is independent of each other, each processing module is in communication connection with a database, any processing module is used for placing target data in the database after completing tasks of the processing module, for example, after a cleaning module takes out data from the database and cleans the data, the data meeting requirements is placed in the database, and a database entering module extracts the data placed in the cleaning module from the database for processing. For the traditional data crawling processing mode, when the data types are multiple, the frame construction work for data acquisition and processing is huge, the management difficulty is also high, logic confusion is easy to occur, and repeated warehouse entry is easy to cause.

Disclosure of Invention

The invention aims to provide a crawling data processing method and system based on data flow, which can quickly build a data acquisition and processing program framework, has a clear logic structure and is not easy to make mistakes.

In order to achieve the above object, the present invention discloses a crawling data processing method based on data flow, which includes:

creating a plurality of key information strips which respectively belong to different dimensions and are used for crawling data;

crawling target data based on the key information bar by adopting a crawler tool to generate a plurality of data items, wherein each data Item comprises one Item of target data, and transmitting the data Item to a first data pipeline;

creating a plurality of different types of data cleaning functions;

receiving the data Item through the first data pipeline, inputting the data Item into a corresponding data cleaning function according to the category of the data Item for cleaning, and transmitting the cleaned data Item meeting the requirement to a second data pipeline;

creating a plurality of data warehouse-in functions of different types;

and receiving the data Item through the second data pipeline, inputting the data Item into a corresponding data warehousing function according to the category of the data Item for warehousing query processing, and updating a database according to the query processing result.

Preferably, each key information bar includes a plurality of data fields, field names of data fields representing the same content in the key information bars of different dimensions are the same, and table names and table unique indexes corresponding to each key information bar are integrated in the same information table to perform unified management.

Preferably, when a data Item is newly added into a database, all data items of the same category of the newly added data Item in the database are integrally ordered.

Preferably, the method for overall ordering the data items comprises the following steps:

when the data Item is processed through the data warehousing function, transmitting the data Item meeting the warehousing condition to a third data pipeline;

receiving the data Item from the third data pipeline by adopting a data marking function, marking the data Item, and writing the characteristic name of the data Item into redis;

and reading the corresponding feature names from the Redis by adopting a data sorting function, and sorting the marks of the similar data items in the database based on the read feature names.

The invention also discloses a crawling data processing system based on the data stream, which comprises:

the data preparation module is used for creating a plurality of key information strips which respectively belong to different dimensions and are used for crawling data;

the data acquisition module is used for crawling target data based on the key information bar by adopting a crawler tool to generate a plurality of data items, wherein each data Item comprises one Item of target data, and the data items are transmitted to a first data pipeline;

the data cleaning module is used for creating a plurality of data cleaning functions of different types, receiving the data items through the first data pipeline, inputting the data items into the corresponding data cleaning functions according to the types of the data items for cleaning, and transmitting the cleaned data items meeting the requirements to the second data pipeline;

the data warehouse-in module is used for creating a plurality of data warehouse-in functions of different types, receiving the data Item through the second data pipeline, inputting the data Item into the corresponding data warehouse-in function according to the category of the data Item for warehouse-in query processing, and updating the database according to the query processing result.

Preferably, each key information bar includes a plurality of data fields, and field names of data fields representing the same content in the key information bars of different dimensions are the same, and the data preparation module further integrates a table name and a table unique index corresponding to each key information bar into the same information table for unified management.

Preferably, the system further comprises a data post-processing module, wherein the data post-processing module is used for integrally sequencing all data items of the same category of the data Item newly added in the database when the data Item is newly added in the database.

Preferably, the data post-processing module comprises a marking module and a sorting module; the marking module is used for receiving the data Item from the third data pipeline by adopting a data marking function, marking the data Item, and writing the characteristic name of the data Item into redis; the third data pipeline is used for receiving data items meeting the warehousing conditions; the sorting module is used for reading the corresponding feature names from the Redis by adopting a data sorting function and sorting the marks of the similar data items in the database based on the read feature names.

The invention also discloses another crawling data processing system based on the data stream, which comprises:

one or more processors;

a memory;

and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs including instructions for performing the data stream based crawling data processing method as described above.

The invention also discloses a computer readable storage medium comprising a computer program executable by a processor to perform a data stream based crawling data processing method as described above.

Compared with the prior art, the technical scheme of the invention designs the framework of the processing program for processing the crawling data by using the thought of the data flow, namely, each processing flow is connected in series, and only the last warehousing flow is connected with the database in a communication way, so that the data to be processed flows into the cleaning stage from the collecting stage and flows into the warehousing stage from the cleaning stage in a sequential flow mode, and finally the data meeting the requirements is updated to the database in the warehousing stage; therefore, the data processing mode has clear logic structure, is convenient to expand, can realize the quick construction of one data acquisition item, and is not easy to make mistakes when the data meeting the conditions are stored in the database.

Drawings

Fig. 1 is a schematic diagram of a crawling data processing method in an embodiment of the present invention.

FIG. 2 is a flowchart of a method for crawling data processing in an embodiment of the present invention.

Detailed Description

In order to describe the technical content, the constructional features, the achieved objects and effects of the present invention in detail, the following description is made in connection with the embodiments and the accompanying drawings.

The embodiment discloses a crawling data processing method based on data flow, which is used for crawling data from web pages or other equipment and other places through a crawler tool. Specifically, as shown in fig. 1 and 2, the data processing method includes:

s1, a data preparation stage: according to project requirements, creating a plurality of key information strips which respectively belong to different dimensions and are used for crawling data;

s2, entering a data acquisition stage: crawling target data from a target webpage or other equipment based on the key information bar by adopting a crawler tool to generate a plurality of data items (namely data containers), wherein each data Item comprises one Item of target data, and transmitting the data Item to a first data pipeline;

s3, entering a data cleaning stage: firstly, creating a plurality of data cleaning functions of different types;

s4, receiving the data Item through the first data pipeline, inputting the data Item into a corresponding data cleaning function according to the category of the data Item for cleaning, and transmitting the cleaned data Item meeting the requirement to a second data pipeline;

s5, entering a data warehouse-in stage: firstly, creating a plurality of data warehouse-in functions of different types;

s6, receiving the data Item through the second data pipeline, inputting the data Item into a corresponding data warehousing function according to the category of the data Item to carry out warehousing query processing, and updating a database according to the query processing result. That is, whether the same object as the data in the current data Item exists is queried in the database, if not, the creation time and the update time are initialized for the current database, and then the data insertion operation is performed; if the data object exists, updating the field values of the new data and the old data, if the field values are inconsistent, directly skipping, generating an updated dictionary, adding the updated data, and then performing data updating operation.

In the data processing method in this embodiment, the framework of the processing procedure for processing the crawl data is designed by using the idea of data flow, that is, as shown in fig. 1, each processing procedure is connected in series, and only the last warehousing procedure is connected with the database in a communication manner, so that the data to be processed flows from the acquisition stage to the cleaning stage, flows from the cleaning stage to the warehousing stage, and finally updates the data meeting the requirements to the database in the warehousing stage. Therefore, the data processing mode has clear logic structure, is convenient to expand, can realize the quick construction of one data acquisition item, and is not easy to make mistakes when the data meeting the conditions are stored in the database.

Further, each key information bar includes a plurality of data fields, field names of the data fields representing the same content (such as release time) in the key information bars with different dimensions are the same, and a table name and a table unique index corresponding to each key information bar are integrated in the same information table so as to perform unified management and facilitate subsequent unified call.

Furthermore, the data processing method in this embodiment further includes a data post-processing stage, that is, when a database has a new data Item added into the database, the data items of the same class as the newly added data Item in the database are integrally ordered, so as to facilitate subsequent calls.

Specifically, the method for overall ordering the data items includes:

firstly, when the data Item is processed through the data warehousing function, transmitting the data Item meeting the warehousing condition to a third data pipeline;

then, the data Item is received from the third data pipeline by adopting a data marking function, the data Item is marked, and the characteristic name of the data Item is written into redis (remote dictionary service, which is an open source log-type, key-Value database written by ANSI C language, supports network, can be based on memory and can be persistent and provides APIs of multiple languages); for example, if the data stored in the database is component A and the version number is 1.0, the feature name "component A" is written into redis;

and then, reading the corresponding feature names from the Redis by adopting a data sorting function, and sorting the marks of the similar data items in the database based on the read feature names. For example, if "component a" is read, all the data of component a are queried in the database, if three data are queried, namely, component a (version 1.0), component a (version 2.0), component a (version 3.0), wherein component a (version 1.0) and component a (version 2.0) are the existing data, the sequence number of the tag of component a (version 1.0) is 2, the sequence number of the tag of component a (version 2.0) is 1 (representing the latest), then, but after component a (version 3.0) enters, the sequence number of the tag of component a (version 1.0) is 3, the sequence number of the tag of component a (version 2.0) is 2, and the sequence number of the tag of component a (version 3.0) is 1 (representing the latest) through the processing of the data sorting function.

In another preferred embodiment of the present invention, a crawling data processing system based on data flow is also disclosed, which includes the following functional modules:

Further, each key information bar includes a plurality of data fields, field names of data fields representing the same content in the key information bars of different dimensions are the same, and the data preparation module integrates a table name and a table unique index corresponding to each key information bar into the same information table so as to perform unified management.

Furthermore, the processing system in this embodiment further includes a data post-processing module, where the data post-processing module is configured to, when a database has a new data Item added in the database, perform overall sorting on all data items in the same category as the newly added data Item in the database.

Specifically, the data post-processing module comprises a marking module and a sorting module; the marking module is used for receiving the data Item from the third data pipeline by adopting a data marking function, marking the data Item, and writing the characteristic name of the data Item into redis; the third data pipeline is used for receiving data items meeting the warehousing conditions; the sorting module is used for reading the corresponding feature names from the Redis by adopting a data sorting function and sorting the marks of the similar data items in the database based on the read feature names.

The present invention also discloses another data stream based crawling data processing system comprising one or more processors, a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the data stream based crawling data processing method as described above. The processor may employ a general-purpose central processing unit (Central Processing Unit, CPU), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for executing associated programs to perform the functions required to be performed by the modules in the data flow based crawling data processing system of the embodiments of the present application or to perform the data flow based crawling data processing method of the embodiments of the present application.

The invention also discloses a computer readable storage medium comprising a computer program executable by a processor to perform a data stream based crawling data processing method as described above. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a read-only memory (ROM), or a random-access memory (random access memory, RAM), or a magnetic medium, for example, a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, for example, a digital versatile disk (digital versatile disc, DVD), or a semiconductor medium, for example, a Solid State Disk (SSD), or the like.

The present application also discloses a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the electronic device to perform the data stream based crawling data processing method described above.

The foregoing description of the preferred embodiments of the present invention is not intended to limit the scope of the claims, which follow, as defined in the claims.

Claims

1. A method for crawling data processing based on a data stream, comprising:

creating a plurality of different types of data cleaning functions;

creating a plurality of data warehouse-in functions of different types;

2. The method according to claim 1, wherein each key information bar includes a plurality of data fields, and field names of data fields representing the same content in the key information bars of different dimensions are the same, and a table name and a table unique index corresponding to each key information bar are integrated in the same information table for unified management.

3. The crawling data processing method based on data flow according to claim 1, characterized in that when a new data Item is added in a database, all data items of the same category of the newly added data Item in the database are integrally ordered.

4. A method of data flow based crawling data processing as claimed in claim 3, wherein the method of overall ordering said data items comprises:

5. A data flow based crawling data processing system, comprising:

6. The system of claim 5, wherein each key information item includes a plurality of data fields, and the fields of the data fields representing the same content in the key information items of different dimensions are the same, and the data preparation module further integrates the table name and the table unique index corresponding to each key information item into the same information table for unified management.

7. The crawling data processing system based on data flow of claim 5, further comprising a data post-processing module, wherein the data post-processing module is configured to, when a database has a newly added data Item in the database, perform overall sorting on all data items in the same category as the newly added data Item in the database.

8. The data stream based crawling data processing system of claim 7, wherein said data post-processing module comprises a tagging module and a ranking module; the marking module is used for receiving the data Item from the third data pipeline by adopting a data marking function, marking the data Item, and writing the characteristic name of the data Item into redis; the third data pipeline is used for receiving data items meeting the warehousing conditions; the sorting module is used for reading the corresponding feature names from the Redis by adopting a data sorting function and sorting the marks of the similar data items in the database based on the read feature names.

9. A data flow based crawling data processing system, comprising:

one or more processors;

a memory;

and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the data flow based crawling data processing method of any of claims 1-4.

10. A computer readable storage medium comprising a computer program executable by a processor to perform the data stream based crawling data processing method of any of claims 1 to 4.