CN111061719B

CN111061719B - Data collection method, device, equipment and storage medium

Info

Publication number: CN111061719B
Application number: CN201911369301.0A
Authority: CN
Inventors: 张浩然
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2023-08-29
Anticipated expiration: 2039-12-26
Also published as: CN111061719A

Abstract

The invention discloses a data collection method, a device, equipment and a storage medium. The method comprises the steps of collecting business data in at least one storage node according to a preset task information table; deleting repeated data in the service data; and taking the service data after deleting the repeated data as target data collected by a service end. According to the technical scheme provided by the embodiment of the invention, the collection is realized, the data processing is simultaneously carried out on each storage node through the task information table, the data processing performance is improved, and the data consistency is ensured.

Description

Data collection method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a data collection method, a device, equipment and a storage medium.

Background

With the development of internet technology, data has become a more important component of life. With the increasing use of distributed technology, a system is formed by interconnecting a plurality of processing nodes through communication lines, where each processing node is geographically dispersed and may be scattered in a unit, a city, a country or even worldwide, and data is stored and processed in each processing node. The data acquisition device needs to acquire data from outside the system and input into the system. Data collection is widely used in various fields and the requirement for distributed data collection also presents challenges.

Conventional data collection methods are typically two types of centralized data collection and decentralized centralized data collection, the two types being specified as follows: 1) Centralized data collection is to input all data into the same computer for processing; 2) Decentralized centralized data collection is to input the data set into different computer distributions for processing, and the data collection between the computers is independent. However, both the above-mentioned collection methods have obvious disadvantages, in that the centralized data collection is handled by a single point computer, and when the computer fails or exceeds the processing capacity, the whole collection system will not work; the distributed centralized data acquisition solves the problem of single-point processing on the basis of centralized data acquisition, but the consistency of transactions cannot be ensured due to independent processing of each computer.

Disclosure of Invention

The invention provides a data collection method, a device, equipment and a storage medium, which are used for realizing collection of mass data, enhancing data processing capacity and ensuring data consistency.

In a first aspect, an embodiment of the present invention provides a data collection method, including:

collecting service data in at least one storage node according to a preset task information table;

deleting repeated data in the service data;

and taking the service data after deleting the repeated data as target data collected by a service end.

In a second aspect, an embodiment of the present invention provides a data collection apparatus, including:

the data reading module is used for collecting service data in at least one storage node according to a preset task information table;

the data deduplication module is used for deleting the duplicate data in the service data;

and the data collection module is used for taking the service data after the repeated data is deleted as target data collected by the service end.

In a third aspect, the present invention provides an apparatus comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data collection method as described in any of the embodiments of the present invention.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium, which when executed by a processor implements a data collection method according to any of the embodiments of the present invention.

According to the technical scheme, the service data in each storage node is collected through the preset task information table, repeated data in the service data are deleted, the service data after de-duplication is used as target data collected by the service end, parallel collection of the data in each storage node is achieved through the task information table, data processing performance is improved, and consistency of the collected target data is guaranteed through de-duplication operation.

Drawings

FIG. 1 is a flow chart of a data collection method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a data collection method according to a second embodiment of the present invention;

fig. 3 is an exemplary diagram of a data collection method according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a data collecting device according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings, and furthermore, embodiments of the present invention and features in the embodiments may be combined with each other without conflict.

Because the existing data machine room is distributed in a plurality of cities of all continents worldwide, and because service logic is required to process the demands of unrepeatable operations such as ordering, rewarding, updating order states and the like, the problems of data loss and repeated collection often occur in the existing data collection method, so that errors are extremely easy to occur when the service logic is realized by a service end according to the collected data.

Example 1

Fig. 1 is a flowchart of a data collection method provided in an embodiment of the present invention, referring to fig. 1, where the embodiment of the present invention is applicable to a case of collecting distributed storage data, the method may be performed by a data collection device, and the device may be implemented in a hardware and/or software manner, referring to fig. 1, and the method provided in the embodiment of the present invention specifically includes the following steps:

and step 101, collecting service data in at least one storage node according to a preset task information table.

The task information table can store a data table of data reading tasks, a plurality of data reading tasks can be stored in the task information table, and the data reading tasks in the task information table can be used for reading service data in the storage nodes; the storage nodes may be constituent nodes of the storage cluster, and data stored in different storage nodes may be the same or different, for example, if two storage nodes are redundant nodes, data stored in the two storage nodes are the same, and if two storage nodes are used for providing different service functions, data stored in the two storage nodes are different, and service data may be data for implementing the service functions and may include order data, user information, prize information and the like.

Specifically, the data acquisition task in the task information table may be extracted, and the service data may be read in each storage node according to the data acquisition task, which may be understood that the service data acquired by different data acquisition tasks may be different. Furthermore, when the data acquisition task in the task information table is executed, the data acquisition task can be locked, one data acquisition task can be executed by one storage node, after the data acquisition task is in the shackle, other storage nodes can not execute the data acquisition task, the data processing amount in the subsequent deduplication process can be reduced, and the data collection efficiency is improved.

And 102, deleting repeated data in the service data.

Wherein the duplicate data may be the same traffic data stored in different storage nodes, e.g. storage node a is read to traffic data a, B and c, storage node B may be read to traffic data a and d, wherein two pieces of traffic data a may be present in the read traffic data, and the traffic data a may be the duplicate data.

In the embodiment of the invention, the service data read from each storage node can be stored in the cache, the service data in the cache can be subjected to the deduplication operation, for example, a hash value can be generated on the service data, if the service data with the same hash value exists, more repeated data can be deleted, and only one service data with the hash value is reserved. Furthermore, when processing mass data, the service data can be read into the distributed cluster to be cached, and repeated data in the service data can be deleted from the distributed cluster.

And 103, taking the service data after deleting the repeated data as target data collected by a service end.

The service end may be an application end that needs to acquire service data, the service end may implement a service function according to the read service data, for example, order status update, prize issue, and the like may be performed, and the target data may be service data corresponding to the service function implemented by the service end and may be obtained by querying in the service data.

Specifically, the service end can search in the service data according to the data type or the data identifier, inquire the data identifier according to the type or the data identifier of the service data, and the service end can use the inquired service data as target data to be collected.

According to the technical scheme, the service data in each storage node is collected through the preset task information table, repeated data in the service data are deleted, the service data after the duplication removal is used as target data collected by the service end, the data in each storage node is quickly collected through the task information table, the data processing performance is improved, the consistency of the collected target data is guaranteed through the duplication removal operation, and the occurrence probability of data loss and repeated collection is reduced.

Example two

Fig. 2 is a flowchart of a data collection method provided by a second embodiment of the present invention, where the embodiment is implemented based on the foregoing embodiment, and the technical solution of the embodiment of the present invention is applicable to a case of collecting mass data, and referring to fig. 2, the data collection method of the embodiment of the present invention includes:

step 201, a task information table is built according to a data acquisition request sent by a service end.

The data acquisition request may be a request for acquiring service data, and the data acquisition request may include information such as a service data type and a service data identifier that the service end requests to acquire.

In the embodiment of the invention, the information such as the data type and the data identification of the service data in the data acquisition request can be extracted, the acquired information can be stored into a task information table as task information, and further, the task information table can be stored into each storage node.

In one embodiment, the task information in the task information table includes at least one of a data reading interface and a reading frequency.

Specifically, the task information table may include a manner of acquiring service data, and may include a data reading interface for acquiring service data call in the storage node and reading frequency information corresponding to a service data acquiring speed. The reading of the service data may be performed based on information within the storage node. For example, the data dictionary of task information stored in the task information table may be as follows:

in the technical scheme of the embodiment of the invention, the information such as the task ID, the task name, the task responsible person, the service data collection interface, the frequency, the task state, the version number and the like can be stored in the task information table to be used as the task information for reading the service information in the storage node.

Step 202, storing the task information table in each storage node, and extracting a data reading interface and a reading frequency in the task information table.

The data reading interface may be a program interface for acquiring service data, the data reading interface may be implemented in a software manner, the data reading interface may be preset in a storage node, a user may acquire service data in the storage node, it may be understood that a rule for acquiring service data may be stored in the data reading interface, service data acquired by calling different data reading interfaces may be different, and further, the storage node may be a storage node of a multi-master storage Cluster, and in particular, may be a multi-master storage Cluster such as myshield, galera Cluster, and RocketMQ.

In the embodiment of the present invention, the reading frequency may be a speed of reading the service data in the storage node, for example, the service data in the storage node may be read every 5 seconds. It is understood that the reading frequency may be to read the service data at fixed time intervals, and the time units of the time intervals may be seconds, minutes, hours, days, and the like. Specifically, the task information table may be stored in each storage node in the data storage cluster, further, the task information table in each storage node may store only locally corresponding task information, and the data reading interface and the reading frequency corresponding to each data reading task may be extracted from the task information table stored in the storage node.

And 203, calling the data reading interface in the storage node to read the service data according to the reading frequency.

Specifically, the data reading interface may be called in the storage node at intervals corresponding to the reading frequency to obtain service data in the storage node, and it may be understood that the data reading interface may be a data extraction rule generated in a software manner, and different service data may be obtained in the storage node by calling the data reading interface. It can be appreciated that in the embodiment of the present invention, service data in the storage node may be continuously acquired through the reading frequency and the data reading interface.

In one embodiment, at least one data reading interface is included in the storage node, the data reading interface reading the traffic data according to the data type.

Specifically, the storage node may be preset with a plurality of data reading interfaces, and one data reading interface may read only service data of one data type, for example, the data interface a is preset with an extraction rule for reading video data, and call the data interface a to only obtain the service data of the video type. The data types can be divided into text, voice, video, pictures and the like according to the storage format, and can be divided into user information, commodity information, order information, state information and the like according to the service function.

And 204, summarizing service data through a stream processing queue, wherein the service data corresponds to at least two storage nodes.

The flow processing queue can be a queue for processing the service data acquired at any time, and the service data can be continuously acquired from the storage node, so that the processing efficiency of the service data can be improved through the flow processing queue, and the data collection speed is improved.

In the embodiment of the invention, the flow processing queue may be a kakfa flow processing cluster, service data which is continuously read from each storage node may be stored in the kafka flow processing cluster, so as to collect each service data, the kafka flow processing cluster may establish a connection with each storage node, and the read service data may be sent to the kafka flow processing cluster.

Step 205, reading the service data to a cache data cluster, and obtaining a data identifier of the service data.

The cache data cluster can be a distributed cache cluster, a large amount of service data can be stored in a cache, a service end can conveniently acquire target data, and the data acquisition speed is improved, so that the data collection efficiency is improved.

Specifically, the cache data cluster subscribes to topics in the stream processing queue, when service data exists in the stream processing queue, the cache data cluster can actively read the service data, and before the acquired service data is stored in the cache data cluster, a data identifier of the service data can be extracted, wherein the data identifier can be a unique identifier number of the service data, and data identifiers corresponding to different service data are different.

And 206, discarding the service data if the data identifier is found in the cache data cluster, otherwise, storing the service data in the cache data cluster.

In the embodiment of the invention, the data identifier of the service data can be used for judging whether the service data is repeated with the cached service data in the cache data cluster, the data identifier corresponding to the service data can be searched in the cache data cluster, if so, the service data can be determined to be stored in the cache data cluster, the newly acquired service data is repeated with the cached service data, the service data which is just acquired can not be stored in the cache data cluster, and the service data can be deleted; if the data identification of the service data cannot be found in the cache data cluster, determining that the service data is not stored, and storing the service data into the cache data cluster.

And 207, taking the service data after deleting the repeated data as target data collected by a service end.

Specifically, the service end can search target data in a cache data cluster for storing service data, and can collect service data required by realizing service functions according to data types, data formats and service scenes. For example, the service end is used for realizing the service function of order status update, and service data serving as order information can be searched in the cache data cluster to serve as target data.

According to the technical scheme, the task information table is built according to the data acquisition request sent by the service end, the task information such as the data reading interface and the reading frequency in the task information table is extracted, the data acquisition task corresponding to the task information is executed to acquire the service data, the acquired service data can be summarized through the stream processing queue, the service data is subjected to the deduplication operation through the cache data cluster from the stream processing queue, the service data after the duplicate data is deleted is used as the target data collected by the service end, the consistency of the collected target data is guaranteed through the deduplication operation, the rapid collection of the service data is realized, the data processing performance is enhanced, and the occurrence probability of data loss and repeated collection is reduced.

For example, fig. 3 is an exemplary diagram of a data collection method according to the second embodiment of the present invention, and referring to fig. 3, taking an example of implementing data collection by using a myboard cluster, a kafka cluster, and a codis cluster together, task information may be first input through the background to form a task information table, and the task information table may be stored in the myboard. A table scanning process can be created according to task information stored in the myshield cluster, and the task information can be locked in the table scanning process creation process to prevent multiple machines from generating the same task. The successful table scanning process can read the service data generated by the service data collecting interface according to the reading frequency of the task information, and the read service data can be written into the kafka cluster. The service process corresponding to each service can read the service data stored in the kafka, the read data can be removed repeatedly through the codis cluster according to the identification number order_id of the service data, and then the service data is sent to the service layer.

Example III

Fig. 4 is a schematic structural diagram of a data collecting device according to a third embodiment of the present invention; the data collection method provided by any embodiment of the invention can be executed, and has the corresponding functional modules and beneficial effects of the execution method. The apparatus may be implemented by software and/or hardware, and specifically includes: a data reading module 301, a data deduplication module 302 and a data collection module 303.

The data reading module 301 is configured to collect service data in at least one storage node according to a preset task information table.

A data deduplication module 302, configured to delete duplicate data in the service data.

And the data collection module 303 is configured to use the service data from which the repeated data is deleted as target data collected by the service end.

According to the technical scheme, the data reading module collects service data in each storage node through the preset task information table, the data deduplication module deletes repeated data in the service data, the data collection module takes the deduplicated service data as target data collected by a service end, parallel collection of the data in each storage node is achieved through the task information table, data processing performance is improved, and consistency of the collected target data is guaranteed through deduplication operation.

Further, on the basis of the embodiment of the invention, the method further comprises the following steps:

and the information table module is used for establishing a task information table according to the data acquisition request sent by the service end.

Further, on the basis of the above embodiment of the present invention, the data reading module 301 includes:

and the information extraction unit is used for storing the task information table into each storage node and extracting a data reading interface and a reading frequency in the task information table.

And the task execution unit is used for calling the data reading interface in the storage node to read the service data according to the reading frequency.

Further, on the basis of the above embodiment of the present invention, the task information in the task information table of the data reading module 301 includes at least one of a data reading interface and a reading frequency.

Further, on the basis of the above embodiment of the present invention, the storage node of the data reading module 301 includes at least one data reading interface, where the data reading interface reads service data according to a data type.

Further, in accordance with the above embodiment of the present invention, the data deduplication module 302 includes:

the identifier acquisition unit is used for reading the service data to the cache data cluster and acquiring the data identifier of the service data.

And the data deduplication unit is used for discarding the service data if the data identifier is found in the cache data cluster, and storing the service data to the cache data cluster if the data identifier is not found in the cache data cluster.

and the data summarizing module is used for summarizing service data through the stream processing queue, wherein the service data corresponds to at least two storage nodes.

Example IV

Fig. 5 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention, and as shown in fig. 5, the apparatus includes a processor 40, a memory 41, an input device 42, and an output device 43; the number of processors 40 in the device may be one or more, one processor 40 being taken as an example in fig. 5; the processor 40, the memory 41, the input means 42 and the output means 43 in the device may be connected by a bus or by other means, in fig. 5 by way of example.

The memory 41 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and a module, such as a program module corresponding to a data collecting method in an embodiment of the present invention (for example, the data reading module 301, the data deduplication module 302, and the data collecting module 303 in the data collecting apparatus). The processor 40 performs various functional applications of the device and data processing, i.e., implements the data collection method described above, by running software programs, instructions and modules stored in the memory 41.

The memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 41 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 41 may further include memory located remotely from processor 40, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 42 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the device. The output means 43 may comprise a display device such as a display screen.

Example five

A fifth embodiment of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a data collection method comprising:

deleting repeated data in the service data;

Of course, the storage medium containing computer executable instructions provided in the embodiments of the present invention is not limited to the above-described method operations, and may also perform related operations in the data collection method provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the data collection device, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method of data collection, comprising:

collecting service data in at least one storage node according to a preset task information table, including: storing the task information table into each storage node, and extracting a data reading interface and a reading frequency in the task information table; calling the data reading interface in the storage node according to the reading frequency to read the service data;

deleting the repeated data in the service data, including: reading the service data to a cache data cluster, and acquiring a data identifier of the service data; discarding the service data if the data identifier is found in the cache data cluster, otherwise, storing the service data in the cache data cluster;

2. The method of claim 1, further comprising, prior to said collecting traffic data in at least one storage node according to a predetermined task information table:

and establishing a task information table according to the data acquisition request sent by the service end.

3. The method according to any of claims 1-2, wherein the task information in the task information table comprises at least one of a data reading interface and a reading frequency.

4. The method of claim 1, wherein the storage node includes at least one data reading interface therein, the data reading interface reading traffic data according to a data type.

5. The method of claim 1, further comprising, prior to said deleting the duplicate data in the service data:

and summarizing service data through a stream processing queue, wherein the service data corresponds to at least two storage nodes.

6. A data collection device, comprising:

the data reading module is used for collecting service data in at least one storage node according to a preset task information table; the data reading module comprises: the information extraction unit is used for storing the task information table into each storage node and extracting a data reading interface and a reading frequency in the task information table; the task execution unit is used for calling the data reading interface in the storage node to read the service data according to the reading frequency;

the data deduplication module is used for deleting the duplicate data in the service data; the data deduplication module comprises: the identifier acquisition unit is used for reading the service data to a cache data cluster and acquiring a data identifier of the service data; the data deduplication unit is used for discarding the service data if the data identifier is found in the cache data cluster, and storing the service data to the cache data cluster if the data identifier is not found in the cache data cluster;

7. A data collection device, comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data collection method of any of claims 1-5.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the data collection method according to any one of claims 1-5.