CN110851473A

CN110851473A - Data processing method, device and system

Info

Publication number: CN110851473A
Application number: CN201810825305.4A
Authority: CN
Inventors: 王爱东
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2020-02-28

Abstract

The embodiment of the invention discloses a data processing method, a device and a system, wherein the data processing method comprises the following steps: the monitoring acquisition module acquires data from the data source according to the pre-configured resource configuration information of the data source and pushes the acquired data to the streaming big data calculation module; the business module defines a deployment and control task; and the streaming big data calculation module processes the data according to the deployment and control task and a data structure of metadata of a pre-configured data source to obtain a deployment and control result, and pushes the deployment and control result to a specified channel of a memory database. The monitoring acquisition module of the embodiment of the invention realizes the quick butt joint with the data source based on the resource configuration information of the data source, and realizes the high-efficiency processing and analysis of the data based on the deployment and control task and the streaming big data calculation module.

Description

Data processing method, device and system

Technical Field

The present invention relates to, but not limited to, big data computing and analyzing technologies, and in particular, to a data processing method, apparatus, and system.

Background

The rapid development of emerging information technologies and application modes such as cloud computing, internet of things, mobile interconnection, social media and the like promotes the rapid increase of global data volume, promotes the human society to enter a big data era, and has higher and higher requirements on data information analysis along with the continuous development of services.

At present, the computing mode of big data can be divided into two forms of batch computing and streaming computing.

The research on the large data batch computing related technology is relatively mature, an efficient and stable batch computing system represented by a mapping reduction (MapReduce) programming model of Google (Google) and an open-source Hadoop computing system is formed, and remarkable results are obtained in theory and practice.

Early research on streaming computing often focused on streaming data computing in a database environment, with small data size and relatively single data object. As the streaming big data in the new period has the characteristics of real-time property, volatility, burst property, disorder property, infinity property and the like, a plurality of new higher requirements are put forward to the system. For example, due to the rapid development of the internet of things and internet technology, the public security field generates massive streaming data, which is characterized by volatility, real-time performance and the like. The volatility means that the data is not long in storage time and can be cleared regularly; real-time refers to the fact that data becomes less and less valuable over time. Conventional computing architectures have difficulty supporting the need for fast docking, fast acquisition, efficient processing, and analysis of this type of data.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device and a data processing system, which can be used for rapidly accessing a data source and efficiently processing and analyzing data.

The embodiment of the invention provides a data processing method, which comprises the following steps:

the monitoring acquisition module acquires data from the data source according to the pre-configured resource configuration information of the data source and pushes the acquired data to the streaming big data calculation module;

the business module defines a deployment and control task;

and the streaming big data calculation module processes the data according to the deployment and control task and a data structure of metadata of a pre-configured data source to obtain a deployment and control result, and pushes the deployment and control result to a specified channel of a memory database.

In another embodiment of the present invention, before the collecting data from the data source, the method further comprises:

the service module configures the resource configuration information of the data source; wherein the resource configuration information includes: a resource interface and a resource path;

before the streaming big data calculation module processes data according to the deployment and control task and the data structure of the metadata of the pre-configured data source to obtain the deployment and control result, the method further comprises the following steps:

the business module configures a data structure of metadata of the data source.

In another embodiment of the present invention, the method further comprises:

and the service module subscribes the appointed channel of the memory database and receives a subscription message pushed by the appointed channel to acquire the data of the appointed channel of the memory database.

In an embodiment of the present invention, the deployment and control task includes: the method comprises the following steps of deploying and controlling task basic information, deploying and controlling object information and deploying and controlling dimensions, wherein the deploying and controlling dimensions comprise deploying and controlling algorithms;

the streaming big data calculation module processes data according to the deployment and control task and the data structure of the metadata of the pre-configured data source to obtain a deployment and control result, and the method comprises the following steps:

and the streaming big data calculation module analyzes the data according to the data structure of the metadata, matches the analyzed data by adopting the control algorithm according to the basic information of the control task, and outputs the successfully matched data as the control result.

In the embodiment of the present invention, the deployment and control algorithm includes at least one of the following: the general algorithm and the extended algorithm are provided by the algorithm library;

wherein the expansion algorithm comprises at least one of: code class self-defining algorithm, function dependence class algorithm and regular rule class algorithm.

In another embodiment of the present invention, before the streaming big data calculation module processes data according to a deployment and control task to obtain a deployment and control result, the method further includes:

and the streaming big data calculation module shunts the deployment and control task according to the data source.

The embodiment of the invention provides a data processing device, which comprises at least one of the following modules:

the monitoring acquisition module is used for acquiring data from the data source according to the pre-configured resource configuration information of the data source and pushing the acquired data to the streaming big data calculation module;

the service module is used for defining a deployment and control task;

the streaming big data calculation module is used for caching the data; processing the data according to the deployment and control task and a data structure of metadata of a pre-configured data source to obtain a deployment and control result, and pushing the deployment and control result to a specified channel of a memory database;

and the storage module is used for storing the deployment and control result to a specified channel of the memory database.

An embodiment of the present invention provides a data processing apparatus, including a processor and a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed by the processor, at least one step of any one of the data processing methods is implemented.

An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements at least one step of any one of the data processing methods described above.

An embodiment of the present invention provides a data processing system, including:

the service module is used for defining a deployment and control task;

The embodiment of the invention comprises the following steps: the monitoring acquisition module acquires data from the data source according to the pre-configured resource configuration information of the data source and pushes the acquired data to the streaming big data calculation module; the business module defines a deployment and control task; and the streaming big data calculation module processes the data according to the deployment and control task and a data structure of metadata of a pre-configured data source to obtain a deployment and control result, and pushes the deployment and control result to a specified channel of a memory database. The monitoring acquisition module of the embodiment of the invention realizes the quick butt joint with the data source based on the resource configuration information of the data source, and realizes the high-efficiency processing and analysis of the data based on the deployment and control task and the streaming big data calculation module.

Additional features and advantages of embodiments of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of embodiments of the invention. The objectives and other advantages of the embodiments of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the examples of the invention serve to explain the principles of the embodiments of the invention and not to limit the embodiments of the invention.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a data processing system according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a detailed structure of a data processing system according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments of the present invention may be arbitrarily combined with each other without conflict.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Referring to fig. 1, an embodiment of the present invention provides a data processing method, including:

step 100, the monitoring acquisition module acquires data from the data source according to the pre-configured resource configuration information of the data source, and pushes the acquired data to the streaming big data calculation module.

In another embodiment of the present invention, before collecting data from the data source, the method further comprises:

the service module configures the resource configuration information of the data source; wherein the resource configuration information includes: resource interfaces and resource paths.

For example, the resource interface of the data source includes, but is not limited to, at least one of:

databases (DB, DataBase), File Transfer Protocols (FTP), Hadoop Distributed File Systems (HDFS).

Table 1 is an example of resource configuration information of a data source according to an embodiment of the present invention. As shown in table 1, the monitoring and collecting module can rapidly access a data source according to a resource interface, and can rapidly locate data to be collected based on a resource path.

TABLE 1

In the embodiment of the present invention, the monitoring acquisition module acquires data from data sources through monitoring acquisition processes, where each data source corresponds to one monitoring acquisition process, as shown in fig. 3.

The monitoring acquisition module provided by the embodiment of the invention has high reliability (transactional data transmission is carried out to ensure data reliability) and high availability.

Wherein, high reliability means:

(1) using an independent transaction to transfer data, the original data will be identified as completed only if the data is successfully committed;

(2) distributed deployment, when one node fails, data can be transmitted to other nodes without loss;

(3) provide multiple data reliability options (from strong to weak): using disk data and receiving end acknowledge character (Ack) mode; when the receiving end is unavailable, writing the data to the local, and continuing to send the data after recovery; data is sent to a receiving end without any Quality of Service (Qos) guarantee;

high availability means that:

(1) the configuration is simple, and only a data access mode, a pipeline transmission mode and a data output mode (in the example, Kafka is fixed) need to be configured;

(2) various types of data source access and exit are supported;

(3) the bottom layer architecture is uniformly managed by the top layer architecture, so that system monitoring and maintenance are facilitated; distributed deployment, ZooKeeper for management and load balancing.

In the embodiment of the invention, the streaming big data calculation module comprises a high-speed message queue and a calculation engine module. And the monitoring acquisition module pushes the acquired data to a high-speed message queue of the streaming big data calculation module.

Wherein the high-speed message queue can be implemented using a variety of high-speed message queue components. For example, a Kafka component (not limited to Kafka components, but other high-speed message queue components may be used) may be used, and the high-speed message queue, as a distributed data stream processing intermediate system, can provide high-speed throughput at an efficiency of several hundred megabits per second, satisfies the processing efficiency of large data, and is an indispensable role in balancing acquisition and computation.

The Kafka message system is composed of a publisher (producer), a broker (broker) and a subscriber (subscriber), which are respectively located on different nodes, and data transmission is performed between the parts through messages, wherein the publisher can push related messages to a topic (topic), and the subscriber can pay attention to and pull the messages in which the subscriber is interested in by taking a group as a unit. In this embodiment of the present invention, the monitoring collection module may be regarded as a publisher, the high-speed message queue may be regarded as a broker, and the computing engine may be regarded as a subscriber (Spark Streaming in this embodiment). The monitoring acquisition module pushes acquired data into a high-speed message queue, different types of data source message streams can correspond to different Topic definitions, and the logic processing of different Topic is relatively independent.

Step 101, a service module defines a deployment and control task.

In the embodiment of the invention, the deployment and control task comprises the following steps: the control method comprises the following steps of controlling task basic information, controlling object information and controlling dimensions, wherein the controlling dimensions comprise a controlling algorithm.

And the basic information of the control task comprises the starting time of the control task. Optionally, the basic information of the deployment and control task further includes a termination time of the deployment and control task or other basic information related to the deployment and control task.

The deployment object information is used for identifying the tracked object, such as the name of a person, and can be customized.

Optionally, the deployment dimension further includes at least one of: data source, field.

Wherein, the control algorithm comprises at least one of the following: the general algorithm and the extended algorithm are provided by the algorithm library;

in a common scene case of public security, the most common deployment and control algorithm is to analyze and screen the data of the trajectory class, screen whether a deployment and control object appears in the trajectory class data, and screen a key field for early warning display.

For example, the string global matching algorithm is to perform string matching on a value defined in a control dimension of a certain control object and a data stream, and when the data stream contains the value defined in the control dimension, capture the data stream and output the data stream as a control result.

For another example, specifying a field matching algorithm refers to performing string matching on a value defined in a control dimension of a certain control object and a certain field of a data stream, and capturing the data stream to output as a control result when the certain field of the data stream contains the value defined in the control dimension.

When the designated field matching algorithm is adopted, matched fields need to be contained in the control dimension.

When the general algorithm can not meet the requirement in practical application, the algorithm library can be expanded according to the requirement, and the expanded algorithm comprises at least one of the following algorithms: code class self-defining algorithm, function dependence class algorithm and regular rule class algorithm.

The code self-defining algorithm can realize the self-definition of the algorithm by uploading a jar packet for realizing the self-defining algorithm;

the function dependent class algorithm realizes the extension of the algorithm by inputting two fields with association and the association relationship between the two fields by a user;

the regular rule class algorithm implements an extension of the algorithm, such as by a user entering a regular expression.

The embodiment of the invention improves the expandability of the system by expanding the control algorithm, thereby meeting the requirements of characteristic scenes.

And 102, processing data by the streaming big data calculation module according to the deployment and control task and a data structure of metadata of a pre-configured data source to obtain a deployment and control result, and pushing the deployment and control result to a specified channel of a memory database.

In the embodiment of the invention, a calculation engine module of the streaming big data calculation module analyzes data in the high-speed message queue according to a data structure of metadata of a pre-configured data source, and processes the analyzed data according to a control task to obtain a control result.

In another embodiment of the present invention, before the computing engine module of the streaming big data computing module parses the data in the high-speed message queue according to the data structure of the preconfigured metadata of the data source, the method further includes: the business module configures a data structure of metadata of the data source.

In the embodiment of the present invention, when the data in the data source is stored in the form of a table, the data structure of the metadata includes: field content, field order, output field. For example, table 2 is one example of a data structure of metadata of a data source. The monitoring acquisition module can quickly acquire data under a resource path based on a data structure of metadata of a data source.

TABLE 2

In the embodiment of the invention, the streaming big data calculation module adopts the control algorithm to match data according to the basic information of the control task, and outputs the successfully matched data as the control result. Specifically, a calculation engine module of the streaming big data calculation module matches data in the high-speed message queue by adopting a control algorithm according to basic information of a control task, and outputs the successfully matched data as a control result.

For example, when the basic information of the deployment task includes the start time of the deployment task, the calculation engine module matches data by using a deployment algorithm from the start time of the deployment task, and outputs the successfully matched data as a deployment result. And when the basic information of the control task comprises the start time and the end time of the control task, the calculation engine module adopts a control algorithm to match data in the time period of the start time and the end time of the control task, and outputs the successfully matched data as a control result.

When the control dimension does not comprise a data source, the calculation engine module adopts a control algorithm to match data collected from all data sources; when the deployment dimension includes a data source, the compute engine module employs a deployment algorithm to match data collected from the data source in the deployment dimension.

The computing engine module in the present example may be implemented using a variety of components. For example, a Spark Streaming module (not limited to Spark Streaming module) may be employed.

In the embodiment of the present invention, the data structure of the deployment and control result may adopt a data structure shown in table 3, and of course, other data structures may also be adopted, which is not limited in the embodiment of the present invention.

TABLE 3

In another embodiment of the present invention, before the streaming big data calculation module processes data according to the deployment task to obtain the deployment result, the method further includes:

and the streaming big data calculation module shunts the deployment and control task according to the data source. Specifically, a calculation engine module of the streaming big data calculation module shunts the deployment and control task according to the data source.

That is, the calculation engine module may implement data matching using a plurality of nodes, each node matching data collected from one data source, and each node pushing a matching result to a designated channel of the in-memory database.

In another embodiment of the present invention, the method further comprises:

and the service module subscribes the appointed channel of the memory database and receives a subscription message pushed by the appointed channel to acquire the data of the appointed channel of the memory database. The service module can sense and display the control result in real time through subscription of the specified channel.

In another embodiment of the present invention, the service module periodically dumps the data in the memory database, for example, the data in the memory database is stored in a relational database or a search engine, etc., so as to provide the user with query and analysis of the historical results.

The monitoring acquisition module of the embodiment of the invention realizes the quick butt joint with the data source based on the resource configuration information of the data source, can quickly acquire the data of the data source without knowing the data structure of the metadata of the data source, and realizes the high-efficiency processing and analysis of the data based on the deployment and control task and the streaming big data calculation module.

After the data sources are configured and butted, massive streaming data are processed by the distributed monitoring acquisition module with high reliability, high throughput and high concurrency and the streaming big data calculation framework, so that the response speed of the streaming data with high real-time requirement is greatly increased, and the accuracy of data processing and analysis is improved to a certain extent. The method provided by the embodiment of the invention has higher generalizability.

Examples of the invention

In this example, the method comprises:

step 1, performing deployment control on a suspect A of a certain case, wherein daily track information of the suspect (deployment control object) needs to be monitored.

The suspect is known for the following information: identification number, vehicle number plate, mobile phone number, etc. .

Several new data sources need to be placed under control: internet bar internet information (hdfs), card gate vehicle-passing record (Oracle), hotel check-in information (mysql), and mobile phone call record (ftp).

And step 2, the service module configures a resource interface of the data source, as shown in table 4.

Data source type	Interface name	Interface definition
			HDFS	hdfs_01	hdfs://ip:9000
Oracle	oracle_02	db://ip:1521/oracle32#oracle
			MySQL	mysql_03	db://ip:3306/test#mysql
FTP	ftp_04	ftp://ip:21

TABLE 4

The interface detail definitions of the data resources of different types are different, and are not described in detail here.

And step 3, configuring a resource path of the data source by the service module, as shown in table 5.

TABLE 5

And 4, configuring a data structure of the metadata of the data source by the service module, as shown in table 6.

TABLE 6

And 5, managing the algorithm library by the service module.

The algorithm library comprises a general algorithm, when the general algorithm cannot meet the requirement in practical application, the algorithm library can expand the algorithm as required, and the expanded algorithm comprises at least one of the following algorithms: code class self-defining algorithm, function dependence class algorithm and regular rule class algorithm.

And step 6, defining a control task by the service module, as shown in a table 7.

TABLE 7

According to the algorithm, a certain data source (a certain field) can be selected, or a data source is not selected (namely all data sources are matched).

Step 7, the monitoring acquisition module acquires data of the data source in real time according to the resource interface and the resource path of the data source and pushes the data to Kafka; kafka defines topic by data source, generates 4 topics as above:

Topic1：T_ZYK_RK_WBSWRY

Topic2：T_VEH_KK_PASSREC

Topic3：T_QB_LG_RY_CGUEST

Topic4：T_ZA_XTBA_RY_PH_THJL

generating a corresponding background task according to a deployment task defined by a service module in Spark Streaming, consuming all topics from Kafka by Spark Streaming, shunting according to the topics, and analyzing a data stream in the topics according to a data structure of metadata; the Spark Streaming task calling algorithm carries out logic operation and pushes a real-time deployment and control result in combination with the data source output definition;

the real-time deployment and control result is pushed to the memory database, the application process subscribes a designated channel of the memory database, and the deployment and control result can be sensed and displayed in real time;

the application process periodically dumps the in-memory database data to a relational database or a search engine (only the latest preset data is reserved in the in-memory database) for querying the history deployment and control result or performing secondary analysis (for example, generating a suspect trajectory diagram, performing judgment analysis, and the like, which are not further described herein).

Another embodiment of the present invention provides a data processing apparatus, including at least one of the following modules:

the monitoring acquisition module 202 is used for acquiring data from a data source according to the pre-configured resource configuration information of the data source 201 and pushing the acquired data to the streaming big data calculation module;

the service module 204 is used for defining a deployment and control task;

the streaming big data calculation module 203 is configured to cache the data; processing the data according to the deployment and control task and a data structure of metadata of a pre-configured data source 201 to obtain a deployment and control result, and pushing the deployment and control result to a specified channel of an in-memory database;

the storage module 205 is configured to store the deployment and control result to the specified channel of the in-memory database.

In another embodiment of the present invention, the service module 204 is further configured to:

configuring resource configuration information of the data source 201 and a data structure of metadata of the data source; wherein the resource configuration information includes: resource interfaces and resource paths.

and subscribing the appointed channel of the memory database, and receiving a subscription message pushed by the appointed channel to acquire the data of the appointed channel of the memory database.

In another embodiment of the present invention, the deployment task comprises: the method comprises the following steps of deploying and controlling task basic information, deploying and controlling object information and deploying and controlling dimensions, wherein the deploying and controlling dimensions comprise deploying and controlling algorithms;

the streaming big data calculation module 203 is specifically configured to:

caching the data; and analyzing the data according to the data structure of the metadata, matching the analyzed data by adopting the control algorithm according to the basic information of the control task, outputting the successfully matched data as a control result, and pushing the control result to a specified channel of the memory database.

In another embodiment of the present invention, the streaming big data calculation module 203 is further configured to:

and shunting the deployment and control task according to the data source.

Another embodiment of the present invention provides a data processing apparatus, including a processor and a computer-readable storage medium, wherein the computer-readable storage medium stores instructions, and when the instructions are executed by the processor, at least one step of any one of the data processing methods is implemented.

Another embodiment of the invention proposes a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out at least one step of any one of the data processing methods described above.

Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer.

Referring to fig. 2, another embodiment of the present invention provides a data processing system, including:

the service module 204 is used for defining a deployment and control task;

the streaming big data calculation module 203 is configured to cache the data; processing the data according to the deployment and control task and a data structure of metadata of a pre-configured data source to obtain a deployment and control result, and pushing the deployment and control result to a specified channel of a memory database;

The monitoring acquisition module 202, the service module 204, the streaming big data calculation module 203, and the storage module 205 in the embodiment of the present invention may be arranged in one node, may be arranged in different nodes, may be implemented by a plurality of nodes, or may be implemented by a cluster.

In another embodiment of the invention, the streaming big data calculation module 203 comprises a high-speed message queue and a calculation engine module;

a high-speed message queue for caching the data;

and the calculation engine module is used for analyzing the data according to the data structure of the metadata, processing the analyzed data according to the deployment and control task to obtain a deployment and control result, and pushing the deployment and control result to a specified channel of the memory database.

In embodiments of the present invention, a high-speed message queue may be implemented using a node, and a high-speed message queue is used for caching data of a data source.

The compute engine module may be implemented using one node or using one cluster. One node implements processing of data of one data source.

the streaming big data calculation module 203 is specifically configured to:

and shunting the deployment and control task according to the data source.

That is, the compute engine module is implemented using a plurality of nodes, one node of the compute engine module being for processing data of one data source.

Although the embodiments of the present invention have been described above, the descriptions are only used for understanding the embodiments of the present invention, and are not intended to limit the embodiments of the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the embodiments of the invention as defined by the appended claims.

Claims

1. A method of data processing, comprising:

the business module defines a deployment and control task;

2. The data processing method of claim 1, wherein prior to collecting data from the data source, the method further comprises:

the business module configures a data structure of metadata of the data source.

3. The data processing method of claim 1, further comprising:

4. The data processing method according to any one of claims 1 to 3, wherein the deployment task comprises: the method comprises the following steps of deploying and controlling task basic information, deploying and controlling object information and deploying and controlling dimensions, wherein the deploying and controlling dimensions comprise deploying and controlling algorithms;

5. The data processing method of claim 4, wherein the deployment algorithm comprises at least one of: the general algorithm and the extended algorithm are provided by the algorithm library;

6. The data processing method according to any one of claims 1 to 3, wherein before the streaming big data calculation module processes data according to the deployment and control task to obtain a deployment and control result, the method further comprises:

7. A data processing apparatus comprising at least one of the following modules:

the service module is used for defining a deployment and control task;

8. A data processing apparatus comprising a processor and a computer readable storage medium having instructions stored thereon, wherein the instructions, when executed by the processor, carry out at least one step of a data processing method according to any one of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out at least one step of a data processing method according to any one of claims 1 to 6.

10. A data processing system comprising:

the service module is used for defining a deployment and control task;