CN113094154A

CN113094154A - Big data processing method and system based on Aliyun

Info

Publication number: CN113094154A
Application number: CN202110355586.3A
Authority: CN
Inventors: 徐胜国; 郭靓; 曾锃; 金倩倩
Original assignee: Nari Technology Co Ltd; Nari Information and Communication Technology Co
Current assignee: Nari Technology Co Ltd; Nari Information and Communication Technology Co
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2021-07-09

Abstract

The invention discloses a big data processing method based on Aliskiren cloud, which comprises the steps of preprocessing collected data through Aliskiren cloud; multitasking the preprocessed data through the Aliskiren cloud; storing alarms generated in the multitasking of data into a database; the multitask processed data are stored through the Ali cloud, the data processing capacity (MaxCommute EB-level large data storage and analysis capacity) of the Ali cloud is fully utilized by the Ali cloud-based computing and storage platform, the esper is operated on the Ali cloud in a distributed multi-task mode through Blink, the problem of low efficiency of the esper single process is solved, stronger data analysis capacity is provided, and the data processing efficiency is improved through various components in the Ali cloud platform.

Description

Big data processing method and system based on Aliyun

Technical Field

The invention belongs to the technical field of information processing, and particularly relates to a big data processing method based on Aliyun.

Background

With the development of internet technology, the application of big data is wider and wider, and in the data processing process of the big data, a traditional architecture needs many virtual machine resources, and an analysis tool esper is single-process and runs on a single server, so that the data processing speed is slow, and a single point of failure is easy to occur.

Disclosure of Invention

The invention aims to provide a big data processing method based on Aliskiren, which can improve the efficiency of big data processing.

In order to achieve the purpose, the invention provides the following technical scheme:

in a first aspect, a big data processing method based on Aliskiren cloud is provided, which includes:

preprocessing the acquired data through Aliskiren cloud;

multitasking the preprocessed data through the Aliskiren cloud;

storing alarms generated in the multitasking of data into a database;

and storing the multitasked data through the Ali cloud.

With reference to the first aspect, further, the preprocessing the acquired data by the arrhizus specifically includes:

sending the collected data to a kafka message queue of the Aliskiren cloud;

and taking out the data from the kafka message queue through the Aliskin Blink computing engine, identifying, de-duplicating, de-noising, normalizing and enhancing the data, and respectively sending the processed data to a data analysis queue and a big data storage queue.

With reference to the first aspect, further, the multitasking of the preprocessed data by the arrhizus specifically includes:

and converting the data in the data analysis queue into corresponding data objects according to a data model, converting the data objects into esper events, and sending the esper events to an esper rule engine for analysis and processing, wherein the esper rule engine runs on a Blink computing engine of the Alice cloud in a multitask mode.

With reference to the first aspect, further, an alarm event is generated for an esper event that hits an alarm rule loaded in advance, and is sent to the alarm warehousing queue of kafka.

With reference to the first aspect, the entering of the alarm generated in the multitasking of the data specifically includes:

and judging whether the alarm event is in a white list through a whiteteliststream tool in the Aliyun Blink computing engine, if so, not warehousing the alarm event, and otherwise, performing warehousing operation on the alarm event.

With reference to the first aspect, further, in the process of warehousing operation, if there is no alarm with the same source IP, the same destination IP, and the same alarm type in the database, the alarm is inserted into the database, otherwise, the number of times of the existing alarm is incremented by 1, and the number of times is updated into the database.

With reference to the first aspect, further, the storing the multitasked data through the aricloud includes:

reading data from the big data storage queue, and deserializing the data into corresponding data objects according to the data types;

and storing the data object through the MaxCommute service of the Alice cloud.

In combination with the first aspect, further, the historical data stored by MaxCompute before 6 months is automatically deleted.

In a second aspect, there is provided an ali cloud-based big data processing system, including:

a preprocessing module: the data acquisition device is used for preprocessing the acquired data through the Aliskiren cloud;

a multitask processing module: the data preprocessing module is used for performing multitasking processing on the preprocessed data through the Aliskiren cloud;

an alarm storage module: the system comprises a database, a database and a database, wherein the database is used for storing alarms generated in the multitasking of data;

a data storage module: the data processing method is used for storing the multitasked data through the Ali cloud.

The beneficial technical effects are as follows: compared with the prior art, the Arry cloud-based computing and storage platform provided by the invention fully utilizes the data processing capacity (MaxCommute EB-level large data storage and analysis capacity) of the Arry cloud, enables the esper to run on the Arry cloud in a distributed multi-task mode through Blink, solves the problem of low efficiency of the esper single process, provides stronger data analysis capacity, and improves the data processing efficiency through various components in the Arry cloud platform.

Drawings

FIG. 1 is a system architecture diagram of the present invention;

FIG. 2 is a data flow diagram of the present invention;

FIG. 3 is a flow diagram of a big data preprocessing module according to the present invention;

FIG. 4 is a flow chart of big data analysis in the present invention;

FIG. 5 is a flow chart of the alarm warehousing process of the present invention;

FIG. 6 is a flow chart of big data storage according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1 to 6, the present invention provides a big data processing method based on alisma cloud, comprising the following steps:

step one, preprocessing the acquired data through Aliskiren cloud

The method specifically comprises the following steps:

1) data analysis: the collected data are sent to a kafka message queue of Aliskiu, the preprocessed data are consumed from the kafka queue, then the preprocessed data are deserialized into corresponding objects respectively according to the types of the data, and for most of the data, the final objects are an alarm object (eventWarnng), an index object (indicator Single) and a heartbeat object (DataQualityHeartBase).

2) Data identification: identifying the data in the step 1), associating a classification table cached in advance according to a port value of the data, selecting a data identification code required for solving a problem, and determining the data type.

3) Denoising data: and (3) directly deleting the data with the inconsistent format and the inconsistent attribute number and requirement from the data identified in the step 2), thereby improving the data quality.

4) Data deduplication: and 3) carrying out deduplication on the result data in the step 3), and only keeping the last record in the repeated records if the records only have different time and the same other attributes appear for multiple times in the specified time interval.

5) Normalizing data: performing canonicalization processing on the log data subjected to the duplicate removal in the step 4), and converting the logs of various different expression modes into a uniform description form. The analyst does not need to be familiar with different log information of different manufacturers, so that the analysis and audit work efficiency is greatly improved. The system provides a normalized field comprising log receiving time, log generating time, log duration, user name, source address, source MAC address, source port, operation, destination address, destination MAC address, destination port, log event name, abstract, level, original type, network protocol, network application protocol, device address, device name, device type and the like. The technical personnel also manually classify and analyze each log according to the best practice and the related technical standard, add a new log type field, enrich the information content of the log and make the boring log information more understandable.

6) Data forwarding: and (3) sending the data after the normalization in the step 5) to an analysis queue of the ES and the kafka and a large data storage queue of the kafka respectively. The ES extracts the lemma from the original journal text full text and indexes the extracted lemma to realize the indexing of formatted fields and the full text, and through the full text indexing technology, the system can provide a flexible and convenient analysis tool for analysts, thereby greatly improving the flexible convenience of using the system; the analysis queue of kafka is used as the input of a data analysis module and used for the real-time correlation analysis of subsequent big data; the big data storage queue of kafka is used as an input of the big data storage module for subsequent offline analysis tasks.

Step two, multitasking (big data analysis) is carried out on the preprocessed data through Aliyun

Step 1) data analysis: and consuming and analyzing data from the kafka queue, and then respectively deserializing the data into corresponding objects according to the types of the data.

Step 2) converting the object into an esper event and sending the event to the esper: registering metadata for a data source to be monitored, wherein the metadata is used for describing information of the data source to be monitored; then the data processing program defines the alarm rule of the data source to be monitored according to the attribute of the metadata and translates the defined process into an Esper SQL-like statement; then the data processing program monitors the data source to be monitored; and when the data source to be monitored triggers the alarm rule, the data processing program sends alarm information to a kafka alarm storage queue.

Step 3) splitting the object to be put in storage: splitting the data generated in the step 2) into information such as alarm data, index data, heartbeat data and the like.

Step 4), thread pool processing: and putting different types of data into blocking queues of different thread pools, taking the latest data when the monitoring thread finds that the queues are not empty, handing the latest data to the thread for processing, and then putting the latest data into the relational database.

Step three, the alarm generated in the multitasking process of the data is put into a warehouse, and the method comprises the following steps:

1) reading kafka alarm warehousing queue data: and reading the alarm data generated by the esper in the step two, and deserializing the alarm data into an alarm object.

2) Judging whether the white list is in the white list: loading a database white list table into a memory in advance, inquiring the white list table according to an attack source IP, an attack destination IP and an alarm level in an alarm object respectively, and returning the program to the step 1) to continuously read the next piece of alarm data in the list; and if the current time is not in the white list, performing warehousing operation. If the same source IP, the target IP and the same type of alarm do not exist in the database, directly inserting the alarm into the database; otherwise, adding 1 to the existing data on the basis of the alarm frequency, and updating the data into the database.

Step four, storing the data after multitasking through the Ali cloud, comprising the following steps:

1) data analysis: and consuming the large data storage data from the kafka queue, and then respectively deserializing the large data storage data into corresponding objects according to the types of the data.

2) Data storage to MaxCompute of aricloud: MaxCommute (original name ODPS) is a fast, fully hosted EB-level data warehouse solution.

With the continuous enrichment of data collection means and the accumulation of a large amount of industry data, the data scale has grown to the level of massive data (hundred TB, PB and EB) which cannot be borne by the traditional software industry. MaxCommute is dedicated to storage and calculation of batch structured data, and provides a solution and an analysis modeling service of a mass data warehouse. The method comprises the steps of establishing a MaxCommute data table through an odps _ cmd client tool according to a preset table establishing rule, wherein the MaxCommute is partitioned storage data, the size of each block of data is 64 megabytes, in order to prevent a large number of small files, a method for timed batch insertion is designed in the step, when the number of data reaches 3 thousands or the time reaches one day, batch operation is carried out, and the data is stored in a pre-established table according to datamode by using a Tunnel data transmission method in the MaxCommute for an offline analysis program to use.

Step 3) deleting the MaxCommute historical data at regular time: to prevent the data from expanding indefinitely, this step sets a timed task, deleting the historical data before MaxCompute6 months.

Example 2

Provided is an Aliskiun-based big data processing system, which comprises:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A big data processing method based on Aliskiren cloud is characterized by comprising the following steps:

preprocessing the acquired data through Aliskiren cloud;

multitasking the preprocessed data through the Aliskiren cloud;

storing alarms generated in the multitasking of data into a database;

and storing the multitasked data through the Ali cloud.

2. The big data processing method based on Aliskiren cloud according to claim 1, wherein: the preprocessing of the collected data through the Aliskiren cloud specifically comprises the following steps:

sending the collected data to a kafka message queue of the Aliskiren cloud;

3. The big data processing method based on Aliskiren cloud as claimed in claim 2, wherein the multitasking of the preprocessed data by Aliskiren cloud is specifically as follows:

4. The big data processing method based on Aliskiu as claimed in claim 3, wherein an alarm event is generated for an esper event hitting the alarm rule loaded in advance and sent to the alarm warehousing queue of kafka.

5. The big data processing method based on the Aliskiren cloud as claimed in claim 4, wherein the step of putting the alarms generated in the multitasking process of the data into a database specifically comprises the following steps:

6. The big data processing method based on Aliskiu as claimed in claim 4, wherein in the process of warehousing operation, if there is no alarm with the same source IP, the same destination IP and the same alarm type in the database, the alarm is inserted into the database, otherwise, the number of times of recording the existing alarm is increased by 1, and the number of times is updated into the database.

7. The big data processing method based on the Aliskiren cloud according to claim 1, wherein the storing the multitasked data through the Aliskiren cloud comprises the following steps:

and storing the data object through the MaxCommute service of the Alice cloud.

8. The arrhizus-based big data processing method as claimed in claim 1, wherein automatic deletion is performed for historical data stored by MaxCompute6 months ago.

9. A big data processing system based on Aliskiun, comprising: