CN108021696B

CN108021696B - Data association analysis method and system

Info

Publication number: CN108021696B
Application number: CN201711371356.6A
Authority: CN
Inventors: 曾毅; 喻波; 王志海; 董爱华; 安鹏
Original assignee: Beijing Wondersoft Technology Co Ltd
Current assignee: Beijing Wondersoft Technology Co Ltd
Priority date: 2017-12-19
Filing date: 2017-12-19
Publication date: 2021-02-05
Anticipated expiration: 2037-12-19
Also published as: CN108021696A

Abstract

The invention discloses a data association analysis method and a device, wherein the method comprises the following steps: pre-classifying the data to be processed, and putting the data to be processed into different data cache queues according to classification results; determining basic data and loading the basic data into a memory; performing correlation analysis on the data put into the data cache queue according to the basic data; and outputting the correlation analysis result to the user. By the technical scheme, the efficiency of data analysis is improved, and suspicious data is quickly positioned.

Description

Data association analysis method and system

Technical Field

The invention relates to the field of data security, in particular to a data association analysis method and system.

Background

Spark Streaming, Spark Streaming is the decomposition of Streaming computation into a series of short batch jobs. The batch processing engine is Spark, that is, the input data of Spark Streaming is divided into one piece of data (partitioned Stream) according to the batch size (e.g. 1 second), each piece of data is converted into RDD (resource partitioned data set) in Spark, then the Transformation operation on dsstream in Spark Streaming is changed into the Transformation operation on RDD in Spark, and the RDD is changed into an intermediate result and stored in the memory.

And correlation analysis, namely searching the correlation existing between the objects in the business data, and specifically, correlating the business data according to the specified field information in the business log.

With the continuous popularization and application of internet technology in various industries, the scale of business data generated in each link in the enterprise workflow is rapidly expanded. Managers have increasingly strong requirements for report management, event warning and behavior audit formation based on business data. With the development of big data technology, a technical basis is provided for centralized collection and association analysis based on business data. By the correlation analysis technology of the service data, auditing and event alarming can be rapidly carried out on the service data, and related original information of problems can be positioned.

In the prior art, log-based analysis systems exist, such as splunk and log-based correlation analysis technologies.

The basic analysis scheme of the prior art includes:

step 1, extracting service data to be processed from a data source.

And 2, performing spark stream processing on the service data.

And 3, putting the business data processed by spark stream into a data buffer area according to the classification result.

And 4, performing association analysis on the data placed in the data cache region according to the specified association rule.

Step 5, judging whether the association is successful or not, if the association is successful, putting the associated result data into a cache region, and displaying the data, and ending; otherwise, the data with association failure is put into the data cache area with association failure.

And 6, judging whether the data failed in association is in the life cycle, if so, returning to the step 4 for association analysis, and otherwise, giving up association.

However, the above prior art has the following problems:

the association rule and the association data are fixed and cannot adapt to the real-time change of the data;

and it is difficult to quickly locate the problem by directly passing suspicious information when the information appears.

Disclosure of Invention

In order to achieve the following objectives:

1. the spark-based stream processing technology can perform correlation analysis on service data in near real time, can dynamically track services based on analysis results, and triggers an alarm for the problem meeting the threshold.

2. And (4) associating the result data after analysis, and quickly positioning the original data material associated with the data when inquiring.

The invention provides a data association analysis method, which is characterized by comprising the following steps:

pre-classifying the data to be processed, and putting the data to be processed into different data cache queues according to classification results;

determining basic data and loading the basic data to a memory;

performing correlation analysis on the data put into the data cache queue according to the basic data;

and outputting the correlation analysis result to the user.

According to the method of the present invention, preferably, the basic data is dynamically managed through machine learning, so that the basic data loaded into the memory is used most frequently.

According to the method of the present invention, preferably, the correlation is generated according to the same identifier between the data to be processed placed in the data buffer queue and the basic data, and the data to be processed is analyzed according to the correlated basic data.

According to the method of the present invention, preferably, the basic data at least includes: user information, location information associated with the IP address, and organizational information.

According to the method of the present invention, preferably, the processing the correlation analysis result includes: and displaying a report in real time and/or early warning data.

The invention provides a data association analysis device, which is characterized by comprising:

the data classification module is used for performing pre-classification on the data to be processed and putting the data to be processed into different data cache queues according to classification results;

the basic data loading module is used for determining basic data and loading the basic data into the memory;

the association analysis module is used for performing association analysis on the data put into the data cache queue according to the basic data;

and the result output module is used for outputting the correlation analysis result to the user.

According to the apparatus of the present invention, preferably, the basic data loading module further includes: and the basic data management submodule is used for dynamically managing the basic data through machine learning so that the use frequency of the basic data loaded into the memory is highest.

According to the apparatus of the present invention, preferably, the association analysis module generates an association according to that the data to be processed placed in the data buffer queue has the same identifier as the basic data, and analyzes the data to be processed according to the associated basic data.

According to the device provided by the invention, preferably, the result output module outputs the correlation analysis result to the user through real-time report display and/or data early warning.

According to the device of the invention, the medium preferably stores computer program instructions, characterized in that, when executing said computer program instructions, it implements one of the methods described above.

By adopting the technical scheme of the invention, the following technical effects are achieved:

and (4) function expansion: the user can specify flexible specified association rules according to specific business requirements. The preloading of basic data and the machine learning management of the data in the preloading area can be flexibly specified, so that the frequency of database query in association analysis can be effectively reduced, and the processing efficiency is improved.

Real-time performance: the correlation analysis based on spark stream processing enables the correlation analysis of the business data to be completed at a speed close to real time, and the timeliness of the auditing function and the alarming function of the system is enhanced.

Drawings

FIG. 1 is a prior art data analysis flow diagram.

FIG. 2 is a flow chart of data association analysis according to the present invention.

FIG. 3 is a flow chart of the overall data correlation analysis of the present invention.

Detailed Description

< correlation analysis method >

The data association analysis method is described with reference to fig. 2 and 3.

The invention provides a data association analysis method, which comprises the following steps:

step 1, data to be processed is pre-classified, and the data to be processed is placed into different data cache queues according to classification results.

Collecting service data from the message queue by a spark stream processing technology, classifying the collected service data according to a preset condition, sending the processed data to a buffer queue of various data, transmitting the data in a json format, and classifying the data through a service type field in the data to wait for the next correlation operation. And classifying the service data according to the value of the predefined data type field.

For example, two types of service data are included: service data a and service data B.

Wherein the service data A, B includes both the same data content: the system comprises a user ID, a data type, a software name, an event message, an IP address, recording time and other related information, and also comprises other data contents which are different from each other;

and after receiving A, B the two types of service data, the flow processing module analyzes the data into json format, puts the data into a corresponding queue according to the value of the predefined data type field, and enters the next association operation.

And 2, determining basic data and loading the basic data into a memory.

And dynamically managing the basic data through machine learning, so that the basic data loaded into the memory has the highest use frequency. The base data includes at least: user information, location information associated with the IP address, and organizational information.

The relevant analysis basic data to be preloaded can be specified in advance, and the basic data is data with relatively small change, such as personnel identity information, equipment basic information and the like. And dynamically managing the loaded data through machine learning, wherein the basic data in the loading area is always the highest in current use frequency, for example, ranking the data in the container according to the use frequency, cleaning the data ranked in the next 10% out of the container, and repeatedly keeping the use frequency of the basic data to be the highest, which is only a preferred embodiment, and other transformation forms with the same idea are within the protection scope of the invention. For example, the basic data includes user information data and user unit data. Wherein the user information data includes: user ID, mobile number, user unit name ID, user location, etc.; the user unit data includes: user unit name ID, unit registration code, unit registration time, city of unit, province of unit, etc.

The basic data is only an example, and other basic information, such as software related information, service related information, and the like, may also be included in the actual operation process, which is not limited herein. By preloading the basic information, the relevant basic data of the service data can be quickly extracted through the user ID, the user unit ID and the like according to the service data to be analyzed, so that the service data can be quickly analyzed.

The function can effectively reduce IO with the database during correlation analysis, and greatly improves the correlation analysis processing efficiency. The specific implementation of this function is as follows: for example, to audit outgoing information of a person in a certain network, corresponding service data includes: user ID, data type, software name, event message, information source IP address, server IP address, recording time and other related information.

The information identifies the user and the software name through corresponding identifiers, and also comprises information such as a data sending IP address and a server IP address, and when suspicious information appears, the problem is difficult to quickly locate through the information directly. Therefore, it is very important to supplement the information identified by these fields by correlation. The above-mentioned method for discovering suspicious information (or sensitive information) belongs to the prior art method only, and is not described in detail here.

However, in the case of a large amount of data, if the database is directly queried to supplement the basic information after receiving the message each time, frequent access to the database is required, which causes a great stress. For this reason, basic data (such as user information, location information related to an IP address, organization information, and the like) that may be used in the analysis may be loaded in advance before the correlation analysis is started, and specified by a page, because the basic information is relatively fixed. Through the basic data preloading function, the service information can be supplemented, and the follow-up operation is convenient. Such as supplementing basic information of the person, such as organization, name, age, etc., according to the user ID.

When a service log is received, the information such as the owner and the position of the file is quickly positioned through the correlation analysis of the service log and the basic data. The invention improves the efficiency of the part by providing a preloading module and a machine learning function, loads frequently used basic data into the memory, can effectively reduce the times of reading the database, optimizes the data set in the preloading module through machine learning, and reduces the times of searching target data in association analysis.

The basic content loaded in the preloading module can be specified by a user according to the specification and the specific requirement, so that the adaptability of the scene is effectively improved. But the preloading basic data is always high or highest in current use frequency, so that subsequent association analysis can be performed quickly.

And 3, performing correlation analysis on the data put into the data buffer queue according to the basic data.

And generating association according to the same identifier between the data to be processed which is put into the data buffer queue and the basic data, and analyzing the data to be processed according to the associated basic data.

Performing association analysis on the data according to an appointed association rule, if the association analysis is successful, storing the result data into a result data cache region, and performing the next data display operation (including data report forms, behavior audit, alarm triggering and the like); if the data fails, the data enters a correlation failure cache region, and the data is subjected to re-correlation operation according to a preset rule (the processing mainly solves the problem of correlation analysis errors caused by the fact that the time difference of the business data entering the stream processor exceeds the stream processing time interval). The association rule is pre-specified by the user, such as: and associating user information according to the user ID, associating geographical position information according to the IP address, associating hardware information according to the hardware ID and the like.

The specific association process is as follows:

finding associated user basic information such as user name, mobile number, corresponding unit name ID and the like through the user ID in the outgoing information service log;

then, unit basic information, such as unit registration code, city of the unit, province of the unit, unit registration time and other related information, is found through the unit name ID.

And associating the two types of data according to the condition that the user IDs in the service log and the user information data are equal to each other to obtain associated service log data, associating the two types of data according to the condition that the names of the unit IDs in the associated service log data and the organization mechanism information are equal to each other to obtain associated service log data, and associating other basic data needing to be associated according to the principle. As described above, the associated log data includes the user basic information and the unit basic information. For the data successfully associated, storing the result in a specified data set; for the data with association failure, the service log is put into a failure queue, and re-association is performed according to a specified retry rule, for example, re-association is performed after 10 minutes and 20 minutes of re-association is waited after failure. If the association is successful, the data is put into the data set with successful association, and if the association is failed, the association of the data is abandoned.

And 4, outputting the correlation analysis result to the user.

The method comprises the following steps: including real-time report display, early warning functions, and the like.

And displaying reports, early warning and the like on the service results in a multi-dimensional manner according to the associated service logs. For example, the regional distribution of the service log can be counted based on the location information, and the service log distribution of each department can be counted based on the organization structure information.

< correlation analysis device >

The invention also discloses a data association analysis device, which comprises:

The basic data loading module further comprises: and the basic data management submodule is used for dynamically managing the basic data through machine learning so that the use frequency of the basic data loaded into the memory is highest.

And the association analysis module generates association according to the same identifier between the data to be processed which is put into the data buffer queue and the basic data, and analyzes the data to be processed according to the associated basic data.

And the result output module outputs the correlation analysis result to the user through real-time report display and/or data early warning.

The base data includes at least: user information, location information associated with the IP address, and organizational information.

Through the scheme of the invention, the user can specify the flexible specified association rule according to the specific service requirement. The preloading of basic data and the machine learning management of the data in the preloading area can be flexibly specified, so that the frequency of database query in association analysis can be effectively reduced, and the processing efficiency is improved. The correlation analysis based on spark stream processing enables the correlation analysis of the business data to be completed at a speed close to real time, and the invalidity of the auditing function and the alarming function of the system is strengthened.

The above examples are merely illustrative of the protection scheme of the present invention and do not limit the specific embodiments of the present invention.

Claims

1. A log data association analysis method is characterized by comprising the following steps:

determining basic data and loading the basic data into a memory, wherein the basic data is dynamically managed through machine learning, so that the basic data loaded into the memory has the highest use frequency, and the basic data at least comprises user information, position information related to an IP address and organization information;

performing association analysis on data placed in a data cache queue according to the basic data, generating association according to the fact that the data to be processed placed in the data cache queue and the basic data have the same identifier, and analyzing the data to be processed according to the associated basic data, wherein an association rule is specified in advance by a user, and the association rule is used for associating user information according to a user ID, and/or associating geographical position information according to an IP address, and/or associating hardware information according to a hardware ID;

and outputting the correlation analysis result to the user through real-time report display and/or data early warning.

2. An apparatus for analyzing log data association, the apparatus comprising:

the basic data loading module is used for determining basic data and loading the basic data into the memory, wherein the basic data is dynamically managed through machine learning, so that the basic data loaded into the memory has the highest use frequency, and the basic data at least comprises user information, position information related to an IP address and organization information;

the association analysis module is used for performing association analysis on the data placed in the data cache queue according to the basic data, generating association according to the fact that the data to be processed placed in the data cache queue and the basic data have the same identifier, and analyzing the data to be processed according to the associated basic data, wherein an association rule is specified in advance by a user, and the association rule is used for associating user information according to a user ID, and/or associating geographic position information according to an IP address, and/or associating hardware information according to a hardware ID;

3. A computer-readable storage medium storing computer program instructions which, when executed, implement the method of claim 1.