CN108021696B - Data association analysis method and system - Google Patents

Data association analysis method and system Download PDF

Info

Publication number
CN108021696B
CN108021696B CN201711371356.6A CN201711371356A CN108021696B CN 108021696 B CN108021696 B CN 108021696B CN 201711371356 A CN201711371356 A CN 201711371356A CN 108021696 B CN108021696 B CN 108021696B
Authority
CN
China
Prior art keywords
data
basic data
basic
association
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711371356.6A
Other languages
Chinese (zh)
Other versions
CN108021696A (en
Inventor
曾毅
喻波
王志海
董爱华
安鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wondersoft Technology Co Ltd
Original Assignee
Beijing Wondersoft Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wondersoft Technology Co Ltd filed Critical Beijing Wondersoft Technology Co Ltd
Priority to CN201711371356.6A priority Critical patent/CN108021696B/en
Publication of CN108021696A publication Critical patent/CN108021696A/en
Application granted granted Critical
Publication of CN108021696B publication Critical patent/CN108021696B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a data association analysis method and a device, wherein the method comprises the following steps: pre-classifying the data to be processed, and putting the data to be processed into different data cache queues according to classification results; determining basic data and loading the basic data into a memory; performing correlation analysis on the data put into the data cache queue according to the basic data; and outputting the correlation analysis result to the user. By the technical scheme, the efficiency of data analysis is improved, and suspicious data is quickly positioned.

Description

Data association analysis method and system
Technical Field
The invention relates to the field of data security, in particular to a data association analysis method and system.
Background
Spark Streaming, Spark Streaming is the decomposition of Streaming computation into a series of short batch jobs. The batch processing engine is Spark, that is, the input data of Spark Streaming is divided into one piece of data (partitioned Stream) according to the batch size (e.g. 1 second), each piece of data is converted into RDD (resource partitioned data set) in Spark, then the Transformation operation on dsstream in Spark Streaming is changed into the Transformation operation on RDD in Spark, and the RDD is changed into an intermediate result and stored in the memory.
And correlation analysis, namely searching the correlation existing between the objects in the business data, and specifically, correlating the business data according to the specified field information in the business log.
With the continuous popularization and application of internet technology in various industries, the scale of business data generated in each link in the enterprise workflow is rapidly expanded. Managers have increasingly strong requirements for report management, event warning and behavior audit formation based on business data. With the development of big data technology, a technical basis is provided for centralized collection and association analysis based on business data. By the correlation analysis technology of the service data, auditing and event alarming can be rapidly carried out on the service data, and related original information of problems can be positioned.
In the prior art, log-based analysis systems exist, such as splunk and log-based correlation analysis technologies.
The basic analysis scheme of the prior art includes:
step 1, extracting service data to be processed from a data source.
And 2, performing spark stream processing on the service data.
And 3, putting the business data processed by spark stream into a data buffer area according to the classification result.
And 4, performing association analysis on the data placed in the data cache region according to the specified association rule.
Step 5, judging whether the association is successful or not, if the association is successful, putting the associated result data into a cache region, and displaying the data, and ending; otherwise, the data with association failure is put into the data cache area with association failure.
And 6, judging whether the data failed in association is in the life cycle, if so, returning to the step 4 for association analysis, and otherwise, giving up association.
However, the above prior art has the following problems:
the association rule and the association data are fixed and cannot adapt to the real-time change of the data;
and it is difficult to quickly locate the problem by directly passing suspicious information when the information appears.
Disclosure of Invention
In order to achieve the following objectives:
1. the spark-based stream processing technology can perform correlation analysis on service data in near real time, can dynamically track services based on analysis results, and triggers an alarm for the problem meeting the threshold.
2. And (4) associating the result data after analysis, and quickly positioning the original data material associated with the data when inquiring.
The invention provides a data association analysis method, which is characterized by comprising the following steps:
pre-classifying the data to be processed, and putting the data to be processed into different data cache queues according to classification results;
determining basic data and loading the basic data to a memory;
performing correlation analysis on the data put into the data cache queue according to the basic data;
and outputting the correlation analysis result to the user.
According to the method of the present invention, preferably, the basic data is dynamically managed through machine learning, so that the basic data loaded into the memory is used most frequently.
According to the method of the present invention, preferably, the correlation is generated according to the same identifier between the data to be processed placed in the data buffer queue and the basic data, and the data to be processed is analyzed according to the correlated basic data.
According to the method of the present invention, preferably, the basic data at least includes: user information, location information associated with the IP address, and organizational information.
According to the method of the present invention, preferably, the processing the correlation analysis result includes: and displaying a report in real time and/or early warning data.
The invention provides a data association analysis device, which is characterized by comprising:
the data classification module is used for performing pre-classification on the data to be processed and putting the data to be processed into different data cache queues according to classification results;
the basic data loading module is used for determining basic data and loading the basic data into the memory;
the association analysis module is used for performing association analysis on the data put into the data cache queue according to the basic data;
and the result output module is used for outputting the correlation analysis result to the user.
According to the apparatus of the present invention, preferably, the basic data loading module further includes: and the basic data management submodule is used for dynamically managing the basic data through machine learning so that the use frequency of the basic data loaded into the memory is highest.
According to the apparatus of the present invention, preferably, the association analysis module generates an association according to that the data to be processed placed in the data buffer queue has the same identifier as the basic data, and analyzes the data to be processed according to the associated basic data.
According to the device provided by the invention, preferably, the result output module outputs the correlation analysis result to the user through real-time report display and/or data early warning.
According to the device of the invention, the medium preferably stores computer program instructions, characterized in that, when executing said computer program instructions, it implements one of the methods described above.
By adopting the technical scheme of the invention, the following technical effects are achieved:
and (4) function expansion: the user can specify flexible specified association rules according to specific business requirements. The preloading of basic data and the machine learning management of the data in the preloading area can be flexibly specified, so that the frequency of database query in association analysis can be effectively reduced, and the processing efficiency is improved.
Real-time performance: the correlation analysis based on spark stream processing enables the correlation analysis of the business data to be completed at a speed close to real time, and the timeliness of the auditing function and the alarming function of the system is enhanced.
Drawings
FIG. 1 is a prior art data analysis flow diagram.
FIG. 2 is a flow chart of data association analysis according to the present invention.
FIG. 3 is a flow chart of the overall data correlation analysis of the present invention.
Detailed Description
< correlation analysis method >
The data association analysis method is described with reference to fig. 2 and 3.
The invention provides a data association analysis method, which comprises the following steps:
step 1, data to be processed is pre-classified, and the data to be processed is placed into different data cache queues according to classification results.
Collecting service data from the message queue by a spark stream processing technology, classifying the collected service data according to a preset condition, sending the processed data to a buffer queue of various data, transmitting the data in a json format, and classifying the data through a service type field in the data to wait for the next correlation operation. And classifying the service data according to the value of the predefined data type field.
For example, two types of service data are included: service data a and service data B.
Wherein the service data A, B includes both the same data content: the system comprises a user ID, a data type, a software name, an event message, an IP address, recording time and other related information, and also comprises other data contents which are different from each other;
and after receiving A, B the two types of service data, the flow processing module analyzes the data into json format, puts the data into a corresponding queue according to the value of the predefined data type field, and enters the next association operation.
And 2, determining basic data and loading the basic data into a memory.
And dynamically managing the basic data through machine learning, so that the basic data loaded into the memory has the highest use frequency. The base data includes at least: user information, location information associated with the IP address, and organizational information.
The relevant analysis basic data to be preloaded can be specified in advance, and the basic data is data with relatively small change, such as personnel identity information, equipment basic information and the like. And dynamically managing the loaded data through machine learning, wherein the basic data in the loading area is always the highest in current use frequency, for example, ranking the data in the container according to the use frequency, cleaning the data ranked in the next 10% out of the container, and repeatedly keeping the use frequency of the basic data to be the highest, which is only a preferred embodiment, and other transformation forms with the same idea are within the protection scope of the invention. For example, the basic data includes user information data and user unit data. Wherein the user information data includes: user ID, mobile number, user unit name ID, user location, etc.; the user unit data includes: user unit name ID, unit registration code, unit registration time, city of unit, province of unit, etc.
The basic data is only an example, and other basic information, such as software related information, service related information, and the like, may also be included in the actual operation process, which is not limited herein. By preloading the basic information, the relevant basic data of the service data can be quickly extracted through the user ID, the user unit ID and the like according to the service data to be analyzed, so that the service data can be quickly analyzed.
The function can effectively reduce IO with the database during correlation analysis, and greatly improves the correlation analysis processing efficiency. The specific implementation of this function is as follows: for example, to audit outgoing information of a person in a certain network, corresponding service data includes: user ID, data type, software name, event message, information source IP address, server IP address, recording time and other related information.
The information identifies the user and the software name through corresponding identifiers, and also comprises information such as a data sending IP address and a server IP address, and when suspicious information appears, the problem is difficult to quickly locate through the information directly. Therefore, it is very important to supplement the information identified by these fields by correlation. The above-mentioned method for discovering suspicious information (or sensitive information) belongs to the prior art method only, and is not described in detail here.
However, in the case of a large amount of data, if the database is directly queried to supplement the basic information after receiving the message each time, frequent access to the database is required, which causes a great stress. For this reason, basic data (such as user information, location information related to an IP address, organization information, and the like) that may be used in the analysis may be loaded in advance before the correlation analysis is started, and specified by a page, because the basic information is relatively fixed. Through the basic data preloading function, the service information can be supplemented, and the follow-up operation is convenient. Such as supplementing basic information of the person, such as organization, name, age, etc., according to the user ID.
When a service log is received, the information such as the owner and the position of the file is quickly positioned through the correlation analysis of the service log and the basic data. The invention improves the efficiency of the part by providing a preloading module and a machine learning function, loads frequently used basic data into the memory, can effectively reduce the times of reading the database, optimizes the data set in the preloading module through machine learning, and reduces the times of searching target data in association analysis.
The basic content loaded in the preloading module can be specified by a user according to the specification and the specific requirement, so that the adaptability of the scene is effectively improved. But the preloading basic data is always high or highest in current use frequency, so that subsequent association analysis can be performed quickly.
And 3, performing correlation analysis on the data put into the data buffer queue according to the basic data.
And generating association according to the same identifier between the data to be processed which is put into the data buffer queue and the basic data, and analyzing the data to be processed according to the associated basic data.
Performing association analysis on the data according to an appointed association rule, if the association analysis is successful, storing the result data into a result data cache region, and performing the next data display operation (including data report forms, behavior audit, alarm triggering and the like); if the data fails, the data enters a correlation failure cache region, and the data is subjected to re-correlation operation according to a preset rule (the processing mainly solves the problem of correlation analysis errors caused by the fact that the time difference of the business data entering the stream processor exceeds the stream processing time interval). The association rule is pre-specified by the user, such as: and associating user information according to the user ID, associating geographical position information according to the IP address, associating hardware information according to the hardware ID and the like.
The specific association process is as follows:
finding associated user basic information such as user name, mobile number, corresponding unit name ID and the like through the user ID in the outgoing information service log;
then, unit basic information, such as unit registration code, city of the unit, province of the unit, unit registration time and other related information, is found through the unit name ID.
And associating the two types of data according to the condition that the user IDs in the service log and the user information data are equal to each other to obtain associated service log data, associating the two types of data according to the condition that the names of the unit IDs in the associated service log data and the organization mechanism information are equal to each other to obtain associated service log data, and associating other basic data needing to be associated according to the principle. As described above, the associated log data includes the user basic information and the unit basic information. For the data successfully associated, storing the result in a specified data set; for the data with association failure, the service log is put into a failure queue, and re-association is performed according to a specified retry rule, for example, re-association is performed after 10 minutes and 20 minutes of re-association is waited after failure. If the association is successful, the data is put into the data set with successful association, and if the association is failed, the association of the data is abandoned.
And 4, outputting the correlation analysis result to the user.
The method comprises the following steps: including real-time report display, early warning functions, and the like.
And displaying reports, early warning and the like on the service results in a multi-dimensional manner according to the associated service logs. For example, the regional distribution of the service log can be counted based on the location information, and the service log distribution of each department can be counted based on the organization structure information.
< correlation analysis device >
The invention also discloses a data association analysis device, which comprises:
the data classification module is used for performing pre-classification on the data to be processed and putting the data to be processed into different data cache queues according to classification results;
the basic data loading module is used for determining basic data and loading the basic data into the memory;
the association analysis module is used for performing association analysis on the data put into the data cache queue according to the basic data;
and the result output module is used for outputting the correlation analysis result to the user.
The basic data loading module further comprises: and the basic data management submodule is used for dynamically managing the basic data through machine learning so that the use frequency of the basic data loaded into the memory is highest.
And the association analysis module generates association according to the same identifier between the data to be processed which is put into the data buffer queue and the basic data, and analyzes the data to be processed according to the associated basic data.
And the result output module outputs the correlation analysis result to the user through real-time report display and/or data early warning.
The base data includes at least: user information, location information associated with the IP address, and organizational information.
Through the scheme of the invention, the user can specify the flexible specified association rule according to the specific service requirement. The preloading of basic data and the machine learning management of the data in the preloading area can be flexibly specified, so that the frequency of database query in association analysis can be effectively reduced, and the processing efficiency is improved. The correlation analysis based on spark stream processing enables the correlation analysis of the business data to be completed at a speed close to real time, and the invalidity of the auditing function and the alarming function of the system is strengthened.
The above examples are merely illustrative of the protection scheme of the present invention and do not limit the specific embodiments of the present invention.

Claims (3)

1. A log data association analysis method is characterized by comprising the following steps:
pre-classifying the data to be processed, and putting the data to be processed into different data cache queues according to classification results;
determining basic data and loading the basic data into a memory, wherein the basic data is dynamically managed through machine learning, so that the basic data loaded into the memory has the highest use frequency, and the basic data at least comprises user information, position information related to an IP address and organization information;
performing association analysis on data placed in a data cache queue according to the basic data, generating association according to the fact that the data to be processed placed in the data cache queue and the basic data have the same identifier, and analyzing the data to be processed according to the associated basic data, wherein an association rule is specified in advance by a user, and the association rule is used for associating user information according to a user ID, and/or associating geographical position information according to an IP address, and/or associating hardware information according to a hardware ID;
and outputting the correlation analysis result to the user through real-time report display and/or data early warning.
2. An apparatus for analyzing log data association, the apparatus comprising:
the data classification module is used for performing pre-classification on the data to be processed and putting the data to be processed into different data cache queues according to classification results;
the basic data loading module is used for determining basic data and loading the basic data into the memory, wherein the basic data is dynamically managed through machine learning, so that the basic data loaded into the memory has the highest use frequency, and the basic data at least comprises user information, position information related to an IP address and organization information;
the association analysis module is used for performing association analysis on the data placed in the data cache queue according to the basic data, generating association according to the fact that the data to be processed placed in the data cache queue and the basic data have the same identifier, and analyzing the data to be processed according to the associated basic data, wherein an association rule is specified in advance by a user, and the association rule is used for associating user information according to a user ID, and/or associating geographic position information according to an IP address, and/or associating hardware information according to a hardware ID;
and the result output module outputs the correlation analysis result to the user through real-time report display and/or data early warning.
3. A computer-readable storage medium storing computer program instructions which, when executed, implement the method of claim 1.
CN201711371356.6A 2017-12-19 2017-12-19 Data association analysis method and system Active CN108021696B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711371356.6A CN108021696B (en) 2017-12-19 2017-12-19 Data association analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711371356.6A CN108021696B (en) 2017-12-19 2017-12-19 Data association analysis method and system

Publications (2)

Publication Number Publication Date
CN108021696A CN108021696A (en) 2018-05-11
CN108021696B true CN108021696B (en) 2021-02-05

Family

ID=62074140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711371356.6A Active CN108021696B (en) 2017-12-19 2017-12-19 Data association analysis method and system

Country Status (1)

Country Link
CN (1) CN108021696B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960960A (en) * 2018-06-01 2018-12-07 中国平安人寿保险股份有限公司 A kind of method and server handling high concurrent data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508844A (en) * 2011-09-26 2012-06-20 北京金马甲产权网络交易有限公司 Cache system for dynamic sharing data of network bidding and cache method for dynamic sharing data of network bidding
CN103812676A (en) * 2012-11-08 2014-05-21 深圳中兴网信科技有限公司 Apparatus and method for realizing log data real-time association
US20170214716A1 (en) * 2016-01-26 2017-07-27 Korea Internet & Security Agency Violation information management module forming violation information intelligence analysis system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508844A (en) * 2011-09-26 2012-06-20 北京金马甲产权网络交易有限公司 Cache system for dynamic sharing data of network bidding and cache method for dynamic sharing data of network bidding
CN103812676A (en) * 2012-11-08 2014-05-21 深圳中兴网信科技有限公司 Apparatus and method for realizing log data real-time association
US20170214716A1 (en) * 2016-01-26 2017-07-27 Korea Internet & Security Agency Violation information management module forming violation information intelligence analysis system

Also Published As

Publication number Publication date
CN108021696A (en) 2018-05-11

Similar Documents

Publication Publication Date Title
CN109034993B (en) Account checking method, account checking equipment, account checking system and computer readable storage medium
CN111522922B (en) Log information query method and device, storage medium and computer equipment
US20130282726A1 (en) Grouping identity records to generate candidate lists to use in an entity and relationship resolution process
US20110270834A1 (en) Data Classifier
CN104836701A (en) Order monitoring method and monitoring apparatus
CN110928853A (en) Method and device for identifying log
CN108600081A (en) A kind of method and device that mail outgoing achieves, Mail Gateway
US20220019954A1 (en) Systems and methods for automated pattern detection in service tickets
US20170147652A1 (en) Search servers, end devices, and search methods for use in a distributed network
CN110888985A (en) Alarm information processing method and device, electronic equipment and storage medium
CN110851324A (en) Log-based routing inspection processing method and device, electronic equipment and storage medium
CN114817968A (en) Method, device and equipment for tracing path of featureless data and storage medium
CN114528457A (en) Web fingerprint detection method and related equipment
CN115033876A (en) Log processing method, log processing device, computer device and storage medium
CN114844771A (en) Monitoring method, device, storage medium and program product for micro-service system
CN115104336A (en) Tracking and publishing data for generating analytics
CN108021696B (en) Data association analysis method and system
CN107004036B (en) Method and system for searching logs containing a large number of entries
CN113495978B (en) Data retrieval method and device
CN113778810A (en) Log collection method, device and system
CN111045983B (en) Nuclear power station electronic file management method, device, terminal equipment and medium
US20140172874A1 (en) Intelligent analysis queue construction
CN106649678B (en) Data processing method and system
CN113961929A (en) Security-specific vulnerability scanning method and system
CN114490246A (en) Monitoring method, monitoring device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant