CN110704698B

CN110704698B - Correlation and query method for unstructured massive network security data

Info

Publication number: CN110704698B
Application number: CN201911278901.6A
Authority: CN
Inventors: 潘祖烈; 张旻; 王文浩; 陈加根; 宁剑; 许成喜
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-04-10
Anticipated expiration: 2039-12-13
Also published as: CN110704698A

Abstract

The invention provides a method for associating and querying unstructured massive network security data, which comprises the following steps: establishing a secondary index, and quickly inquiring mass network security data based on the secondary index to complete the primary association of the network security data; performing iterative computation based on the credibility value of the network security data to realize the association of the network security data; carrying out validity judgment on the network security data which realizes the association; and a user submits a data query task through a data query interface, queries the established secondary index data, acquires a data primary key value corresponding to the data association task, queries an association data table of the network security database through the data primary key value, and acquires corresponding network security data and association data thereof. The invention obviously improves the efficiency of associating massive internet user information or massive internet user information data, and ensures the validity and accuracy of the network security data association result.

Description

Correlation and query method for unstructured massive network security data

Technical Field

The invention belongs to the technical field of big data, and is mainly used for information association and query of unstructured massive data from different sources. Belongs to a big data processing and analyzing method. In particular to a processing, analyzing and inquiring method of unstructured network security massive network security data in internet information.

Background

With the continuous development of internet technology, the data volume is exponentially and rapidly increased, and how to rapidly and efficiently extract, associate and mine the value of mass data from huge value information stored in massive unstructured data of different sources is an important subject of current big data field research. A key ring in the big data mining analysis technology is a data association technology, massive data are effectively associated, rules among the data can be obtained, and therefore value is created for technical innovation or commercial application.

The big data association technology is a technology for researching the direct relation or the potential relation existing between mass data. Currently, the mainstream association rule algorithms include Apriori algorithm, F-P algorithm, Eclat algorithm, etc., and the main function of the algorithms is to study a certain rule or relation among data, or to refer to indirect relation.

The main research object of the invention is to provide safety data for mass internet networks. In network security, identifying the exact identity of a user (such as a hacker or a network attack source user) is of great significance for the protection, tracking, and counterattack of network security. The specific research object of the invention is user information in the network security data or data information with similar characteristics with mass internet user information.

The massive Internet user information data or similar data has the characteristics that ① data are unstructured, ② data have multiple dimensions, ③ data have complex corresponding relations with users, through preliminary statistics, the total dimensions of the data are 25 in total, such as user names, passwords, birth dates and the like, the dimensions of single data are 3-4 dimensions, data sources are numerous, user attributes of different sources, such as user names, may belong to different users or the same user, the same mailbox field information of different sources belongs to the same user, the user names, the names and the birth dates of different sources are the same and can be identified as belonging to the same user.

The unstructured database has to perform data retrieval according to the primary key values, and cannot support multi-field combination or other field query, so that when data association query is performed, full-table scanning is required for each operation, the data association operation speed is low, and the calculation association efficiency of mass data is low.

The difference between the application scene of the current mainstream algorithm and the application scene of the invention is large, and the data association and query tasks of mass internet user information are difficult to process efficiently.

Disclosure of Invention

Aiming at the problems of efficiency, effectiveness and accuracy in the process of associating massive internet user information data, the invention provides an association method of unstructured massive data, which obviously improves the efficiency of data association and ensures the effectiveness and accuracy of data association results.

The method is realized based on a system for realizing the method for correlating and querying the unstructured massive network security data, wherein the system comprises a bottom-up hardware layer, a data layer, a processing layer and an application layer which are communicated with each other; the hardware layer comprises a management server and N computing and storing servers, wherein N is greater than 1 and is used for computing massive network security data and storing corresponding computing results; the data layer comprises a network security database used for storing network security data; characterized in that the method comprises the following steps:

step S201: establishing a secondary index, and quickly inquiring mass network security data based on the secondary index to complete the primary association of the network security data;

step S202: performing iterative computation based on the credibility value of the network security data to realize the association of the network security data;

step S203: carrying out validity judgment on the network security data which realizes the association, and deleting the network security data and the associated data thereof from a network security database for the network security data which does not meet the requirement and realizes the association;

step S204: and a user submits a data query task through a data query interface, queries the established secondary index data, acquires a data primary key value corresponding to the data association task, queries an association data table of the network security database through the data primary key value, and acquires corresponding network security data and association data thereof.

Further, the step S201: establishing a secondary index, and quickly inquiring mass network security data based on the secondary index to complete the primary association of the network security data; the method comprises the following steps:

step S301: analyzing the characteristics of the updated or newly added unstructured network security data, extracting data corresponding to a security attribute field capable of representing the network security data as security attribute characteristic field data, and storing the data into a network security database; the number of security attribute fields of the network security data is more than or equal to 1;

step S302: establishing a mapping relation between the data corresponding to the security attribute characteristic field and the primary key value of the network security data to form secondary index data, and storing the secondary index data into a network security database;

step S303: scanning the network security data regularly to obtain updated or newly added network security data, and searching corresponding primary key values in the secondary index data according to the data corresponding to the security attribute characteristic field;

step S304: and quickly positioning data in a network security database according to the searched primary key value, finishing the primary association operation of the network security data, and setting a mark for the network security data which is subjected to the primary association.

Further, the step S202: performing iterative computation based on the credibility value of the network security data to realize the association of the network security data, wherein the iterative computation comprises the following steps:

step S401: searching the network security data which is associated for the first time or the network security data which is obtained from the data association task sent by the user as the network security data to be associated;

step S402: according to the characteristics of the network security data, the contents of the network security data field are divided into three types: unique representation class data, probabilistic representation class data and invalid representation class data;

step S403: for the first class of data, namely the unique representation class data, carrying out query operation according to the content of the field to obtain all result data, marking the reliability value of the correlation result as the maximum, then carrying out classification correlation on the obtained data, and marking the reliability value of the correlation result;

for the second class data, namely the probability representation class data, data query is carried out according to the content of the field, all results are correlated, and the reliability value of the correlation result is marked according to the reliability of the representation data;

for the third class of data, namely invalid representation class data, no correlation is carried out, and the reliability value is set to be 0;

step S404: judging whether the reliability value is higher than a preset threshold value, if so, entering a step S402; if not, the method ends.

Further, the unique representation class data refers to a unique attribute that the security attribute field data belongs to the user; the probability characterization class data refers to that the content probability of some fields does not definitely belong to a certain user or does not completely belong to a single user.

Further, the step S203: the method comprises the following steps of judging the effectiveness of the network security data which is associated, and deleting the network security data and the associated data thereof from a network security database for the network security data which is not qualified and is associated, wherein the method comprises the following steps:

step S601: acquiring associated data associated with the network security data realizing association;

step S602: detecting whether the content of the correlation result field has conflict or not; if yes, go to step S603; if not, go to step S604;

step S603: deleting the low-reliability data and the associated data thereof from the network security database according to the reliability value; the method is ended;

step S604: detecting whether the content of the data field is invalid; if not, go to step S606; otherwise, go to step S605;

step S605: detecting whether the data volume corresponding to the association result field is smaller than a threshold value, if so, entering a step S606, and if not, entering a step S607;

step S606: deleting the data and its associated data from the network security database; the method is ended;

step S607: and storing the network security data and the associated data thereof into an associated data table of the network security database.

Further, before the step S301, a budget management operation is further included, that is, the obtained network security data is cleaned, filtered, and missing data is supplemented.

Further, the step of using the network security data obtained from the data association task sent by the user as the network security data to be associated includes:

a user sends a data association task through a data association interface of an application layer according to the use requirement of the user;

inquiring the established secondary index data, acquiring a data primary key value corresponding to the data association task, and acquiring seed data through two-dimensional index data stored in a network security database;

acquiring data which has a relative field attribute and the same field attribute value with the seed data in the network security database as data to be associated based on the seed data;

and taking the seed data and the data to be associated as network security data obtained from a data association task sent by a user.

Has the advantages that:

the invention provides a method for associating and inquiring unstructured massive network security data aiming at the characteristics of massive internet user information or data characteristics similar to the characteristics of the massive internet user information, solves the problems by methods such as data secondary index design, data association credibility design, data validity judgment and the like, and achieves the following effects:

(1) the mass network security data can be quickly searched, analyzed and calculated by utilizing the secondary index design of the data, the data calculation time is greatly shortened, and the data association efficiency is improved;

(2) the method can realize automatic data discrimination in the mass network security data association process, effectively improves the size of the associated data set on the premise of ensuring the accuracy through the data association credibility design, and ensures the validity of the data association result through the data validity judgment;

(3) the method comprises the steps of generating a single-user maximum attribute set as far as possible by mining user data information in massive unstructured network security data;

(4) by the method and the device, the user information related to the network security can be quickly and accurately determined from the mass network security data, and measures such as subsequent protection, tracking, counterattack and the like related to the network security can be conveniently carried out.

Drawings

FIG. 1 is a system architecture diagram of the present invention for implementing a method for associating and querying unstructured massive data;

FIG. 2 is a flow chart of a method for associating and querying unstructured mass data in accordance with the present invention;

FIG. 3 is a schematic diagram of establishing a secondary index and performing fast query of mass network security data based on the secondary index according to the present invention;

FIG. 4 is a flow chart of the present invention for implementing efficient association of network security data based on trustworthiness values;

FIG. 5 is a diagram illustrating an operation process of a user initiating a data association task according to an embodiment of the present invention;

fig. 6 is a flow chart of the network security data validity judgment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

First, a system architecture for implementing the association and query method of unstructured massive data according to the present invention is described with reference to fig. 1, and fig. 1 shows a system architecture diagram for implementing the association and query method of unstructured massive data according to the present invention. As shown in fig. 1:

the system for realizing the association and query method of the unstructured massive data comprises a hardware layer, a data layer, a processing layer and an application layer which are communicated with each other from bottom to top.

The hardware layer comprises a management server and N computing storage servers, wherein N >1, and the management server is used for providing data computing and storage support, namely computing mass network security data and storing corresponding computing results.

The data layer includes a network security database for storing network security data, which in this embodiment includes but is not limited to internet user data.

The processing layer is used for being provided with a credibility judging module, an effectiveness judging module and a secondary index module, and the processing layer is used for associating network security data, cleaning invalid network security data and constructing a secondary index of massive network security data.

The application layer provides a man-machine interaction interface for a user, and the man-machine interaction interface comprises a data association interface and a data query interface which are respectively used for realizing association and query of the unstructured massive network security data. This embodiment includes the following steps, as shown in fig. 2:

step S201: and establishing a secondary index, and quickly inquiring mass network security data based on the secondary index to complete the primary association of the network security data.

Step S202: and performing iterative computation based on the credibility value of the network security data to realize the association of the network security data.

Step S203: and judging the effectiveness of the network security data which realizes the association, and deleting the network security data and the associated data thereof from the network security database for the network security data which does not meet the requirement and realizes the association.

Step S201: establishing a secondary index, and quickly inquiring mass network security data based on the secondary index to complete the primary association of the network security data; the method comprises the following steps:

and carrying out preprocessing operations such as cleaning, filtering, missing data supplementing and the like on the acquired network security data.

And the quick execution of the data association operation is realized by establishing a secondary index mode, and the calculation association efficiency of mass data is effectively improved. The task for establishing the two-dimensional index is an automatic task, whether the network security data in the network security database are updated or not is monitored in real time, if the network security data in the network security database are updated, the two-dimensional index establishing task is automatically triggered, and the second-level index module of the processing layer establishes a second-level index.

FIG. 3 is a schematic diagram illustrating establishment of a secondary index and fast query of mass network security data based on the secondary index, as shown in FIG. 3:

the specific implementation process is as follows:

step S301: analyzing the characteristics of the updated or newly added unstructured network security data, extracting data corresponding to a security attribute field capable of representing the network security data as security attribute characteristic field data, and storing the data into a network security database; the number of the security attribute fields of the network security data is greater than or equal to 1.

Step S302: and establishing a mapping relation between the data corresponding to the security attribute characteristic field and the primary key value of the network security data to form secondary index data, and storing the secondary index data into a network security database.

Step S303: and scanning the network security data periodically to obtain updated or newly added network security data, and searching the corresponding primary key value in the secondary index data according to the data corresponding to the security attribute characteristic field.

Step S202: performing iterative computation based on the credibility value of the network security data to realize the association of the network security data, wherein the iterative computation comprises the following steps:

when the network security data association is carried out, the network security data credibility value is designed, the network security data association is carried out based on the credibility value, the network security data credibility value is judged through continuous network security data iterative query, and the effective association of the network security data is realized based on the credibility value. As shown in fig. 4:

the specific implementation process is as follows:

step S401: and searching the network security data which is associated for the first time or the network security data obtained from the data association task sent by the user as the network security data to be associated.

Further, in another embodiment of the present invention, as shown in fig. 5, the acquiring the network security data from the data association task sent by the user as the network security data to be associated includes:

and the user sends the data association task through the data association interface of the application layer according to the self use requirement.

And querying the established secondary index data, acquiring a data primary key value corresponding to the data association task, and acquiring seed data through the two-dimensional index data stored in the network security database.

And acquiring data which has a relative field attribute and the same field attribute value with the seed data in the network security database as data to be associated based on the seed data.

Step S402: according to the characteristics of the network security data, the contents of the network security data field are divided into three types: unique token class data, probabilistic token class data, and invalid token class data.

The unique representation data refers to the unique attribute of the data field belonging to the user, such as information of a mailbox account, an identity card number and the like. The probability characterization class data refers to that some field contents belong to a certain user in a probability manner, but may not belong to a single user completely, such as information of names, mobile phone numbers and the like. And if the representation class data is invalid, no correlation operation is carried out.

Step S403: for the first class of data, namely the unique representation class data, query operation is carried out according to the content of the field to obtain all result data, the reliability value of the correlation result is marked to be maximum, then classification correlation is carried out on the obtained data, and the reliability value of the correlation result is marked.

And for the second class of data, namely the probability characterization class of data, performing data query according to the content of the field, associating all results, and marking the reliability value of the associated result according to the reliability of the characterization data.

And for the third class of data, namely invalid characterization class data, no correlation is carried out, and the reliability value is set to be 0.

Step S203: the method comprises the following steps of judging the effectiveness of the network security data which is associated, and deleting the network security data and the associated data thereof from a network security database for the network security data which is not qualified and is associated, wherein the method comprises the following steps:

and for the correlation result of the correlated network security data, in order to improve the accuracy of the correlation result as much as possible, further judging the validity of the network security data, and acquiring a real result as much as possible. As shown in fig. 6:

step S601: obtaining association data associated with the network security data that implements the association.

Step S602: detecting whether the content of the correlation result field has conflict or not; if yes, go to step S603; if not, the process proceeds to step S604.

the low confidence level data can be judged according to a preset threshold value.

Step S604: detecting whether the content of the data field is invalid; if not, go to step S606; otherwise, the process proceeds to step S605.

Further, regular expression rules can be utilized to match whether the field contents meet requirements.

Step S605: and detecting whether the data volume corresponding to the association result field is smaller than a threshold value, if so, entering step S606, and if not, entering step S607.

And detecting whether the number of the relevant result fields is enough, wherein the specific method is to judge whether the number of the data fields meets the requirement in a threshold value design mode.

Step S606: deleting the data and its associated data from the network security database; the method ends.

The embodiment can detect error data generated under various conditions, and effectively improves the accuracy of data association.

The step S204: the user submits a data query task through a data query interface, queries the established secondary index data, acquires a primary key value of the data corresponding to the data association task, queries an association data table of the network security database through the primary key value of the data, acquires corresponding network security data and associated data thereof, and the method further comprises the following steps:

the storage position of the corresponding network security data in the network security database can be quickly located through the primary key value.

It will be evident to those skilled in the art that the embodiments of the present invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention are capable of being embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention and not for limiting, and although the embodiments of the present invention are described in detail with reference to the above preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the embodiments of the present invention without departing from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A correlation and query method of unstructured massive network security data is realized based on a system for realizing the correlation and query method of unstructured massive data, wherein the system comprises a hardware layer, a data layer, a processing layer and an application layer which are communicated with each other from bottom to top; the hardware layer comprises a management server and N computing and storing servers, wherein N is greater than 1 and is used for computing massive network security data and storing corresponding computing results; the data layer comprises a network security database used for storing network security data; characterized in that the method comprises the following steps:

step S204: a user submits a data query task through a data query interface, queries the established secondary index data, acquires a data primary key value corresponding to the data association task, queries an association data table of the network security database through the data primary key value, and acquires corresponding network security data and association data thereof;

the step S202: performing iterative computation based on the credibility value of the network security data to realize the association of the network security data, wherein the iterative computation comprises the following steps:

2. The method for associating and querying unstructured massive network security data according to claim 1, wherein the step S201: establishing a secondary index, and quickly inquiring mass network security data based on the secondary index to complete the primary association of the network security data; the method comprises the following steps:

3. The method for associating and querying unstructured massive network security data as claimed in claim 2, wherein the unique representation class data refers to a unique attribute that security attribute field data belongs to a user; the probability characterization class data refers to that the content probability of some fields does not definitely belong to a certain user or does not completely belong to a single user.

4. The method for associating and querying unstructured massive network security data according to claim 1, wherein step S203: the method comprises the following steps of judging the effectiveness of the network security data which is associated, and deleting the network security data and the associated data thereof from a network security database for the network security data which is not qualified and is associated, wherein the method comprises the following steps:

5. The method for associating and querying unstructured massive network security data according to claim 2, wherein before the step S301, the method further comprises a preprocessing operation, namely, the acquired network security data is cleaned, filtered and the missing data is supplemented.

6. The method for associating and querying unstructured massive network security data as claimed in claim 2, wherein the step of obtaining the network security data from the data association task sent by the user as the network security data to be associated comprises: