CN113254641A

CN113254641A - Information data fusion method and device

Info

Publication number: CN113254641A
Application number: CN202110588184.8A
Authority: CN
Inventors: 任传伦; 王淮; 刘晓影; 乌吉斯古愣; 俞赛赛; 张先国; 王玥
Original assignee: CETC 15 Research Institute; CETC 30 Research Institute
Current assignee: CETC 15 Research Institute; CETC 30 Research Institute
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-08-13
Anticipated expiration: 2041-05-27
Also published as: CN113254641B

Abstract

The invention provides an information data fusion method and device, which are characterized in that a decision tree ID3 algorithm is adopted for training to generate Smart rules, and fusion rules are automatically selected by performing entity extraction, entity classification, attribute identification and attribute extraction on original network information data to realize fusion of the network information data. The invention mainly aims to solve the problems of low fusion efficiency and uneven fusion effect of the existing information data, realize efficient, rapid and standardized fusion of the network information data and reduce the dependence of the network information data fusion on domain expert knowledge.

Description

Information data fusion method and device

Technical Field

The invention relates to the technical field of network security, in particular to an intelligence data fusion method and device.

Background

The information data fusion mainly processes newly added information data to realize the storage of entities and attribute values of the newly added information data. The intelligence data fusion is to perform operations such as entity fusion, attribute fusion and the like on the intelligence data, and fuse the entity and the attribute value into the existing intelligence library in a new adding or updating mode.

At present, the method for information data fusion mainly checks entities and attribute values by program scripts, extracts the entities and the attribute values manually, and stores network information data into an information base manually by adopting background or foreground visual operation and other modes in combination with domain expert knowledge. The method needs manual intervention to write new data, and needs experts to participate in the verification of data attributes, so that information data fusion is realized. The method needs a large amount of manual operation, excessively depends on field experts, and is difficult to complete data fusion in a limited time when facing mass information data, so that the fusion efficiency of the information data is low, and the fusion effect is different from person to person.

Disclosure of Invention

In view of the above, the invention provides an information data fusion method and apparatus, and mainly aims to solve the problems of low fusion efficiency and uneven fusion effect of the existing information data. The method avoids excessive dependence on field experts, lightens heavy manual operation, designs Smart rules according to the characteristics of wide sources, missing attributes, low reliability and the like of the network information data, and realizes the rapid and automatic fusion of the network information data.

According to an aspect of the present invention, there is provided an intelligence data fusion method, the method including the steps of: s1, preprocessing the original network intelligence data to obtain the structured data in accordance with the intelligence database data model; s2, collecting a large amount of structured data and labeling each piece of data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model; s3, inputting the structured data into a Smart rule decision tree model to obtain a fusion rule of the structured data and the intelligence database data model; s4, writing the structured data into the intelligence base according to the fusion rule.

As a further improvement of the present invention, the machine learning training of the decision tree model using the training data is specifically training using a decision tree ID3 classification algorithm.

As a further improvement of the present invention, the pretreatment comprises: s101, entity extraction: identifying the information entity in the original network information data, and extracting and storing the entity field; s102, entity classification: classifying the intelligence entities, and mapping the entity fields to the intelligence database data model according to the constraint of the intelligence database data model; s103, attribute identification: identifying entity attributes of the intelligence entities; s104, extracting attributes: and matching the entity attribute with the data model of the information database, and extracting and processing the attribute value of the matched entity attribute to form formatted entity attribute data.

As a further improvement of the present invention, the training data specifically includes: defining m types of the information entities, and defining entity attributes of n types of the information entities; preprocessing each piece of original network information data to form the structured data as m + n-dimensional data vectors; the label of the fusion mode comprises label of the information entity fusion mode and label of the entity attribute fusion mode; the information entity fusion mode comprises data coverage writing, data newly-added writing and repeated data discarding; the entity attribute fusion mode comprises data coverage writing, data newly-added writing, repeated data discarding, data additional writing and partial replacement writing.

As a further improvement of the present invention, the training using the decision tree ID3 classification algorithm specifically includes: the method comprises the following steps: calculating the training data to obtain current information entropy, calculating branch information entropy under each n entity attributes, calculating conditional entropy according to the branch information entropy, further calculating information gains of the n attributes respectively, selecting the attribute with the maximum information gain as a decision point and adding the decision point into a decision tree; step two: and removing the attribute column data with the maximum information gain from the training data, and repeating the step one on the current training data until all entity attributes are added into the decision tree.

According to another aspect of the present invention, there is provided an informative-data fusion apparatus, the apparatus comprising: a preprocessing module: is configured to preprocess raw network intelligence data to obtain structured data that conforms to an intelligence repository data model; a model training module: the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is configured to acquire a large amount of structured data and label each data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model; and a fusion rule generation module: the information base data model fusion rule is configured to input the structured data into a Smart rule decision tree model, and a fusion rule of the structured data and the information base data model is obtained; a data writing module: configured to write the structured data to the intelligence repository according to the fusion rule.

As a further improvement of the present invention, the preprocessing module comprises: an entity extraction submodule: configured to identify intelligence entities in the raw network intelligence data, and extract and save entity fields; an entity classification submodule: configured to classify said intelligence entities, mapping said entity fields onto said intelligence repository data model according to constraints of said intelligence repository data model; an attribute identification submodule: an entity attribute configured to identify the intelligence entity; an attribute extraction submodule: and the entity attribute matching module is configured to match the entity attributes with the intelligence database data model, and extract and process attribute values of the matched entity attributes to form formatted entity attribute data.

As a further improvement of the present invention, the training using the decision tree ID3 classification algorithm specifically includes: the method comprises the following steps: calculating the training data to obtain current information entropy, calculating branch information entropy under each n entity attributes, calculating conditional entropy according to the branch information entropy, further calculating information gains of the n attributes respectively, selecting the attribute with the maximum information gain as a decision point and adding the decision point into a decision tree; step two: removing the attribute column data with the maximum information gain from the training data, and repeating the step one on the current training data; until all entity attributes are added into the decision tree.

By the technical scheme, the beneficial effects provided by the invention are as follows:

(1) a large amount of original information data are trained by using a decision tree ID3 classification algorithm to obtain a Smart rule decision tree model, the model can automatically generate entities of the information data and fusion rules of entity attributes and an information database model according to the input information data, and automatic fusion and warehousing of the information data can be realized.

(2) And the trained Smart rule decision tree model is used for generating the fusion rule, so that a large amount of manual operation is avoided for each piece of information data, and the efficiency of information data fusion is improved.

(3) The problems of excessive dependence on field experts and uneven fusion effect during manual fusion are avoided.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a flowchart of an intelligence data fusion method provided by an embodiment of the present invention;

FIG. 2 is a flow chart showing the data preprocessing steps in an intelligence data fusion method according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating network intelligence data entity classification and attribute classification in an intelligence data fusion method according to an embodiment of the present invention;

FIG. 4 is a flow chart of decision tree training in an intelligence data fusion method according to an embodiment of the present invention;

fig. 5 shows an example of Smart rules generated in an intelligence data fusion method according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

And (3) decision tree algorithm: a decision tree is a tree built up by means of decisions. In machine learning, a decision tree is a predictive model representing a mapping between object attributes and object values, each node representing an object, each diverging path in the tree representing a possible attribute value, and each leaf node corresponding to the value of the object represented by the path traversed from the root node to the leaf node. The decision tree has only a single output, and if there are multiple outputs, separate decision trees can be built to handle the different outputs.

Information entropy: information entropy is a concept used in information theory to measure the amount of information. The more ordered a system is, the lower the information entropy is; conversely, the more chaotic a system is, the higher the entropy of the information becomes. Therefore, entropy can also be said to be a measure of the degree of system ordering.

The ID3 algorithm: the ID3 algorithm is one of the decision trees that is based on the principle of the okamu razor, i.e. doing more with as little as possible. The ID3 algorithm is a decision tree algorithm invented by Ross Quinlan, i.e. (Iterative Dichotomiser 3) Iterative binary tree 3 generation, which is based on the above mentioned principle of the ocamer razor, the smaller the decision tree is better than the larger one, however, the smallest tree structure is not always generated, but a heuristic algorithm. In information theory, the smaller the desired information, the greater the information gain and thus the higher the purity. The core idea of the ID3 algorithm is to measure the selection of attributes by information gain, and select the attribute with the largest information gain after splitting for splitting. The algorithm traverses the possible decision space using a top-down greedy search.

Example 1

As shown in fig. 1, the method for fusing information data is divided into several stages in implementation, including preprocessing, generating a fusion rule, and fusing information data and information database data according to the rule.

S1, preprocessing the original network intelligence data to obtain the structured data in accordance with the intelligence database data model;

table 1 is an exemplary intelligence library data model:

TABLE 1 information library data model

as shown in fig. 2, the preprocessing step mainly processes raw information data, the data sources can be microstep online report data, 360 threat information data, known wound information data, Qian information data, self-researched information data and the like, the information attributes are determined according to different data sources, and a HanLP open source NLP toolkit is adopted to extract information entities and entity attribute values. The network intelligence data is subjected to a data preprocessing stage to generate structured data which accords with an intelligence base data model. The method comprises the steps of entity extraction, entity classification, attribute identification and attribute value extraction.

S101, entity extraction: and identifying the entity in the network information data, and adopting a HanLP toolkit to accurately identify the entity field of the network information data and extract and store the entity field.

S102, entity classification: and classifying the extracted entities, and mapping the extracted entity fields to the data model of the information base according to the constraint of the data model of the information base.

Illustratively, the entity fields are categorized in table 1: IP/domain name/sample/URL/account/APT organization.

S103, attribute identification: and identifying the related attributes of the entity, and adopting a HanLP toolkit to accurately identify the entity attributes of the network intelligence data.

S104, extracting attributes: and matching the obtained entity attributes with the data model of the information base, and extracting and processing the attribute values of the matched entity attributes to form formatted attribute data.

Illustratively, the entity attributes include those in Table 1: geographic location, country, time of recording, registrant, process behavior, etc.

S2, collecting a large amount of structured data and labeling each data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model;

s21 combines the domain expert to study and judge the information data and the information database data, labels the fusion mode of each preprocessed information data, and forms the training set. And training the entity type and the entity attribute of the intelligence data by adopting a decision tree ID3 classification algorithm, constraining the entity type and the entity attribute into a fusion mode of the marked intelligence data, forming a Smart rule decision tree model through training, and generating a fusion rule of the intelligence data through the model.

The fusion rule comprises a fusion rule of entity data coverage writing, data newly added writing, repeated data discarding, and entity attribute data coverage writing, data newly added writing, repeated data discarding, data additional writing, and partial replacement writing.

And based on an ID3 decision tree algorithm, carrying out entity fusion rule and entity attribute fusion rule classification on the network intelligence data so as to form a Smart rule. The method mainly comprises three stages of network intelligence data classification, network intelligence data training set construction, decision tree training and the like.

S211 intelligence data classification: and constructing entity types and attribute types of the intelligence data.

as shown in fig. 3, the intelligence data is divided into IP, domain name, sample, URL, account number, and APT organization according to entities, and in order to effectively utilize ID3 decision tree algorithm, based on ONE-HOT coding concept, geographical location, affiliated institution, country, recording time, registrant, process behavior, attack target, attack intention, character string storage, list storage, aggregate storage, file storage, 360 intelligence, VT intelligence, micro-step intelligence, qianxin intelligence, know chuangyu intelligence, and self-production intelligence are defined for each entity, and the intelligence data is mapped into 18-dimensional data vector. In the step, network information data with various forms are mapped into a data vector with a fixed length, so that the logic of artificial fusion of the information data is met, and the form requirement of machine learning calculation is met.

S212, constructing an intelligence data training set: original information data are obtained from open source information, third party information and self-produced information, the information data are decomposed into 18-dimensional data vectors, each piece of data is marked respectively, and the fusion mode of the entity and the entity attribute is marked. The entity fusion rule comprises 3 choices of data coverage writing, data newly-added writing, repeated data discarding and the like which are respectively marked as 0, 1 and 2, and the attribute fusion rule comprises 5 choices of data coverage writing, data newly-added writing, repeated data discarding, data additional writing, partial replacement writing and the like which are respectively marked as 0, 1, 2, 3 and 4; and encoding the attribute of the network intelligence data, wherein the attribute has a data identifier of 1, and the attribute has no data identifier of 0, thereby realizing the construction of a network intelligence data training set. Table 2 is an exemplary network intelligence data training set example.

TABLE 2 example of network intelligence data training set

S213, decision tree training: calculating data of an intelligence data training set to obtain the information entropy of the current set, and defining the proportion of the kth attribute in the current set D as p_kThe information entropy of the set D is defined as Ent (D).

Then dividing the attribute into a plurality of subsets according to the attribute value of the attribute k, and calculating each subset D^vThen, weighting is carried out on each subset entropy, the weight is defined as the ratio of the number of the subset samples to the total number, and the information Gain (D, k) of the attribute k is calculated.

And selecting the attribute with the maximum information gain as a decision point, and adding the attribute into the decision tree. And removing the characteristic attribute column data corresponding to the maximum information gain from the training set, and repeating the process until no attribute exists in the set D. The process of decision tree training is shown in fig. 4.

S3, inputting the structured data into a Smart rule decision tree model to obtain a fusion rule of the structured data and the intelligence database data model;

based on the trained Smart rule decision tree model, a Smart rule is formed for the fusion mode of network information data, and an intelligent, simple and convenient entity fusion rule and an entity attribute fusion rule can be provided for the fusion of the network information data. Mainly comprises two stages of intelligence data decomposition, fusion rule calculation and the like.

S31 intelligence data decomposition: decomposing the intelligence data to be fused, obtaining the entity type and entity attribute of the intelligence, and forming the input data of the decision tree ID3 algorithm.

S32 fusion rule calculation: and acquiring an entity fusion rule and an entity attribute fusion rule of the current information data by using the trained decision tree, namely a Smart rule.

S4, writing the structured data into the intelligence base according to the fusion rule.

And processing the network information data subjected to data preprocessing according to the Smart rule formed in the last step, writing the data into an information library, and completing fusion of the network information data.

The following describes the use of the method according to the invention by way of a specific application scenario example.

As shown in fig. 5, when part of Smart rules are formed after training and IP data is input, a repeated data discarding operation is performed on data with an empty recording time; for data with recording time, geographic position, attack target and attack intention not being empty, executing data additional write operation; for data with recording time, geographic position and attack target not being empty but with attack intention being empty, executing repeated data discarding operation; for data with recording time and geographic position not empty but with an attack target empty, repeating data discarding operation is executed; for the data with the recording time not being empty, the geographic position being empty, the country, the registrant and the character string storage not being empty, the 360 information being empty but the VT information not being empty, or the 360 information being empty and the VT information being empty but the microstep information not being empty, executing the data newly-increased writing operation; for data with a geographic position of null, state, registrant and character string storage of not null, but 360 information, VT information and microstep information are all null, repeated data discarding is executed; for data with recording time not empty but geographic position and country empty, executing repeated data discarding operation; for data with recording time not empty, geographic position empty and country not empty but registration person empty, executing repeated data discarding operation; and for the data with the recording time not being empty, the geographic position being empty, and the country and the registrant not being empty but the character string being stored as empty, executing partial replacement write-in operation.

Example 2

Further, as an implementation of the method shown in the above embodiment, another embodiment of the present invention further provides an intelligence data fusion apparatus. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. In the apparatus of this embodiment, there are the following modules:

1. a preprocessing module: is configured to preprocess raw network intelligence data to obtain structured data that conforms to an intelligence repository data model; this block corresponds to step S1 in embodiment 1.

The method comprises the following sub-modules:

an entity extraction submodule: configured to identify intelligence entities in the raw network intelligence data, and extract and save entity fields;

an entity classification submodule: configured to classify said intelligence entities, mapping said entity fields onto said intelligence repository data model according to constraints of said intelligence repository data model;

an attribute identification submodule: an entity attribute configured to identify the intelligence entity;

an attribute extraction submodule: and the entity attribute matching module is configured to match the entity attributes with the intelligence database data model, and extract and process attribute values of the matched entity attributes to form formatted entity attribute data.

2. A model training module: the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is configured to acquire a large amount of structured data and label each data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model; this block corresponds to step S2 in embodiment 1.

3. And a fusion rule generation module: the information base data model fusion rule is configured to input the structured data into a Smart rule decision tree model, and a fusion rule of the structured data and the information base data model is obtained; this block corresponds to step S3 in embodiment 1.

4. A data writing module: configured to write the structured data to the intelligence repository according to the fusion rule. This block corresponds to step S4 in embodiment 1.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Claims

1. An intelligence data fusion method, comprising:

s2, collecting a large amount of structured data and labeling each piece of data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model;

2. The intelligence data fusion method of claim 1, wherein the machine learning training of the decision tree model using the training data is specifically training using a decision tree ID3 classification algorithm.

3. The intelligence data fusion method of claim 1, wherein the preprocessing comprises:

s101, entity extraction: identifying the information entity in the original network information data, and extracting and storing the entity field;

s102, entity classification: classifying the intelligence entities, and mapping the entity fields to the intelligence database data model according to the constraint of the intelligence database data model;

s103, attribute identification: identifying entity attributes of the intelligence entities;

s104, extracting attributes: and matching the entity attribute with the data model of the information database, and extracting and processing the attribute value of the matched entity attribute to form formatted entity attribute data.

4. The intelligence data fusion method of claim 2 or 3, wherein the training data is specifically:

defining m types of the information entities, and defining entity attributes of n types of the information entities;

preprocessing each piece of original network information data to form the structured data as m + n-dimensional data vectors;

the label of the fusion mode comprises label of the information entity fusion mode and label of the entity attribute fusion mode;

the information entity fusion mode comprises data coverage writing, data newly-added writing and repeated data discarding;

the entity attribute fusion mode comprises data coverage writing, data newly-added writing, repeated data discarding, data additional writing and partial replacement writing.

5. The intelligence data fusion method of claim 4, wherein the training using decision tree ID3 classification algorithm is specifically:

the method comprises the following steps: calculating the training data to obtain current information entropy, calculating branch information entropy under each n entity attributes, calculating conditional entropy according to the branch information entropy, further calculating information gains of the n attributes respectively, selecting the attribute with the maximum information gain as a decision point and adding the decision point into a decision tree;

step two: and removing the attribute column data with the maximum information gain from the training data, and repeating the step one on the current training data until all entity attributes are added into the decision tree.

6. An intelligence data fusion device, comprising:

a preprocessing module: is configured to preprocess raw network intelligence data to obtain structured data that conforms to an intelligence repository data model;

a model training module: the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is configured to acquire a large amount of structured data and label each data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model;

and a fusion rule generation module: the information base data model fusion rule is configured to input the structured data into a Smart rule decision tree model, and a fusion rule of the structured data and the information base data model is obtained;

a data writing module: configured to write the structured data to the intelligence repository according to the fusion rule.

7. An intelligence data fusion apparatus according to claim 6, wherein the machine learning training of the decision tree model using the training data is specifically training using a decision tree ID3 classification algorithm.

8. The intelligence data fusion apparatus of claim 6, wherein the preprocessing module comprises:

9. An intelligence data fusion apparatus according to claim 7 or 8, wherein the training data is specifically:

10. The intelligence data fusion apparatus of claim 9, wherein the training using decision tree ID3 classification algorithm is specifically:

step two: removing the attribute column data with the maximum information gain from the training data, and repeating the step one on the current training data; until all entity attributes are added into the decision tree.