CN113254641A - Information data fusion method and device - Google Patents

Information data fusion method and device Download PDF

Info

Publication number
CN113254641A
CN113254641A CN202110588184.8A CN202110588184A CN113254641A CN 113254641 A CN113254641 A CN 113254641A CN 202110588184 A CN202110588184 A CN 202110588184A CN 113254641 A CN113254641 A CN 113254641A
Authority
CN
China
Prior art keywords
data
entity
intelligence
fusion
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110588184.8A
Other languages
Chinese (zh)
Other versions
CN113254641B (en
Inventor
任传伦
王淮
刘晓影
乌吉斯古愣
俞赛赛
张先国
王玥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
CETC 30 Research Institute
Original Assignee
CETC 15 Research Institute
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute, CETC 30 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202110588184.8A priority Critical patent/CN113254641B/en
Publication of CN113254641A publication Critical patent/CN113254641A/en
Application granted granted Critical
Publication of CN113254641B publication Critical patent/CN113254641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The invention provides an information data fusion method and device, which are characterized in that a decision tree ID3 algorithm is adopted for training to generate Smart rules, and fusion rules are automatically selected by performing entity extraction, entity classification, attribute identification and attribute extraction on original network information data to realize fusion of the network information data. The invention mainly aims to solve the problems of low fusion efficiency and uneven fusion effect of the existing information data, realize efficient, rapid and standardized fusion of the network information data and reduce the dependence of the network information data fusion on domain expert knowledge.

Description

Information data fusion method and device
Technical Field
The invention relates to the technical field of network security, in particular to an intelligence data fusion method and device.
Background
The information data fusion mainly processes newly added information data to realize the storage of entities and attribute values of the newly added information data. The intelligence data fusion is to perform operations such as entity fusion, attribute fusion and the like on the intelligence data, and fuse the entity and the attribute value into the existing intelligence library in a new adding or updating mode.
At present, the method for information data fusion mainly checks entities and attribute values by program scripts, extracts the entities and the attribute values manually, and stores network information data into an information base manually by adopting background or foreground visual operation and other modes in combination with domain expert knowledge. The method needs manual intervention to write new data, and needs experts to participate in the verification of data attributes, so that information data fusion is realized. The method needs a large amount of manual operation, excessively depends on field experts, and is difficult to complete data fusion in a limited time when facing mass information data, so that the fusion efficiency of the information data is low, and the fusion effect is different from person to person.
Disclosure of Invention
In view of the above, the invention provides an information data fusion method and apparatus, and mainly aims to solve the problems of low fusion efficiency and uneven fusion effect of the existing information data. The method avoids excessive dependence on field experts, lightens heavy manual operation, designs Smart rules according to the characteristics of wide sources, missing attributes, low reliability and the like of the network information data, and realizes the rapid and automatic fusion of the network information data.
According to an aspect of the present invention, there is provided an intelligence data fusion method, the method including the steps of: s1, preprocessing the original network intelligence data to obtain the structured data in accordance with the intelligence database data model; s2, collecting a large amount of structured data and labeling each piece of data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model; s3, inputting the structured data into a Smart rule decision tree model to obtain a fusion rule of the structured data and the intelligence database data model; s4, writing the structured data into the intelligence base according to the fusion rule.
As a further improvement of the present invention, the machine learning training of the decision tree model using the training data is specifically training using a decision tree ID3 classification algorithm.
As a further improvement of the present invention, the pretreatment comprises: s101, entity extraction: identifying the information entity in the original network information data, and extracting and storing the entity field; s102, entity classification: classifying the intelligence entities, and mapping the entity fields to the intelligence database data model according to the constraint of the intelligence database data model; s103, attribute identification: identifying entity attributes of the intelligence entities; s104, extracting attributes: and matching the entity attribute with the data model of the information database, and extracting and processing the attribute value of the matched entity attribute to form formatted entity attribute data.
As a further improvement of the present invention, the training data specifically includes: defining m types of the information entities, and defining entity attributes of n types of the information entities; preprocessing each piece of original network information data to form the structured data as m + n-dimensional data vectors; the label of the fusion mode comprises label of the information entity fusion mode and label of the entity attribute fusion mode; the information entity fusion mode comprises data coverage writing, data newly-added writing and repeated data discarding; the entity attribute fusion mode comprises data coverage writing, data newly-added writing, repeated data discarding, data additional writing and partial replacement writing.
As a further improvement of the present invention, the training using the decision tree ID3 classification algorithm specifically includes: the method comprises the following steps: calculating the training data to obtain current information entropy, calculating branch information entropy under each n entity attributes, calculating conditional entropy according to the branch information entropy, further calculating information gains of the n attributes respectively, selecting the attribute with the maximum information gain as a decision point and adding the decision point into a decision tree; step two: and removing the attribute column data with the maximum information gain from the training data, and repeating the step one on the current training data until all entity attributes are added into the decision tree.
According to another aspect of the present invention, there is provided an informative-data fusion apparatus, the apparatus comprising: a preprocessing module: is configured to preprocess raw network intelligence data to obtain structured data that conforms to an intelligence repository data model; a model training module: the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is configured to acquire a large amount of structured data and label each data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model; and a fusion rule generation module: the information base data model fusion rule is configured to input the structured data into a Smart rule decision tree model, and a fusion rule of the structured data and the information base data model is obtained; a data writing module: configured to write the structured data to the intelligence repository according to the fusion rule.
As a further improvement of the present invention, the machine learning training of the decision tree model using the training data is specifically training using a decision tree ID3 classification algorithm.
As a further improvement of the present invention, the preprocessing module comprises: an entity extraction submodule: configured to identify intelligence entities in the raw network intelligence data, and extract and save entity fields; an entity classification submodule: configured to classify said intelligence entities, mapping said entity fields onto said intelligence repository data model according to constraints of said intelligence repository data model; an attribute identification submodule: an entity attribute configured to identify the intelligence entity; an attribute extraction submodule: and the entity attribute matching module is configured to match the entity attributes with the intelligence database data model, and extract and process attribute values of the matched entity attributes to form formatted entity attribute data.
As a further improvement of the present invention, the training data specifically includes: defining m types of the information entities, and defining entity attributes of n types of the information entities; preprocessing each piece of original network information data to form the structured data as m + n-dimensional data vectors; the label of the fusion mode comprises label of the information entity fusion mode and label of the entity attribute fusion mode; the information entity fusion mode comprises data coverage writing, data newly-added writing and repeated data discarding; the entity attribute fusion mode comprises data coverage writing, data newly-added writing, repeated data discarding, data additional writing and partial replacement writing.
As a further improvement of the present invention, the training using the decision tree ID3 classification algorithm specifically includes: the method comprises the following steps: calculating the training data to obtain current information entropy, calculating branch information entropy under each n entity attributes, calculating conditional entropy according to the branch information entropy, further calculating information gains of the n attributes respectively, selecting the attribute with the maximum information gain as a decision point and adding the decision point into a decision tree; step two: removing the attribute column data with the maximum information gain from the training data, and repeating the step one on the current training data; until all entity attributes are added into the decision tree.
By the technical scheme, the beneficial effects provided by the invention are as follows:
(1) a large amount of original information data are trained by using a decision tree ID3 classification algorithm to obtain a Smart rule decision tree model, the model can automatically generate entities of the information data and fusion rules of entity attributes and an information database model according to the input information data, and automatic fusion and warehousing of the information data can be realized.
(2) And the trained Smart rule decision tree model is used for generating the fusion rule, so that a large amount of manual operation is avoided for each piece of information data, and the efficiency of information data fusion is improved.
(3) The problems of excessive dependence on field experts and uneven fusion effect during manual fusion are avoided.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 shows a flowchart of an intelligence data fusion method provided by an embodiment of the present invention;
FIG. 2 is a flow chart showing the data preprocessing steps in an intelligence data fusion method according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating network intelligence data entity classification and attribute classification in an intelligence data fusion method according to an embodiment of the present invention;
FIG. 4 is a flow chart of decision tree training in an intelligence data fusion method according to an embodiment of the present invention;
fig. 5 shows an example of Smart rules generated in an intelligence data fusion method according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
And (3) decision tree algorithm: a decision tree is a tree built up by means of decisions. In machine learning, a decision tree is a predictive model representing a mapping between object attributes and object values, each node representing an object, each diverging path in the tree representing a possible attribute value, and each leaf node corresponding to the value of the object represented by the path traversed from the root node to the leaf node. The decision tree has only a single output, and if there are multiple outputs, separate decision trees can be built to handle the different outputs.
Information entropy: information entropy is a concept used in information theory to measure the amount of information. The more ordered a system is, the lower the information entropy is; conversely, the more chaotic a system is, the higher the entropy of the information becomes. Therefore, entropy can also be said to be a measure of the degree of system ordering.
The ID3 algorithm: the ID3 algorithm is one of the decision trees that is based on the principle of the okamu razor, i.e. doing more with as little as possible. The ID3 algorithm is a decision tree algorithm invented by Ross Quinlan, i.e. (Iterative Dichotomiser 3) Iterative binary tree 3 generation, which is based on the above mentioned principle of the ocamer razor, the smaller the decision tree is better than the larger one, however, the smallest tree structure is not always generated, but a heuristic algorithm. In information theory, the smaller the desired information, the greater the information gain and thus the higher the purity. The core idea of the ID3 algorithm is to measure the selection of attributes by information gain, and select the attribute with the largest information gain after splitting for splitting. The algorithm traverses the possible decision space using a top-down greedy search.
Example 1
As shown in fig. 1, the method for fusing information data is divided into several stages in implementation, including preprocessing, generating a fusion rule, and fusing information data and information database data according to the rule.
S1, preprocessing the original network intelligence data to obtain the structured data in accordance with the intelligence database data model;
table 1 is an exemplary intelligence library data model:
TABLE 1 information library data model
Figure BDA0003087777490000061
FIG. 2 is a flow chart showing the data preprocessing steps in an intelligence data fusion method according to an embodiment of the present invention;
as shown in fig. 2, the preprocessing step mainly processes raw information data, the data sources can be microstep online report data, 360 threat information data, known wound information data, Qian information data, self-researched information data and the like, the information attributes are determined according to different data sources, and a HanLP open source NLP toolkit is adopted to extract information entities and entity attribute values. The network intelligence data is subjected to a data preprocessing stage to generate structured data which accords with an intelligence base data model. The method comprises the steps of entity extraction, entity classification, attribute identification and attribute value extraction.
S101, entity extraction: and identifying the entity in the network information data, and adopting a HanLP toolkit to accurately identify the entity field of the network information data and extract and store the entity field.
S102, entity classification: and classifying the extracted entities, and mapping the extracted entity fields to the data model of the information base according to the constraint of the data model of the information base.
Illustratively, the entity fields are categorized in table 1: IP/domain name/sample/URL/account/APT organization.
S103, attribute identification: and identifying the related attributes of the entity, and adopting a HanLP toolkit to accurately identify the entity attributes of the network intelligence data.
S104, extracting attributes: and matching the obtained entity attributes with the data model of the information base, and extracting and processing the attribute values of the matched entity attributes to form formatted attribute data.
Illustratively, the entity attributes include those in Table 1: geographic location, country, time of recording, registrant, process behavior, etc.
S2, collecting a large amount of structured data and labeling each data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model;
s21 combines the domain expert to study and judge the information data and the information database data, labels the fusion mode of each preprocessed information data, and forms the training set. And training the entity type and the entity attribute of the intelligence data by adopting a decision tree ID3 classification algorithm, constraining the entity type and the entity attribute into a fusion mode of the marked intelligence data, forming a Smart rule decision tree model through training, and generating a fusion rule of the intelligence data through the model.
The fusion rule comprises a fusion rule of entity data coverage writing, data newly added writing, repeated data discarding, and entity attribute data coverage writing, data newly added writing, repeated data discarding, data additional writing, and partial replacement writing.
And based on an ID3 decision tree algorithm, carrying out entity fusion rule and entity attribute fusion rule classification on the network intelligence data so as to form a Smart rule. The method mainly comprises three stages of network intelligence data classification, network intelligence data training set construction, decision tree training and the like.
S211 intelligence data classification: and constructing entity types and attribute types of the intelligence data.
Fig. 3 is a schematic diagram illustrating network intelligence data entity classification and attribute classification in an intelligence data fusion method according to an embodiment of the present invention;
as shown in fig. 3, the intelligence data is divided into IP, domain name, sample, URL, account number, and APT organization according to entities, and in order to effectively utilize ID3 decision tree algorithm, based on ONE-HOT coding concept, geographical location, affiliated institution, country, recording time, registrant, process behavior, attack target, attack intention, character string storage, list storage, aggregate storage, file storage, 360 intelligence, VT intelligence, micro-step intelligence, qianxin intelligence, know chuangyu intelligence, and self-production intelligence are defined for each entity, and the intelligence data is mapped into 18-dimensional data vector. In the step, network information data with various forms are mapped into a data vector with a fixed length, so that the logic of artificial fusion of the information data is met, and the form requirement of machine learning calculation is met.
S212, constructing an intelligence data training set: original information data are obtained from open source information, third party information and self-produced information, the information data are decomposed into 18-dimensional data vectors, each piece of data is marked respectively, and the fusion mode of the entity and the entity attribute is marked. The entity fusion rule comprises 3 choices of data coverage writing, data newly-added writing, repeated data discarding and the like which are respectively marked as 0, 1 and 2, and the attribute fusion rule comprises 5 choices of data coverage writing, data newly-added writing, repeated data discarding, data additional writing, partial replacement writing and the like which are respectively marked as 0, 1, 2, 3 and 4; and encoding the attribute of the network intelligence data, wherein the attribute has a data identifier of 1, and the attribute has no data identifier of 0, thereby realizing the construction of a network intelligence data training set. Table 2 is an exemplary network intelligence data training set example.
TABLE 2 example of network intelligence data training set
Figure BDA0003087777490000091
S213, decision tree training: calculating data of an intelligence data training set to obtain the information entropy of the current set, and defining the proportion of the kth attribute in the current set D as pkThe information entropy of the set D is defined as Ent (D).
Figure BDA0003087777490000092
Then dividing the attribute into a plurality of subsets according to the attribute value of the attribute k, and calculating each subset DvThen, weighting is carried out on each subset entropy, the weight is defined as the ratio of the number of the subset samples to the total number, and the information Gain (D, k) of the attribute k is calculated.
Figure BDA0003087777490000101
And selecting the attribute with the maximum information gain as a decision point, and adding the attribute into the decision tree. And removing the characteristic attribute column data corresponding to the maximum information gain from the training set, and repeating the process until no attribute exists in the set D. The process of decision tree training is shown in fig. 4.
S3, inputting the structured data into a Smart rule decision tree model to obtain a fusion rule of the structured data and the intelligence database data model;
based on the trained Smart rule decision tree model, a Smart rule is formed for the fusion mode of network information data, and an intelligent, simple and convenient entity fusion rule and an entity attribute fusion rule can be provided for the fusion of the network information data. Mainly comprises two stages of intelligence data decomposition, fusion rule calculation and the like.
S31 intelligence data decomposition: decomposing the intelligence data to be fused, obtaining the entity type and entity attribute of the intelligence, and forming the input data of the decision tree ID3 algorithm.
S32 fusion rule calculation: and acquiring an entity fusion rule and an entity attribute fusion rule of the current information data by using the trained decision tree, namely a Smart rule.
S4, writing the structured data into the intelligence base according to the fusion rule.
And processing the network information data subjected to data preprocessing according to the Smart rule formed in the last step, writing the data into an information library, and completing fusion of the network information data.
The following describes the use of the method according to the invention by way of a specific application scenario example.
Fig. 5 shows an example of Smart rules generated in an intelligence data fusion method according to an embodiment of the present invention.
As shown in fig. 5, when part of Smart rules are formed after training and IP data is input, a repeated data discarding operation is performed on data with an empty recording time; for data with recording time, geographic position, attack target and attack intention not being empty, executing data additional write operation; for data with recording time, geographic position and attack target not being empty but with attack intention being empty, executing repeated data discarding operation; for data with recording time and geographic position not empty but with an attack target empty, repeating data discarding operation is executed; for the data with the recording time not being empty, the geographic position being empty, the country, the registrant and the character string storage not being empty, the 360 information being empty but the VT information not being empty, or the 360 information being empty and the VT information being empty but the microstep information not being empty, executing the data newly-increased writing operation; for data with a geographic position of null, state, registrant and character string storage of not null, but 360 information, VT information and microstep information are all null, repeated data discarding is executed; for data with recording time not empty but geographic position and country empty, executing repeated data discarding operation; for data with recording time not empty, geographic position empty and country not empty but registration person empty, executing repeated data discarding operation; and for the data with the recording time not being empty, the geographic position being empty, and the country and the registrant not being empty but the character string being stored as empty, executing partial replacement write-in operation.
Example 2
Further, as an implementation of the method shown in the above embodiment, another embodiment of the present invention further provides an intelligence data fusion apparatus. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. In the apparatus of this embodiment, there are the following modules:
1. a preprocessing module: is configured to preprocess raw network intelligence data to obtain structured data that conforms to an intelligence repository data model; this block corresponds to step S1 in embodiment 1.
The method comprises the following sub-modules:
an entity extraction submodule: configured to identify intelligence entities in the raw network intelligence data, and extract and save entity fields;
an entity classification submodule: configured to classify said intelligence entities, mapping said entity fields onto said intelligence repository data model according to constraints of said intelligence repository data model;
an attribute identification submodule: an entity attribute configured to identify the intelligence entity;
an attribute extraction submodule: and the entity attribute matching module is configured to match the entity attributes with the intelligence database data model, and extract and process attribute values of the matched entity attributes to form formatted entity attribute data.
2. A model training module: the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is configured to acquire a large amount of structured data and label each data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model; this block corresponds to step S2 in embodiment 1.
3. And a fusion rule generation module: the information base data model fusion rule is configured to input the structured data into a Smart rule decision tree model, and a fusion rule of the structured data and the information base data model is obtained; this block corresponds to step S3 in embodiment 1.
4. A data writing module: configured to write the structured data to the intelligence repository according to the fusion rule. This block corresponds to step S4 in embodiment 1.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Claims (10)

1. An intelligence data fusion method, comprising:
s1, preprocessing the original network intelligence data to obtain the structured data in accordance with the intelligence database data model;
s2, collecting a large amount of structured data and labeling each piece of data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model;
s3, inputting the structured data into a Smart rule decision tree model to obtain a fusion rule of the structured data and the intelligence database data model;
s4, writing the structured data into the intelligence base according to the fusion rule.
2. The intelligence data fusion method of claim 1, wherein the machine learning training of the decision tree model using the training data is specifically training using a decision tree ID3 classification algorithm.
3. The intelligence data fusion method of claim 1, wherein the preprocessing comprises:
s101, entity extraction: identifying the information entity in the original network information data, and extracting and storing the entity field;
s102, entity classification: classifying the intelligence entities, and mapping the entity fields to the intelligence database data model according to the constraint of the intelligence database data model;
s103, attribute identification: identifying entity attributes of the intelligence entities;
s104, extracting attributes: and matching the entity attribute with the data model of the information database, and extracting and processing the attribute value of the matched entity attribute to form formatted entity attribute data.
4. The intelligence data fusion method of claim 2 or 3, wherein the training data is specifically:
defining m types of the information entities, and defining entity attributes of n types of the information entities;
preprocessing each piece of original network information data to form the structured data as m + n-dimensional data vectors;
the label of the fusion mode comprises label of the information entity fusion mode and label of the entity attribute fusion mode;
the information entity fusion mode comprises data coverage writing, data newly-added writing and repeated data discarding;
the entity attribute fusion mode comprises data coverage writing, data newly-added writing, repeated data discarding, data additional writing and partial replacement writing.
5. The intelligence data fusion method of claim 4, wherein the training using decision tree ID3 classification algorithm is specifically:
the method comprises the following steps: calculating the training data to obtain current information entropy, calculating branch information entropy under each n entity attributes, calculating conditional entropy according to the branch information entropy, further calculating information gains of the n attributes respectively, selecting the attribute with the maximum information gain as a decision point and adding the decision point into a decision tree;
step two: and removing the attribute column data with the maximum information gain from the training data, and repeating the step one on the current training data until all entity attributes are added into the decision tree.
6. An intelligence data fusion device, comprising:
a preprocessing module: is configured to preprocess raw network intelligence data to obtain structured data that conforms to an intelligence repository data model;
a model training module: the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is configured to acquire a large amount of structured data and label each data in a fusion mode to form training data; performing machine learning training on the decision tree model by using training data to obtain a Smart rule decision tree model;
and a fusion rule generation module: the information base data model fusion rule is configured to input the structured data into a Smart rule decision tree model, and a fusion rule of the structured data and the information base data model is obtained;
a data writing module: configured to write the structured data to the intelligence repository according to the fusion rule.
7. An intelligence data fusion apparatus according to claim 6, wherein the machine learning training of the decision tree model using the training data is specifically training using a decision tree ID3 classification algorithm.
8. The intelligence data fusion apparatus of claim 6, wherein the preprocessing module comprises:
an entity extraction submodule: configured to identify intelligence entities in the raw network intelligence data, and extract and save entity fields;
an entity classification submodule: configured to classify said intelligence entities, mapping said entity fields onto said intelligence repository data model according to constraints of said intelligence repository data model;
an attribute identification submodule: an entity attribute configured to identify the intelligence entity;
an attribute extraction submodule: and the entity attribute matching module is configured to match the entity attributes with the intelligence database data model, and extract and process attribute values of the matched entity attributes to form formatted entity attribute data.
9. An intelligence data fusion apparatus according to claim 7 or 8, wherein the training data is specifically:
defining m types of the information entities, and defining entity attributes of n types of the information entities;
preprocessing each piece of original network information data to form the structured data as m + n-dimensional data vectors;
the label of the fusion mode comprises label of the information entity fusion mode and label of the entity attribute fusion mode;
the information entity fusion mode comprises data coverage writing, data newly-added writing and repeated data discarding;
the entity attribute fusion mode comprises data coverage writing, data newly-added writing, repeated data discarding, data additional writing and partial replacement writing.
10. The intelligence data fusion apparatus of claim 9, wherein the training using decision tree ID3 classification algorithm is specifically:
the method comprises the following steps: calculating the training data to obtain current information entropy, calculating branch information entropy under each n entity attributes, calculating conditional entropy according to the branch information entropy, further calculating information gains of the n attributes respectively, selecting the attribute with the maximum information gain as a decision point and adding the decision point into a decision tree;
step two: removing the attribute column data with the maximum information gain from the training data, and repeating the step one on the current training data; until all entity attributes are added into the decision tree.
CN202110588184.8A 2021-05-27 2021-05-27 Information data fusion method and device Active CN113254641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110588184.8A CN113254641B (en) 2021-05-27 2021-05-27 Information data fusion method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110588184.8A CN113254641B (en) 2021-05-27 2021-05-27 Information data fusion method and device

Publications (2)

Publication Number Publication Date
CN113254641A true CN113254641A (en) 2021-08-13
CN113254641B CN113254641B (en) 2021-11-16

Family

ID=77185022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110588184.8A Active CN113254641B (en) 2021-05-27 2021-05-27 Information data fusion method and device

Country Status (1)

Country Link
CN (1) CN113254641B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925757A (en) * 2022-05-09 2022-08-19 中国电信股份有限公司 Multi-source threat intelligence fusion method, device, equipment and storage medium
CN115630288A (en) * 2022-12-20 2023-01-20 中国电子科技集团公司第十四研究所 Multi-source characteristic multi-level comprehensive identification processing framework

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
US20160014094A1 (en) * 2014-07-10 2016-01-14 Empire Technology Development Llc Protection of private data
CN108052528A (en) * 2017-11-09 2018-05-18 华中科技大学 A kind of storage device sequential classification method for early warning
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
CN109508453A (en) * 2018-09-28 2019-03-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Across media information target component correlation analysis systems and its association analysis method
CN109614904A (en) * 2018-12-03 2019-04-12 东北大学 A kind of activity recognition method of the Multi-sensor Fusion based on Shapelet
CN110365666A (en) * 2019-07-01 2019-10-22 中国电子科技集团公司第十五研究所 Multiterminal fusion collaboration command system of the military field based on augmented reality
CN110709864A (en) * 2017-08-30 2020-01-17 谷歌有限责任公司 Man-machine loop interactive model training
CN110781249A (en) * 2019-10-16 2020-02-11 华电国际电力股份有限公司技术服务分公司 Knowledge graph-based multi-source data fusion method and device for thermal power plant
US20200074052A1 (en) * 2018-08-28 2020-03-05 International Business Machines Corporation Intelligent user identification
CN111078868A (en) * 2019-06-04 2020-04-28 中国人民解放军92493部队参谋部 Knowledge graph analysis-based equipment test system planning decision method and system
CN111126504A (en) * 2019-12-27 2020-05-08 西北工业大学 Multi-source incomplete information fusion image target classification method
CN111149141A (en) * 2017-09-04 2020-05-12 Nng软件开发和商业有限责任公司 Method and apparatus for collecting and using sensor data from a vehicle
CN111222536A (en) * 2019-11-19 2020-06-02 南京林业大学 City green space information extraction method based on decision tree classification
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
US20200201840A1 (en) * 2018-12-21 2020-06-25 Amadeus S.A.S. Self-adaptive data source aggregation system and method
CN111738343A (en) * 2020-06-24 2020-10-02 杭州电子科技大学 Image labeling method based on semi-supervised learning
CN111767325A (en) * 2020-09-03 2020-10-13 国网浙江省电力有限公司营销服务中心 Multi-source data deep fusion method based on deep learning
CN111914037A (en) * 2020-07-15 2020-11-10 国网能源研究院有限公司 Power grid development-oriented multivariate information mining and analyzing method and system
CN112286901A (en) * 2019-11-19 2021-01-29 中建材信息技术股份有限公司 Database fusion association system
EP3789935A1 (en) * 2019-09-03 2021-03-10 Sap Se Automated data processing based on machine learning

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
US20160014094A1 (en) * 2014-07-10 2016-01-14 Empire Technology Development Llc Protection of private data
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN110709864A (en) * 2017-08-30 2020-01-17 谷歌有限责任公司 Man-machine loop interactive model training
CN111149141A (en) * 2017-09-04 2020-05-12 Nng软件开发和商业有限责任公司 Method and apparatus for collecting and using sensor data from a vehicle
CN108052528A (en) * 2017-11-09 2018-05-18 华中科技大学 A kind of storage device sequential classification method for early warning
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
US20200074052A1 (en) * 2018-08-28 2020-03-05 International Business Machines Corporation Intelligent user identification
CN109508453A (en) * 2018-09-28 2019-03-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Across media information target component correlation analysis systems and its association analysis method
CN109614904A (en) * 2018-12-03 2019-04-12 东北大学 A kind of activity recognition method of the Multi-sensor Fusion based on Shapelet
US20200201840A1 (en) * 2018-12-21 2020-06-25 Amadeus S.A.S. Self-adaptive data source aggregation system and method
CN111078868A (en) * 2019-06-04 2020-04-28 中国人民解放军92493部队参谋部 Knowledge graph analysis-based equipment test system planning decision method and system
CN110365666A (en) * 2019-07-01 2019-10-22 中国电子科技集团公司第十五研究所 Multiterminal fusion collaboration command system of the military field based on augmented reality
EP3789935A1 (en) * 2019-09-03 2021-03-10 Sap Se Automated data processing based on machine learning
CN110781249A (en) * 2019-10-16 2020-02-11 华电国际电力股份有限公司技术服务分公司 Knowledge graph-based multi-source data fusion method and device for thermal power plant
CN111222536A (en) * 2019-11-19 2020-06-02 南京林业大学 City green space information extraction method based on decision tree classification
CN112286901A (en) * 2019-11-19 2021-01-29 中建材信息技术股份有限公司 Database fusion association system
CN111126504A (en) * 2019-12-27 2020-05-08 西北工业大学 Multi-source incomplete information fusion image target classification method
CN111738343A (en) * 2020-06-24 2020-10-02 杭州电子科技大学 Image labeling method based on semi-supervised learning
CN111914037A (en) * 2020-07-15 2020-11-10 国网能源研究院有限公司 Power grid development-oriented multivariate information mining and analyzing method and system
CN111767325A (en) * 2020-09-03 2020-10-13 国网浙江省电力有限公司营销服务中心 Multi-source data deep fusion method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
L. WANG 等: ""Research on Multi-source Data Security Protection of Smart Grid Based on Quantum Key Combination"", 《2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS 》 *
王淮 等: ""网络威胁情报关联分析技术"", 《信息技术》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925757A (en) * 2022-05-09 2022-08-19 中国电信股份有限公司 Multi-source threat intelligence fusion method, device, equipment and storage medium
CN114925757B (en) * 2022-05-09 2023-10-03 中国电信股份有限公司 Multisource threat information fusion method, device, equipment and storage medium
CN115630288A (en) * 2022-12-20 2023-01-20 中国电子科技集团公司第十四研究所 Multi-source characteristic multi-level comprehensive identification processing framework

Also Published As

Publication number Publication date
CN113254641B (en) 2021-11-16

Similar Documents

Publication Publication Date Title
CN107844560B (en) Data access method and device, computer equipment and readable storage medium
CN111597209B (en) Database materialized view construction system, method and system creation method
CN113254641B (en) Information data fusion method and device
CN111046275B (en) User label determining method and device based on artificial intelligence and storage medium
CN110569353A (en) Attention mechanism-based Bi-LSTM label recommendation method
CN110782123A (en) Matching method and device of decision scheme, computer equipment and storage medium
CN114821271B (en) Model training method, image description generation device and storage medium
CN113011529B (en) Training method, training device, training equipment and training equipment for text classification model and readable storage medium
CN113449821A (en) Intelligent training method, device, equipment and medium fusing semantics and image characteristics
CN116976640A (en) Automatic service generation method, device, computer equipment and storage medium
Japa et al. A population-based hybrid approach for hyperparameter optimization of neural networks
CN113343692A (en) Search intention recognition method, model training method, device, medium and equipment
CN117010373A (en) Recommendation method for category and group to which asset management data of power equipment belong
CN116226638A (en) Model training method, data benchmarking method, device and computer storage medium
CN113553844B (en) Domain identification method based on prefix tree features and convolutional neural network
CN115358473A (en) Power load prediction method and prediction system based on deep learning
CN114254622A (en) Intention identification method and device
CN109977227B (en) Text feature extraction method, system and device based on feature coding
CN113535946A (en) Text identification method, device and equipment based on deep learning and storage medium
Wang et al. Combining label-wise attention and adversarial training for tag prediction of web services
CN116910377B (en) Grid event classified search recommendation method and system
CN114826921B (en) Dynamic network resource allocation method, system and medium based on sampling subgraph
CN112348583B (en) User preference generation method and generation system
CN117453897B (en) Document question-answering method and system based on large model and genetic algorithm
CN107122392B (en) Word stock construction method, search requirement identification method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant