CN114398428A - Data analysis method, device, equipment and storage medium - Google Patents

Data analysis method, device, equipment and storage medium Download PDF

Info

Publication number
CN114398428A
CN114398428A CN202210060977.7A CN202210060977A CN114398428A CN 114398428 A CN114398428 A CN 114398428A CN 202210060977 A CN202210060977 A CN 202210060977A CN 114398428 A CN114398428 A CN 114398428A
Authority
CN
China
Prior art keywords
data
temporary table
field
threat intelligence
storage field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210060977.7A
Other languages
Chinese (zh)
Inventor
吴脂娟
郝伟
刘加瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Huayun'an Technology Co ltd
Original Assignee
Anhui Huayun'an Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Huayun'an Technology Co ltd filed Critical Anhui Huayun'an Technology Co ltd
Priority to CN202210060977.7A priority Critical patent/CN114398428A/en
Publication of CN114398428A publication Critical patent/CN114398428A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data analysis method, a device, equipment and a storage medium, which are applied to the technical field of network security, and the method comprises the following steps: acquiring threat intelligence text data; carrying out deduplication processing on the text data of the threat intelligence according to a preset data type to obtain data to be processed corresponding to each data type; storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field; determining the incidence relation data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table; and importing the incidence relation data into a graph database to obtain threat intelligence relation data of the multilevel incidence relation. The method and the device can avoid duplicate removal and summary of key fields of repeated data in different data types in hundred million data tables, and improve the data searching efficiency; and the detailed element information of each storage field is statistically analyzed by using a graph database, so that the data statistical efficiency is improved.

Description

Data analysis method, device, equipment and storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a data analysis method, apparatus, device, and storage medium.
Background
With the rapid development of network technology, a great deal of network technology is introduced in various industries to improve productivity, the network information security problem is brought about, under the condition that the network information security situation is increasingly complex, dynamic defense of network information security driven by threat information becomes the focus of attention of the industry, the threat information has the characteristics of rich data content, high accuracy and strong real-time performance, and can reflect the attack chain of the whole attack event, so the application and analysis value is extremely high.
According to the threat information source division, technical researches such as intercepting information based on safety equipment, execution based on sandbox, honeypot technology and text data are performed, compared with other threat data, the threat information data of the text data is written by safety researchers, the threat information data is obtained through crawlers and the like, the obtained billions of data are stored in a data table, and due to the fact that the data in the data table comprise multiple types, the data are not stored in the form of main keys, if the threat information text data of different types are analyzed and counted, the query efficiency and the statistical analysis efficiency are greatly influenced.
Disclosure of Invention
In view of this, the embodiment of the present application provides a data analysis method, where an HBase database is used to store data of different data types in a temporary table according to storage fields, so that duplicate removal and key field aggregation of repeated data of different data types in a hundred million-level data table are avoided, and the data search efficiency is improved; and the detailed element information of each storage field is statistically analyzed by using the graph database according to each data type, so that the data statistical efficiency is improved.
In a first aspect, an embodiment of the present application provides a data analysis method, where the method includes:
acquiring threat intelligence text data, wherein the threat intelligence text data refers to data which attacks and invades computers of other people by using software security vulnerabilities in an application system;
classifying the threat intelligence text data according to a preset data type, and performing deduplication processing on the data in each data type to obtain data to be processed corresponding to each data type;
storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field;
determining a slave key ID corresponding to each storage field in a temporary table and a master key ID corresponding to the temporary table of the HBase database, and determining incidence relation data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table;
importing the incidence relation data into a graph database to obtain threat intelligence relation data of a multilevel incidence relation;
and searching and counting the threat intelligence relationship data, and determining the statistical result of the multilevel incidence relationship corresponding to each storage field.
With reference to the first aspect, an embodiment of the present application provides a first possible implementation manner of the first aspect, where acquiring threat intelligence text data includes:
and acquiring open-source threat intelligence text data from the Internet according to a crawler program, and storing the acquired threat intelligence text data in an HBase database.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present application provides a second possible implementation manner of the first aspect, where the classifying processing is performed on the threat intelligence text data according to a preset data type, and the deduplication processing is performed on data in each data type, so as to obtain to-be-processed data corresponding to each data type, and the method includes:
classifying the threat intelligence text data according to IP address data, domain name data, sample data and Url data, wherein the preset data type comprises: IP address data, domain name data, sample data, Url data;
and if the classified threat intelligence text data has repeated ID identifications, performing deduplication processing on key fields corresponding to the repeated ID identifications in the classified threat intelligence text data to obtain data to be processed corresponding to each data type.
With reference to the first possible implementation manner or the second possible implementation manner of the first aspect, an embodiment of the present application provides a third possible implementation manner of the first aspect, where the storing, according to a preset storage field, to-be-processed data corresponding to each data type in a temporary table of an HBase database includes:
respectively storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field, wherein the storage field comprises: country field, city field, community field, port field, address field, document field, mailbox field, Url field, and scope field;
and establishing a primary key ID corresponding to the temporary table according to the temporary table of the HBase database.
With reference to the first possible implementation manner or the second possible implementation manner of the first aspect, an embodiment of the present application provides a fourth possible implementation manner of the first aspect, where determining a slave key ID corresponding to each storage field in the temporary table and a master key ID corresponding to the temporary table of the HBase database, and determining association data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table includes:
determining the storage address of each storage field in the temporary table as the slave key ID corresponding to each storage field;
determining a storage address corresponding to a temporary table of an HBase database as a primary key ID;
and binding the slave key ID corresponding to each storage field with the master key ID corresponding to the temporary table of the HBase database to generate association relation data.
With reference to the first possible implementation manner or the second possible implementation manner of the first aspect, an embodiment of the present application provides a fifth possible implementation manner of the first aspect, where importing the incidence relation data into a graph database to obtain threat intelligence relation data of a multi-level incidence relation, includes:
establishing a first-level incidence relation in the graph database according to each data type and the incidence relation data ID in the graph database and the primary key ID corresponding to the temporary table of the HBase database;
establishing a second-level incidence relation of incidence relation data according to the first-level incidence relation and the attribute corresponding to each storage field;
and obtaining threat intelligence relationship data of the multilevel association relationship according to the first-level association relationship and the second-level association relationship.
With reference to the first possible implementation manner or the second possible implementation manner of the first aspect, an embodiment of the present application provides a sixth possible implementation manner of the first aspect, where the searching and statistics processing is performed on the threat intelligence relationship data, and determining a statistical result of a multi-level association relationship corresponding to each storage field includes:
searching and processing the threat intelligence relationship data according to each storage field to obtain statistical data corresponding to each storage field;
and determining the statistical data corresponding to all the storage fields as statistical results, and performing visual processing on the statistical results.
In a second aspect, an embodiment of the present application further provides a data analysis apparatus, where the apparatus includes:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring threat intelligence text data, and the threat intelligence text data refers to data which attacks and invades computers of other people by using software security vulnerabilities in an application system;
the classification module is used for classifying the threat intelligence text data according to a preset data type and carrying out deduplication processing on the data in each data type to obtain data to be processed corresponding to each data type;
the storage module is used for storing the data to be processed corresponding to each data type into a temporary table of the HBase database according to a preset storage field;
the determining module is used for determining the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table of the HBase database, and determining the incidence relation data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table;
the association module is used for importing the association relation data into a graph database to obtain threat intelligence relation data of a multilevel association relation;
and the statistical module is used for searching and statistically processing the threat intelligence relationship data and determining the statistical result of the multilevel incidence relationship corresponding to each storage field.
In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the steps of the data analysis method implemented when the computer program is executed.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform steps of a method such as data analysis.
Compared with the prior art that billions of threat intelligence text data are stored in a data table, the data analysis method, the data analysis device and the data analysis system have higher efficiency of searching data and counting data, and the threat intelligence text data are obtained; classifying the threat intelligence text data according to a preset data type, and performing deduplication processing on the data in each data type to obtain data to be processed corresponding to each data type; storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field; determining the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table of the HBase database, and determining the incidence relation data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table; importing the incidence relation data into a graph database to obtain threat intelligence relation data of a multilevel incidence relation; and searching and counting the threat intelligence relationship data, and determining the statistical result of the multilevel association relationship corresponding to each storage field. Specifically, the threat information data of the scheme is network attack process information written by network information security researchers, and pre-warning and deploying security strategies are actively carried out on network security threats according to the threat information; according to the source of threat information data, removing the duplicate of threat information text data of multiple data types to obtain data to be processed corresponding to each data type, and storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field, so that repeated data in different data types are prevented from being removed and summarized in hundred million data tables, and the data searching efficiency is improved; the method comprises the steps of binding the corresponding slave key ID of each storage field in a temporary table with the corresponding master key ID of the temporary table to generate incidence relation data, leading the incidence relation data into a graph database, searching detailed element information of each storage field according to each data type by the graph database, summarizing and counting data of different data types by the graph database, inquiring threat information relation data of multi-level incidence relations, reducing complexity of analyzing the data, enabling relation data counting results to be clearly shown according to relation levels, and improving data counting efficiency.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 shows a schematic flow chart of a data analysis method provided in an embodiment of the present application.
Fig. 2 is a schematic flow chart illustrating the generation processing data obtained in another data analysis method provided in the embodiment of the present application.
Fig. 3 shows a schematic flow chart of storing data in a temporary table in the data analysis method provided in the embodiment of the present application.
Fig. 4 shows a schematic flow chart of generating association relation data in the data analysis method provided in the embodiment of the present application.
Fig. 5 is a schematic flow chart illustrating a process of obtaining threat intelligence relationship data of a multilevel association relationship in the data analysis method according to the embodiment of the present application.
Fig. 6 shows a schematic structural diagram of a data analysis apparatus provided in an embodiment of the present application.
Fig. 7 shows a schematic structural diagram of a computer device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
With the increasing popularization of technologies such as big data, internet of things, cloud computing and mobile internet due to informatization and high-speed construction of big data, threat intelligence data of network information security is structured database information data with exclusivity and encryption, the complexity of format content of the threat intelligence data is high, and a statistical method for counting intelligence information according to existing threat intelligence text data, reducing the complexity of data analysis and clearly showing the incidence relation of the data is urgently needed to be provided.
Based on the fact that the text data of the hundred million-level threat intelligence is stored in a data table, the embodiment of the application provides a data analysis method and a data analysis device, and the following description is given through the embodiment.
Fig. 1 is a schematic flow chart illustrating a data analysis method provided in an embodiment of the present application; as shown in fig. 1, the method specifically comprises the following steps:
and step S10, obtaining threat intelligence text data, wherein the threat intelligence text data refers to data which attacks and invades computers of other people by using software security loopholes in the application system.
Step S10 is implemented specifically, a distributed crawler is used to crawl web pages and analyze and extract threat intelligence text data, which is network attack process information written by network information security researchers, and according to the threat intelligence information, early warning and deployment security policy are actively performed on network security threats.
And step S20, classifying the threat intelligence text data according to the preset data type, and performing deduplication processing on the data in each data type to obtain the data to be processed corresponding to each data type.
Step S20 is implemented specifically, using a swift serialization technique to check threat intelligence text data stored in the HBase database, classifying the threat intelligence text data according to preset data types such as IP address data, domain name data, sample data, and Url data, and performing deduplication processing on the data in each data type after classification to obtain data to be processed corresponding to each data type.
And step S30, storing the data to be processed corresponding to each data type into a temporary table of the HBase database according to a preset storage field.
When the step S30 is implemented specifically, the HBase database stores the to-be-processed data corresponding to each data type into a temporary table of the HBase database according to the generated preset partition and timestamp and the storage addresses of the preset country field, city field, community field, port field, address field, document field, mailbox field, Url field and range field.
Step S40, determining the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table of the HBase database, and determining the association data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table.
Step S40 is implemented specifically, after storing the to-be-processed data corresponding to each data type in the temporary table of the HBase database, determining the primary key ID corresponding to the temporary table according to the temporary table created in the HBase database, parsing the storage address of the to-be-processed data storage item in the temporary table for each storage field, determining the secondary key ID corresponding to each storage field in the temporary table, and performing association binding according to the secondary key ID corresponding to each parsed storage field in the temporary table and the primary key ID corresponding to the temporary table, thereby generating association relationship data.
And step S50, importing the incidence relation data into a graph database to obtain threat intelligence relation data of multi-level incidence relation.
Step S50 is implemented specifically, the generated association relationship data is imported into a graph database, and the graph database establishes a multi-level association relationship of the association relationship data according to each data type and the primary key ID corresponding to the temporary table of the HBase database and the secondary key ID corresponding to each storage field, where the multi-level association relationship includes: a first level of incidence relation, a second level of incidence relation and a multi-level incidence relation.
And step S60, searching and counting the threat intelligence relationship data, and determining the statistical result of the multilevel association relationship corresponding to each storage field.
In specific implementation, step S60 searches threat intelligence relationship data from the graph database to obtain statistical data of the temporary table of the HBase database, and obtains statistical data corresponding to each storage field, and visualizes the statistical result.
In one possible implementation, in step S10, obtaining the text data of threat intelligence includes:
step 101, obtaining open source threat intelligence text data from the internet according to a crawler program, and storing the obtained threat intelligence text data in an HBase database.
In the specific implementation of the step 101, a master/slave structure of a distributed crawler program is adopted, a master end and a plurality of slave ends are set, a crawling request is deployed at the master end, a crawler program is deployed at the slave ends to crawl a webpage and analyze and extract open-source threat information text data, and each slave end stores the analyzed threat information text data in an HBase database.
In a possible implementation, fig. 2 is a schematic flow chart illustrating a method for obtaining substitute processing data in a data analysis method provided by an embodiment of the present application; in step S20, classifying the threat intelligence text data according to a preset data type, and performing deduplication processing on data in each data type to obtain data to be processed corresponding to each data type, including:
step S201, classifying the text data of the threat intelligence according to the IP address data, the domain name data, the sample data and the Url data, wherein the preset data type comprises: IP address data, domain name data, sample data, Url data.
Step S202, if the classified threat intelligence text data has repeated ID identification, the key field corresponding to the repeated ID identification in the classified threat intelligence text data is subjected to deduplication processing, and to-be-processed data corresponding to each data type is obtained.
When the steps S201 and S202 are implemented specifically, the HBase command line tool is used as an interface, threat information text data stored in the HBase database is accessed through SQL language, and the threat information text data is classified according to IP address data, domain name data, sample data and Url data, and whether there is a repeated ID in the classified threat information text data is judged, if yes, the key field corresponding to the repeated ID in the classified threat information text data is deduplicated, if so: finding out data in the IP address according to a script program, printing out IP address data, sequencing the IP address data through an arbitrary sequencing function, and sequencing the data according to the key field name of the IP: 1.10.10.16 the duplicate data is deduplicated to obtain the data to be processed corresponding to the IP address data.
In a possible implementation, fig. 3 illustrates a schematic flow chart of storing data in a temporary table in a data analysis method provided in an embodiment of the present application; in step S30, storing the to-be-processed data corresponding to each data type into a temporary table of the HBase database according to a preset storage field, where the step includes:
step S301, storing the data to be processed corresponding to each data type into a temporary table of the HBase database according to a preset storage field, wherein the storage field comprises: country field, city field, community field, port field, address field, document field, mailbox field, Url field, and scope field.
Step S302, according to the temporary table of the HBase database, establishing a primary key ID corresponding to the temporary table.
In specific implementation, the steps S301 and S302 create a namespace of a temporary table in the HBase database using a create command, set a preset partition of the temporary table generated by the HBase database according to the number of preset storage fields in the created namespace of the temporary table and a hash modifying the preset storage fields, respectively store the storage addresses of the preset country field, city field, community field, port field, address field, document field, mailbox field, Url field and range field in the temporary table of the HBase database according to the generated preset partition and a timestamp, and automatically create a unique index of a primary key ID using the create command according to the temporary table of the HBase database.
In a possible implementation, fig. 4 is a schematic flowchart illustrating a process of generating association relation data in a data analysis method provided by an embodiment of the present application; in the step S40, determining the slave key ID corresponding to each storage field in the IP table and the master key ID corresponding to the temporary table of the HBase database, and determining the association data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table includes:
step S401, determining the storage address of each storage field in the temporary table as the slave key ID corresponding to each storage field.
Step S402, determining the storage address corresponding to the temporary table of the HBase database as the primary key ID.
Step S403, binding the slave key ID corresponding to each storage field with the master key ID corresponding to the temporary table of the HBase database to generate association relation data.
Steps S401, S402, S403 are implemented specifically, the preset partitions stored in the temporary table of the HBase database are arranged in a row and column format, wherein each row of the preset partition records the serial number of the temporary table, each column records the storage field of the temporary table, the temporary table analyzes the storage address and data length of the data storage item to be processed in the temporary table of each storage field by obtaining the timestamp of the temporary table recorded in the HBase database, determines the corresponding slave key ID of each storage field in the temporary table, creates a unique index of a master key ID of the temporary table of the HBase database according to the corresponding data to be processed of each data type, and performs relational binding by using the corresponding master key ID of the created temporary table of the HBase database and the corresponding slave key IDs of each preset country field, city field, community field, port field, address field, document field, mailbox field, Url field and range field, and generating incidence relation data.
In a possible implementation, fig. 5 is a schematic flow chart illustrating a process of obtaining threat intelligence relationship data of a multilevel association relationship in a data analysis method provided in an embodiment of the present application; in step S50, importing the incidence relation data into a graph database to obtain threat intelligence relation data of a multilevel incidence relation, including:
step S501, according to each data type, establishing a first-level incidence relation in a graph database according to a primary key ID corresponding to a temporary table of an HBase database and an incidence relation data ID in the graph database;
step S502, establishing a second-level incidence relation of incidence relation data according to the first-level incidence relation and the respective corresponding attribute of each storage field;
and S503, obtaining threat intelligence relationship data of the multilevel association relationship according to the first level association relationship and the second level association relationship.
Steps S501, S502, S503 are implemented specifically, according to the type of IP address data or Url data, the relationship binding is performed in the graph database according to the primary key ID corresponding to the temporary table of HBase database and the ID of the association relationship data in the graph database, the first level association relationship of the association relationship data is established, for example, according to the association relationship between different types of data, the first level association relationship between the temporary table and the association relationship data such as country field, city field, community field, port field, address field, mailbox field, etc. is established, according to the attribute corresponding to the first level association relationship and each storage field, the second level association relationship of the association relationship data is established, for example, the second level association relationship between country field and mailbox field, city field and community field, and the threat information relationship data of the multi-level association relationship is obtained according to the first level association relationship and the second level association relationship, such as a multi-level association between the temporary table and the Url data or domain name data.
In a possible implementation, in step S60, the searching and counting the threat intelligence relationship data, and determining the statistical result of the multilevel association relationship corresponding to each storage field includes:
step 601, searching and processing threat intelligence relationship data according to each storage field to obtain statistical data corresponding to each storage field;
step 602, determining the statistical data corresponding to all the storage fields as statistical results, and performing visualization processing on the statistical results.
601, 602, in specific implementation, searching and processing threat intelligence relationship data according to the number of preset storage fields of the created temporary table, respectively counting the data corresponding to each preset country field, city field, community field, port field, address field, document field, mailbox field, Url field and range field in the created temporary table according to a custom formula after searching, determining the statistical data corresponding to all the storage fields as statistical results, and visually displaying the threat intelligence relationship data in each preset storage field through automatically searching script plug-in on a page.
Fig. 6 is a schematic structural diagram of a data analysis apparatus 70 according to an embodiment of the present application, and as shown in fig. 6, the apparatus includes:
an obtaining module 701, configured to obtain threat intelligence text data, where the threat intelligence text data refers to data that attacks and invades computers of other people by using software security vulnerabilities in an application system;
the classification module 702 is configured to classify threat intelligence text data according to a preset data type, and perform deduplication processing on data in each data type to obtain to-be-processed data corresponding to each data type;
the storage module 703 is configured to store the to-be-processed data corresponding to each data type into a temporary table of the HBase database according to a preset storage field;
a determining module 704, configured to determine a slave key ID corresponding to each storage field in the temporary table and a master key ID corresponding to the temporary table of the HBase database, and determine association data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table;
the association module 705 is used for importing association relationship data into a graph database to obtain threat intelligence relationship data of a multilevel association relationship;
and the statistical module 706 is configured to perform search and statistical processing on the threat intelligence relationship data, and determine a statistical result of the multilevel association relationship corresponding to each storage field.
When the device is implemented specifically, threat information data is network attack process information written by network information security researchers, and early warning and safety strategy deployment are actively carried out on network security threats according to the threat information; according to the source of threat information data, removing the duplicate of threat information text data of multiple data types to obtain data to be processed corresponding to each data type, and storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field, so that repeated data in different data types are prevented from being removed and summarized in hundred million data tables, and the data searching efficiency is improved; the method comprises the steps of binding the corresponding slave key ID of each storage field in a temporary table with the corresponding master key ID of the temporary table to generate incidence relation data, leading the incidence relation data into a graph database, searching detailed element information of each storage field according to each data type by the graph database, summarizing and counting data of different data types by the graph database, inquiring threat information relation data of multi-level incidence relations, reducing complexity of analyzing the data, enabling relation data counting results to be clearly shown according to relation levels, and improving data counting efficiency.
Corresponding to the data analysis method in fig. 1, an embodiment of the present application further provides a computer device 80, fig. 7, as shown in fig. 7, the device includes a memory 801, a processor 802, and a computer program stored on the memory 801 and executable on the processor 802, wherein the processor 802 implements the method when executing the computer program.
Acquiring threat intelligence text data, wherein the threat intelligence text data refers to data which attacks and invades computers of other people by using software security vulnerabilities in an application system;
classifying the threat intelligence text data according to a preset data type, and performing deduplication processing on the data in each data type to obtain data to be processed corresponding to each data type;
storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field;
determining the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table of the HBase database, and determining the incidence relation data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table;
importing the incidence relation data into a graph database to obtain threat intelligence relation data of a multilevel incidence relation;
and searching and counting the threat intelligence relationship data, and determining the statistical result of the multilevel association relationship corresponding to each storage field.
Corresponding to the data analysis method in fig. 1, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the following steps:
acquiring threat intelligence text data, wherein the threat intelligence text data refers to data which attacks and invades computers of other people by using software security vulnerabilities in an application system;
classifying the threat intelligence text data according to a preset data type, and performing deduplication processing on the data in each data type to obtain data to be processed corresponding to each data type;
storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field;
determining the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table of the HBase database, and determining the incidence relation data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table;
importing the incidence relation data into a graph database to obtain threat intelligence relation data of a multilevel incidence relation;
and searching and counting the threat intelligence relationship data, and determining the statistical result of the multilevel association relationship corresponding to each storage field.
Based on the analysis, compared with the situation that billions of threat information text data in the related technology are stored in one data table, the data analysis method provided by the embodiment of the application performs deduplication on the threat information text data of multiple data types according to the source of the threat information data to obtain the data to be processed corresponding to each data type, and then stores the data to be processed corresponding to each data type into the temporary table of the HBase database according to the preset storage field, so that the repeated data in different data types are prevented from being deduplicated and summarized in the billion data tables, and the data search efficiency is improved; and the corresponding slave key ID of each storage field in the temporary table and the corresponding master key ID of the temporary table are subjected to relational binding to generate incidence relation data, the incidence relation data are imported into a graph database, and the graph database searches detailed element information of each storage field according to each data type, so that the data statistical efficiency is improved.
The data analysis device provided by the embodiment of the application can be specific hardware on the device, or software or firmware installed on the device, and the like. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the foregoing systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application.

Claims (10)

1. A method of data analysis, the method comprising:
acquiring threat intelligence text data, wherein the threat intelligence text data refers to data which attacks and invades computers of other people by using software security vulnerabilities in an application system;
classifying the threat intelligence text data according to a preset data type, and performing deduplication processing on the data in each data type to obtain data to be processed corresponding to each data type;
storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field;
determining a slave key ID corresponding to each storage field in a temporary table and a master key ID corresponding to the temporary table of the HBase database, and determining incidence relation data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table;
importing the incidence relation data into a graph database to obtain threat intelligence relation data of a multilevel incidence relation;
and searching and counting the threat intelligence relationship data, and determining the statistical result of the multilevel incidence relationship corresponding to each storage field.
2. The data analysis method of claim 1, wherein obtaining threat intelligence text data comprises:
and acquiring open-source threat intelligence text data from the Internet according to a crawler program, and storing the acquired threat intelligence text data in an HBase database.
3. The data analysis method of claim 1, wherein classifying the threat intelligence text data according to a preset data type, and performing deduplication processing on data in each data type to obtain to-be-processed data corresponding to each data type, comprises:
classifying the threat intelligence text data according to IP address data, domain name data, sample data and Url data, wherein the preset data type comprises: IP address data, domain name data, sample data, Url data;
and if the classified threat intelligence text data has repeated ID identifications, performing deduplication processing on key fields corresponding to the repeated ID identifications in the classified threat intelligence text data to obtain data to be processed corresponding to each data type.
4. The data analysis method according to claim 1, wherein storing the to-be-processed data corresponding to each data type in a temporary table of the HBase database according to a preset storage field comprises:
respectively storing the data to be processed corresponding to each data type into a temporary table of an HBase database according to a preset storage field, wherein the storage field comprises: country field, city field, community field, port field, address field, document field, mailbox field, Url field, and scope field;
and establishing a primary key ID corresponding to the temporary table according to the temporary table of the HBase database.
5. The data analysis method according to claim 1, wherein determining the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table of the HBase database, and determining the association data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table comprises:
determining the storage address of each storage field in the temporary table as the slave key ID corresponding to each storage field;
determining a storage address corresponding to a temporary table of an HBase database as a primary key ID;
and binding the slave key ID corresponding to each storage field with the master key ID corresponding to the temporary table of the HBase database to generate association relation data.
6. The data analysis method of claim 1, wherein importing the incidence relation data into a graph database to obtain threat intelligence relation data of a multilevel incidence relation comprises:
establishing a first-level incidence relation in the graph database according to each data type and the incidence relation data ID in the graph database and the primary key ID corresponding to the temporary table of the HBase database;
establishing a second-level incidence relation of incidence relation data according to the first-level incidence relation and the attribute corresponding to each storage field;
and obtaining threat intelligence relationship data of the multilevel association relationship according to the first-level association relationship and the second-level association relationship.
7. The data analysis method of claim 1, wherein performing statistical search on the threat intelligence relationship data to determine statistical results of the multilevel correlations corresponding to each storage field comprises:
searching and processing the threat intelligence relationship data according to each storage field to obtain statistical data corresponding to each storage field;
and determining the statistical data corresponding to all the storage fields as statistical results, and performing visual processing on the statistical results.
8. A data analysis apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring threat intelligence text data, and the threat intelligence text data refers to data which attacks and invades computers of other people by using software security vulnerabilities in an application system;
the classification module is used for classifying the threat intelligence text data according to a preset data type and carrying out deduplication processing on the data in each data type to obtain data to be processed corresponding to each data type;
the storage module is used for storing the data to be processed corresponding to each data type into a temporary table of the HBase database according to a preset storage field;
the determining module is used for determining the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table of the HBase database, and determining the incidence relation data between the slave key ID corresponding to each storage field in the temporary table and the master key ID corresponding to the temporary table;
the association module is used for importing the association relation data into a graph database to obtain threat intelligence relation data of a multilevel association relation;
and the statistical module is used for searching and statistically processing the threat intelligence relationship data and determining the statistical result of the multilevel incidence relationship corresponding to each storage field.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of the preceding claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 7.
CN202210060977.7A 2022-01-19 2022-01-19 Data analysis method, device, equipment and storage medium Pending CN114398428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210060977.7A CN114398428A (en) 2022-01-19 2022-01-19 Data analysis method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210060977.7A CN114398428A (en) 2022-01-19 2022-01-19 Data analysis method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114398428A true CN114398428A (en) 2022-04-26

Family

ID=81230818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210060977.7A Pending CN114398428A (en) 2022-01-19 2022-01-19 Data analysis method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114398428A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115344563A (en) * 2022-08-17 2022-11-15 中国电信股份有限公司 Data deduplication method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115344563A (en) * 2022-08-17 2022-11-15 中国电信股份有限公司 Data deduplication method and device, storage medium and electronic equipment
CN115344563B (en) * 2022-08-17 2024-02-02 中国电信股份有限公司 Data deduplication method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN107566376B (en) Threat information generation method, device and system
Simonini et al. BLAST: a loosely schema-aware meta-blocking approach for entity resolution
US11188657B2 (en) Method and system for managing electronic documents based on sensitivity of information
JP2023011632A (en) Obfuscation of user content in structured user data file
US20220368671A1 (en) Pattern-based malicious url detection
US7895515B1 (en) Detecting indicators of misleading content in markup language coded documents using the formatting of the document
EP3346664B1 (en) Binary search of byte sequences using inverted indices
CN111104579A (en) Identification method and device for public network assets and storage medium
Balduzzi et al. Targeted attacks detection with spunge
Prathibha et al. Design of a hybrid intrusion detection system using snort and hadoop
Dasgupta et al. De-duping urls via rewrite rules
Hauger et al. The state of database forensic research
US11334592B2 (en) Self-orchestrated system for extraction, analysis, and presentation of entity data
KR20190138037A (en) An information retrieval system using knowledge base of cyber security and the method thereof
CN114398428A (en) Data analysis method, device, equipment and storage medium
CN111314292A (en) Data security inspection method based on sensitive data identification
Layton et al. Determining provenance in phishing websites using automated conceptual analysis
Al Fahdi et al. Towards an automated forensic examiner (AFE) based upon criminal profiling & artificial intelligence
CN115629945A (en) Alarm processing method and device and electronic equipment
WO2013172309A1 (en) Rule discovery system, method, device, and program
CN107391597B (en) Multivariate data acquisition method and system
Marty The security data lake
Kim et al. Scalable and Multifaceted Search and Its Application for Binary Malware Files
RU2772300C2 (en) Obfuscation of user content in structured user data files
RU2740856C1 (en) Method and system for identifying clusters of affiliated websites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination