CN115344563A

CN115344563A - Data deduplication method and device, storage medium and electronic equipment

Info

Publication number: CN115344563A
Application number: CN202210987429.9A
Authority: CN
Inventors: 高岩; 袁涵; 郭实秋; 鞠港
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2022-11-15
Anticipated expiration: 2042-08-17
Also published as: CN115344563B

Abstract

The disclosure belongs to the technical field of network security, and relates to a data deduplication method and device, a storage medium and electronic equipment. The method comprises the following steps: acquiring threat intelligence data, and preprocessing the threat intelligence data to determine the data type; when the data type is an unstructured type, performing text similarity calculation on the threat intelligence data to obtain a semantic feature vector, and performing deduplication processing on the threat intelligence data according to the semantic feature vector; or when the data type is a structured type, performing data compression processing on the data type, and storing the compressed threat intelligence data for deduplication processing. The method solves the problems that the occupied memory is too large and the processing flow is time-consuming in the process of removing the weight of the threat information data, solves the problem that the original weight removing method cannot capture text information, improves the retrieval efficiency of unstructured threat information data, and solves the problem that the system resource consumption caused by removing the weight of massive threat information data and storing the threat information data is too much.

Description

Data deduplication method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of network security technologies, and in particular, to a data deduplication method, a data deduplication apparatus, a computer-readable storage medium, and an electronic device.

Background

With the high-speed development of the internet, particularly the mobile internet, more and more network devices and internet of things devices are connected to a backbone network, the internet topology environment is more complex, different attack behaviors are more industrialized, and the intrusion method is more diversified and complicated, so that the traditional security solution is continuously challenged. Meanwhile, with the continuous improvement of the national status, the network attacks suffered by China tend to be diversified and complicated. Under the background, threat information is more concerned by enterprises, the security equipment can play a greater role in combination with the threat information, and the security operation of the enterprises can respond to security events more quickly in combination with the threat information.

With the occurrence of network attack events more and more frequently, threat situations generated daily number millions, however, the quality of both commercial threat messages and threat messages in open source websites is different, a large amount of repeated data exists in non-homologous threat messages, and the homologous threat messages also have the situation of repeated data, so that the occupied memory is too large, and adverse effects are caused on multiple aspects such as platform operation, storage, operation and maintenance.

In view of the above, there is a need in the art to develop a new data deduplication method and apparatus.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a data deduplication method, a data deduplication apparatus, a computer-readable storage medium, and an electronic device, so as to overcome, at least to some extent, the technical problem of an excessively large occupied memory due to limitations of related technologies.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the embodiments of the present invention, there is provided a data deduplication method, the method including:

acquiring threat intelligence data, and preprocessing the threat intelligence data to determine the data type;

when the data type is an unstructured type, performing text similarity calculation on the threat intelligence data to obtain a semantic feature vector, and performing deduplication processing on the threat intelligence data according to the semantic feature vector; or

And when the data type is a structured type, performing data compression processing on the data type, and storing the compressed threat intelligence data for deduplication processing.

In an exemplary embodiment of the invention, the preprocessing the threat intelligence data to determine a data type comprises:

carrying out data standardization processing on the threat intelligence data, and extracting and processing the processed threat intelligence data to obtain key data;

and carrying out data cleaning processing on the key data, and classifying the cleaned key data to obtain a data type.

In an exemplary embodiment of the present invention, the performing data compression processing on the data type includes:

coding the data type to obtain a first bit vector, and performing hash calculation on the key data to obtain a second bit vector;

and calculating the first bit vector and the second bit vector to obtain a target bit vector so as to obtain the compressed threat intelligence data.

In an exemplary embodiment of the invention, before the text similarity calculation on the threat intelligence data to obtain the semantic feature vector, the method further comprises:

inputting the threat intelligence data into a combined extraction model so that the combined extraction model outputs intelligence keywords and intelligence categories;

and scoring the intelligence keywords and the intelligence categories by using a structured deduplication algorithm to obtain a first deduplication score.

In an exemplary embodiment of the present invention, the joint extraction model is trained by the following method:

training a training sample by using a pre-training algorithm to obtain a text vector through character vector training, and coding the text vector to obtain a coding vector;

and performing sequence label prediction on the coding vector to obtain keyword data, and performing category prediction on the coding vector to obtain category data.

In an exemplary embodiment of the invention, the semantic feature vectors include high-level semantic vectors and medium-level semantic vectors,

the text similarity calculation of the threat intelligence data to obtain a semantic feature vector comprises the following steps:

inputting the threat intelligence data into a full binary quantized language characterization model such that the language characterization model outputs the high-level semantic vector and the medium-level semantic vector.

In an exemplary embodiment of the invention, the performing the deduplication processing on the threat intelligence data according to the semantic feature vector includes:

acquiring stored information data in an information database, and performing first distance calculation on the medium-level semantic vector and the stored information data to determine an information candidate set;

performing a second distance calculation on the high-level semantic vectors and the stored intelligence data in the intelligence candidate set to determine a second deduplication score, and calculating the first deduplication score and the second deduplication score to obtain a repetition confidence;

and carrying out deduplication processing on the threat intelligence data according to the repetition confidence.

According to a second aspect of the embodiments of the present invention, there is provided a data deduplication apparatus, including:

the data acquisition module is configured to acquire threat intelligence data and preprocess the threat intelligence data to determine the data type;

the first duplicate removal module is configured to perform text similarity calculation on the threat intelligence data to obtain a semantic feature vector when the data type is an unstructured type, and perform duplicate removal processing on the threat intelligence data according to the semantic feature vector; or

And the second deduplication module is configured to perform data compression processing on the data type and store the compressed threat intelligence data for deduplication processing when the data type is a structured type.

In an exemplary embodiment of the invention, before the text similarity calculation on the threat intelligence data to obtain semantic feature vectors, the method further comprises:

In an exemplary embodiment of the invention, the semantic feature vectors comprise high-level semantic vectors and medium-level semantic vectors,

According to a third aspect of an embodiment of the present invention, there is provided an electronic apparatus including: a processor and a memory; wherein the memory has stored thereon computer readable instructions which, when executed by the processor, implement the data deduplication method in any of the exemplary embodiments described above.

According to a fourth aspect of embodiments of the present invention, there is provided a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the data deduplication method in any of the exemplary embodiments described above.

According to the technical scheme, the data deduplication method, the data deduplication device, the computer storage medium and the electronic device in the exemplary embodiment of the disclosure have at least the following advantages and positive effects:

in the method and the device provided by the exemplary embodiment of the disclosure, the data type of the threat intelligence data is determined, and data basis and theoretical support are provided for providing different deduplication modes for different types of threat intelligence data. On one hand, the threat information data is subjected to deduplication processing according to the semantic feature vector, the problems that an occupied memory is too large and a processing flow is time-consuming in the process of deduplication of the threat information data are solved, the problem that the original deduplication method cannot capture text information is effectively solved, and meanwhile the retrieval efficiency of unstructured threat information data is improved. On the other hand, the data type of the structured threat intelligence data is subjected to data compression processing, so that the problem that excessive system resources are consumed due to the deduplication process of massive threat intelligence data is solved, and the resource consumption caused by storing the threat intelligence data is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 schematically illustrates a flow diagram of a data deduplication method in an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of determining a data type in an exemplary embodiment of the disclosure;

FIG. 3 schematically illustrates a flow diagram of a method for determining a first deduplication score for threat intelligence data using a joint extraction model in an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of a method of training a joint extraction model in an exemplary embodiment of the disclosure;

FIG. 5 schematically illustrates a flow chart of a method of deduplication processing of threat intelligence data in an exemplary embodiment of the present disclosure;

FIG. 6 is a flow diagram that schematically illustrates a method for data compression processing of data types in an exemplary embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a data deduplication system in an application scenario in an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a data processing module in an exemplary embodiment of the disclosure;

FIG. 9 schematically shows a structural diagram of an intelligence keyword-type joint extraction model in an exemplary embodiment of the disclosure;

FIG. 10 is a schematic diagram illustrating the structure of a fully binary quantized language representation model in an exemplary embodiment of the present disclosure;

FIG. 11 schematically illustrates a block diagram of a data compression module in an exemplary embodiment of the disclosure;

fig. 12 schematically illustrates a schematic structure diagram of a data deduplication apparatus in an exemplary embodiment of the present disclosure;

FIG. 13 schematically illustrates an electronic device for implementing a data deduplication method in exemplary embodiments of the present disclosure;

fig. 14 schematically illustrates a computer-readable storage medium for implementing a data deduplication method in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/parts/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second," etc. are used merely as labels, and are not limiting on the number of their objects.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

With the high-speed development of the internet, particularly the mobile internet, more and more network devices and internet of things devices are connected to a backbone network, the internet topology environment is more complex, different attack behaviors are more industrialized, and the invasion method is more diversified and complicated, so that the traditional security solution is continuously challenged. Meanwhile, with the continuous improvement of the national status, the network attacks suffered by China tend to be diversified and complicated.

Under the background, threat information is more concerned by enterprises, the security equipment can play a greater role in combination with the threat information, and the security operation of the enterprises can respond to security events more quickly in combination with the threat information. Thus, the role of threat intelligence in network security is becoming more and more important.

Threat intelligence refers to an intelligence knowledge base that contains multiple types and multiple dimensions. The threat intelligence may include vulnerability intelligence, asset intelligence, IOC (threat Indicator) intelligence, event intelligence, and the like.

The threat information is used as a knowledge set based on evidence, scenes, mechanisms, indexes and operable suggestions, can effectively make up a network security defense blind area, and changes passive protection into active defense. The method can perform threat tracing, evidence discovery, attack prediction, attack map establishment and the like while detecting the existing attack, and improves the protection capability of the network security equipment on the whole, thereby reducing the influence caused by network attack and providing important reference for network defense for security decision makers.

With the more frequent occurrence of network attack events, the daily threat information is millions, however, the quality of commercial threat information and threat information in open source websites is different, a large amount of repeated data exists in non-homologous threat information, and the situation that the homologous threat information is repeated with the previous data also exists. As a threat information platform, accurate and high-quality data needs to be provided. And a large amount of repeated data generated by the data source every day all affect the operation, storage, operation and maintenance of the platform, so that the threat intelligence data deduplication step becomes an important component of intelligence processing and is directly related to the intelligence quality and the threat intelligence platform construction.

In view of the problems in the related art, the present disclosure provides a data deduplication method. Fig. 1 shows a flow chart of a data deduplication method, as shown in fig. 1, the data deduplication method at least includes the following steps:

and S110, obtaining threat intelligence data, and preprocessing the threat intelligence data to determine the data type.

And S120, when the data type is an unstructured type, performing text similarity calculation on the threat intelligence data to obtain a semantic feature vector, and performing deduplication processing on the threat intelligence data according to the semantic feature vector.

And S130, when the data type is the structured type, performing data compression processing on the data type, and storing the compressed threat intelligence data for deduplication processing.

In an exemplary embodiment of the present disclosure, determining a data type of threat intelligence data provides a data base and theoretical support for providing different deduplication schemes for different types of threat intelligence data. On one hand, the threat information data is subjected to deduplication processing according to the semantic feature vector, the problems that an occupied memory is too large and a processing flow is time-consuming in the process of deduplication of the threat information data are solved, the problem that the original deduplication method cannot capture text information is effectively solved, and meanwhile the retrieval efficiency of unstructured threat information data is improved. On the other hand, the data type of the structured threat intelligence data is subjected to data compression processing, so that the problem that excessive system resources are consumed due to the deduplication process of massive threat intelligence data is solved, and the resource consumption caused by storing the threat intelligence data is reduced.

The following describes each step of the data deduplication method in detail.

In step S110, threat intelligence data is obtained and preprocessed to determine the data type.

In an exemplary embodiment of the present disclosure, threat intelligence refers to an intelligence repository containing multiple types, multiple dimensions.

Wherein, the threat intelligence may include vulnerability intelligence, asset intelligence, IOC intelligence, event intelligence, etc.

The threat intelligence is used as a knowledge set based on evidence, scenes, mechanisms, indexes and operable suggestions, can effectively make up a network security defense blind area, changes passive protection into active defense, can carry out threat tracing, evidence discovery, attack prediction, attack map establishment and the like while detecting existing attacks, improves the protection capability of network security equipment on the whole, reduces the influence caused by network attacks, and provides important references for network defense for security decision makers.

Threat intelligence data is classified according to attributes, and threat intelligence can be matched with a use scene.

Based on this, threat intelligence data can be classified into basic intelligence classes, asset classes, leak classes, event classes, IOC classes, attack organization classes, other intelligence types, and the like.

The basic information includes common objects in the network, such as an IP (Internet Protocol) address (192.168.0.x), a domain name (www.xxxxx.com), a mailbox (example @ xx.com), a URL (Uniform Resource Locator) (http:// www.xxxxxx.com), and a certificate.

The basic intelligence for each category includes, for example, a port used, a type of service provided, whois (domain name query protocol) information (including whether or not it has been registered, and detailed information of the registered domain name), and geographic location information of IP, domain name, URL, such as longitude and latitude, a region city of a country to which it belongs, and the like.

The asset information is classified into three categories, namely, risk asset information, asset alteration information, and asset discovery information, according to the contents. Assets are physical or virtual devices in the internet, such as routers, switches, servers, hosts, etc.

The vulnerability information refers to a knowledge base formed by applying a threat information technology to carry out data acquisition, analysis and description on the existing vulnerability.

For example, a country-related Vulnerability library (e.g., NVD (National Vulnerability Database), CNVD (chinese National Vulnerability Database, national Information Security Vulnerability sharing platform), CNNVD (chinese National Vulnerability Database of Information Security), or Common Vulnerability disclosure (CVE) mainly describes the name, description, type, hazard score, implementation principle, influence, and patching measures of the Vulnerability.

The event type intelligence refers to various types of intelligence and related events. Such as the time of occurrence, the resulting effect, etc. Through the literal detailed description of the type, the source, the potential influence, the associated vulnerability or attack organization and the like of the security event, the method is beneficial to helping security operators or non-professionals to know the external threat situation in time so as to respond.

IOC refers to a threat indicator that describes the detection characteristics of a network attack. Such as attack source IP, domain name, and MD5 (MD 5 Message-Digest Algorithm) hash value of malicious files, or traffic characteristics, mailbox to which phishing mail belongs, etc. Security personnel can conduct risk study and judgment, security reinforcement and the like through IOC information.

The attack organization contains the name of the threat subject, such as the name of the hacker organization, etc., as well as the role of the attack subject, such as hacker, white cap, etc., and the attack of the attack organization is directed to the industry, the country, etc.

Other intelligence types may include threat reports, critical activity assurance classes, and internal intelligence, among others.

After threat intelligence data is acquired, the threat intelligence data may be preprocessed to determine a data type of the threat intelligence data.

In an alternative embodiment, fig. 2 shows a flow chart of a method for determining a data type, as shown in fig. 2, the method may include at least the following steps: in step S210, data normalization processing is performed on the threat intelligence data, and the processed threat intelligence data is extracted to obtain key data.

And carrying out data standardization processing on threat intelligence data of different intelligence sources. For example, the data normalization processing may be to form a JSON (JavaScript Object Notation) format or the like, and this exemplary embodiment is not limited in particular.

Further, threat intelligence data after data standardization processing can be extracted to obtain key data such as attacker IP, attack type and threat level.

The keywords required by different types of threat intelligence extraction platforms are different, and then JSON documents in a uniform format can be formed to serve as key data.

In step S220, a data cleaning process is performed on the key data, and the cleaned key data is classified to obtain a data type.

Because the intelligence quality of different sources is different and characters such as a line feed character "\ n", a tab character "\ t" and the like exist, the characters of the key data can be deleted, replaced, sensitive words, stop words and the like can be removed through data cleaning processing, so that the cleaned key data can meet the requirements of subsequent processing flows.

Data classification is a deduplication process for threat intelligence data. The key data after the original cleaning can divide attack organization information, event information, reports and the like into unstructured information data, and basic information, vulnerability information, IOC information and the like into structured information data according to the type of threat information data.

Referring to table 1, different ways of dividing threat intelligence data are shown:

TABLE 1

In particular, the intelligence type can be divided into structured intelligence and unstructured intelligence according to the intelligence data format.

The structured threat intelligence refers to data that can be uniquely identified by a character string, such as an IP, an asset, a vulnerability, and the like, for example, a specific IP address and a vulnerability number, and a threat report, a major activity assurance class, and internal intelligence included in other intelligence types may also be structured types. By which a piece of informative information can be uniquely identified.

Unstructured threat intelligence data refers to event-like threat intelligence, and the like. An attack event and the like are described through writing, wherein the attack event comprises vulnerability information, attack organization information and the like. Such intelligence is not directly available and typically requires human or machine reading to extract the desired information for combing to produce useable intelligence.

In the exemplary embodiment, the data type of the threat intelligence data can be determined through preprocessing, a data base and theoretical support are provided for subsequent deduplication processing, and the accuracy and timeliness of data deduplication are guaranteed.

In step S120, when the data type is an unstructured type, a semantic feature vector is obtained by performing text similarity calculation on the threat intelligence data, and deduplication processing is performed on the threat intelligence data according to the semantic feature vector.

In an exemplary embodiment of the present disclosure, when threat intelligence data is unstructured intelligence data, as it is found by observing a large amount of data that threat intelligence text is similar, intelligence keywords are not the same, such as intelligence 1: trojan backdoor, vulnerability exploitation: CVE-2022-26134; information 2: trojan backdoor, security hole: CVE-2022-30716.

Only through the word co-occurrence text similarity calculation method, because co-occurrence words such as 'Trojan backdoor' and 'loophole' exist, the similarity of 62.5% can be obtained through SimHash (the most common hash method for webpage deduplication), but the keyword CVE loophole numbers of the two texts are different, and obviously, the two texts are two different informative texts.

Therefore, an intelligence keyword-type joint extraction model is proposed for the above problems. Extracting intelligence keywords such as IP address, attack organization, IOC intelligence, vulnerability number (CVE), etc. for intelligence text, such as "trojan backdoor, security vulnerability: CVE-2022-30716', then extracting CVE-2022-30716. Meanwhile, the model extracts the information keywords and judges the information types at the same time.

In an alternative embodiment, fig. 3 shows a flow diagram of a method for determining a first deduplication score of threat intelligence data using a joint extraction model, which may include at least the following steps, as shown in fig. 3: in step S310, threat intelligence data is input into the joint extraction model so that the joint extraction model outputs intelligence keywords and intelligence categories.

In an alternative embodiment, fig. 4 shows a flow diagram of a method for training a joint extraction model, and as shown in fig. 4, the method may include at least the following steps: in step S410, a pre-training algorithm is used to perform character vector training on the training samples to obtain text vectors, and the text vectors are encoded to obtain encoded vectors.

The combined extraction model inputs information text vectors of unstructured information by using the thought of combined training, and finally outputs results of the information text vectors, namely the starting and ending positions of the information keywords and the information types through a word embedding layer, an encoding layer, a condition random field layer, a sequence prediction layer and a type prediction layer.

Wherein, the sequence tags are [ B _ T, O _ T, E _ T, X ] which respectively represent the start position of the keyword, the interval position of the keyword, the end position of the keyword and the non-keyword.

The intelligence type tag is {0: basic information, 1: vulnerability information, 2: asset information, 3: event information, 4: IOC information, 5: attack organization intelligence, 6: other types of intelligence }.

Specifically, character vector training is performed on all the labeled data through a threat intelligence vector pre-training algorithm.

The basic idea of the character vector is to represent each character as a K-dimensional vector, the relation between the characters can be learned in the training process of the character vector, and meanwhile, the vocabulary expression mode in the form of the vector is beneficial to calculation. The specific calculation formula is as follows:

wherein the content of the first and second substances,

embedding matrices for characters, x _i Indexes a sequence number for the ith character,

is represented by the ith character vector.

And coding the text vector through a bidirectional long-time and short-time memory neural network to obtain a deep expression of the text vector, wherein the formula is as follows:

wherein the content of the first and second substances,

obtaining a forward hidden layer state for the character vector through forward and reverse long-time and short-time memory neural network coding

And reverse hidden layer state

Two vectors are concatenated as a text-coded representation, using h _i And (4) showing.

In step S420, the sequence label prediction is performed on the coded vector to obtain keyword data, and the category prediction is performed on the coded vector to obtain category data.

When sequence label prediction is carried out, the influence of the information before and after the coded vector on label prediction is considered, and the coded vector can be processed through a conditional random field layer.

And then, performing label prediction on each hidden layer state of the alignment coding features by a sequence label prediction method.

In general, during the label prediction phase of the model, the softmax function (a logistic regression model) can be used for processing. For each character, the probability that the character is a keyword start, a keyword end, a keyword interval and a non-keyword is predicted, the item with the maximum probability is finally selected as the label of each character, and the keyword is extracted through the start label and the end label to obtain keyword data.

While passing through the hidden layer vector h _i And predicting the intelligence type through a feedforward neural network, wherein the formula is as follows:

P＝softmax(Wh _i +b)i∈{1,…,M} (5)

wherein W and b are parameters to be learned, h _i The method is characterized in that the method is a hidden vector mark of a piece of threat intelligence text, P is the probability that the threat intelligence text belongs to a certain category, and the category with the highest probability is taken to be judged as the type of the piece of intelligence so as to obtain category data.

In terms of the loss function, since keyword extraction and class prediction are used for joint learning, the loss function formula is as follows:

L＝(αl ₉ +βl _β ) (6)

wherein l _α 、l _β Loss functions of a keyword extraction model and a category prediction model are respectively adopted, alpha and beta are hyper-parameters to be learned, and iterative optimization is carried out through training.

And marking by a free threat intelligence data set to complete the training of the intelligence keyword-type combined extraction model so as to enable the model effect to reach the expected result.

Furthermore, in the aspect of model prediction, an input text is a section of unstructured threat information data to be recognized, and extracted information keywords and information categories are output through model prediction.

For example, when the unstructured threat intelligence data is input as "trojan backdoor, security hole: when CVE-2022-30716 ' is used, the result of the keyword of the intelligence which threatens the intelligence is ' CVE-2022-30716 ', and the type of the intelligence which is output is ' vulnerability '.

The combined extraction model can extract information types and information keywords from event information and attack organization information, so that the characteristics generate the resetting reliability score A through a structured resetting calculation method.

In step S320, a structured deduplication algorithm is used to score the intelligence keywords and the intelligence categories to obtain a first deduplication score.

Specifically, the structured deduplication algorithm stores a series of rules for determining the first deduplication score, and the first deduplication score corresponding to the intelligence keyword and the intelligence category can be obtained through the combination of the rules.

For example, the rule includes the score corresponding to the information keyword and/or the information type in the database, the score corresponding to the information keyword and/or the information type not in the database, the score corresponding to the information type consistent with the threat information data source and the score corresponding to the inconsistency, and the like.

For unstructured intelligence, the deduplication effect is not achieved by keywords alone. Part of the intelligence, e.g., "Vim" is a cross-platform text editor. The previous version of Vim 8.2 has security loopholes which are caused by the problem of reuse after release, and it can be seen that the text of the intelligence does not have obvious keywords, such as an attack source IP address, a CVE number and the like, so that the text intelligence without the keywords of the intelligence partially needs to be duplicated with a local intelligence library through text similarity calculation.

However, in general, a text similarity calculation algorithm based on hash features is based on word co-occurrence degree, is not applicable to information type data, and has a certain degree of misjudgment, so that deep semantic features are required for similarity judgment. In recent years, text similarity judgment based on deep learning has been developed, but similarity measurement algorithms for text threat intelligence are few. Meanwhile, the information data of the platform needing to be heavily judged every day is nearly millions, and the deep learning similarity calculation cannot meet the performance.

Therefore, through a Bit-BERT (binary encoder retrieval from a transform) algorithm, semantic vectors are subjected to feedforward neural network learning to generate binary Bit vectors, rough similarity calculation is performed to obtain a rough candidate set, and fine similarity calculation is performed from the rough candidate set to obtain the reset confidence B.

The BERT pre-training model is a language representation model trained by Google in an unsupervised mode by utilizing massive unlabeled texts. The BERT pre-training Model is a general semantic representation Model with strong migration capability, takes a Transformer as a network basic component, takes a Masked Bi-Language Model (a mask Language Model) and a Next sequence Prediction (Next Sentence Prediction) as training targets, and obtains general semantic representation through pre-training.

Wherein, whether supervision (supervised) exists depends on whether the input data has a label (label). If the input data has a label, the supervised learning is performed, and if the input data does not have the label, the unsupervised learning is performed.

Compared with traditional Word Vectors such as Word2Vec (Word to vector, which is used to generate a correlation model of Word Vectors), gloVe (Global Vectors for Word Representation), and Word Representation (Word Representation) tools based on Global Word frequency statistics (count-based & overall statistics), BERT satisfies the concept of contextual Word Representation (contextual Word Representation) which is very popular in recent years, that is, the context is considered, and the same Word has different Representation modes in different contexts. Intuitively, this also satisfies the real situation of human natural language, i.e. the meaning of the same vocabulary is likely to be different in different situations.

In an alternative embodiment, the semantic feature vector includes a high-level semantic vector and a medium-level semantic vector.

Threat intelligence data is input into a full binary quantized language representation model such that the language representation model outputs high-level semantic vectors and medium-level semantic vectors.

The unstructured threat intelligence text is coded, a BERT pre-training language model is used for coding character vectors, the text vectors are generated through a maximum pooling layer, the generated text vectors can be used for carrying out intelligence text similarity calculation, and cosine similarity and the like are generally adopted. However, because the amount of threat intelligence data is large, the performance requirements cannot be met by adopting the method, a bit coding layer is adopted to generate a representative hash value for the text vector, and a binary coding identification learning layer is introduced.

Specifically, a layer is added to an output layer and a semantic hiding layer for carrying out Hash expression learning, the layer adopts a full-connection structure, a sigmoid (S-shaped growth curve used as an activation function and a logistic regression of a neural network) activation function is adopted, and each dimension floating point number is hidden and represented as a binary representation [0,1] of a Boolean type. Through training, binary coding (Bit Encoding) and high-level semantic representation (semantic high Layer) with medium-level semantic features are generated. The specific formula is as follows:

h _sematic ＝MaxP”l(h _bert ) (8)

h _bit ＝sigmod(Wh _sematic +B) (9)

wherein, the first and the second end of the pipe are connected with each other,

for character vector, BERT is pre-training language model for text feature extraction and representation, maxPoint is maximum pooling layer for extracting important components in features, h _sematic 256-dimensional high-level semantic vector, h, extracted for a model _bit The method is a 64-dimensional middle-level semantic vector obtained by an activation function and a full connection layer.

After the semantic feature vectors are obtained, threat intelligence data may be deduplicated with the semantic feature vectors.

In an alternative embodiment, fig. 5 shows a flow diagram of a method for deduplication processing of threat intelligence data, which may include at least the following steps, as shown in fig. 5: in step S510, the stored information in the information database is obtained, and a first distance calculation is performed on the medium-level semantic vector and the stored information to determine an information candidate set.

For each piece of intelligence data, the high-performance storage database Redis stores medium-level semantic features and high-level semantic features, so that the stored intelligence data in the intelligence database can be acquired.

For threat intelligence data to be reviewed, intermediate-level semantic vectors and high-level semantic vectors can be obtained after the threat intelligence data passes through a Bit-BERT model. Therefore, the intermediate semantic vector and the stored information data can be calculated by adopting the hamming distance, and the threat information data higher than the preset threshold value can be sent to the information candidate set.

In step S520, a second distance calculation is performed on the high-level semantic vectors and the stored information data in the information candidate set to determine a second deduplication score, and the first deduplication score and the second deduplication score are calculated to obtain a repetition confidence.

After determining the intelligence candidate set, similarity calculation may be performed on the corresponding high-level semantic vector and the stored intelligence data in a cosine-like manner, and a similarity score higher than a corresponding threshold value is used as a second deduplication score, while a similarity score lower than the corresponding threshold value is determined as non-duplicate.

The hamming distance is subjected to exclusive-or operation, so that the calculation efficiency is greatly improved, the contrast range is reduced through rapid coarse-grained calculation, and the similarity score with higher confidence coefficient is generated through fine-grained calculation, so that the method not only gives consideration to the accuracy, but also gives consideration to the calculation efficiency.

After the first and second deduplication scores a and B are calculated, the final duplication confidence may be calculated by 0.6 × confidence a +0.4 × confidence B = duplication confidence.

In step S530, the threat intelligence data is deduplicated according to the repetition confidence.

After the confidence of repetition is computed, the confidence of repetition can be compared to a corresponding threshold to deduplicate the threat intelligence data. In general, the threshold may be set to 0.6.

When the repetition confidence coefficient is greater than or equal to 0.6, determining that the threat intelligence data is repeated; when the repetition confidence is less than 0.6, the threat intelligence data may be stored, and the storage format may be < medium level semantic vector, high level semantic vector >.

In the exemplary embodiment, the similarity calculation and vector hash learning of unstructured threat intelligence data can be realized through a full-binary quantized language representation model, the detection rate of repeated texts is effectively improved, and the similarity calculation efficiency can be effectively improved by a similarity measurement method based on medium-level semantic vectors and high-level semantic vectors with medium and high granularity.

In step S130, when the data type is a structured type, data compression processing is performed on the data type, and the compressed threat intelligence data is stored to perform deduplication processing.

In the exemplary embodiment of the disclosure, when the threat intelligence data is structured intelligence data, the structured intelligence data deduplication processing service often adopts a high-performance storage database such as Redis due to its high concurrency, but since the data volume is huge, flooding the data into the Redis may cause problems such as memory overflow, and therefore for the deduplication service, compression processing needs to be performed on the data, and the system overhead of the deduplication service is reduced.

In an alternative embodiment, fig. 6 is a flowchart illustrating a method for performing data compression processing on data types, and as shown in fig. 6, the method may include at least the following steps: in step S610, the data type is encoded to obtain a first bit vector, and the key data is hashed to obtain a second bit vector.

The structured type of threat intelligence data is processed by bit hashing. Specifically, a 4-bit vector is generated as a first bit vector according to the data type of the threat intelligence data, and a 60-bit vector is generated as a second bit vector by a hash algorithm according to the key data of the threat intelligence data.

In step S620, the first bit vector and the second bit vector are calculated to obtain a target bit vector, so as to obtain compressed threat intelligence data.

After generating the first bit vector and the second bit vector, a weighted summation of the first bit vector and the second bit vector can be performed to obtain a 64-bit vector as a target bit vector to convert each feature into a 64-bit hash vector, for example:

10 0 10 1 82300, 0 10 characteristic value 1

110 10 82300, 0 10 characteristic value 2

110 10 00 10 0 82300 00 00 characteristic value 3

The weight corresponding to the first bit vector is the frequency of the corresponding content appearing in the threat information data, and the weight corresponding to the second bit vector is the frequency of the corresponding content appearing in the threat information data.

In the exemplary embodiment, since the deduplication program uses Redis as a temporary database, and Redis puts data into memory. When massive data is deduplicated, the keywords are converted into bit characteristics, so that resources required by a system can be effectively reduced, and the speed of retrieving repeated data is increased.

After compressed threat intelligence data is obtained, the threat intelligence data may also be stored to speed retrieval and reduce resource consumption during deduplication processing.

The following describes the data deduplication method in the embodiment of the present disclosure in detail with reference to an application scenario.

Under the background, threat information is more and more concerned by enterprises, the safety equipment can play a greater role in combination with the threat information, and the safety operation of the enterprises can more quickly respond to safety events in combination with the threat information. Thus, threat intelligence plays an increasingly important role in network security.

Threat intelligence refers to an intelligence knowledge base that contains multiple types and dimensions. Wherein, the threat intelligence may include vulnerability intelligence, asset intelligence, IOC intelligence, event intelligence, etc.

The threat information is used as a knowledge set based on evidence, scenes, mechanisms, indexes and operational suggestions, can effectively make up for network security defense blind areas, and changes passive protection into active defense. The method can perform threat tracing, evidence discovery, attack prediction, attack map establishment and the like while detecting the existing attack, and improves the protection capability of the network security equipment on the whole, thereby reducing the influence caused by network attack and providing important reference for network defense for security decision makers.

With the more frequent occurrence of network attack events, the daily threat information is millions, however, the quality of commercial threat information and threat information in open source websites is different, a large amount of repeated data exists in non-homologous threat information, and the situation that the homologous threat information is repeated with the previous data also exists. As a threat intelligence platform, it is necessary to provide accurate and high-quality data. And a large amount of repeated data generated by the data source every day influences platform operation, storage, operation and maintenance, so that the threat information data deduplication step becomes an important component of information processing and is directly related to information quality and threat information platform construction.

Fig. 7 shows a schematic structural diagram of a data deduplication system in an application scenario, and as shown in fig. 7, the system includes a data processing module, a data compression module, an intelligence keyword-type joint extraction model, a Bit-BERT semantic coding model, and a data storage module.

Fig. 8 shows a schematic structural diagram of a data processing module, and as shown in fig. 8, the data processing module is composed of three parts, namely data extraction, data cleaning and data classification.

Threat intelligence refers to an intelligence knowledge base that contains multiple types and multiple dimensions.

Based on this, threat intelligence data may be divided into base intelligence classes, asset classes, vulnerability classes, event classes, IOC classes, attack organization classes, other intelligence types, and so on.

The basic information includes common objects in the network, such as an IP (Internet Protocol) address (192.168.0.x), a domain name (www.xxxxx.com), a mailbox (example @ xx.com), a URL (http:// www.xxxxxx.com), and a certificate.

The asset information is roughly classified into risk asset information, asset alteration information, and asset discovery information according to the content. Assets are physical or virtual devices in the internet, such as routers, switches, servers, hosts, etc.

For example, a country-related vulnerability library (e.g., NVD, CNVD, CNNVD) or a general vulnerability disclosure mainly describes the name, description, type, hazard score, implementation principle, influence, and patching measures of the vulnerability.

IOC refers to a threat indicator that describes the detection characteristics of a network attack. Such as attack source IP, domain name, and MD5 hash value of malicious files, or traffic characteristics, mailbox to which phishing mail belongs, etc. Security personnel can conduct risk study and judgment, security reinforcement and the like through IOC information.

Wherein, the data extraction part carries out data standardization processing on threat intelligence data of different intelligence sources. For example, the data normalization processing may be to form a JSON format or the like, and this exemplary embodiment is not particularly limited thereto.

Different types of threat intelligence extraction platforms need different keywords, and then JSON documents in a uniform format can be formed as key data.

The data cleaning process means that the intelligence quality of different sources is different, and characters such as a line feed character "\ n", a tab character "\ t" and the like exist, so that character deletion, replacement, sensitive word removal, stop word removal and the like can be performed on the key data through data cleaning processing, and the cleaned key data can meet the requirements of subsequent processing procedures.

The data classification part is aimed at a threat information duplicate removal process, the key data after original cleaning can divide attack organization information, event information, reports and the like into unstructured information data according to the type of threat information data, and basic information, vulnerability information, IOC information and the like into structured information data.

The structured threat intelligence refers to data that can be uniquely identified by a character string, such as an IP, an asset, a vulnerability, and the like, for example, a specific IP address and a vulnerability number, and a threat report, a major activity assurance class, and internal intelligence included in other intelligence types may also be structured types. By which an informative message can be uniquely identified.

The unstructured threat information data is event-like threat information or the like. An attack event and the like are described through writing, wherein the attack event comprises vulnerability information, attack organization information and the like. Such intelligence cannot be used directly, and typically requires human or machine reading to extract the desired information for combing to produce usable intelligence.

The data type of threat intelligence data can be determined through preprocessing, a data basis and a theoretical support are provided for subsequent deduplication processing, and the accuracy and timeliness of data deduplication are guaranteed.

When threat intelligence data is unstructured intelligence data, the unstructured intelligence data is subdivided into two steps, firstly, keyword extraction is carried out on an intelligence text, meanwhile, intelligence type judgment is carried out on the intelligence text, extracted intelligence keywords and intelligence type are obtained, and a repetition confidence coefficient A is obtained through a structured deduplication process. And then, carrying out similarity calculation on the information text and the existing text in the database to obtain a repetition confidence B, weighting the two scores to obtain a final score, and judging whether the text is repeated.

Since it is found by observing a large amount of data that the threat intelligence text is similar, the intelligence keywords are not the same, for example: information 1: trojan backdoor, leak utilization: CVE-2022-26134; and information 2: trojan backdoor, security hole: CVE-2022-30716.

Only through the word co-occurrence text similarity calculation method, because co-occurrence words such as 'Trojan backdoor' and 'loophole' exist, 62.5% of similarity can be obtained through SimHash, but the keyword CVE loophole numbers of two texts are different and obviously are two different information texts, and therefore an information keyword-type combined extraction model is provided for the problems. Extracting intelligence keywords such as IP address, attack organization, IOC intelligence, vulnerability number (CVE), etc. for intelligence text, such as "trojan backdoor, security vulnerability: CVE-2022-30716', then extracting CVE-2022-30716. Meanwhile, the model extracts the information keywords and judges the information types at the same time.

Fig. 9 shows a schematic structural diagram of an intelligence keyword-type joint extraction model, as shown in fig. 9, the joint extraction model utilizes the idea of joint training, inputs an intelligence text vector of unstructured intelligence, passes through a word embedding layer, an encoding layer, a conditional random field layer, a sequence prediction layer and a category prediction layer, and finally outputs the starting and ending positions and the intelligence categories of the intelligence keywords.

Wherein, the sequence labels are [ B _ T, O _ T, E _ T, X ], which respectively represent the start position of the keyword, the interval position of the keyword, the end position of the keyword and the non-keyword.

Specifically, all the labeled data are subjected to character vector training through a threat intelligence vector pre-training algorithm.

The basic idea of the character vector is to represent each character as a K-dimensional vector, the relation between the characters can be learned in the training process of the character vector, and meanwhile, the vocabulary representation mode in the form of the vector is beneficial to calculation. The specific calculation formula is shown in (1).

And coding the text vector through a bidirectional long-short-term memory neural network to obtain a deep expression of the text vector, wherein the formulas are shown in (2) - (4).

In general, the model can be processed by using a softmax function in the label prediction stage. And for each character, predicting the probability of the character being the start of a keyword, the end of the keyword, the interval of the keyword and a non-keyword, finally selecting an item with the maximum probability as a label of each character, and extracting the keyword through a start label and an end label to obtain keyword data.

While passing through the hidden layer vector h _i The intelligence type is predicted by a feedforward neural network, and the formula is shown in (5).

In terms of the loss function, since keyword extraction and class prediction are used for joint learning, the loss function formula is as shown in formula (6).

And marking through the free threat intelligence data set to complete the training of the intelligence keyword-type combined extraction model so as to enable the model effect to achieve the expected result.

For example, when the input unstructured threat intelligence data is "trojan backdoor, security hole: when CVE-2022-30716 ' is used, the result of the keyword of the intelligence which threatens the intelligence is ' CVE-2022-30716 ', and the type of the intelligence which is output is ' vulnerability '.

The combined extraction model can extract information types and information keywords from event information and attack organization information, so that the characteristics generate the resetting reliability score A through a structured resetting method.

For unstructured intelligence, the deduplication effect cannot be achieved by keywords alone. Part of the intelligence, e.g., "Vim" is a cross-platform text editor. The Vim 8.2 previous version has a security hole, the hole is caused by the reuse problem after release, and it can be seen that the informative text does not have obvious keywords, such as attack source IP address, CVE number and the like, so for the text intelligence without informative keywords, the text intelligence needs to be duplicated with a local intelligence base through text similarity calculation.

Therefore, through a Bit-BERT algorithm, the semantic vector is subjected to feedforward neural network learning to generate a binary Bit vector, coarse-granularity similarity calculation is carried out to obtain a coarse-granularity candidate set, fine-granularity similarity calculation is carried out from the coarse-granularity candidate set, and the reset-free confidence B is obtained.

Fig. 10 shows a schematic structural diagram of a full binary quantized language representation model, as shown in fig. 10, an unstructured threat intelligence text is encoded, a BERT pre-training language model is used to encode character vectors and generate text vectors through a maximum pooling layer, the generated text vectors can be used for information text similarity calculation, cosine similarity and the like are generally adopted, but because the amount of threat intelligence data is large, performance requirements cannot be met by adopting the method, a bit encoding layer is adopted in the scheme, representative hash values are generated for the text vectors, a binary encoding identification learning layer is introduced, specifically, a layer is added to an output layer and a semantic hiding layer for hash representation learning, the layer adopts a full connection structure, and a sigmoid activation function is adopted to hide and represent each dimensional floating point number as a boolean binary representation [0,1]. Through training, binary codes with middle-level semantic features and high-level semantic representations are generated. Specifically, as shown in equations (7) to (9).

For each piece of intelligence data, the high-performance storage database Redis stores medium-level semantic features and high-level semantic features, so that stored intelligence data in the intelligence database can be obtained.

The hamming distance is subjected to exclusive OR operation, so that the calculation efficiency is greatly improved, the contrast range is reduced through rapid coarse-grained calculation, and the similarity score with higher confidence coefficient is generated through fine-grained calculation, so that the method not only gives consideration to the accuracy, but also gives consideration to the calculation efficiency.

After the first deduplication score a and the second deduplication score B are calculated, the final duplication confidence may be calculated by 0.6 × confidence a +0.4 × confidence B = duplication confidence.

After the confidence of repetition is calculated, the confidence of repetition can be compared to a corresponding threshold to deduplicate the threat intelligence data. In general, the threshold may be set to 0.6.

Similarity calculation and vector hash learning of unstructured threat information data can be achieved through the full-binary quantitative language representation model, the detection rate of repeated texts is effectively improved, and the similarity measurement method based on medium-level semantic vectors and high-level semantic vectors with medium and high granularities can effectively improve the calculation efficiency of the similarity.

When threat intelligence data is structured intelligence data, the structured intelligence data deduplication processing service often adopts high-performance storage databases such as Redis due to high concurrency, but because of huge data volume, flooding data into Redis can cause memory overflow and other problems, so for deduplication service, data compression processing is needed, and system overhead of deduplication service is reduced.

Fig. 11 shows a schematic diagram of a data compression module, which processes structured type threat intelligence data by bit hashing as shown in fig. 11. Specifically, a 4-bit vector is generated as a first bit vector according to the data type of the threat intelligence data, and a 60-bit vector is generated as a second bit vector by a hash algorithm according to the key data of the threat intelligence data.

After the first bit vector and the second bit vector are generated, a weighted sum calculation can be performed on the first bit vector and the second bit vector to obtain a 64-bit vector as a target bit vector, so as to convert each feature into a 64-bit hash vector.

Since the deduplication process uses Redis as a temporary database, and Redis puts data into memory. When massive data is deduplicated, the keywords are converted into bit characteristics, so that resources required by a system can be effectively reduced, and the speed of retrieving repeated data is increased.

Based on this, for the structured threat intelligence data, a bit hash algorithm can be used to spool intelligence keywords, such as IP intelligence, vulnerability numbers, etc., into a bit vector, and the storage format is < bit vector, intelligence ID >.

For unstructured threat information data, extracting information keywords and information types through an information keyword-type combined extraction model, and sending the keywords and the information types into a structured information database, wherein the storage format is < bit vector, information ID >. If the element exists in the structured database, generating a repetition confidence A (the default value is 1); if not, the data is stored in the database.

Further, a middle-level semantic vector (Bit vector) and a high-level semantic vector (deep semantic vector) are generated through a Bit-BERT model, a candidate set is generated through a middle-level semantic vector retrieval semantic storage module (Redis), and finally, the maximum approximate value is calculated through the high-level semantic vector to generate a repetition confidence B (the default value is 1).

Further, the final repeat confidence is found by 0.6 confidence a +0.4 confidence B. If the value is more than or equal to 0.6, the data is judged to be repeated, otherwise, the data is written into a semantic vector retrieval semantic storage module, and the storage format is < middle-level semantic vector, high-level semantic vector >.

For newly-increased threat information data every day, if the newly-increased threat information data is judged to be repeated, the newly-increased threat information data is not written into a system background database; otherwise, database writing and subsequent intelligence data fusion, aging and other processes are carried out.

The data deduplication method under the application scene determines the data type of threat intelligence data, and provides data basis and theoretical support for providing different deduplication modes aiming at different types of threat intelligence data. On one hand, the threat information data is subjected to deduplication processing according to the semantic feature vector, the problems that an occupied memory is too large and processing flow is time-consuming in the deduplication process of the threat information data are solved, the problem that text information cannot be captured by an original deduplication method is effectively solved, and meanwhile the retrieval efficiency of unstructured threat information data is improved. On the other hand, the data type of the structured threat intelligence data is subjected to data compression processing, so that the problem that excessive system resources are consumed due to the deduplication process of massive threat intelligence data is solved, and the resource consumption caused by storing the threat intelligence data is reduced.

Fig. 12 is a schematic diagram illustrating a structure of a data deduplication apparatus, and as shown in fig. 12, the data deduplication apparatus 1200 may include: a data acquisition module 1210, a first deduplication module 1220, and a second deduplication module 1230. Wherein:

a data acquisition module 1210 configured to acquire threat intelligence data and preprocess the threat intelligence data to determine a data type;

a first deduplication module 1220, configured to, when the data type is an unstructured type, perform text similarity calculation on the threat intelligence data to obtain a semantic feature vector, and perform deduplication processing on the threat intelligence data according to the semantic feature vector; or

A second deduplication module 1230 configured to perform data compression processing on the data type and store the compressed threat intelligence data for deduplication processing when the data type is a structured type.

In an exemplary embodiment of the present invention, the joint extraction model is obtained by training as follows:

inputting the threat intelligence data into a full binary quantized language representation model such that the language representation model outputs the high level semantic vectors and the medium level semantic vectors.

acquiring stored information data in an information database, and calculating a first distance between the medium-level semantic vector and the stored information data to determine an information candidate set;

The details of the data deduplication apparatus 1200 are already described in detail in the corresponding data deduplication method, and therefore are not described herein again.

It should be noted that although several modules or units of the data deduplication apparatus 1200 are mentioned in the above detailed description, such partitioning is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

An electronic device 1300 according to such an embodiment of the invention is described below with reference to fig. 13. The electronic device 1300 shown in fig. 13 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 13, electronic device 1300 takes the form of a general-purpose computing device. The components of the electronic device 1300 may include, but are not limited to: the at least one processing unit 1310, the at least one memory unit 1320, the bus 1330 connecting the various system components (including the memory unit 1320 and the processing unit 1310), the display unit 1340.

Wherein the memory unit stores program code that is executable by the processing unit 1310 to cause the processing unit 1310 to perform steps according to various exemplary embodiments of the present invention as described in the "exemplary methods" section above in this specification.

The storage 1320 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 1321 and/or a cache memory unit 1322, and may further include a read only memory unit (ROM) 1323.

Storage 1320 may also include a program/utility 1324 having a set (at least one) of program modules 1325, such program modules 1325 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1330 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1300 may also communicate with one or more external devices 1500 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1300, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1300 to communicate with one or more other computing devices. Such communication may occur over input/output (I/O) interfaces 1350. Also, the electronic device 1300 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through the network adapter 1360. As shown, the network adapter 1360 communicates with the other modules of the electronic device 1300 via the bus 1330. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1300, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.

Referring to fig. 14, a program product 1400 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this respect, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method for data deduplication, the method comprising:

obtaining threat intelligence data, and preprocessing the threat intelligence data to determine the data type;

2. The data deduplication method of claim 1, wherein the preprocessing the threat intelligence data to determine a data type comprises:

3. The method according to claim 2, wherein the performing data compression processing on the data type includes:

4. The data deduplication method of claim 1, wherein prior to the text similarity calculation on the threat intelligence data to obtain semantic feature vectors, the method further comprises:

5. The data deduplication method of claim 4, wherein the joint extraction model is trained by:

carrying out character vector training on a training sample by utilizing a pre-training algorithm to obtain a text vector, and coding the text vector to obtain a coding vector;

and performing sequence label prediction on the coding vector to obtain key word data, and performing category prediction on the coding vector to obtain category data.

6. The data deduplication method of claim 4, wherein the semantic feature vector comprises a high-level semantic vector and a medium-level semantic vector,

7. The data deduplication method of claim 6, wherein the deduplication processing of the threat intelligence data according to the semantic feature vector comprises:

8. A data deduplication apparatus, comprising:

the first deduplication module is configured to perform text similarity calculation on the threat intelligence data to obtain a semantic feature vector when the data type is an unstructured type, and perform deduplication processing on the threat intelligence data according to the semantic feature vector; or

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data deduplication method of any one of claims 1-7.

10. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the data deduplication method of any one of claims 1-7 via execution of the executable instructions.