CN115334039B

CN115334039B - Feature construction method and device based on artificial intelligent model

Info

Publication number: CN115334039B
Application number: CN202210951261.6A
Authority: CN
Inventors: 吕晋
Original assignee: Tianrongxin Xiongan Network Security Technology Co ltd; Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Tianrongxin Xiongan Network Security Technology Co ltd; Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2024-02-20
Anticipated expiration: 2042-08-09
Also published as: CN115334039A

Abstract

The application provides a feature construction method and device based on an artificial intelligence model, wherein the method comprises the following steps: collecting DNS logs; the DNS logs are arranged and de-duplicated to obtain de-duplicated logs; extracting a characteristic field from the deduplication log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field; feature vector processing is carried out on the feature field to obtain feature vectors: the ambiguous features are constructed based on the feature vectors. It can be seen that implementing this embodiment enables the generation of a large number of ambiguous features that enable the training of a better quality DGA artificial intelligence detection model, thereby enabling a significant reduction in detection false alarm rates.

Description

Feature construction method and device based on artificial intelligent model

Technical Field

The present application relates to an artificial intelligence domain, and in particular, to a feature construction method and device based on an artificial intelligence model.

Background

In recent years, DGA (domain name generation algorithm) based technology has evolved, and more DGA based malware is beginning to pose a corresponding threat to government networks and important infrastructure networks. Specifically, an attacker may generate a large number of domain names using DGA, and thereby evade detection of the blacklist of domain names.

However, the currently employed domain name filtering method is a blacklist filtering method, which makes the current filtering method unable to effectively filter such a large and flexible domain name.

To solve this problem, those skilled in the art propose DGA domain name detection methods based on machine learning, which hope to be able to detect efficiently by a classifier based on statistical information of domain name characters. However, in practice, the problem of high false alarm rate still exists in the method, and domain names cannot be effectively filtered.

Disclosure of Invention

The embodiment of the application aims to provide a feature construction method and device based on an artificial intelligence model, which can generate a large number of ambiguous features so that the large number of ambiguous features can train a higher-quality DGA artificial intelligence detection model, and therefore the detection false alarm rate can be remarkably reduced.

An embodiment of the present application provides a feature construction method based on an artificial intelligence model, including:

collecting DNS logs;

sorting and de-duplicating the DNS log to obtain a de-duplicated log;

extracting a characteristic field from the duplicate removal log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field;

performing feature vectorization processing on the feature field to obtain a feature vector:

and constructing the ambiguous features based on the feature vectors.

In the implementation process, the method can collect DNS logs preferentially; therefore, training samples are completed, extracted features are high-quality features based on DNS, and the subsequent model training effect is guaranteed. Then, the method can sort and de-duplicate the DNS log to obtain a de-duplicated log; therefore, the process can simplify the DNS log, so that an effective and useful sample is reserved, and the extracted characteristics are further ensured to be accurate and effective. Then, the method can extract the characteristic field from the duplicate removal log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field; therefore, the basic classification information can be provided for the construction of the ambiguous features by respectively extracting the fields, so that the ambiguous features can be constructed accurately. Then, the method can perform feature vectorization processing on the feature field to obtain a feature vector, and construct a multi-sense feature based on the feature vector: therefore, the method is based on vectorization feature recombination multi-sense features, and has the effect of enabling the constructed multi-sense features to be more structural, so that a large number of high-quality multi-sense features can be generated conveniently.

Further, the step of sorting and deduplicating the DNS log to obtain a deduplication log includes:

sorting the DNS log into json files each row comprising a field name, a field value and a category label;

calculating the md5 value of each row of field names and field values in the json file;

and filtering json line data corresponding to the same md5 value to obtain a deduplication log.

In the implementation process, the method can be used for preferentially sorting the DNS log into json files with each row comprising field names, field values and category labels in the process of sorting and de-duplicating the DNS log to obtain the de-duplication log; therefore, the DNS log can be organized into multiple-row and ambiguous json files, and the field structure of each row is limited, so that subsequent calling and calculation can be more concise. Then, the method can calculate the md5 value of each row of field names and field values in the json file; and filtering json line data corresponding to the same md5 value to obtain a deduplication log. Therefore, the method can filter the corresponding json file line data based on the md5 value, so that a simpler, rapid and accurate de-duplication effect is realized.

Further, the step of performing feature vectorization processing on the feature field to obtain a feature vector includes:

acquiring character string features and category features included in the feature fields;

performing feature vectorization processing on the category features to obtain semantic feature vectors;

performing feature vectorization processing on the character string features to obtain domain name feature vectors;

and combining the semantic feature vector and the domain name feature vector to obtain a feature vector.

In the implementation process, the method can be used for preferentially acquiring the character string characteristics and the category characteristics included in the characteristic field in the process of carrying out characteristic vectorization processing on the characteristic field to obtain the characteristic vector; wherein the category characteristics include all the category characteristics and values of other characteristics than the requested domain name characteristics, and the string characteristics include only the values of the requested domain name characteristics, it is seen that the extraction process can preferentially perform the classification extraction, thereby providing for the vectorization of the characteristics. Then, the method can carry out feature vectorization processing on the category features to obtain semantic feature vectors; performing feature vectorization processing on character string features to obtain domain name feature vectors; therefore, the method can respectively perform feature vector processing on two types of features, so that feature vectors with different dimensions are obtained. Finally, the method can combine the semantic feature vector and the domain name feature vector to obtain the feature vector. Therefore, the method can combine all the feature vectors according to the inherent grouping, so that the subsequent feature extraction and application in the form of groups are facilitated, and the features can be better quality.

Further, the step of constructing the ambiguous feature based on the feature vector includes:

dividing the feature vectors based on different field semantics to obtain a plurality of semantic features;

and splicing the semantic features to obtain the ambiguous features.

In the implementation process, the method can divide the feature vector preferentially based on different field semantics to obtain a plurality of semantic features in the process of constructing the multi-semantic features based on the feature vector; and then splicing a plurality of semantic features to obtain the multi-semantic features. Therefore, the method can obtain the high-order features fusing the grammar information and the semantic information through the construction, thereby facilitating the subsequent model training.

Further, the method further comprises:

constructing an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module;

based on the ambiguous features, training the artificial intelligent model by using a gradient back propagation method to obtain a DGA domain name detection model.

In the implementation process, the method can also build an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module; and then training the artificial intelligent model by using a gradient back propagation method based on the ambiguous features to obtain the DGA domain name detection model. The method system constructs the artificial intelligent model by moving the ambiguous features through the specified structure and the specified method, so that the artificial intelligent model can realize the depth detection of the DGA domain name detection more effectively and accurately.

A second aspect of the embodiments of the present application provides an artificial intelligence model-based feature construction apparatus, including:

a collecting unit for collecting DNS logs;

the de-duplication unit is used for sorting and de-duplication the DNS log to obtain a de-duplication log;

an extracting unit, configured to extract a feature field from the deduplication log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field;

the processing unit is used for carrying out feature vectorization processing on the feature field to obtain a feature vector:

and the construction unit is used for constructing the ambiguous feature based on the feature vector.

In the implementation process, the device may collect DNS logs through the collecting unit; therefore, training samples are completed, extracted features are high-quality features based on DNS, and the subsequent model training effect is guaranteed. Then, the DNS log is tidied and de-duplicated through a de-duplication unit to obtain a de-duplication log; therefore, the process can simplify the DNS log, so that an effective and useful sample is reserved, and the extracted characteristics are further ensured to be accurate and effective. Then, extracting the characteristic field from the duplicate removal log through an extraction unit; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field; therefore, the basic classification information can be provided for the construction of the ambiguous features by respectively extracting the fields, so that the ambiguous features can be constructed accurately. Then, the feature field is subjected to feature vectorization processing through a processing unit to obtain a feature vector, and the construction unit constructs the multi-sense feature based on the feature vector: it can be seen that the device is able to reconstruct the ambiguous features based on the vectorized features, thereby constructing more structural ambiguous features.

Further, the deduplication unit includes:

a sorting subunit, configured to sort the DNS log into json files each row including a field name, a field value, and a category label;

a calculating subunit, configured to calculate an md5 value of a field name and a field value of each row in the json file;

and the deduplication subunit is used for filtering json line data corresponding to the same md5 value to obtain a deduplication log.

In the implementation process, the deduplication unit may sort the DNS log into json files including field names, field values, and category labels for each row through the sorting subunit; therefore, the DNS log can be organized into multiple-row and ambiguous json files, and the field structure of each row is limited, so that subsequent calling and calculation can be more concise. Then, the device can calculate the md5 value of each row of field names and field values in the json file through a calculating subunit; and filtering json line data corresponding to the same md5 value by using a deduplication subunit to obtain a deduplication log. Therefore, the device can filter the corresponding json file line data based on the md5 value, so that a simpler, rapid and accurate de-duplication effect is realized.

Further, the processing unit includes:

an obtaining subunit, configured to obtain a character string feature and a category feature included in the feature field;

the processing subunit is used for carrying out feature vectorization processing on the category features to obtain semantic feature vectors;

the processing subunit is further configured to perform feature vectorization processing on the character string feature to obtain a domain name feature vector;

and the combining subunit is used for combining the semantic feature vector and the domain name feature vector to obtain a feature vector.

In the implementation process, the processing unit can acquire the character string characteristics and the category characteristics included in the characteristic field through the acquisition subunit; wherein the category characteristics include all the category characteristics and values of other characteristics than the requested domain name characteristics, and the string characteristics include only the values of the requested domain name characteristics, it is seen that the extraction process can preferentially perform the classification extraction, thereby providing for the vectorization of the characteristics. Then, the device can carry out feature vectorization processing on the category features through a processing subunit to obtain semantic feature vectors; performing feature vectorization processing on character string features to obtain domain name feature vectors; therefore, the device can respectively perform feature vector processing on two types of features, so that feature vectors with different dimensions are obtained. Finally, the device combines the semantic feature vector and the domain name feature vector through the combination subunit to obtain the feature vector. Therefore, the device can combine all the feature vectors according to the inherent grouping, so that the subsequent feature extraction and application in the form of groups are facilitated, and the features can be better quality.

Further, the construction unit includes:

dividing sub-units, which are used for dividing the feature vectors based on different field semantics to obtain a plurality of semantic features;

and the splicing subunit is used for splicing the plurality of semantic features to obtain the multi-semantic features.

In the implementation process, the construction unit can divide the feature vector based on different field semantics through the division subunit to obtain a plurality of semantic features; and then splicing a plurality of semantic features through the splicing subunit to obtain the multi-semantic features. Therefore, the device can obtain the high-order features fusing the grammar information and the semantic information through the construction, thereby facilitating the subsequent model training.

Further, the artificial intelligence model-based feature construction apparatus further includes:

the building unit is used for building an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module;

and the training unit is used for training the artificial intelligent model by using a gradient back propagation method based on the ambiguous features to obtain a DGA domain name detection model.

In the implementation process, the device can build an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module through a building unit; and then training the artificial intelligent model by using a gradient back propagation device based on the ambiguous features through a training unit to obtain a DGA domain name detection model. It can be seen that the device system constructs an artificial intelligent model by moving the ambiguous features through the specified structure and the specified device, so that the artificial intelligent model can more effectively and accurately realize the depth detection of the DGA domain name detection.

A third aspect of the embodiments of the present application provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to perform the artificial intelligence model-based feature construction method according to any one of the first aspect of the embodiments of the present application.

A fourth aspect of the embodiments of the present application provides a computer readable storage medium storing computer program instructions which, when read and executed by a processor, perform the method for constructing features based on an artificial intelligence model according to any one of the first aspect of the embodiments of the present application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a feature construction method based on an artificial intelligence model according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a feature building device based on an artificial intelligence model according to an embodiment of the present application;

fig. 3 is a diagram of an example of feature construction based on an artificial intelligence model according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Example 1

Referring to fig. 1, fig. 1 is a schematic flow chart of a feature construction method based on an artificial intelligence model according to the present embodiment. The characteristic construction method based on the artificial intelligence model comprises the following steps:

s101, collecting DNS logs.

S102, sorting the DNS log into json files with each row comprising a field name, a field value and a category label.

In this embodiment, the method sorts the obtained DNS log into a json file.

In this embodiment, each line of the json file is a dictionary, which includes a plurality of field names and corresponding values (i.e., field values), such as { key1: value1, key2: value 2..label: 0}. The previous field may be used as the data portion and the last field is a category label.

S103, calculating the md5 value of the field name and the field value of each row in the json file.

S104, filtering json line data corresponding to the same md5 value to obtain a duplication removal log.

In this embodiment, the method may initialize a set, then traverse the file to calculate the md5 value of the data portion, and if the md5 value is not in the set, then save the data and add md5 to the set; otherwise deleting the data; and finally writing the set out of the file for later use when data is newly added.

S105, extracting characteristic fields from the duplicate removal log; the feature field includes a request domain name field, a response length field, a query type field, a return code field, a question portion containing entity number field, an authoritative zone containing entity number field, and a time-to-live feature field.

In this embodiment, the method may extract 7 fields from the DNS log by constructing a regular expression, including a request domain name field, a response length field, a query type field, a return code field, a problem portion including an entity number field, an authority area including an entity number field, and a time-to-live feature field; each field extracts a field name and a corresponding value, respectively.

In this embodiment, in the existing DGA technical detection scheme, modeling detection is performed for the field value of the request domain name; however, DGA domain name (referring to a request domain name field value) generation is not consistent with a fixed rule, so that the problem of high false alarm is common in the prior art scheme.

In this embodiment, in order to reduce false alarm, the method considers that, based on the field value of the request domain name, the values of the fields of the response length, the query type, the return code, the number of entities contained in the question portion, the number of entities contained in the authoritative zone, and the survival time are added to enhance the feature representation. According to practical working experience, the response length, the query type, the return code, the number of the problem part containing entities, the number of the authority area containing entities and the survival time can be known, and the DNS log fields are common fields for filtering false alarms in practical working.

In this embodiment, the request domain name field, the response length field, the query type field, the return code field, the problem part containing entity number field, the authority area containing entity number field, and the time-to-live feature field are derived from semantic isolation spaces reflecting different information, and if the values of these fields are directly spliced, the model cannot effectively distinguish different semantic spaces. The method therefore supplements the field name as a category feature in front of the field value to suggest the model semantic boundary.

S106, acquiring character string features and category features included in the feature field.

In this embodiment, the features extracted by the method may be classified into two categories, i.e., category features and text string features, where the category features include all category features and values of features other than the requested domain name feature, and the text string features include only the values of the requested domain name feature.

And S107, performing feature vectorization processing on the category features to obtain semantic feature vectors.

In this embodiment, the method may first vectorize the category characteristics. Specifically, the method may randomly initialize a matrix 7*M, wherein each row vector represents a semantic class feature. Each semantic category feature can either prompt the model or naturally segment different semantic features.

In this embodiment, the method may further vectorize the corresponding value features. For the values of a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field, firstly, counting the limited value M of each field, and then constructing a M-N matrix, wherein each row represents a valued semantic vector, and N represents the dimension of the semantic vector.

S108, performing feature vectorization processing on the character string features to obtain domain name feature vectors.

In this embodiment, the method may be directed to requesting the value of the domain name field, i.e., a domain name string. LSTM encoding is used. Specifically, the method can construct an LSTM depth coding model, divide the domain name character string according to characters, input the divided character sequence into the LSTM depth coding model, and finally obtain the output of the last step of the LSTM depth coding model as the domain name request feature.

S109, combining the semantic feature vector and the domain name feature vector to obtain a feature vector.

S110, dividing the feature vectors based on different field semantics to obtain a plurality of semantic features.

In this embodiment, the ambiguous features include a request domain name semantic module, a response length semantic module, a query type semantic module, a return code semantic module, a question portion containing entity number semantic module, an authoritative zone containing entity number semantic module, and a time-to-live semantic module. Wherein, in each semantic module, the category feature vector and the corresponding feature value vector are spliced to form a semantic representation, wherein the semantic feature category vector is similar to part of speech and the corresponding feature value vector is similar to word sense. The 8 features are then processed as 8 semantic modules in this way. Finally, the 8 semantic modules are spliced into 1 vector to be detected.

And S111, splicing a plurality of semantic features to obtain the ambiguous features.

In this embodiment, ambiguity is a high-level representation that fuses grammar and semantic information.

In the embodiment, the multi-sense feature vector provided by the method is formed by fusing a request domain name semantic module, a response length semantic module, a query type semantic module, a return code semantic module, a problem part containing entity quantity semantic module, an authoritative zone containing entity quantity semantic module and a time-to-live semantic module; is distinguished from the request domain name features used in existing DGA detection techniques. Secondly, each semantic module comprises the integration of two feature vectors, namely a category feature vector and a feature value vector; is distinguished from eigenvalue vectors in the prior DGA detection technology.

As an alternative embodiment, the method further comprises:

based on the ambiguous features, training the artificial intelligent model by using a gradient back propagation method to obtain the DGA domain name detection model.

In this embodiment, the method may build a model including an automatic attention module and a fully connected classification module.

In this embodiment, the input of the automatic attention module is the above-described ambiguous feature, and the output is the ambiguous feature subjected to the attention-allocated weight. The automatic attention module refers to a model built based on a self-attention mechanism, and can acquire jump characteristics with a long distance. The reason for adopting the automatic focusing module is that the importance of each semantic unit of the ambiguous features is different, and the automatic focusing module can automatically configure larger weight to the part needing focusing.

In this embodiment, the method proposes an auxiliary method for extracting the result of the automatic attention module as a rule for the first time. The automatic attention model maintains an attention matrix, each feature can be automatically scored when a new sample is processed, and a part with higher score is extracted and can be used as a rule for judging a malicious sample.

In this embodiment, the method may build two full-connection layers to perform feature interaction, and then generate a classification result.

In this embodiment, the method can use gradient back propagation method to train classification model, which mainly adjusts learning rate and batch to get best result.

Referring to FIG. 3, FIG. 3 is a diagram illustrating an example of feature construction based on an artificial intelligence model. Based on the graph, the method provides a DGA detection method based on automatic attention of the ambiguous features, and the method constructs the ambiguous features by extracting each field (including field names and field values) from the DNS log and builds an automatic attention learning model so as to solve the false alarm problem in DGA domain name detection.

For example, the method can be applied to deep detection of DGA domain name detection, and can also be used for extracting the judgment rule of the DGA domain name, so that detection false alarm can be effectively reduced. The specific example flow is as follows:

(1) samples were collected and deduplicated. The white samples and the black samples are derived from an online log, the total number of the white samples after weight removal is 10.8 ten thousand, and the total number of the black samples is 4.6 ten thousand.

(2) Initializing two lists data and labels, traversing the files under the path according to the white and black categories of the files, extracting character strings formed by the feature categories and the feature values in each sample by using a regular expression for each file, storing the character strings in the data, and simultaneously storing the corresponding data labels white (0) and black (1) in the labels. The characteristics include: request domain name feature, response length feature, query type feature, return code feature, number of entities contained in the question portion, number of entities contained in the authoritative zone, and time-to-live feature.

(3) The characteristic representation comprises two steps in parallel: vectorization of category features and vectorization of text string types.

(4) Vectorization of category features is first performed. A matrix of 7 x N1 is initialized, wherein 7 represents 7 category characteristics of query type characteristics, return code characteristics, number of problem part containing entities, authority area containing number of entities, time-to-live characteristics, response length characteristics and request domain name characteristics, and N1 represents semantic dimension of each category characteristic. Because the feature semantics of different classes span much larger, class feature vectors may prompt the model to focus on important content.

(5) Secondly, query type features, return code features, problem part containing entity quantity, authority area containing entity quantity, time-to-live features and values of response length features are vectorized, feature values of the 6 features are numerical values, two types can be classified approximately, wherein the values of the query type features and the return code features are discrete values, and the remaining 4 features are continuous values. For a specific feature, the continuous values are discretized according to a preset discrete interval, for example, the discrete interval is set to 100, and then the value range of 0-600 can be discretized into 6 values. And initializing a matrix of M1 x N2, wherein M1 represents the number of discrete values of a certain feature, and N2 represents the dimension of the certain feature.

(6) And finally, vectorizing the character string type. The value of the request domain name feature is a string, e.g. "ac5y7.Hhmukp. Com". The first step divides the character string according to letters, the second step maps each letter to a vector, the third step inputs the mapping vector composition matrix of each letter in the character string into LSTM (long short term memory network), and the fourth step obtains the final output of LSTM (long short term memory network) as the representation of' ac5y7.hhmukp.

(7) After the feature representation of the steps is carried out on all the features, splicing is carried out according to the category feature vectors and the feature value vectors to form 1 semantic unit, and then 7 semantic units are spliced into 1 vector to be detected. The vector to be detected is a 258-dimensional vector, wherein the feature class vector has 10 dimensions, the numerical feature has 10 dimensions, and the character string type feature has 128 dimensions.

(8) And (5) building a self-attention network. Because the importance degree of each semantic unit of the vector to be detected is different, and the sample category can be judged by jumping the combined information, a self-attention deep learning network is built by using a keras tool, and different attention weights are automatically given according to different samples, so that the high-order interaction of the features is realized, and the generalization performance of the model is improved.

(9) The model is trained, 50 rounds of training are set, the property=2 of the reduced OnPlaeatu is set so that the model converges to a better position, and the property=5 of the earlyStopping can reduce the time cost of model training.

In this embodiment, the execution subject of the method may be a computing device such as a computer or a server, which is not limited in this embodiment.

In this embodiment, the execution body of the method may be an intelligent device such as a smart phone or a tablet computer, which is not limited in this embodiment.

Therefore, by implementing the feature construction method based on the artificial intelligence model described in the embodiment, response length features, query type features, return code features, feature of the number of entities contained in the problem part, feature of the number of entities contained in the authoritative zone and time-to-live features can be introduced, so that semantic representation of the features is enhanced; meanwhile, as the characteristics are derived from semantic isolation spaces reflecting different information, the application provides a problem that the class characteristic solution model for extracting response can not distinguish different semantic characteristic boundaries. According to experimental results, the characteristics obtained by the method are used for model training, so that the false alarm rate of the model is reduced by 8.52%, and the detection rate is improved by 1.64%. In addition, the automatic attention model provided by the method can be used for clearly capturing the relevant part which leads to the response result by distributing attention scores to the relevant features of the text to be detected, and further processing the extracted text features into judgment rules.

Example 2

Referring to fig. 2, fig. 2 is a schematic structural diagram of a feature constructing apparatus based on an artificial intelligence model according to the present embodiment. As shown in fig. 2, the artificial intelligence model-based feature construction apparatus includes:

a collecting unit 210 for collecting DNS logs;

a deduplication unit 220, configured to sort and deduplicate the DNS log to obtain a deduplication log;

an extracting unit 230 for extracting a feature field in the deduplication log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field;

the processing unit 240 is configured to perform feature vectorization processing on the feature field to obtain a feature vector:

a construction unit 250 for constructing the ambiguous feature based on the feature vector.

As an alternative embodiment, the deduplication unit 220 includes:

a sorting subunit 221, configured to sort the DNS log into json files each row including a field name, a field value, and a category label;

a calculating subunit 222, configured to calculate an md5 value of each row of field names and field values in the json file;

the deduplication subunit 223 is configured to filter json line data corresponding to the same md5 value, and obtain a deduplication log.

As an alternative embodiment, the processing unit 240 includes:

an obtaining subunit 241, configured to obtain a character string feature and a category feature included in the feature field;

a processing subunit 242, configured to perform feature vectorization processing on the category features to obtain a semantic feature vector;

the processing subunit 242 is further configured to perform feature vectorization processing on the character string feature to obtain a domain name feature vector;

the combining subunit 243 is configured to combine the semantic feature vector and the domain name feature vector to obtain a feature vector.

As an alternative embodiment, the construction unit 250 includes:

a dividing subunit 251, configured to divide the feature vector based on different field semantics to obtain a plurality of semantic features;

a stitching subunit 252, configured to stitch the plurality of semantic features to obtain the semantic features.

As an alternative embodiment, the feature building apparatus based on the artificial intelligence model further includes:

a construction unit 260 for constructing an artificial intelligence model including an automatic attention module and a fully connected classification module;

the training unit 270 is configured to train the artificial intelligence model by using a gradient back propagation method based on the ambiguous features, so as to obtain a DGA domain name detection model.

In this embodiment, the explanation of the feature constructing apparatus based on the artificial intelligence model may refer to the description in embodiment 1, and the description is not repeated in this embodiment.

It can be seen that, by implementing the feature construction device based on the artificial intelligence model described in this embodiment, response length features, query type features, return code features, feature of the number of entities in the question portion, feature of the number of entities in the authoritative zone, and time-to-live features can be introduced, so that semantic representation of the features is enhanced; meanwhile, as the characteristics are derived from semantic isolation spaces reflecting different information, the application provides a problem that the class characteristic solution model for extracting response can not distinguish different semantic characteristic boundaries. According to experimental results, the characteristics obtained by the device are used for model training, so that the false alarm rate of the model is reduced by 8.52%, and the detection rate is improved by 1.64%. In addition, the automatic attention model provided by the device can be used for clearly capturing the relevant part which leads to the response result by distributing attention scores to the relevant features of the text to be detected, and further processing the extracted text features into judgment rules.

An embodiment of the present application provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to execute a feature construction method based on an artificial intelligence model in embodiment 1 of the present application.

The present embodiment provides a computer readable storage medium storing computer program instructions that, when read and executed by a processor, perform the artificial intelligence model-based feature construction method of embodiment 1 of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. The characteristic construction method based on the artificial intelligence model is characterized by comprising the following steps of:

collecting DNS logs;

sorting and de-duplicating the DNS log to obtain a de-duplicated log;

constructing a multi-sense feature based on the feature vector;

the step of performing feature vector processing on the feature field to obtain a feature vector comprises the following steps:

combining the semantic feature vector and the domain name feature vector to obtain a feature vector;

wherein the step of constructing the ambiguous feature based on the feature vector comprises:

and splicing the semantic features to obtain the ambiguous features.

2. The artificial intelligence model-based feature construction method according to claim 1, wherein the step of sorting and deduplicating the DNS log to obtain a deduplication log comprises:

3. The artificial intelligence model based feature construction method according to claim 1, further comprising:

4. An artificial intelligence model-based feature construction apparatus, characterized in that the artificial intelligence model-based feature construction apparatus includes:

a collecting unit for collecting DNS logs;

a construction unit for constructing a multi-sense feature based on the feature vector;

wherein the processing unit comprises:

a combination subunit, configured to combine the semantic feature vector and the domain name feature vector to obtain a feature vector;

wherein the construction unit comprises:

and the splicing subunit is used for splicing the plurality of semantic features to obtain the semantic features.

5. The artificial intelligence model based feature construction apparatus according to claim 4, wherein the deduplication unit comprises:

6. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the artificial intelligence model-based feature construction method of any one of claims 1 to 3.

7. A readable storage medium having stored therein computer program instructions which, when read and executed by a processor, perform the artificial intelligence model based feature construction method of any one of claims 1 to 3.