CN115334039B - Feature construction method and device based on artificial intelligent model - Google Patents

Feature construction method and device based on artificial intelligent model Download PDF

Info

Publication number
CN115334039B
CN115334039B CN202210951261.6A CN202210951261A CN115334039B CN 115334039 B CN115334039 B CN 115334039B CN 202210951261 A CN202210951261 A CN 202210951261A CN 115334039 B CN115334039 B CN 115334039B
Authority
CN
China
Prior art keywords
feature
field
features
log
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210951261.6A
Other languages
Chinese (zh)
Other versions
CN115334039A (en
Inventor
吕晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianrongxin Xiongan Network Security Technology Co ltd
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Tianrongxin Xiongan Network Security Technology Co ltd
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianrongxin Xiongan Network Security Technology Co ltd, Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Tianrongxin Xiongan Network Security Technology Co ltd
Priority to CN202210951261.6A priority Critical patent/CN115334039B/en
Publication of CN115334039A publication Critical patent/CN115334039A/en
Application granted granted Critical
Publication of CN115334039B publication Critical patent/CN115334039B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/028Capturing of monitoring data by filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a feature construction method and device based on an artificial intelligence model, wherein the method comprises the following steps: collecting DNS logs; the DNS logs are arranged and de-duplicated to obtain de-duplicated logs; extracting a characteristic field from the deduplication log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field; feature vector processing is carried out on the feature field to obtain feature vectors: the ambiguous features are constructed based on the feature vectors. It can be seen that implementing this embodiment enables the generation of a large number of ambiguous features that enable the training of a better quality DGA artificial intelligence detection model, thereby enabling a significant reduction in detection false alarm rates.

Description

Feature construction method and device based on artificial intelligent model
Technical Field
The present application relates to an artificial intelligence domain, and in particular, to a feature construction method and device based on an artificial intelligence model.
Background
In recent years, DGA (domain name generation algorithm) based technology has evolved, and more DGA based malware is beginning to pose a corresponding threat to government networks and important infrastructure networks. Specifically, an attacker may generate a large number of domain names using DGA, and thereby evade detection of the blacklist of domain names.
However, the currently employed domain name filtering method is a blacklist filtering method, which makes the current filtering method unable to effectively filter such a large and flexible domain name.
To solve this problem, those skilled in the art propose DGA domain name detection methods based on machine learning, which hope to be able to detect efficiently by a classifier based on statistical information of domain name characters. However, in practice, the problem of high false alarm rate still exists in the method, and domain names cannot be effectively filtered.
Disclosure of Invention
The embodiment of the application aims to provide a feature construction method and device based on an artificial intelligence model, which can generate a large number of ambiguous features so that the large number of ambiguous features can train a higher-quality DGA artificial intelligence detection model, and therefore the detection false alarm rate can be remarkably reduced.
An embodiment of the present application provides a feature construction method based on an artificial intelligence model, including:
collecting DNS logs;
sorting and de-duplicating the DNS log to obtain a de-duplicated log;
extracting a characteristic field from the duplicate removal log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field;
performing feature vectorization processing on the feature field to obtain a feature vector:
and constructing the ambiguous features based on the feature vectors.
In the implementation process, the method can collect DNS logs preferentially; therefore, training samples are completed, extracted features are high-quality features based on DNS, and the subsequent model training effect is guaranteed. Then, the method can sort and de-duplicate the DNS log to obtain a de-duplicated log; therefore, the process can simplify the DNS log, so that an effective and useful sample is reserved, and the extracted characteristics are further ensured to be accurate and effective. Then, the method can extract the characteristic field from the duplicate removal log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field; therefore, the basic classification information can be provided for the construction of the ambiguous features by respectively extracting the fields, so that the ambiguous features can be constructed accurately. Then, the method can perform feature vectorization processing on the feature field to obtain a feature vector, and construct a multi-sense feature based on the feature vector: therefore, the method is based on vectorization feature recombination multi-sense features, and has the effect of enabling the constructed multi-sense features to be more structural, so that a large number of high-quality multi-sense features can be generated conveniently.
Further, the step of sorting and deduplicating the DNS log to obtain a deduplication log includes:
sorting the DNS log into json files each row comprising a field name, a field value and a category label;
calculating the md5 value of each row of field names and field values in the json file;
and filtering json line data corresponding to the same md5 value to obtain a deduplication log.
In the implementation process, the method can be used for preferentially sorting the DNS log into json files with each row comprising field names, field values and category labels in the process of sorting and de-duplicating the DNS log to obtain the de-duplication log; therefore, the DNS log can be organized into multiple-row and ambiguous json files, and the field structure of each row is limited, so that subsequent calling and calculation can be more concise. Then, the method can calculate the md5 value of each row of field names and field values in the json file; and filtering json line data corresponding to the same md5 value to obtain a deduplication log. Therefore, the method can filter the corresponding json file line data based on the md5 value, so that a simpler, rapid and accurate de-duplication effect is realized.
Further, the step of performing feature vectorization processing on the feature field to obtain a feature vector includes:
acquiring character string features and category features included in the feature fields;
performing feature vectorization processing on the category features to obtain semantic feature vectors;
performing feature vectorization processing on the character string features to obtain domain name feature vectors;
and combining the semantic feature vector and the domain name feature vector to obtain a feature vector.
In the implementation process, the method can be used for preferentially acquiring the character string characteristics and the category characteristics included in the characteristic field in the process of carrying out characteristic vectorization processing on the characteristic field to obtain the characteristic vector; wherein the category characteristics include all the category characteristics and values of other characteristics than the requested domain name characteristics, and the string characteristics include only the values of the requested domain name characteristics, it is seen that the extraction process can preferentially perform the classification extraction, thereby providing for the vectorization of the characteristics. Then, the method can carry out feature vectorization processing on the category features to obtain semantic feature vectors; performing feature vectorization processing on character string features to obtain domain name feature vectors; therefore, the method can respectively perform feature vector processing on two types of features, so that feature vectors with different dimensions are obtained. Finally, the method can combine the semantic feature vector and the domain name feature vector to obtain the feature vector. Therefore, the method can combine all the feature vectors according to the inherent grouping, so that the subsequent feature extraction and application in the form of groups are facilitated, and the features can be better quality.
Further, the step of constructing the ambiguous feature based on the feature vector includes:
dividing the feature vectors based on different field semantics to obtain a plurality of semantic features;
and splicing the semantic features to obtain the ambiguous features.
In the implementation process, the method can divide the feature vector preferentially based on different field semantics to obtain a plurality of semantic features in the process of constructing the multi-semantic features based on the feature vector; and then splicing a plurality of semantic features to obtain the multi-semantic features. Therefore, the method can obtain the high-order features fusing the grammar information and the semantic information through the construction, thereby facilitating the subsequent model training.
Further, the method further comprises:
constructing an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module;
based on the ambiguous features, training the artificial intelligent model by using a gradient back propagation method to obtain a DGA domain name detection model.
In the implementation process, the method can also build an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module; and then training the artificial intelligent model by using a gradient back propagation method based on the ambiguous features to obtain the DGA domain name detection model. The method system constructs the artificial intelligent model by moving the ambiguous features through the specified structure and the specified method, so that the artificial intelligent model can realize the depth detection of the DGA domain name detection more effectively and accurately.
A second aspect of the embodiments of the present application provides an artificial intelligence model-based feature construction apparatus, including:
a collecting unit for collecting DNS logs;
the de-duplication unit is used for sorting and de-duplication the DNS log to obtain a de-duplication log;
an extracting unit, configured to extract a feature field from the deduplication log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field;
the processing unit is used for carrying out feature vectorization processing on the feature field to obtain a feature vector:
and the construction unit is used for constructing the ambiguous feature based on the feature vector.
In the implementation process, the device may collect DNS logs through the collecting unit; therefore, training samples are completed, extracted features are high-quality features based on DNS, and the subsequent model training effect is guaranteed. Then, the DNS log is tidied and de-duplicated through a de-duplication unit to obtain a de-duplication log; therefore, the process can simplify the DNS log, so that an effective and useful sample is reserved, and the extracted characteristics are further ensured to be accurate and effective. Then, extracting the characteristic field from the duplicate removal log through an extraction unit; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field; therefore, the basic classification information can be provided for the construction of the ambiguous features by respectively extracting the fields, so that the ambiguous features can be constructed accurately. Then, the feature field is subjected to feature vectorization processing through a processing unit to obtain a feature vector, and the construction unit constructs the multi-sense feature based on the feature vector: it can be seen that the device is able to reconstruct the ambiguous features based on the vectorized features, thereby constructing more structural ambiguous features.
Further, the deduplication unit includes:
a sorting subunit, configured to sort the DNS log into json files each row including a field name, a field value, and a category label;
a calculating subunit, configured to calculate an md5 value of a field name and a field value of each row in the json file;
and the deduplication subunit is used for filtering json line data corresponding to the same md5 value to obtain a deduplication log.
In the implementation process, the deduplication unit may sort the DNS log into json files including field names, field values, and category labels for each row through the sorting subunit; therefore, the DNS log can be organized into multiple-row and ambiguous json files, and the field structure of each row is limited, so that subsequent calling and calculation can be more concise. Then, the device can calculate the md5 value of each row of field names and field values in the json file through a calculating subunit; and filtering json line data corresponding to the same md5 value by using a deduplication subunit to obtain a deduplication log. Therefore, the device can filter the corresponding json file line data based on the md5 value, so that a simpler, rapid and accurate de-duplication effect is realized.
Further, the processing unit includes:
an obtaining subunit, configured to obtain a character string feature and a category feature included in the feature field;
the processing subunit is used for carrying out feature vectorization processing on the category features to obtain semantic feature vectors;
the processing subunit is further configured to perform feature vectorization processing on the character string feature to obtain a domain name feature vector;
and the combining subunit is used for combining the semantic feature vector and the domain name feature vector to obtain a feature vector.
In the implementation process, the processing unit can acquire the character string characteristics and the category characteristics included in the characteristic field through the acquisition subunit; wherein the category characteristics include all the category characteristics and values of other characteristics than the requested domain name characteristics, and the string characteristics include only the values of the requested domain name characteristics, it is seen that the extraction process can preferentially perform the classification extraction, thereby providing for the vectorization of the characteristics. Then, the device can carry out feature vectorization processing on the category features through a processing subunit to obtain semantic feature vectors; performing feature vectorization processing on character string features to obtain domain name feature vectors; therefore, the device can respectively perform feature vector processing on two types of features, so that feature vectors with different dimensions are obtained. Finally, the device combines the semantic feature vector and the domain name feature vector through the combination subunit to obtain the feature vector. Therefore, the device can combine all the feature vectors according to the inherent grouping, so that the subsequent feature extraction and application in the form of groups are facilitated, and the features can be better quality.
Further, the construction unit includes:
dividing sub-units, which are used for dividing the feature vectors based on different field semantics to obtain a plurality of semantic features;
and the splicing subunit is used for splicing the plurality of semantic features to obtain the multi-semantic features.
In the implementation process, the construction unit can divide the feature vector based on different field semantics through the division subunit to obtain a plurality of semantic features; and then splicing a plurality of semantic features through the splicing subunit to obtain the multi-semantic features. Therefore, the device can obtain the high-order features fusing the grammar information and the semantic information through the construction, thereby facilitating the subsequent model training.
Further, the artificial intelligence model-based feature construction apparatus further includes:
the building unit is used for building an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module;
and the training unit is used for training the artificial intelligent model by using a gradient back propagation method based on the ambiguous features to obtain a DGA domain name detection model.
In the implementation process, the device can build an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module through a building unit; and then training the artificial intelligent model by using a gradient back propagation device based on the ambiguous features through a training unit to obtain a DGA domain name detection model. It can be seen that the device system constructs an artificial intelligent model by moving the ambiguous features through the specified structure and the specified device, so that the artificial intelligent model can more effectively and accurately realize the depth detection of the DGA domain name detection.
A third aspect of the embodiments of the present application provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to perform the artificial intelligence model-based feature construction method according to any one of the first aspect of the embodiments of the present application.
A fourth aspect of the embodiments of the present application provides a computer readable storage medium storing computer program instructions which, when read and executed by a processor, perform the method for constructing features based on an artificial intelligence model according to any one of the first aspect of the embodiments of the present application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a feature construction method based on an artificial intelligence model according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a feature building device based on an artificial intelligence model according to an embodiment of the present application;
fig. 3 is a diagram of an example of feature construction based on an artificial intelligence model according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
Example 1
Referring to fig. 1, fig. 1 is a schematic flow chart of a feature construction method based on an artificial intelligence model according to the present embodiment. The characteristic construction method based on the artificial intelligence model comprises the following steps:
s101, collecting DNS logs.
S102, sorting the DNS log into json files with each row comprising a field name, a field value and a category label.
In this embodiment, the method sorts the obtained DNS log into a json file.
In this embodiment, each line of the json file is a dictionary, which includes a plurality of field names and corresponding values (i.e., field values), such as { key1: value1, key2: value 2..label: 0}. The previous field may be used as the data portion and the last field is a category label.
S103, calculating the md5 value of the field name and the field value of each row in the json file.
S104, filtering json line data corresponding to the same md5 value to obtain a duplication removal log.
In this embodiment, the method may initialize a set, then traverse the file to calculate the md5 value of the data portion, and if the md5 value is not in the set, then save the data and add md5 to the set; otherwise deleting the data; and finally writing the set out of the file for later use when data is newly added.
S105, extracting characteristic fields from the duplicate removal log; the feature field includes a request domain name field, a response length field, a query type field, a return code field, a question portion containing entity number field, an authoritative zone containing entity number field, and a time-to-live feature field.
In this embodiment, the method may extract 7 fields from the DNS log by constructing a regular expression, including a request domain name field, a response length field, a query type field, a return code field, a problem portion including an entity number field, an authority area including an entity number field, and a time-to-live feature field; each field extracts a field name and a corresponding value, respectively.
In this embodiment, in the existing DGA technical detection scheme, modeling detection is performed for the field value of the request domain name; however, DGA domain name (referring to a request domain name field value) generation is not consistent with a fixed rule, so that the problem of high false alarm is common in the prior art scheme.
In this embodiment, in order to reduce false alarm, the method considers that, based on the field value of the request domain name, the values of the fields of the response length, the query type, the return code, the number of entities contained in the question portion, the number of entities contained in the authoritative zone, and the survival time are added to enhance the feature representation. According to practical working experience, the response length, the query type, the return code, the number of the problem part containing entities, the number of the authority area containing entities and the survival time can be known, and the DNS log fields are common fields for filtering false alarms in practical working.
In this embodiment, the request domain name field, the response length field, the query type field, the return code field, the problem part containing entity number field, the authority area containing entity number field, and the time-to-live feature field are derived from semantic isolation spaces reflecting different information, and if the values of these fields are directly spliced, the model cannot effectively distinguish different semantic spaces. The method therefore supplements the field name as a category feature in front of the field value to suggest the model semantic boundary.
S106, acquiring character string features and category features included in the feature field.
In this embodiment, the features extracted by the method may be classified into two categories, i.e., category features and text string features, where the category features include all category features and values of features other than the requested domain name feature, and the text string features include only the values of the requested domain name feature.
And S107, performing feature vectorization processing on the category features to obtain semantic feature vectors.
In this embodiment, the method may first vectorize the category characteristics. Specifically, the method may randomly initialize a matrix 7*M, wherein each row vector represents a semantic class feature. Each semantic category feature can either prompt the model or naturally segment different semantic features.
In this embodiment, the method may further vectorize the corresponding value features. For the values of a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field, firstly, counting the limited value M of each field, and then constructing a M-N matrix, wherein each row represents a valued semantic vector, and N represents the dimension of the semantic vector.
S108, performing feature vectorization processing on the character string features to obtain domain name feature vectors.
In this embodiment, the method may be directed to requesting the value of the domain name field, i.e., a domain name string. LSTM encoding is used. Specifically, the method can construct an LSTM depth coding model, divide the domain name character string according to characters, input the divided character sequence into the LSTM depth coding model, and finally obtain the output of the last step of the LSTM depth coding model as the domain name request feature.
S109, combining the semantic feature vector and the domain name feature vector to obtain a feature vector.
S110, dividing the feature vectors based on different field semantics to obtain a plurality of semantic features.
In this embodiment, the ambiguous features include a request domain name semantic module, a response length semantic module, a query type semantic module, a return code semantic module, a question portion containing entity number semantic module, an authoritative zone containing entity number semantic module, and a time-to-live semantic module. Wherein, in each semantic module, the category feature vector and the corresponding feature value vector are spliced to form a semantic representation, wherein the semantic feature category vector is similar to part of speech and the corresponding feature value vector is similar to word sense. The 8 features are then processed as 8 semantic modules in this way. Finally, the 8 semantic modules are spliced into 1 vector to be detected.
And S111, splicing a plurality of semantic features to obtain the ambiguous features.
In this embodiment, ambiguity is a high-level representation that fuses grammar and semantic information.
In the embodiment, the multi-sense feature vector provided by the method is formed by fusing a request domain name semantic module, a response length semantic module, a query type semantic module, a return code semantic module, a problem part containing entity quantity semantic module, an authoritative zone containing entity quantity semantic module and a time-to-live semantic module; is distinguished from the request domain name features used in existing DGA detection techniques. Secondly, each semantic module comprises the integration of two feature vectors, namely a category feature vector and a feature value vector; is distinguished from eigenvalue vectors in the prior DGA detection technology.
As an alternative embodiment, the method further comprises:
constructing an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module;
based on the ambiguous features, training the artificial intelligent model by using a gradient back propagation method to obtain the DGA domain name detection model.
In this embodiment, the method may build a model including an automatic attention module and a fully connected classification module.
In this embodiment, the input of the automatic attention module is the above-described ambiguous feature, and the output is the ambiguous feature subjected to the attention-allocated weight. The automatic attention module refers to a model built based on a self-attention mechanism, and can acquire jump characteristics with a long distance. The reason for adopting the automatic focusing module is that the importance of each semantic unit of the ambiguous features is different, and the automatic focusing module can automatically configure larger weight to the part needing focusing.
In this embodiment, the method proposes an auxiliary method for extracting the result of the automatic attention module as a rule for the first time. The automatic attention model maintains an attention matrix, each feature can be automatically scored when a new sample is processed, and a part with higher score is extracted and can be used as a rule for judging a malicious sample.
In this embodiment, the method may build two full-connection layers to perform feature interaction, and then generate a classification result.
In this embodiment, the method can use gradient back propagation method to train classification model, which mainly adjusts learning rate and batch to get best result.
Referring to FIG. 3, FIG. 3 is a diagram illustrating an example of feature construction based on an artificial intelligence model. Based on the graph, the method provides a DGA detection method based on automatic attention of the ambiguous features, and the method constructs the ambiguous features by extracting each field (including field names and field values) from the DNS log and builds an automatic attention learning model so as to solve the false alarm problem in DGA domain name detection.
For example, the method can be applied to deep detection of DGA domain name detection, and can also be used for extracting the judgment rule of the DGA domain name, so that detection false alarm can be effectively reduced. The specific example flow is as follows:
(1) samples were collected and deduplicated. The white samples and the black samples are derived from an online log, the total number of the white samples after weight removal is 10.8 ten thousand, and the total number of the black samples is 4.6 ten thousand.
(2) Initializing two lists data and labels, traversing the files under the path according to the white and black categories of the files, extracting character strings formed by the feature categories and the feature values in each sample by using a regular expression for each file, storing the character strings in the data, and simultaneously storing the corresponding data labels white (0) and black (1) in the labels. The characteristics include: request domain name feature, response length feature, query type feature, return code feature, number of entities contained in the question portion, number of entities contained in the authoritative zone, and time-to-live feature.
(3) The characteristic representation comprises two steps in parallel: vectorization of category features and vectorization of text string types.
(4) Vectorization of category features is first performed. A matrix of 7 x N1 is initialized, wherein 7 represents 7 category characteristics of query type characteristics, return code characteristics, number of problem part containing entities, authority area containing number of entities, time-to-live characteristics, response length characteristics and request domain name characteristics, and N1 represents semantic dimension of each category characteristic. Because the feature semantics of different classes span much larger, class feature vectors may prompt the model to focus on important content.
(5) Secondly, query type features, return code features, problem part containing entity quantity, authority area containing entity quantity, time-to-live features and values of response length features are vectorized, feature values of the 6 features are numerical values, two types can be classified approximately, wherein the values of the query type features and the return code features are discrete values, and the remaining 4 features are continuous values. For a specific feature, the continuous values are discretized according to a preset discrete interval, for example, the discrete interval is set to 100, and then the value range of 0-600 can be discretized into 6 values. And initializing a matrix of M1 x N2, wherein M1 represents the number of discrete values of a certain feature, and N2 represents the dimension of the certain feature.
(6) And finally, vectorizing the character string type. The value of the request domain name feature is a string, e.g. "ac5y7.Hhmukp. Com". The first step divides the character string according to letters, the second step maps each letter to a vector, the third step inputs the mapping vector composition matrix of each letter in the character string into LSTM (long short term memory network), and the fourth step obtains the final output of LSTM (long short term memory network) as the representation of' ac5y7.hhmukp.
(7) After the feature representation of the steps is carried out on all the features, splicing is carried out according to the category feature vectors and the feature value vectors to form 1 semantic unit, and then 7 semantic units are spliced into 1 vector to be detected. The vector to be detected is a 258-dimensional vector, wherein the feature class vector has 10 dimensions, the numerical feature has 10 dimensions, and the character string type feature has 128 dimensions.
(8) And (5) building a self-attention network. Because the importance degree of each semantic unit of the vector to be detected is different, and the sample category can be judged by jumping the combined information, a self-attention deep learning network is built by using a keras tool, and different attention weights are automatically given according to different samples, so that the high-order interaction of the features is realized, and the generalization performance of the model is improved.
(9) The model is trained, 50 rounds of training are set, the property=2 of the reduced OnPlaeatu is set so that the model converges to a better position, and the property=5 of the earlyStopping can reduce the time cost of model training.
In this embodiment, the execution subject of the method may be a computing device such as a computer or a server, which is not limited in this embodiment.
In this embodiment, the execution body of the method may be an intelligent device such as a smart phone or a tablet computer, which is not limited in this embodiment.
Therefore, by implementing the feature construction method based on the artificial intelligence model described in the embodiment, response length features, query type features, return code features, feature of the number of entities contained in the problem part, feature of the number of entities contained in the authoritative zone and time-to-live features can be introduced, so that semantic representation of the features is enhanced; meanwhile, as the characteristics are derived from semantic isolation spaces reflecting different information, the application provides a problem that the class characteristic solution model for extracting response can not distinguish different semantic characteristic boundaries. According to experimental results, the characteristics obtained by the method are used for model training, so that the false alarm rate of the model is reduced by 8.52%, and the detection rate is improved by 1.64%. In addition, the automatic attention model provided by the method can be used for clearly capturing the relevant part which leads to the response result by distributing attention scores to the relevant features of the text to be detected, and further processing the extracted text features into judgment rules.
Example 2
Referring to fig. 2, fig. 2 is a schematic structural diagram of a feature constructing apparatus based on an artificial intelligence model according to the present embodiment. As shown in fig. 2, the artificial intelligence model-based feature construction apparatus includes:
a collecting unit 210 for collecting DNS logs;
a deduplication unit 220, configured to sort and deduplicate the DNS log to obtain a deduplication log;
an extracting unit 230 for extracting a feature field in the deduplication log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field;
the processing unit 240 is configured to perform feature vectorization processing on the feature field to obtain a feature vector:
a construction unit 250 for constructing the ambiguous feature based on the feature vector.
As an alternative embodiment, the deduplication unit 220 includes:
a sorting subunit 221, configured to sort the DNS log into json files each row including a field name, a field value, and a category label;
a calculating subunit 222, configured to calculate an md5 value of each row of field names and field values in the json file;
the deduplication subunit 223 is configured to filter json line data corresponding to the same md5 value, and obtain a deduplication log.
As an alternative embodiment, the processing unit 240 includes:
an obtaining subunit 241, configured to obtain a character string feature and a category feature included in the feature field;
a processing subunit 242, configured to perform feature vectorization processing on the category features to obtain a semantic feature vector;
the processing subunit 242 is further configured to perform feature vectorization processing on the character string feature to obtain a domain name feature vector;
the combining subunit 243 is configured to combine the semantic feature vector and the domain name feature vector to obtain a feature vector.
As an alternative embodiment, the construction unit 250 includes:
a dividing subunit 251, configured to divide the feature vector based on different field semantics to obtain a plurality of semantic features;
a stitching subunit 252, configured to stitch the plurality of semantic features to obtain the semantic features.
As an alternative embodiment, the feature building apparatus based on the artificial intelligence model further includes:
a construction unit 260 for constructing an artificial intelligence model including an automatic attention module and a fully connected classification module;
the training unit 270 is configured to train the artificial intelligence model by using a gradient back propagation method based on the ambiguous features, so as to obtain a DGA domain name detection model.
In this embodiment, the explanation of the feature constructing apparatus based on the artificial intelligence model may refer to the description in embodiment 1, and the description is not repeated in this embodiment.
It can be seen that, by implementing the feature construction device based on the artificial intelligence model described in this embodiment, response length features, query type features, return code features, feature of the number of entities in the question portion, feature of the number of entities in the authoritative zone, and time-to-live features can be introduced, so that semantic representation of the features is enhanced; meanwhile, as the characteristics are derived from semantic isolation spaces reflecting different information, the application provides a problem that the class characteristic solution model for extracting response can not distinguish different semantic characteristic boundaries. According to experimental results, the characteristics obtained by the device are used for model training, so that the false alarm rate of the model is reduced by 8.52%, and the detection rate is improved by 1.64%. In addition, the automatic attention model provided by the device can be used for clearly capturing the relevant part which leads to the response result by distributing attention scores to the relevant features of the text to be detected, and further processing the extracted text features into judgment rules.
An embodiment of the present application provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to execute a feature construction method based on an artificial intelligence model in embodiment 1 of the present application.
The present embodiment provides a computer readable storage medium storing computer program instructions that, when read and executed by a processor, perform the artificial intelligence model-based feature construction method of embodiment 1 of the present application.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (7)

1. The characteristic construction method based on the artificial intelligence model is characterized by comprising the following steps of:
collecting DNS logs;
sorting and de-duplicating the DNS log to obtain a de-duplicated log;
extracting a characteristic field from the duplicate removal log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field;
performing feature vectorization processing on the feature field to obtain a feature vector:
constructing a multi-sense feature based on the feature vector;
the step of performing feature vector processing on the feature field to obtain a feature vector comprises the following steps:
acquiring character string features and category features included in the feature fields;
performing feature vectorization processing on the category features to obtain semantic feature vectors;
performing feature vectorization processing on the character string features to obtain domain name feature vectors;
combining the semantic feature vector and the domain name feature vector to obtain a feature vector;
wherein the step of constructing the ambiguous feature based on the feature vector comprises:
dividing the feature vectors based on different field semantics to obtain a plurality of semantic features;
and splicing the semantic features to obtain the ambiguous features.
2. The artificial intelligence model-based feature construction method according to claim 1, wherein the step of sorting and deduplicating the DNS log to obtain a deduplication log comprises:
sorting the DNS log into json files each row comprising a field name, a field value and a category label;
calculating the md5 value of each row of field names and field values in the json file;
and filtering json line data corresponding to the same md5 value to obtain a deduplication log.
3. The artificial intelligence model based feature construction method according to claim 1, further comprising:
constructing an artificial intelligent model comprising an automatic focusing module and a fully-connected classifying module;
based on the ambiguous features, training the artificial intelligent model by using a gradient back propagation method to obtain a DGA domain name detection model.
4. An artificial intelligence model-based feature construction apparatus, characterized in that the artificial intelligence model-based feature construction apparatus includes:
a collecting unit for collecting DNS logs;
the de-duplication unit is used for sorting and de-duplication the DNS log to obtain a de-duplication log;
an extracting unit, configured to extract a feature field from the deduplication log; the feature field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live feature field;
the processing unit is used for carrying out feature vectorization processing on the feature field to obtain a feature vector:
a construction unit for constructing a multi-sense feature based on the feature vector;
wherein the processing unit comprises:
an obtaining subunit, configured to obtain a character string feature and a category feature included in the feature field;
the processing subunit is used for carrying out feature vectorization processing on the category features to obtain semantic feature vectors;
the processing subunit is further configured to perform feature vectorization processing on the character string feature to obtain a domain name feature vector;
a combination subunit, configured to combine the semantic feature vector and the domain name feature vector to obtain a feature vector;
wherein the construction unit comprises:
dividing sub-units, which are used for dividing the feature vectors based on different field semantics to obtain a plurality of semantic features;
and the splicing subunit is used for splicing the plurality of semantic features to obtain the semantic features.
5. The artificial intelligence model based feature construction apparatus according to claim 4, wherein the deduplication unit comprises:
a sorting subunit, configured to sort the DNS log into json files each row including a field name, a field value, and a category label;
a calculating subunit, configured to calculate an md5 value of a field name and a field value of each row in the json file;
and the deduplication subunit is used for filtering json line data corresponding to the same md5 value to obtain a deduplication log.
6. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the artificial intelligence model-based feature construction method of any one of claims 1 to 3.
7. A readable storage medium having stored therein computer program instructions which, when read and executed by a processor, perform the artificial intelligence model based feature construction method of any one of claims 1 to 3.
CN202210951261.6A 2022-08-09 2022-08-09 Feature construction method and device based on artificial intelligent model Active CN115334039B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210951261.6A CN115334039B (en) 2022-08-09 2022-08-09 Feature construction method and device based on artificial intelligent model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210951261.6A CN115334039B (en) 2022-08-09 2022-08-09 Feature construction method and device based on artificial intelligent model

Publications (2)

Publication Number Publication Date
CN115334039A CN115334039A (en) 2022-11-11
CN115334039B true CN115334039B (en) 2024-02-20

Family

ID=83921814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210951261.6A Active CN115334039B (en) 2022-08-09 2022-08-09 Feature construction method and device based on artificial intelligent model

Country Status (1)

Country Link
CN (1) CN115334039B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108200054A (en) * 2017-12-29 2018-06-22 北京奇安信科技有限公司 A kind of malice domain name detection method and device based on dns resolution
CN110266647A (en) * 2019-05-22 2019-09-20 北京金睛云华科技有限公司 It is a kind of to order and control communication check method and system
CN111708860A (en) * 2020-06-15 2020-09-25 北京优特捷信息技术有限公司 Information extraction method, device, equipment and storage medium
CN112769755A (en) * 2020-12-18 2021-05-07 国家计算机网络与信息安全管理中心 DNS log statistical feature extraction method for threat detection
CN112839012A (en) * 2019-11-22 2021-05-25 中国移动通信有限公司研究院 Zombie program domain name identification method, device, equipment and storage medium
CN114386410A (en) * 2022-01-11 2022-04-22 腾讯科技(深圳)有限公司 Training method and text processing method of pre-training model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9720901B2 (en) * 2015-11-19 2017-08-01 King Abdulaziz City For Science And Technology Automated text-evaluation of user generated text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108200054A (en) * 2017-12-29 2018-06-22 北京奇安信科技有限公司 A kind of malice domain name detection method and device based on dns resolution
CN110266647A (en) * 2019-05-22 2019-09-20 北京金睛云华科技有限公司 It is a kind of to order and control communication check method and system
CN112839012A (en) * 2019-11-22 2021-05-25 中国移动通信有限公司研究院 Zombie program domain name identification method, device, equipment and storage medium
CN111708860A (en) * 2020-06-15 2020-09-25 北京优特捷信息技术有限公司 Information extraction method, device, equipment and storage medium
CN112769755A (en) * 2020-12-18 2021-05-07 国家计算机网络与信息安全管理中心 DNS log statistical feature extraction method for threat detection
CN114386410A (en) * 2022-01-11 2022-04-22 腾讯科技(深圳)有限公司 Training method and text processing method of pre-training model

Also Published As

Publication number Publication date
CN115334039A (en) 2022-11-11

Similar Documents

Publication Publication Date Title
CN109960724B (en) Text summarization method based on TF-IDF
CN106294350B (en) A kind of text polymerization and device
CN110750640A (en) Text data classification method and device based on neural network model and storage medium
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
Maharjan et al. A multi-task approach to predict likability of books
CN106909669B (en) Method and device for detecting promotion information
CN107291895B (en) Quick hierarchical document query method
CN106909575B (en) Text clustering method and device
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN110858217A (en) Method and device for detecting microblog sensitive topics and readable storage medium
CN110990676A (en) Social media hotspot topic extraction method and system
CN112651025A (en) Webshell detection method based on character-level embedded code
KR101379128B1 (en) Dictionary generation device, dictionary generation method, and computer readable recording medium storing the dictionary generation program
CN111797247B (en) Case pushing method and device based on artificial intelligence, electronic equipment and medium
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
Krokos et al. A look into twitter hashtag discovery and generation
CN106933818A (en) A kind of quick multiple key text matching technique and device
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN108475265B (en) Method and device for acquiring unknown words
CN115334039B (en) Feature construction method and device based on artificial intelligent model
CN115344563B (en) Data deduplication method and device, storage medium and electronic equipment
CN110874408B (en) Model training method, text recognition device and computing equipment
CN108875060B (en) Website identification method and identification system
CN112528021B (en) Model training method, model training device and intelligent equipment
CN109344397A (en) The extracting method and device of text feature word, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240115

Address after: 071800 Conference Center 1-184, South Section of Baojin Expressway, Xiong'an Area, Xiong'an New District, Baoding City, Hebei Province

Applicant after: Tianrongxin Xiongan Network Security Technology Co.,Ltd.

Applicant after: Beijing Topsec Network Security Technology Co.,Ltd.

Applicant after: Topsec Technologies Inc.

Applicant after: BEIJING TOPSEC SOFTWARE Co.,Ltd.

Address before: 100085 4th floor, building 3, yard 1, Shangdi East Road, Haidian District, Beijing

Applicant before: Beijing Topsec Network Security Technology Co.,Ltd.

Applicant before: Topsec Technologies Inc.

Applicant before: BEIJING TOPSEC SOFTWARE Co.,Ltd.

GR01 Patent grant
GR01 Patent grant