CN115334039A - Artificial intelligence model-based feature construction method and device - Google Patents

Artificial intelligence model-based feature construction method and device Download PDF

Info

Publication number
CN115334039A
CN115334039A CN202210951261.6A CN202210951261A CN115334039A CN 115334039 A CN115334039 A CN 115334039A CN 202210951261 A CN202210951261 A CN 202210951261A CN 115334039 A CN115334039 A CN 115334039A
Authority
CN
China
Prior art keywords
feature
field
artificial intelligence
features
intelligence model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210951261.6A
Other languages
Chinese (zh)
Other versions
CN115334039B (en
Inventor
吕晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianrongxin Xiongan Network Security Technology Co ltd
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202210951261.6A priority Critical patent/CN115334039B/en
Publication of CN115334039A publication Critical patent/CN115334039A/en
Application granted granted Critical
Publication of CN115334039B publication Critical patent/CN115334039B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/028Capturing of monitoring data by filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a feature construction method and a device based on an artificial intelligence model, and the method comprises the following steps: collecting DNS logs; the DNS log is sorted and deduplicated to obtain a deduplicated log; extracting characteristic fields from the duplicate removal log; the characteristic field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live characteristic field; carrying out feature vectorization processing on the feature field to obtain a feature vector: and constructing the polysemous characteristics based on the characteristic vectors. Therefore, by implementing the embodiment, a large number of ambiguous features can be generated, so that the large number of ambiguous features can train a better DGA artificial intelligence detection model, and the false detection rate can be obviously reduced.

Description

Artificial intelligence model-based feature construction method and device
Technical Field
The application relates to an artificial intelligence domain, in particular to a feature construction method and device based on an artificial intelligence model.
Background
In recent years, based on the technology development of DGA (domain name generation algorithm), more and more DGA-based malware starts to continuously pose corresponding threats to government networks and important infrastructure networks. Specifically, the attacker may use DGA to generate a large number of domain names, and thus evade the detection of the domain name blacklist.
However, the currently adopted domain name filtering method is a black list filtering method, which makes the current filtering method unable to effectively filter such a large and flexible domain name.
In order to solve the problem, those skilled in the art have proposed a DGA domain name detection method based on machine learning, and hopefully, the DGA domain name detection method can perform effective detection through a classifier based on the statistical information of domain name characters. However, in practice, the method still has the problem of high false alarm rate, and the domain name cannot be effectively filtered.
Disclosure of Invention
An object of the embodiments of the present application is to provide a feature construction method and apparatus based on an artificial intelligence model, which can generate a large number of ambiguous features, so that the large number of ambiguous features can train a better DGA artificial intelligence detection model, thereby significantly reducing a false detection rate.
The first aspect of the embodiments of the present application provides a feature construction method based on an artificial intelligence model, including:
collecting DNS logs;
sorting and removing the duplicate of the DNS log to obtain a duplicate removal log;
extracting a characteristic field from the duplicate removal log; the characteristic field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live characteristic field;
carrying out feature vectorization processing on the feature field to obtain a feature vector:
constructing an ambiguous feature based on the feature vector.
In the implementation process, the method can preferentially collect DNS logs; therefore, training samples are completed, the extracted features are guaranteed to be high-quality features based on the DNS, and the subsequent model training effect is further guaranteed. Then, the method can be used for sorting and removing the duplicate of the DNS log to obtain a duplicate-removed log; therefore, the process can simplify the DNS log, so that effective and useful samples are reserved, and the extracted features are further ensured to be accurate and effective. Then, the method can extract the characteristic field in the duplicate removal log; the characteristic field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live characteristic field; therefore, basic classification information can be provided for construction of the polysemous features by respectively extracting the fields, and the polysemous features can be accurately constructed. Then, the method can carry out feature vectorization processing on the feature field to obtain a feature vector, and construct the polysemous feature based on the feature vector: therefore, the method is based on vectorization feature recombination polysemous features, and has the effect that the constructed polysemous features have more structurality, so that a large number of high-quality polysemous features are generated conveniently.
Further, the step of sorting and removing the duplicate of the DNS log to obtain a duplicate-removed log includes:
sorting the DNS log into json files of each row comprising field names, field values and category labels;
calculating the md5 value of the field name and the field value of each row in the json file;
and filtering json line data corresponding to the same md5 value to obtain a duplicate removal log.
In the implementation process, in the process of sorting and removing the duplicate of the DNS log to obtain the duplicate-removed log, the method can preferentially sort the DNS log into json files of each row, wherein each row comprises a field name, a field value and a category label; so that DNS logs can be collated into multi-row, ambiguous json files, meanwhile, the field structure of each row is limited, so that subsequent calling and calculation become more concise. Then, the method can calculate the md5 value of the field name and the field value of each row in the json file; and filtering json line data corresponding to the same md5 value to obtain a duplicate removal log. Therefore, the method can filter the corresponding json file line data based on the md5 value, so that a simpler, quicker and more accurate duplicate removal effect is realized.
Further, the step of performing feature vectorization processing on the feature field to obtain a feature vector includes:
acquiring character string characteristics and category characteristics included in the characteristic field;
performing feature vectorization processing on the category features to obtain semantic feature vectors;
performing feature vectorization processing on the character string features to obtain a domain name feature vector;
and combining the semantic feature vector and the domain name feature vector to obtain a feature vector.
In the implementation process, the method can preferentially acquire character string features and category features included in the feature field in the process of performing feature vectorization processing on the feature field to obtain the feature vector; the extraction process can preferentially classify and extract, so as to prepare for feature vectorization. Then, the method can carry out feature vectorization processing on the category features to obtain semantic feature vectors; and carrying out feature vectorization processing on the character string features to obtain domain name feature vectors; therefore, the method can perform feature vectorization processing on two types of features respectively, so as to obtain feature vectors with different dimensions. Finally, the method can combine the semantic feature vector and the domain name feature vector to obtain the feature vector. Therefore, all the feature vectors can be combined according to the inherent grouping by the method, so that feature extraction and application are conveniently carried out in a group form subsequently, and the features can be enabled to be better in quality.
Further, the step of constructing the ambiguous feature based on the feature vector comprises:
dividing the feature vector based on different field semantics to obtain a plurality of semantic features;
and splicing the semantic features to obtain the polysemous features.
In the implementation process, the method can divide the feature vector based on different field semantics preferentially in the process of constructing the polysemous feature based on the feature vector to obtain a plurality of semantic features; then, the multiple semantic features are spliced to obtain the polysemy features. Therefore, the method can obtain the high-order characteristics fusing the grammatical information and the semantic information through the construction, thereby facilitating the subsequent model training.
Further, the method further comprises:
building an artificial intelligence model comprising an automatic attention module and a full-connection classification module;
and training the artificial intelligence model by using a gradient back propagation method based on the polysemous characteristics to obtain a DGA domain name detection model.
In the implementation process, the method can also build an artificial intelligence model comprising an automatic attention module and a full-connection classification module; and then training the artificial intelligence model by using a gradient back propagation method based on the polysemous characteristics to obtain a DGA domain name detection model. Therefore, the method system constructs the artificial intelligence model by the specified structure and the specified method to move the ambiguous features, so that the artificial intelligence model can more effectively and accurately realize the deep detection of the DGA domain name detection.
A second aspect of the embodiments of the present application provides a feature construction apparatus based on an artificial intelligence model, where the feature construction apparatus based on the artificial intelligence model includes:
a collecting unit for collecting DNS logs;
the duplicate removal unit is used for sorting and removing the duplicate of the DNS log to obtain a duplicate removal log;
an extracting unit, configured to extract a feature field in the deduplication log; the characteristic field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live characteristic field;
a processing unit, configured to perform feature vectorization processing on the feature field to obtain a feature vector:
and the constructing unit is used for constructing the polysemous characteristic based on the characteristic vector.
In the implementation process, the device may collect DNS logs through the collecting unit; therefore, training samples are completed, the extracted features are guaranteed to be high-quality features based on the DNS, and the subsequent model training effect is further guaranteed. Then, the DNS log is sorted and deduplicated by a deduplication unit to obtain a deduplication log; therefore, the process can simplify the DNS log, so that effective and useful samples are reserved, and the extracted features are further ensured to be accurate and effective. Then, extracting a characteristic field in the duplicate removal log through an extraction unit; the characteristic field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live characteristic field; therefore, basic classification information can be provided for construction of the polysemous features by respectively extracting the fields, and the polysemous features can be accurately constructed. Then, the feature field is subjected to feature vectorization processing through the processing unit to obtain a feature vector, and the construction unit constructs the polysemous feature based on the feature vector: therefore, the device can recombine the polysemy features based on the vectorized features, and therefore the polysemy features with more structural characteristics are constructed.
Further, the deduplication unit comprises:
the sorting subunit is used for sorting the DNS log into json files of each row, wherein each json file comprises a field name, a field value and a category label;
the calculation subunit is used for calculating the md5 value of the field name and the field value of each row in the json file;
and the duplicate removal subunit is used for filtering json line data corresponding to the same md5 value to obtain a duplicate removal log.
In the implementation process, the deduplication unit may sort the DNS log into json files each row of which includes a field name, a field value, and a category label by the sorting subunit; therefore, the DNS log can be arranged into a multi-row and ambiguous json file, and meanwhile, the field structure of each row is limited, so that subsequent calling and calculation become more concise. Then, the device can calculate the md5 value of the field name and the field value of each row in the json file through a calculation subunit; and then filtering json line data corresponding to the same md5 value by a duplicate removal subunit to obtain a duplicate removal log. Therefore, the device can filter the corresponding json file line data based on the md5 value, so that a simpler, quicker and more accurate duplicate removal effect is realized.
Further, the processing unit includes:
the obtaining subunit is used for obtaining the character string characteristics and the category characteristics included in the characteristic field;
the processing subunit is used for performing feature vectorization processing on the category features to obtain semantic feature vectors;
the processing subunit is further configured to perform feature vectorization processing on the character string features to obtain a domain name feature vector;
and the combination subunit is used for combining the semantic feature vector and the domain name feature vector to obtain a feature vector.
In the implementation process, the processing unit may obtain the character string features and the category features included in the feature field through the obtaining subunit; the extraction process can preferentially classify and extract, so as to prepare for feature vectorization. Then, the device can carry out feature vectorization processing on the category features through the processing subunit to obtain semantic feature vectors; and carrying out feature vectorization processing on the character string features to obtain domain name feature vectors; therefore, the device can perform feature vectorization processing on two types of features respectively, so as to obtain feature vectors with different dimensions. Finally, the device combines the semantic feature vector and the domain name feature vector through the combination subunit to obtain the feature vector. Therefore, the device can combine all the feature vectors according to the inherent grouping, so that feature extraction and application are conveniently carried out in a group form subsequently, and the features can be enabled to be better in quality.
Further, the construction unit includes:
the dividing subunit is used for dividing the feature vector based on different field semantics to obtain a plurality of semantic features;
and the splicing subunit is used for splicing the semantic features to obtain the polysemous features.
In the implementation process, the construction unit can divide the feature vector based on different field semantics through the division subunit to obtain a plurality of semantic features; and then splicing the plurality of semantic features through a splicing subunit to obtain the polysemous feature. Therefore, the device can obtain the high-order characteristics fusing the grammatical information and the semantic information through the construction, so that the subsequent model training is facilitated.
Further, the artificial intelligence model-based feature construction device further includes:
the building unit is used for building an artificial intelligence model comprising an automatic attention module and a full-connection classification module;
and the training unit is used for training the artificial intelligence model by using a gradient back propagation method based on the polysemous characteristics to obtain a DGA domain name detection model.
In the implementation process, the device can build an artificial intelligence model comprising an automatic attention module and a full-connection classification module through a building unit; and then training the artificial intelligence model by using a gradient back propagation device through a training unit based on the polysemous characteristics to obtain a DGA domain name detection model. Therefore, the device system constructs the artificial intelligence model by the specified structure and the specified device to move the polysemous characteristics, so that the artificial intelligence model can more effectively and accurately realize the deep detection of the DGA domain name detection.
A third aspect of the embodiments of the present application provides an electronic device, including a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to enable the electronic device to execute the method for constructing features based on an artificial intelligence model according to any one of the first aspect of the embodiments of the present application.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, which stores computer program instructions, and when the computer program instructions are read and executed by a processor, the method for constructing features based on an artificial intelligence model according to any one of the first aspect of the embodiments of the present application is performed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flowchart of a feature construction method based on an artificial intelligence model according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a feature construction apparatus based on an artificial intelligence model according to an embodiment of the present application;
fig. 3 is a diagram of an example of feature construction based on an artificial intelligence model according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Example 1
Referring to fig. 1, fig. 1 is a schematic flow chart of a feature construction method based on an artificial intelligence model according to this embodiment. The feature construction method based on the artificial intelligence model comprises the following steps:
s101, collecting DNS logs.
S102, arranging the DNS log into json files of which each row comprises a field name, a field value and a category label.
In this embodiment, the method sorts the obtained DNS logs into json files.
In this embodiment, each row of the json file is a dictionary, which contains a plurality of field names and corresponding values (i.e., field values), such as { key1: value1, key2: value2.., label:0}. The front field may be the data portion and the last field is the category label.
S103, calculating the md5 value of the field name and the field value of each row in the json file.
And S104, filtering json line data corresponding to the same md5 value to obtain a duplicate removal log.
In this embodiment, the method may initialize a set, then traverse the file to calculate the md5 value of the data portion, and if the md5 value is not in the set, retain the data and add md5 to the set; otherwise, deleting the data; and finally writing the set out to a file for use in subsequent newly-added data.
S105, extracting a characteristic field from the duplicate removal log; the characteristic fields include a request domain name field, a response length field, a query type field, a return code field, a question portion containing entity number field, an authoritative zone containing entity number field, and a time-to-live characteristic field.
In the embodiment, the method can extract 7 fields from the DNS log by constructing a regular expression, wherein the fields comprise a request domain name field, a response length field, a query type field, a return code field, a problem part containing entity number field, an authority area containing entity number field and a time-to-live characteristic field; each field extracts a field name and a corresponding value, respectively.
In this embodiment, in the existing DGA technology detection scheme, modeling detection is performed for a request domain name field value; however, the generation of the DGA domain name (referring to the field value of the requested domain name) has no fixed rule to follow, so the prior art has the problem of high false alarm.
In this embodiment, in order to reduce false alarm, the method considers that, on the basis of a request domain name field value, the field values of a response length, a query type, a return code, a problem part containing entity number, an authority area containing entity number, and a lifetime are added to enhance feature representation. The response length, the query type, the return code, the number of entities contained in the problem part, the number of entities contained in the authority area and the survival time can be known according to the actual work experience, and the DNS log fields are fields commonly used for filtering false alarms in the actual work.
In this embodiment, the request domain name field, the response length field, the query type field, the return code field, the problem part containing entity number field, the authority area containing entity number field, and the lifetime characteristic field are derived from semantic isolation spaces that reflect different information, and if values of these fields are directly concatenated, the model cannot effectively distinguish different semantic spaces. The method therefore supplements the field names in front of the field values as class features to suggest model semantic boundaries.
And S106, acquiring character string characteristics and category characteristics included in the characteristic field.
In this embodiment, the features extracted by the method may be classified into two categories, namely, category-type features and text string features, the category-type features include all the category features and values of other features except the request domain name feature, and the text string features include only the value of the request domain name feature.
And S107, carrying out feature vectorization processing on the category features to obtain semantic feature vectors.
In this embodiment, the method may first vectorize the class characteristics. Specifically, the method can randomly initialize a matrix 7*M, wherein each row vector can represent a semantic class feature. Each semantic category feature can prompt the model and can naturally segment different semantic features.
In this embodiment, the method may then vectorize the corresponding value features. For values of a response length field, a query type field, a return code field, a problem part containing entity number field, an authority area containing entity number field and a survival time characteristic field, firstly counting a limited value M of each field, and then constructing an M N matrix, wherein each row represents a semantic vector of the value, and N represents the dimension of the semantic vector.
And S108, performing feature vectorization processing on the character string features to obtain domain name feature vectors.
In this embodiment, the method may be directed to requesting a value of a domain name field, i.e., a domain name string. LSTM encoding is used. Specifically, the method can be used for firstly building an LSTM depth coding model, then segmenting the domain name character string according to characters, then inputting the segmented character sequence into the LSTM depth coding model, and finally obtaining the output of the last step of the LSTM depth coding model as the request domain name feature.
And S109, combining the semantic feature vector and the domain name feature vector to obtain a feature vector.
S110, dividing the feature vector based on different field semantics to obtain a plurality of semantic features.
In this embodiment, the polysemous feature includes a request domain name semantic module, a response length semantic module, a query type semantic module, a return code semantic module, a question portion containing entity number semantic module, an authority region containing entity number semantic module, and a survival time semantic module. In each semantic module, a category feature vector and a corresponding feature value vector are spliced to form a semantic representation, wherein the semantic feature category vector is similar to a part of speech, and the feature value vector corresponding to the semantic feature category vector is similar to a sense of speech. Then all the above-mentioned 8 characteristics are processed into 8 semantic modules by this method. And finally, splicing 8 semantic modules into 1 vector to be detected.
And S111, splicing the multiple semantic features to obtain the polysemy features.
In this embodiment, the ambiguity is a high-order representation that fuses syntax and semantic information.
In the embodiment, the polysemous feature vector provided by the method is formed by fusing a request domain name semantic module, a response length semantic module, a query type semantic module, a return code semantic module, a problem part containing entity quantity semantic module, an authority area containing entity quantity semantic module and a survival time semantic module; distinguished from the request domain name feature used in existing DGA detection techniques. Secondly, each semantic module also comprises integration of two kinds of feature vectors, namely a category feature vector and a feature value vector; the method is different from the characteristic value vector in the existing DGA detection technology.
As an optional implementation, the method further comprises:
building an artificial intelligence model comprising an automatic attention module and a full-connection classification module;
and training the artificial intelligence model by using a gradient back propagation method based on the polysemous characteristics to obtain a DGA domain name detection model.
In this embodiment, the method may build a model including an automatic attention module and a fully connected classification module.
In this embodiment, the input of the auto focus module is the above-mentioned ambiguous feature, and the output is the ambiguous feature weighted by attention. The automatic attention module refers to a model built based on a self-attention mechanism, and can acquire a long-distance jumping feature. The reason for adopting the automatic attention module is that the importance of each semantic unit of the polysemous features is different, and the automatic attention module can automatically configure larger weight to the part needing attention.
In the present embodiment, the method proposes for the first time a method of assisting the rule extraction with the result of the automatic attention module. An attention matrix is maintained in the automatic attention model, each feature can be automatically scored when a new sample is processed, and the part with higher score is extracted to be used as a rule for judging a malicious sample.
In the embodiment, the method can be used for constructing two full-connection layers to carry out feature interaction and then generating a classification result.
In this embodiment, the method may use a gradient back propagation method to train the classification model, which mainly adjusts the learning rate and the batch to obtain the best result.
Referring to fig. 3, fig. 3 is a diagram illustrating an example of feature construction based on an artificial intelligence model. Based on the figure, the method provides a DGA detection method based on the automatic attention of the ambiguous features, the method constructs the ambiguous features by extracting fields (including field names and field values) from DNS logs, and builds an automatic attention learning model, so that the problem of false alarm in DGA domain name detection is solved.
For example, the method can be applied to deep detection of DGA domain name detection, can also be used for extracting the judgment rule of the DGA domain name, and can effectively reduce false detection alarm. The specific exemplary process is as follows:
(1) samples were collected and de-duplicated. The white samples and the black samples are from online logs, the number of the white samples is 10.8 ten thousand after the duplication removal, and the number of the black samples is 4.6 ten thousand.
(2) Initializing two lists of data and labels, extracting character strings formed by various characteristic categories and characteristic values in each sample by using a regular expression according to files under a white-black category traversal path of the files, storing the character strings in the data, and storing corresponding data labels of white (0) and black (1) in the labels. Is characterized by comprising the following steps: request domain name characteristic, response length characteristic, query type characteristic, return code characteristic, problem part containing entity number, authority area containing entity number and survival time characteristic.
(3) The characterization includes two steps in parallel: vectorization of category features and vectorization of text string types.
(4) First, vectorization of class features is performed. Initializing a matrix of 7 Nx 1, wherein 7 represents a query type characteristic, a return code characteristic, a question part containing entity number, an authority area containing entity number, a time-to-live characteristic, a response length characteristic and 7 category characteristics of a request domain name characteristic, and N1 represents the semantic dimension of each category characteristic. Because the feature semantics of different classes span a large amount, class feature vectors can prompt the model to focus on important content.
(5) And secondly, performing vectorization on values of query type characteristics, return code characteristics, entity quantity contained in a problem part, entity quantity contained in an authority area, survival time characteristics and response length characteristics, wherein the characteristic values of the 6 characteristics are numerical values and can be roughly classified into two types, the values of the query type characteristics and the return code characteristics are discrete values, and the rest 4 characteristics are continuous values. For a specific feature, in the first step, the continuous values are discretized according to a preset discrete interval, for example, the discrete interval is set to 100, and the value range of 0 to 600 may be discretized into 6 values. The second step initializes a matrix of M1 x N2, where M1 represents the number of discrete values of a feature and N2 represents the dimension of the feature.
(6) And finally, vectorizing the character string type. The value of the request domain name feature is a string, for example, "ac5y7.Hhmukp. Com". The first step is to divide the character string according to letters, the second step is to map each letter to a vector, the third step is to input a matrix formed by the mapping vectors of each letter in the character string into an LSTM (long-short term memory network), and the fourth step is to obtain the final output of the LSTM (long-short term memory network) as the expression of 'ac5y7. Hhmukp.com'.
(7) After all the features are subjected to feature representation in the steps, 1 semantic unit is formed by splicing according to the category feature vector and the feature value vector, and then 7 semantic units are spliced into 1 vector to be detected. The vector to be detected is a 258-dimensional vector, wherein the feature class vector has 10 dimensions, the numerical feature has 10 dimensions, and the character string type feature has 128 dimensions.
(8) And building a self-attention network. Because the importance degrees of semantic units of the vectors to be detected are different and the sample types can be judged through the jump combination information, a keras tool is used for building a self-attention deep learning network, different attention weights are automatically given according to different samples, so that high-order interaction of features is realized, and the generalization performance of the model is improved.
(9) The model is trained, 50 rounds of training are set, the probability =2 of ReduceOnPleatu is set to make the model converge to a better position, and the probability =5 of EarlyStopping can reduce the time overhead of model training.
In this embodiment, the execution subject of the method may be a computing device such as a computer and a server, and is not limited in this embodiment.
In this embodiment, an execution subject of the method may also be an intelligent device such as a smart phone and a tablet computer, which is not limited in this embodiment.
It can be seen that, by implementing the feature construction method based on the artificial intelligence model described in this embodiment, a response length feature, a query type feature, a return code feature, a problem part containing entity number field feature, an authority area containing entity number field feature, and a lifetime feature can be introduced, thereby enhancing semantic representation of features; meanwhile, the features are derived from semantic isolation spaces reflecting different information, so that the problem that the extracted response class feature solves the problem that a model cannot distinguish different semantic feature boundaries is solved. According to the experimental result, the model is trained by using the characteristics obtained by the method, so that the false alarm rate of the model is reduced by 8.52%, and the detection rate is improved by 1.64%. In addition, the automatic attention model provided by the method can clearly capture the relevant parts causing the response result by distributing the attention scores to the relevant characteristics of the text to be detected, and further process the extracted text characteristics into the judgment rules.
Example 2
Referring to fig. 2, fig. 2 is a schematic structural diagram of a feature constructing apparatus based on an artificial intelligence model according to this embodiment. As shown in fig. 2, the artificial intelligence model-based feature construction apparatus includes:
a collecting unit 210 for collecting DNS logs;
the duplicate removal unit 220 is configured to perform sorting and duplicate removal on the DNS logs to obtain duplicate removal logs;
an extracting unit 230, configured to extract a feature field in the deduplication log; the characteristic field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live characteristic field;
the processing unit 240 is configured to perform feature vectorization processing on the feature field to obtain a feature vector:
a construction unit 250 for constructing the ambiguous feature based on the feature vector.
As an alternative embodiment, the deduplication unit 220 includes:
a sorting subunit 221, configured to sort the DNS log into json files each including a field name, a field value, and a category label;
a calculation subunit 222, configured to calculate an md5 value of a field name and a field value of each row in the json file;
and the duplicate removal subunit 223 is configured to filter json line data corresponding to the same md5 value, so as to obtain a duplicate removal log.
As an alternative embodiment, the processing unit 240 includes:
an obtaining subunit 241, configured to obtain a character string feature and a category feature included in the feature field;
a processing subunit 242, configured to perform feature vectorization processing on the category features to obtain semantic feature vectors;
the processing subunit 242 is further configured to perform feature vectorization processing on the character string features to obtain a domain name feature vector;
and a combining subunit 243, configured to combine the semantic feature vector and the domain name feature vector to obtain a feature vector.
As an alternative embodiment, the building unit 250 includes:
a dividing unit 251, configured to divide the feature vector based on different field semantics to obtain a plurality of semantic features;
and a splicing subunit 252, configured to splice multiple semantic features to obtain an ambiguous feature.
As an optional implementation manner, the artificial intelligence model-based feature construction apparatus further includes:
the building unit 260 is used for building an artificial intelligence model comprising an automatic attention module and a fully-connected classification module;
and the training unit 270 is configured to train the artificial intelligence model by using a gradient back propagation method based on the polysemous feature, so as to obtain a DGA domain name detection model.
In this embodiment, for the explanation of the feature constructing apparatus based on the artificial intelligence model, reference may be made to the description in embodiment 1, and details are not repeated in this embodiment.
It can be seen that, by implementing the feature construction device based on the artificial intelligence model described in this embodiment, a response length feature, a query type feature, a return code feature, a problem part containing entity number field feature, an authority area containing entity number field feature, and a lifetime feature can be introduced, so that semantic representation of features is enhanced; meanwhile, the features are derived from semantic isolation spaces reflecting different information, so that the problem that the extracted response class feature solves the problem that a model cannot distinguish different semantic feature boundaries is solved. According to the experimental result, the characteristics obtained by the device are used for model training, so that the false alarm rate of the model is reduced by 8.52%, and the detection rate is improved by 1.64%. In addition, the automatic attention model provided by the device can clearly capture relevant parts causing response results by distributing attention scores to relevant features of the text to be detected, and further process the extracted text features into judgment rules.
The embodiment of the application provides an electronic device, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic device to execute the feature construction method based on the artificial intelligence model in the embodiment 1 of the application.
The embodiment of the present application provides a computer-readable storage medium, which stores computer program instructions, and when the computer program instructions are read and executed by a processor, the method for constructing features based on an artificial intelligence model in embodiment 1 of the present application is performed.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A feature construction method based on an artificial intelligence model is characterized by comprising the following steps:
collecting DNS logs;
sorting and removing duplication of the DNS log to obtain a duplicate removal log;
extracting a characteristic field from the duplicate removal log; the characteristic field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live characteristic field;
performing feature vectorization processing on the feature field to obtain a feature vector:
constructing an ambiguous feature based on the feature vector.
2. The artificial intelligence model-based feature construction method according to claim 1, wherein the step of sorting and deduplication the DNS logs to obtain deduplication logs comprises:
sorting the DNS log into json files of each row comprising field names, field values and category labels;
calculating the md5 value of the field name and the field value of each row in the json file;
and filtering json line data corresponding to the same md5 value to obtain a duplicate removal log.
3. The method for constructing features based on artificial intelligence model according to claim 1, wherein the step of performing feature vectorization on the feature fields to obtain feature vectors includes:
acquiring character string characteristics and category characteristics included in the characteristic field;
performing feature vectorization processing on the category features to obtain semantic feature vectors;
performing feature vectorization processing on the character string features to obtain a domain name feature vector;
and combining the semantic feature vector and the domain name feature vector to obtain a feature vector.
4. The artificial intelligence model-based feature construction method according to claim 1, wherein the step of constructing the ambiguous features based on the feature vectors comprises:
dividing the feature vector based on different field semantics to obtain a plurality of semantic features;
and splicing the semantic features to obtain the polysemy features.
5. The artificial intelligence model-based feature construction method of claim 1, further comprising:
building an artificial intelligence model comprising an automatic attention module and a full-connection classification module;
and training the artificial intelligence model by using a gradient back propagation method based on the polysemous characteristics to obtain a DGA domain name detection model.
6. An artificial intelligence model-based feature construction apparatus, comprising:
a collecting unit for collecting DNS logs;
the duplicate removal unit is used for sorting and removing the duplicate of the DNS log to obtain a duplicate removal log;
an extracting unit, configured to extract a feature field from the deduplication log; the characteristic field comprises a request domain name field, a response length field, a query type field, a return code field, a question part containing entity number field, an authority area containing entity number field and a time-to-live characteristic field;
the processing unit is used for carrying out feature vectorization processing on the feature field to obtain a feature vector:
and the construction unit is used for constructing the polysemous feature based on the feature vector.
7. The artificial intelligence model-based feature construction apparatus of claim 6, wherein the deduplication unit comprises:
the sorting subunit is used for sorting the DNS log into json files of each row, wherein each json file comprises a field name, a field value and a category label;
the calculation subunit is used for calculating the md5 value of the field name and the field value of each row in the json file;
and the duplicate removal subunit is used for filtering json line data corresponding to the same md5 value to obtain a duplicate removal log.
8. The artificial intelligence model-based feature construction apparatus of claim 6, wherein the processing unit comprises:
an obtaining subunit, configured to obtain a character string feature and a category feature that are included in the feature field;
the processing subunit is used for carrying out feature vectorization processing on the category features to obtain semantic feature vectors;
the processing subunit is further configured to perform feature vectorization processing on the character string features to obtain a domain name feature vector;
and the combination subunit is used for combining the semantic feature vector and the domain name feature vector to obtain a feature vector.
9. An electronic device, comprising a memory for storing a computer program and a processor for executing the computer program to cause the electronic device to perform the artificial intelligence model based feature construction method of any one of claims 1 to 5.
10. A readable storage medium storing computer program instructions, which when read and executed by a processor, perform the method for constructing artificial intelligence model-based features according to any one of claims 1 to 5.
CN202210951261.6A 2022-08-09 2022-08-09 Feature construction method and device based on artificial intelligent model Active CN115334039B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210951261.6A CN115334039B (en) 2022-08-09 2022-08-09 Feature construction method and device based on artificial intelligent model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210951261.6A CN115334039B (en) 2022-08-09 2022-08-09 Feature construction method and device based on artificial intelligent model

Publications (2)

Publication Number Publication Date
CN115334039A true CN115334039A (en) 2022-11-11
CN115334039B CN115334039B (en) 2024-02-20

Family

ID=83921814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210951261.6A Active CN115334039B (en) 2022-08-09 2022-08-09 Feature construction method and device based on artificial intelligent model

Country Status (1)

Country Link
CN (1) CN115334039B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170147682A1 (en) * 2015-11-19 2017-05-25 King Abdulaziz City For Science And Technology Automated text-evaluation of user generated text
CN108200054A (en) * 2017-12-29 2018-06-22 北京奇安信科技有限公司 A kind of malice domain name detection method and device based on dns resolution
CN110266647A (en) * 2019-05-22 2019-09-20 北京金睛云华科技有限公司 It is a kind of to order and control communication check method and system
CN111708860A (en) * 2020-06-15 2020-09-25 北京优特捷信息技术有限公司 Information extraction method, device, equipment and storage medium
CN112769755A (en) * 2020-12-18 2021-05-07 国家计算机网络与信息安全管理中心 DNS log statistical feature extraction method for threat detection
CN112839012A (en) * 2019-11-22 2021-05-25 中国移动通信有限公司研究院 Zombie program domain name identification method, device, equipment and storage medium
CN114386410A (en) * 2022-01-11 2022-04-22 腾讯科技(深圳)有限公司 Training method and text processing method of pre-training model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170147682A1 (en) * 2015-11-19 2017-05-25 King Abdulaziz City For Science And Technology Automated text-evaluation of user generated text
CN108200054A (en) * 2017-12-29 2018-06-22 北京奇安信科技有限公司 A kind of malice domain name detection method and device based on dns resolution
CN110266647A (en) * 2019-05-22 2019-09-20 北京金睛云华科技有限公司 It is a kind of to order and control communication check method and system
CN112839012A (en) * 2019-11-22 2021-05-25 中国移动通信有限公司研究院 Zombie program domain name identification method, device, equipment and storage medium
CN111708860A (en) * 2020-06-15 2020-09-25 北京优特捷信息技术有限公司 Information extraction method, device, equipment and storage medium
CN112769755A (en) * 2020-12-18 2021-05-07 国家计算机网络与信息安全管理中心 DNS log statistical feature extraction method for threat detection
CN114386410A (en) * 2022-01-11 2022-04-22 腾讯科技(深圳)有限公司 Training method and text processing method of pre-training model

Also Published As

Publication number Publication date
CN115334039B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
Nouh et al. Understanding the radical mind: Identifying signals to detect extremist content on twitter
CN111695033A (en) Enterprise public opinion analysis method, device, electronic equipment and medium
CN112241481B (en) Cross-modal news event classification method and system based on graph neural network
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
Maharjan et al. A multi-task approach to predict likability of books
CN110750640A (en) Text data classification method and device based on neural network model and storage medium
CN109739986A (en) A kind of complaint short text classification method based on Deep integrating study
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
Junnarkar et al. E-mail spam classification via machine learning and natural language processing
CN106909575B (en) Text clustering method and device
CN112463774B (en) Text data duplication eliminating method, equipment and storage medium
CN112052451A (en) Webshell detection method and device
CN110990676A (en) Social media hotspot topic extraction method and system
CN112651025A (en) Webshell detection method based on character-level embedded code
Sah et al. An approach for malicious spam detection in email with comparison of different classifiers
CN117195220A (en) Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM
CN115495744A (en) Threat information classification method, device, electronic equipment and storage medium
CN111797247B (en) Case pushing method and device based on artificial intelligence, electronic equipment and medium
CN113220964B (en) Viewpoint mining method based on short text in network message field
Krokos et al. A look into twitter hashtag discovery and generation
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN115309899B (en) Method and system for identifying and storing specific content in text
CN113971283A (en) Malicious application program detection method and device based on features
CN108875060B (en) Website identification method and identification system
CN115344563B (en) Data deduplication method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240115

Address after: 071800 Conference Center 1-184, South Section of Baojin Expressway, Xiong'an Area, Xiong'an New District, Baoding City, Hebei Province

Applicant after: Tianrongxin Xiongan Network Security Technology Co.,Ltd.

Applicant after: Beijing Topsec Network Security Technology Co.,Ltd.

Applicant after: Topsec Technologies Inc.

Applicant after: BEIJING TOPSEC SOFTWARE Co.,Ltd.

Address before: 100085 4th floor, building 3, yard 1, Shangdi East Road, Haidian District, Beijing

Applicant before: Beijing Topsec Network Security Technology Co.,Ltd.

Applicant before: Topsec Technologies Inc.

Applicant before: BEIJING TOPSEC SOFTWARE Co.,Ltd.

GR01 Patent grant
GR01 Patent grant