CN108710608A

CN108710608A - A kind of malice domain name language material library generating method based on context semanteme

Info

Publication number: CN108710608A
Application number: CN201810408635.3A
Authority: CN
Inventors: 黄诚; 方勇; 刘亮; 彭嘉毅; 刘勇成
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2018-10-26

Abstract

The present invention designs a kind of malice domain name language material library generating method based on context semanteme, this method carries out semantic analysis from the context words of sentence, phrase where malice domain name respectively, and the corpus of description malice domain name is automatically generated using natural language processing technique.

Description

A kind of malice domain name language material library generating method based on context semanteme

Technical field

The present invention designs a kind of malice domain name language material library generating method based on context semanteme, and this method is respectively from malice The context words of sentence, phrase carry out semantic analysis, and automatically generate description using natural language processing technique where domain name The corpus of malice domain name.

Background technology

Internet is gradually merged recently as the various core business of enterprise, more and more tissues or company suffer from Various hacker attacks are arrived, various APT (Advanced Persistent Threat) attacks emerge one after another.In order to adapt to Fast-changing network crime technology, security firm or associated mechanisms are also constantly found and considerable safety attack of tracing to the source, and lead to Different channel (blog, forum, microblogging, profession report etc.) is crossed to disclose the letters such as various attack technology details and malice domain name Breath.These published attack analysis reports are generally write using English, content mainly from the target of attack, attack Analysis is described in malice domain name, IP address, malice tool that the person of hitting uses etc..Malice domain name in content or IP address It is possible that by hacker in other attacks, in order to detect and block these potential hacker attack behaviors, security firm to be often Can these malice domain names be arranged and are added the blacklist list of fire wall or antivirus software.

The technology of malice domain name is extracted from text at present mainly or is based on regular expression and white list technology, it is this There are prodigious rate of false alarms for technology, i.e., the domain name not in white list is exactly not necessarily malice domain name.Therefore, how from Malice domain name is automatically extracted in magnanimity technical text to play an important role in network attack detection and defender face.

It sets up to the low rate of false alarm extraction model of the automation of malice domain name, on condition that being generated by a large amount of text data Malice domain name corpus.Malice domain name language material refers to the word or phrase that domain name safety is described from text, these words Or phrase can carry out context-descriptive from text semantic.Meanwhile supervised learning mould can be trained by these language materials Type, to which the safety of domain name in text data is labeled or be predicted.Therefore, malice domain name language material extraction model is sea Malice domain name extractive technique provides a new approaches in amount text, and the corpus data generated can be used for full spectrum of threats system In domain name automatic classification technology in.

It is for the problem that malice domain name corpus mainly solves is generated：

(1) information content for how solving too many, each phrase of redundancy that traditional BOW models are brought is relatively low, extraction normal Domain name language material and malice domain name language material content phase parity problem.

(2) how in the case where Feature Words concentrate the quantity of word huger, Feature Words are carried out using sort algorithm Importance analysis.

(3) the problems such as how solving the sparsity of vector space model sign, reduces the dimension of the malice language material ultimately generated Degree.

Emphasis of the present invention solves three above problem, realizes a kind of malice domain name language based on context semanteme Expect library generating method.

Invention content

The invention is malice language material extraction algorithm, feature based word principal component analysis method, base based on context semanteme The advanced side of the multinomial Technology designs such as the selection method in uni-gram frequency, the language material weighing computation method based on TF-IDF algorithms Method solves the problems, such as the feature extraction in machine learning classification model.

The invention aims at following target：

(1) it is only analyzed from the context semanteme of sentence where malice domain name, generates the language material that can describe malice domain name (context words, 2-gram phrases), to improve the validity of language material.

(2) many English stop words and punctuation mark can occur repeatedly in the text, while most of word and symbol pair The meaning expressed by sentence influences very little, it is therefore desirable to delete these stop words and punctuation mark from text, improve language material packet The effective information entropy size contained.

(3) malice domain name corpus data is handled based on the method for dictionary, correspondence is found using dictionary map locating The original shape of morphology restores the root-form of word, reduces the dimension of the malice language material ultimately generated.

(4) the statistical analysis interface based on mass data obtains the IDF values of each language material, by the TF-IDF values being calculated The correlation that language material describes malice domain name is represented, the quantitative criteria of the significance level of description domain name safety is obtained.

(5) generally speaking, realize that the language material generated using machine learning correlation theory and model is directly used in the rich text of extraction Malice domain name in this, a kind of new direction is provided for malice domain name extractive technique in mass text.

To achieve the above object, which uses following technical solution：Malice domain name language material based on context semanteme Library generates model and is generally made of data input layer, Business Logic, data output layer three parts.

Data input layer provides the acquisition to external data, and is formatted processing for different classes of data.Outside The key data source of portion data acquisition is presently disclosed malicious attack APT attack analysis article, document or blog articles, due to Including different data formats, it is therefore desirable to be handled it using data format processing component.

Business Logic belongs to the core technology layer of malice domain name language material extraction model, realizes from formatted text data It is functional to the institute ultimately generated during malice domain name language material, it is dropped comprising the extraction of malice domain name, language material extraction algorithm, language material Dimension, weight calculation etc..

Data output layer provides the malice domain name corpus data with weight, and can build language material by such data Library is used for other machine sort models.

Description of the drawings

Fig. 1 is the extraction model general frame figure of the present invention

Fig. 2 is the malice domain name language material extraction algorithm flow chart of the present invention

Fig. 3 is the malice domain name language material dimension reduction method flow chart of the present invention

Fig. 4 is the malice domain name weight calculation flow chart of the present invention

Specific implementation mode：

The malice domain name language material library generating method based on context semanteme includes five key steps：Data acquisition formats, Malice domain name is extracted, language material extraction, language material dimensionality reduction, language material weight calculation.

It is the main frame figure of model as shown in Figure 1, describes the relevant design of malice domain name language material extraction model in detail And deployment framework.By acquisition to external data and formatting processing, dropped via extraction malice domain name, extraction language material, language material Dimension, calculating language material weight and etc., it generates relevant malice domain name language material and converges storage, make for other machine sort models With.

It is illustrated in figure 2 the malice domain name language material extraction algorithm flow that model includes, describes malice domain name language material in detail Processing generating process.External data is acquired, by including the document sets of malice domain name analysing content, and is directed to inhomogeneity Other data are formatted processing, and only extraction includes the sentence of domain name.All domain names in sentence are extracted, are examined using online domain name It surveys platform and safety mark is carried out to domain name, and select all malice domain names.Selection contains malice domain name from all sentences Sentence continues to segment these sentences, removes the operations such as stop words and tense reduction, then will treated combinations of words At word packet, to obtain context words set.Malice domain name is contained to previous step by 2-gram generating algorithms simultaneously Sentence extracts phrase, to generate the 2-gram phrases in malice corpus.The 2-gram phrases that finally previous processed is obtained Duplicate removal is carried out with context words.

Fig. 3 show the malice domain name language material dimension reduction method flow that model includes, and describes in detail by original malice domain Name language material extraction effective information, morphology standardization, realize the flow of language material dimensionality reduction.By to existing dimension reduction method in text classification Analysis, proposed based on uni-gram frequency in combination with the actual content and english writing feature, the present invention of malice domain name language material Two methods of selection method and feature based word principal component analysis dimensionality reduction is carried out to malice domain name language material.

Wherein the selection method based on uni-gram frequency is mainly in view of many English stop words and punctuation mark can be in text Middle occur multiple, while most of word and symbol influence very little to the meaning expressed by sentence, it includes comentropy very little. Therefore it can directly be deleted from text.Stop words is primarily used to connect all kinds of words, but does not have any meaning in sentence Word.Corpus by analyzing NLTK finds that English common stop words only has 127 words, but some of words Language also carries certain emotion, and either subjective attitude can influence the meaning of entire sentence or target, such as：no, not,too,very.Although these words belong to the stop words of English, not by these significant stop words in experiment It removes, and other stop words have carried out delete operation after participle.Similarly, certain punctuation marks in sentence are (such as:！,) can To influence the domain name being described to a certain extent, these characters are retained.In addition some characters are (such as：", $) then Any help no to the description of target domain name, these characters also need to delete.

Feature based word principal component analysis method mainly considers that the different shape merging processing of word, i.e. morphology standardize, For reducing the dimension of entire language material.Its main contents includes that lemmatization and stem extract, and lemmatization is any one The language vocabulary of form is reduced to general type, and stem extraction is to extract this stem or root-form.Lemmatization master If being restored for verb tense different in different context and sentence, such as third-person singular, general present When, past tense etc..This generic operation mainly has rule-based method, the method based on dictionary, the side based on machine learning at present Method and mixed method, wherein the Lemmatization method based on dictionary is also the method for most mainstream.In order to realize the form of word also Former and stem extraction operation, paper are handled malice domain name corpus data using the method based on dictionary, main thought It is the original shape that corresponding morphology is found using dictionary map locating, to restore the root-form of word.The present invention is during realization Restoring operation mainly is carried out to language material using the dictionary in NLTK and WordNet projects, by existing dictionary progress morphology identification, The mapping of morphology and original shape, to reduce the dimension of the malice language material ultimately generated.

It is illustrated in figure 4 the malice domain name weight calculation flow that model includes, it is detailed to describe description malice domain name phase The computational methods of closing property weights of importance.In order to more accurately describe the importance of each language material in malice domain name corpus, After carrying out dimensionality reduction to language material, need to calculate weight of each language material in corpus.It can effectively be sieved by weight calculation Select the language material more useful to grader.Malice domain name language material is after the dimension-reduction treatment by front, and the dimension of language material is It reduces, but then needs to be calculated in detail using TF-IDF algorithms for the weight of each language material in corpus.

Steps are as follows for the detailed algorithm of language material weight calculation：1st step：Calculate each language material in malice domain name corpus (U) (w) the TF frequency values before language material deduplication operation；2nd step：Pass through Microsoft online API (Application Programming Interface) query interface calculates the IDF inverse document frequency values of each language material (w)；3rd step：It is calculated by TF-IDF formula every The weighted value of a language material (w)；4th step：All language materials are ranked up according to weighted value, and are returned the result.Finally by as above Processing obtained the weighted value of each language material, value represents the significance level of description domain name safety.

The present invention the course of work be：

Malice language material extraction algorithm based on context semanteme, is acquired external data, by comprising in the analysis of malice domain name The document sets of appearance, and it is formatted processing for different classes of data, only extraction includes the sentence of domain name.It extracts in sentence All domain names carry out safety mark to domain name using online domain name detection platform, and select all malice domain names.From all sentences The sentence containing malice domain name is selected in son, uses language material extraction algorithm, language material reducing dimension algorithm, language material weight meter in order It calculates three flows of algorithm to handle it, generates malice domain name corpus.The model solves in machine learning classification model Feature extraction problem, the language material generated using machine learning correlation theory and model is used directly in extraction rich text Malice domain name.

Wherein, the malice language material extraction algorithm improvement based on context semanteme is as follows：

1) context semanteme is introduced on the basis of traditional BOW (bag-of-words) model, only where malice domain name The context semanteme of sentence is analyzed, and the language material (context words, 2-gram phrases) that can describe malice domain name is generated, from And improve the validity of language material.

For vector space model sign sparsity the problems such as improvement it is as follows：

1) language material dimension reduction method combination uni-gram frequency and Feature Words principal component analysis method go to reduce the dimension of language material.Wherein, base The stop words and punctuation mark of low comentropy, feature based word principal component analysis are predominantly deleted in the selection method of uni-gram frequency Method is mainly that the original shape of corresponding morphology is found using dictionary map locating, restores the root-form of word, realizes morphology standardization.

The IDF value calculating methods improvement of each language material is as follows：

1) the IDF values of each language material are that the statistical analysis interface based on mass data obtains, and represent the phrase in internet In actual distribution situation, the TF-IDF values being calculated can more represent the correlation that language material describes malice domain name.

Claims

1. the invention discloses a kind of malice domain name language material library generating method based on context semanteme, feature includes following step Suddenly：

（1）Step 1：Malice language material extraction algorithm based on context semanteme, is acquired external data, by including malice The document sets of domain name analysing content, and it is formatted processing for different classes of data, only extraction includes the sentence of domain name；

（2）Step 2：All domain names in sentence are extracted, safety mark is carried out to domain name using online domain name detection platform, and Select all malice domain names；

（3）Step 3：The sentence containing malice domain name is selected from all sentences, and these sentences are further processed；

（4）Step 4：Phrase is extracted to sentence obtained in the previous step by 2-gram generating algorithms, to generate malice language material 2-gram phrases in library；

（5）Step 5：Continue to segment the sentence of step 3, removes the operations such as stop words and tense reduction, it then will place Group of words compound word packet after reason, to obtain context words set；

（6）Step 6：The 2-gram that step 4 obtains and the context words that step 5 obtains are subjected to duplicate removal；

（7）Step 7：Screening Treatment is carried out to English stop words and punctuation mark based on the selection method of uni-gram frequency；

（8）Step 8：Feature based word principal component analysis method implements lemmatization to language material and stem extracts, to the difference of word The processing of form merging, i.e. morphology standardize, and reduce the dimension of entire language material；

（9）Step 9：Language material weighing computation method based on TF-IDF algorithms calculates weight of each language material in corpus, Its value represents the significance level of description domain name safety.

2. the malice language material extraction algorithm according to claim 1 based on context semanteme, it is characterised in that：In tradition Context semanteme is introduced on the basis of BOW (bag-of-words) model, only from the context of sentence where malice domain name Semanteme is analyzed, and the language material (context words, 2-gram phrases) that can describe malice domain name is generated, to improve language material Validity.

3. the selection method based on uni-gram frequency according to claim 1, it is characterised in that：Many English stop words and punctuate Symbol can occur repeatedly in the text, while most of word and symbol influence very little to the meaning expressed by sentence, it includes Comentropy very little, therefore these stop words and punctuation mark are directly deleted from text.

4. feature based word principal component analysis method according to claim 1, it is characterised in that：Based on the method for dictionary to disliking Meaning domain name corpus data is handled, and the original shape of corresponding morphology is found using dictionary map locating, the root-form of word is restored, subtracts The dimension of the malice language material ultimately generated less.

5. the language material weighing computation method according to claim 1 based on TF-IDF algorithms, it is characterised in that：Each language Statistical analysis interface of the IDF values of material based on mass data obtains, and the TF-IDF values being calculated represent language material description The correlation of malice domain name can select the quantity of feature when actual characteristic selects according to TF-IDF values.