CN108710608A - A kind of malice domain name language material library generating method based on context semanteme - Google Patents

A kind of malice domain name language material library generating method based on context semanteme Download PDF

Info

Publication number
CN108710608A
CN108710608A CN201810408635.3A CN201810408635A CN108710608A CN 108710608 A CN108710608 A CN 108710608A CN 201810408635 A CN201810408635 A CN 201810408635A CN 108710608 A CN108710608 A CN 108710608A
Authority
CN
China
Prior art keywords
domain name
language material
malice
context
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810408635.3A
Other languages
Chinese (zh)
Inventor
黄诚
方勇
刘亮
彭嘉毅
刘勇成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201810408635.3A priority Critical patent/CN108710608A/en
Publication of CN108710608A publication Critical patent/CN108710608A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention designs a kind of malice domain name language material library generating method based on context semanteme, this method carries out semantic analysis from the context words of sentence, phrase where malice domain name respectively, and the corpus of description malice domain name is automatically generated using natural language processing technique.

Description

A kind of malice domain name language material library generating method based on context semanteme
Technical field
The present invention designs a kind of malice domain name language material library generating method based on context semanteme, and this method is respectively from malice The context words of sentence, phrase carry out semantic analysis, and automatically generate description using natural language processing technique where domain name The corpus of malice domain name.
Background technology
Internet is gradually merged recently as the various core business of enterprise, more and more tissues or company suffer from Various hacker attacks are arrived, various APT (Advanced Persistent Threat) attacks emerge one after another.In order to adapt to Fast-changing network crime technology, security firm or associated mechanisms are also constantly found and considerable safety attack of tracing to the source, and lead to Different channel (blog, forum, microblogging, profession report etc.) is crossed to disclose the letters such as various attack technology details and malice domain name Breath.These published attack analysis reports are generally write using English, content mainly from the target of attack, attack Analysis is described in malice domain name, IP address, malice tool that the person of hitting uses etc..Malice domain name in content or IP address It is possible that by hacker in other attacks, in order to detect and block these potential hacker attack behaviors, security firm to be often Can these malice domain names be arranged and are added the blacklist list of fire wall or antivirus software.
The technology of malice domain name is extracted from text at present mainly or is based on regular expression and white list technology, it is this There are prodigious rate of false alarms for technology, i.e., the domain name not in white list is exactly not necessarily malice domain name.Therefore, how from Malice domain name is automatically extracted in magnanimity technical text to play an important role in network attack detection and defender face.
It sets up to the low rate of false alarm extraction model of the automation of malice domain name, on condition that being generated by a large amount of text data Malice domain name corpus.Malice domain name language material refers to the word or phrase that domain name safety is described from text, these words Or phrase can carry out context-descriptive from text semantic.Meanwhile supervised learning mould can be trained by these language materials Type, to which the safety of domain name in text data is labeled or be predicted.Therefore, malice domain name language material extraction model is sea Malice domain name extractive technique provides a new approaches in amount text, and the corpus data generated can be used for full spectrum of threats system In domain name automatic classification technology in.
It is for the problem that malice domain name corpus mainly solves is generated:
(1) information content for how solving too many, each phrase of redundancy that traditional BOW models are brought is relatively low, extraction normal Domain name language material and malice domain name language material content phase parity problem.
(2) how in the case where Feature Words concentrate the quantity of word huger, Feature Words are carried out using sort algorithm Importance analysis.
(3) the problems such as how solving the sparsity of vector space model sign, reduces the dimension of the malice language material ultimately generated Degree.
Emphasis of the present invention solves three above problem, realizes a kind of malice domain name language based on context semanteme Expect library generating method.
Invention content
The invention is malice language material extraction algorithm, feature based word principal component analysis method, base based on context semanteme The advanced side of the multinomial Technology designs such as the selection method in uni-gram frequency, the language material weighing computation method based on TF-IDF algorithms Method solves the problems, such as the feature extraction in machine learning classification model.
The invention aims at following target:
(1) it is only analyzed from the context semanteme of sentence where malice domain name, generates the language material that can describe malice domain name (context words, 2-gram phrases), to improve the validity of language material.
(2) many English stop words and punctuation mark can occur repeatedly in the text, while most of word and symbol pair The meaning expressed by sentence influences very little, it is therefore desirable to delete these stop words and punctuation mark from text, improve language material packet The effective information entropy size contained.
(3) malice domain name corpus data is handled based on the method for dictionary, correspondence is found using dictionary map locating The original shape of morphology restores the root-form of word, reduces the dimension of the malice language material ultimately generated.
(4) the statistical analysis interface based on mass data obtains the IDF values of each language material, by the TF-IDF values being calculated The correlation that language material describes malice domain name is represented, the quantitative criteria of the significance level of description domain name safety is obtained.
(5) generally speaking, realize that the language material generated using machine learning correlation theory and model is directly used in the rich text of extraction Malice domain name in this, a kind of new direction is provided for malice domain name extractive technique in mass text.
To achieve the above object, which uses following technical solution:Malice domain name language material based on context semanteme Library generates model and is generally made of data input layer, Business Logic, data output layer three parts.
Data input layer provides the acquisition to external data, and is formatted processing for different classes of data.Outside The key data source of portion data acquisition is presently disclosed malicious attack APT attack analysis article, document or blog articles, due to Including different data formats, it is therefore desirable to be handled it using data format processing component.
Business Logic belongs to the core technology layer of malice domain name language material extraction model, realizes from formatted text data It is functional to the institute ultimately generated during malice domain name language material, it is dropped comprising the extraction of malice domain name, language material extraction algorithm, language material Dimension, weight calculation etc..
Data output layer provides the malice domain name corpus data with weight, and can build language material by such data Library is used for other machine sort models.
Description of the drawings
Fig. 1 is the extraction model general frame figure of the present invention
Fig. 2 is the malice domain name language material extraction algorithm flow chart of the present invention
Fig. 3 is the malice domain name language material dimension reduction method flow chart of the present invention
Fig. 4 is the malice domain name weight calculation flow chart of the present invention
Specific implementation mode:
The malice domain name language material library generating method based on context semanteme includes five key steps:Data acquisition formats, Malice domain name is extracted, language material extraction, language material dimensionality reduction, language material weight calculation.
It is the main frame figure of model as shown in Figure 1, describes the relevant design of malice domain name language material extraction model in detail And deployment framework.By acquisition to external data and formatting processing, dropped via extraction malice domain name, extraction language material, language material Dimension, calculating language material weight and etc., it generates relevant malice domain name language material and converges storage, make for other machine sort models With.
It is illustrated in figure 2 the malice domain name language material extraction algorithm flow that model includes, describes malice domain name language material in detail Processing generating process.External data is acquired, by including the document sets of malice domain name analysing content, and is directed to inhomogeneity Other data are formatted processing, and only extraction includes the sentence of domain name.All domain names in sentence are extracted, are examined using online domain name It surveys platform and safety mark is carried out to domain name, and select all malice domain names.Selection contains malice domain name from all sentences Sentence continues to segment these sentences, removes the operations such as stop words and tense reduction, then will treated combinations of words At word packet, to obtain context words set.Malice domain name is contained to previous step by 2-gram generating algorithms simultaneously Sentence extracts phrase, to generate the 2-gram phrases in malice corpus.The 2-gram phrases that finally previous processed is obtained Duplicate removal is carried out with context words.
Fig. 3 show the malice domain name language material dimension reduction method flow that model includes, and describes in detail by original malice domain Name language material extraction effective information, morphology standardization, realize the flow of language material dimensionality reduction.By to existing dimension reduction method in text classification Analysis, proposed based on uni-gram frequency in combination with the actual content and english writing feature, the present invention of malice domain name language material Two methods of selection method and feature based word principal component analysis dimensionality reduction is carried out to malice domain name language material.
Wherein the selection method based on uni-gram frequency is mainly in view of many English stop words and punctuation mark can be in text Middle occur multiple, while most of word and symbol influence very little to the meaning expressed by sentence, it includes comentropy very little. Therefore it can directly be deleted from text.Stop words is primarily used to connect all kinds of words, but does not have any meaning in sentence Word.Corpus by analyzing NLTK finds that English common stop words only has 127 words, but some of words Language also carries certain emotion, and either subjective attitude can influence the meaning of entire sentence or target, such as:no, not,too,very.Although these words belong to the stop words of English, not by these significant stop words in experiment It removes, and other stop words have carried out delete operation after participle.Similarly, certain punctuation marks in sentence are (such as:!,) can To influence the domain name being described to a certain extent, these characters are retained.In addition some characters are (such as:", $) then Any help no to the description of target domain name, these characters also need to delete.
Feature based word principal component analysis method mainly considers that the different shape merging processing of word, i.e. morphology standardize, For reducing the dimension of entire language material.Its main contents includes that lemmatization and stem extract, and lemmatization is any one The language vocabulary of form is reduced to general type, and stem extraction is to extract this stem or root-form.Lemmatization master If being restored for verb tense different in different context and sentence, such as third-person singular, general present When, past tense etc..This generic operation mainly has rule-based method, the method based on dictionary, the side based on machine learning at present Method and mixed method, wherein the Lemmatization method based on dictionary is also the method for most mainstream.In order to realize the form of word also Former and stem extraction operation, paper are handled malice domain name corpus data using the method based on dictionary, main thought It is the original shape that corresponding morphology is found using dictionary map locating, to restore the root-form of word.The present invention is during realization Restoring operation mainly is carried out to language material using the dictionary in NLTK and WordNet projects, by existing dictionary progress morphology identification, The mapping of morphology and original shape, to reduce the dimension of the malice language material ultimately generated.
It is illustrated in figure 4 the malice domain name weight calculation flow that model includes, it is detailed to describe description malice domain name phase The computational methods of closing property weights of importance.In order to more accurately describe the importance of each language material in malice domain name corpus, After carrying out dimensionality reduction to language material, need to calculate weight of each language material in corpus.It can effectively be sieved by weight calculation Select the language material more useful to grader.Malice domain name language material is after the dimension-reduction treatment by front, and the dimension of language material is It reduces, but then needs to be calculated in detail using TF-IDF algorithms for the weight of each language material in corpus.
Steps are as follows for the detailed algorithm of language material weight calculation:1st step:Calculate each language material in malice domain name corpus (U) (w) the TF frequency values before language material deduplication operation;2nd step:Pass through Microsoft online API (Application Programming Interface) query interface calculates the IDF inverse document frequency values of each language material (w);3rd step:It is calculated by TF-IDF formula every The weighted value of a language material (w);4th step:All language materials are ranked up according to weighted value, and are returned the result.Finally by as above Processing obtained the weighted value of each language material, value represents the significance level of description domain name safety.
The present invention the course of work be:
Malice language material extraction algorithm based on context semanteme, is acquired external data, by comprising in the analysis of malice domain name The document sets of appearance, and it is formatted processing for different classes of data, only extraction includes the sentence of domain name.It extracts in sentence All domain names carry out safety mark to domain name using online domain name detection platform, and select all malice domain names.From all sentences The sentence containing malice domain name is selected in son, uses language material extraction algorithm, language material reducing dimension algorithm, language material weight meter in order It calculates three flows of algorithm to handle it, generates malice domain name corpus.The model solves in machine learning classification model Feature extraction problem, the language material generated using machine learning correlation theory and model is used directly in extraction rich text Malice domain name.
Wherein, the malice language material extraction algorithm improvement based on context semanteme is as follows:
1) context semanteme is introduced on the basis of traditional BOW (bag-of-words) model, only where malice domain name The context semanteme of sentence is analyzed, and the language material (context words, 2-gram phrases) that can describe malice domain name is generated, from And improve the validity of language material.
For vector space model sign sparsity the problems such as improvement it is as follows:
1) language material dimension reduction method combination uni-gram frequency and Feature Words principal component analysis method go to reduce the dimension of language material.Wherein, base The stop words and punctuation mark of low comentropy, feature based word principal component analysis are predominantly deleted in the selection method of uni-gram frequency Method is mainly that the original shape of corresponding morphology is found using dictionary map locating, restores the root-form of word, realizes morphology standardization.
The IDF value calculating methods improvement of each language material is as follows:
1) the IDF values of each language material are that the statistical analysis interface based on mass data obtains, and represent the phrase in internet In actual distribution situation, the TF-IDF values being calculated can more represent the correlation that language material describes malice domain name.

Claims (5)

1. the invention discloses a kind of malice domain name language material library generating method based on context semanteme, feature includes following step Suddenly:
(1)Step 1:Malice language material extraction algorithm based on context semanteme, is acquired external data, by including malice The document sets of domain name analysing content, and it is formatted processing for different classes of data, only extraction includes the sentence of domain name;
(2)Step 2:All domain names in sentence are extracted, safety mark is carried out to domain name using online domain name detection platform, and Select all malice domain names;
(3)Step 3:The sentence containing malice domain name is selected from all sentences, and these sentences are further processed;
(4)Step 4:Phrase is extracted to sentence obtained in the previous step by 2-gram generating algorithms, to generate malice language material 2-gram phrases in library;
(5)Step 5:Continue to segment the sentence of step 3, removes the operations such as stop words and tense reduction, it then will place Group of words compound word packet after reason, to obtain context words set;
(6)Step 6:The 2-gram that step 4 obtains and the context words that step 5 obtains are subjected to duplicate removal;
(7)Step 7:Screening Treatment is carried out to English stop words and punctuation mark based on the selection method of uni-gram frequency;
(8)Step 8:Feature based word principal component analysis method implements lemmatization to language material and stem extracts, to the difference of word The processing of form merging, i.e. morphology standardize, and reduce the dimension of entire language material;
(9)Step 9:Language material weighing computation method based on TF-IDF algorithms calculates weight of each language material in corpus, Its value represents the significance level of description domain name safety.
2. the malice language material extraction algorithm according to claim 1 based on context semanteme, it is characterised in that:In tradition Context semanteme is introduced on the basis of BOW (bag-of-words) model, only from the context of sentence where malice domain name Semanteme is analyzed, and the language material (context words, 2-gram phrases) that can describe malice domain name is generated, to improve language material Validity.
3. the selection method based on uni-gram frequency according to claim 1, it is characterised in that:Many English stop words and punctuate Symbol can occur repeatedly in the text, while most of word and symbol influence very little to the meaning expressed by sentence, it includes Comentropy very little, therefore these stop words and punctuation mark are directly deleted from text.
4. feature based word principal component analysis method according to claim 1, it is characterised in that:Based on the method for dictionary to disliking Meaning domain name corpus data is handled, and the original shape of corresponding morphology is found using dictionary map locating, the root-form of word is restored, subtracts The dimension of the malice language material ultimately generated less.
5. the language material weighing computation method according to claim 1 based on TF-IDF algorithms, it is characterised in that:Each language Statistical analysis interface of the IDF values of material based on mass data obtains, and the TF-IDF values being calculated represent language material description The correlation of malice domain name can select the quantity of feature when actual characteristic selects according to TF-IDF values.
CN201810408635.3A 2018-04-28 2018-04-28 A kind of malice domain name language material library generating method based on context semanteme Pending CN108710608A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810408635.3A CN108710608A (en) 2018-04-28 2018-04-28 A kind of malice domain name language material library generating method based on context semanteme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810408635.3A CN108710608A (en) 2018-04-28 2018-04-28 A kind of malice domain name language material library generating method based on context semanteme

Publications (1)

Publication Number Publication Date
CN108710608A true CN108710608A (en) 2018-10-26

Family

ID=63867637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810408635.3A Pending CN108710608A (en) 2018-04-28 2018-04-28 A kind of malice domain name language material library generating method based on context semanteme

Country Status (1)

Country Link
CN (1) CN108710608A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378163A (en) * 2020-03-10 2021-09-10 四川大学 Android malicious software family classification method based on DEX file partition characteristics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629328A (en) * 2012-03-12 2012-08-08 北京工业大学 Probabilistic latent semantic model object image recognition method with fusion of significant characteristic of color
CN107015963A (en) * 2017-03-22 2017-08-04 重庆邮电大学 Natural language semantic parsing system and method based on deep neural network
CN207977315U (en) * 2018-01-05 2018-10-16 广东迅扬科技股份有限公司 A kind of Micro LED multi-color display array structures

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629328A (en) * 2012-03-12 2012-08-08 北京工业大学 Probabilistic latent semantic model object image recognition method with fusion of significant characteristic of color
CN107015963A (en) * 2017-03-22 2017-08-04 重庆邮电大学 Natural language semantic parsing system and method based on deep neural network
CN207977315U (en) * 2018-01-05 2018-10-16 广东迅扬科技股份有限公司 A kind of Micro LED multi-color display array structures

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄诚 等: ""基于上下文语义的恶意域名语料提取模型研究"", 《CNKI 网络出版:2017-08-29, HTTP://KNS.CNKI.NET/KCMS/DETAIL/11.2127.TP.20170829.1420.004.HTML》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378163A (en) * 2020-03-10 2021-09-10 四川大学 Android malicious software family classification method based on DEX file partition characteristics

Similar Documents

Publication Publication Date Title
US8489689B1 (en) Apparatus and method for obfuscation detection within a spam filtering model
JP5744228B2 (en) Method and apparatus for blocking harmful information on the Internet
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
De Silva et al. User type classification of tweets with implications for event recognition
CN105183717A (en) OSN user emotion analysis method based on random forest and user relationship
CN111866004B (en) Security assessment method, apparatus, computer system, and medium
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
Jayan et al. A hybrid statistical approach for named entity recognition for malayalam language
CN104346382B (en) Use the text analysis system and method for language inquiry
CN108536868A (en) The data processing method of short text data and application on social networks
CN107688630A (en) A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme
Mestry et al. Automation in social networking comments with the help of robust fasttext and cnn
CN106250365A (en) The extracting method of item property Feature Words in consumer reviews based on text analyzing
CN109857869A (en) A kind of hot topic prediction technique based on Ap increment cluster and network primitive
Jain et al. Sentiment analysis: An empirical comparative study of various machine learning approaches
US9396177B1 (en) Systems and methods for document tracking using elastic graph-based hierarchical analysis
Alves et al. Leveraging BERT's Power to Classify TTP from Unstructured Text
CN107688594B (en) The identifying system and method for risk case based on social information
Ergin et al. Turkish anti-spam filtering using binary and probabilistic models
CN108710608A (en) A kind of malice domain name language material library generating method based on context semanteme
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Zhai et al. A girl has a name, and it's... adversarial authorship attribution for deobfuscation
CN111538893A (en) Method for extracting network security new words from unstructured data
Huang et al. An unsupervised method for short-text sentiment analysis based on analysis of massive data
JP4326713B2 (en) News topic analysis device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181026

WD01 Invention patent application deemed withdrawn after publication