CN108710608A - A kind of malice domain name language material library generating method based on context semanteme - Google Patents
A kind of malice domain name language material library generating method based on context semanteme Download PDFInfo
- Publication number
- CN108710608A CN108710608A CN201810408635.3A CN201810408635A CN108710608A CN 108710608 A CN108710608 A CN 108710608A CN 201810408635 A CN201810408635 A CN 201810408635A CN 108710608 A CN108710608 A CN 108710608A
- Authority
- CN
- China
- Prior art keywords
- domain name
- language material
- malice
- context
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention designs a kind of malice domain name language material library generating method based on context semanteme, this method carries out semantic analysis from the context words of sentence, phrase where malice domain name respectively, and the corpus of description malice domain name is automatically generated using natural language processing technique.
Description
Technical field
The present invention designs a kind of malice domain name language material library generating method based on context semanteme, and this method is respectively from malice
The context words of sentence, phrase carry out semantic analysis, and automatically generate description using natural language processing technique where domain name
The corpus of malice domain name.
Background technology
Internet is gradually merged recently as the various core business of enterprise, more and more tissues or company suffer from
Various hacker attacks are arrived, various APT (Advanced Persistent Threat) attacks emerge one after another.In order to adapt to
Fast-changing network crime technology, security firm or associated mechanisms are also constantly found and considerable safety attack of tracing to the source, and lead to
Different channel (blog, forum, microblogging, profession report etc.) is crossed to disclose the letters such as various attack technology details and malice domain name
Breath.These published attack analysis reports are generally write using English, content mainly from the target of attack, attack
Analysis is described in malice domain name, IP address, malice tool that the person of hitting uses etc..Malice domain name in content or IP address
It is possible that by hacker in other attacks, in order to detect and block these potential hacker attack behaviors, security firm to be often
Can these malice domain names be arranged and are added the blacklist list of fire wall or antivirus software.
The technology of malice domain name is extracted from text at present mainly or is based on regular expression and white list technology, it is this
There are prodigious rate of false alarms for technology, i.e., the domain name not in white list is exactly not necessarily malice domain name.Therefore, how from
Malice domain name is automatically extracted in magnanimity technical text to play an important role in network attack detection and defender face.
It sets up to the low rate of false alarm extraction model of the automation of malice domain name, on condition that being generated by a large amount of text data
Malice domain name corpus.Malice domain name language material refers to the word or phrase that domain name safety is described from text, these words
Or phrase can carry out context-descriptive from text semantic.Meanwhile supervised learning mould can be trained by these language materials
Type, to which the safety of domain name in text data is labeled or be predicted.Therefore, malice domain name language material extraction model is sea
Malice domain name extractive technique provides a new approaches in amount text, and the corpus data generated can be used for full spectrum of threats system
In domain name automatic classification technology in.
It is for the problem that malice domain name corpus mainly solves is generated:
(1) information content for how solving too many, each phrase of redundancy that traditional BOW models are brought is relatively low, extraction normal
Domain name language material and malice domain name language material content phase parity problem.
(2) how in the case where Feature Words concentrate the quantity of word huger, Feature Words are carried out using sort algorithm
Importance analysis.
(3) the problems such as how solving the sparsity of vector space model sign, reduces the dimension of the malice language material ultimately generated
Degree.
Emphasis of the present invention solves three above problem, realizes a kind of malice domain name language based on context semanteme
Expect library generating method.
Invention content
The invention is malice language material extraction algorithm, feature based word principal component analysis method, base based on context semanteme
The advanced side of the multinomial Technology designs such as the selection method in uni-gram frequency, the language material weighing computation method based on TF-IDF algorithms
Method solves the problems, such as the feature extraction in machine learning classification model.
The invention aims at following target:
(1) it is only analyzed from the context semanteme of sentence where malice domain name, generates the language material that can describe malice domain name
(context words, 2-gram phrases), to improve the validity of language material.
(2) many English stop words and punctuation mark can occur repeatedly in the text, while most of word and symbol pair
The meaning expressed by sentence influences very little, it is therefore desirable to delete these stop words and punctuation mark from text, improve language material packet
The effective information entropy size contained.
(3) malice domain name corpus data is handled based on the method for dictionary, correspondence is found using dictionary map locating
The original shape of morphology restores the root-form of word, reduces the dimension of the malice language material ultimately generated.
(4) the statistical analysis interface based on mass data obtains the IDF values of each language material, by the TF-IDF values being calculated
The correlation that language material describes malice domain name is represented, the quantitative criteria of the significance level of description domain name safety is obtained.
(5) generally speaking, realize that the language material generated using machine learning correlation theory and model is directly used in the rich text of extraction
Malice domain name in this, a kind of new direction is provided for malice domain name extractive technique in mass text.
To achieve the above object, which uses following technical solution:Malice domain name language material based on context semanteme
Library generates model and is generally made of data input layer, Business Logic, data output layer three parts.
Data input layer provides the acquisition to external data, and is formatted processing for different classes of data.Outside
The key data source of portion data acquisition is presently disclosed malicious attack APT attack analysis article, document or blog articles, due to
Including different data formats, it is therefore desirable to be handled it using data format processing component.
Business Logic belongs to the core technology layer of malice domain name language material extraction model, realizes from formatted text data
It is functional to the institute ultimately generated during malice domain name language material, it is dropped comprising the extraction of malice domain name, language material extraction algorithm, language material
Dimension, weight calculation etc..
Data output layer provides the malice domain name corpus data with weight, and can build language material by such data
Library is used for other machine sort models.
Description of the drawings
Fig. 1 is the extraction model general frame figure of the present invention
Fig. 2 is the malice domain name language material extraction algorithm flow chart of the present invention
Fig. 3 is the malice domain name language material dimension reduction method flow chart of the present invention
Fig. 4 is the malice domain name weight calculation flow chart of the present invention
Specific implementation mode:
The malice domain name language material library generating method based on context semanteme includes five key steps:Data acquisition formats,
Malice domain name is extracted, language material extraction, language material dimensionality reduction, language material weight calculation.
It is the main frame figure of model as shown in Figure 1, describes the relevant design of malice domain name language material extraction model in detail
And deployment framework.By acquisition to external data and formatting processing, dropped via extraction malice domain name, extraction language material, language material
Dimension, calculating language material weight and etc., it generates relevant malice domain name language material and converges storage, make for other machine sort models
With.
It is illustrated in figure 2 the malice domain name language material extraction algorithm flow that model includes, describes malice domain name language material in detail
Processing generating process.External data is acquired, by including the document sets of malice domain name analysing content, and is directed to inhomogeneity
Other data are formatted processing, and only extraction includes the sentence of domain name.All domain names in sentence are extracted, are examined using online domain name
It surveys platform and safety mark is carried out to domain name, and select all malice domain names.Selection contains malice domain name from all sentences
Sentence continues to segment these sentences, removes the operations such as stop words and tense reduction, then will treated combinations of words
At word packet, to obtain context words set.Malice domain name is contained to previous step by 2-gram generating algorithms simultaneously
Sentence extracts phrase, to generate the 2-gram phrases in malice corpus.The 2-gram phrases that finally previous processed is obtained
Duplicate removal is carried out with context words.
Fig. 3 show the malice domain name language material dimension reduction method flow that model includes, and describes in detail by original malice domain
Name language material extraction effective information, morphology standardization, realize the flow of language material dimensionality reduction.By to existing dimension reduction method in text classification
Analysis, proposed based on uni-gram frequency in combination with the actual content and english writing feature, the present invention of malice domain name language material
Two methods of selection method and feature based word principal component analysis dimensionality reduction is carried out to malice domain name language material.
Wherein the selection method based on uni-gram frequency is mainly in view of many English stop words and punctuation mark can be in text
Middle occur multiple, while most of word and symbol influence very little to the meaning expressed by sentence, it includes comentropy very little.
Therefore it can directly be deleted from text.Stop words is primarily used to connect all kinds of words, but does not have any meaning in sentence
Word.Corpus by analyzing NLTK finds that English common stop words only has 127 words, but some of words
Language also carries certain emotion, and either subjective attitude can influence the meaning of entire sentence or target, such as:no,
not,too,very.Although these words belong to the stop words of English, not by these significant stop words in experiment
It removes, and other stop words have carried out delete operation after participle.Similarly, certain punctuation marks in sentence are (such as:!,) can
To influence the domain name being described to a certain extent, these characters are retained.In addition some characters are (such as:", $) then
Any help no to the description of target domain name, these characters also need to delete.
Feature based word principal component analysis method mainly considers that the different shape merging processing of word, i.e. morphology standardize,
For reducing the dimension of entire language material.Its main contents includes that lemmatization and stem extract, and lemmatization is any one
The language vocabulary of form is reduced to general type, and stem extraction is to extract this stem or root-form.Lemmatization master
If being restored for verb tense different in different context and sentence, such as third-person singular, general present
When, past tense etc..This generic operation mainly has rule-based method, the method based on dictionary, the side based on machine learning at present
Method and mixed method, wherein the Lemmatization method based on dictionary is also the method for most mainstream.In order to realize the form of word also
Former and stem extraction operation, paper are handled malice domain name corpus data using the method based on dictionary, main thought
It is the original shape that corresponding morphology is found using dictionary map locating, to restore the root-form of word.The present invention is during realization
Restoring operation mainly is carried out to language material using the dictionary in NLTK and WordNet projects, by existing dictionary progress morphology identification,
The mapping of morphology and original shape, to reduce the dimension of the malice language material ultimately generated.
It is illustrated in figure 4 the malice domain name weight calculation flow that model includes, it is detailed to describe description malice domain name phase
The computational methods of closing property weights of importance.In order to more accurately describe the importance of each language material in malice domain name corpus,
After carrying out dimensionality reduction to language material, need to calculate weight of each language material in corpus.It can effectively be sieved by weight calculation
Select the language material more useful to grader.Malice domain name language material is after the dimension-reduction treatment by front, and the dimension of language material is
It reduces, but then needs to be calculated in detail using TF-IDF algorithms for the weight of each language material in corpus.
Steps are as follows for the detailed algorithm of language material weight calculation:1st step:Calculate each language material in malice domain name corpus (U)
(w) the TF frequency values before language material deduplication operation;2nd step:Pass through Microsoft online API (Application Programming
Interface) query interface calculates the IDF inverse document frequency values of each language material (w);3rd step:It is calculated by TF-IDF formula every
The weighted value of a language material (w);4th step:All language materials are ranked up according to weighted value, and are returned the result.Finally by as above
Processing obtained the weighted value of each language material, value represents the significance level of description domain name safety.
The present invention the course of work be:
Malice language material extraction algorithm based on context semanteme, is acquired external data, by comprising in the analysis of malice domain name
The document sets of appearance, and it is formatted processing for different classes of data, only extraction includes the sentence of domain name.It extracts in sentence
All domain names carry out safety mark to domain name using online domain name detection platform, and select all malice domain names.From all sentences
The sentence containing malice domain name is selected in son, uses language material extraction algorithm, language material reducing dimension algorithm, language material weight meter in order
It calculates three flows of algorithm to handle it, generates malice domain name corpus.The model solves in machine learning classification model
Feature extraction problem, the language material generated using machine learning correlation theory and model is used directly in extraction rich text
Malice domain name.
Wherein, the malice language material extraction algorithm improvement based on context semanteme is as follows:
1) context semanteme is introduced on the basis of traditional BOW (bag-of-words) model, only where malice domain name
The context semanteme of sentence is analyzed, and the language material (context words, 2-gram phrases) that can describe malice domain name is generated, from
And improve the validity of language material.
For vector space model sign sparsity the problems such as improvement it is as follows:
1) language material dimension reduction method combination uni-gram frequency and Feature Words principal component analysis method go to reduce the dimension of language material.Wherein, base
The stop words and punctuation mark of low comentropy, feature based word principal component analysis are predominantly deleted in the selection method of uni-gram frequency
Method is mainly that the original shape of corresponding morphology is found using dictionary map locating, restores the root-form of word, realizes morphology standardization.
The IDF value calculating methods improvement of each language material is as follows:
1) the IDF values of each language material are that the statistical analysis interface based on mass data obtains, and represent the phrase in internet
In actual distribution situation, the TF-IDF values being calculated can more represent the correlation that language material describes malice domain name.
Claims (5)
1. the invention discloses a kind of malice domain name language material library generating method based on context semanteme, feature includes following step
Suddenly:
(1)Step 1:Malice language material extraction algorithm based on context semanteme, is acquired external data, by including malice
The document sets of domain name analysing content, and it is formatted processing for different classes of data, only extraction includes the sentence of domain name;
(2)Step 2:All domain names in sentence are extracted, safety mark is carried out to domain name using online domain name detection platform, and
Select all malice domain names;
(3)Step 3:The sentence containing malice domain name is selected from all sentences, and these sentences are further processed;
(4)Step 4:Phrase is extracted to sentence obtained in the previous step by 2-gram generating algorithms, to generate malice language material
2-gram phrases in library;
(5)Step 5:Continue to segment the sentence of step 3, removes the operations such as stop words and tense reduction, it then will place
Group of words compound word packet after reason, to obtain context words set;
(6)Step 6:The 2-gram that step 4 obtains and the context words that step 5 obtains are subjected to duplicate removal;
(7)Step 7:Screening Treatment is carried out to English stop words and punctuation mark based on the selection method of uni-gram frequency;
(8)Step 8:Feature based word principal component analysis method implements lemmatization to language material and stem extracts, to the difference of word
The processing of form merging, i.e. morphology standardize, and reduce the dimension of entire language material;
(9)Step 9:Language material weighing computation method based on TF-IDF algorithms calculates weight of each language material in corpus,
Its value represents the significance level of description domain name safety.
2. the malice language material extraction algorithm according to claim 1 based on context semanteme, it is characterised in that:In tradition
Context semanteme is introduced on the basis of BOW (bag-of-words) model, only from the context of sentence where malice domain name
Semanteme is analyzed, and the language material (context words, 2-gram phrases) that can describe malice domain name is generated, to improve language material
Validity.
3. the selection method based on uni-gram frequency according to claim 1, it is characterised in that:Many English stop words and punctuate
Symbol can occur repeatedly in the text, while most of word and symbol influence very little to the meaning expressed by sentence, it includes
Comentropy very little, therefore these stop words and punctuation mark are directly deleted from text.
4. feature based word principal component analysis method according to claim 1, it is characterised in that:Based on the method for dictionary to disliking
Meaning domain name corpus data is handled, and the original shape of corresponding morphology is found using dictionary map locating, the root-form of word is restored, subtracts
The dimension of the malice language material ultimately generated less.
5. the language material weighing computation method according to claim 1 based on TF-IDF algorithms, it is characterised in that:Each language
Statistical analysis interface of the IDF values of material based on mass data obtains, and the TF-IDF values being calculated represent language material description
The correlation of malice domain name can select the quantity of feature when actual characteristic selects according to TF-IDF values.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810408635.3A CN108710608A (en) | 2018-04-28 | 2018-04-28 | A kind of malice domain name language material library generating method based on context semanteme |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810408635.3A CN108710608A (en) | 2018-04-28 | 2018-04-28 | A kind of malice domain name language material library generating method based on context semanteme |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108710608A true CN108710608A (en) | 2018-10-26 |
Family
ID=63867637
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810408635.3A Pending CN108710608A (en) | 2018-04-28 | 2018-04-28 | A kind of malice domain name language material library generating method based on context semanteme |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108710608A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113378163A (en) * | 2020-03-10 | 2021-09-10 | 四川大学 | Android malicious software family classification method based on DEX file partition characteristics |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102629328A (en) * | 2012-03-12 | 2012-08-08 | 北京工业大学 | Probabilistic latent semantic model object image recognition method with fusion of significant characteristic of color |
CN107015963A (en) * | 2017-03-22 | 2017-08-04 | 重庆邮电大学 | Natural language semantic parsing system and method based on deep neural network |
CN207977315U (en) * | 2018-01-05 | 2018-10-16 | 广东迅扬科技股份有限公司 | A kind of Micro LED multi-color display array structures |
-
2018
- 2018-04-28 CN CN201810408635.3A patent/CN108710608A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102629328A (en) * | 2012-03-12 | 2012-08-08 | 北京工业大学 | Probabilistic latent semantic model object image recognition method with fusion of significant characteristic of color |
CN107015963A (en) * | 2017-03-22 | 2017-08-04 | 重庆邮电大学 | Natural language semantic parsing system and method based on deep neural network |
CN207977315U (en) * | 2018-01-05 | 2018-10-16 | 广东迅扬科技股份有限公司 | A kind of Micro LED multi-color display array structures |
Non-Patent Citations (1)
Title |
---|
黄诚 等: ""基于上下文语义的恶意域名语料提取模型研究"", 《CNKI 网络出版:2017-08-29, HTTP://KNS.CNKI.NET/KCMS/DETAIL/11.2127.TP.20170829.1420.004.HTML》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113378163A (en) * | 2020-03-10 | 2021-09-10 | 四川大学 | Android malicious software family classification method based on DEX file partition characteristics |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8489689B1 (en) | Apparatus and method for obfuscation detection within a spam filtering model | |
JP5744228B2 (en) | Method and apparatus for blocking harmful information on the Internet | |
WO2019080863A1 (en) | Text sentiment classification method, storage medium and computer | |
De Silva et al. | User type classification of tweets with implications for event recognition | |
CN105183717A (en) | OSN user emotion analysis method based on random forest and user relationship | |
CN111866004B (en) | Security assessment method, apparatus, computer system, and medium | |
CN111488732B (en) | Method, system and related equipment for detecting deformed keywords | |
Jayan et al. | A hybrid statistical approach for named entity recognition for malayalam language | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
CN108536868A (en) | The data processing method of short text data and application on social networks | |
CN107688630A (en) | A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme | |
Mestry et al. | Automation in social networking comments with the help of robust fasttext and cnn | |
CN106250365A (en) | The extracting method of item property Feature Words in consumer reviews based on text analyzing | |
CN109857869A (en) | A kind of hot topic prediction technique based on Ap increment cluster and network primitive | |
Jain et al. | Sentiment analysis: An empirical comparative study of various machine learning approaches | |
US9396177B1 (en) | Systems and methods for document tracking using elastic graph-based hierarchical analysis | |
Alves et al. | Leveraging BERT's Power to Classify TTP from Unstructured Text | |
CN107688594B (en) | The identifying system and method for risk case based on social information | |
Ergin et al. | Turkish anti-spam filtering using binary and probabilistic models | |
CN108710608A (en) | A kind of malice domain name language material library generating method based on context semanteme | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
Zhai et al. | A girl has a name, and it's... adversarial authorship attribution for deobfuscation | |
CN111538893A (en) | Method for extracting network security new words from unstructured data | |
Huang et al. | An unsupervised method for short-text sentiment analysis based on analysis of massive data | |
JP4326713B2 (en) | News topic analysis device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181026 |
|
WD01 | Invention patent application deemed withdrawn after publication |