CN114928472B - Bad site gray list filtering method based on full circulation main domain name - Google Patents

Bad site gray list filtering method based on full circulation main domain name Download PDF

Info

Publication number
CN114928472B
CN114928472B CN202210416876.9A CN202210416876A CN114928472B CN 114928472 B CN114928472 B CN 114928472B CN 202210416876 A CN202210416876 A CN 202210416876A CN 114928472 B CN114928472 B CN 114928472B
Authority
CN
China
Prior art keywords
bad
domain name
domain names
site
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210416876.9A
Other languages
Chinese (zh)
Other versions
CN114928472A (en
Inventor
张兆心
陈俊仁
柴婷婷
赵东
孟月阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Weihai
Original Assignee
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Weihai filed Critical Harbin Institute of Technology Weihai
Priority to CN202210416876.9A priority Critical patent/CN114928472B/en
Publication of CN114928472A publication Critical patent/CN114928472A/en
Application granted granted Critical
Publication of CN114928472B publication Critical patent/CN114928472B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/101Access control lists [ACL]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Computation (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a bad site gray list filtering method based on a full-flow main domain name, which comprises the following steps: step 1, constructing a name discrimination model of bad site domain names based on character similarity, and realizing coarse filtration of suspected bad site domain names in the full domain names; step 2, identifying whether the domain name can be resolved and used for Web service; step 3, performing coarse filtration based on IP similarity; step 4, classifying the geographical areas of the domain names based on the IP positioning technology; step 5, analyzing the accuracy of the bad site domain name gray list obtained by coarse filtration; and step 6, performing iterative optimization on the coarse filtration step 1 and the coarse filtration step 3. The method reduces the magnitude of the existing domain name in a large range through filtering the character similarity of the domain name and the service IP similarity, greatly reduces the time consumption caused by acquiring and analyzing the webpage text and the snapshot, and realizes the efficient and accurate filtering of the full-quantity domain name.

Description

Bad site gray list filtering method based on full circulation main domain name
Technical Field
The invention relates to the technical field of construction of a domain name gray list of bad sites, in particular to a method for filtering the bad site gray list based on a full-quantity circulation main domain name.
Background
With the rapid development of computer networks, the internet has become an integral part of human life. Wherein the domain name system provides a mutual mapping function of IP addresses and domain names for applications and services in the network. People can access the Internet more conveniently through the domain name. Today, however, networks are flooded with a large number of bad sites. They not only jeopardize the mind of people, but even seriously jeopardize the security of property. Therefore, identification, monitoring and control of bad sites are important.
The magnitude of the main domain name of global circulation is about 2.6 hundred million, the dynamic newly added domain name is about 30 ten thousand daily, and the expired domain name is about 30 ten thousand daily. At present, the main methods for identifying bad sites are based on web page texts and web page snapshots, but the time cost for acquiring and analyzing the web page texts and the web page snapshots is very high. Therefore, an efficient system method for filtering the full-scale circulating main domain name is lacked, so that the full-scale bad site domain name gray list cannot be effectively constructed.
Disclosure of Invention
Aiming at the technical problems of long time consumption and high cost of the existing full-quantity domain name gray list filtering method based on the webpage text and the webpage snapshot, the invention provides a bad site gray list filtering method based on the full-quantity circulation main domain name.
Therefore, the technical scheme of the invention is that the bad site gray list filtering method based on the full circulation main domain name comprises the following steps:
step 1, extracting features from character strings of existing bad site domain names, establishing a bad site keyword phrase library, and constructing a name discrimination model of the bad site domain names based on character similarity to realize coarse filtering of suspected bad site domain names in the full domain names;
step 2, constructing an IP and port quick scanning model, acquiring service IP and port attribute information of suspected bad site domain names, and identifying whether the domain names can be resolved and used for Web services;
step 3, establishing an IP mapping range model of the domain name of the bad site through the existing bad site service IP group, and performing rough filtration based on IP similarity;
step 4, classifying the geographical areas of the domain names based on the IP positioning technology;
step 5, analyzing the accuracy of the bad site domain name gray list obtained by rough filtering by utilizing the existing bad site identification technology;
step 6, performing iterative optimization on the coarse filtration step 1 and the coarse filtration step 3;
the construction forms of the domain names of the bad sites are divided into two main types, wherein the first type is that the domain names contain English words or Chinese pinyin, the second type is that the domain names are formed by character sequences randomly, in the method based on the character similarity model, a bad keyword phrase library is constructed for the domain names of the first type to match keywords, and the second type is that whether the character sequences are randomly generated or not is judged by training an LSTM neural network model.
The construction method of the bad keyword phrase library is that a 37 ten thousand English word dictionary and 405 Chinese pinyin are combined into an English Chinese pinyin dictionary, longest word matching is carried out from a 39 ten thousand bad domain name set, bad English pinyin phrases with high occurrence frequency are extracted, and a bad keyword phrase library is formed and used for later keyword matching and filtering.
The training method of the LSTM neural network model is that the LSTM neural network model is trained by using 70 ten thousand Alex domain names and 78 ten thousand random character sequence domain names as training sets and test sets, and the LSTM neural network is divided into 3 layers: 1. the preprocessing layer expands the length of the domain name character sequence to 75, then maps character features into integer indexes, and finally converts positive integer indexes into dense vectors with fixed sizes to be used as character embedding; 2. a long-short-period memory layer, wherein the number of units is set to 128, and dropout is set to 0.5 for avoiding overfitting; 3. and the output layer adopts 2 classification output.
The rough filtering method of the suspected bad site domain name comprises the steps of firstly, matching keywords through a constructed bad keyword phrase library, judging whether the domain name contains bad keyword phrases, if yes, considering that the domain name is possibly used for the bad site, if no sensitive keywords exist, judging whether the domain name is randomly composed of characters by using a trained LSTM neural network model, and if yes, considering that the domain name is possibly used for the bad site.
The method for performing coarse filtering based on the IP similarity is to analyze the IP stored in the step 2 through the existing IP mapping range model, and if the IP falls into a section of mapping range of the model, the IP is considered to be used for bad service contents.
Further, the specific method for performing iterative optimization on the coarse filtration step 1 and the coarse filtration step 3 is as follows:
step S1, dynamically updating a bad keyword phrase library, adding new bad English pinyin phrases with high frequency into the phrase library, and deleting the phrases which are not used for a long time in the phrase library;
and S2, dynamically updating an IP mapping range model, integrating the newly-appearing bad site service IP into the model, and reducing the IP range which is missed for a long time in the model.
The method has the advantages that when the bad site gray list of the full-volume circulation main domain name is filtered, the domain name magnitude range of 2.6 hundred million is reduced by 90% through the filtering of the domain name character similarity and the service IP similarity, so that the time consumption caused by acquiring and analyzing the webpage text and the snapshot is greatly reduced, and the full-volume domain name is efficiently and accurately filtered. The method provided by the invention can realize high-speed and high-precision filtration of the full-quantity domain names.
Drawings
FIG. 1 is a schematic diagram of the construction of keyword phrase libraries, LSTM neural network models and IP mapping range models in accordance with the present invention;
FIG. 2 is a flow chart of the present invention for performing the filtering of bad site gray lists.
Detailed Description
The invention is further described below with reference to examples.
As shown in fig. 1, the first stage of the present invention needs to construct a character similarity model and an IP mapping range model in two steps, respectively. The method comprises the following specific steps:
step (1): when the domain name of the bad site is analyzed, the domain name construction form of the bad site is found to be divided into two main types. The first type is that the domain name contains english words or chinese pinyin (different languages, for a poor chinese website, the domain name contains more pinyin), for example: ponsvideo.com, tiyubocai.cn, etc. The second category is that domain names consist of sequences of characters randomly (possibly randomly generated by an algorithm), such as: vdqw-96.Com,12034.Cn. Therefore, in the method based on the character similarity model, aiming at the domain names of the first category, a poor keyword phrase library is constructed to match keywords. And aiming at the domain names of the second class, training the LSTM neural network to judge whether the character sequences are randomly generated.
(1) Constructing a bad keyword phrase library: the 37 ten thousand english word dictionary and 405 chinese pinyin (without phonetic symbols) are combined into an english chinese pinyin dictionary. And (3) carrying out longest word matching from the 39 ten thousand bad domain name sets, extracting bad English pinyin phrase with high occurrence frequency, and forming a bad keyword phrase library for later keyword matching and filtering.
(2) Training of LSTM neural network model: the LSTM model was trained using 70 thousands Alex domain names and 78 thousands random character sequence domain names (consisting of random character sequence domain names and DGA domain names in 39 thousands bad domain names) as training and test sets. The neural network is divided into 3 layers: 1. the preprocessing layer expands the length of the domain name character sequence to 75, then maps character features into integer indexes, and finally converts positive integer indexes into dense vectors with fixed sizes as character embedding. 2. A long-term and short-term memory layer: the number of cells was set to 128 and dropout was set to 0.5 for avoiding overfitting. 3. Output layer: and adopting 2 classification output. Finally, the accuracy is 94% in the training set and 96% in the testing set.
Step (2): and carrying out DNS analysis on the existing 39 ten thousand bad domain names to acquire all service IP addresses. Considering that when applying for IP, a batch of consecutive IP addresses is usually applied as backup. Therefore, mapping the scope of all bad IPs, and constructing a model of the mapping scope of the bad IPs for subsequent filtering.
As shown in fig. 2, a method for filtering a bad site gray list based on a full-volume circulation main domain name specifically comprises the following steps:
step 1: extracting features from the existing bad site domain name character strings, establishing a bad site keyword phrase library, and constructing a bad site domain name discrimination model based on character similarity, so as to realize coarse filtering of suspected bad site domain names in the full domain names. And filtering the similarity of the domain name characters by taking 2.6 hundred million full-quantity main domain names as input data. Wherein the filtering is performed in two parts. Firstly, matching keywords through a constructed bad keyword phrase library, and judging whether a domain name contains bad keyword phrases or not. If present, this domain name is considered likely to be used for bad sites. If the sensitive keywords do not exist, judging the randomness of the character sequence by using the trained LSTM model, and judging whether the domain name consists of characters randomly. If so, then the domain name is considered likely to be for the bad site. Coarse filtering of suspected bad site domain names is performed through the two parts.
Step 2: and constructing an IP and port quick scanning model, acquiring suspected bad site domain name service IP and port attribute information, and identifying whether a domain name can be resolved and used for Web service. And obtaining the service IP and the port attribute of the domain name set obtained in the last step. And acquiring A records of the IP address by DNS analysis, and storing all available IP addresses. Port scanning is then performed to see if ports 80, 443, 8080, etc. are open, thereby filtering out IP for Web services.
Step 3: and establishing an IP mapping range model of the domain name of the bad site through the existing bad site service IP group, and performing rough filtering based on the IP similarity. And (3) carrying out similarity analysis on the IP stored in the step (2) through an existing IP mapping range model. If the IP falls within a mapping range of the model, the IP is deemed to be used for bad content. Coarse filtering of bad sites is performed by IP similarity.
Step 4: and (5) carrying out geographical area classification of the domain name based on the IP positioning technology. Through the service IP positioning technology, the IP physical address attribute is obtained and subdivided into domestic and foreign. And the IP obtained by filtering in the steps corresponds to the domain name and is stored.
Step 5: and analyzing the accuracy of the bad site domain name gray list obtained by rough filtering by using the existing bad site identification technology. And accurately judging the domain name obtained by filtering in the steps through the existing bad site judging model. The discrimination model is based on web page content and snapshot, so that the discrimination model is time-consuming in acquiring text content and snapshot. However, by the filtering in the above steps, the domain name range has been reduced by 90%, and the domain name sets obtained by the filtering are all domain names highly suspected for bad sites. Therefore, the step can effectively filter out the bad site domain name gray list and evaluate the filtering effect of the step.
Step 6: and (3) storing the gray list of the total number of bad domain names by using the iterative optimization coarse filtration method, and performing iterative optimization of the steps (1) and (3). The optimization concrete mode is as follows: step S1, dynamically updating a bad keyword phrase library, adding new bad English pinyin phrases with high frequency into the phrase library, and deleting the phrases which are not used for a long time in the phrase library. And S2, dynamically updating an IP mapping range model, integrating the newly-appearing bad site service IP into the model, and reducing the IP range which is missed for a long time in the model.
When the method is used for filtering the bad site gray list of the full-volume circulation main domain name, the domain name magnitude range of 2.6 hundred million is reduced by 90 percent through filtering the domain name character similarity and the service IP similarity, so that the time consumption caused by acquiring and analyzing the webpage text and the snapshot is greatly reduced, and the full-volume domain name is efficiently and accurately filtered. The method provided by the invention can realize high-speed and high-precision filtration of the full-quantity domain names.
However, the foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, so that the substitution of equivalent elements or equivalent variations and modifications within the scope of the invention are intended to fall within the scope of the claims.

Claims (2)

1. A bad site gray list filtering method based on a full circulation main domain name is characterized by comprising the following steps:
step 1, extracting features from character strings of existing bad site domain names, establishing a bad site keyword phrase library, and constructing a name discrimination model of the bad site domain names based on character similarity to realize coarse filtering of suspected bad site domain names in the full domain names;
step 2, constructing an IP and port quick scanning model, acquiring service IP and port attribute information of suspected bad site domain names, and identifying whether the domain names can be resolved and used for Web services;
step 3, establishing an IP mapping range model of the domain name of the bad site through the existing bad site service IP group, and performing rough filtration based on IP similarity;
step 4, classifying the geographical areas of the domain names based on the IP positioning technology;
step 5, analyzing the accuracy of the bad site domain name gray list obtained by rough filtering by utilizing the existing bad site identification technology;
step 6, performing iterative optimization on the coarse filtration step 1 and the coarse filtration step 3;
the construction forms of the domain names of the bad sites are divided into two main types, wherein the first type is that the domain names contain English words or Chinese pinyin, the second type is that the domain names are formed by character sequences randomly, in the method based on the character similarity model, a bad keyword phrase library is constructed for the domain names of the first type to match keywords, and the second type is that whether the character sequences are randomly generated or not is judged by training an LSTM neural network model;
combining a 37 ten thousand English word dictionary and 405 Chinese pinyin to form an English Chinese pinyin dictionary, carrying out longest word matching from a 39 ten thousand defective domain name set, extracting defective English pinyin phrase with high occurrence frequency to form a defective keyword phrase library, and filtering the subsequent keyword matching;
the training method of the LSTM neural network model is that the LSTM neural network model is trained by using 70 ten thousand Alex domain names and 78 ten thousand random character sequence domain names as training sets and test sets, and the LSTM neural network is divided into 3 layers: (1) The preprocessing layer expands the length of the domain name character sequence to 75, then maps character features into integer indexes, and finally converts positive integer indexes into dense vectors with fixed sizes to be used as character embedding; (2) A long-short-period memory layer, wherein the number of units is set to 128, and dropout is set to 0.5 for avoiding overfitting; (3) an output layer, wherein the output layer adopts 2 classification output;
the rough filtering method of suspected bad site domain name includes that firstly, matching keywords through a constructed bad keyword phrase library, judging whether the domain name contains bad keyword phrases, if yes, considering that the domain name is possible to be used for the bad site, if no sensitive keywords exist, judging whether the domain name is formed by characters randomly by using a trained LSTM neural network model, and if yes, considering that the domain name is possible to be used for the bad site;
the method for performing coarse filtering based on the IP similarity is to analyze the IP stored in the step 2 through the existing IP mapping range model, and if the IP falls into a section of mapping range of the model, the IP is considered to be used for bad service contents.
2. The method for filtering the bad site gray list based on the full-scale circulation main domain name according to claim 1, wherein the method comprises the following steps: the specific method for performing iterative optimization on the coarse filtration step 1 and the coarse filtration step 3 comprises the following steps:
step S1, dynamically updating a bad keyword phrase library, adding new bad English pinyin phrases with high frequency into the phrase library, and deleting the phrases which are not used for a long time in the phrase library;
and S2, dynamically updating an IP mapping range model, integrating the newly-appearing bad site service IP into the model, and reducing the IP range which is missed for a long time in the model.
CN202210416876.9A 2022-04-20 2022-04-20 Bad site gray list filtering method based on full circulation main domain name Active CN114928472B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210416876.9A CN114928472B (en) 2022-04-20 2022-04-20 Bad site gray list filtering method based on full circulation main domain name

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210416876.9A CN114928472B (en) 2022-04-20 2022-04-20 Bad site gray list filtering method based on full circulation main domain name

Publications (2)

Publication Number Publication Date
CN114928472A CN114928472A (en) 2022-08-19
CN114928472B true CN114928472B (en) 2023-07-18

Family

ID=82807565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210416876.9A Active CN114928472B (en) 2022-04-20 2022-04-20 Bad site gray list filtering method based on full circulation main domain name

Country Status (1)

Country Link
CN (1) CN114928472B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105897714A (en) * 2016-04-11 2016-08-24 天津大学 Botnet detection method based on DNS (Domain Name System) flow characteristics
CN110191103A (en) * 2019-05-10 2019-08-30 长安通信科技有限责任公司 A kind of DGA domain name detection classification method
US10440042B1 (en) * 2016-05-18 2019-10-08 Area 1 Security, Inc. Domain feature classification and autonomous system vulnerability scanning
CN111866196A (en) * 2019-04-26 2020-10-30 深信服科技股份有限公司 Domain name traffic characteristic extraction method, device, equipment and readable storage medium
CN112948725A (en) * 2021-03-02 2021-06-11 北京六方云信息技术有限公司 Phishing website URL detection method and system based on machine learning
CN114095176A (en) * 2021-10-29 2022-02-25 北京天融信网络安全技术有限公司 Malicious domain name detection method and device
CN114266251A (en) * 2021-12-27 2022-04-01 北京天融信网络安全技术有限公司 Malicious domain name detection method and device, electronic equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1168031C (en) * 2001-09-07 2004-09-22 联想(北京)有限公司 Content filter based on text content characteristic similarity and theme correlation degree comparison
US10257082B2 (en) * 2017-02-06 2019-04-09 Silver Peak Systems, Inc. Multi-level learning for classifying traffic flows
US20200349430A1 (en) * 2019-05-03 2020-11-05 Webroot Inc. System and method for predicting domain reputation
US11637863B2 (en) * 2020-04-03 2023-04-25 Paypal, Inc. Detection of user interface imitation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105897714A (en) * 2016-04-11 2016-08-24 天津大学 Botnet detection method based on DNS (Domain Name System) flow characteristics
US10440042B1 (en) * 2016-05-18 2019-10-08 Area 1 Security, Inc. Domain feature classification and autonomous system vulnerability scanning
CN111866196A (en) * 2019-04-26 2020-10-30 深信服科技股份有限公司 Domain name traffic characteristic extraction method, device, equipment and readable storage medium
CN110191103A (en) * 2019-05-10 2019-08-30 长安通信科技有限责任公司 A kind of DGA domain name detection classification method
CN112948725A (en) * 2021-03-02 2021-06-11 北京六方云信息技术有限公司 Phishing website URL detection method and system based on machine learning
CN114095176A (en) * 2021-10-29 2022-02-25 北京天融信网络安全技术有限公司 Malicious domain name detection method and device
CN114266251A (en) * 2021-12-27 2022-04-01 北京天融信网络安全技术有限公司 Malicious domain name detection method and device, electronic equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
《Identifying Gambling and Porn Websites with Image Recognition》;Longxi Li etal;《Springer International Publishing AG》;全文 *
《VegaStar: An Illegal Domain Detection System on Large-Scale Video Traffic》;Xiang Tian etal;《2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE)》;全文 *
《基于人工智能创作能力的未知不良域名发现技术》;杜刚 等;《电信工程技术与标准化》;全文 *
基于IP地址段的网站内容监控的研究;刘乐群;史君华;;现代电子技术(21);全文 *

Also Published As

Publication number Publication date
CN114928472A (en) 2022-08-19

Similar Documents

Publication Publication Date Title
CN111241389B (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
US20060206306A1 (en) Text mining apparatus and associated methods
CN110602045B (en) Malicious webpage identification method based on feature fusion and machine learning
WO2008014702A1 (en) Method and system of extracting new words
Frantzeskou et al. Examining the significance of high-level programming features in source code author classification
CN112364637B (en) Sensitive word detection method and device, electronic equipment and storage medium
CN110298039B (en) Event place identification method, system, equipment and computer readable storage medium
CN113282955A (en) Method, system, terminal and medium for extracting privacy information in privacy policy
CN113761208A (en) Scientific and technological innovation information classification method and storage device based on knowledge graph
Krüger et al. A literature review on methods for the extraction of usage statements of software and data
CN113158660B (en) Sub-domain name discovery method and system applied to penetration test
CN114004277A (en) Small sample threat risk early warning method and device based on deep learning
CN114928472B (en) Bad site gray list filtering method based on full circulation main domain name
CN108021595B (en) Method and device for checking knowledge base triples
CN115150354B (en) Method and device for generating domain name, storage medium and electronic equipment
CN116562296A (en) Geographic named entity recognition model training method and geographic named entity recognition method
CN115438340A (en) Mining behavior identification method and system based on morpheme characteristics
Bhattacharjee et al. Named entity recognition: A survey for indian languages
CN115062108A (en) Method for obtaining standardized house address
CN107239704A (en) Malicious web pages find method and device
CN113761137A (en) Method and device for extracting address information
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
CN112084389A (en) Network crawler-based academic institution geographical position information extraction method
JP6816621B2 (en) Discrimination method, discrimination program and discrimination device
Barrière et al. Searching for Named Entities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Zhang Zhaoxin

Inventor after: Chen Junren

Inventor after: Chai Tingting

Inventor after: Zhao Dong

Inventor after: Meng Yueyang

Inventor before: Zhang Zhaoxin

Inventor before: Meng Yueyang

Inventor before: Chai Tingting

Inventor before: Zhao Dong

Inventor before: Chen Junren

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant