WO2023054858A1 - Procédé et système de classification automatique de catégorie d'url en fonction d'un apprentissage automatique - Google Patents

Procédé et système de classification automatique de catégorie d'url en fonction d'un apprentissage automatique Download PDF

Info

Publication number
WO2023054858A1
WO2023054858A1 PCT/KR2022/009723 KR2022009723W WO2023054858A1 WO 2023054858 A1 WO2023054858 A1 WO 2023054858A1 KR 2022009723 W KR2022009723 W KR 2022009723W WO 2023054858 A1 WO2023054858 A1 WO 2023054858A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
machine learning
analysis target
category
target url
Prior art date
Application number
PCT/KR2022/009723
Other languages
English (en)
Korean (ko)
Inventor
김영중
노주영
Original Assignee
(주)모니터랩
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by (주)모니터랩 filed Critical (주)모니터랩
Publication of WO2023054858A1 publication Critical patent/WO2023054858A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a method and system for automatically classifying URL categories based on machine learning.
  • Secure Web Gateway is a security solution that blocks harmful sites to effectively control the web use environment itself, which hinders productivity and protects corporate assets.
  • Creating URL Category Classification which is one of the core data of a secure web gateway that blocks harmful websites, is more important than anything else.
  • manual classification was carried out through human resources. Since it is necessary to classify a large number of URL data in a short period of time, there is a problem in that a lot of time and effort are required along with the cost problem of having to input a large number of human resources. In addition, since there is a possibility that classifiers do not classify consistently, there is a problem in that accuracy deviation increases.
  • the technical problem to be solved by the present invention is to provide a method and system for automatically classifying URL categories based on machine learning.
  • a method for automatically classifying URL categories based on machine learning includes the steps of receiving an analysis target URL, determining whether the analysis target URL is subject to machine learning-based category classification, and the analysis target. If the URL is a machine learning-based category classification target, acquiring web page data corresponding to the target URL for analysis, inputting text data extracted from the obtained web page data into a machine learning model to obtain data corresponding to the target URL for analysis Classifying a category, and storing category classification information for the analysis target URL in a database.
  • the step of determining whether the analysis target URL is subject to machine learning-based category classification may include a preprocessing step of separating a protocol, domain, and path from the analysis target URL, and combining at least some of the separated protocols, domains, and paths. and if category classification information for the created URL is included in the database, determining that the analysis target URL is not a machine learning-based category classification target.
  • Web page data corresponding to the analysis target URL may be obtained by accessing a website corresponding to the analysis target URL.
  • the analysis target URL is not subject to machine learning-based category classification. It may further include the step of determining that it is.
  • the URL pattern rule list may include a plurality of URL pattern rules classified into categories in advance.
  • a category corresponding to a URL pattern rule matched to the analysis target URL may be stored in the database as category classification information for the analysis target URL.
  • the machine learning model may be trained with text data extracted from web page data obtained from a plurality of websites and learning data constructed with category classification information pre-assigned to the plurality of websites.
  • the machine learning model removes formal morphemes from the text extracted from the web page data, receives text data consisting only of nouns, calculates the similarity for each predefined category, and classifies the analysis target URL into the category with the highest similarity. can do.
  • It may include a computer-readable recording medium on which a program for executing the method is recorded on a computer.
  • the automatic machine learning-based URL category classification system determines whether a URL input unit receives an analysis target URL, whether or not the analysis target URL is subject to machine learning-based category classification, and analyzes the target URL. If the URL is a machine learning-based category classification target, a control unit that obtains webpage data corresponding to the analysis target URL and extracts text data, and inputs the text data extracted from the acquired webpage data into a machine learning model to analyze the target It includes an artificial intelligence unit for classifying categories corresponding to URLs, and a database for storing category classification information for the analysis target URLs.
  • the control unit performs preprocessing to separate the domain from the analysis target URL, and if category classification information for the separated domain is included in the database, it is determined that the analysis target URL is not subject to machine learning-based category classification. do.
  • category classification for analysis target URLs can be accurately and efficiently automatically performed through a machine learning model.
  • processing speed and efficiency can be improved by performing machine learning-based category classification only for URLs requiring category classification through a URL pre-processing filter.
  • it is possible to create a large amount of URL category classification data because batch processing is possible by inputting a list of URLs that require category classification.
  • FIG. 1 is a block diagram of a machine learning-based URL category automatic classification system according to an embodiment of the present invention.
  • FIG. 2 is a diagram provided to explain URL pre-processing according to the present invention.
  • FIG 3 shows an example of extracting text data through web crawling according to the present invention.
  • FIG. 4 shows an example of a result of morphological analysis of text data extracted through web crawling according to the present invention.
  • FIG. 5 is an operation flowchart of a machine learning-based URL category automatic classification system according to an embodiment of the present invention.
  • FIG. 6 is a flowchart illustrating the machine learning-based classification procedure of FIG. 5 in detail.
  • FIG. 1 is a block diagram of a machine learning-based URL category automatic classification system according to an embodiment of the present invention.
  • a system 100 may include a URL input unit 110, a control unit 120, an artificial intelligence unit 130, and a database 140.
  • the URL input unit 110 may receive an analysis target URL.
  • Analysis target URLs can be collected in three main types.
  • a URL to be analyzed can be extracted from a user visit log of a product such as the secure web gateway 200.
  • the threat information collector can collect URLs through crawling.
  • a URL classification request list may be received from a customer. Corresponding lists may be input into the URL input unit 110 in single or bulk form.
  • the secure web gateway 200 loads URL category classification information and logs URLs visited by clients (not shown) to block access to URLs corresponding to URL categories set to be blocked by the administrator or to URL categories set to be allowed. You can control access by allowing access only to
  • the secure web gateway 200 separately stores user visit URL logs, and in particular, extracts URLs that do not exist in the URL category classification information or URLs designated as reclassification targets by the administrator as analysis target URLs and provides them to the URL input unit 110. You may.
  • control unit 120 may determine whether the analysis target URL is subject to machine learning-based category classification.
  • FIG. 2 is a diagram provided to explain URL pre-processing according to the present invention.
  • control unit 120 may perform preprocessing of separating a protocol, domain, and path from an analysis target URL.
  • the control unit 120 determines that the analysis target URL is not subject to machine learning-based category classification if category classification information for URLs generated by combining at least some of the separated protocols, domains, and paths is included in the database 140. can do.
  • control unit 120 may determine that the analysis target URL is not a machine learning-based category classification target.
  • the domain and path of the URL below are randomly displayed, so it is not possible to check whether the URL is subject to machine learning-based category classification by referring to the category classification information.
  • a URL pattern rule list for classifying URLs to be analyzed according to a certain rule may be prepared in advance. If ' *.aaa.com/news ' is included in the URL pattern rule list as a URL rule corresponding to the news category, the above URL is classified as a news category and judged not to be subject to machine learning-based category classification. .
  • the first method queries the database with only text combinations, and the second method includes * (Asterisk) and certain rules. There is a difference in classifying URLs by pre-defining a whitelist with .
  • the control unit 120 may obtain web page data corresponding to the analysis target URL. To this end, the control unit 120 can connect to a web site corresponding to an analysis target URL through web crawling, and extract and bring all text data of the page as illustrated in FIG. 3 .
  • FIG 3 shows an example of extracting text data through web crawling according to the present invention.
  • the artificial intelligence unit 130 may classify a category corresponding to an analysis target URL by inputting text data extracted from the obtained web page data to a machine learning model. To this end, the artificial intelligence unit 130 trains the machine learning model with text data extracted from web page data obtained from a plurality of websites and learning data built with category classification information pre-assigned to the plurality of websites. can do.
  • the machine learning model is a machine learning model such as convolution neural network (CNN), recurrent neural network (RNN), gated recurrent unit (GRU), long short term memory (LSTM), sequence-to-sequence (Seq2Seq), etc. It can be in the form of a learning algorithm.
  • CNN convolution neural network
  • RNN recurrent neural network
  • GRU gated recurrent unit
  • LSTM long short term memory
  • Seq2Seq sequence-to-sequence
  • the machine learning model will be pre-trained to remove formal morphemes from the text extracted from web page data, receive text data consisting only of nouns, calculate the similarity for each predefined category, and classify the analysis target URL into the category with the highest similarity.
  • can 4 shows an example of a result of morphological analysis of text data extracted through web crawling according to the present invention.
  • Ad/Pop-up Alcohol/Tobacco business vehicle/transportation computer/technology education Finance/Banking health/medicine Recruitment/job search news non-profit real estate religion restaurant/restaurant Search engine/portal shopping sports travel leisure/entertainment fashion/beauty
  • training data for machine learning model training may be constructed by matching predefined category classification with text data extracted from web page data. Specifically, after removing formal morphemes from text data extracted from web page data and classifying only nouns among the remaining substantive morphemes, learning data can be generated through pattern clustering for features of each category using the nouns.
  • learning data can be generated through pattern clustering for features of each category using the nouns.
  • the database 140 may temporarily or permanently store various types of information and data related to the operation of the system 100 .
  • the database 140 may store learning data built for machine learning model training.
  • the database 140 may store category classification information for URLs, and may store category classification information for analysis target URLs classified by a machine learning model.
  • FIG. 5 is an operation flowchart of a machine learning-based URL category automatic classification system according to an embodiment of the present invention.
  • the URL input unit 110 may receive an analysis target URL (S510).
  • the control unit 120 may determine whether the URL to be analyzed is subject to machine learning-based category classification (S520).
  • Step S520 is a procedure for determining whether machine learning-based automatic URL category classification is necessary for the target URL to be analyzed.
  • step S520 may include procedures such as a preprocessing step (S521), URL pattern rule filtering (S523), and database reference classification (S525), and some steps may be omitted or modified depending on the embodiment, The order of execution may change.
  • S521 preprocessing step
  • S523 URL pattern rule filtering
  • S525 database reference classification
  • the control unit 120 may perform preprocessing of separating the protocol, domain, and path from the analysis target URL (S521).
  • control unit 120 may determine that the analysis target URL is not a machine learning-based category classification target (S523-N).
  • the control unit 120 may determine that the analysis target URL is not subject to machine learning-based category classification if category classification information for a URL generated by combining at least some of the separated protocols, domains, and paths is included in the database. (S525-N).
  • the control unit 120 may store in the database 140 category classification information of the analysis target URL determined not to be subject to machine learning-based category classification (S540).
  • control unit 120 may determine that the analysis target URL is a machine learning-based category classification target (S523-Y).
  • control unit 120 may determine that the analysis target URL is subject to machine learning-based category classification even when the database does not include category classification information for URLs generated by combining at least some of the separated protocols, domains, and paths. Yes (S525-Y).
  • the control unit 120 may perform a machine learning-based classification procedure on the analysis target URL determined to be a machine learning-based category classification target (S530).
  • FIG. 6 is a flowchart illustrating the machine learning-based classification procedure of FIG. 5 in detail.
  • the controller 120 accesses the website corresponding to the analysis target URL (S531), obtains web page data corresponding to the analysis target URL (S533), and extracts from the obtained web page data.
  • Form morphemes may be removed from the generated text and processed into text data consisting only of nouns (S535).
  • step S535 the text data extracted and processed in step S535 may be input to a machine learning model to perform category classification of the URL to be analyzed (S537).
  • step S537 if there is a category classification result in which the degree of similarity is equal to or greater than a certain criterion, it may be treated as classification success, and if the degree of similarity is less than a certain criterion, it may be treated as classification failure.
  • control unit 120 may store category classification information of the analysis target URL classified based on machine learning in the database 140 (S540). .
  • control unit 120 may store the analysis target URL as unclassified data in the database 140 (S550).
  • the embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components.
  • the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions.
  • a processing device may run an operating system (OS) and one or more software applications running on the operating system.
  • a processing device may also access, store, manipulate, process, and generate data in response to execution of software.
  • OS operating system
  • a processing device may also access, store, manipulate, process, and generate data in response to execution of software.
  • the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include.
  • a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.
  • Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device.
  • Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave.
  • Software may be distributed on networked computer systems and stored or executed in a distributed manner.
  • Software and data may be stored on one or more computer readable media.
  • the method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium.
  • the computer readable medium may include program instructions, data files, data structures, etc. alone or in combination.
  • Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software.
  • Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks.
  • - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like.
  • program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.
  • the hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention se rapporte à un procédé et à un système de classification automatique d'une catégorie d'URL en fonction d'un apprentissage automatique, et le procédé de classification automatique d'une catégorie d'URL en fonction d'un apprentissage automatique selon la présente invention comprend les étapes consistant : à recevoir une entrée d'une URL à analyser ; à déterminer si l'URL à analyser est soumise à une classification de catégorie en fonction d'un apprentissage automatique ; si l'URL à analyser est soumise à une classification de catégorie en fonction d'un apprentissage automatique, à acquérir des données de page Web correspondant à l'URL à analyser ; à entrer des données de texte extraites des données de page Web acquises dans un modèle d'apprentissage automatique afin de classer une catégorie correspondant à l'URL à analyser ; et à stocker, dans une base de données, des informations de classification de catégorie pour l'URL à analyser.
PCT/KR2022/009723 2021-09-30 2022-07-06 Procédé et système de classification automatique de catégorie d'url en fonction d'un apprentissage automatique WO2023054858A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020210129546A KR20230046494A (ko) 2021-09-30 2021-09-30 머신러닝 기반 url 카테고리 자동 분류 방법 및 시스템
KR10-2021-0129546 2021-09-30

Publications (1)

Publication Number Publication Date
WO2023054858A1 true WO2023054858A1 (fr) 2023-04-06

Family

ID=85783049

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/009723 WO2023054858A1 (fr) 2021-09-30 2022-07-06 Procédé et système de classification automatique de catégorie d'url en fonction d'un apprentissage automatique

Country Status (2)

Country Link
KR (1) KR20230046494A (fr)
WO (1) WO2023054858A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188120A (zh) * 2023-04-28 2023-05-30 北京华阅嘉诚科技发展有限公司 一种有声书的推荐方法、装置、系统及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102580512B1 (ko) * 2023-04-12 2023-09-20 (주)유알피 자동 문장 클러스터링 딥러닝 모델 학습을 위한 자동화된 rpa 학습 장치 및 방법

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080052097A (ko) * 2006-12-07 2008-06-11 한국전자통신연구원 웹 구조정보를 이용한 유해 사이트 차단 방법 및 장치
US20120158626A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Detection and categorization of malicious urls
KR20180115111A (ko) * 2017-04-12 2018-10-22 주식회사 리메인 키워드 라벨링의 학습화를 통한 불법/유해 정보에 대한 차단 방법 및 이를 수행하는 장치
KR20200119534A (ko) * 2019-04-10 2020-10-20 인천대학교 산학협력단 유해 콘텐츠 웹 페이지 url 필터링 장치
KR20210054799A (ko) * 2019-11-06 2021-05-14 삼성에스디에스 주식회사 Url 클러스터링을 위한 url의 요약을 생성하는 방법 및 장치

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080052097A (ko) * 2006-12-07 2008-06-11 한국전자통신연구원 웹 구조정보를 이용한 유해 사이트 차단 방법 및 장치
US20120158626A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Detection and categorization of malicious urls
KR20180115111A (ko) * 2017-04-12 2018-10-22 주식회사 리메인 키워드 라벨링의 학습화를 통한 불법/유해 정보에 대한 차단 방법 및 이를 수행하는 장치
KR20200119534A (ko) * 2019-04-10 2020-10-20 인천대학교 산학협력단 유해 콘텐츠 웹 페이지 url 필터링 장치
KR20210054799A (ko) * 2019-11-06 2021-05-14 삼성에스디에스 주식회사 Url 클러스터링을 위한 url의 요약을 생성하는 방법 및 장치

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188120A (zh) * 2023-04-28 2023-05-30 北京华阅嘉诚科技发展有限公司 一种有声书的推荐方法、装置、系统及存储介质

Also Published As

Publication number Publication date
KR20230046494A (ko) 2023-04-06

Similar Documents

Publication Publication Date Title
WO2023054858A1 (fr) Procédé et système de classification automatique de catégorie d'url en fonction d'un apprentissage automatique
David et al. Deepsign: Deep learning for automatic malware signature generation and classification
CN101971591B (zh) 分析网址的系统及方法
Buber et al. NLP based phishing attack detection from URLs
WO2012108623A1 (fr) Procédé, système et support d'enregistrement lisible par ordinateur pour ajouter une nouvelle image et des informations sur la nouvelle image à une base de données d'images
CN107786575A (zh) 一种基于dns流量的自适应恶意域名检测方法
US7617090B2 (en) Contents filter based on the comparison between similarity of content character and correlation of subject matter
CN105956180A (zh) 一种敏感词过滤方法
Wazirali et al. Sustaining accurate detection of phishing URLs using SDN and feature selection approaches
KR102060766B1 (ko) 다크웹 범죄 사이트 모니터링 시스템
CN105653563B (zh) 对网页抓取的控制方法、动态更新黑名单和白名单的方法及相关装置
CN106549980A (zh) 一种恶意c&c服务器确定方法及装置
CN106708952A (zh) 一种网页聚类方法及装置
CN107547490A (zh) 一种扫描器识别方法、装置及系统
CN108769001A (zh) 基于网络行为特征聚类分析的恶意代码检测方法
Yang et al. Scalable detection of promotional website defacements in black hat {SEO} campaigns
CN108229131A (zh) 仿冒app识别方法及装置
CN107341371A (zh) 一种适用于web组态的脚本控制方法
WO2018101506A1 (fr) Dispositif et procédé de classification multiple de documents permettant de classer un document dans une pluralité de catégories à l'aide d'un motif lexico-sémantique obtenu en reconfigurant une catégorie sémantique de mots constituant une phrase
Pan et al. Webshell detection based on executable data characteristics of php code
CN112199569A (zh) 一种违禁网址识别方法、系统、计算机设备及存储介质
WO2022191596A1 (fr) Dispositif et procédé permettant de détecter automatiquement un comportement anormal de paquet de réseau sur la base d'un auto-profilage
Alshammery et al. Crawling and mining the dark web: A survey on existing and new approaches
CN111639250B (zh) 企业描述信息获取方法、装置、电子设备及存储介质
Nguyen et al. Improving web application firewalls with automatic language detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22876633

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE