WO2023054858A1

WO2023054858A1 - Method and system for automatic classification of url category on basis of machine learning

Info

Publication number: WO2023054858A1
Application number: PCT/KR2022/009723
Authority: WO
Inventors: 김영중; 노주영
Original assignee: (주)모니터랩
Priority date: 2021-09-30
Filing date: 2022-07-06
Publication date: 2023-04-06
Also published as: KR20230046494A

Abstract

The present invention relates to a method and system for automatic classification of a URL category on the basis of machine learning, and the method for automatic classification of a URL category on the basis of machine learning according to the present invention comprises the steps of: receiving an input of a URL to be analyzed; determining whether the URL to be analyzed is subject to category classification on the basis of machine learning; if the URL to be analyzed is subject to category classification on the basis of machine learning, acquiring web page data corresponding to the URL to be analyzed; inputting text data extracted from the acquired web page data to a machine learning model to classify a category corresponding to the URL to be analyzed; and storing, in a database, category classification information for the URL to be analyzed.

Description

Machine learning-based URL category automatic classification method and system

The present invention relates to a method and system for automatically classifying URL categories based on machine learning.

Recently, as the web-based business environment increases, the number of websites that threaten corporate assets or productivity is increasing. Accessing non-business sites wastes business time and reduces productivity, and corporate assets may be stolen by visiting websites with malicious codes hidden. In addition, inappropriate acts of leaking confidential information through the web easily occur. Secure Web Gateway is a security solution that blocks harmful sites to effectively control the web use environment itself, which hinders productivity and protects corporate assets.

Creating URL Category Classification, which is one of the core data of a secure web gateway that blocks harmful websites, is more important than anything else. Conventionally, manual classification was carried out through human resources. Since it is necessary to classify a large number of URL data in a short period of time, there is a problem in that a lot of time and effort are required along with the cost problem of having to input a large number of human resources. In addition, since there is a possibility that classifiers do not classify consistently, there is a problem in that accuracy deviation increases.

Therefore, the technical problem to be solved by the present invention is to provide a method and system for automatically classifying URL categories based on machine learning.

A method for automatically classifying URL categories based on machine learning according to the present invention to solve the above technical problem includes the steps of receiving an analysis target URL, determining whether the analysis target URL is subject to machine learning-based category classification, and the analysis target. If the URL is a machine learning-based category classification target, acquiring web page data corresponding to the target URL for analysis, inputting text data extracted from the obtained web page data into a machine learning model to obtain data corresponding to the target URL for analysis Classifying a category, and storing category classification information for the analysis target URL in a database.

The step of determining whether the analysis target URL is subject to machine learning-based category classification may include a preprocessing step of separating a protocol, domain, and path from the analysis target URL, and combining at least some of the separated protocols, domains, and paths. and if category classification information for the created URL is included in the database, determining that the analysis target URL is not a machine learning-based category classification target.

Web page data corresponding to the analysis target URL may be obtained by accessing a website corresponding to the analysis target URL.

In the step of determining whether the analysis target URL is subject to machine learning-based category classification, if there is a URL pattern rule matching the analysis target URL in the URL pattern rule list, the analysis target URL is not subject to machine learning-based category classification. It may further include the step of determining that it is.

The URL pattern rule list may include a plurality of URL pattern rules classified into categories in advance.

A category corresponding to a URL pattern rule matched to the analysis target URL may be stored in the database as category classification information for the analysis target URL.

The machine learning model may be trained with text data extracted from web page data obtained from a plurality of websites and learning data constructed with category classification information pre-assigned to the plurality of websites.

The machine learning model removes formal morphemes from the text extracted from the web page data, receives text data consisting only of nouns, calculates the similarity for each predefined category, and classifies the analysis target URL into the category with the highest similarity. can do.

It may include a computer-readable recording medium on which a program for executing the method is recorded on a computer.

In order to solve the above technical problem, the automatic machine learning-based URL category classification system according to the present invention determines whether a URL input unit receives an analysis target URL, whether or not the analysis target URL is subject to machine learning-based category classification, and analyzes the target URL. If the URL is a machine learning-based category classification target, a control unit that obtains webpage data corresponding to the analysis target URL and extracts text data, and inputs the text data extracted from the acquired webpage data into a machine learning model to analyze the target It includes an artificial intelligence unit for classifying categories corresponding to URLs, and a database for storing category classification information for the analysis target URLs.

The control unit performs preprocessing to separate the domain from the analysis target URL, and if category classification information for the separated domain is included in the database, it is determined that the analysis target URL is not subject to machine learning-based category classification. do.

According to the present invention, category classification for analysis target URLs can be accurately and efficiently automatically performed through a machine learning model. In particular, processing speed and efficiency can be improved by performing machine learning-based category classification only for URLs requiring category classification through a URL pre-processing filter. In addition, it is possible to create a large amount of URL category classification data because batch processing is possible by inputting a list of URLs that require category classification.

1 is a block diagram of a machine learning-based URL category automatic classification system according to an embodiment of the present invention.

2 is a diagram provided to explain URL pre-processing according to the present invention.

3 shows an example of extracting text data through web crawling according to the present invention.

4 shows an example of a result of morphological analysis of text data extracted through web crawling according to the present invention.

5 is an operation flowchart of a machine learning-based URL category automatic classification system according to an embodiment of the present invention.

FIG. 6 is a flowchart illustrating the machine learning-based classification procedure of FIG. 5 in detail.

Then, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily practice the present invention.

Referring to FIG. 1 , a system 100 according to the present invention may include a URL input unit 110, a control unit 120, an artificial intelligence unit 130, and a database 140.

The URL input unit 110 may receive an analysis target URL.

Analysis target URLs can be collected in three main types. First, a URL to be analyzed can be extracted from a user visit log of a product such as the secure web gateway 200. Second, the threat information collector can collect URLs through crawling. Third, a URL classification request list may be received from a customer. Corresponding lists may be input into the URL input unit 110 in single or bulk form.

The secure web gateway 200 loads URL category classification information and logs URLs visited by clients (not shown) to block access to URLs corresponding to URL categories set to be blocked by the administrator or to URL categories set to be allowed. You can control access by allowing access only to

The secure web gateway 200 separately stores user visit URL logs, and in particular, extracts URLs that do not exist in the URL category classification information or URLs designated as reclassification targets by the administrator as analysis target URLs and provides them to the URL input unit 110. You may.

After performing preprocessing on the analysis target URL, the control unit 120 may determine whether the analysis target URL is subject to machine learning-based category classification.

Referring to FIG. 2 , the control unit 120 may perform preprocessing of separating a protocol, domain, and path from an analysis target URL.

The control unit 120 determines that the analysis target URL is not subject to machine learning-based category classification if category classification information for URLs generated by combining at least some of the separated protocols, domains, and paths is included in the database 140. can do.

For example, assuming that "https://www.a.com/path1/" is entered as the URL to be analyzed, the protocol 'https', domain 'www.a.com', path 'path1/', etc. It can be separated by text.

And combining at least some of each separated text to https://www.a.com/, http://www.a.com/, https://www.a.com/path1/, http:// You can create URLs like www.a.com/path1/ and so on. In addition, if category classification information for the generated URL is already included in the database 140, it may be determined that the analysis target URL is not a machine learning-based category classification target.

Meanwhile, if there is a URL pattern rule matching the analysis target URL in the URL pattern rule list, the control unit 120 may determine that the analysis target URL is not a machine learning-based category classification target.

For example, the domain and path of the URL below are randomly displayed, so it is not possible to check whether the URL is subject to machine learning-based category classification by referring to the category classification information.

https:// 231231231.aaa.com/news /1HIWVx3reLiggzMftCc/I8yzSRrqU98Sj5Euo8QAtnPLg/

Therefore, in order to determine whether a URL whose domain or path is randomly displayed is subject to machine learning-based category classification, a URL pattern rule list for classifying URLs to be analyzed according to a certain rule may be prepared in advance. If ' *.aaa.com/news ' is included in the URL pattern rule list as a URL rule corresponding to the news category, the above URL is classified as a news category and judged not to be subject to machine learning-based category classification. .

Among the above two methods used to determine whether a URL to be analyzed is subject to machine learning-based category classification, the first method queries the database with only text combinations, and the second method includes * (Asterisk) and certain rules. There is a difference in classifying URLs by pre-defining a whitelist with .

If the analysis target URL is a machine learning-based category classification target, the control unit 120 may obtain web page data corresponding to the analysis target URL. To this end, the control unit 120 can connect to a web site corresponding to an analysis target URL through web crawling, and extract and bring all text data of the page as illustrated in FIG. 3 .

The artificial intelligence unit 130 may classify a category corresponding to an analysis target URL by inputting text data extracted from the obtained web page data to a machine learning model. To this end, the artificial intelligence unit 130 trains the machine learning model with text data extracted from web page data obtained from a plurality of websites and learning data built with category classification information pre-assigned to the plurality of websites. can do.

Here, the machine learning model is a machine learning model such as convolution neural network (CNN), recurrent neural network (RNN), gated recurrent unit (GRU), long short term memory (LSTM), sequence-to-sequence (Seq2Seq), etc. It can be in the form of a learning algorithm.

The machine learning model will be pre-trained to remove formal morphemes from the text extracted from web page data, receive text data consisting only of nouns, calculate the similarity for each predefined category, and classify the analysis target URL into the category with the highest similarity. can 4 shows an example of a result of morphological analysis of text data extracted through web crawling according to the present invention.

광고/팝업Ad/Pop-up	주류/담배Alcohol/Tobacco	비즈니스business	차량/운송수단vehicle/transportation
컴퓨터/테크놀로지computer/technology	교육education	금융/은행Finance/Banking	건강/의학health/medicine
구인/구직Recruitment/job search	뉴스news	비영리non-profit	부동산real estate
종교religion	식당/요식업restaurant/restaurant	검색엔진/포털Search engine/portal	쇼핑shopping
스포츠sports	여행travel	여가/오락leisure/entertainment	패션/뷰티fashion/beauty

As illustrated in Table 1, training data for machine learning model training may be constructed by matching predefined category classification with text data extracted from web page data. Specifically, after removing formal morphemes from text data extracted from web page data and classifying only nouns among the remaining substantive morphemes, learning data can be generated through pattern clustering for features of each category using the nouns. Of course, in addition to what has been described here, it is also possible to train a machine learning model to receive text data included in a web page and classify it into predefined categories, and to classify the category of the URL to be analyzed using the trained machine learning model. It is possible. The database 140 may temporarily or permanently store various types of information and data related to the operation of the system 100 . The database 140 may store learning data built for machine learning model training. The database 140 may store category classification information for URLs, and may store category classification information for analysis target URLs classified by a machine learning model.

Referring to FIG. 5 , the URL input unit 110 may receive an analysis target URL (S510).

The control unit 120 may determine whether the URL to be analyzed is subject to machine learning-based category classification (S520). Step S520 is a procedure for determining whether machine learning-based automatic URL category classification is necessary for the target URL to be analyzed.

Specifically, step S520 may include procedures such as a preprocessing step (S521), URL pattern rule filtering (S523), and database reference classification (S525), and some steps may be omitted or modified depending on the embodiment, The order of execution may change.

The control unit 120 may perform preprocessing of separating the protocol, domain, and path from the analysis target URL (S521).

If there is a URL pattern rule matching the analysis target URL in the URL pattern rule list, the control unit 120 may determine that the analysis target URL is not a machine learning-based category classification target (S523-N).

The control unit 120 may determine that the analysis target URL is not subject to machine learning-based category classification if category classification information for a URL generated by combining at least some of the separated protocols, domains, and paths is included in the database. (S525-N).

The control unit 120 may store in the database 140 category classification information of the analysis target URL determined not to be subject to machine learning-based category classification (S540).

Meanwhile, if there is no URL pattern rule matching the analysis target URL in the URL pattern rule list, the control unit 120 may determine that the analysis target URL is a machine learning-based category classification target (S523-Y).

In addition, the control unit 120 may determine that the analysis target URL is subject to machine learning-based category classification even when the database does not include category classification information for URLs generated by combining at least some of the separated protocols, domains, and paths. Yes (S525-Y).

The control unit 120 may perform a machine learning-based classification procedure on the analysis target URL determined to be a machine learning-based category classification target (S530).

Referring to FIG. 6, first, the controller 120 accesses the website corresponding to the analysis target URL (S531), obtains web page data corresponding to the analysis target URL (S533), and extracts from the obtained web page data. Form morphemes may be removed from the generated text and processed into text data consisting only of nouns (S535).

Next, the text data extracted and processed in step S535 may be input to a machine learning model to perform category classification of the URL to be analyzed (S537). In step S537, if there is a category classification result in which the degree of similarity is equal to or greater than a certain criterion, it may be treated as classification success, and if the degree of similarity is less than a certain criterion, it may be treated as classification failure.

Referring back to FIG. 5 , if the machine learning-based classification succeeds (S530-Y), the control unit 120 may store category classification information of the analysis target URL classified based on machine learning in the database 140 (S540). .

Meanwhile, if the machine learning-based classification fails (S540-Y), the control unit 120 may store the analysis target URL as unclassified data in the database 140 (S550).

The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on the above. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

Claims

The step of receiving an analysis target URL;

Determining whether the URL to be analyzed is subject to machine learning-based category classification;

If the analysis target URL is a machine learning-based category classification target, obtaining web page data corresponding to the analysis target URL;

Classifying a category corresponding to the analysis target URL by inputting text data extracted from the obtained web page data into a machine learning model; and

Storing category classification information for the analysis target URL in a database

including,

The step of determining whether the URL to be analyzed is subject to machine learning-based category classification,

A preprocessing step of separating a protocol, domain, and path from the analysis target URL, and

Determining that the analysis target URL is not subject to machine learning-based categorization if category classification information for a URL generated by combining at least some of the separated protocol, domain, and path is included in the database

Machine learning-based URL category automatic classification method including.
In claim 1,

A method of automatically classifying URL categories based on machine learning for accessing a website corresponding to the analysis target URL and obtaining web page data corresponding to the analysis target URL.
In paragraph 2,

The step of determining whether the URL to be analyzed is subject to machine learning-based category classification,

If there is a URL pattern rule matching the analysis target URL in the URL pattern rule list, determining that the analysis target URL is not subject to machine learning-based category classification.

Including more,

The URL pattern rule list includes a plurality of URL pattern rules in which categories are classified in advance.
In paragraph 3,

A method of automatically classifying URL categories based on machine learning, wherein a category corresponding to a URL pattern rule matched to the analysis target URL is stored in the database as category classification information for the analysis target URL.
In any one of claims 1 to 4,

The machine learning model,

A method for automatically classifying URL categories based on machine learning trained with text data extracted from web page data obtained from a plurality of websites and learning data constructed with category classification information pre-assigned to the plurality of websites.
In paragraph 5,

The machine learning model,

Machine learning-based URL categories that remove formal morphemes from the text extracted from the web page data, receive text data consisting only of nouns, calculate the similarity for each predefined category, and classify the analysis target URL into the category with the highest similarity. Automatic classification method.
A computer-readable recording medium recording a program for executing the method according to any one of claims 1 to 4 in a computer.
A URL input unit for receiving an analysis target URL;

A control unit that determines whether the analysis target URL is subject to machine learning-based category classification, and if the analysis target URL is a machine learning-based category classification target, obtains web page data corresponding to the analysis target URL and extracts text data;

An artificial intelligence unit that classifies a category corresponding to the URL to be analyzed by inputting text data extracted from the obtained web page data into a machine learning model; and

Database for storing category classification information for the analysis target URL

including,

The control unit,

Performs pre-processing to separate the domain from the analysis target URL, and if category classification information for the separated domain is included in the database, machine learning-based determining that the analysis target URL is not subject to machine learning-based category classification URL category automatic classification system.
In paragraph 8,

A machine learning-based URL category automatic classification system for obtaining web page data corresponding to the analysis target URL by accessing a website corresponding to the analysis target URL.
In paragraph 9,

The control unit,

If there is a URL pattern rule matching the analysis target URL in the URL pattern rule list, determining that the analysis target URL is not a machine learning-based category classification target;

The URL pattern rule list includes a plurality of URL pattern rules in which categories are classified in advance.
In paragraph 10,

A machine learning-based automatic URL category classification system for storing a category corresponding to a URL pattern rule matched with the analysis target URL in the database as category classification information for the analysis target URL.
In any one of claims 8 to 11,

The machine learning model,

A machine learning-based automatic URL category classification system trained with text data extracted from web page data obtained from a plurality of websites and learning data constructed with category classification information pre-assigned to the plurality of websites.
In paragraph 12,

The machine learning model,

Machine learning-based URL categories that remove formal morphemes from the text extracted from the web page data, receive text data consisting only of nouns, calculate the similarity for each predefined category, and classify the analysis target URL into the category with the highest similarity. automatic sorting system.