WO2020044469A1

WO2020044469A1 - Illicit webpage detection device, illicit webpage detection device control method, and control program

Info

Publication number: WO2020044469A1
Application number: PCT/JP2018/031993
Authority: WO
Inventors: 隆一田代
Original assignee: Ｂｂソフトサービス株式会社
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2020-03-05
Also published as: JP7182764B2; JPWO2020044469A1

Abstract

Provided are an illicit webpage detection device, illicit webpage detection device control method, and control program, with which a determination as to whether a webpage is an illicit webpage can be made with high precision. This illicit webpage detection device comprises: a storage part for storing feature vectors based on the states of association of a plurality of character strings in each of a plurality of illicit HTML documents which configure each of a plurality of illicit webpages; an acquisition part for acquiring an HTML document to be inspected which configures a webpage to be inspected; a vector computation part for computing a feature vector of the HTML document under inspection; a similarity computation part for computing similarities between the feature vector of the HTML document under inspection and each of the feature vectors of the plurality of illicit HTML documents; a determination part for, on the basis of each of the computed similarities and a threshold, determining whether the webpage under inspection is an illicit webpage; and a determination result output part for outputting the result of the determination performed by the determination part.

Description

Unauthorized Web page detection apparatus, control method of unauthorized Web page detection apparatus, and control program

The present disclosure relates to an unauthorized Web page detection device, a control method of the unauthorized Web page detection device, and a control program.

技術 In order to respond to the increase in phishing scams using the Internet, technologies for preventing damages from phishing scams are becoming widespread.

For example, Patent Document 1 discloses a communication control device that prohibits access to a URL (Uniform Resource Locator) of a phishing site. The communication control device is provided on a communication path between the user's terminal and another device with which the user's terminal communicates, and includes a URL of an access destination content included in communication data transmitted by the terminal, a phishing site list, That is, the URL is compared with the URL included in the blacklist. When the URL of the content accessed by the terminal matches the URL included in the blacklist, the communication control device prohibits access to the content.

International Publication No. WO 2006/087908

In recent years, tools for constructing phishing sites have become widely distributed among criminals who perform phishing scams, and criminals can easily and quickly generate phishing sites by using the tools. I have. Criminals use tools to create new phishing sites, direct users to fraudulent Web pages on the new sites, perform phishing scams, and quickly close created phishing sites. I do. A criminal can execute a phishing scam before the fraudulent Web page is placed on the blacklist, and the conventional blacklist method may not detect the fraudulent Web page.

The purpose of the unauthorized Web page detection device, the control method of the unauthorized Web page detection device, and the control program is to make it possible to determine with high accuracy whether or not the Web page is an unauthorized Web page.

The unauthorized Web page detection device according to the present embodiment is configured to detect a feature vector of a plurality of unauthorized HTML (HyperText Markup Language) documents constituting each of the plurality of unauthorized Web pages, based on a related state of a plurality of character strings in each HTML document. , An acquisition unit that acquires an HTML document to be inspected that constitutes a Web page to be inspected, a vector calculation unit that calculates a feature vector of the HTML document to be inspected, a feature vector of the HTML document to be inspected, A similarity calculation unit that calculates a similarity with each of the feature vectors of the malicious HTML document, and whether the inspection target Web page is a malicious Web page based on the calculated similarities and the threshold value. The determination unit includes a determination unit and a determination result output unit that outputs a determination result by the determination unit.

In the unauthorized Web page detection device according to the present embodiment, the storage unit further stores a feature vector of a plurality of regular HTML documents constituting each of the plurality of regular Web pages into a regular URL (Uniform Resource Locator) indicating the regular Web page. The acquisition unit further acquires the inspection target URL indicating the inspection target Web page, and the similarity calculation unit determines that the domain name in the inspection target URL is any of the domain names in the plurality of regular URLs. If they do not match, it is preferable to further calculate the similarity between the feature vector of the inspection target HTML and each of the feature vectors of the plurality of normal HTML documents.

In the unauthorized Web page detection device according to the present embodiment, the similarity calculation unit does not calculate the similarity for the unauthorized HTML document when the difference between the size of the unauthorized HTML document and the size of the inspection target HTML document is equal to or larger than a predetermined value. Is preferred.

不正 In the unauthorized Web page detection device according to the present embodiment, it is preferable that the plurality of character strings include an HTML tag and a word.

In the unauthorized Web page detection device according to the present embodiment, the plurality of character strings are preferably continuous character strings.

The method for controlling an unauthorized Web page detection device having a storage unit and an output unit according to the present embodiment is characterized in that the unauthorized Web page detection device includes a plurality of unauthorized HTML documents constituting each of the plurality of unauthorized Web pages in each HTML document. Is stored in the storage unit, an HTML document to be inspected that forms the Web page to be inspected is obtained, a feature vector of the HTML document to be inspected is calculated, and the HTML vector of the HTML document to be inspected is stored. A similarity between the feature vector and each of the feature vectors of the plurality of unauthorized HTML documents is calculated, and based on each of the calculated similarities and the threshold value, it is determined whether or not the inspection target Web page is an unauthorized Web page. Determining, and outputting the result of the determination to the output unit.

The control program of the unauthorized Web page detecting device having the storage unit and the output unit according to the present embodiment relates to the association of a plurality of unauthorized HTML documents constituting each of the plurality of unauthorized Web pages with a plurality of character strings in each HTML document. A feature vector based on the state is stored in the storage unit, an HTML document to be inspected constituting the Web page to be inspected is acquired, a feature vector of the HTML document to be inspected is calculated, and a feature vector of the HTML document to be inspected and plural unauthorized The similarity with each of the feature vectors of the HTML document is calculated, and based on each of the calculated similarities and the threshold value, it is determined whether or not the inspection target Web page is an unauthorized Web page. The output to the output unit is performed by the unauthorized Web page detection device.

According to the present embodiment, the unauthorized Web page detection device, the control method of the unauthorized Web page detection device, and the control program make it possible to determine with high accuracy whether or not the Web page is an unauthorized Web page.

The objects and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. Both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, which is set forth in the following claims.

FIG. 4 is a diagram illustrating an example of a processing outline in an unauthorized Web page detection device. FIG. 1 is a diagram illustrating an example of a schematic configuration of a communication system 1. FIG. 2 is a diagram illustrating an example of a schematic configuration of an unauthorized Web page detection device 4. FIG. 7A is a diagram illustrating an example of a data structure of an unauthorized Web page table, and FIG. 7B is a diagram illustrating an example of a data structure of a regular Web page table. 6 is a flowchart illustrating an example of an operation of the unauthorized Web page detection device 4. It is a flowchart which shows an example of an initial process. It is a flow chart which shows an example of inspection processing. (A) is an example of input data to the morphological analysis unit 433, and (b) is an example of output data of the morphological analysis unit 433. FIG. 9 is a diagram illustrating an example of a feature vector processing outline. (A)-(d) is a figure which shows an example of the screen which the terminal 2 displays.

Hereinafter, various embodiments of the present invention will be described with reference to the drawings. However, it should be noted that the technical scope of the present invention is not limited to these embodiments, but extends to the inventions described in the claims and their equivalents.

FIG. 1 is a diagram showing an example of a processing outline in the unauthorized Web page detection device.

The unauthorized web page detection device stores a plurality of unauthorized HTML documents constituting each of a plurality of known unauthorized web pages. The fraudulent Web page is a Web page used in phishing scams, and the URL of a known fraudulent Web page is provided by an organization such as the Anti-Phishing Council, for example. The Web page includes an HTML document and an image described in the HTML document.

{First, the unauthorized Web page detection device calculates, for each of the plurality of unauthorized HTML documents, feature vectors 1 to n based on the associated state of a plurality of character strings in each HTML document. The character string is an HTML tag or a word. The related state of a plurality of character strings is a relation between the character strings, for example, an arrangement relation of a predetermined plurality of character strings in each HTML document. The plurality of character strings may include HTML tags and words, and may be continuous character strings. The feature vector is a vector having a plurality of dimensions, for example, 1 × 150. Each feature vector is calculated such that the feature vectors of two HTML documents having similar character string arrangements in the documents are more similar to the feature vectors of two dissimilar HTML documents.

Next, the unauthorized Web page detection device acquires the inspection target HTML document included in the inspection target Web page. The inspection target Web page is a Web page to be inspected to determine whether or not it is a Web page used in phishing fraud, and is, for example, a Web page requested to access by a terminal different from the unauthorized Web page detection device. The unauthorized Web page detection device calculates the feature vector A for the inspection-target HTML document, similarly to the unauthorized HTML document.

Next, the unauthorized Web page detection device calculates similarities 1 to n between the calculated feature vector A and each of the feature vectors 1 to n.

Next, the fraudulent Web page detection device determines whether the inspection target Web page is a fraudulent Web page by comparing the calculated maximum value of the similarities 1 to n with a threshold value. When the maximum value of the similarities 1 to n is equal to or greater than the threshold value, the unauthorized Web page detection device determines that the inspection target Web page is similar to the unauthorized Web page corresponding to the feature vector for which the maximum similarity was calculated. Is determined to be an unauthorized Web page.

The unauthorized Web page detection device calculates a feature vector based on a related state of a plurality of character strings in each HTML document for each of a plurality of known unauthorized HTML documents and the inspection target HTML document. The unauthorized Web page detection device determines whether the inspection target Web page is an unauthorized Web page based on the similarity of the feature vectors. Unauthorized Web pages are often generated by a common tool, and a plurality of unauthorized Web pages generated by a common tool have common features attributed to the tool and are likely to be similar. Therefore, even if the URL of the inspection target Web page is different from the URL of the known invalid Web page, the unauthorized Web page detection device uses the feature vector of the HTML document to determine whether the inspection target Web page is an unauthorized Web page. Can be determined with high accuracy.

<Embodiment>
FIG. 2 is a diagram illustrating an example of a schematic configuration of the communication system 1.

The communication system 1 includes a terminal 2, a Web server 3, an unauthorized Web page detection device 4, and the like. The terminal 2, the Web server 3, and the unauthorized Web page detection device 4 are connected via a communication network 5 such as the Internet.

The terminal 2 is a terminal used by the user for browsing the Web page. The terminal 2 communicates with the Web server 3 and the unauthorized Web page detection device 4 via the communication network 5 by a communication method such as TCP / IP (Transmission Control Protocol / Internet Protocol) and performs display according to the content of the communication. .

The Web server 3 is a server that transmits a Web page in response to a request from the terminal 2 and the unauthorized Web page detection device 4. The Web server 3 communicates with the terminal 2 and the unauthorized Web page detection device 4 via the communication network 5 by a communication method such as TCP / IP.

(2) When the terminal 2 accesses the Web page of the Web server 3 by specifying the URL, the terminal 2 transmits the same URL to the unauthorized Web page detection device 4. The unauthorized Web page detection device 4 specifies the transmitted URL, requests the Web server 3 to acquire an HTML document, and receives the HTML document from the Web server 3. The unauthorized Web page detection device 4 determines whether the received HTML document is an unauthorized HTML document, and transmits the determined result to the terminal 2. The terminal 2 displays a Web page or a warning screen transmitted from the Web server 3 according to the transmitted inspection result.

FIG. 3 is a diagram showing an example of a schematic configuration of the unauthorized Web page detection device 4. As shown in FIG.

The unauthorized Web page detection device 4 includes a communication unit 41, a storage unit 42, and a processing unit 43.

The communication unit 41 has a wired communication interface circuit such as a wired LAN or a wireless communication interface circuit such as a wireless LAN. The communication unit 41 communicates with the terminal 2, the Web server 3, and the like via the communication network 5 by a communication method such as TCP / IP. The communication unit 41 supplies data received from the terminal 2, the Web server 3, and the like to the processing unit 43. The communication unit 41 transmits the data supplied from the processing unit 43 to the terminal 2, the Web server 3, and the like. The communication unit 41 is an example of an output unit.

The storage unit 42 has, for example, at least one of a semiconductor memory, a magnetic disk device, and an optical disk device. The storage unit 42 stores a driver program, an operating system program, an application program, data, and the like used for processing by the processing unit 43.

For example, the storage unit 42 stores a communication device driver program for controlling the communication unit 41 as a driver program. Further, the storage unit 42 stores a connection control program or the like according to a communication method such as TCP / IP as an operating system program. Further, the storage unit 42 stores a data processing program for transmitting and receiving various data and the like as an application program. The computer program is stored in a storage unit 42 from a computer-readable portable recording medium such as a CD-ROM (Compact Disk Read Only Memory) and a DVD-ROM (Digital Versatile Disk Read Only Memory) using a known setup program. May be installed.

The storage unit 42 stores an unauthorized Web page table, a normal Web page table, and the like as data. The details of the unauthorized Web page table and the regular Web page table will be described later.

The processing unit 43 has one or a plurality of processors and their peripheral circuits, and totally controls the overall operation of the unauthorized Web page detection device 4. The processing unit 43 is, for example, a CPU (Central Processing Unit). Note that the processing unit 43 may be a DSP (digital signal processor), an LSI (large scale integration), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programming Gate Array), or the like.

The processing unit 43 controls operations of the communication unit 41 and the like so that various processes of the unauthorized Web page detection device 4 are executed in an appropriate procedure according to a program or the like stored in the storage unit 42. The processing unit 43 executes a process based on a program (a driver program, an operating system program, an application program, etc.) stored in the storage unit 42. Further, the processing unit 43 can execute a plurality of programs (such as application programs) in parallel.

The processing unit 43 includes an acquisition unit 431, a preprocessing unit 432, a morphological analysis unit 433, a vector calculation unit 434, a similarity calculation unit 435, a determination unit 436, a determination result output unit 437, and the like. Each of these units included in the processing unit 43 is a functional module implemented by a program executed on a processor included in the processing unit 43. Alternatively, each of these units included in the processing unit 43 may be mounted on the unauthorized Web page detection device 4 as an independent integrated circuit, microprocessor, or firmware.

FIG. 4A is a diagram showing an example of the data structure of the unauthorized Web page table.

An ID for identifying an unauthorized Web page, a URL indicating the unauthorized Web page, an unauthorized HTML document included in the unauthorized Web page, a feature vector calculated based on the unauthorized HTML document, and the like are associated with the unauthorized Web page table. It is memorized. A plurality of malicious HTML documents are stored in the malicious Web page table, and the plurality of malicious HTML documents constitute each of the plurality of malicious Web pages. Note that the feature vector may be stored in the storage unit 42 in association with an ID, a URL, and the like, separately from the unauthorized Web page table. The URL does not have to be included in the unauthorized Web page table. Regardless of whether or not the feature vector is stored in the malicious Web page table, the storage unit 42 stores the plurality of character strings in each HTML document of the plurality of malicious HTML documents constituting each of the plurality of malicious Web pages. The feature vector based on the related state is stored.

FIG. 4B is a diagram showing an example of the data structure of the regular Web page table.

An ID for identifying a regular Web page, a regular URL indicating the regular Web page, a regular HTML document included in the regular Web page, a feature vector calculated based on the regular HTML document, and the like are associated with the regular Web page table. Is memorized. The feature vector may be stored in the storage unit 42 in association with an ID, a regular URL, or the like, separately from the regular Web page table. Regardless of whether or not the feature vectors are stored in the normal Web page table, the storage unit 42 stores the feature vectors of the plurality of normal HTML documents constituting each of the plurality of normal Web pages into the normal URL indicating the normal Web page. Is stored in association with.

FIG. 5 is a flowchart showing an example of the operation of the unauthorized Web page detection device 4.

Hereinafter, an example of the operation of the unauthorized Web page detection device 4 will be described with reference to the flowchart shown in FIG. The operation described below is mainly executed by the processing unit 43 in cooperation with each element based on a program stored in the storage unit 42 in advance.

First, the acquiring unit 431 reads out the unauthorized Web page table or the authorized Web page table from the storage unit 42, and acquires a plurality of unauthorized HTML documents and a plurality of authorized HTML documents, respectively (step S11).

Next, the unauthorized Web page detection device 4 executes an initial process (step S12). The vector calculation unit 434 of the unauthorized Web page detection device 4 calculates a feature vector for each of a plurality of unauthorized HTML documents and a plurality of normal HTML documents in the initial processing. Details of the initial processing will be described later. The processing in steps S11 and S12 is executed immediately after the unauthorized Web page detection device 4 is started.

Next, the acquisition unit 431 of the unauthorized Web page detection device 4 waits until a URL is received from the terminal 2 (Step S13). The terminal 2 transmits a Web page transmission request to the Web server 3 by specifying a URL, and transmits the same URL to the unauthorized Web page detection device 4. The acquisition unit 431 of the unauthorized Web page detection device 4 receives the URL transmitted from the terminal 2 via the communication unit 41, and acquires the URL as the inspection target URL indicating the inspection target Web page.

Next, the acquisition unit 431 specifies the acquired URL and transmits an HTML document transmission request to the Web server 3 via the communication unit 41 (step S14).

Next, when receiving the transmission request of the HTML document, the Web server 3 transmits the HTML document specified by the URL to the unauthorized Web page detection device 4. The acquisition unit 431 receives the HTML document from the Web server 3 via the communication unit 41, and acquires the HTML document as the inspection-target HTML document constituting the inspection-target Web page (Step S15).

Next, the determination unit 436 of the unauthorized Web page detection device 4 performs an inspection process on the inspection-target HTML document (step S16). The determining unit 436 determines whether or not the inspection target Web page including the inspection target HTML document is an unauthorized Web page in the inspection processing. Details of the inspection processing will be described later.

Next, the determination result output unit 437 outputs the determination result in the inspection process by transmitting it to the terminal 2 via the communication unit 41 (step S17). Next, the determination result output unit 437 returns the processing to step S13, and repeats the processing from step S13 to step S17.

On the other hand, when receiving the determination result, the terminal 2 specifies the received determination result. The terminal 2 displays the Web page received from the Web server 3 when the determination result indicates that it is a legitimate Web page, and displays the Web page that is received from the Web server 3 when the determination result indicates that it is an unauthorized Web page. Display a warning screen without displaying.

Note that the terminal 2 may receive and display the Web page from the Web server 3 before receiving the determination result indicating that the Web page is the unauthorized Web page from the unauthorized Web page detection device 4. In that case, the terminal 2 displays a warning screen instead of the displayed Web page.

FIG. 6 is a flowchart showing an example of the initial process. The initial processing is executed in step S12 of FIG.

First, the preprocessing unit 432 performs preprocessing on each of the plurality of unauthorized HTML documents and the plurality of regular HTML documents acquired in step S11 (step S21). The preprocessing unit 432 analyzes the contents of each HTML document based on HTML grammar rules as preprocessing, and deletes some characters in each HTML document based on the analysis result. For example, the preprocessing unit 432 deletes a line feed code that is a control character indicating a line feed in each HTML document, a blank character before and after the line feed code, a comment character string, a JavaScript execution code, and the like. Further, the preprocessing unit 432 may delete the URL path described in the HTML tag of each HTML document, delete a part of the HTML tag, and change the other part of the HTML document to the HTML document. May be processed.

Next, the morphological analysis unit 433 performs a morphological analysis process on each HTML document processed by the preprocessing unit 432 (step S22). The morphological analysis unit 433 performs morphological analysis on each HTML document, thereby converting the contents of each HTML document into a set of a plurality of character strings. The morphological analysis unit 433 performs a morphological analysis process using a known morphological analysis engine such as MeCab. In the morphological analysis processing, the morphological analysis unit 433 performs processing such that, for example, an HTML tag such as <p> and a word other than the HTML tag are each one character string.

Next, the vector calculation unit 434 calculates, for each HTML document processed by the morphological analysis unit 433, a feature vector based on the associated state of a plurality of character strings in each HTML document (step S23).

The vector calculation unit 434 calculates a feature vector by a learning device that is pre-trained so as to output a feature vector of the HTML document when an HTML document having a plurality of character strings is input. This learning device is pre-learned using an HTML document of an existing Web page by, for example, a neural network or the like, and is stored in the storage unit 42 in advance. The learning device outputs a similar feature vector for an HTML document in which the arrangement of character strings in the HTML document is similar, and outputs a dissimilar feature vector for an HTML document in which the arrangement state of the character strings in the HTML document is not similar. Learned to do. The learning device executes this learning using a known method such as Doc2Vec. The HTML document used for the pre-learning is, for example, a Wikipedia HTML document.

Note that the vector calculation unit 434 may calculate the feature vector without using a learning device. In that case, the vector calculation unit 434 calculates a feature vector in which the number of appearances of two or more predetermined numbers of character strings in each document is an element. A plurality of the predetermined number of character strings are set in advance and stored in the storage unit 42. In this case, the related state of a plurality of character strings is the magnitude relation of the number of appearances of each character string, and for similar HTML documents, the magnitude relation of the number of appearances of each character string is similar. Therefore, the vector calculation unit 434 calculates a similar feature vector for HTML documents in which the number of appearances of each character string in the HTML document is similar to each other, and outputs HTML documents in which the number of appearances of each character string in the HTML document is not similar. , A dissimilar feature vector is calculated.

Next, the vector calculation unit 434 stores each of the calculated feature vectors in the unauthorized Web page table or the authorized Web page table in association with the corresponding unauthorized HTML document or regular HTML document (step S24). Thus, a series of processing ends.

FIG. 7 is a flowchart showing an example of the inspection processing. The initial processing is executed in step S16 in FIG.

First, the preprocessing unit 432 performs preprocessing on the inspection target HTML document acquired in step S15 (step S31). This preprocessing is the same as the preprocessing described in step S21 except that the target is an HTML document to be inspected.

Next, the morphological analysis unit 433 performs a morphological analysis process on the inspection-target HTML document processed by the preprocessing unit 432 (step S32). This morphological analysis processing is the same as the morphological analysis processing described in step S22 except that the target is an HTML document to be inspected.

Next, the vector calculation unit 434 calculates the feature vector of the inspection-target HTML document processed by the morphological analysis unit 433 (step S33). This feature vector calculation process is the same as the feature vector calculation process described in step S23 except that the target is an inspection target HTML document. As in step S23 and step S33, the vector calculation unit 434 determines, for each of the plurality of invalid HTML documents, the plurality of regular HTML documents, and the inspection target HTML document, the feature vector based on the related state of the plurality of character strings in each HTML document. Is calculated.

Next, the similarity calculator 435 calculates the similarity between the feature vector of the inspection-target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents stored in step S24 (step S34).

Next, the determination unit 436 determines whether the inspection target Web page is an unauthorized Web page based on the calculated similarities and the threshold (step S35).

If the maximum value of the similarity is equal to or greater than the threshold value (step S35-Y), the determination unit 436 determines that the inspection target Web page is an unauthorized Web page corresponding to the feature vector whose maximum similarity has been calculated. Is determined (step S36), and a series of processing ends.

On the other hand, if the maximum value of the similarity is less than the threshold value (step S35-N), the determination unit 436 reads the regular Web table and acquires a plurality of regular URLs (step S37).

Next, the determination unit 436 determines whether or not the domain name in the URL to be inspected acquired in step S13 matches any of the domain names in the plurality of regular URLs acquired in step S37 (step S38). .

If the domain name in the URL to be inspected matches any one of the domain names in the plurality of regular URLs (step S38-Y), the determination unit 436 determines that the Web page to be inspected belongs to a regular Web site, It is determined that the page is not a page (step S39). Thus, a series of processing ends.

If the domain name in the URL to be inspected does not match any of the domain names in the plurality of regular URLs (step S38-N), the determination unit 436 determines that the Web page to be inspected does not belong to a legitimate Web site. . Next, the similarity calculation unit 435 calculates the similarity between the feature vector of the inspection target HTML and each of the feature vectors of the plurality of normal HTML documents (step S40).

Next, the determination unit 436 determines whether or not the inspection target Web page is an unauthorized Web page by comparing the calculated maximum value of each similarity with the second threshold value (Step S41). . The second threshold value may be the same value as the threshold value used in step S35 or a different value.

The determination unit 436 determines in step S38 that the Web page to be inspected does not belong to a legitimate Web site. Therefore, when the maximum value of the similarity is equal to or larger than the second threshold, the determination unit 436 determines that the inspection target Web page is an unauthorized Web page similar to the registered regular Web page (Step S42). .

On the other hand, when the maximum value of the similarity is less than the second threshold, the determination unit 436 determines that the inspection target Web page does not belong to the regular Web site, but the content is not similar to any of the regular Web pages. Therefore, it is determined that the page is an unregistered regular Web page (step S43). Thus, a series of processing ends.

FIG. 8A shows an example of input data to the morphological analysis unit 433, and FIG. 8B shows an example of output data of the morphological analysis unit 433.

As shown in FIG. 8A, the input data to the morphological analysis unit 433 is obtained from the HTML documents of the illegal Web page, the regular Web page, and the inspection target Web page, and the pre-processing unit 432 outputs a part of the line feed code and the like. This is an HTML document from which characters have been deleted.

As shown in FIG. 8B, the output data of the morphological analysis unit 433 is obtained by performing the morphological analysis on the input data by the morphological analysis unit 433, and collecting the morphemes obtained as the execution result in units of words. Data placed between quotes. Note that the morphological analysis unit 433 performs morphological analysis after removing the HTML tag from the input data, puts the morpheme into words, and inserts an HTML tag with double quotes at the original position. May generate output data.

FIG. 9 is a diagram showing an example of a feature vector processing outline.

The storage unit 42 stores the illegal HTML documents 1 to n of the plurality of illegal Web pages 1 to n. First, in step S23, the vector calculation unit 434 calculates feature vectors 1 to n for the malicious HTML documents 1 to n of the malicious web pages 1 to n stored in the storage unit 42, respectively. On the other hand, in step S33, the vector calculation unit 434 calculates the feature vector A for the inspection target HTML document of the inspection target Web page acquired by the acquisition unit 431. Then, in step S34, the similarity calculating section 435 calculates cosine similarities 1 to n of the feature vector A and each of the feature vectors 1 to n. The two feature vectors are similar when the cosine similarity is close to 1, and are not similar when the cosine similarity is close to -1. In the example shown in FIG. 9, the similarity 1 is 0.9, the similarity 2 is 0.4, and the similarity n is −0.9.

In step S35, the determination unit 436 determines whether or not the inspection target Web page is an unauthorized Web page by comparing the maximum value of 0.9 of the similarities 1 to n with a threshold value. . For example, when the threshold value is 0.8, the maximum value 0.9 of the similarities 1 to n is equal to or larger than the threshold value, and therefore, the inspection target Web page is an unauthorized Web page corresponding to the unauthorized Web page 1. Is determined.

FIGS. 10 (a) to 10 (d) are diagrams showing an example of a screen displayed by the terminal 2. FIG.

As shown in FIG. 10A, when the user instructs to start the Web browser, the terminal 2 starts and displays the Web browser. The display screen 60 of the Web browser includes a URL input area 61 and a display area 62. When the terminal 2 activates the Web browser, the terminal 2 activates an application program that communicates with the unauthorized Web page detection device 4.

As shown in FIG. 10B, when the user inputs a URL in the URL input area 61 of the display screen 70 of the Web browser, the terminal 2 accesses the Web server 3 indicated by the specified URL, and accesses the Web server 3. 3 receives a Web page. Further, the terminal 2 transmits the URL input to the Web browser to the unauthorized Web page detection device 4 according to the application program.

(4) The unauthorized Web page detection device 4 acquires the URL transmitted from the terminal 2 in step S13, executes the processes in steps S14 to S17, and transmits the determination result to the terminal 2.

As shown in FIG. 10C, when the terminal 2 receives from the unauthorized Web page detection device 4 a determination result indicating that the Web page corresponding to the URL transmitted from the terminal 2 is a regular Web page, The Web page 81 received from the server 3 is displayed on the display screen 80.

As shown in FIG. 10D, the terminal 2 issues a warning when the determination result indicating that the Web page corresponding to the URL transmitted from the terminal 2 is an unauthorized Web page is received from the unauthorized Web page detection device 4. A screen 90 is displayed. The data for the warning screen is stored in the terminal 2 in advance. On the warning screen 90, a character display 91 and an end button 92 are displayed. The character display 91 is a text warning that the Web page received from the Web server 3 may be a phishing page. When the end button 92 is pressed, the terminal 2 closes the warning screen 90.

As described above, the unauthorized Web page detection device 4 calculates a feature vector based on a related state of a plurality of character strings in each HTML document for each of the plurality of known unauthorized HTML documents and the inspection-target HTML document. The fraudulent Web page detection device 4 determines whether or not the inspection target Web page is a fraudulent Web page based on the calculated similarity of the feature vectors. Unauthorized Web pages are often generated by a common tool, and a plurality of unauthorized Web pages generated by a common tool have common features attributed to the tool and are likely to be similar. For this reason, the fraudulent Web page detection device 4 uses the feature vector of the HTML document, so that even if the URL of the inspection target Web page is different from the URL of the known fraudulent Web page, the fraudulent Web page is detected. Can be determined with high accuracy.

In addition, when the domain name in the URL to be inspected does not match any of the domain names in the plurality of regular URLs, the unauthorized Web page detection device 4 further determines the feature vector of the HTML to be inspected and the feature of the plurality of regular HTML documents. The similarity with each of the vectors is calculated. The fraudulent Web page detection device 4 determines whether the inspection target HTML document is similar to the legitimate HTML document. Therefore, the fraudulent Web page detection device 4 creates a fraudulent Web page that is created to be similar to the legitimate Web page and has not been registered as a fraudulent Web page. Can be detected.

{Further, the unauthorized Web page detection device 4 calculates a feature vector based on an associated state of a plurality of character strings including an HTML tag and a word. A plurality of malicious Web pages generated by a common tool are likely to have a specific association between the HTML tag and the word due to the tool. The fraudulent Web page detection device 4 determines whether or not the state of association between the HTML tag and the word is similar between the test target Web page and each fraudulent Web page. It can be detected with higher accuracy.

{Further, the unauthorized Web page detection device 4 calculates the feature vector based on the related state of a plurality of continuous character strings. Web pages that tend to use similar HTML tags and / or word sets in consecutive character strings are likely to be similar Web pages. Therefore, the fraudulent Web page detection device 4 can detect a fraudulent Web page similar to a Web page registered as a fraudulent Web page with higher accuracy.

The preprocessing unit 432 may calculate the size of each HTML document generated by the preprocessing in steps S21 and S31. In this case, in step S34, the similarity calculation unit 435 calculates a difference between each of the calculated plurality of unauthorized HTML documents and the calculated size of the inspection target HTML document, and the size difference is equal to or larger than a predetermined value. In this case, the similarity is not calculated for the invalid HTML document. Similarly, in step S40, the similarity calculation unit 435 calculates the difference between the calculated size of each of the plurality of normal HTML documents and the calculated size of the inspection target HTML document, and the size difference is equal to or larger than a predetermined value. In this case, the similarity is not calculated for the regular HTML document.

場合 If the size of the HTML document to be inspected is clearly different from the size of the invalid HTML document or the size of the regular HTML document, the two HTML documents are clearly different. Therefore, the fraudulent Web page detection device 4 can speed up the inspection process without reducing the accuracy of determining the fraudulent Web page. Note that the similarity calculation unit 435 may calculate the difference between the sizes of the HTML documents before the preprocessing unit 432 performs the preprocessing. Alternatively, the similarity calculation unit 435 may calculate the difference between the sizes of the HTML documents after the morphological analysis unit 433 has performed the morphological analysis processing.

Also, the morphological analysis unit 433 replaces the regular HTML document acquired in step S11 and the inspection target HTML document acquired in step S15 with a morphological Analysis processing may be performed.

{Circle around (4)} The vector calculation unit 434 may calculate a feature vector for an HTML document that has been preprocessed by the preprocessing unit 432, instead of the HTML document processed by the morphological analysis unit 433. The vector calculation unit 434 may calculate a feature vector for each of the regular HTML documents acquired in step S11 and the inspection target HTML document acquired in step S15 instead of the HTML document processed by the morphological analysis unit 433. Good. For example, when the HTML document is described in a language such as English that is separated and written for each word, the vector calculation unit 434 separates the input HTML document by a break in the HTML tag and a space between words. The feature vector may be calculated based on a plurality of character strings.

The determination unit 436 may determine whether or not the number of unauthorized Web pages determined to have a similarity equal to or greater than the threshold in step S35 is equal to or greater than a predetermined number. For example, the determination unit 436 determines that the inspection target Web page is the unauthorized Web page when the number of the unauthorized Web pages determined to have the similarity equal to or higher than the threshold value is the predetermined number or more, and determines that the inspection target Web page is not the predetermined number or more. In this case, it may be determined that the inspection target Web page is not an unauthorized Web page.

In addition, the processing of steps S37 to S43 is omitted, and the determination unit 436 determines that the inspection target Web page is a regular Web page when the maximum value of each similarity calculated in step S34 is less than the threshold. It may be determined.

The timing at which the determination unit 436 executes the processing of steps S37 to S38 may be changed before the processing of step S31, and the processing may advance to step S40 in the case of step S35-N. For example, the determination unit 436 performs the processing of steps S37 to S38 at the beginning of the inspection processing. In the case of step S38-Y, the determination unit 436 determines that the inspection target Web page belongs to the legitimate Web site and is not an unauthorized Web page, and ends a series of processes, as in step S39. In the case of step S38-Y, the determination unit 436 advances the processing to step S31.

Further, the storage unit 42 may further store URL information indicating which authorized URL corresponds to the unauthorized HTML document that executes the phishing scam, in association with each unauthorized HTML document in the unauthorized Web page table. . In this case, in step S34, the similarity calculation unit 435 further calculates the similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of normal HTML documents. Then, the similarity calculating unit 435 calculates an average value of the similarity of each unauthorized HTML document and the similarity of the regular HTML document associated with the regular URL indicated by the URL information of each malicious HTML document. The determination unit 436 determines in step S35 whether the maximum value of the average values calculated by the similarity calculation unit 435 is equal to or greater than a threshold value, thereby determining whether the inspection target Web page is an unauthorized Web page. Determine whether or not.

The unauthorized Web page detection device 4 may acquire a URL of a new unauthorized Web page or a legitimate Web page during operation, and calculate a feature vector corresponding to each Web page. In this case, the obtaining unit 431 specifies the obtained URL, obtains an invalid HTML document or a regular HTML document, and registers the obtained URL and HTML document in the illegal Web page table or the regular Web page table. The preprocessing unit 432, the morphological analysis unit 433, and the vector calculation unit 434 execute the initial process of step S12 on the newly acquired HTML document, and calculate a feature vector.

The unauthorized Web page detection device 4 can calculate the similarity between the feature vector of the inspection target HTML document and the feature vector of the new HTML document without causing the existing learning device to learn the new HTML document. The unauthorized Web page detection device 4 can execute the determination using the new HTML document without re-learning the learning device using the entire existing HTML document and the new HTML document. The processing load can be reduced.

It should be understood that those skilled in the art can make various changes, substitutions and modifications without departing from the spirit and scope of the invention.

4 Unauthorized Web page detection device 42 Storage unit 431 Acquisition unit 434 Vector calculation unit 435 Similarity calculation unit 436 Judgment unit 437 Judgment result output unit

Claims

A storage unit for storing a feature vector of a plurality of malicious HTML (HyperText Markup Language) documents constituting each of the plurality of malicious Web pages, based on a relation state of a plurality of character strings in each HTML document;
An acquisition unit configured to acquire an inspection target HTML document that constitutes the inspection target Web page;
A vector calculation unit that calculates a feature vector of the inspection target HTML document;
A similarity calculator that calculates a similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents;
A determining unit that determines whether the inspection target Web page is an unauthorized Web page based on each of the calculated similarities and the threshold value;
A determination result output unit that outputs a determination result by the determination unit,
An unauthorized Web page detection device comprising:
The storage unit further stores the feature vectors of a plurality of regular HTML documents constituting each of the plurality of regular Web pages in association with a regular URL (Uniform Resource Locator) indicating the regular Web page,
The acquisition unit further acquires an inspection target URL indicating the inspection target Web page,
When the domain name in the inspection target URL does not match any of the domain names in the plurality of normal URLs, the similarity calculation unit further includes a feature vector of the inspection target HTML and the plurality of normal HTML documents. The unauthorized Web page detection device according to claim 1, wherein a similarity with each of the feature vectors is calculated.
The method according to claim 1, wherein the similarity calculating unit does not calculate the similarity for the invalid HTML document when a difference between the size of the invalid HTML document and the size of the inspection target HTML document is equal to or greater than a predetermined value. Unauthorized Web page detection device.
4. The unauthorized Web page detection device according to claim 1, wherein the plurality of character strings include an HTML tag and a word.
The unauthorized Web page detection device according to any one of claims 1 to 4, wherein the plurality of character strings are continuous character strings.
A method for controlling an unauthorized Web page detection device having a storage unit and an output unit, wherein the unauthorized Web page detection device includes:
Storing, in the storage unit, a feature vector of a plurality of malicious HTML (HyperText Markup Language) documents constituting each of the plurality of malicious Web pages, based on a relation state of a plurality of character strings in each HTML document;
Obtain the HTML document to be inspected that constitutes the Web page to be inspected,
Calculating a feature vector of the HTML document to be inspected;
Calculating a similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents;
Based on each of the calculated similarities and the threshold value, determine whether the inspection target Web page is a fraudulent Web page,
Outputting the result of the determination to the output unit;
A method for controlling an unauthorized Web page detection device, comprising:
A control program for an unauthorized Web page detection device having a storage unit and an output unit,
Storing, in the storage unit, a feature vector of a plurality of malicious HTML (HyperText Markup Language) documents constituting each of the plurality of malicious Web pages, based on a relation state of a plurality of character strings in each HTML document;
Obtain the HTML document to be inspected that constitutes the Web page to be inspected,
Calculating a feature vector of the HTML document to be inspected;
Calculating a similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents;
Based on each of the calculated similarities and the threshold value, determine whether the inspection target Web page is a fraudulent Web page,
Outputting the result of the determination to the output unit;
A control program for causing the unauthorized Web page detection device to execute the following.