WO2020044469A1 - Illicit webpage detection device, illicit webpage detection device control method, and control program - Google Patents

Illicit webpage detection device, illicit webpage detection device control method, and control program Download PDF

Info

Publication number
WO2020044469A1
WO2020044469A1 PCT/JP2018/031993 JP2018031993W WO2020044469A1 WO 2020044469 A1 WO2020044469 A1 WO 2020044469A1 JP 2018031993 W JP2018031993 W JP 2018031993W WO 2020044469 A1 WO2020044469 A1 WO 2020044469A1
Authority
WO
WIPO (PCT)
Prior art keywords
web page
unauthorized
html document
detection device
html
Prior art date
Application number
PCT/JP2018/031993
Other languages
French (fr)
Japanese (ja)
Inventor
隆一 田代
Original Assignee
Bbソフトサービス株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bbソフトサービス株式会社 filed Critical Bbソフトサービス株式会社
Priority to PCT/JP2018/031993 priority Critical patent/WO2020044469A1/en
Priority to JP2020539928A priority patent/JP7182764B2/en
Publication of WO2020044469A1 publication Critical patent/WO2020044469A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures

Definitions

  • the present disclosure relates to an unauthorized Web page detection device, a control method of the unauthorized Web page detection device, and a control program.
  • Patent Document 1 discloses a communication control device that prohibits access to a URL (Uniform Resource Locator) of a phishing site.
  • the communication control device is provided on a communication path between the user's terminal and another device with which the user's terminal communicates, and includes a URL of an access destination content included in communication data transmitted by the terminal, a phishing site list, That is, the URL is compared with the URL included in the blacklist.
  • the communication control device prohibits access to the content.
  • the purpose of the unauthorized Web page detection device, the control method of the unauthorized Web page detection device, and the control program is to make it possible to determine with high accuracy whether or not the Web page is an unauthorized Web page.
  • the unauthorized Web page detection device is configured to detect a feature vector of a plurality of unauthorized HTML (HyperText Markup Language) documents constituting each of the plurality of unauthorized Web pages, based on a related state of a plurality of character strings in each HTML document.
  • An acquisition unit that acquires an HTML document to be inspected that constitutes a Web page to be inspected, a vector calculation unit that calculates a feature vector of the HTML document to be inspected, a feature vector of the HTML document to be inspected, A similarity calculation unit that calculates a similarity with each of the feature vectors of the malicious HTML document, and whether the inspection target Web page is a malicious Web page based on the calculated similarities and the threshold value.
  • the determination unit includes a determination unit and a determination result output unit that outputs a determination result by the determination unit.
  • the storage unit further stores a feature vector of a plurality of regular HTML documents constituting each of the plurality of regular Web pages into a regular URL (Uniform Resource Locator) indicating the regular Web page.
  • the acquisition unit further acquires the inspection target URL indicating the inspection target Web page, and the similarity calculation unit determines that the domain name in the inspection target URL is any of the domain names in the plurality of regular URLs. If they do not match, it is preferable to further calculate the similarity between the feature vector of the inspection target HTML and each of the feature vectors of the plurality of normal HTML documents.
  • the similarity calculation unit does not calculate the similarity for the unauthorized HTML document when the difference between the size of the unauthorized HTML document and the size of the inspection target HTML document is equal to or larger than a predetermined value. Is preferred.
  • the plurality of character strings include an HTML tag and a word.
  • the plurality of character strings are preferably continuous character strings.
  • the method for controlling an unauthorized Web page detection device having a storage unit and an output unit is characterized in that the unauthorized Web page detection device includes a plurality of unauthorized HTML documents constituting each of the plurality of unauthorized Web pages in each HTML document. Is stored in the storage unit, an HTML document to be inspected that forms the Web page to be inspected is obtained, a feature vector of the HTML document to be inspected is calculated, and the HTML vector of the HTML document to be inspected is stored. A similarity between the feature vector and each of the feature vectors of the plurality of unauthorized HTML documents is calculated, and based on each of the calculated similarities and the threshold value, it is determined whether or not the inspection target Web page is an unauthorized Web page. Determining, and outputting the result of the determination to the output unit.
  • the control program of the unauthorized Web page detecting device having the storage unit and the output unit relates to the association of a plurality of unauthorized HTML documents constituting each of the plurality of unauthorized Web pages with a plurality of character strings in each HTML document.
  • a feature vector based on the state is stored in the storage unit, an HTML document to be inspected constituting the Web page to be inspected is acquired, a feature vector of the HTML document to be inspected is calculated, and a feature vector of the HTML document to be inspected and plural unauthorized
  • the similarity with each of the feature vectors of the HTML document is calculated, and based on each of the calculated similarities and the threshold value, it is determined whether or not the inspection target Web page is an unauthorized Web page.
  • the output to the output unit is performed by the unauthorized Web page detection device.
  • the unauthorized Web page detection device the control method of the unauthorized Web page detection device, and the control program make it possible to determine with high accuracy whether or not the Web page is an unauthorized Web page.
  • FIG. 4 is a diagram illustrating an example of a processing outline in an unauthorized Web page detection device.
  • FIG. 1 is a diagram illustrating an example of a schematic configuration of a communication system 1.
  • FIG. 2 is a diagram illustrating an example of a schematic configuration of an unauthorized Web page detection device 4.
  • FIG. 7A is a diagram illustrating an example of a data structure of an unauthorized Web page table
  • FIG. 7B is a diagram illustrating an example of a data structure of a regular Web page table.
  • 6 is a flowchart illustrating an example of an operation of the unauthorized Web page detection device 4. It is a flowchart which shows an example of an initial process. It is a flow chart which shows an example of inspection processing.
  • FIG. 9 is a diagram illustrating an example of a feature vector processing outline.
  • (A)-(d) is a figure which shows an example of the screen which the terminal 2 displays.
  • FIG. 1 is a diagram showing an example of a processing outline in the unauthorized Web page detection device.
  • the unauthorized web page detection device stores a plurality of unauthorized HTML documents constituting each of a plurality of known unauthorized web pages.
  • the fraudulent Web page is a Web page used in phishing scams, and the URL of a known fraudulent Web page is provided by an organization such as the Anti-Phishing Council, for example.
  • the Web page includes an HTML document and an image described in the HTML document.
  • the unauthorized Web page detection device calculates, for each of the plurality of unauthorized HTML documents, feature vectors 1 to n based on the associated state of a plurality of character strings in each HTML document.
  • the character string is an HTML tag or a word.
  • the related state of a plurality of character strings is a relation between the character strings, for example, an arrangement relation of a predetermined plurality of character strings in each HTML document.
  • the plurality of character strings may include HTML tags and words, and may be continuous character strings.
  • the feature vector is a vector having a plurality of dimensions, for example, 1 ⁇ 150. Each feature vector is calculated such that the feature vectors of two HTML documents having similar character string arrangements in the documents are more similar to the feature vectors of two dissimilar HTML documents.
  • the unauthorized Web page detection device acquires the inspection target HTML document included in the inspection target Web page.
  • the inspection target Web page is a Web page to be inspected to determine whether or not it is a Web page used in phishing fraud, and is, for example, a Web page requested to access by a terminal different from the unauthorized Web page detection device.
  • the unauthorized Web page detection device calculates the feature vector A for the inspection-target HTML document, similarly to the unauthorized HTML document.
  • the unauthorized Web page detection device calculates similarities 1 to n between the calculated feature vector A and each of the feature vectors 1 to n.
  • the fraudulent Web page detection device determines whether the inspection target Web page is a fraudulent Web page by comparing the calculated maximum value of the similarities 1 to n with a threshold value.
  • the unauthorized Web page detection device determines that the inspection target Web page is similar to the unauthorized Web page corresponding to the feature vector for which the maximum similarity was calculated. Is determined to be an unauthorized Web page.
  • the unauthorized Web page detection device calculates a feature vector based on a related state of a plurality of character strings in each HTML document for each of a plurality of known unauthorized HTML documents and the inspection target HTML document.
  • the unauthorized Web page detection device determines whether the inspection target Web page is an unauthorized Web page based on the similarity of the feature vectors.
  • Unauthorized Web pages are often generated by a common tool, and a plurality of unauthorized Web pages generated by a common tool have common features attributed to the tool and are likely to be similar. Therefore, even if the URL of the inspection target Web page is different from the URL of the known invalid Web page, the unauthorized Web page detection device uses the feature vector of the HTML document to determine whether the inspection target Web page is an unauthorized Web page. Can be determined with high accuracy.
  • FIG. 2 is a diagram illustrating an example of a schematic configuration of the communication system 1.
  • the communication system 1 includes a terminal 2, a Web server 3, an unauthorized Web page detection device 4, and the like.
  • the terminal 2, the Web server 3, and the unauthorized Web page detection device 4 are connected via a communication network 5 such as the Internet.
  • the terminal 2 is a terminal used by the user for browsing the Web page.
  • the terminal 2 communicates with the Web server 3 and the unauthorized Web page detection device 4 via the communication network 5 by a communication method such as TCP / IP (Transmission Control Protocol / Internet Protocol) and performs display according to the content of the communication. .
  • TCP / IP Transmission Control Protocol / Internet Protocol
  • the Web server 3 is a server that transmits a Web page in response to a request from the terminal 2 and the unauthorized Web page detection device 4.
  • the Web server 3 communicates with the terminal 2 and the unauthorized Web page detection device 4 via the communication network 5 by a communication method such as TCP / IP.
  • the terminal 2 accesses the Web page of the Web server 3 by specifying the URL, the terminal 2 transmits the same URL to the unauthorized Web page detection device 4.
  • the unauthorized Web page detection device 4 specifies the transmitted URL, requests the Web server 3 to acquire an HTML document, and receives the HTML document from the Web server 3.
  • the unauthorized Web page detection device 4 determines whether the received HTML document is an unauthorized HTML document, and transmits the determined result to the terminal 2.
  • the terminal 2 displays a Web page or a warning screen transmitted from the Web server 3 according to the transmitted inspection result.
  • FIG. 3 is a diagram showing an example of a schematic configuration of the unauthorized Web page detection device 4. As shown in FIG.
  • the unauthorized Web page detection device 4 includes a communication unit 41, a storage unit 42, and a processing unit 43.
  • the communication unit 41 has a wired communication interface circuit such as a wired LAN or a wireless communication interface circuit such as a wireless LAN.
  • the communication unit 41 communicates with the terminal 2, the Web server 3, and the like via the communication network 5 by a communication method such as TCP / IP.
  • the communication unit 41 supplies data received from the terminal 2, the Web server 3, and the like to the processing unit 43.
  • the communication unit 41 transmits the data supplied from the processing unit 43 to the terminal 2, the Web server 3, and the like.
  • the communication unit 41 is an example of an output unit.
  • the storage unit 42 has, for example, at least one of a semiconductor memory, a magnetic disk device, and an optical disk device.
  • the storage unit 42 stores a driver program, an operating system program, an application program, data, and the like used for processing by the processing unit 43.
  • the storage unit 42 stores a communication device driver program for controlling the communication unit 41 as a driver program. Further, the storage unit 42 stores a connection control program or the like according to a communication method such as TCP / IP as an operating system program. Further, the storage unit 42 stores a data processing program for transmitting and receiving various data and the like as an application program.
  • the computer program is stored in a storage unit 42 from a computer-readable portable recording medium such as a CD-ROM (Compact Disk Read Only Memory) and a DVD-ROM (Digital Versatile Disk Read Only Memory) using a known setup program. May be installed.
  • the storage unit 42 stores an unauthorized Web page table, a normal Web page table, and the like as data. The details of the unauthorized Web page table and the regular Web page table will be described later.
  • the processing unit 43 has one or a plurality of processors and their peripheral circuits, and totally controls the overall operation of the unauthorized Web page detection device 4.
  • the processing unit 43 is, for example, a CPU (Central Processing Unit).
  • the processing unit 43 may be a DSP (digital signal processor), an LSI (large scale integration), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programming Gate Array), or the like.
  • the processing unit 43 controls operations of the communication unit 41 and the like so that various processes of the unauthorized Web page detection device 4 are executed in an appropriate procedure according to a program or the like stored in the storage unit 42.
  • the processing unit 43 executes a process based on a program (a driver program, an operating system program, an application program, etc.) stored in the storage unit 42. Further, the processing unit 43 can execute a plurality of programs (such as application programs) in parallel.
  • the processing unit 43 includes an acquisition unit 431, a preprocessing unit 432, a morphological analysis unit 433, a vector calculation unit 434, a similarity calculation unit 435, a determination unit 436, a determination result output unit 437, and the like.
  • Each of these units included in the processing unit 43 is a functional module implemented by a program executed on a processor included in the processing unit 43.
  • each of these units included in the processing unit 43 may be mounted on the unauthorized Web page detection device 4 as an independent integrated circuit, microprocessor, or firmware.
  • FIG. 4A is a diagram showing an example of the data structure of the unauthorized Web page table.
  • An ID for identifying an unauthorized Web page, a URL indicating the unauthorized Web page, an unauthorized HTML document included in the unauthorized Web page, a feature vector calculated based on the unauthorized HTML document, and the like are associated with the unauthorized Web page table. It is memorized.
  • a plurality of malicious HTML documents are stored in the malicious Web page table, and the plurality of malicious HTML documents constitute each of the plurality of malicious Web pages.
  • the feature vector may be stored in the storage unit 42 in association with an ID, a URL, and the like, separately from the unauthorized Web page table.
  • the URL does not have to be included in the unauthorized Web page table.
  • the storage unit 42 stores the plurality of character strings in each HTML document of the plurality of malicious HTML documents constituting each of the plurality of malicious Web pages.
  • the feature vector based on the related state is stored.
  • FIG. 4B is a diagram showing an example of the data structure of the regular Web page table.
  • An ID for identifying a regular Web page, a regular URL indicating the regular Web page, a regular HTML document included in the regular Web page, a feature vector calculated based on the regular HTML document, and the like are associated with the regular Web page table. Is memorized.
  • the feature vector may be stored in the storage unit 42 in association with an ID, a regular URL, or the like, separately from the regular Web page table. Regardless of whether or not the feature vectors are stored in the normal Web page table, the storage unit 42 stores the feature vectors of the plurality of normal HTML documents constituting each of the plurality of normal Web pages into the normal URL indicating the normal Web page. Is stored in association with.
  • FIG. 5 is a flowchart showing an example of the operation of the unauthorized Web page detection device 4.
  • the acquiring unit 431 reads out the unauthorized Web page table or the authorized Web page table from the storage unit 42, and acquires a plurality of unauthorized HTML documents and a plurality of authorized HTML documents, respectively (step S11).
  • the unauthorized Web page detection device 4 executes an initial process (step S12).
  • the vector calculation unit 434 of the unauthorized Web page detection device 4 calculates a feature vector for each of a plurality of unauthorized HTML documents and a plurality of normal HTML documents in the initial processing. Details of the initial processing will be described later.
  • the processing in steps S11 and S12 is executed immediately after the unauthorized Web page detection device 4 is started.
  • the acquisition unit 431 of the unauthorized Web page detection device 4 waits until a URL is received from the terminal 2 (Step S13).
  • the terminal 2 transmits a Web page transmission request to the Web server 3 by specifying a URL, and transmits the same URL to the unauthorized Web page detection device 4.
  • the acquisition unit 431 of the unauthorized Web page detection device 4 receives the URL transmitted from the terminal 2 via the communication unit 41, and acquires the URL as the inspection target URL indicating the inspection target Web page.
  • the acquisition unit 431 specifies the acquired URL and transmits an HTML document transmission request to the Web server 3 via the communication unit 41 (step S14).
  • the Web server 3 transmits the HTML document specified by the URL to the unauthorized Web page detection device 4.
  • the acquisition unit 431 receives the HTML document from the Web server 3 via the communication unit 41, and acquires the HTML document as the inspection-target HTML document constituting the inspection-target Web page (Step S15).
  • the determination unit 436 of the unauthorized Web page detection device 4 performs an inspection process on the inspection-target HTML document (step S16).
  • the determining unit 436 determines whether or not the inspection target Web page including the inspection target HTML document is an unauthorized Web page in the inspection processing. Details of the inspection processing will be described later.
  • the determination result output unit 437 outputs the determination result in the inspection process by transmitting it to the terminal 2 via the communication unit 41 (step S17).
  • the determination result output unit 437 returns the processing to step S13, and repeats the processing from step S13 to step S17.
  • the terminal 2 when receiving the determination result, specifies the received determination result.
  • the terminal 2 displays the Web page received from the Web server 3 when the determination result indicates that it is a legitimate Web page, and displays the Web page that is received from the Web server 3 when the determination result indicates that it is an unauthorized Web page. Display a warning screen without displaying.
  • the terminal 2 may receive and display the Web page from the Web server 3 before receiving the determination result indicating that the Web page is the unauthorized Web page from the unauthorized Web page detection device 4. In that case, the terminal 2 displays a warning screen instead of the displayed Web page.
  • FIG. 6 is a flowchart showing an example of the initial process. The initial processing is executed in step S12 of FIG.
  • the preprocessing unit 432 performs preprocessing on each of the plurality of unauthorized HTML documents and the plurality of regular HTML documents acquired in step S11 (step S21).
  • the preprocessing unit 432 analyzes the contents of each HTML document based on HTML grammar rules as preprocessing, and deletes some characters in each HTML document based on the analysis result. For example, the preprocessing unit 432 deletes a line feed code that is a control character indicating a line feed in each HTML document, a blank character before and after the line feed code, a comment character string, a JavaScript execution code, and the like. Further, the preprocessing unit 432 may delete the URL path described in the HTML tag of each HTML document, delete a part of the HTML tag, and change the other part of the HTML document to the HTML document. May be processed.
  • the morphological analysis unit 433 performs a morphological analysis process on each HTML document processed by the preprocessing unit 432 (step S22).
  • the morphological analysis unit 433 performs morphological analysis on each HTML document, thereby converting the contents of each HTML document into a set of a plurality of character strings.
  • the morphological analysis unit 433 performs a morphological analysis process using a known morphological analysis engine such as MeCab.
  • the morphological analysis unit 433 performs processing such that, for example, an HTML tag such as ⁇ p> and a word other than the HTML tag are each one character string.
  • the vector calculation unit 434 calculates, for each HTML document processed by the morphological analysis unit 433, a feature vector based on the associated state of a plurality of character strings in each HTML document (step S23).
  • the vector calculation unit 434 calculates a feature vector by a learning device that is pre-trained so as to output a feature vector of the HTML document when an HTML document having a plurality of character strings is input.
  • This learning device is pre-learned using an HTML document of an existing Web page by, for example, a neural network or the like, and is stored in the storage unit 42 in advance.
  • the learning device outputs a similar feature vector for an HTML document in which the arrangement of character strings in the HTML document is similar, and outputs a dissimilar feature vector for an HTML document in which the arrangement state of the character strings in the HTML document is not similar. Learned to do.
  • the learning device executes this learning using a known method such as Doc2Vec.
  • the HTML document used for the pre-learning is, for example, a Wikipedia HTML document.
  • the vector calculation unit 434 may calculate the feature vector without using a learning device. In that case, the vector calculation unit 434 calculates a feature vector in which the number of appearances of two or more predetermined numbers of character strings in each document is an element. A plurality of the predetermined number of character strings are set in advance and stored in the storage unit 42. In this case, the related state of a plurality of character strings is the magnitude relation of the number of appearances of each character string, and for similar HTML documents, the magnitude relation of the number of appearances of each character string is similar.
  • the vector calculation unit 434 calculates a similar feature vector for HTML documents in which the number of appearances of each character string in the HTML document is similar to each other, and outputs HTML documents in which the number of appearances of each character string in the HTML document is not similar. , A dissimilar feature vector is calculated.
  • the vector calculation unit 434 stores each of the calculated feature vectors in the unauthorized Web page table or the authorized Web page table in association with the corresponding unauthorized HTML document or regular HTML document (step S24). Thus, a series of processing ends.
  • FIG. 7 is a flowchart showing an example of the inspection processing.
  • the initial processing is executed in step S16 in FIG.
  • the preprocessing unit 432 performs preprocessing on the inspection target HTML document acquired in step S15 (step S31). This preprocessing is the same as the preprocessing described in step S21 except that the target is an HTML document to be inspected.
  • the morphological analysis unit 433 performs a morphological analysis process on the inspection-target HTML document processed by the preprocessing unit 432 (step S32).
  • This morphological analysis processing is the same as the morphological analysis processing described in step S22 except that the target is an HTML document to be inspected.
  • the vector calculation unit 434 calculates the feature vector of the inspection-target HTML document processed by the morphological analysis unit 433 (step S33).
  • This feature vector calculation process is the same as the feature vector calculation process described in step S23 except that the target is an inspection target HTML document.
  • the vector calculation unit 434 determines, for each of the plurality of invalid HTML documents, the plurality of regular HTML documents, and the inspection target HTML document, the feature vector based on the related state of the plurality of character strings in each HTML document. Is calculated.
  • the similarity calculator 435 calculates the similarity between the feature vector of the inspection-target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents stored in step S24 (step S34).
  • the determination unit 436 determines whether the inspection target Web page is an unauthorized Web page based on the calculated similarities and the threshold (step S35).
  • step S35-Y If the maximum value of the similarity is equal to or greater than the threshold value (step S35-Y), the determination unit 436 determines that the inspection target Web page is an unauthorized Web page corresponding to the feature vector whose maximum similarity has been calculated. Is determined (step S36), and a series of processing ends.
  • step S35-N the determination unit 436 reads the regular Web table and acquires a plurality of regular URLs (step S37).
  • the determination unit 436 determines whether or not the domain name in the URL to be inspected acquired in step S13 matches any of the domain names in the plurality of regular URLs acquired in step S37 (step S38). .
  • the determination unit 436 determines that the Web page to be inspected belongs to a regular Web site, It is determined that the page is not a page (step S39). Thus, a series of processing ends.
  • the determination unit 436 determines that the Web page to be inspected does not belong to a legitimate Web site. .
  • the similarity calculation unit 435 calculates the similarity between the feature vector of the inspection target HTML and each of the feature vectors of the plurality of normal HTML documents (step S40).
  • the determination unit 436 determines whether or not the inspection target Web page is an unauthorized Web page by comparing the calculated maximum value of each similarity with the second threshold value (Step S41).
  • the second threshold value may be the same value as the threshold value used in step S35 or a different value.
  • the determination unit 436 determines in step S38 that the Web page to be inspected does not belong to a legitimate Web site. Therefore, when the maximum value of the similarity is equal to or larger than the second threshold, the determination unit 436 determines that the inspection target Web page is an unauthorized Web page similar to the registered regular Web page (Step S42). .
  • the determination unit 436 determines that the inspection target Web page does not belong to the regular Web site, but the content is not similar to any of the regular Web pages. Therefore, it is determined that the page is an unregistered regular Web page (step S43). Thus, a series of processing ends.
  • FIG. 8A shows an example of input data to the morphological analysis unit 433, and FIG. 8B shows an example of output data of the morphological analysis unit 433.
  • the input data to the morphological analysis unit 433 is obtained from the HTML documents of the illegal Web page, the regular Web page, and the inspection target Web page, and the pre-processing unit 432 outputs a part of the line feed code and the like.
  • This is an HTML document from which characters have been deleted.
  • the output data of the morphological analysis unit 433 is obtained by performing the morphological analysis on the input data by the morphological analysis unit 433, and collecting the morphemes obtained as the execution result in units of words. Data placed between quotes. Note that the morphological analysis unit 433 performs morphological analysis after removing the HTML tag from the input data, puts the morpheme into words, and inserts an HTML tag with double quotes at the original position. May generate output data.
  • FIG. 9 is a diagram showing an example of a feature vector processing outline.
  • the storage unit 42 stores the illegal HTML documents 1 to n of the plurality of illegal Web pages 1 to n.
  • the vector calculation unit 434 calculates feature vectors 1 to n for the malicious HTML documents 1 to n of the malicious web pages 1 to n stored in the storage unit 42, respectively.
  • the vector calculation unit 434 calculates the feature vector A for the inspection target HTML document of the inspection target Web page acquired by the acquisition unit 431.
  • the similarity calculating section 435 calculates cosine similarities 1 to n of the feature vector A and each of the feature vectors 1 to n.
  • the two feature vectors are similar when the cosine similarity is close to 1, and are not similar when the cosine similarity is close to -1.
  • the similarity 1 is 0.9
  • the similarity 2 is 0.4
  • the similarity n is ⁇ 0.9.
  • step S35 the determination unit 436 determines whether or not the inspection target Web page is an unauthorized Web page by comparing the maximum value of 0.9 of the similarities 1 to n with a threshold value.
  • a threshold value For example, when the threshold value is 0.8, the maximum value 0.9 of the similarities 1 to n is equal to or larger than the threshold value, and therefore, the inspection target Web page is an unauthorized Web page corresponding to the unauthorized Web page 1. Is determined.
  • FIGS. 10 (a) to 10 (d) are diagrams showing an example of a screen displayed by the terminal 2.
  • FIG. 10 (a) to 10 (d) are diagrams showing an example of a screen displayed by the terminal 2.
  • the terminal 2 when the user instructs to start the Web browser, the terminal 2 starts and displays the Web browser.
  • the display screen 60 of the Web browser includes a URL input area 61 and a display area 62.
  • the terminal 2 activates the Web browser, the terminal 2 activates an application program that communicates with the unauthorized Web page detection device 4.
  • the terminal 2 accesses the Web server 3 indicated by the specified URL, and accesses the Web server 3. 3 receives a Web page. Further, the terminal 2 transmits the URL input to the Web browser to the unauthorized Web page detection device 4 according to the application program.
  • the unauthorized Web page detection device 4 acquires the URL transmitted from the terminal 2 in step S13, executes the processes in steps S14 to S17, and transmits the determination result to the terminal 2.
  • the terminal 2 receives from the unauthorized Web page detection device 4 a determination result indicating that the Web page corresponding to the URL transmitted from the terminal 2 is a regular Web page
  • the Web page 81 received from the server 3 is displayed on the display screen 80.
  • the terminal 2 issues a warning when the determination result indicating that the Web page corresponding to the URL transmitted from the terminal 2 is an unauthorized Web page is received from the unauthorized Web page detection device 4.
  • a screen 90 is displayed.
  • the data for the warning screen is stored in the terminal 2 in advance.
  • a character display 91 and an end button 92 are displayed.
  • the character display 91 is a text warning that the Web page received from the Web server 3 may be a phishing page.
  • the terminal 2 closes the warning screen 90.
  • the unauthorized Web page detection device 4 calculates a feature vector based on a related state of a plurality of character strings in each HTML document for each of the plurality of known unauthorized HTML documents and the inspection-target HTML document.
  • the fraudulent Web page detection device 4 determines whether or not the inspection target Web page is a fraudulent Web page based on the calculated similarity of the feature vectors.
  • Unauthorized Web pages are often generated by a common tool, and a plurality of unauthorized Web pages generated by a common tool have common features attributed to the tool and are likely to be similar. For this reason, the fraudulent Web page detection device 4 uses the feature vector of the HTML document, so that even if the URL of the inspection target Web page is different from the URL of the known fraudulent Web page, the fraudulent Web page is detected. Can be determined with high accuracy.
  • the unauthorized Web page detection device 4 further determines the feature vector of the HTML to be inspected and the feature of the plurality of regular HTML documents. The similarity with each of the vectors is calculated. The fraudulent Web page detection device 4 determines whether the inspection target HTML document is similar to the legitimate HTML document. Therefore, the fraudulent Web page detection device 4 creates a fraudulent Web page that is created to be similar to the legitimate Web page and has not been registered as a fraudulent Web page. Can be detected.
  • the unauthorized Web page detection device 4 calculates a feature vector based on an associated state of a plurality of character strings including an HTML tag and a word.
  • a plurality of malicious Web pages generated by a common tool are likely to have a specific association between the HTML tag and the word due to the tool.
  • the fraudulent Web page detection device 4 determines whether or not the state of association between the HTML tag and the word is similar between the test target Web page and each fraudulent Web page. It can be detected with higher accuracy.
  • the unauthorized Web page detection device 4 calculates the feature vector based on the related state of a plurality of continuous character strings. Web pages that tend to use similar HTML tags and / or word sets in consecutive character strings are likely to be similar Web pages. Therefore, the fraudulent Web page detection device 4 can detect a fraudulent Web page similar to a Web page registered as a fraudulent Web page with higher accuracy.
  • the preprocessing unit 432 may calculate the size of each HTML document generated by the preprocessing in steps S21 and S31.
  • the similarity calculation unit 435 calculates a difference between each of the calculated plurality of unauthorized HTML documents and the calculated size of the inspection target HTML document, and the size difference is equal to or larger than a predetermined value. In this case, the similarity is not calculated for the invalid HTML document.
  • the similarity calculation unit 435 calculates the difference between the calculated size of each of the plurality of normal HTML documents and the calculated size of the inspection target HTML document, and the size difference is equal to or larger than a predetermined value. In this case, the similarity is not calculated for the regular HTML document.
  • the fraudulent Web page detection device 4 can speed up the inspection process without reducing the accuracy of determining the fraudulent Web page.
  • the similarity calculation unit 435 may calculate the difference between the sizes of the HTML documents before the preprocessing unit 432 performs the preprocessing.
  • the similarity calculation unit 435 may calculate the difference between the sizes of the HTML documents after the morphological analysis unit 433 has performed the morphological analysis processing.
  • the morphological analysis unit 433 replaces the regular HTML document acquired in step S11 and the inspection target HTML document acquired in step S15 with a morphological Analysis processing may be performed.
  • the vector calculation unit 434 may calculate a feature vector for an HTML document that has been preprocessed by the preprocessing unit 432, instead of the HTML document processed by the morphological analysis unit 433.
  • the vector calculation unit 434 may calculate a feature vector for each of the regular HTML documents acquired in step S11 and the inspection target HTML document acquired in step S15 instead of the HTML document processed by the morphological analysis unit 433. Good. For example, when the HTML document is described in a language such as English that is separated and written for each word, the vector calculation unit 434 separates the input HTML document by a break in the HTML tag and a space between words.
  • the feature vector may be calculated based on a plurality of character strings.
  • the determination unit 436 may determine whether or not the number of unauthorized Web pages determined to have a similarity equal to or greater than the threshold in step S35 is equal to or greater than a predetermined number. For example, the determination unit 436 determines that the inspection target Web page is the unauthorized Web page when the number of the unauthorized Web pages determined to have the similarity equal to or higher than the threshold value is the predetermined number or more, and determines that the inspection target Web page is not the predetermined number or more. In this case, it may be determined that the inspection target Web page is not an unauthorized Web page.
  • steps S37 to S43 are omitted, and the determination unit 436 determines that the inspection target Web page is a regular Web page when the maximum value of each similarity calculated in step S34 is less than the threshold. It may be determined.
  • the timing at which the determination unit 436 executes the processing of steps S37 to S38 may be changed before the processing of step S31, and the processing may advance to step S40 in the case of step S35-N.
  • the determination unit 436 performs the processing of steps S37 to S38 at the beginning of the inspection processing.
  • the determination unit 436 determines that the inspection target Web page belongs to the legitimate Web site and is not an unauthorized Web page, and ends a series of processes, as in step S39.
  • the determination unit 436 advances the processing to step S31.
  • the storage unit 42 may further store URL information indicating which authorized URL corresponds to the unauthorized HTML document that executes the phishing scam, in association with each unauthorized HTML document in the unauthorized Web page table.
  • the similarity calculation unit 435 further calculates the similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of normal HTML documents. Then, the similarity calculating unit 435 calculates an average value of the similarity of each unauthorized HTML document and the similarity of the regular HTML document associated with the regular URL indicated by the URL information of each malicious HTML document.
  • the determination unit 436 determines in step S35 whether the maximum value of the average values calculated by the similarity calculation unit 435 is equal to or greater than a threshold value, thereby determining whether the inspection target Web page is an unauthorized Web page. Determine whether or not.
  • the unauthorized Web page detection device 4 may acquire a URL of a new unauthorized Web page or a legitimate Web page during operation, and calculate a feature vector corresponding to each Web page.
  • the obtaining unit 431 specifies the obtained URL, obtains an invalid HTML document or a regular HTML document, and registers the obtained URL and HTML document in the illegal Web page table or the regular Web page table.
  • the preprocessing unit 432, the morphological analysis unit 433, and the vector calculation unit 434 execute the initial process of step S12 on the newly acquired HTML document, and calculate a feature vector.
  • the unauthorized Web page detection device 4 can calculate the similarity between the feature vector of the inspection target HTML document and the feature vector of the new HTML document without causing the existing learning device to learn the new HTML document.
  • the unauthorized Web page detection device 4 can execute the determination using the new HTML document without re-learning the learning device using the entire existing HTML document and the new HTML document. The processing load can be reduced.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are an illicit webpage detection device, illicit webpage detection device control method, and control program, with which a determination as to whether a webpage is an illicit webpage can be made with high precision. This illicit webpage detection device comprises: a storage part for storing feature vectors based on the states of association of a plurality of character strings in each of a plurality of illicit HTML documents which configure each of a plurality of illicit webpages; an acquisition part for acquiring an HTML document to be inspected which configures a webpage to be inspected; a vector computation part for computing a feature vector of the HTML document under inspection; a similarity computation part for computing similarities between the feature vector of the HTML document under inspection and each of the feature vectors of the plurality of illicit HTML documents; a determination part for, on the basis of each of the computed similarities and a threshold, determining whether the webpage under inspection is an illicit webpage; and a determination result output part for outputting the result of the determination performed by the determination part.

Description

不正Webページ検出装置、不正Webページ検出装置の制御方法及び制御プログラムUnauthorized Web page detection apparatus, control method of unauthorized Web page detection apparatus, and control program
 本開示は、不正Webページ検出装置、不正Webページ検出装置の制御方法及び制御プログラムに関する。 The present disclosure relates to an unauthorized Web page detection device, a control method of the unauthorized Web page detection device, and a control program.
 インターネットを利用したフィッシング詐欺の増加に対応するため、フィッシング詐欺による被害を防止するための技術が普及しつつある。 技術 In order to respond to the increase in phishing scams using the Internet, technologies for preventing damages from phishing scams are becoming widespread.
 例えば、特許文献1には、フィッシングサイトのURL(Uniform Resource Locator)へのアクセスを禁止する通信制御装置が記載されている。通信制御装置は、ユーザの端末と、ユーザの端末が通信する他の装置との間の通信経路に設けられ、端末が送信した通信データに含まれるアクセス先のコンテンツのURLと、フィッシングサイトリスト、即ちブラックリストに含まれるURLとを比較する。通信制御装置は、端末のアクセス先のコンテンツのURLが、ブラックリストに含まれるURLに合致した場合、そのコンテンツへのアクセスを禁止する。 For example, Patent Document 1 discloses a communication control device that prohibits access to a URL (Uniform Resource Locator) of a phishing site. The communication control device is provided on a communication path between the user's terminal and another device with which the user's terminal communicates, and includes a URL of an access destination content included in communication data transmitted by the terminal, a phishing site list, That is, the URL is compared with the URL included in the blacklist. When the URL of the content accessed by the terminal matches the URL included in the blacklist, the communication control device prohibits access to the content.
国際公開第2006/087908号International Publication No. WO 2006/087908
 近年、フィッシングサイトを構築するためのツールがフィッシング詐欺を行う犯罪者の間で広く流通し、犯罪者は、ツールを使用することによって、容易に且つ短期間でフィッシングサイトを生成できるようになっている。犯罪者は、ツールを使用して新たなフィッシングサイトを生成し、ユーザを新たなサイトの不正Webページに誘導してフィッシング詐欺を実行し、生成したフィッシングサイトを閉鎖することを、短期間で実行する。犯罪者は、不正Webページがブラックリストに掲載される前にフィッシング詐欺を実行することができ、従来のブラックリスト方式では、不正Webページを検出できない場合がある。 In recent years, tools for constructing phishing sites have become widely distributed among criminals who perform phishing scams, and criminals can easily and quickly generate phishing sites by using the tools. I have. Criminals use tools to create new phishing sites, direct users to fraudulent Web pages on the new sites, perform phishing scams, and quickly close created phishing sites. I do. A criminal can execute a phishing scam before the fraudulent Web page is placed on the blacklist, and the conventional blacklist method may not detect the fraudulent Web page.
 不正Webページ検出装置、不正Webページ検出装置の制御方法及び制御プログラムの目的は、Webページが不正Webページであるか否かを高精度に判定することを可能にすることにある。 The purpose of the unauthorized Web page detection device, the control method of the unauthorized Web page detection device, and the control program is to make it possible to determine with high accuracy whether or not the Web page is an unauthorized Web page.
 本実施形態に係る不正Webページ検出装置は、複数の不正Webページのそれぞれを構成する複数の不正HTML(HyperText Markup Language)文書の、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを記憶する記憶部と、検査対象Webページを構成する検査対象HTML文書を取得する取得部と、検査対象HTML文書の特徴ベクトルを算出するベクトル算出部と、検査対象HTML文書の特徴ベクトルと、複数の不正HTML文書の特徴ベクトルのそれぞれとの類似度を算出する類似度算出部と、算出された各類似度としきい値とに基づいて、検査対象Webページが不正Webページであるか否かを判定する判定部と、判定部による判定結果を出力する判定結果出力部と、を有する。 The unauthorized Web page detection device according to the present embodiment is configured to detect a feature vector of a plurality of unauthorized HTML (HyperText Markup Language) documents constituting each of the plurality of unauthorized Web pages, based on a related state of a plurality of character strings in each HTML document. , An acquisition unit that acquires an HTML document to be inspected that constitutes a Web page to be inspected, a vector calculation unit that calculates a feature vector of the HTML document to be inspected, a feature vector of the HTML document to be inspected, A similarity calculation unit that calculates a similarity with each of the feature vectors of the malicious HTML document, and whether the inspection target Web page is a malicious Web page based on the calculated similarities and the threshold value. The determination unit includes a determination unit and a determination result output unit that outputs a determination result by the determination unit.
 本実施形態に係る不正Webページ検出装置において、記憶部は、さらに、複数の正規Webページのそれぞれを構成する複数の正規HTML文書の特徴ベクトルを、正規Webページを示す正規URL(Uniform Resource Locator)と関連付けて記憶し、取得部は、さらに、検査対象Webページを示す検査対象URLを取得し、類似度算出部は、検査対象URL中のドメイン名が複数の正規URL中のドメイン名の何れとも一致しない場合、さらに、検査対象HTMLの特徴ベクトルと、複数の正規HTML文書の特徴ベクトルのそれぞれとの類似度を算出することが好ましい。 In the unauthorized Web page detection device according to the present embodiment, the storage unit further stores a feature vector of a plurality of regular HTML documents constituting each of the plurality of regular Web pages into a regular URL (Uniform Resource Locator) indicating the regular Web page. The acquisition unit further acquires the inspection target URL indicating the inspection target Web page, and the similarity calculation unit determines that the domain name in the inspection target URL is any of the domain names in the plurality of regular URLs. If they do not match, it is preferable to further calculate the similarity between the feature vector of the inspection target HTML and each of the feature vectors of the plurality of normal HTML documents.
 本実施形態に係る不正Webページ検出装置において、類似度算出部は、不正HTML文書のサイズと検査対象HTML文書のサイズとの差が所定値以上である場合、不正HTML文書について類似度を算出しないことが好ましい。 In the unauthorized Web page detection device according to the present embodiment, the similarity calculation unit does not calculate the similarity for the unauthorized HTML document when the difference between the size of the unauthorized HTML document and the size of the inspection target HTML document is equal to or larger than a predetermined value. Is preferred.
 本実施形態に係る不正Webページ検出装置において、複数の文字列は、HTMLタグ及び単語を含むことが好ましい。 不正 In the unauthorized Web page detection device according to the present embodiment, it is preferable that the plurality of character strings include an HTML tag and a word.
 本実施形態に係る不正Webページ検出装置において、複数の文字列は、連続する文字列であることが好ましい。 In the unauthorized Web page detection device according to the present embodiment, the plurality of character strings are preferably continuous character strings.
 本実施形態に係る記憶部及び出力部を有する不正Webページ検出装置の制御方法は、不正Webページ検出装置が、複数の不正Webページのそれぞれを構成する複数の不正HTML文書の、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを記憶部に記憶し、検査対象Webページを構成する検査対象HTML文書を取得し、検査対象HTML文書の特徴ベクトルを算出し、検査対象HTML文書の特徴ベクトルと、複数の不正HTML文書の特徴ベクトルのそれぞれとの類似度を算出し、算出された各類似度としきい値とに基づいて、検査対象Webページが不正Webページであるか否かを判定し、判定の結果を出力部に出力する、ことを含む。 The method for controlling an unauthorized Web page detection device having a storage unit and an output unit according to the present embodiment is characterized in that the unauthorized Web page detection device includes a plurality of unauthorized HTML documents constituting each of the plurality of unauthorized Web pages in each HTML document. Is stored in the storage unit, an HTML document to be inspected that forms the Web page to be inspected is obtained, a feature vector of the HTML document to be inspected is calculated, and the HTML vector of the HTML document to be inspected is stored. A similarity between the feature vector and each of the feature vectors of the plurality of unauthorized HTML documents is calculated, and based on each of the calculated similarities and the threshold value, it is determined whether or not the inspection target Web page is an unauthorized Web page. Determining, and outputting the result of the determination to the output unit.
 本実施形態に係る記憶部及び出力部を有する不正Webページ検出装置の制御プログラムは、複数の不正Webページのそれぞれを構成する複数の不正HTML文書の、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを記憶部に記憶し、検査対象Webページを構成する検査対象HTML文書を取得し、検査対象HTML文書の特徴ベクトルを算出し、検査対象HTML文書の特徴ベクトルと、複数の不正HTML文書の特徴ベクトルのそれぞれとの類似度を算出し、算出された各類似度としきい値とに基づいて、検査対象Webページが不正Webページであるか否かを判定し、判定の結果を出力部に出力する、ことを不正Webページ検出装置に実行させる。 The control program of the unauthorized Web page detecting device having the storage unit and the output unit according to the present embodiment relates to the association of a plurality of unauthorized HTML documents constituting each of the plurality of unauthorized Web pages with a plurality of character strings in each HTML document. A feature vector based on the state is stored in the storage unit, an HTML document to be inspected constituting the Web page to be inspected is acquired, a feature vector of the HTML document to be inspected is calculated, and a feature vector of the HTML document to be inspected and plural unauthorized The similarity with each of the feature vectors of the HTML document is calculated, and based on each of the calculated similarities and the threshold value, it is determined whether or not the inspection target Web page is an unauthorized Web page. The output to the output unit is performed by the unauthorized Web page detection device.
 本実施形態によれば、不正Webページ検出装置、不正Webページ検出装置の制御方法及び制御プログラムは、Webページが不正Webページであるか否かを高精度に判定することを可能にする。 According to the present embodiment, the unauthorized Web page detection device, the control method of the unauthorized Web page detection device, and the control program make it possible to determine with high accuracy whether or not the Web page is an unauthorized Web page.
 本発明の目的及び効果は、特に請求項において指摘される構成要素及び組み合わせを用いることによって認識され且つ得られるだろう。前述の一般的な説明及び後述の詳細な説明の両方は、例示的及び説明的なものであり、特許請求の範囲に記載されている本発明を制限するものではない。 The objects and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. Both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, which is set forth in the following claims.
不正Webページ検出装置における処理概要の一例を示す図である。FIG. 4 is a diagram illustrating an example of a processing outline in an unauthorized Web page detection device. 通信システム1の概略構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a schematic configuration of a communication system 1. 不正Webページ検出装置4の概略構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a schematic configuration of an unauthorized Web page detection device 4. (a)は不正Webページテーブルのデータ構造の一例を示す図であり、(b)は正規Webページテーブルのデータ構造の一例を示す図である。FIG. 7A is a diagram illustrating an example of a data structure of an unauthorized Web page table, and FIG. 7B is a diagram illustrating an example of a data structure of a regular Web page table. 不正Webページ検出装置4の動作の一例を示すフローチャートである。6 is a flowchart illustrating an example of an operation of the unauthorized Web page detection device 4. 初期処理の一例を示すフローチャートである。It is a flowchart which shows an example of an initial process. 検査処理の一例を示すフローチャートである。It is a flow chart which shows an example of inspection processing. (a)は形態素解析部433への入力データの一例であり、(b)は形態素解析部433の出力データの一例である。(A) is an example of input data to the morphological analysis unit 433, and (b) is an example of output data of the morphological analysis unit 433. 特徴ベクトルの処理概要の一例を示す図である。FIG. 9 is a diagram illustrating an example of a feature vector processing outline. (a)~(d)は端末2が表示する画面の一例を示す図である。(A)-(d) is a figure which shows an example of the screen which the terminal 2 displays.
 以下、図面を参照しつつ、本発明の様々な実施形態について説明する。ただし、本発明の技術的範囲はそれらの実施形態に限定されず、特許請求の範囲に記載された発明とその均等物に及ぶ点に留意されたい。 Hereinafter, various embodiments of the present invention will be described with reference to the drawings. However, it should be noted that the technical scope of the present invention is not limited to these embodiments, but extends to the inventions described in the claims and their equivalents.
 図1は、不正Webページ検出装置における処理概要の一例を示す図である。 FIG. 1 is a diagram showing an example of a processing outline in the unauthorized Web page detection device.
 不正Webページ検出装置は、既知の複数の不正Webページのそれぞれを構成する複数の不正HTML文書を記憶している。不正Webページは、フィッシング詐欺で使用されるWebページであり、既知の不正WebページのURLは、例えば、フィッシング対策協議会等の団体によって提供される。Webページには、HTML文書と、HTML文書中に記載された画像等が含まれる。 The unauthorized web page detection device stores a plurality of unauthorized HTML documents constituting each of a plurality of known unauthorized web pages. The fraudulent Web page is a Web page used in phishing scams, and the URL of a known fraudulent Web page is provided by an organization such as the Anti-Phishing Council, for example. The Web page includes an HTML document and an image described in the HTML document.
 最初に、不正Webページ検出装置は、複数の不正HTML文書毎に、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトル1~nを算出する。文字列とは、HTMLタグ又は単語である。複数の文字列の関連状態とは、各文字列の間の関係性であり、例えば、各HTML文書内における所定の複数の文字列の配置関係である。複数の文字列は、HTMLタグ及び単語を含むことがあり、連続する文字列であってもよい。特徴ベクトルは、複数次元、例えば1×150のベクトルである。各特徴ベクトルは、文書内の文字列の配置が類似する2つのHTML文書の特徴ベクトルが、非類似の2つのHTML文書の特徴ベクトルよりも類似するように算出される。 {First, the unauthorized Web page detection device calculates, for each of the plurality of unauthorized HTML documents, feature vectors 1 to n based on the associated state of a plurality of character strings in each HTML document. The character string is an HTML tag or a word. The related state of a plurality of character strings is a relation between the character strings, for example, an arrangement relation of a predetermined plurality of character strings in each HTML document. The plurality of character strings may include HTML tags and words, and may be continuous character strings. The feature vector is a vector having a plurality of dimensions, for example, 1 × 150. Each feature vector is calculated such that the feature vectors of two HTML documents having similar character string arrangements in the documents are more similar to the feature vectors of two dissimilar HTML documents.
 次に、不正Webページ検出装置は、検査対象Webページに含まれる検査対象HTML文書を取得する。検査対象Webページは、フィッシング詐欺で使用されるWebページであるか否かを検査する対象のWebページであり、例えば、不正Webページ検出装置とは異なる端末がアクセスを要求したWebページである。不正Webページ検出装置は、不正HTML文書と同様に、検査対象HTML文書に対して特徴ベクトルAを算出する。 Next, the unauthorized Web page detection device acquires the inspection target HTML document included in the inspection target Web page. The inspection target Web page is a Web page to be inspected to determine whether or not it is a Web page used in phishing fraud, and is, for example, a Web page requested to access by a terminal different from the unauthorized Web page detection device. The unauthorized Web page detection device calculates the feature vector A for the inspection-target HTML document, similarly to the unauthorized HTML document.
 次に、不正Webページ検出装置は、算出した特徴ベクトルAと、各特徴ベクトル1~nとの類似度1~nを算出する。 Next, the unauthorized Web page detection device calculates similarities 1 to n between the calculated feature vector A and each of the feature vectors 1 to n.
 次に、不正Webページ検出装置は、算出した類似度1~nの最大値としきい値とを比較することにより、検査対象Webページが不正Webページであるか否かを判定する。不正Webページ検出装置は、類似度1~nの最大値がしきい値以上である場合、検査対象Webページはその最大値となる類似度が算出された特徴ベクトルに対応する不正Webページに類似しており、不正Webページであると判定する。 Next, the fraudulent Web page detection device determines whether the inspection target Web page is a fraudulent Web page by comparing the calculated maximum value of the similarities 1 to n with a threshold value. When the maximum value of the similarities 1 to n is equal to or greater than the threshold value, the unauthorized Web page detection device determines that the inspection target Web page is similar to the unauthorized Web page corresponding to the feature vector for which the maximum similarity was calculated. Is determined to be an unauthorized Web page.
 不正Webページ検出装置は、既知の複数の不正HTML文書及び検査対象HTML文書毎に、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを算出する。不正Webページ検出装置は、特徴ベクトルの類似度に基づいて、検査対象Webページが不正Webページであるか否かを判定する。不正Webページは、共通のツールにより生成されていることが多く、共通のツールにより生成された複数の不正Webページは、ツールに起因する共通の特徴を有し、類似する可能性が高い。このため、不正Webページ検出装置は、HTML文書の特徴ベクトルを使用することにより、検査対象WebページのURLが既知の不正WebページのURLと異なっていても、検査対象Webページが不正Webページか否かを高精度に判定することができる。 The unauthorized Web page detection device calculates a feature vector based on a related state of a plurality of character strings in each HTML document for each of a plurality of known unauthorized HTML documents and the inspection target HTML document. The unauthorized Web page detection device determines whether the inspection target Web page is an unauthorized Web page based on the similarity of the feature vectors. Unauthorized Web pages are often generated by a common tool, and a plurality of unauthorized Web pages generated by a common tool have common features attributed to the tool and are likely to be similar. Therefore, even if the URL of the inspection target Web page is different from the URL of the known invalid Web page, the unauthorized Web page detection device uses the feature vector of the HTML document to determine whether the inspection target Web page is an unauthorized Web page. Can be determined with high accuracy.
 <実施形態>
 図2は、通信システム1の概略構成の一例を示す図である。
<Embodiment>
FIG. 2 is a diagram illustrating an example of a schematic configuration of the communication system 1.
 通信システム1は、端末2、Webサーバ3及び不正Webページ検出装置4等を有する。端末2、Webサーバ3及び不正Webページ検出装置4は、インターネット等の通信ネットワーク5を介して接続される。 The communication system 1 includes a terminal 2, a Web server 3, an unauthorized Web page detection device 4, and the like. The terminal 2, the Web server 3, and the unauthorized Web page detection device 4 are connected via a communication network 5 such as the Internet.
 端末2は、ユーザがWebページの閲覧に使用する端末である。端末2は、TCP/IP(Transmission Control Protocol / Internet Protocol)等の通信方式により、通信ネットワーク5を介してWebサーバ3及び不正Webページ検出装置4と通信し、通信の内容に応じた表示を行う。 The terminal 2 is a terminal used by the user for browsing the Web page. The terminal 2 communicates with the Web server 3 and the unauthorized Web page detection device 4 via the communication network 5 by a communication method such as TCP / IP (Transmission Control Protocol / Internet Protocol) and performs display according to the content of the communication. .
 Webサーバ3は、端末2及び不正Webページ検出装置4による要求に応じて、Webページを送信するサーバである。Webサーバ3は、TCP/IP等の通信方式により、通信ネットワーク5を介して端末2及び不正Webページ検出装置4と通信する。 The Web server 3 is a server that transmits a Web page in response to a request from the terminal 2 and the unauthorized Web page detection device 4. The Web server 3 communicates with the terminal 2 and the unauthorized Web page detection device 4 via the communication network 5 by a communication method such as TCP / IP.
 端末2は、URLを指定してWebサーバ3のWebページにアクセスする際に、同一のURLを不正Webページ検出装置4に送信する。不正Webページ検出装置4は、送信されたURLを指定してWebサーバ3にHTML文書の取得を要求し、Webサーバ3からHTML文書を受信する。不正Webページ検出装置4は、受信したHTML文書が不正HTML文書であるか否かを判定し、判定した結果を端末2に送信する。端末2は、送信された検査結果に応じて、Webサーバ3から送信されたWebページ又は警告画面を表示する。 (2) When the terminal 2 accesses the Web page of the Web server 3 by specifying the URL, the terminal 2 transmits the same URL to the unauthorized Web page detection device 4. The unauthorized Web page detection device 4 specifies the transmitted URL, requests the Web server 3 to acquire an HTML document, and receives the HTML document from the Web server 3. The unauthorized Web page detection device 4 determines whether the received HTML document is an unauthorized HTML document, and transmits the determined result to the terminal 2. The terminal 2 displays a Web page or a warning screen transmitted from the Web server 3 according to the transmitted inspection result.
 図3は、不正Webページ検出装置4の概略構成の一例を示す図である。 FIG. 3 is a diagram showing an example of a schematic configuration of the unauthorized Web page detection device 4. As shown in FIG.
 不正Webページ検出装置4は、通信部41と、記憶部42と、処理部43とを有する。 The unauthorized Web page detection device 4 includes a communication unit 41, a storage unit 42, and a processing unit 43.
 通信部41は、有線LAN等の有線の通信インターフェース回路、又は、無線LAN等の無線の通信インターフェース回路を有する。通信部41は、通信ネットワーク5を介して、端末2、Webサーバ3等とTCP/IP等の通信方式により通信を行う。通信部41は、端末2、Webサーバ3等から受信したデータを処理部43に供給する。通信部41は、処理部43から供給されたデータを端末2、Webサーバ3等に送信する。通信部41は、出力部の一例である。 The communication unit 41 has a wired communication interface circuit such as a wired LAN or a wireless communication interface circuit such as a wireless LAN. The communication unit 41 communicates with the terminal 2, the Web server 3, and the like via the communication network 5 by a communication method such as TCP / IP. The communication unit 41 supplies data received from the terminal 2, the Web server 3, and the like to the processing unit 43. The communication unit 41 transmits the data supplied from the processing unit 43 to the terminal 2, the Web server 3, and the like. The communication unit 41 is an example of an output unit.
 記憶部42は、例えば、半導体メモリ、磁気ディスク装置及び光ディスク装置のうちの少なくとも一つを有する。記憶部42は、処理部43による処理に用いられるドライバプログラム、オペレーティングシステムプログラム、アプリケーションプログラム、データ等を記憶する。 The storage unit 42 has, for example, at least one of a semiconductor memory, a magnetic disk device, and an optical disk device. The storage unit 42 stores a driver program, an operating system program, an application program, data, and the like used for processing by the processing unit 43.
 例えば、記憶部42は、ドライバプログラムとして、通信部41を制御する通信デバイスドライバプログラム等を記憶する。また、記憶部42は、オペレーティングシステムプログラムとして、TCP/IP等の通信方式による接続制御プログラム等を記憶する。また、記憶部42は、アプリケーションプログラムとして、各種データの送受信を行うデータ処理プログラム等を記憶する。コンピュータプログラムは、例えばCD-ROM(Compact Disk Read Only Memory)、DVD-ROM(Digital Versatile Disk Read Only Memory)等のコンピュータ読み取り可能な可搬型記録媒体から、公知のセットアッププログラム等を用いて記憶部42にインストールされてもよい。 For example, the storage unit 42 stores a communication device driver program for controlling the communication unit 41 as a driver program. Further, the storage unit 42 stores a connection control program or the like according to a communication method such as TCP / IP as an operating system program. Further, the storage unit 42 stores a data processing program for transmitting and receiving various data and the like as an application program. The computer program is stored in a storage unit 42 from a computer-readable portable recording medium such as a CD-ROM (Compact Disk Read Only Memory) and a DVD-ROM (Digital Versatile Disk Read Only Memory) using a known setup program. May be installed.
 記憶部42には、データとして、不正Webページテーブル及び正規Webページテーブル等が記憶される。不正Webページテーブル及び正規Webページテーブルの詳細については後述する。 The storage unit 42 stores an unauthorized Web page table, a normal Web page table, and the like as data. The details of the unauthorized Web page table and the regular Web page table will be described later.
 処理部43は、一又は複数個のプロセッサ及びその周辺回路を有し、不正Webページ検出装置4の全体的な動作を統括的に制御する。処理部43は、例えば、CPU(Central Processing Unit)である。なお、処理部43は、DSP(digital signal processor)、LSI(large scale integration)、ASIC(Application Specific Integrated Circuit)、FPGA(Field-Programming Gate Array)等でもよい。 The processing unit 43 has one or a plurality of processors and their peripheral circuits, and totally controls the overall operation of the unauthorized Web page detection device 4. The processing unit 43 is, for example, a CPU (Central Processing Unit). Note that the processing unit 43 may be a DSP (digital signal processor), an LSI (large scale integration), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programming Gate Array), or the like.
 処理部43は、不正Webページ検出装置4の各種処理が記憶部42に記憶されているプログラム等に応じて適切な手順で実行されるように、通信部41等の動作を制御する。処理部43は、記憶部42に記憶されているプログラム(ドライバプログラム、オペレーティングシステムプログラム、アプリケーションプログラム等)に基づいて処理を実行する。また、処理部43は、複数のプログラム(アプリケーションプログラム等)を並列に実行できる。 The processing unit 43 controls operations of the communication unit 41 and the like so that various processes of the unauthorized Web page detection device 4 are executed in an appropriate procedure according to a program or the like stored in the storage unit 42. The processing unit 43 executes a process based on a program (a driver program, an operating system program, an application program, etc.) stored in the storage unit 42. Further, the processing unit 43 can execute a plurality of programs (such as application programs) in parallel.
 処理部43は、取得部431、前処理部432、形態素解析部433、ベクトル算出部434、類似度算出部435、判定部436及び判定結果出力部437等を有する。処理部43が有するこれらの各部は、処理部43が有するプロセッサ上で実行されるプログラムによって実装される機能モジュールである。あるいは、処理部43が有するこれらの各部は、独立した集積回路、マイクロプロセッサ、又はファームウェアとして不正Webページ検出装置4に実装されてもよい。 The processing unit 43 includes an acquisition unit 431, a preprocessing unit 432, a morphological analysis unit 433, a vector calculation unit 434, a similarity calculation unit 435, a determination unit 436, a determination result output unit 437, and the like. Each of these units included in the processing unit 43 is a functional module implemented by a program executed on a processor included in the processing unit 43. Alternatively, each of these units included in the processing unit 43 may be mounted on the unauthorized Web page detection device 4 as an independent integrated circuit, microprocessor, or firmware.
 図4(a)は、不正Webページテーブルのデータ構造の一例を示す図である。 FIG. 4A is a diagram showing an example of the data structure of the unauthorized Web page table.
 不正Webページテーブルには、不正Webページを識別するためのID、不正Webページを示すURL、不正Webページに含まれる不正HTML文書、不正HTML文書に基づいて算出された特徴ベクトル等が関連付けられて記憶される。不正HTML文書は、不正Webページテーブルに複数個記憶され、複数の不正HTML文書は、複数の不正Webページのそれぞれを構成する。なお、特徴ベクトルは、不正Webページテーブルとは別に、ID、URL等と関連付けられて記憶部42上に記憶されてもよい。また、URLは、不正Webページテーブルに含まれなくてもよい。特徴ベクトルが不正Webページテーブルに記憶されるか否かに関わらず、記憶部42は、複数の不正Webページのそれぞれを構成する複数の不正HTML文書の、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを記憶する。 An ID for identifying an unauthorized Web page, a URL indicating the unauthorized Web page, an unauthorized HTML document included in the unauthorized Web page, a feature vector calculated based on the unauthorized HTML document, and the like are associated with the unauthorized Web page table. It is memorized. A plurality of malicious HTML documents are stored in the malicious Web page table, and the plurality of malicious HTML documents constitute each of the plurality of malicious Web pages. Note that the feature vector may be stored in the storage unit 42 in association with an ID, a URL, and the like, separately from the unauthorized Web page table. The URL does not have to be included in the unauthorized Web page table. Regardless of whether or not the feature vector is stored in the malicious Web page table, the storage unit 42 stores the plurality of character strings in each HTML document of the plurality of malicious HTML documents constituting each of the plurality of malicious Web pages. The feature vector based on the related state is stored.
 図4(b)は、正規Webページテーブルのデータ構造の一例を示す図である。 FIG. 4B is a diagram showing an example of the data structure of the regular Web page table.
 正規Webページテーブルには、正規Webページを識別するためのID、正規Webページを示す正規URL、正規Webページに含まれる正規HTML文書、正規HTML文書に基づいて算出された特徴ベクトル等が関連付けられて記憶される。なお、特徴ベクトルは、正規Webページテーブルとは別に、ID、正規URL等と関連付けられて記憶部42上に記憶されてもよい。特徴ベクトルが正規Webページテーブルに記憶されるか否かに関わらず、記憶部42は、複数の正規Webページのそれぞれを構成する複数の正規HTML文書の特徴ベクトルを、正規Webページを示す正規URLと関連付けて記憶する。 An ID for identifying a regular Web page, a regular URL indicating the regular Web page, a regular HTML document included in the regular Web page, a feature vector calculated based on the regular HTML document, and the like are associated with the regular Web page table. Is memorized. The feature vector may be stored in the storage unit 42 in association with an ID, a regular URL, or the like, separately from the regular Web page table. Regardless of whether or not the feature vectors are stored in the normal Web page table, the storage unit 42 stores the feature vectors of the plurality of normal HTML documents constituting each of the plurality of normal Web pages into the normal URL indicating the normal Web page. Is stored in association with.
 図5は、不正Webページ検出装置4の動作の一例を示すフローチャートである。 FIG. 5 is a flowchart showing an example of the operation of the unauthorized Web page detection device 4.
 以下、図5に示したフローチャートを参照しつつ、不正Webページ検出装置4の動作の例を説明する。以下に説明する動作は、予め記憶部42に記憶されているプログラムに基づき、主に処理部43により各要素と協働して実行される。 Hereinafter, an example of the operation of the unauthorized Web page detection device 4 will be described with reference to the flowchart shown in FIG. The operation described below is mainly executed by the processing unit 43 in cooperation with each element based on a program stored in the storage unit 42 in advance.
 最初に、取得部431は、記憶部42から不正Webページテーブル又は正規Webページテーブルを読み出し、複数の不正HTML文書及び複数の正規HTML文書をそれぞれ取得する(ステップS11)。 First, the acquiring unit 431 reads out the unauthorized Web page table or the authorized Web page table from the storage unit 42, and acquires a plurality of unauthorized HTML documents and a plurality of authorized HTML documents, respectively (step S11).
 次に、不正Webページ検出装置4は、初期処理を実行する(ステップS12)。不正Webページ検出装置4のベクトル算出部434は、初期処理において、複数の不正HTML文書及び複数の正規HTML文書毎に特徴ベクトルを算出する。初期処理の詳細については後述する。ステップS11及びステップS12の処理は、不正Webページ検出装置4が起動した直後に実行される。 Next, the unauthorized Web page detection device 4 executes an initial process (step S12). The vector calculation unit 434 of the unauthorized Web page detection device 4 calculates a feature vector for each of a plurality of unauthorized HTML documents and a plurality of normal HTML documents in the initial processing. Details of the initial processing will be described later. The processing in steps S11 and S12 is executed immediately after the unauthorized Web page detection device 4 is started.
 次に、不正Webページ検出装置4の取得部431は、端末2からURLを受信するまで待機する(ステップS13)。端末2は、URLを指定してWebページの送信要求をWebサーバ3に送信し、同一のURLを不正Webページ検出装置4に送信する。不正Webページ検出装置4の取得部431は、端末2から送信されたURLを通信部41を介して受信し、検査対象Webページを示す検査対象URLとして取得する。 Next, the acquisition unit 431 of the unauthorized Web page detection device 4 waits until a URL is received from the terminal 2 (Step S13). The terminal 2 transmits a Web page transmission request to the Web server 3 by specifying a URL, and transmits the same URL to the unauthorized Web page detection device 4. The acquisition unit 431 of the unauthorized Web page detection device 4 receives the URL transmitted from the terminal 2 via the communication unit 41, and acquires the URL as the inspection target URL indicating the inspection target Web page.
 次に、取得部431は、取得したURLを指定して、Webサーバ3にHTML文書の送信要求を通信部41を介して送信する(ステップS14)。 Next, the acquisition unit 431 specifies the acquired URL and transmits an HTML document transmission request to the Web server 3 via the communication unit 41 (step S14).
 次に、Webサーバ3は、HTML文書の送信要求を受信すると、URLで指定されたHTML文書を不正Webページ検出装置4に送信する。取得部431は、HTML文書を通信部41を介してWebサーバ3から受信し、検査対象Webページを構成する検査対象HTML文書として取得する(ステップS15)。 Next, when receiving the transmission request of the HTML document, the Web server 3 transmits the HTML document specified by the URL to the unauthorized Web page detection device 4. The acquisition unit 431 receives the HTML document from the Web server 3 via the communication unit 41, and acquires the HTML document as the inspection-target HTML document constituting the inspection-target Web page (Step S15).
 次に、不正Webページ検出装置4の判定部436は、検査対象HTML文書に対して検査処理を実行する(ステップS16)。判定部436は、検査処理において、検査対象HTML文書を含む検査対象Webページが不正Webページであるか否かを判定する。検査処理の詳細については後述する。 Next, the determination unit 436 of the unauthorized Web page detection device 4 performs an inspection process on the inspection-target HTML document (step S16). The determining unit 436 determines whether or not the inspection target Web page including the inspection target HTML document is an unauthorized Web page in the inspection processing. Details of the inspection processing will be described later.
 次に、判定結果出力部437は、検査処理における判定結果を通信部41を介して端末2に送信することにより出力する(ステップS17)。次に、判定結果出力部437は、処理をステップS13へ戻し、ステップS13からステップS17の処理を繰り返す。 Next, the determination result output unit 437 outputs the determination result in the inspection process by transmitting it to the terminal 2 via the communication unit 41 (step S17). Next, the determination result output unit 437 returns the processing to step S13, and repeats the processing from step S13 to step S17.
 一方、端末2は、判定結果を受信すると、受信した判定結果を特定する。端末2は、判定結果が正規Webページであることを示す場合、Webサーバ3から受信したWebページを表示し、判定結果が不正Webページであることを示す場合、Webサーバ3から受信したWebページを表示せず、警告画面を表示する。 On the other hand, when receiving the determination result, the terminal 2 specifies the received determination result. The terminal 2 displays the Web page received from the Web server 3 when the determination result indicates that it is a legitimate Web page, and displays the Web page that is received from the Web server 3 when the determination result indicates that it is an unauthorized Web page. Display a warning screen without displaying.
 なお、端末2は、不正Webページ検出装置4からWebページが不正Webページであることを示す判定結果を受信する前にWebサーバ3からWebページを受信し、表示している場合がある。その場合、端末2は、表示しているWebページに代えて、警告画面を表示する。 Note that the terminal 2 may receive and display the Web page from the Web server 3 before receiving the determination result indicating that the Web page is the unauthorized Web page from the unauthorized Web page detection device 4. In that case, the terminal 2 displays a warning screen instead of the displayed Web page.
 図6は、初期処理の一例を示すフローチャートである。初期処理は図5のステップS12で実行される。 FIG. 6 is a flowchart showing an example of the initial process. The initial processing is executed in step S12 of FIG.
 最初に、前処理部432は、ステップS11で取得した複数の不正HTML文書及び複数の正規HTML文書に対して、それぞれ前処理を実行する(ステップS21)。前処理部432は、前処理として、各HTML文書の内容をHTML文法規則に基づいて解析し、解析結果に基づいて各HTML文書中の一部の文字を削除する。例えば、前処理部432は、各HTML文書中の改行を表す制御文字である改行コード、改行コードの前後の空白文字、コメント文字列又はJavaScriptの実行コード等を削除する。また、前処理部432は、各HTML文書のHTMLタグ内に記載されているURLのパスを削除してもよく、一部のHTMLタグを削除して、他の一部のHTMLタグがHTML文書に残るように処理してもよい。 First, the preprocessing unit 432 performs preprocessing on each of the plurality of unauthorized HTML documents and the plurality of regular HTML documents acquired in step S11 (step S21). The preprocessing unit 432 analyzes the contents of each HTML document based on HTML grammar rules as preprocessing, and deletes some characters in each HTML document based on the analysis result. For example, the preprocessing unit 432 deletes a line feed code that is a control character indicating a line feed in each HTML document, a blank character before and after the line feed code, a comment character string, a JavaScript execution code, and the like. Further, the preprocessing unit 432 may delete the URL path described in the HTML tag of each HTML document, delete a part of the HTML tag, and change the other part of the HTML document to the HTML document. May be processed.
 次に、形態素解析部433は、前処理部432が処理した各HTML文書に対して、それぞれ形態素解析処理を実行する(ステップS22)。形態素解析部433は、各HTML文書に対して形態素解析を実行することにより、各HTML文書の内容を、複数の文字列の集合体に変換する。形態素解析部433は、例えばMeCab等の公知の形態素解析エンジンを用いて形態素解析処理を実行する。形態素解析部433は、形態素解析処理において、例えば、<p>等のHTMLタグ、及び、HTMLタグ以外の単語がそれぞれ1つの文字列となるように処理する。 Next, the morphological analysis unit 433 performs a morphological analysis process on each HTML document processed by the preprocessing unit 432 (step S22). The morphological analysis unit 433 performs morphological analysis on each HTML document, thereby converting the contents of each HTML document into a set of a plurality of character strings. The morphological analysis unit 433 performs a morphological analysis process using a known morphological analysis engine such as MeCab. In the morphological analysis processing, the morphological analysis unit 433 performs processing such that, for example, an HTML tag such as <p> and a word other than the HTML tag are each one character string.
 次に、ベクトル算出部434は、形態素解析部433が処理した各HTML文書に対して、それぞれ各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを算出する(ステップS23)。 Next, the vector calculation unit 434 calculates, for each HTML document processed by the morphological analysis unit 433, a feature vector based on the associated state of a plurality of character strings in each HTML document (step S23).
 ベクトル算出部434は、複数の文字列を有するHTML文書が入力された場合に、HTML文書の特徴ベクトルを出力するように事前学習された学習器により、特徴ベクトルを算出する。この学習器は、例えばニューラルネットワーク等により、既存のWebページのHTML文書を用いて事前学習され、予め記憶部42に記憶されている。学習器は、HTML文書内の文字列の配置が類似するHTML文書については類似する特徴ベクトルを出力し、HTML文書内の文字列の配置の状態が類似しないHTML文書については類似しない特徴ベクトルを出力する様に学習されている。学習器は、この学習を、例えばDoc2Vec等の公知の手法を用いて実行する。事前学習に用いられるHTML文書は、例えばWikipediaのHTML文書である。 The vector calculation unit 434 calculates a feature vector by a learning device that is pre-trained so as to output a feature vector of the HTML document when an HTML document having a plurality of character strings is input. This learning device is pre-learned using an HTML document of an existing Web page by, for example, a neural network or the like, and is stored in the storage unit 42 in advance. The learning device outputs a similar feature vector for an HTML document in which the arrangement of character strings in the HTML document is similar, and outputs a dissimilar feature vector for an HTML document in which the arrangement state of the character strings in the HTML document is not similar. Learned to do. The learning device executes this learning using a known method such as Doc2Vec. The HTML document used for the pre-learning is, for example, a Wikipedia HTML document.
 なお、ベクトル算出部434は、学習器を使用せずに特徴ベクトルを算出してもよい。その場合、ベクトル算出部434は、二以上の所定数の文字列が各文書内に出現する出現数を各要素とする特徴ベクトルを算出する。所定数の文字列は、予め複数設定され、記憶部42に記憶されている。この場合、複数の文字列の関連状態とは、各文字列の出現数の大小関係であり、類似するHTML文書については、各文字列の出現数の大小関係は類似するものとなる。したがって、ベクトル算出部434は、HTML文書内の各文字列の出現数が相互に類似するHTML文書については類似する特徴ベクトルを算出し、HTML文書内の各文字列の出現数が類似しないHTML文書については類似しない特徴ベクトルを算出する。 Note that the vector calculation unit 434 may calculate the feature vector without using a learning device. In that case, the vector calculation unit 434 calculates a feature vector in which the number of appearances of two or more predetermined numbers of character strings in each document is an element. A plurality of the predetermined number of character strings are set in advance and stored in the storage unit 42. In this case, the related state of a plurality of character strings is the magnitude relation of the number of appearances of each character string, and for similar HTML documents, the magnitude relation of the number of appearances of each character string is similar. Therefore, the vector calculation unit 434 calculates a similar feature vector for HTML documents in which the number of appearances of each character string in the HTML document is similar to each other, and outputs HTML documents in which the number of appearances of each character string in the HTML document is not similar. , A dissimilar feature vector is calculated.
 次に、ベクトル算出部434は、算出した各特徴ベクトルを、それぞれ対応する不正HTML文書又は正規HTML文書と関連付けて不正Webページテーブル又は正規Webページテーブルに記憶する(ステップS24)。以上により、一連の処理は終了する。 Next, the vector calculation unit 434 stores each of the calculated feature vectors in the unauthorized Web page table or the authorized Web page table in association with the corresponding unauthorized HTML document or regular HTML document (step S24). Thus, a series of processing ends.
 図7は、検査処理の一例を示すフローチャートである。初期処理は図5のステップS16で実行される。 FIG. 7 is a flowchart showing an example of the inspection processing. The initial processing is executed in step S16 in FIG.
 最初に、前処理部432は、ステップS15で取得した検査対象HTML文書に対して、前処理を実行する(ステップS31)。この前処理は、対象が検査対象HTML文書である点を除いてステップS21で説明した前処理と同一である。 First, the preprocessing unit 432 performs preprocessing on the inspection target HTML document acquired in step S15 (step S31). This preprocessing is the same as the preprocessing described in step S21 except that the target is an HTML document to be inspected.
 次に、形態素解析部433は、前処理部432が処理した検査対象HTML文書に対して、形態素解析処理を実行する(ステップS32)。この形態素解析処理は、対象が検査対象HTML文書である点を除いてステップS22で説明した形態素解析処理と同一である。 Next, the morphological analysis unit 433 performs a morphological analysis process on the inspection-target HTML document processed by the preprocessing unit 432 (step S32). This morphological analysis processing is the same as the morphological analysis processing described in step S22 except that the target is an HTML document to be inspected.
 次に、ベクトル算出部434は、形態素解析部433が処理した検査対象HTML文書の特徴ベクトルを算出する(ステップS33)。この特徴ベクトルの算出処理は、対象が検査対象HTML文書である点を除いてステップS23で説明した特徴ベクトルの算出処理と同一である。ステップS23及びステップS33のように、ベクトル算出部434は、複数の不正HTML文書、複数の正規HTML文書及び検査対象HTML文書毎に、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを算出する。 Next, the vector calculation unit 434 calculates the feature vector of the inspection-target HTML document processed by the morphological analysis unit 433 (step S33). This feature vector calculation process is the same as the feature vector calculation process described in step S23 except that the target is an inspection target HTML document. As in step S23 and step S33, the vector calculation unit 434 determines, for each of the plurality of invalid HTML documents, the plurality of regular HTML documents, and the inspection target HTML document, the feature vector based on the related state of the plurality of character strings in each HTML document. Is calculated.
 次に、類似度算出部435は、検査対象HTML文書の特徴ベクトルと、ステップS24で記憶した複数の不正HTML文書の特徴ベクトルのそれぞれとの類似度を算出する(ステップS34)。 Next, the similarity calculator 435 calculates the similarity between the feature vector of the inspection-target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents stored in step S24 (step S34).
 次に、判定部436は、算出された各類似度としきい値とに基づいて、検査対象Webページが不正Webページであるか否かを判定する(ステップS35)。 Next, the determination unit 436 determines whether the inspection target Web page is an unauthorized Web page based on the calculated similarities and the threshold (step S35).
 類似度の最大値がしきい値以上である場合(ステップS35-Y)、判定部436は、検査対象Webページが、その最大値となる類似度が算出された特徴ベクトルに対応する不正Webページであると判定し(ステップS36)、一連の処理を終了する。 If the maximum value of the similarity is equal to or greater than the threshold value (step S35-Y), the determination unit 436 determines that the inspection target Web page is an unauthorized Web page corresponding to the feature vector whose maximum similarity has been calculated. Is determined (step S36), and a series of processing ends.
 一方、類似度の最大値がしきい値未満である場合(ステップS35-N)、判定部436は、正規Webテーブルを読み出し、複数の正規URLを取得する(ステップS37)。 On the other hand, if the maximum value of the similarity is less than the threshold value (step S35-N), the determination unit 436 reads the regular Web table and acquires a plurality of regular URLs (step S37).
 次に、判定部436は、ステップS13で取得した検査対象URL中のドメイン名が、ステップS37で取得した複数の正規URL中のドメイン名の何れかと一致するか否かを判定する(ステップS38)。 Next, the determination unit 436 determines whether or not the domain name in the URL to be inspected acquired in step S13 matches any of the domain names in the plurality of regular URLs acquired in step S37 (step S38). .
 検査対象URL中のドメイン名が複数の正規URL中のドメイン名の何れかと一致する場合(ステップS38-Y)、判定部436は、検査対象Webページは正規のWebサイトに属しており、不正Webページでないと判定する(ステップS39)。以上により、一連の処理を終了する。 If the domain name in the URL to be inspected matches any one of the domain names in the plurality of regular URLs (step S38-Y), the determination unit 436 determines that the Web page to be inspected belongs to a regular Web site, It is determined that the page is not a page (step S39). Thus, a series of processing ends.
 検査対象URL中のドメイン名が複数の正規URL中のドメイン名の何れとも一致しない場合(ステップS38-N)、判定部436は、検査対象Webページは正規のWebサイトに属していないと判定する。次に、類似度算出部435は、検査対象HTMLの特徴ベクトルと、複数の正規HTML文書の特徴ベクトルのそれぞれとの類似度を算出する(ステップS40)。 If the domain name in the URL to be inspected does not match any of the domain names in the plurality of regular URLs (step S38-N), the determination unit 436 determines that the Web page to be inspected does not belong to a legitimate Web site. . Next, the similarity calculation unit 435 calculates the similarity between the feature vector of the inspection target HTML and each of the feature vectors of the plurality of normal HTML documents (step S40).
 次に、判定部436は、算出された各類似度の最大値と第2しきい値とを比較することにより、検査対象Webページが不正Webページであるか否かを判定する(ステップS41)。第2しきい値は、ステップS35で使用するしきい値と同一の値でも、異なる値でもよい。 Next, the determination unit 436 determines whether or not the inspection target Web page is an unauthorized Web page by comparing the calculated maximum value of each similarity with the second threshold value (Step S41). . The second threshold value may be the same value as the threshold value used in step S35 or a different value.
 判定部436は、ステップS38において、検査対象Webページは正規のWebサイトに属していないと判定している。したがって、類似度の最大値が第2しきい値以上である場合、判定部436は、検査対象Webページは登録されている正規Webページに類似する不正Webページであると判定する(ステップS42)。 The determination unit 436 determines in step S38 that the Web page to be inspected does not belong to a legitimate Web site. Therefore, when the maximum value of the similarity is equal to or larger than the second threshold, the determination unit 436 determines that the inspection target Web page is an unauthorized Web page similar to the registered regular Web page (Step S42). .
 一方、類似度の最大値が第2しきい値未満である場合、判定部436は、検査対象Webページは正規のWebサイトに属していないが、正規Webページのいずれとも内容が類似していないため、未登録の正規Webページであると判定する(ステップS43)。以上により、一連の処理を終了する。 On the other hand, when the maximum value of the similarity is less than the second threshold, the determination unit 436 determines that the inspection target Web page does not belong to the regular Web site, but the content is not similar to any of the regular Web pages. Therefore, it is determined that the page is an unregistered regular Web page (step S43). Thus, a series of processing ends.
 図8(a)は、形態素解析部433への入力データの一例であり、図8(b)は、形態素解析部433の出力データの一例である。 FIG. 8A shows an example of input data to the morphological analysis unit 433, and FIG. 8B shows an example of output data of the morphological analysis unit 433.
 図8(a)に示す様に、形態素解析部433への入力データは、不正Webページ、正規Webページ及び検査対象Webページの各HTML文書から、前処理部432が改行コード等の一部の文字を削除したHTML文書である。 As shown in FIG. 8A, the input data to the morphological analysis unit 433 is obtained from the HTML documents of the illegal Web page, the regular Web page, and the inspection target Web page, and the pre-processing unit 432 outputs a part of the line feed code and the like. This is an HTML document from which characters have been deleted.
 図8(b)に示す様に、形態素解析部433の出力データは、形態素解析部433が、入力データに対して形態素解析を実行し、実行結果として得られる形態素を単語単位にまとめて二重引用符の間に配置したデータである。なお、形態素解析部433は、入力データからHTMLタグを除去した後に形態素解析を実行し、形態素を単語単位にまとめた後、元の位置に二重引用符が付されたHTMLタグを挿入することによって、出力データを生成してもよい。 As shown in FIG. 8B, the output data of the morphological analysis unit 433 is obtained by performing the morphological analysis on the input data by the morphological analysis unit 433, and collecting the morphemes obtained as the execution result in units of words. Data placed between quotes. Note that the morphological analysis unit 433 performs morphological analysis after removing the HTML tag from the input data, puts the morpheme into words, and inserts an HTML tag with double quotes at the original position. May generate output data.
 図9は、特徴ベクトルの処理概要の一例を示す図である。 FIG. 9 is a diagram showing an example of a feature vector processing outline.
 記憶部42には、複数の各不正Webページ1~nの不正HTML文書1~nが記憶されている。まず、ステップS23において、ベクトル算出部434は、記憶部42に記憶された各不正Webページ1~nの不正HTML文書1~nに対して、それぞれ特徴ベクトル1~nを算出する。一方、ステップS33において、ベクトル算出部434は、取得部431が取得した検査対象Webページの検査対象HTML文書に対して、特徴ベクトルAを算出する。そして、ステップS34において、類似度算出部435は、特徴ベクトルAと、特徴ベクトル1~nのそれぞれとのコサイン類似度1~nを算出する。2つの特徴ベクトルは、コサイン類似度が1に近いほど類似し、-1に近いほど類似しない。図9に示す例では、類似度1は0.9であり、類似度2は0.4であり、類似度nは-0.9である。 The storage unit 42 stores the illegal HTML documents 1 to n of the plurality of illegal Web pages 1 to n. First, in step S23, the vector calculation unit 434 calculates feature vectors 1 to n for the malicious HTML documents 1 to n of the malicious web pages 1 to n stored in the storage unit 42, respectively. On the other hand, in step S33, the vector calculation unit 434 calculates the feature vector A for the inspection target HTML document of the inspection target Web page acquired by the acquisition unit 431. Then, in step S34, the similarity calculating section 435 calculates cosine similarities 1 to n of the feature vector A and each of the feature vectors 1 to n. The two feature vectors are similar when the cosine similarity is close to 1, and are not similar when the cosine similarity is close to -1. In the example shown in FIG. 9, the similarity 1 is 0.9, the similarity 2 is 0.4, and the similarity n is −0.9.
 ステップS35において、判定部436は、類似度1~nの最大値である0.9と、しきい値とを比較することにより、検査対象Webページが不正Webページであるか否かを判定する。例えば、しきい値が0.8である場合、類似度1~nの最大値0.9は、しきい値以上であるため、検査対象Webページは、不正Webページ1に対応する不正Webページであると判定される。 In step S35, the determination unit 436 determines whether or not the inspection target Web page is an unauthorized Web page by comparing the maximum value of 0.9 of the similarities 1 to n with a threshold value. . For example, when the threshold value is 0.8, the maximum value 0.9 of the similarities 1 to n is equal to or larger than the threshold value, and therefore, the inspection target Web page is an unauthorized Web page corresponding to the unauthorized Web page 1. Is determined.
 図10(a)~図10(d)は、端末2が表示する画面の一例を示す図である。 FIGS. 10 (a) to 10 (d) are diagrams showing an example of a screen displayed by the terminal 2. FIG.
 図10(a)に示す様に、端末2は、ユーザによりWebブラウザの起動が指示されると、Webブラウザを起動して表示する。Webブラウザの表示画面60は、URL入力領域61と、表示領域62とを含む。端末2は、Webブラウザを起動すると、不正Webページ検出装置4と通信するアプリケーションプログラムを起動する。 As shown in FIG. 10A, when the user instructs to start the Web browser, the terminal 2 starts and displays the Web browser. The display screen 60 of the Web browser includes a URL input area 61 and a display area 62. When the terminal 2 activates the Web browser, the terminal 2 activates an application program that communicates with the unauthorized Web page detection device 4.
 図10(b)に示す様に、ユーザによりWebブラウザの表示画面70のURL入力領域61にURLが入力された場合、端末2は、指示されたURLが示すWebサーバ3へアクセスし、Webサーバ3からWebページを受信する。さらに、端末2は、アプリケーションプログラムに従って、Webブラウザに入力されたURLを不正Webページ検出装置4に送信する。 As shown in FIG. 10B, when the user inputs a URL in the URL input area 61 of the display screen 70 of the Web browser, the terminal 2 accesses the Web server 3 indicated by the specified URL, and accesses the Web server 3. 3 receives a Web page. Further, the terminal 2 transmits the URL input to the Web browser to the unauthorized Web page detection device 4 according to the application program.
 不正Webページ検出装置4は、端末2から送信されたURLをステップS13で取得し、ステップS14~ステップS17の処理を実行して、判定結果を端末2に送信する。 (4) The unauthorized Web page detection device 4 acquires the URL transmitted from the terminal 2 in step S13, executes the processes in steps S14 to S17, and transmits the determination result to the terminal 2.
 図10(c)に示す様に、端末2は、端末2から送信されたURLに対応するWebページが正規Webページであることを示す判定結果を不正Webページ検出装置4から受信した場合、Webサーバ3から受信したWebページ81を表示画面80に表示する。 As shown in FIG. 10C, when the terminal 2 receives from the unauthorized Web page detection device 4 a determination result indicating that the Web page corresponding to the URL transmitted from the terminal 2 is a regular Web page, The Web page 81 received from the server 3 is displayed on the display screen 80.
 図10(d)に示す様に、端末2は、端末2から送信されたURLに対応するWebページが不正Webページであることを示す判定結果を不正Webページ検出装置4から受信した場合、警告画面90を表示する。警告画面用のデータは、端末2に予め記憶されている。警告画面90には、文字表示91と、終了ボタン92とが表示される。文字表示91は、Webサーバ3から受信したWebページがフィッシングページである可能性があることを警告する文章である。終了ボタン92が押下されると、端末2は、警告画面90を閉じる。 As shown in FIG. 10D, the terminal 2 issues a warning when the determination result indicating that the Web page corresponding to the URL transmitted from the terminal 2 is an unauthorized Web page is received from the unauthorized Web page detection device 4. A screen 90 is displayed. The data for the warning screen is stored in the terminal 2 in advance. On the warning screen 90, a character display 91 and an end button 92 are displayed. The character display 91 is a text warning that the Web page received from the Web server 3 may be a phishing page. When the end button 92 is pressed, the terminal 2 closes the warning screen 90.
 このように、不正Webページ検出装置4は、既知の複数の不正HTML文書及び検査対象HTML文書毎に、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを算出する。不正Webページ検出装置4は、算出した特徴ベクトルの類似度に基づいて、検査対象Webページが不正Webページであるか否かを判定する。不正Webページは、共通のツールにより生成されていることが多く、共通のツールにより生成された複数の不正Webページは、ツールに起因する共通の特徴を有し、類似する可能性が高い。このため、不正Webページ検出装置4は、HTML文書の特徴ベクトルを使用することにより、検査対象WebページのURLが既知の不正WebページのURLと異なっていても、検査対象Webページが不正Webページか否かを高精度に判定することができる。 As described above, the unauthorized Web page detection device 4 calculates a feature vector based on a related state of a plurality of character strings in each HTML document for each of the plurality of known unauthorized HTML documents and the inspection-target HTML document. The fraudulent Web page detection device 4 determines whether or not the inspection target Web page is a fraudulent Web page based on the calculated similarity of the feature vectors. Unauthorized Web pages are often generated by a common tool, and a plurality of unauthorized Web pages generated by a common tool have common features attributed to the tool and are likely to be similar. For this reason, the fraudulent Web page detection device 4 uses the feature vector of the HTML document, so that even if the URL of the inspection target Web page is different from the URL of the known fraudulent Web page, the fraudulent Web page is detected. Can be determined with high accuracy.
 また、不正Webページ検出装置4は、検査対象URL中のドメイン名が複数の正規URL中のドメイン名の何れとも一致しない場合、さらに、検査対象HTMLの特徴ベクトルと、複数の正規HTML文書の特徴ベクトルのそれぞれとの類似度を算出する。不正Webページ検出装置4は、検査対象HTML文書が正規HTML文書と類似するか否かも判定するため、正規Webページと類似する様に作成され、まだ不正Webページとして登録されていない不正Webページを検出することができる。 In addition, when the domain name in the URL to be inspected does not match any of the domain names in the plurality of regular URLs, the unauthorized Web page detection device 4 further determines the feature vector of the HTML to be inspected and the feature of the plurality of regular HTML documents. The similarity with each of the vectors is calculated. The fraudulent Web page detection device 4 determines whether the inspection target HTML document is similar to the legitimate HTML document. Therefore, the fraudulent Web page detection device 4 creates a fraudulent Web page that is created to be similar to the legitimate Web page and has not been registered as a fraudulent Web page. Can be detected.
 また、不正Webページ検出装置4は、HTMLタグ及び単語を含む複数の文字列の関連状態に基づいて特徴ベクトルを算出する。共通のツールにより生成された複数の不正Webページは、HTMLタグと単語との間にツールに起因する特定の関連性を有している可能性が高い。不正Webページ検出装置4は、HTMLタグと単語との関連状態が検査対象Webページと各不正Webページとで類似しているか否かを判定するため、検査対象Webページが不正Webページであるか否かをより高精度に検出することができる。 {Further, the unauthorized Web page detection device 4 calculates a feature vector based on an associated state of a plurality of character strings including an HTML tag and a word. A plurality of malicious Web pages generated by a common tool are likely to have a specific association between the HTML tag and the word due to the tool. The fraudulent Web page detection device 4 determines whether or not the state of association between the HTML tag and the word is similar between the test target Web page and each fraudulent Web page. It can be detected with higher accuracy.
 また、不正Webページ検出装置4は、特徴ベクトルを、複数の連続する文字列の関連状態に基づいて算出する。連続する文字列において同様なHTMLタグ及び/又は単語の組が使用される傾向を有するWebページ群は、類似するWebページである可能性が高い。したがって、不正Webページ検出装置4は、不正Webページとして登録されているWebページと類似する不正Webページをより高精度に検出することができる。 {Further, the unauthorized Web page detection device 4 calculates the feature vector based on the related state of a plurality of continuous character strings. Web pages that tend to use similar HTML tags and / or word sets in consecutive character strings are likely to be similar Web pages. Therefore, the fraudulent Web page detection device 4 can detect a fraudulent Web page similar to a Web page registered as a fraudulent Web page with higher accuracy.
 なお、前処理部432は、ステップS21及びステップS31において、前処理により生成された各HTML文書のサイズをそれぞれ算出してもよい。その場合、類似度算出部435は、ステップS34において、算出した複数の不正HTML文書のそれぞれのサイズと算出した検査対象HTML文書のサイズとの差を算出し、サイズの差が所定値以上である場合、当該不正HTML文書について類似度を算出しない。同様に、類似度算出部435は、ステップS40において、算出した複数の正規HTML文書のそれぞれのサイズと算出した検査対象HTML文書のサイズとの差を算出し、サイズの差が所定値以上である場合、当該正規HTML文書について類似度を算出しない。 The preprocessing unit 432 may calculate the size of each HTML document generated by the preprocessing in steps S21 and S31. In this case, in step S34, the similarity calculation unit 435 calculates a difference between each of the calculated plurality of unauthorized HTML documents and the calculated size of the inspection target HTML document, and the size difference is equal to or larger than a predetermined value. In this case, the similarity is not calculated for the invalid HTML document. Similarly, in step S40, the similarity calculation unit 435 calculates the difference between the calculated size of each of the plurality of normal HTML documents and the calculated size of the inspection target HTML document, and the size difference is equal to or larger than a predetermined value. In this case, the similarity is not calculated for the regular HTML document.
 検査対象HTML文書のサイズが不正HTML文書のサイズ又は正規HTML文書のサイズと明らかに異なる場合、2つのHTML文書は明らかに異なる。したがって、不正Webページ検出装置4は、不正Webページの判定精度を低減することなく、検査処理の高速化を図ることができる。なお、類似度算出部435は、前処理部432が前処理を実行する前の各HTML文書のサイズの差を算出してもよい。または、類似度算出部435は、形態素解析部433が形態素解析処理を実行した後の各HTML文書のサイズの差を算出してもよい。 場合 If the size of the HTML document to be inspected is clearly different from the size of the invalid HTML document or the size of the regular HTML document, the two HTML documents are clearly different. Therefore, the fraudulent Web page detection device 4 can speed up the inspection process without reducing the accuracy of determining the fraudulent Web page. Note that the similarity calculation unit 435 may calculate the difference between the sizes of the HTML documents before the preprocessing unit 432 performs the preprocessing. Alternatively, the similarity calculation unit 435 may calculate the difference between the sizes of the HTML documents after the morphological analysis unit 433 has performed the morphological analysis processing.
 また、形態素解析部433は、前処理部432が前処理を実行したHTML文書に代えて、ステップS11で取得した各正規HTML文書、及び、ステップS15で取得した検査対象HTML文書に対して、形態素解析処理を実行してもよい。 Also, the morphological analysis unit 433 replaces the regular HTML document acquired in step S11 and the inspection target HTML document acquired in step S15 with a morphological Analysis processing may be performed.
 また、ベクトル算出部434は、形態素解析部433が処理したHTML文書に代えて、前処理部432が前処理を実行したHTML文書に対して特徴ベクトルを算出してもよい。ベクトル算出部434は、形態素解析部433が処理したHTML文書に代えて、ステップS11で取得した各正規HTML文書、及び、ステップS15で取得した検査対象HTML文書に対して特徴ベクトルを算出してもよい。例えば、HTML文書が単語毎に分かち書きされる英語等の言語で記載されている場合、ベクトル算出部434は、入力されたHTML文書をHTMLタグの切れ目及び単語と単語との間の空白で区切った複数の文字列に基づいて、特徴ベクトルを算出してもよい。 {Circle around (4)} The vector calculation unit 434 may calculate a feature vector for an HTML document that has been preprocessed by the preprocessing unit 432, instead of the HTML document processed by the morphological analysis unit 433. The vector calculation unit 434 may calculate a feature vector for each of the regular HTML documents acquired in step S11 and the inspection target HTML document acquired in step S15 instead of the HTML document processed by the morphological analysis unit 433. Good. For example, when the HTML document is described in a language such as English that is separated and written for each word, the vector calculation unit 434 separates the input HTML document by a break in the HTML tag and a space between words. The feature vector may be calculated based on a plurality of character strings.
 また、判定部436は、ステップS35において、しきい値以上の類似度と判定された不正Webページが所定数以上であるか否かを判定してもよい。例えば、判定部436は、しきい値以上の類似度と判定された不正Webページの数が、所定数以上である場合に検査対象Webページが不正Webページであると判定し、所定数以上でない場合に検査対象Webページが不正Webページでないと判定してもよい。 The determination unit 436 may determine whether or not the number of unauthorized Web pages determined to have a similarity equal to or greater than the threshold in step S35 is equal to or greater than a predetermined number. For example, the determination unit 436 determines that the inspection target Web page is the unauthorized Web page when the number of the unauthorized Web pages determined to have the similarity equal to or higher than the threshold value is the predetermined number or more, and determines that the inspection target Web page is not the predetermined number or more. In this case, it may be determined that the inspection target Web page is not an unauthorized Web page.
 また、ステップS37~ステップS43の処理を省略し、判定部436は、ステップS34で算出された各類似度の最大値がしきい値未満である場合、検査対象Webページは正規Webページであると判定してもよい。 In addition, the processing of steps S37 to S43 is omitted, and the determination unit 436 determines that the inspection target Web page is a regular Web page when the maximum value of each similarity calculated in step S34 is less than the threshold. It may be determined.
 また、判定部436がステップS37~ステップS38の処理を実行するタイミングを、ステップS31の処理の前に変更し、ステップS35-NのときにステップS40に処理を進めてもよい。例えば、判定部436は、検査処理の最初にステップS37~ステップS38の処理を実行する。ステップS38-Yの場合、判定部436は、ステップS39と同様に、検査対象Webページは正規のWebサイトに属しており、不正Webページでないと判定して、一連の処理を終了する。ステップS38-Yの場合、判定部436は、処理をステップS31に進める。 The timing at which the determination unit 436 executes the processing of steps S37 to S38 may be changed before the processing of step S31, and the processing may advance to step S40 in the case of step S35-N. For example, the determination unit 436 performs the processing of steps S37 to S38 at the beginning of the inspection processing. In the case of step S38-Y, the determination unit 436 determines that the inspection target Web page belongs to the legitimate Web site and is not an unauthorized Web page, and ends a series of processes, as in step S39. In the case of step S38-Y, the determination unit 436 advances the processing to step S31.
 また、記憶部42は、さらに、不正Webページテーブルの各不正HTML文書に、どの正規URLに対応してフィッシング詐欺を実行する不正HTML文書であるかを示すURL情報を関連付けて記憶してもよい。この場合、類似度算出部435は、ステップS34において、さらに、検査対象HTML文書の特徴ベクトルと、複数の正規HTML文書の特徴ベクトルのそれぞれとの類似度を算出する。そして、類似度算出部435は、各不正HTML文書に関する類似度と、各不正HTML文書のURL情報が示す正規URLに関連付けられた正規HTML文書に関する類似度との平均値を算出する。判定部436は、ステップS35において、類似度算出部435が算出した各平均値の最大値がしきい値以上であるか否かを判定することにより、検査対象Webページが不正Webページであるか否かを判定する。 Further, the storage unit 42 may further store URL information indicating which authorized URL corresponds to the unauthorized HTML document that executes the phishing scam, in association with each unauthorized HTML document in the unauthorized Web page table. . In this case, in step S34, the similarity calculation unit 435 further calculates the similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of normal HTML documents. Then, the similarity calculating unit 435 calculates an average value of the similarity of each unauthorized HTML document and the similarity of the regular HTML document associated with the regular URL indicated by the URL information of each malicious HTML document. The determination unit 436 determines in step S35 whether the maximum value of the average values calculated by the similarity calculation unit 435 is equal to or greater than a threshold value, thereby determining whether the inspection target Web page is an unauthorized Web page. Determine whether or not.
 また、不正Webページ検出装置4は、運用中に新たな不正Webページ又は正規WebページのURLを取得し、各Webページに対応する特徴ベクトルを算出してもよい。この場合、取得部431は、取得したURLを指定して不正HTML文書又は正規HTML文書を取得し、取得したURL及びHTML文書を不正Webページテーブル又は正規Webページテーブルに登録する。前処理部432、形態素解析部433及びベクトル算出部434は、新たに取得したHTML文書に対してステップS12の初期処理を実行し、特徴ベクトルを算出する。 The unauthorized Web page detection device 4 may acquire a URL of a new unauthorized Web page or a legitimate Web page during operation, and calculate a feature vector corresponding to each Web page. In this case, the obtaining unit 431 specifies the obtained URL, obtains an invalid HTML document or a regular HTML document, and registers the obtained URL and HTML document in the illegal Web page table or the regular Web page table. The preprocessing unit 432, the morphological analysis unit 433, and the vector calculation unit 434 execute the initial process of step S12 on the newly acquired HTML document, and calculate a feature vector.
 不正Webページ検出装置4は、既存の学習器に新たなHTML文書を学習させることなく、検査対象HTML文書の特徴ベクトルと新たなHTML文書の特徴ベクトルとの類似度を算出することができる。不正Webページ検出装置4は、既存のHTML文書及び新たなHTML文書の全体を用いて学習器を再学習させることなく、新たなHTML文書を用いた判定を実行することができるため、学習に係る処理の負荷を軽減させることができる。 The unauthorized Web page detection device 4 can calculate the similarity between the feature vector of the inspection target HTML document and the feature vector of the new HTML document without causing the existing learning device to learn the new HTML document. The unauthorized Web page detection device 4 can execute the determination using the new HTML document without re-learning the learning device using the entire existing HTML document and the new HTML document. The processing load can be reduced.
 当業者は、本発明の精神及び範囲から外れることなく、様々な変更、置換及び修正をこれに加えることが可能であることを理解されたい。 It should be understood that those skilled in the art can make various changes, substitutions and modifications without departing from the spirit and scope of the invention.
 4  不正Webページ検出装置
 42  記憶部
 431  取得部
 434  ベクトル算出部
 435  類似度算出部
 436  判定部
 437  判定結果出力部
4 Unauthorized Web page detection device 42 Storage unit 431 Acquisition unit 434 Vector calculation unit 435 Similarity calculation unit 436 Judgment unit 437 Judgment result output unit

Claims (7)

  1.  複数の不正Webページのそれぞれを構成する複数の不正HTML(HyperText Markup Language)文書の、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを記憶する記憶部と、
     検査対象Webページを構成する検査対象HTML文書を取得する取得部と、
     前記検査対象HTML文書の特徴ベクトルを算出するベクトル算出部と、
     前記検査対象HTML文書の特徴ベクトルと、前記複数の不正HTML文書の特徴ベクトルのそれぞれとの類似度を算出する類似度算出部と、
     前記算出された各類似度としきい値とに基づいて、前記検査対象Webページが不正Webページであるか否かを判定する判定部と、
     前記判定部による判定結果を出力する判定結果出力部と、
     を有することを特徴とする不正Webページ検出装置。
    A storage unit for storing a feature vector of a plurality of malicious HTML (HyperText Markup Language) documents constituting each of the plurality of malicious Web pages, based on a relation state of a plurality of character strings in each HTML document;
    An acquisition unit configured to acquire an inspection target HTML document that constitutes the inspection target Web page;
    A vector calculation unit that calculates a feature vector of the inspection target HTML document;
    A similarity calculator that calculates a similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents;
    A determining unit that determines whether the inspection target Web page is an unauthorized Web page based on each of the calculated similarities and the threshold value;
    A determination result output unit that outputs a determination result by the determination unit,
    An unauthorized Web page detection device comprising:
  2.  前記記憶部は、さらに、複数の正規Webページのそれぞれを構成する複数の正規HTML文書の前記特徴ベクトルを、前記正規Webページを示す正規URL(Uniform Resource Locator)と関連付けて記憶し、
     前記取得部は、さらに、前記検査対象Webページを示す検査対象URLを取得し、
     前記類似度算出部は、前記検査対象URL中のドメイン名が前記複数の正規URL中のドメイン名の何れとも一致しない場合、さらに、前記検査対象HTMLの特徴ベクトルと、前記複数の正規HTML文書の特徴ベクトルのそれぞれとの類似度を算出する、請求項1に記載の不正Webページ検出装置。
    The storage unit further stores the feature vectors of a plurality of regular HTML documents constituting each of the plurality of regular Web pages in association with a regular URL (Uniform Resource Locator) indicating the regular Web page,
    The acquisition unit further acquires an inspection target URL indicating the inspection target Web page,
    When the domain name in the inspection target URL does not match any of the domain names in the plurality of normal URLs, the similarity calculation unit further includes a feature vector of the inspection target HTML and the plurality of normal HTML documents. The unauthorized Web page detection device according to claim 1, wherein a similarity with each of the feature vectors is calculated.
  3.  前記類似度算出部は、前記不正HTML文書のサイズと前記検査対象HTML文書のサイズとの差が所定値以上である場合、当該不正HTML文書について前記類似度を算出しない、請求項1又は2に記載の不正Webページ検出装置。 The method according to claim 1, wherein the similarity calculating unit does not calculate the similarity for the invalid HTML document when a difference between the size of the invalid HTML document and the size of the inspection target HTML document is equal to or greater than a predetermined value. Unauthorized Web page detection device.
  4.  前記複数の文字列は、HTMLタグ及び単語を含む、請求項1~3のいずれか一項に記載の不正Webページ検出装置。 4. The unauthorized Web page detection device according to claim 1, wherein the plurality of character strings include an HTML tag and a word.
  5.  前記複数の文字列は、連続する文字列である、請求項1~4のいずれか一項に記載の不正Webページ検出装置。 The unauthorized Web page detection device according to any one of claims 1 to 4, wherein the plurality of character strings are continuous character strings.
  6.  記憶部及び出力部を有する不正Webページ検出装置の制御方法であって、前記不正Webページ検出装置が、
     複数の不正Webページのそれぞれを構成する複数の不正HTML(HyperText Markup Language)文書の、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを前記記憶部に記憶し、
     検査対象Webページを構成する検査対象HTML文書を取得し、
     前記検査対象HTML文書の特徴ベクトルを算出し、
     前記検査対象HTML文書の特徴ベクトルと、前記複数の不正HTML文書の特徴ベクトルのそれぞれとの類似度を算出し、
     前記算出された各類似度としきい値とに基づいて、前記検査対象Webページが不正Webページであるか否かを判定し、
     前記判定の結果を前記出力部に出力する、
     ことを含むことを特徴とする不正Webページ検出装置の制御方法。
    A method for controlling an unauthorized Web page detection device having a storage unit and an output unit, wherein the unauthorized Web page detection device includes:
    Storing, in the storage unit, a feature vector of a plurality of malicious HTML (HyperText Markup Language) documents constituting each of the plurality of malicious Web pages, based on a relation state of a plurality of character strings in each HTML document;
    Obtain the HTML document to be inspected that constitutes the Web page to be inspected,
    Calculating a feature vector of the HTML document to be inspected;
    Calculating a similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents;
    Based on each of the calculated similarities and the threshold value, determine whether the inspection target Web page is a fraudulent Web page,
    Outputting the result of the determination to the output unit;
    A method for controlling an unauthorized Web page detection device, comprising:
  7.  記憶部及び出力部を有する不正Webページ検出装置の制御プログラムであって、
     複数の不正Webページのそれぞれを構成する複数の不正HTML(HyperText Markup Language)文書の、各HTML文書内の複数の文字列の関連状態に基づく特徴ベクトルを前記記憶部に記憶し、
     検査対象Webページを構成する検査対象HTML文書を取得し、
     前記検査対象HTML文書の特徴ベクトルを算出し、
     前記検査対象HTML文書の特徴ベクトルと、前記複数の不正HTML文書の特徴ベクトルのそれぞれとの類似度を算出し、
     前記算出された各類似度としきい値とに基づいて、前記検査対象Webページが不正Webページであるか否かを判定し、
     前記判定の結果を前記出力部に出力する、
     ことを不正Webページ検出装置に実行させることを特徴とする制御プログラム。
    A control program for an unauthorized Web page detection device having a storage unit and an output unit,
    Storing, in the storage unit, a feature vector of a plurality of malicious HTML (HyperText Markup Language) documents constituting each of the plurality of malicious Web pages, based on a relation state of a plurality of character strings in each HTML document;
    Obtain the HTML document to be inspected that constitutes the Web page to be inspected,
    Calculating a feature vector of the HTML document to be inspected;
    Calculating a similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents;
    Based on each of the calculated similarities and the threshold value, determine whether the inspection target Web page is a fraudulent Web page,
    Outputting the result of the determination to the output unit;
    A control program for causing the unauthorized Web page detection device to execute the following.
PCT/JP2018/031993 2018-08-29 2018-08-29 Illicit webpage detection device, illicit webpage detection device control method, and control program WO2020044469A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2018/031993 WO2020044469A1 (en) 2018-08-29 2018-08-29 Illicit webpage detection device, illicit webpage detection device control method, and control program
JP2020539928A JP7182764B2 (en) 2018-08-29 2018-08-29 Fraudulent web page detection device, control method and control program for fraudulent web page detection device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2018/031993 WO2020044469A1 (en) 2018-08-29 2018-08-29 Illicit webpage detection device, illicit webpage detection device control method, and control program

Publications (1)

Publication Number Publication Date
WO2020044469A1 true WO2020044469A1 (en) 2020-03-05

Family

ID=69643425

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/031993 WO2020044469A1 (en) 2018-08-29 2018-08-29 Illicit webpage detection device, illicit webpage detection device control method, and control program

Country Status (2)

Country Link
JP (1) JP7182764B2 (en)
WO (1) WO2020044469A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597107A (en) * 2020-04-22 2020-08-28 北京字节跳动网络技术有限公司 Information output method and device and electronic equipment
KR20220080703A (en) * 2020-12-07 2022-06-14 주식회사 앰진시큐러스 Method for analyzing a similarity of a website based on a keyword in script
JP7138279B1 (en) * 2022-02-17 2022-09-16 株式会社ファイブドライブ Communication system, gateway device, terminal device and program
KR102595595B1 (en) * 2023-07-24 2023-10-31 (주)에잇스니핏 Method and device for blocking illegal and harmful information sites using website structure information
KR102705181B1 (en) * 2020-12-07 2024-09-11 주식회사 앰진 Method for analyzing a similarity of a website based on a keyword in script

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07319897A (en) * 1994-05-20 1995-12-08 Canon Inc Method and device for processing information
US20130086677A1 (en) * 2010-12-31 2013-04-04 Huawei Technologies Co., Ltd. Method and device for detecting phishing web page
US20160352772A1 (en) * 2015-05-27 2016-12-01 Cisco Technology, Inc. Domain Classification And Routing Using Lexical and Semantic Processing
US20180013789A1 (en) * 2016-07-11 2018-01-11 Bitdefender IPR Management Ltd. Systems and Methods for Detecting Online Fraud

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07319897A (en) * 1994-05-20 1995-12-08 Canon Inc Method and device for processing information
US20130086677A1 (en) * 2010-12-31 2013-04-04 Huawei Technologies Co., Ltd. Method and device for detecting phishing web page
US20160352772A1 (en) * 2015-05-27 2016-12-01 Cisco Technology, Inc. Domain Classification And Routing Using Lexical and Semantic Processing
US20180013789A1 (en) * 2016-07-11 2018-01-11 Bitdefender IPR Management Ltd. Systems and Methods for Detecting Online Fraud

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597107A (en) * 2020-04-22 2020-08-28 北京字节跳动网络技术有限公司 Information output method and device and electronic equipment
CN111597107B (en) * 2020-04-22 2023-04-28 北京字节跳动网络技术有限公司 Information output method and device and electronic equipment
KR20220080703A (en) * 2020-12-07 2022-06-14 주식회사 앰진시큐러스 Method for analyzing a similarity of a website based on a keyword in script
WO2022124573A1 (en) * 2020-12-07 2022-06-16 주식회사 앰진시큐러스 Method for evaluating similarity of website on basis of menu structure and keyword in script
KR102705181B1 (en) * 2020-12-07 2024-09-11 주식회사 앰진 Method for analyzing a similarity of a website based on a keyword in script
JP7138279B1 (en) * 2022-02-17 2022-09-16 株式会社ファイブドライブ Communication system, gateway device, terminal device and program
WO2023157191A1 (en) * 2022-02-17 2023-08-24 株式会社ファイブドライブ Communication system, gateway device, terminal device, and program
KR102595595B1 (en) * 2023-07-24 2023-10-31 (주)에잇스니핏 Method and device for blocking illegal and harmful information sites using website structure information

Also Published As

Publication number Publication date
JP7182764B2 (en) 2022-12-05
JPWO2020044469A1 (en) 2021-08-26

Similar Documents

Publication Publication Date Title
WO2020044469A1 (en) Illicit webpage detection device, illicit webpage detection device control method, and control program
US8385589B2 (en) Web-based content detection in images, extraction and recognition
CN109768992B (en) Webpage malicious scanning processing method and device, terminal device and readable storage medium
EP3703329B1 (en) Webpage request identification
JP2010516007A5 (en)
US20090313536A1 (en) Dynamically Providing Relevant Browser Content
US9229835B2 (en) Method and apparatus for monitoring state of online application
CN104023046B (en) Mobile terminal recognition method and device
CN111597490A (en) Web fingerprint identification method, device, equipment and computer storage medium
CN107786529B (en) Website detection method, device and system
KR100917458B1 (en) Method and system of providing recommended words
CN109657472B (en) SQL injection vulnerability detection method, device, equipment and readable storage medium
CN104978423A (en) Website type detection method and apparatus
WO2014203573A1 (en) Digital information analysis system, digital information analysis method, and digital information analysis program
CN103390128A (en) Page labeling method and device and terminal equipment
CN115801455B (en) Method and device for detecting counterfeit website based on website fingerprint
US20130230248A1 (en) Ensuring validity of the bookmark reference in a collaborative bookmarking system
CN111488622A (en) Method and device for detecting webpage tampering behavior and related components
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN113992390A (en) Phishing website detection method and device and storage medium
US9639611B2 (en) System and method for providing suitable web addresses to a user device
CN111563276B (en) Webpage tampering detection method, detection system and related equipment
CN106713217B (en) Verification method and device
KR20200010669A (en) Big data based web-accessibility improvement apparatus and method
Swathi et al. Detection of Phishing Websites Using Machine Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18931290

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020539928

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18931290

Country of ref document: EP

Kind code of ref document: A1