KR101814005B1

KR101814005B1 - Apparatus and method for automatically extracting product keyword information according to web page analysis based artificial intelligence

Info

Publication number: KR101814005B1
Application number: KR1020170105316A
Authority: KR
Inventors: 김관호; 이동훈
Original assignee: 인천대학교 산학협력단
Priority date: 2017-08-21
Filing date: 2017-08-21
Publication date: 2018-01-02
Also published as: WO2019039673A1

Abstract

An apparatus and a method for automatically extracting product keyword information based on an artificial intelligence-based web page analysis are disclosed. According to the present invention, the apparatus and the method for automatically extracting product keyword information based on a web page analysis can support a manager to easily collect product keyword information on a specific company automatically by extracting important words according to an appearance frequency in a web page from the web page of a company, and selecting one of product keywords with the highest similarity to each of the important words according to a feature vector so as to provide the product keyword with the highest similarity for the manager.

Description

[0001] APPARATUS AND METHOD FOR AUTOMATICALLY EXTRACTING [0002] PRODUCT KEYWORD INFORMATION ACCORDING TO WEB PAGE ANALYSIS BASED ARTIFICIAL INTELLIGENCE [0003]

본 발명은 웹 페이지의 분석을 통해 해당 웹 페이지를 운영하는 기업에 대한 제품 키워드 정보를 자동으로 추출하는 장치 및 방법에 대한 것이다.The present invention relates to an apparatus and a method for automatically extracting product keyword information for a company operating a web page through analysis of a web page.

인터넷의 보급이 확대됨에 따라 다양한 정보들을 포함하고 있는 웹 페이지들이 등장하고 있다.As the spread of the Internet grows, web pages containing various information are emerging.

이렇게 다양한 정보들을 포함하고 있는 웹 페이지가 생산되고 배포됨에 따라, 이러한 웹 페이지들에 포함되어 있는 정보들의 분석을 통해서, 경제 흐름이나 여론 흐름 등을 분석하는 빅데이터 기반의 분석 기술들도 등장하고 있다.As web pages containing various kinds of information are produced and distributed, big data-based analysis techniques for analyzing economic flow and public opinion flow through the analysis of information contained in such web pages have appeared .

웹 페이지를 통한 정보 분석은 비정형 텍스트 데이터에서 새롭고 유용한 정보를 찾아내는 기술인 텍스트 마이닝을 통해 웹 페이지에 포함되어 있는 각종 텍스트들을 분석하고, 이로부터 소정의 의미를 찾아내는 형태로 이루어지고 있다.The analysis of information through web pages is performed by analyzing various texts included in a web page through text mining, which is a technique for finding new and useful information from unstructured text data, and finding a predetermined meaning therefrom.

예컨대, 상품에 대한 정보를 주고받는 커뮤니티와 관련된 웹 페이지에서 각 회원들이 웹 페이지 상에 남겨놓은 글들을 분석해서 현재 어떠한 제품이 인기를 끌고 있는지 등을 예측하는 시스템들이 존재한다.For example, there are systems that analyze the articles that each member has left on a web page in a web page related to a community that exchanges information about products, and predict which products are currently popular.

최근에는 투자 유치, 수요 기업 발굴 등의 이유로 기업들의 정보를 데이터베이스로 구축하려고 하는 시도가 증가하고 있다. 다양한 기업들의 정보를 수집하여 데이터베이스로 구축할 때에는 각 기업들이 어떠한 제품들을 생산하고 있는지를 확인해서 기업 정보 데이터베이스 상에 각 기업들이 취급하는 제품 정보를 저장해야 할 필요가 있다.In recent years, there has been an increasing tendency to build information databases for companies on the grounds of attracting investment, exploring demand companies, and the like. When collecting information from various companies and constructing it as a database, it is necessary to check what products each company produces and store the product information that each company handles in the company information database.

이와 관련해서, 대부분의 기업들은 홍보의 목적으로 다양한 정보가 포함된 웹 페이지를 구축해서 운영하고 있다는 점에서, 이러한 각 기업들의 웹 페이지에 대한 정보 분석을 통해 각 기업들이 취급할 것으로 예상되는 제품의 키워드들을 자동으로 추출해서 관리자에게 제공할 수 있는 기술에 대한 연구가 필요하다.In this regard, most companies build and operate web pages that contain a variety of information for the purpose of publicity, and by analyzing the information on each company's web pages, Research is needed on techniques that can automatically extract keywords and provide them to administrators.

본 발명에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 장치 및 방법은 기업의 웹 페이지로부터 해당 웹 페이지에서의 등장 빈도수에 따른 중요 단어들을 추출하고, 복수의 제품 키워드들 중 상기 중요 단어들 각각과 특성 벡터에 따른 유사도가 가장 높은 제품 키워드를 선택하여 관리자에게 제공함으로써, 관리자가 특정 기업에 대한 제품 키워드 정보를 자동으로 손쉽게 수집할 수 있도록 지원하고자 한다.An apparatus and method for automatically extracting product keyword information based on analysis of a web page according to the present invention includes extracting important words according to frequency of occurrence in a corresponding web page from a web page of an enterprise, A product keyword having the highest similarity according to a characteristic vector is selected and provided to an administrator so that an administrator can easily collect product keyword information for a specific company easily.

본 발명의 일실시예에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 장치는 미리 정해진 복수의 단어들(상기 복수의 단어들 각각에는 미리 정해진 단어 유사도 기준에 따라 유사한 단어일수록 벡터 사이의 유사도가 높게 연산되도록 미리 설정된 서로 다른 특성 벡터들이 할당되어 있음)이 저장되어 있는 사전 데이터베이스, 미리 정해진 복수의 제품 키워드들(상기 복수의 제품 키워드들은 상기 복수의 단어들 내에 포함되어 있는 단어들임)이 저장되어 있는 제품 키워드 데이터베이스, 제1 기업의 웹 페이지에 대한 접속 주소가 입력되면, 상기 접속 주소를 기초로 상기 제1 기업의 웹 페이지에 접속하여 상기 제1 기업의 웹 페이지로부터 상기 제1 기업의 웹 페이지 상에 존재하는 복수의 제1 텍스트들을 추출하는 텍스트 추출부, 상기 복수의 제1 텍스트들에 대해 형태소 분석을 수행하여 상기 복수의 제1 텍스트들로부터 복수의 제1 단어들을 추출하는 단어 추출부, 상기 복수의 제1 단어들의 상기 제1 기업의 웹 페이지 상에서의 등장 빈도수에 기초하여 상기 복수의 제1 단어들 중 적어도 하나의 중요 단어를 선택하는 중요 단어 선택부, 상기 적어도 하나의 중요 단어가 선택되면, 상기 제품 키워드 데이터베이스에 저장되어 있는 복수의 제품 키워드들 중 상기 사전 데이터베이스를 참조하여 상기 적어도 하나의 중요 단어 각각에 대해, 상기 적어도 하나의 중요 단어 각각에 할당되어 있는 특성 벡터와의 유사도가 최대로 연산되는 특성 벡터가 할당되어 있는 적어도 하나의 제품 키워드를 선택하는 제품 키워드 선택부 및 상기 적어도 하나의 제품 키워드가 선택되면, 관리자의 단말에 대해 상기 적어도 하나의 제품 키워드를 상기 제1 기업의 주요 제품 키워드 정보로 전송하는 제품 키워드 정보 전송부를 포함한다.The apparatus for automatically extracting product keyword information based on analysis of a web page according to an embodiment of the present invention is a device for automatically extracting product keyword information based on analysis of a web page according to an embodiment of the present invention. (Where different product keywords are assigned in advance), a dictionary database in which a plurality of product keywords are assigned, and a plurality of predetermined product keywords (the plurality of product keywords are words included in the plurality of words) A product keyword database and a connection address for a web page of the first company are input, the web page of the first company is accessed from the web page of the first company on the basis of the connection address, A text extracting unit for extracting a plurality of first texts existing in the plurality A word extracting unit for extracting a plurality of first words from the plurality of first texts by performing morphological analysis on the first texts, a word extracting unit for extracting a plurality of first words from the plurality of first texts, An important word selection unit for selecting at least one important word among the plurality of first words on the basis of a plurality of keywords stored in the product keyword database, A product keyword for selecting at least one product keyword to which a feature vector with which a degree of similarity with a feature vector assigned to each of the at least one important word is calculated is assigned to each of the at least one important word, When the selection unit and the at least one product keyword are selected, It comprises at least one group of the product keywords to the first enterprise key Product Keyword Product Keyword information transmitter for transmitting a call.

또한, 본 발명의 일실시예에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 방법은 미리 정해진 복수의 단어들(상기 복수의 단어들 각각에는 미리 정해진 단어 유사도 기준에 따라 유사한 단어일수록 벡터 사이의 유사도가 높게 연산되도록 미리 설정된 서로 다른 특성 벡터들이 할당되어 있음)이 저장되어 있는 사전 데이터베이스를 유지하는 단계, 미리 정해진 복수의 제품 키워드들(상기 복수의 제품 키워드들은 상기 복수의 단어들 내에 포함되어 있는 단어들임)이 저장되어 있는 제품 키워드 데이터베이스를 유지하는 단계, 제1 기업의 웹 페이지에 대한 접속 주소가 입력되면, 상기 접속 주소를 기초로 상기 제1 기업의 웹 페이지에 접속하여 상기 제1 기업의 웹 페이지로부터 상기 제1 기업의 웹 페이지 상에 존재하는 복수의 제1 텍스트들을 추출하는 단계, 상기 복수의 제1 텍스트들에 대해 형태소 분석을 수행하여 상기 복수의 제1 텍스트들로부터 복수의 제1 단어들을 추출하는 단계, 상기 복수의 제1 단어들의 상기 제1 기업의 웹 페이지 상에서의 등장 빈도수에 기초하여 상기 복수의 제1 단어들 중 적어도 하나의 중요 단어를 선택하는 단계, 상기 적어도 하나의 중요 단어가 선택되면, 상기 제품 키워드 데이터베이스에 저장되어 있는 복수의 제품 키워드들 중 상기 사전 데이터베이스를 참조하여 상기 적어도 하나의 중요 단어 각각에 대해, 상기 적어도 하나의 중요 단어 각각에 할당되어 있는 특성 벡터와의 유사도가 최대로 연산되는 특성 벡터가 할당되어 있는 적어도 하나의 제품 키워드를 선택하는 단계 및 상기 적어도 하나의 제품 키워드가 선택되면, 관리자의 단말에 대해 상기 적어도 하나의 제품 키워드를 상기 제1 기업의 주요 제품 키워드 정보로 전송하는 단계를 포함한다.In addition, a method of automatically extracting product keyword information based on analysis of a web page according to an embodiment of the present invention is a method of automatically extracting product keyword information based on analysis of a web page according to an embodiment of the present invention, The method comprising the steps of: maintaining a dictionary database in which a plurality of product keywords are assigned different preset characteristic vectors to be operated at a higher level, a predetermined plurality of product keywords A step of accessing a web page of the first company based on the connection address and inputting the web address of the web page of the first company, From a page, a plurality of first texts < RTI ID = 0.0 > Extracting a plurality of first words from the plurality of first texts by performing morphological analysis on the plurality of first texts, extracting a plurality of first words from the plurality of first words, Selecting at least one important word among the plurality of first words based on a frequency of appearance on a page, selecting one of the plurality of product keywords stored in the product keyword database when the at least one important word is selected, Selecting at least one product keyword to which a feature vector having a maximum degree of similarity with a feature vector assigned to each of the at least one important word is assigned to each of the at least one important word with reference to the dictionary database And when the at least one product keyword is selected, A least one product keywords and transmitting a key Product Keyword information of the first enterprise.

본 발명에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 장치 및 방법은 기업의 웹 페이지로부터 해당 웹 페이지에서의 등장 빈도수에 따른 중요 단어들을 추출하고, 복수의 제품 키워드들 중 상기 중요 단어들 각각과 특성 벡터에 따른 유사도가 가장 높은 제품 키워드를 선택하여 관리자에게 제공함으로써, 관리자가 특정 기업에 대한 제품 키워드 정보를 자동으로 손쉽게 수집할 수 있도록 지원할 수 있다.An apparatus and method for automatically extracting product keyword information based on analysis of a web page according to the present invention includes extracting important words according to frequency of occurrence in a corresponding web page from a web page of an enterprise, A product keyword having the highest degree of similarity according to the characteristic vector is selected and provided to the manager so that the manager can easily and automatically collect product keyword information for a specific company.

도 1은 본 발명의 일실시예에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 장치의 구조를 도시한 도면이다.
도 2는 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 방법을 도시한 순서도이다.1 is a diagram illustrating a structure of an apparatus for automatically extracting product keyword information based on web page analysis according to an embodiment of the present invention.
2 is a flowchart showing a method of automatically extracting product keyword information based on web page analysis.

이하에서는 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명하기로 한다. 이러한 설명은 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였으며, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 본 명세서 상에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 사람에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It is to be understood that the description is not intended to limit the invention to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals in the drawings are used for similar elements and, unless otherwise defined, all terms used in the specification, including technical and scientific terms, are to be construed in a manner that is familiar to those skilled in the art. It has the same meaning as commonly understood by those who have it.

도 1은 본 발명의 일실시예에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 장치의 구조를 도시한 도면이다.1 is a diagram illustrating a structure of an apparatus for automatically extracting product keyword information based on web page analysis according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일실시예에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 장치(110)는 사전 데이터베이스(111), 제품 키워드 데이터베이스(112), 텍스트 추출부(113), 단어 추출부(114), 중요 단어 선택부(115), 제품 키워드 선택부(116) 및 제품 키워드 정보 전송부(117)를 포함한다.Referring to FIG. 1, an apparatus for automatically extracting product keyword information 110 based on web page analysis according to an embodiment of the present invention includes a dictionary database 111, a product keyword database 112, a text extractor 113, An extracting unit 114, an important word selecting unit 115, a product keyword selecting unit 116, and a product keyword information transmitting unit 117.

사전 데이터베이스(111)에는 미리 정해진 복수의 단어들이 저장되어 있다.The dictionary database 111 stores a plurality of predetermined words.

여기서, 상기 복수의 단어들 각각에는 미리 정해진 단어 유사도 기준에 따라 유사한 단어일수록 벡터 사이의 유사도가 높게 연산되도록 미리 설정된 서로 다른 특성 벡터들이 할당되어 있다.Here, different characteristic vectors are assigned to each of the plurality of words so that the similarity degree between the vectors becomes higher as similar words are determined according to a predetermined word similarity criterion.

예컨대, 사전 데이터베이스(111)에는 하기의 표 1과 같이 정보가 저장되어 있을 수 있다.For example, information may be stored in the dictionary database 111 as shown in Table 1 below.

복수의 단어들Multiple words 특성 벡터Characteristic vector 컴퓨터computer (1, 2, 3, 4, 5)(1 2 3 4 5) 휴대폰cellphone (6, 7, 8, 9, 10)(6, 7, 8, 9, 10) ...... ......

여기서, 벡터 사이의 유사도는 하기의 수학식 1에 따라 연산될 수 있다.Here, the similarity between the vectors can be calculated according to the following equation (1).

여기서, S는 특성 벡터 A와 B 사이의 유사도로 -1에서 1사이의 값을 가지며, 그 값이 클수록 유사한 특성 벡터임을 의미하고, A_i는 특성 벡터 A의 i번째 성분, B_i는 특성 벡터 B의 i번째 성분을 의미한다.Here, S is a similarity between the characteristic vectors A and B, and has a value between -1 and 1, and the larger the value is, the similar characteristic vector, A _i is the i-th component of the characteristic vector A, B _i is the characteristic vector And the i-th component of B, respectively.

예컨대, 상기 표 1에서 "컴퓨터"라는 단어와 "휴대폰"이라는 단어에 각각 할당되어 있는 특성 벡터 간의 유사도를 연산하게 되면, 하기의 수학식 2와 같이 연산될 수 있다.For example, when the similarity degree between the word "computer" and the characteristic vector assigned to the word "mobile phone" is calculated in the above Table 1, it can be calculated as the following equation (2).

상기 표 1과 같은 사전 데이터베이스(111)에 저장되는 복수의 단어들은 관리자에 의해 임의로 설정된 단어들이며, 각 단어들에 할당되어 있는 특성 벡터들은 관리자에 의해서 설정된 각 단어들 간의 유사도 기준에 따라 소정의 유사도가 연산되도록 할당된 값일 수 있다. 이때, 각 단어들 간의 상기 유사도 기준은 웹을 통해 다양한 정보들을 수집하고, 각 정보들에 대한 분석과 학습을 통해서 다양한 단어들 간의 관계 분석을 수행한 결과에 기반한 기준일 수 있다.The plurality of words stored in the dictionary database 111 as shown in Table 1 are words arbitrarily set by the administrator, and the characteristic vectors assigned to the words are determined by a similarity degree between the words set by the administrator Lt; / RTI > At this time, the similarity criterion between the words can be a criterion based on a result of analyzing various information through the web and analyzing the relation between various words through analysis and learning of each information.

제품 키워드 데이터베이스(112)에는 미리 정해진 복수의 제품 키워드들이 저장되어 있다.The product keyword database 112 stores a plurality of predetermined product keywords.

여기서, 상기 복수의 제품 키워드들은 상기 복수의 단어들 내에 포함되어 있는 단어들이다.Here, the plurality of product keywords are words included in the plurality of words.

텍스트 추출부(113)는 제1 기업의 웹 페이지에 대한 접속 주소가 입력되면, 상기 접속 주소를 기초로 상기 제1 기업의 웹 페이지에 접속하여 상기 제1 기업의 웹 페이지로부터 상기 제1 기업의 웹 페이지 상에 존재하는 복수의 제1 텍스트들을 추출한다.If the access address for the web page of the first company is input, the text extracting unit 113 accesses the web page of the first company based on the access address, And extracts a plurality of first texts existing on the web page.

이때, 본 발명의 일실시예에 따르면, 텍스트 추출부(113)는 상기 제1 기업의 웹 페이지를 구성하는 HTML(Hypertext Markup Language) 코드를 파싱(parsing)하여 상기 HTML 코드 상에서 텍스트 입력과 연관된 태그(tag)를 통해 삽입되어 있는 텍스트들을 추출함으로써, 상기 제1 기업의 웹 페이지 상에 존재하는 상기 복수의 제1 텍스트들을 추출하되, 상기 HTML 코드 상에 하이퍼링크 태그가 존재하는 경우, 상기 하이퍼링크 태그를 통해 링크되어 있는 서브 페이지에 접속하여 상기 서브 페이지의 HTML 코드로부터 텍스트 입력과 연관된 태그를 통해 삽입되어 있는 텍스트들도 함께 추출함으로써, 상기 제1 기업의 웹 페이지 상에 존재하는 상기 복수의 제1 텍스트들에 대한 추출을 수행할 수 있다.At this time, according to an embodiment of the present invention, the text extracting unit 113 parses HTML (Hypertext Markup Language) code constituting the web page of the first company, extracting the plurality of first texts existing on the web page of the first company by extracting the inserted text through a tag, if a hyperlink tag exists on the HTML code, Tag, and extracting texts inserted through a tag associated with text input from the HTML code of the sub-page, together with the plurality of texts existing on the web page of the first company, 1 < / RTI > texts.

관련해서, 텍스트 추출부(113)는 제1 기업의 웹 페이지를 구성하는 HTML 코드에서 텍스트 입력과 연관된 태그를 통해 삽입되어 있는 텍스트들을 추출하되, "<a href>"와 같은 하이퍼링크 태그가 존재하는 경우, 해당 하이퍼링크 태그를 통해 링크되어 있는 서브 페이지에 접속해서 상기 서브 페이지의 HTML 코드로부터 텍스트 입력과 연관된 태그를 통해 삽입되어 있는 텍스트들을 함께 추출함으로써, 상기 제1 기업의 웹 페이지 상에 존재하는 복수의 제1 텍스트들을 추출할 수 있다.In this regard, the text extraction unit 113 extracts the text inserted through the tag associated with the text input in the HTML code constituting the web page of the first company, and a hyperlink tag such as "<a href> & , It is possible to access the sub page linked through the hyperlink tag and to extract the texts inserted through the tag associated with the text input from the HTML code of the sub page to be present on the web page of the first company A plurality of first texts may be extracted.

단어 추출부(114)는 상기 복수의 제1 텍스트들에 대해 형태소 분석을 수행하여 상기 복수의 제1 텍스트들로부터 복수의 제1 단어들을 추출한다.The word extracting unit 114 extracts a plurality of first words from the plurality of first texts by performing morphological analysis on the plurality of first texts.

중요 단어 선택부(115)는 상기 복수의 제1 단어들의 상기 제1 기업의 웹 페이지 상에서의 등장 빈도수에 기초하여 상기 복수의 제1 단어들 중 적어도 하나의 중요 단어를 선택한다.The important word selection unit 115 selects at least one important word among the plurality of first words based on the frequency of occurrence of the plurality of first words on the web page of the first company.

이때, 본 발명의 일실시예에 따르면, 중요 단어 선택부(115)는 점수 할당부(118), 빈도수 카운트부(119), 점수 보정부(120) 및 선택부(121)를 포함할 수 있다.According to an embodiment of the present invention, the important word selecting unit 115 may include a score assigning unit 118, a frequency counting unit 119, a score correcting unit 120, and a selecting unit 121 .

점수 할당부(118)는 상기 제1 기업의 기업명이 입력되면, 상기 복수의 제1 단어들 각각에 대해, 사전 데이터베이스(111)를 참조하여 상기 기업명에 대한 특성 벡터와 상기 복수의 제1 단어들 각각에 대한 특성 벡터 간의 유사도에 기초한 점수를 할당한다.If the company name of the first company is input, the score assigning unit 118 refers to the dictionary database 111 for each of the plurality of first words, and stores the feature vector for the company name and the plurality of first words A score based on the similarity between the feature vectors for each is assigned.

빈도수 카운트부(119)는 상기 복수의 제1 단어들 각각이 상기 제1 기업의 웹 페이지 상에서 등장하는 등장 빈도수를 카운트한다.The frequency counting unit 119 counts the frequency of appearance of each of the plurality of first words appearing on the web page of the first company.

점수 보정부(120)는 상기 복수의 제1 단어들 각각에 할당된 점수에 대해, 상기 복수의 제1 단어들의 상기 제1 기업의 웹 페이지 상에서의 등장 빈도수에 기초한 서로 다른 가중치를 적용하여 상기 복수의 제1 단어들 각각에 대한 점수를 보정한다.The score correcting unit 120 applies different weights based on the appearance frequency on the web page of the first company of the plurality of first words to the score assigned to each of the plurality of first words, &Lt; / RTI >

선택부(121)는 상기 복수의 제1 단어들 중 상기 보정된 점수가 선정된(predetermined) 기준 점수를 초과하는 점수가 할당되어 있는 단어들을 상기 적어도 하나의 중요 단어로 선택한다.The selecting unit 121 selects words as the at least one important word to which a score exceeding a predetermined reference score is assigned to the corrected score among the plurality of first words.

이때, 본 발명의 일실시예에 따르면, 중요 단어 선택부(115)는 미리 정해진 서로 다른 빈도수 범위들 별로 서로 다른 가중치들이 대응되어 기록되어 있는 가중치 테이블을 저장하여 유지하는 가중치 테이블 유지부(122)를 더 포함할 수 있다.In this case, according to an embodiment of the present invention, the important word selection unit 115 includes a weight table holding unit 122 for storing and holding a weight table in which different weights are recorded corresponding to predetermined predetermined frequency ranges, As shown in FIG.

관련해서, 상기 가중치 테이블에는 하기의 표 2와 같이 정보가 기록되어 있을 수 있다.In this connection, information may be recorded in the weight table as shown in Table 2 below.

서로 다른 빈도수 범위들Different frequency ranges 서로 다른 가중치Different weights 1회~5회1 to 5 times 1One 6회~10회6 ~ 10 times 1.11.1 11회~15회11 times to 15 times 1.21.2 ...... ......

이때, 점수 할당부(118)는 상기 제1 기업의 기업명이 입력되면, 상기 복수의 제1 단어들 중 사전 데이터베이스(111)를 참조하여 상기 기업명에 대한 특성 벡터와 상기 복수의 제1 단어들 각각에 대한 특성 벡터 간의 유사도가 선정된 기준 유사도를 초과하는 단어들에 대해 제1 점수를 할당하고, 상기 복수의 제1 단어들 중 상기 선정된 기준 유사도를 초과하지 않는 단어들에 대해 제2 점수를 할당할 수 있다.When the company name of the first company is input, the score assigning unit 118 refers to the dictionary database 111 among the plurality of first words, and stores the feature vector for the company name and the plurality of first words Assigning a first score to words whose similarities are greater than a predetermined reference similarity degree among the plurality of first words and assigning a second score to words not exceeding the predetermined reference similarity degree among the plurality of first words Can be assigned.

여기서, 상기 제2 점수는 상기 제1 점수보다 낮은 점수이다.Here, the second score is a score lower than the first score.

관련해서, 점수 할당부(118)는 상기 수학식 1의 연산식에 따라 사전 데이터베이스(111) 상에 저장되어 있는 상기 기업명에 대한 특성 벡터와 상기 복수의 제1 단어들 각각의 특성 벡터 간의 유사도를 연산한 후 상기 복수의 제1 단어들 중 연산된 유사도가 상기 선정된 기준 유사도를 초과하는 단어들에 대해 제1 점수를 할당할 수 있고, 나머지 단어들에 대해 상기 제1 점수보다 낮은 제2 점수를 할당할 수 있다.In this regard, the score assigning unit 118 assigns the degree of similarity between the characteristic vector of the company name stored in the dictionary database 111 and the characteristic vector of each of the plurality of first words according to the equation of equation (1) A first score can be assigned to words whose calculated degree of similarity exceeds the predetermined reference similarity degree among the plurality of first words and a second score lower than the first score Can be assigned.

이때, 점수 보정부(120)는 상기 표 1과 같은 가중치 테이블을 참조하여 상기 복수의 제1 단어들 각각에 대해, 상기 가중치 테이블 상에서 상기 복수의 제1 단어들의 상기 제1 기업의 웹 페이지 상에서의 등장 빈도수가 속해있는 빈도수 범위에 대응하는 가중치를 매칭시킨 후 상기 복수의 제1 단어들 각각에 할당된 점수에 대해, 상기 복수의 제1 단어들 각각에 매칭된 가중치를 적용하여 상기 복수의 제1 단어들 각각에 대한 점수를 보정할 수 있다.At this time, the score correction unit 120 refers to the weight table as shown in Table 1, and calculates, for each of the plurality of first words, A weighting value corresponding to a frequency range to which the frequency of occurrence belongs is matched and a weight matched to each of the plurality of first words is applied to a score assigned to each of the plurality of first words, The score for each of the words can be corrected.

이렇게, 상기 복수의 제1 단어들 각각에 대한 점수의 보정이 완료되면, 선택부(121)는 상기 복수의 제1 단어들 중 상기 보정된 점수가 선정된 기준 점수를 초과하는 점수가 할당되어 있는 단어들을 상기 적어도 하나의 중요 단어로 선택할 수 있다.When the correction of the score for each of the plurality of first words is completed, the selector 121 assigns a score that exceeds the predetermined reference score of the corrected scores among the plurality of first words Words can be selected as the at least one important word.

이렇게, 상기 적어도 하나의 중요 단어가 선택되면, 제품 키워드 선택부(116)는 제품 키워드 데이터베이스(116)에 저장되어 있는 복수의 제품 키워드들 중 사전 데이터베이스(111)를 참조하여 상기 적어도 하나의 중요 단어 각각에 대해, 상기 적어도 하나의 중요 단어 각각에 할당되어 있는 특성 벡터와의 유사도가 최대로 연산되는 특성 벡터가 할당되어 있는 적어도 하나의 제품 키워드를 선택한다.When the at least one important word is selected, the product keyword selection unit 116 refers to the dictionary database 111 among the plurality of product keywords stored in the product keyword database 116, For each of the at least one important word, at least one product keyword to which a feature vector for which the degree of similarity with the feature vector assigned to each of the at least one important word is calculated is selected.

예컨대, 상기 적어도 하나의 중요 단어가 총 10개 단어라고 하는 경우, 제품 키워드 선택부(116)는 제품 키워드 데이터베이스(116)에 저장되어 있는 복수의 제품 키워드들 중 상기 10개의 중요 단어 각각에 대해서, 상기 10개의 중요 단어 각각의 특성 벡터와의 상기 수학식 1에 따른 유사도가 최대로 연산되는 특성 벡터가 할당되어 있는 10개의 제품 키워드들을 선택할 수 있다.For example, when the at least one important word is a total of 10 words, the product keyword selection unit 116 selects, for each of the ten important words among the plurality of product keywords stored in the product keyword database 116, Ten product keywords to which a feature vector with the maximum similarity calculated according to Equation (1) above are assigned to the feature vectors of the ten important words can be selected.

이렇게, 상기 적어도 하나의 제품 키워드가 선택되면, 제품 키워드 정보 전송부(117)는 관리자의 단말에 대해 상기 적어도 하나의 제품 키워드를 상기 제1 기업의 주요 제품 키워드 정보로 전송한다.When the at least one product keyword is selected, the product keyword information transmitting unit 117 transmits the at least one product keyword to the terminal of the manager as the main product keyword information of the first company.

결국, 본 발명에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 장치(110)는 기업의 웹 페이지로부터 해당 웹 페이지에서의 등장 빈도수에 따른 중요 단어들을 추출하고, 복수의 제품 키워드들 중 상기 중요 단어들 각각과 특성 벡터에 따른 유사도가 가장 높은 제품 키워드를 선택하여 관리자에게 제공함으로써, 관리자가 특정 기업에 대한 제품 키워드 정보를 자동으로 손쉽게 수집할 수 있도록 지원할 수 있다.As a result, the automatic product keyword information extraction device 110 based on the analysis of the web page according to the present invention extracts important words according to the appearance frequency in the web page from the web page of the company, And the product keyword having the highest similarity according to the characteristic vector is selected and provided to the manager so that the manager can easily and automatically collect product keyword information for a specific company.

도 2는 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 방법을 도시한 순서도이다.2 is a flowchart showing a method of automatically extracting product keyword information based on web page analysis.

단계(S210)에서는 미리 정해진 복수의 단어들(상기 복수의 단어들 각각에는 미리 정해진 단어 유사도 기준에 따라 유사한 단어일수록 벡터 사이의 유사도가 높게 연산되도록 미리 설정된 서로 다른 특성 벡터들이 할당되어 있음)이 저장되어 있는 사전 데이터베이스를 유지한다.In step S210, a plurality of predetermined words (different characteristic vectors set in advance so as to calculate a similarity degree between the vectors with a similar word according to a predetermined word similarity criterion are assigned to each of the plurality of words) And maintains a dictionary database.

단계(S220)에서는 미리 정해진 복수의 제품 키워드들(상기 복수의 제품 키워드들은 상기 복수의 단어들 내에 포함되어 있는 단어들임)이 저장되어 있는 제품 키워드 데이터베이스를 유지한다.In step S220, a product keyword database in which a predetermined plurality of product keywords (the plurality of product keywords are included in the plurality of words) is stored.

단계(S230)에서는 제1 기업의 웹 페이지에 대한 접속 주소가 입력되면, 상기 접속 주소를 기초로 상기 제1 기업의 웹 페이지에 접속하여 상기 제1 기업의 웹 페이지로부터 상기 제1 기업의 웹 페이지 상에 존재하는 복수의 제1 텍스트들을 추출한다.In step S230, when a connection address for the web page of the first company is inputted, the web page of the first company is accessed from the web page of the first company on the basis of the connection address, And extracts a plurality of first texts existing on the display screen.

단계(S240)에서는 상기 복수의 제1 텍스트들에 대해 형태소 분석을 수행하여 상기 복수의 제1 텍스트들로부터 복수의 제1 단어들을 추출한다.In step S240, morphological analysis is performed on the plurality of first texts to extract a plurality of first words from the plurality of first texts.

단계(S250)에서는 상기 복수의 제1 단어들의 상기 제1 기업의 웹 페이지 상에서의 등장 빈도수에 기초하여 상기 복수의 제1 단어들 중 적어도 하나의 중요 단어를 선택한다.In step S250, at least one important word among the plurality of first words is selected based on the frequency of occurrence on the web page of the first company of the plurality of first words.

단계(S260)에서는 상기 적어도 하나의 중요 단어가 선택되면, 상기 제품 키워드 데이터베이스에 저장되어 있는 복수의 제품 키워드들 중 상기 사전 데이터베이스를 참조하여 상기 적어도 하나의 중요 단어 각각에 대해, 상기 적어도 하나의 중요 단어 각각에 할당되어 있는 특성 벡터와의 유사도가 최대로 연산되는 특성 벡터가 할당되어 있는 적어도 하나의 제품 키워드를 선택한다.If the at least one important word is selected in step S260, for each of the at least one important word, referring to the dictionary database among a plurality of product keywords stored in the product keyword database, At least one product keyword to which a feature vector for which the degree of similarity with the feature vector assigned to each word is calculated is selected.

단계(S270)에서는 상기 적어도 하나의 제품 키워드가 선택되면, 관리자의 단말에 대해 상기 적어도 하나의 제품 키워드를 상기 제1 기업의 주요 제품 키워드 정보로 전송한다.In step S270, when the at least one product keyword is selected, the at least one product keyword is transmitted to the terminal of the administrator as the main product keyword information of the first company.

이때, 본 발명의 일실시예에 따르면, 단계(S250)에서는 상기 제1 기업의 기업명이 입력되면, 상기 복수의 제1 단어들 각각에 대해, 상기 사전 데이터베이스를 참조하여 상기 기업명에 대한 특성 벡터와 상기 복수의 제1 단어들 각각에 대한 특성 벡터 간의 유사도에 기초한 점수를 할당하는 단계, 상기 복수의 제1 단어들 각각이 상기 제1 기업의 웹 페이지 상에서 등장하는 등장 빈도수를 카운트하는 단계, 상기 복수의 제1 단어들 각각에 할당된 점수에 대해, 상기 복수의 제1 단어들의 상기 제1 기업의 웹 페이지 상에서의 등장 빈도수에 기초한 서로 다른 가중치를 적용하여 상기 복수의 제1 단어들 각각에 대한 점수를 보정하는 단계 및 상기 복수의 제1 단어들 중 상기 보정된 점수가 선정된 기준 점수를 초과하는 점수가 할당되어 있는 단어들을 상기 적어도 하나의 중요 단어로 선택하는 단계를 포함할 수 있다.According to an embodiment of the present invention, in step S250, when the company name of the first company is input, for each of the plurality of first words, referring to the dictionary database, Assigning a score based on a degree of similarity between characteristic vectors for each of the plurality of first words, counting the number of occurrences of each of the plurality of first words appearing on a web page of the first company, For each of the first words of the plurality of first words by applying different weights based on the frequency of occurrence on the web page of the first company of the plurality of first words, Correcting the first words and the words to which the score exceeding the predetermined reference score is assigned to the corrected score among the plurality of first words It may include the step of selecting as one of the important words.

이때, 본 발명의 일실시예에 따르면, 단계(S250)에서는 미리 정해진 서로 다른 빈도수 범위들 별로 서로 다른 가중치들이 대응되어 기록되어 있는 가중치 테이블을 저장하여 유지하는 단계를 더 포함할 수 있고, 상기 점수를 할당하는 단계는 상기 제1 기업의 기업명이 입력되면, 상기 복수의 제1 단어들 중 상기 사전 데이터베이스를 참조하여 상기 기업명에 대한 특성 벡터와 상기 복수의 제1 단어들 각각에 대한 특성 벡터 간의 유사도가 선정된 기준 유사도를 초과하는 단어들에 대해 제1 점수를 할당하고, 상기 복수의 제1 단어들 중 상기 선정된 기준 유사도를 초과하지 않는 단어들에 대해 제2 점수(상기 제2 점수는 상기 제1 점수보다 낮은 점수임)를 할당할 수 있으며, 상기 점수를 보정하는 단계는 상기 가중치 테이블을 참조하여 상기 복수의 제1 단어들 각각에 대해, 상기 가중치 테이블 상에서 상기 복수의 제1 단어들의 상기 제1 기업의 웹 페이지 상에서의 등장 빈도수가 속해있는 빈도수 범위에 대응하는 가중치를 매칭시킨 후 상기 복수의 제1 단어들 각각에 할당된 점수에 대해, 상기 복수의 제1 단어들 각각에 매칭된 가중치를 적용하여 상기 복수의 제1 단어들 각각에 대한 점수를 보정할 수 있다.According to an embodiment of the present invention, the step S250 may further include storing and maintaining a weight table in which different weights are recorded corresponding to predetermined frequency ranges, Wherein when the company name of the first company is inputted, the step of assigning the name of the first company is performed by referring to the dictionary database among the plurality of first words and calculating the similarity between the characteristic vector for the company name and the characteristic vector for each of the plurality of first words Assigning a first score to words exceeding a predetermined reference similarity degree and assigning a second score to the words not exceeding the predetermined reference similarity degree among the plurality of first words The score being less than the first score), and the step of correcting the score may include assigning the plurality of first words For each of the plurality of first words, a weight corresponding to a frequency range to which a frequency of occurrence on the web page of the first corporation belongs is matched on the weight table, The score for each of the plurality of first words may be corrected by applying a weight matched to each of the plurality of first words.

또한, 본 발명의 일실시예에 따르면, 단계(S230)에서는 상기 제1 기업의 웹 페이지를 구성하는 HTML 코드를 파싱하여 상기 HTML 코드 상에서 텍스트 입력과 연관된 태그를 통해 삽입되어 있는 텍스트들을 추출함으로써, 상기 제1 기업의 웹 페이지 상에 존재하는 상기 복수의 제1 텍스트들을 추출하되, 상기 HTML 코드 상에 하이퍼링크 태그가 존재하는 경우, 상기 하이퍼링크 태그를 통해 링크되어 있는 서브 페이지에 접속하여 상기 서브 페이지의 HTML 코드로부터 텍스트 입력과 연관된 태그를 통해 삽입되어 있는 텍스트들도 함께 추출함으로써, 상기 제1 기업의 웹 페이지 상에 존재하는 상기 복수의 제1 텍스트들에 대한 추출을 수행할 수 있다.According to an embodiment of the present invention, in step S230, the HTML code constituting the web page of the first company is parsed, and the embedded text is extracted through the tag associated with the text input on the HTML code, Extracting the plurality of first texts existing on the web page of the first company, if a hyperlink tag exists on the HTML code, accessing a sub page linked through the hyperlink tag, Extracting the text inserted through the tag associated with the text input from the HTML code of the page, and extracting the plurality of first texts existing on the web page of the first company.

또한, 본 발명의 일실시예에 따르면, 서로 다른 특성 벡터 사이에 대한 상기 유사도의 연산은 상기 수학식 1에 따라 수행될 수 있다.Further, according to an embodiment of the present invention, the calculation of the similarity between different feature vectors may be performed according to Equation (1) above.

이상, 도 2를 참조하여 본 발명의 일실시예에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 방법에 대해 설명하였다. 여기서, 본 발명의 일실시예에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 방법은 도 1을 이용하여 설명한 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 장치(110)의 동작에 대한 구성과 대응될 수 있으므로, 이에 대한 보다 상세한 설명은 생략하기로 한다.The method of automatically extracting product keyword information based on web page analysis according to an embodiment of the present invention has been described above with reference to FIG. Here, the automatic product keyword information extraction method based on web page analysis according to an embodiment of the present invention corresponds to the configuration of the operation of the automatic product keyword information extraction apparatus 110 based on the web page analysis described with reference to FIG. 1 A detailed description thereof will be omitted.

본 발명의 일실시예에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 저장매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.A method for automatically extracting product keyword information based on analysis of a web page according to an embodiment of the present invention may be implemented by a computer program stored in a storage medium for execution through a combination with a computer.

또한, 본 발명의 일실시예에 따른 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. In addition, the method of automatically extracting product keyword information based on analysis of a web page according to an embodiment of the present invention may be implemented in a form of a program command that can be executed through various computer means and recorded in a computer readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and configured for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. As described above, the present invention has been described with reference to particular embodiments, such as specific elements, and specific embodiments and drawings. However, it should be understood that the present invention is not limited to the above- And various modifications and changes may be made thereto by those skilled in the art to which the present invention pertains.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Accordingly, the spirit of the present invention should not be construed as being limited to the embodiments described, and all of the equivalents or equivalents of the claims, as well as the following claims, belong to the scope of the present invention .

110: 웹 페이지 분석에 기초한 제품 키워드 정보 자동추출 장치
111: 사전 데이터베이스 112: 제품 키워드 데이터베이스
113: 텍스트 추출부 114: 단어 추출부
115: 중요 단어 선택부 116: 제품 키워드 선택부
117: 제품 키워드 정보 전송부 118: 점수 할당부
119: 빈도수 카운트부 120: 점수 보정부
121: 선택부 122: 가중치 테이블 유지부110: Automatic extraction of product keyword information based on Web page analysis
111: Dictionary database 112: Product keyword database
113: Text extraction unit 114: Word extraction unit
115: Important word selection unit 116: Product keyword selection unit
117: Product keyword information transmitting unit 118: Score assigning unit
119: frequency counting unit 120: score correcting unit
121: selection unit 122: weight table holding unit

Claims

Wherein each of the plurality of words is assigned with different characteristic vectors set in advance so as to calculate a degree of similarity between vectors as the similar words are determined according to a predetermined word similarity criterion;
A product keyword database storing a plurality of predetermined product keywords, the plurality of product keywords being words included in the plurality of words;
When a connection address for a web page of a first company is input, accessing a web page of the first company based on the connection address, and accessing a web page of the first company from the web page of the first company, A text extracting unit for extracting first texts of the text data;
A word extraction unit for performing morphological analysis on the plurality of first texts to extract a plurality of first words from the plurality of first texts;
An important word selection unit for selecting at least one important word among the plurality of first words based on an appearance frequency on the web page of the first company of the plurality of first words;
Wherein when the at least one important word is selected, the degree of similarity between each of the plurality of product keywords stored in the product keyword database and each feature vector assigned to each of the at least one important words is maximum A product keyword selecting unit for selecting at least one product keyword to which a characteristic vector calculated by the product keyword is assigned; And
When the at least one product keyword is selected, transmitting the at least one product keyword to the terminal of the manager as the main product keyword information of the first company,
And a product keyword information extracting unit for extracting product keyword information based on web page analysis.

The method according to claim 1,
The important word selection unit
And for each of the plurality of first words, based on the similarity between the characteristic vector for the company name and the characteristic vector for each of the plurality of first words, referring to the dictionary database, A score assigning unit for assigning a score;
A frequency counting unit that counts the frequency of appearance of each of the plurality of first words appearing on the web page of the first company;
Applying a different weight to each of the plurality of first words based on an appearance frequency on a web page of the first company for a score assigned to each of the plurality of first words, A score corrector that corrects the score for; And
A selection unit for selecting, as the at least one important word, words to which a score exceeding a predetermined reference score is assigned to the corrected score among the plurality of first words,
And a product keyword information extracting unit for extracting product keyword information based on web page analysis.

3. The method of claim 2,
The important word selection unit
A weight table holding unit for storing and holding a weight table in which different weights corresponding to different predetermined frequency ranges are recorded in correspondence with each other,
Further comprising:
The score assigning unit
If the company name of the first company is inputted, the similarity degree between the characteristic vector of the company name and the characteristic vector of each of the plurality of first words is calculated by referring to the dictionary database among the plurality of first words, A second score for words not exceeding the predetermined reference similarity among the plurality of first words, the second score being less than the first score Score -
The score corrector
For each of the plurality of first words, a weight corresponding to a frequency range in which the number of occurrences on the web page of the first company of the plurality of first words on the weight table belongs is matched with reference to the weight table And a score corresponding to each of the plurality of first words is applied to a score assigned to each of the plurality of first words to apply a weight matched to each of the plurality of first words, Automatic extraction of product keyword information based on.

The method according to claim 1,
The text extractor
Parsing an HTML (Hypertext Markup Language) code constituting a web page of the first company and extracting embedded texts through a tag associated with text input on the HTML code, Extracting the plurality of first texts existing on a web page, if a hyperlink tag exists on the HTML code, accessing a sub page linked through the hyperlink tag and extracting from the HTML code of the sub page Extracting the texts inserted through the tags associated with the text input, and extracting the plurality of first texts existing on the web page of the first company by automatically extracting the product keyword information based on the web page analysis Extraction device.

3. The method of claim 2,
Wherein the calculation of the similarity between different feature vectors is performed according to the following equation (1).
[Equation 1]

Here, S is a similarity between the characteristic vectors A and B, and has a value between -1 and 1, and the larger the value is, the similar characteristic vector, A _i is the i-th component of the characteristic vector A, B _i is the characteristic vector Means the i-th component of B.

Wherein each of the plurality of words is assigned a different characteristic vector that is set in advance so that the degree of similarity between vectors becomes higher the more similar words are based on a predetermined word similarity criterion, Maintaining;
Maintaining a product keyword database storing a plurality of predetermined product keywords, the plurality of product keywords being words included in the plurality of words;
When a connection address for a web page of a first company is input, accessing a web page of the first company based on the connection address, and accessing a web page of the first company from the web page of the first company, Extracting first texts of the text;
Performing morphological analysis on the plurality of first texts to extract a plurality of first words from the plurality of first texts;
Selecting at least one important word of the plurality of first words based on an appearance frequency on a web page of the first company of the plurality of first words;
Wherein when the at least one important word is selected, the degree of similarity between each of the plurality of product keywords stored in the product keyword database and each feature vector assigned to each of the at least one important words is maximum Selecting at least one product keyword to which a characteristic vector calculated by the characteristic vector is assigned; And
When the at least one product keyword is selected, transmitting the at least one product keyword to the terminal of the administrator as the main product keyword information of the first company
A method for automatically extracting product keyword information based on a web page analysis,

The method according to claim 6,
The step of selecting the at least one important word
And for each of the plurality of first words, based on the similarity between the characteristic vector for the company name and the characteristic vector for each of the plurality of first words, referring to the dictionary database, Assigning a score;
Counting an appearance frequency of each of the plurality of first words appearing on a web page of the first company;
Applying a different weight to each of the plurality of first words based on an appearance frequency on a web page of the first company for a score assigned to each of the plurality of first words, Correcting the score for the score; And
Selecting one of the plurality of first words as the at least one important word to which the score having the corrected score exceeding a predetermined reference score is assigned;
A method for automatically extracting product keyword information based on a web page analysis,

8. The method of claim 7,
The step of selecting the at least one important word
Storing and maintaining a weight table in which different weights corresponding to different predetermined frequency ranges are recorded in correspondence with each other,
Further comprising:
The step of assigning the score
If the company name of the first company is inputted, the similarity degree between the characteristic vector of the company name and the characteristic vector of each of the plurality of first words is calculated by referring to the dictionary database among the plurality of first words, A second score for words not exceeding the predetermined reference similarity among the plurality of first words, the second score being less than the first score Score -
The step of correcting the score
For each of the plurality of first words, a weight corresponding to a frequency range in which the number of occurrences on the web page of the first company of the plurality of first words on the weight table belongs is matched with reference to the weight table And a score corresponding to each of the plurality of first words is applied to a score assigned to each of the plurality of first words to apply a weight matched to each of the plurality of first words, Automatic method for extracting product keyword information based on.

The method according to claim 6,
The step of extracting the plurality of first texts
Parsing an HTML (Hypertext Markup Language) code constituting a web page of the first company and extracting embedded texts through a tag associated with text input on the HTML code, Extracting the plurality of first texts existing on a web page, if a hyperlink tag exists on the HTML code, accessing a sub page linked through the hyperlink tag and extracting from the HTML code of the sub page Extracting the texts inserted through the tags associated with the text input, and extracting the plurality of first texts existing on the web page of the first company by automatically extracting the product keyword information based on the web page analysis Extraction method.

8. The method of claim 7,
Wherein the calculation of the similarity between different feature vectors is performed according to the following equation (2).
&Quot; (2) "

11. A computer-readable recording medium having recorded thereon a program for causing a computer to perform the method according to any one of claims 6 to 10.

11. A computer program stored in a storage medium for executing the method of any one of claims 6 to 10 through a combination with a computer.