KR100837797B1

KR100837797B1 - Method for automatic construction of acronym dictionary based on acronym type, Recording medium thereof and Apparatus for automatic construction of acronym dictionary based on acronym type

Info

Publication number: KR100837797B1
Application number: KR1020060092174A
Authority: KR
Inventors: 임해창; 윤여찬; 송영인
Original assignee: 고려대학교 산학협력단
Priority date: 2006-09-22
Filing date: 2006-09-22
Publication date: 2008-06-13
Also published as: KR20080026931A

Abstract

약어 생성 유형을 고려하는 약어 사전 자동 구축 방법, 그 기록 매체 및 약어 생성 유형을 고려하는 약어 사전 자동 구축 장치가 개시된다.Disclosed are a method for automatically building an abbreviation dictionary that takes into account an abbreviation generation type, and an apparatus for automatically building an abbreviation dictionary that takes into account its recording medium and an abbreviation generation type.

본 발명은 인터넷 상에 존재하는 웹 문서들로부터 원어들을 수집하는 단계, 상기 수집된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성하는 단계, 상기 약어 후보들이 인터넷 상에 존재하는 웹 문서들에 존재하는 빈도를 소정의 분류 모형에 적용하여 상기 약어 후보들이 실제 약어일 확률을 연산하는 단계 및 상기 약어 후보들 중 상기 실제 약어일 확률이 약어 임계값 이상인 약어 후보들을 약어 사전에 등재시키는 단계를 포함한다.The present invention is to collect original words from web documents existing on the Internet, to generate abbreviated candidates corresponding to at least one of a noun abbreviation, a syllable combination, or a mixed form for the collected original words; Calculating the probability that the abbreviation candidates are actual abbreviations by applying a frequency in which candidates exist in web documents existing on the Internet, and an abbreviation whose probability that the actual abbreviation among the abbreviation candidates is greater than or equal to an abbreviation threshold Listing the candidates in the abbreviation dictionary.

본 발명에 의하면, 한국어 약어의 사전을 자동으로 구축하는 방법을 제공하고, 동시에 높은 효율성과 높은 정확도를 제공하며, 한국어의 언어학적 특성을 고려하여 가능한 약어 후보만을 생성하므로 약어 사전을 구축하기 위해 사용되는 시간을 효율적으로 줄일 수 있고, 약어가 가지는 특징을 이용한 확률 모형을 사용하여 높은 정확도 및 높은 재현율을 가지는 사전을 구축 할 수 있다.According to the present invention, there is provided a method for automatically constructing a dictionary of Korean abbreviations, providing high efficiency and high accuracy, and generating only possible abbreviation candidates in consideration of the linguistic characteristics of the Korean language, and thus used to construct an abbreviation dictionary. It is possible to effectively reduce the time required, and construct a dictionary with high accuracy and high recall using a probabilistic model using the features of the abbreviation.

Description

Method for automatic construction of acronym dictionary based on acronym type, Recording medium etc and Apparatus for automatic construction of acronym dictionary based on acronym type}

도 1은 본 발명에 따른 약어 생성 유형을 고려하는 약어 사전 자동 구축 장치의 블럭도이다.1 is a block diagram of an apparatus for automatically building an abbreviation dictionary in consideration of an abbreviation generation type according to the present invention.

도 2는 본 발명의 일 실시예에 따른 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법의 흐름도이다.2 is a flowchart of a method of automatically building an abbreviation dictionary in consideration of an abbreviation generation type according to an embodiment of the present invention.

도 3a 및 도 3b는 도 2의 분류 모형을 유형별로 분리하는 이유로서 생성 유형별로 선호하는 음절의 수를 나타낸 그래프이다.3A and 3B are graphs showing the number of preferred syllables by generation type as a reason for separating the classification model of FIG. 2 by type.

도 4는 본 발명의 다른 실시예에 따른 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법의 흐름도이다.4 is a flowchart of a method of automatically building an abbreviation dictionary in consideration of an abbreviation generation type according to another embodiment of the present invention.

도 5는 본 발명의 또다른 실시예에 따른 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법의 흐름도이다.5 is a flowchart of a method of automatically building an abbreviation dictionary in consideration of an abbreviation generation type according to another embodiment of the present invention.

본 발명은 자연어 처리에 관한 것으로, 특히, 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법, 그 기록 매체 및 약어 생성 유형을 고려하는 약어 사전 자동 구축 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to natural language processing, and more particularly, to a method for automatically building an abbreviation dictionary that takes into account an abbreviation generation type, and an apparatus for automatically building an abbreviation dictionary that takes into account a recording medium and an abbreviation generation type.

약어란 본디의 음절이나 형태소 등이 줄어서 된 말로 가령 '국산품'을 '국산으로 줄여 쓴다든지 '한국전력공사'라는 기업명을 '한전'으로 줄여 사용하는 것이 이러한 약어 사용의 예이다. 약어는 언어의 효율성을 위해 사용되는 현상으로 특히 인터넷이 활발히 사용되는 현재에 더욱 빈번하게 일어나는 현상이라고 할 수 있다. The abbreviation is a shortened word of syllables or morphemes. For example, abbreviation of 'domestic product' to 'domestic' or 'KEPCO' to use the abbreviation 'KEPCO' is an example of the use of the abbreviation. Acronyms are a phenomenon used for the efficiency of languages, and are especially frequent in the present day when the Internet is actively used.

이러한 약어를 파악하고 약어의 원어를 찾아 주는 일은 웹에서 원하는 정보를 검색하는 정보검색이나 특정 문서에서 원하는 정보만을 추출하는 정보추출을 위해 요구되는 일이다. 예를 들어, 정보검색의 경우, 검색을 하려는 사용자의 질의에 약어가 출현 하였을 시, 약어의 원어를 자동으로 질의에 추가하여 확장함으로서 원하는 문서를 찾을 가능성을 높여주고, 또한 질의에 원어만이 사용되었을 때, 약어가 출현한 문서도 함께 검색함으로써 성능을 향상 시킬 수 있다. 뿐만 아니라 약어의 사전을 구축함으로써, 언어학 및 국어학 연구에도 도움을 주며 또한 문서에 약어가 사용 되었을 때 쉽게 약어가 어떤 원어를 가리키는 지를 확인 할 수 있도록 하여 문서의 이해를 도울 수 있다.Identifying the abbreviation and finding the original language of the abbreviation is required for information retrieval to search for desired information on the web or to extract information for extracting only desired information from a specific document. For example, in the case of information retrieval, when an abbreviation appears in the query of the user who wants to search, it automatically adds the abbreviation of the abbreviation to the query to increase the possibility of finding the desired document and also uses only the original word in the query. When it is done, you can improve the performance by searching for documents with abbreviations. In addition, by building a dictionary of abbreviations, it is helpful to study linguistics and linguistics, and it is possible to easily understand the original language when the abbreviation is used in the document.

그러나, 약어의 경우, 이미 존재하는 약어의 수가 대단히 많고, 또한 끊임없이 계속해서 새로 생성된다.However, in the case of abbreviations, the number of abbreviations already existing is very large, and constantly new ones are generated continuously.

따라서, 종래의 약어 사전은 수동으로 사전을 구축하기에 어려움이 많고, 한 국어 약어의 특성을 반영할 수 없으며, 약어 사전을 자동으로 구축할 수 없는 문제점이 있다.Therefore, the conventional abbreviation dictionary is difficult to build a dictionary manually, there is a problem that can not reflect the characteristics of the Korean abbreviation, and can not automatically build an abbreviation dictionary.

따라서, 본 발명이 이루고자 하는 첫번째 기술적 과제는 한국어 약어의 사전을 자동으로 구축하는 방법을 제공하고 동시에 높은 효율성과 높은 정확도를 제공하며, 한국어의 언어학적 특성을 고려하여 가능한 약어 후보만을 생성하므로 약어 사전을 구축하기 위해 사용되는 시간을 효율적으로 줄일 수 있고, 약어가 가지는 특징을 이용한 확률 모형을 사용하여 높은 정확도 및 높은 재현율을 가지는 사전을 구축 할 수 있는 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법을 제공하는데 있다.Accordingly, the first technical task of the present invention is to provide a method for automatically constructing a dictionary of Korean abbreviations, provide high efficiency and high accuracy, and generate only possible abbreviation candidates in consideration of the linguistic characteristics of Korean. An automatic method for building abbreviation dictionaries that can efficiently reduce the time used to construct the algorithm, and to consider the type of abbreviation generation that can build dictionaries with high accuracy and high recall using a probabilistic model using the characteristics of the abbreviations. To provide.

본 발명이 이루고자 하는 두번째 기술적 과제는 상기의 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 있다.A second technical object of the present invention is to provide a computer readable recording medium having recorded thereon a program for executing the automatic abbreviation dictionary automatic construction method considering the above abbreviation generation type.

본 발명이 이루고자 하는 세번째 기술적 과제는 상기의 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법이 적용된 약어 생성 유형을 고려하는 약어 사전 자동 구축 장치를 제공하는데 있다.The third technical problem to be achieved by the present invention is to provide a device for automatically building an abbreviation dictionary that takes into account an abbreviation generation type to which the method for automatically building an abbreviation dictionary considering the above abbreviation generation type is applied.

상기의 첫번째 기술적 과제를 이루기 위하여, 본 발명은 인터넷 상에 존재하는 웹 문서들로부터 원어들을 수집하는 단계, 상기 수집된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성하는 단계, 상기 약어 후보들이 실제 약어일 확률을 연산하는 분류 모형으로서 약어를 구별하기 위한 자질을 고려하는 상기 분류 모형에 상기 약어 후보들이 인터넷 상에 존재하는 웹 문서들에 존재하는 빈도를 적용하여 상기 약어 후보들이 실제 약어일 확률을 연산하는 단계 및 상기 약어 후보들 중 실제 약어가 아닐 확률이 큰 약어 후보를 가려내기 위한 약어 임계값보다 상기 실제 약어일 확률이 크거나 같은 약어 후보들을 약어 사전에 등재시키는 단계를 포함하는 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법을 제공한다.In order to achieve the first technical problem, the present invention comprises the steps of collecting original words from web documents existing on the Internet, corresponding to at least one form of noun abbreviation, syllable combination or mixed form for the collected original words. Generating abbreviation candidates, frequency of occurrence of the abbreviation candidates in web documents present on the Internet in the classification model that considers the qualities for distinguishing abbreviations as a classification model for calculating the probability that the abbreviation candidates are actual abbreviations Calculating a probability that the abbreviation candidates are actual abbreviations and abbreviating the abbreviation candidates having a probability that is greater than or equal to the actual abbreviation than an abbreviation threshold for selecting an abbreviation candidate having a high probability that the abbreviation candidate is not a real abbreviation. Dictionary of abbreviations that takes into account the type of abbreviation that includes listing in a dictionary It provides the same construction method.

또한, 상기의 첫번째 기술적 과제를 이루기 위하여, 본 발명은 인터넷 상에 존재하는 웹 문서들로부터 원어들을 수집하는 단계, 상기 수집된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성하는 단계, 상기 약어 후보들이 실제 약어일 확률을 연산하는 분류 모형으로서 약어를 구별하기 위한 자질을 고려하는 상기 분류 모형에 상기 약어 후보들이 인터넷 상에 존재하는 웹 문서들에 존재하는 빈도를 적용하여 상기 약어 후보들이 실제 약어일 확률 및 약어가 아닐 확률을 연산하는 단계, 상기 약어 후보들 중 실제 약어가 아닐 확률이 큰 약어 후보를 가려내기 위한 약어 임계값보다 상기 실제 약어일 확률이 크거나 같은 약어 후보들을 선별하는 단계, 상기 선별된 약어 후보들 중 상기 실제 약어일 확률과 상기 약어가 아닐 확률 사이의 차이가 실제 약어가 아닐 확률이 큰 약어 후보를 추가로 가려내어 정확률을 높이기 위한 정확률 임계값보다 작은 약어 후보들을 상기 선별된 약어 후보들에서 제거하는 단계 및 상기 선별된 약어 후보들을 약어 사전에 등재시키는 단계를 포함하는 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법을 제공한다.In addition, in order to achieve the first technical problem, the present invention comprises the steps of collecting the original language from the web documents existing on the Internet, at least one form of noun abbreviation, syllable combination or mixed form for the collected original words Generating corresponding abbreviation candidates, wherein the abbreviation candidates are present in web documents present on the Internet in the classification model that considers qualities for distinguishing abbreviations as a classification model for calculating the probability that the abbreviation candidates are actual abbreviations; Calculating a probability that the abbreviation candidates are actual abbreviations and non-abbreviations by applying a frequency of the abbreviations, and the probability that the abbreviation candidates are actual abbreviations is greater than an abbreviation threshold for selecting an abbreviation candidate having a high probability of not being an abbreviation. Selecting an abbreviation candidate that is greater than or equal to, the actual drug among the selected abbreviation candidates; Further screening for abbreviation candidates that are unlikely to be actual abbreviations that are unlikely to be abbreviations and removing abbreviated candidates from the selected abbreviation candidates to increase accuracy; A method for automatically building an abbreviation dictionary that takes into account the type of abbreviation generation comprising the step of listing the listed abbreviation candidates in the abbreviation dictionary.

또한, 상기의 첫번째 기술적 과제를 이루기 위하여, 본 발명은 인터넷 상에 존재하는 웹 문서들로부터 원어들을 수집하는 단계, 상기 수집된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성하는 단계, 상기 약어 후보들이 실제 약어일 확률을 연산하는 분류 모형으로서 약어를 구별하기 위한 자질을 고려하는 상기 분류 모형에 상기 약어 후보들이 인터넷 상에 존재하는 웹 문서들에 존재하는 빈도를 적용하여 상기 약어 후보들이 실제 약어일 확률 및 약어가 아닐 확률을 연산하는 단계, 상기 약어 후보들 중 실제 약어가 아닐 확률이 큰 약어 후보를 가려내기 위한 약어 임계값보다 상기 실제 약어일 확률이 크거나 같은 약어 후보들을 선별하는 단계, 상기 실제 약어일 확률이 상기 약어 임계값보다 작은 약어 후보들 중 상기 실제 약어일 확률과 상기 약어가 아닐 확률 사이의 차이가 실제 약어가 될 수 있는 약어 후보를 복원하여 재현률을 높이기 위한 재현률 임계값보다 크거나 같은 약어 후보들을 상기 선별된 약어 후보들에 추가하는 단계, 상기 선별된 약어 후보들을 약어 사전에 등재시키는 단계를 포함하는 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법을 제공한다.In addition, in order to achieve the first technical problem, the present invention comprises the steps of collecting the original language from the web documents existing on the Internet, at least one form of noun abbreviation, syllable combination or mixed form for the collected original words Generating corresponding abbreviation candidates, wherein the abbreviation candidates are present in web documents present on the Internet in the classification model that considers qualities for distinguishing abbreviations as a classification model for calculating the probability that the abbreviation candidates are actual abbreviations; Calculating a probability that the abbreviation candidates are actual abbreviations and non-abbreviations by applying a frequency of the abbreviations, and the probability that the abbreviation candidates are actual abbreviations is greater than an abbreviation threshold for selecting an abbreviation candidate having a high probability of not being an abbreviation. Selecting candidates with a greater or equal abbreviation, the probability that the actual abbreviation is the abbreviation threshold Among the smaller abbreviation candidates, the difference between the probability of actual abbreviation and the probability of non-abbreviation may be abbreviated candidates greater than or equal to the recall factor threshold to increase reproducibility by restoring an abbreviation candidate that may become an actual abbreviation. A method for automatically constructing an abbreviation dictionary considering a type of an abbreviation including adding to and listing the selected abbreviation candidates in an abbreviation dictionary.

상기의 두번째 기술적 과제를 이루기 위하여, 본 발명은 상기의 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In order to achieve the above second technical problem, the present invention provides a computer-readable recording medium having recorded thereon a program for executing the automatic abbreviation dictionary construction method in consideration of the above-mentioned abbreviation generation type.

상기의 세번째 기술적 과제를 이루기 위하여, 본 발명은 인터넷 상에 존재하는 웹 문서들로부터 원어들을 수집하는 원어 수집부, 상기 수집된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성하는 약어 후보 생성부, 상기 약어 후보들이 실제 약어일 확률을 연산하는 분류 모형으로서 약어를 구별하기 위한 자질을 고려하는 상기 분류 모형에 상기 약어 후보들이 인터넷 상에 존재하는 웹 문서들에 존재하는 빈도를 적용하여 상기 약어 후보들이 실제 약어일 확률을 연산하는 분류 모형부 및 상기 약어 후보들 중 실제 약어가 아닐 확률이 큰 약어 후보를 가려내기 위한 약어 임계값보다 상기 실제 약어일 확률이 크거나 같은 약어 후보들을 약어 사전에 등재시키는 사전 생성부를 포함하는 약어 생성 유형을 고려하는 약어 사전 자동 구축 장치를 제공한다.In order to achieve the third technical problem, the present invention provides a language collection unit that collects original languages from web documents existing on the Internet, and includes at least one of a noun abbreviation, a syllable combination, or a mixed form for the collected original languages. An abbreviation candidate generation unit for generating corresponding abbreviation candidates, and a web document in which the abbreviation candidates exist on the Internet in the classification model considering a qualification for distinguishing an abbreviation as a classification model for calculating a probability that the abbreviation candidates are actual abbreviations. Classification model that calculates the probability that the abbreviation candidates are actual abbreviations by applying a frequency present in the field, and the probability that the actual abbreviation is higher than an abbreviation threshold for selecting an abbreviation candidate having a high probability that the abbreviation candidates are not the actual abbreviation. An abbreviation that includes a dictionary generator that lists abbreviated candidates greater than or equal to the abbreviation dictionary. Provides an automatic abbreviation dictionary construction device that takes into account the type of generation.

이하에서는 도면을 참조하여 본 발명의 바람직한 실시예를 설명하기로 한다. 그러나, 다음에 예시하는 본 발명의 실시예는 여러 가지 다른 형태로 변형될 수 있 으며, 본 발명의 범위가 다음에 상술하는 실시예에 한정되는 것은 아니다.Hereinafter, with reference to the drawings will be described a preferred embodiment of the present invention. However, embodiments of the present invention illustrated below may be modified in various other forms, and the scope of the present invention is not limited to the embodiments described below.

본 발명에 따른 약어 사전 자동 구축 장치는 도 1과 같이 주어진 원어에 대하여 약어 생성상의 특징을 고려, 가능한 모든 약어의 후보를 생성한 후, 이 중 실제 사용되는 약어를 확률 모형을 이용하여 분류함으로써 사전을 구축 할 수 있다. 원어의 경우, 웹사이트 등에서 쉽게 획득이 용이 하다. 특히, 고유명사의 경우, 특정 도메인과 관련된 웹사이트를 통해 원어의 리스트를 수집하는 것이 어렵지 않다. 가령 기업명 목록의 경우, 한국상공회의소 사이트나 증권거래소 사이트 등에서 그 목록을 쉽게 얻을 수 있다. 따라서, 목적에 따라 충분한 양의 사전을 구축할 수 있다.The automatic abbreviation dictionary construction apparatus according to the present invention considers the characteristics of the abbreviation generation for a given original language as shown in FIG. 1, generates candidates for all possible abbreviations, and classifies the abbreviations actually used by using a probability model. Can build. In the case of the original language, it can be easily obtained on a website. In particular, in the case of proper nouns, it is not difficult to collect a list of original languages through a website related to a specific domain. For example, a list of company names can be easily obtained from the Korea Chamber of Commerce and Industry or the stock exchange site. Therefore, a sufficient amount of dictionaries can be constructed according to the purpose.

원어 수집부(110)는 인터넷 상에 존재하는 웹 문서들로부터 원어들을 수집한다. 이때, 인터넷에 연결하기 위해서 네트워크 인터페이스부(120)를 이용할 수 있다. 예를 들어, 네트워크 인터페이스부(120)는 이더넷 지원 카드, 무선랜 카드 등의 인터페이스 장치일 수 있다.The original language collecting unit 110 collects original languages from web documents existing on the Internet. In this case, the network interface unit 120 may be used to connect to the Internet. For example, the network interface unit 120 may be an interface device such as an Ethernet support card or a wireless LAN card.

약어 후보 생성부(130)는 원어 수집부(110)에 의해 수집된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성한다.The abbreviation candidate generation unit 130 generates abbreviation candidates corresponding to at least one of a noun abbreviation, a syllable combination, or a mixed form with respect to the original words collected by the original language collection unit 110.

분류 모형부(140)는 약어 후보 생성부(130)에서 생성된 약어 후보들이 인터넷 상에 존재하는 웹 문서들에 존재하는 빈도를 소정의 분류 모형에 적용하여 약어 후보들이 실제 약어일 확률을 연산한다. 예를 들어, 소정의 분류 모형은 단순 베이지안 분류 모형일 수 있다.The classification model unit 140 calculates the probability that the abbreviation candidates are actual abbreviations by applying a frequency in which the abbreviation candidates generated by the abbreviation candidate generator 130 are present in web documents existing on the Internet to a predetermined classification model. . For example, the given classification model may be a simple Bayesian classification model.

사전 생성부(150)는 약어 후보들 중 분류 모형부(140)에서 연산된 실제 약어일 확률이 약어 임계값 이상인 약어 후보들을 약어 사전에 등재시킨다. 사전에 등재시키는 과정은 데이터 베이스를 구축하는 과정을 포함한다. 예를 들어, 데이터 베이스는 하드 디스크 드라이브, 광 기록 매체 및 그 기록 장치를 포함할 수 있다. 이때, 약어 임계값은 약어 후보 중에 실제 약어가 아닐 확률이 높은 후보들을 가려내기 위한 임계값으로서, 반복 실험을 통하여 결정될 수 있다.The dictionary generator 150 registers the abbreviation candidates having the probability that the actual abbreviation calculated by the classification model unit 140 is greater than or equal to the abbreviation threshold among the abbreviation candidates. Listing in advance involves building a database. For example, the database may include a hard disk drive, an optical recording medium, and a recording device thereof. In this case, the abbreviation threshold value is a threshold value for selecting candidates having a high probability of not being an actual abbreviation among the abbreviation candidates, and may be determined through an iterative experiment.

예를 들어, 본 발명에 따른 약어 생성 유형을 고려하는 약어 사전 자동 구축 장치는 대용량의 데이터 베이스를 구축할 수 있는 저장 수단, 저장 수단과 연결된 I/O 장치, 주된 연산을 수행하고 분류 모형 및 형태소 분석기를 위한 연산을 수행하는 마이크로 프로세서, 마이크로 프로세서를 위한 메모리 및 인터넷에 연결하여 IP주소를 할당받는 인터페이스 카드를 포함하는 컴퓨터 시스템을 이용하여 구현될 수 있다.For example, the automatic abbreviation dictionary construction device that takes into account the type of abbreviation generation according to the present invention is a storage means capable of constructing a large-capacity database, an I / O device connected to the storage means, performs a main operation and classifies models and stemming. It can be implemented using a computer system including a microprocessor for performing operations for the analyzer, a memory for the microprocessor, and an interface card connected to the Internet and assigned an IP address.

먼저, 인터넷 상에 존재하는 웹 문서들로부터 원어들을 수집한다(210 과정).First, original languages are collected from web documents existing on the Internet (step 210).

다음, 수집된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성한다(220 과정). 바람직하게는, 이 과정(220 과정)은 수집된 원어들의 형태소를 분석하여 원어들을 구성 명사로 분할 하는 과정 및 구성 명사로 분할된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성하는 과정을 포함할 수 있다.Next, abbreviation candidates corresponding to at least one of noun abbreviation, syllable combination, or mixed form are generated for the collected original words (step 220). Preferably, this process (step 220) is a process of analyzing the morphemes of the collected original words to divide the original words into constituent nouns and at least one form of noun abbreviation, syllable combination or mixed form for the original words divided into constituent nouns. It may include the process of generating the abbreviation candidate corresponding to the.

약어 후보들이 생성되면, 약어 후보들이 인터넷 상에 존재하는 웹 문서들에 존재하는 빈도를 소정의 분류 모형에 적용하여 약어 후보들이 실제 약어일 확률을 연산한다(230 과정). 바람직하게는, 이 과정(230 과정)은 명사 생략형, 음절 조합형 및 혼합형에 대해 특유한 자질들을 고려하여 각각 분류 모형을 구성하는 과정 및 약어 후보들을 구성된 분류 모형 중 대응하는 분류 모형에 적용하는 과정을 포함할 수 있다.Once the abbreviation candidates are generated, the probability that the abbreviation candidates are actual abbreviations is calculated by applying a frequency in which the abbreviation candidates exist in web documents existing on the Internet to a predetermined classification model (step 230). Preferably, this process (230) is a process of constructing a classification model by considering characteristics specific to noun abbreviations, syllable combinations, and mixed forms, and applying abbreviated candidates to corresponding classification models among the classification models. It may include.

마지막으로, 약어 후보들 중 실제 약어일 확률이 약어 임계값 이상인 약어 후보들을 약어 사전에 등재시킨다(240 과정).Finally, abbreviation candidates whose probabilities are actual abbreviations above the abbreviation threshold are listed in the abbreviation dictionary (step 240).

위의 각 단계를 예를 들어 설명하면 다음과 같다.For example, each of the above steps is described below.

우선 원어가 수집되었다고 가정하고, 첫 번째 단계로서 주어진 원어들에 대하여 가능할 법한 약어의 후보들을 생성한다. 한국어 약어의 경우, '교육인적자원부'를 '교육부'로 줄여 사용하는 것과 같이 특정 명사를 생략하여 약어를 생성하거나 '정보통신부'를 '정통부'로 줄여 사용하는 것과 같이 특정 명사의 특정 음절을 조합하여 약어를 생성할 수 있다. 본 발명에서는 표 1과 같이 약어를 생성하는 방식을 나누고 이와 같은 방식에 따라 가능한 약어의 후보를 생성 한다. First assume that the original language has been collected, and as a first step create candidates with possible abbreviations for the given original languages. In the case of Korean abbreviations, a specific noun is generated by omitting a specific noun, such as using the Ministry of Education and Human Resources as a Ministry of Education, or combining certain syllables of a particular noun, such as using the Ministry of Information and Communication as an Orthodox Department. To generate abbreviations. In the present invention, as shown in Table 1, a method of generating abbreviations is generated and candidates for possible abbreviations are generated according to the method.

위의 분류에 따라 약어를 생성하기 위해서 본 발명에서는 원어를 구성하는 각각의 명사에 대하여 아래의 같은 규칙 중 하나를 적용한다. In order to generate an abbreviation according to the above classification, the present invention applies one of the following rules to each noun constituting the original language.

규칙 1. 명사 생략Rule 1. Omit nouns

규칙 2 .명사의 첫 음절, 혹은 끝 음절 선택하여 약어에 참여 시킴Rule 2 Choose a syllable's first syllable or end syllable to participate in the abbreviation

규칙 3. 명사 전체를 약어에 참여 시킴Rule 3. Participating in the entire noun

또한 적용 된 규칙에 따라, 생성된 약어의 유형을 아래와 같이 결정 할 수 있다.Also, depending on the rules applied, the type of abbreviation generated can be determined as follows.

즉, 명사생략형은 규칙 1 혹은 규칙 3을 적용하여 생성된 약어로서, 예를 들어, "교육/인적/자원/부"를 "교육부"로 하는 경우이고, 음절조합형은 규칙 1 혹은 규칙 2를 적용하여 생성된 약어로서, 예를 들어, "국가/정보/원"을 "국정원"으로 하는 경우이다. 또한, 혼합형은 규칙2와 규칙3을 모두 적용하여 생성 된 약어로서, 예를 들어, "대우자동차판매"를 "대우자판"으로 하는 경우이다.In other words, the noun abbreviation is an abbreviation generated by applying rule 1 or rule 3, for example, when "education / human / resources / part" is referred to as the "education part", and the syllable combination type uses the rule 1 or rule 2. As an abbreviation generated by this, for example, "country / information / source" is set to "National Intelligence Service". In addition, the mixed type is an abbreviation generated by applying both Rule 2 and Rule 3, for example, when "Daewoo Motor Sales" is "Daewoo Keyboard".

위와 같은 방식을 적용하여 약어후보를 생성하기 위해서는 원어를 구성명사로 나눠주는 작업이 필요하다. 이를 위해서 본 발명에서 기존에 개발된 형태소 분석기들을 이용할 수 있다.To generate abbreviated candidates by applying the above method, it is necessary to divide the original language into constituent nouns. To this end, the morphological analyzers developed in the present invention may be used.

두 번째 단계로서 첫 번째 단계에서 생성된 원어, 약어 후보 쌍에 대하여 확률 모형을 사용하여 실제 사용되는 원어, 약어 쌍 만을 선별할 수 있다. 이때, 사용될 수 있는 확률모형은 분류를 위해 흔히 사용되는 단순 베이지안 분류 모형을 포함한다. 단순 베이지안 분류모형은 수학식 1을 사용하여 원어와 약어 후보가 주어진 상태에서 약어 후보가 실제 약어일 확률과 약어가 아닐 확률을 연산한다.As a second step, only original and abbreviation pairs can be selected by using a probability model for the original and abbreviation candidate pairs generated in the first step. Probability models that can be used include simple Bayesian classification models that are commonly used for classification. The simple Bayesian classification model uses Equation 1 to calculate the probability that the abbreviation candidate is the actual abbreviation and the probability that the abbreviation is not the abbreviation given the original language and the abbreviation candidate.

와

는 자질 집합

로 바꾸어 표현 가능 하며 이는 다시 자질집합을 구성하는 자질로 나누어

으로 나타낼 수 있다. 이를 다시 수학식 2를 사용하여 최종적으로 약어 후보가 실제 약어 인지 아닌지를 가려 낼 수 있다.

Wow

Set of qualities

Can be expressed in terms of the features that make up the feature set.

It can be represented as Equation 2 can be used again to finally determine whether the abbreviation candidate is a real abbreviation.

수학식 1과 수학식 2에서는 각각 후보 단어가 약어일 확률, 약어가 아닐 확률, 또한 약어인지를 실제 판정하기 위하여 주어진 원어와 약어 후보를 자질들의 집합으로 표현하였다. 즉 약어를 특정한 상태로 표현하고 이 상태에서 약어일 확률과 약어가 아닐 확률을 계산한 것이다. 예를 들어 주어진 원어와 약어 후보가 약어의 길이, 약어가 출현한 웹문서의 수 두 개의 자질로 이루어져 있고, 특정 원어와 약어 후보에 대하여 (약어의 길이 = 3, 출현한 웹 문서의 수 = 10)의 상태로 표현할 수 있을 때, 이러한 상태인 단어가 약어인 경우를 주어진 문서에서 세 확률을 계산한 다는 것이다. 즉 주어진 문서(학습집합)에 길이가 3이고 출현한 웹문서의 수가 10인 단어가 10개 있고, 이중 7개가 실제 약어라면 이러한 상태를 가지는 단어가 약어일 확률은 0.7이 되는 것이다. In Equation 1 and Equation 2, the original and abbreviated candidates are expressed as a set of qualities to determine whether the candidate word is an abbreviation, a non-abbreviation, or an abbreviation. In other words, the abbreviation is expressed as a specific state, and the probability of being an abbreviation and not of an abbreviation is calculated in this state. For example, a given source and abbreviation candidate consists of two qualities: the length of the abbreviation, the number of web documents in which the abbreviation appears, and for a particular source and abbreviation candidate (abbreviation length = 3, the number of web documents appearing = 10). In terms of), three probabilities are computed in a given document for the case where a word in this state is an abbreviation. In other words, if a given document (learning set) has 10 words of length 3 and the number of web documents that appeared 10, of which 7 are actual abbreviations, the probability of a word having such a status is an abbreviation of 0.7.

주어진 원어와 약어 후보를 표현하기 위해 사용될 수 있는 자질들은 다음과 같다. 즉, 약어 생성 시 생략된 명사와 생략되지 않은 명사의 수의 차이, 약어 생성 시 생략되는 명사의 수, 약어의 음절 수, 원어와 약어의 음절의 수의 차이, 선택 된 끝 음절의 개수, 혼합형에서의 조합된 음절개수, 웹 문서에서의 약어와 원어 의 공기빈도 등이다.The following qualities can be used to represent a given source and abbreviation candidate. That is, the difference between the number of nouns omitted and abbreviated nouns when generating abbreviations, the number of nouns omitted when generating abbreviations, the number of syllables in the abbreviation, the number of syllables in the original and abbreviation, the number of selected ending syllables, and the mixed form. The number of syllables combined, the abbreviations in the web document, and the frequency of the original language.

원어와 약어의 음절의 수의 차이에 대해서 예를 들면 다음과 같다. '한국전력'이라는 약어의 경우, 원어 '한국전력공사'에서 한 개의 단위명사 '공사'가 생략되었다 볼 수 있고, 두 개의 음절 '공','사'가 생략되었다고 볼 수도 있다. 따라서 축약의 단위로 음절과 단위명사 두 가지를 모두 고려하여 자질을 구성, 어느 정도의 축약이 발생하는지와 약어의 길이를 고려하여 실제 약어 후보가 약어인지를 판별 할 수 있다. For example, the difference between the number of syllables in the original and abbreviations is as follows. In the case of the abbreviation 'KEPCO', one unit noun 'corporation' was omitted from the original 'KEPCO', and two syllables 'Gong' and 'Sa' may be omitted. Therefore, in consideration of both syllables and unit nouns as the unit of abbreviation, it is possible to construct the qualities and determine whether the actual abbreviation candidate is an abbreviation considering the length of abbreviation and the length of the abbreviation.

선택 된 끝 음절의 개수에 대해서 예를 들면 다음과 같다. 음절조합형 및 혼합형에서 음절을 선택할 경우, 한국전력공사의 약어인 한전과 같이 단위명사의 첫 음절을 선택하는 경우가 흔하고, 끝 음절을 선택하는 경우는 많지 않다. 따라서 약어에 포함된 끝 음절의 수를 고려한다.For example, the number of selected end syllables is as follows. When selecting syllables from syllable combination type and mixed type, it is common to select the first syllable of unit noun like KEPCO, which is abbreviation of KEPCO, and not to select the last syllable. Therefore, consider the number of ending syllables included in the abbreviation.

혼합형에서의 조합된 음절개수에 대해서 예를 들면 다음과 같다. 혼합형에서 단위 명사의 특정 음절을 뽑아 조합할 때, 몇 개의 특정 음절을 사용하는 지를 고려한다.For example, the number of syllables combined in a mixed form is as follows. Consider how many specific syllables are used when extracting and combining specific syllables of unit nouns in a blend.

약어의 경우, 원어와 동시에 웹 문서에서 출현하는 경향을 찾을 수 있으며, 반면에 약어후보가 대응되는 원어의 약어로 쓰이지 않을 경우 공기 하는 문서를 찾기 어렵다. 따라서 약어와 원어가 동시 출현한 웹문서를 찾아 그 공기 빈도를 통해 실제 약어인지를 판별 할 수 있다.In the case of abbreviations, it is possible to find a tendency to appear in the web document at the same time as the original language, while on the other hand, it is difficult to find an affirmative document unless the abbreviation candidate is used as the abbreviation of the corresponding original language. Therefore, it is possible to find a web document in which the abbreviation and the original language appear at the same time and determine whether the abbreviation is an actual abbreviation through the air frequency.

웹 문서는 정보검색을 이용하여 수집 하되, 약어후보와 원어를 질의로 한 것, 약어후보만을 질의로 한 것의 두 종류 문서를 각 질의에 대하여 복수 개를 수 집 하여 사용한다.Web documents are collected using information retrieval, but two types of documents are collected and used for each query: the abbreviation candidate and the original language as a query, and the abbreviation candidate only as a query.

한국어 약어의 경우, 생성 유형별로 길이 등의 특성이 서로 다른데 가령 도 3a 및 도 3b에서 나타난 것과 같이 생성 유형별로 선호하는 음절의 수가 다른 것을 확인 할 수 있다. 또한 특정 자질의 경우, 특정 유형에만 적용할 수 있으며(예를 들어, #OfRepChar) 따라서 생성유형별로 학습집합 분리하고, 고려하는 자질을 달리하여 모형을 구성하여 성능의 향상을 기대할 수 있다. 이하에서는 생성유형을 고려하여 모형을 분리한 경우와 유형을 고려하지 않는 모형 두 가지를 적용, 각 성능을 평가한다.In the case of the Korean abbreviation, characteristics such as length are different for each generation type. For example, as shown in FIGS. 3A and 3B, it can be seen that the number of preferred syllables is different for each generation type. In addition, specific qualities can be applied only to specific types (for example, #OfRepChar). Therefore, the learning set can be separated by generation type, and the model can be expected to improve performance by changing the qualities to be considered. In the following, the performance of each model is evaluated by applying the two types of models that do not consider the type and when the model is separated in consideration of the generation type.

실험을 위해 기업명, 국가단체명, 대학명으로 구성된 4음절 이상의 314개의 원어가 사용되었다. 314개의 원어 중, 307개의 원어에 대한 835개의 원어, 약어 쌍을 Positive Example로 사용하였고, 314개의 원어로 생성 가능 한 50139개의 부적절한 원어, 약어 쌍을 Negative Example로 구성하여 학습하였다.For the experiment, more than 314 original languages were used, consisting of company name, national organization name and university name. Among 314 original languages, 835 original and abbreviation pairs for 307 original languages were used as positive examples, and 50139 inappropriate original and abbreviation pairs that could be generated in 314 original languages were constructed as negative examples.

이하에서, AbbNum은 약어 생성시 생략되는 명사의 수, DiffNoun은 약어 생성시 생략된 명사와 생략되지 않은 명사의 수의 차이, AcroLen은 약어의 음절 수, DiffLen은 원어와 약어의 음절의 수의 차이, #ofLastChar은 선택 된 끝 음절의 개수, #ofRepChar은 혼합형에서의 조합된 음절개수, CoOccurFreq는 수집한 문서에서의 공기빈도, DefFreq는 약어후보를 질의로 하여 검색한 문서에서 원어의 출현 빈도의 자질을 의미한다.In the following description, AbbNum is the number of nouns omitted when generating an abbreviation, DiffNoun is the difference between the number of nouns and nouns omitted when generating an abbreviation, AcroLen is the number of syllables in an abbreviation, and DiffLen is the difference between the number of syllables in the original and abbreviations. , #ofLastChar is the number of selected end syllables, #ofRepChar is the combined number of syllables in a mixed type, CoOccurFreq is the frequency of air in the collected documents, and DefFreq is the abbreviation of the original frequency in the searched document. Means.

평가와 학습을 위하여 5 Fold Cross Validation을 이용하였고, 각 결과 값의 평균을 통해 성능을 측정하였다. 평가를 위한 정확률과 재현율은 다음의 수학식 3과 같이 연산한다.5 Fold Cross Validation was used for evaluation and learning, and performance was measured by the average of each result. The accuracy rate and recall rate for the evaluation are calculated as in Equation 3 below.

표 2는 생성유형별로 모형을 나누지 않았을 때, 각 자질 추가시의 성능을 나타낸 것이다.Table 2 shows the performance of adding each feature when the model is not divided by generation type.

자질endowment 정확률Accuracy 재현율Recall F maesureF maesure BaselineBaseline 47.91%47.91% 68.60%68.60% 56.41%56.41% AbbNumAbbNum 52.71%52.71% 67.20%67.20% 59.08%59.08% DiffLenDiffLen 50.15%50.15% 67.70%67.70% 57.62%57.62% #OfLastChar#OfLastChar 58.78%58.78% 69.60%69.60% 63.74%63.74% allall 58.68%58.68% 73.00%73.00% 65.06%65.06%

표 3은 본 발명에 따라 약어 생성유형을 고려하였을 때의 성능을 나타낸 것이다.Table 3 shows the performance when considering the abbreviation generation type according to the present invention.

유형type 자질endowment 정확률Accuracy 재현율Recall F-MeasureF-Measure 명사생략형Noun DiffNounDiffNoun 72.58%72.58% 69.33%69.33% 70.92%70.92% 음절조합형Syllable combination #OfLastChar +DefFreq#OfLastChar + DefFreq 67.47%67.47% 68.29%68.29% 67.88%67.88% 혼합형Mixed DiffLen +#OfLastChar +#OfRepCharDiffLen + # OfLastChar + # OfRepChar 51.67%51.67% 79.45%79.45% 62.62%62.62% 통합 성능Integrated performance 64.52%64.52% 72.00%72.00% 68.05%68.05%

Baseline으로는 다른 특성을 고려하지 않고, 웹 문서에서 원어와 약어의 공기 빈도만을 자질로 사용한 분류기의 성능을 사용 한다. 표 1은 Baseline의 성능과 생성유형별로 모형을 나누지 않은 경우에 대하여 Baseline에 성능을 향상시키는 유용한 자질을 추가하였을 때의 성능을 나타낸다. Baseline에 비해 약 8.4%의 성능이 향상됨을 알 수 있다. 표 2는 약어의 생성유형에 따라 자질 및 학습 집합을 달리 하였을 때의 성능 및 유용한 자질과 전체 학습 집합에 대한 통합 성능을 보여준다. 명사축약형의 경우, 의미 있는 수준의 학습집합이 없어 유형별로 모형을 나누지 않은 경우에만 성능을 측정하도록 한다. 생성 유형을 고려하지 않았을 때에 비하여 약 3%의 성능이 향상 됨을 확인 할 수 있다.As a baseline, we use the performance of a classifier that uses only the frequency of the source and abbreviations in the web document as a feature without considering other characteristics. Table 1 shows the performance of adding useful features to the baseline to improve the performance of the baseline. It can be seen that the performance is improved by about 8.4% compared to the baseline. Table 2 shows the performance of different features and learning sets according to the type of abbreviation, the useful features, and the integrated performance of the entire learning set. In the case of nominal abbreviations, there is no meaningful level of learning set, so performance is measured only when the model is not divided by type. It can be seen that the performance is improved by about 3% compared to when the generation type is not considered.

실제 약어가 아닌 단어를 약어로 분류한 경우 중 상당수는, 철자 오류가 발생한 단어가 웹에 출현하였을 경우이다. 이러한 단어의 상당수가 (대우자동차판매, 대우자동판매)와 같이 중간의 한글자가 탈락하는 경우임을 감안하여 이러한 후보를 제거함으로써 70.51%의 정확률과 71%의 재현율, 70.75%의 F-measure 값을 얻을 수 있다.Many of the cases in which words that are not actual abbreviations are classified as abbreviations are those when misspelled words appear on the web. Considering that many of these words are missing from intermediate Hangul characters such as (Daewoo Motor Sales, Daewoo Auto Sales), the candidates are eliminated to obtain 70.51% accuracy, 71% recall, and 70.75% F-measure. Can be.

약어 사전의 경우, 목적에 따라 재현율 보다는 정확률이 중요 할 수 있고, 또한 그 반대의 경우가 중요할 수도 있다. 따라서 이를 조정할 필요성이 요구 된다. 본 발명에서는 분류기의 확률 결과 값을 정규화 한 후, 약어라고 판단 한 후보 중 약어일 확률과 약어가 아닐 확률의 값의 차이가 일정 이상 되지 않을 경우 약어라 판단 하지 않는 방법으로 정확률을 높이고 재현율을 낮출 수 있다. 아래는 이 같은 방식으로 약어를 판단하기 위한 식을 나타낸다. In the case of an abbreviation dictionary, accuracy may be important rather than recall, and vice versa, depending on the purpose. Therefore, the need to adjust this is required. In the present invention, after normalizing the probability result value of the classifier, if the difference between the probability of the abbreviation and the probability that is not the abbreviation among the candidates judged as the abbreviation is not more than a certain level, the accuracy rate is increased and the reproducibility is lowered by not determining the abbreviation. Can be. Below is the formula for judging the abbreviation in this way.

정확률을 낮추고 재현율을 높이기 위해서는 이와 반대로 약어라고 판단하지 않은 후보에 대하여 확률 값의 차이가 일정 값 이상이 될 경우, 약어라고 판단 한다. In order to reduce the accuracy and increase the reproducibility, on the contrary, if the difference in the probability value exceeds a certain value for a candidate that is not determined as an abbreviation, it is determined as an abbreviation.

표 4는 이와 같은 방법으로 조정하였을 때의 정확률과 재현율 및 F-measure 값을 나타낸 것이다.Table 4 shows the accuracy, recall and F-measure values when adjusted in this manner.

정확률Accuracy 재현율Recall F-MeasureF-Measure 정확률 상승Increase in accuracy 82.30%82.30% 54.40%54.40% 65.50%65.50% 재현율 상승Higher recall 53.38%53.38% 83.00%83.00% 64.97%64.97%

먼저, 인터넷 상에 존재하는 웹 문서들로부터 원어들을 수집한다(410 과정).First, original languages are collected from web documents existing on the Internet (step 410).

다음, 수집된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성한다(420 과정).Next, abbreviation candidates corresponding to at least one of noun abbreviation, syllable combination, or mixed form are generated for the collected original words (step 420).

약어 후보들이 생성되면, 약어 후보들이 인터넷 상에 존재하는 웹 문서들에 존재하는 빈도를 소정의 분류 모형에 적용하여 약어 후보들이 실제 약어일 확률 및 약어가 아닐 확률을 연산한다(430 과정).Once the abbreviation candidates are generated, the frequency at which the abbreviation candidates are present in the web documents existing on the Internet is applied to a predetermined classification model to calculate the probability that the abbreviation candidates are actual abbreviations and not the abbreviations (step 430).

다음, 약어 후보들 중 실제 약어일 확률이 약어 임계값 이상인 약어 후보들을 선별한다(440 과정).Next, among the abbreviation candidates, the abbreviation candidates whose probability of being the actual abbreviation is greater than or equal to the abbreviation threshold is selected (step 440).

약어 후보들이 선별되면, 선별된 약어 후보들 중 실제 약어일 확률과 약어가 아닐 확률 사이의 차이가 정확률 임계값 미만인 약어 후보들을 선별된 약어 후보들에서 제거한다(450, 460 과정).Once the abbreviation candidates are selected, abbreviated candidates whose difference between the probability of being the actual abbreviation and the non-abbreviation of the selected abbreviation candidates is below the accuracy factor threshold are removed from the selected abbreviation candidates (steps 450, 460).

이때, 정확률 임계값은 약어 후보 중에 실제 약어가 아닐 확률이 높은 후보들을 추가적으로 가려내어 정확률을 높이기 위한 임계값으로서, 반복 실험을 통하여 결정될 수 있다.In this case, the accuracy rate threshold value is a threshold value for increasing the accuracy rate by additionally selecting candidates that are not likely to be actual abbreviations among the abbreviation candidates and may be determined through an iterative experiment.

마지막으로, 선별된 약어 후보들을 약어 사전에 등재시킨다(470 과정).Finally, the selected abbreviation candidates are listed in the abbreviation dictionary (step 470).

먼저, 인터넷 상에 존재하는 웹 문서들로부터 원어들을 수집한다(510 과정).First, original languages are collected from web documents existing on the Internet (step 510).

다음, 수집된 원어들에 대해 명사 생략형, 음절 조합형 또는 혼합형 중 적어도 하나의 형태에 해당하는 약어 후보들을 생성한다(520 과정).Next, abbreviation candidates corresponding to at least one of noun abbreviation, syllable combination, or mixed form are generated for the collected original words (step 520).

약어 후보들이 생성되면, 약어 후보들이 인터넷 상에 존재하는 웹 문서들에 존재하는 빈도를 소정의 분류 모형에 적용하여 약어 후보들이 실제 약어일 확률 및 약어가 아닐 확률을 연산한다(530 과정).Once the abbreviation candidates are generated, the frequency with which the abbreviation candidates are present in the web documents existing on the Internet is applied to a predetermined classification model to calculate the probability that the abbreviation candidates are actual abbreviations and not the abbreviations (step 530).

다음, 약어 후보들 중 실제 약어일 확률이 약어 임계값 이상인 약어 후보들을 선별한다(540 과정).Next, among the abbreviation candidates, the abbreviation candidates whose probability of being the actual abbreviation is greater than or equal to the abbreviation threshold is selected (step 540).

약어 후보들이 선별되면, 위에서 약어가 아니라고 판단된 즉, 실제 약어일 확률이 약어 임계값 미만인 약어 후보들 중 실제 약어일 확률과 약어가 아닐 확률 사이의 차이가 재현률 임계값 이상인 약어 후보들을 선별된 약어 후보들에 추가한다(550, 560 과정).When the abbreviation candidates are selected, the abbreviation candidates selected as non-abbreviation candidates above, that is, the difference between the probability that the actual abbreviation is less than the abbreviation threshold and the probability that the actual abbreviation is not the abbreviation is greater than the recall rate (Steps 550, 560).

이때, 재현률 임계값은 위의 과정에서 약어 후보 중에 약어가 아닌 것으로 판단되었더라도 약어가 될 수 있는 후보들을 복원하여 재현률을 높이기 위한 임계값으로서, 반복 실험을 통하여 결정될 수 있다.In this case, the reproducibility threshold may be determined through repetitive experiments as a threshold for increasing the reproducibility by restoring candidates that may be abbreviations even though it is determined that the abbreviation is not an abbreviation.

마지막으로, 선별된 약어 후보들을 약어 사전에 등재시킨다(570 과정).Finally, the selected abbreviation candidates are listed in the abbreviation dictionary (step 570).

바람직하게는, 본 발명의 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법을 컴퓨터에서 실행시키기 위한 프로그램을 컴퓨터로 읽을 수 있는 기록매체에 기록하여 제공할 수 있다.Preferably, a method for automatically building an abbreviation dictionary in consideration of the abbreviation generation type of the present invention may be provided by recording a program for executing on a computer on a computer-readable recording medium.

본 발명은 소프트웨어를 통해 실행될 수 있다. 소프트웨어로 실행될 때, 본 발명의 구성 수단들은 필요한 작업을 실행하는 코드 세그먼트들이다. 프로그램 또는 코드 세그먼트들은 프로세서 판독 가능 매체에 저장되거나 전송 매체 또는 통신망에서 반송파와 결합된 컴퓨터 데이터 신호에 의하여 전송될 수 있다.The invention can be implemented via software. When implemented in software, the constituent means of the present invention are code segments that perform the necessary work. The program or code segments may be stored on a processor readable medium or transmitted by a computer data signal coupled with a carrier on a transmission medium or network.

컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 테이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 장치의 예로는 ROM, RAM, CD-ROM, DVD±ROM, DVD-RAM, 자기 테이프, 플로피 디스크, 하드 디스크(hard disk), 광데이터 저장장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 장치에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The computer-readable recording medium includes all kinds of recording devices in which data is stored which can be read by a computer system. Examples of computer-readable recording devices include ROM, RAM, CD-ROM, DVD ± ROM, DVD-RAM, magnetic tape, floppy disks, hard disks, optical data storage devices, and the like. The computer readable recording medium can also be distributed over network coupled computer devices so that the computer readable code is stored and executed in a distributed fashion.

본 발명은 도면에 도시된 일 실시예를 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시예의 변형이 가능하다는 점을 이해할 것이다. 그러나, 이와 같은 변형은 본 발명의 기술적 보호범위내에 있다고 보아야 한다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.Although the present invention has been described with reference to one embodiment shown in the drawings, this is merely exemplary and will be understood by those of ordinary skill in the art that various modifications and variations can be made therefrom. However, such modifications should be considered to be within the technical protection scope of the present invention. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

상술한 바와 같이, 본 발명에 의하면, 한국어 약어의 사전을 자동으로 구축하는 방법을 제공하고, 동시에 높은 효율성과 높은 정확도를 제공하며, 한국어의 언어학적 특성을 고려하여 가능한 약어 후보만을 생성하므로 약어 사전을 구축하기 위해 사용되는 시간을 효율적으로 줄일 수 있고, 약어가 가지는 특징을 이용한 확률 모형을 사용하여 높은 정확도 및 높은 재현율을 가지는 사전을 구축 할 수 있는 효과가 있다.As described above, the present invention provides a method for automatically constructing a dictionary of Korean abbreviations, provides high efficiency and high accuracy, and generates only possible abbreviation candidates in view of the linguistic characteristics of Korean. It can effectively reduce the time used for constructing and construct a dictionary with high accuracy and high recall using a probabilistic model using the features of the abbreviations.

Claims

Collecting original languages from web documents existing on the Internet;

Generating abbreviated candidates corresponding to at least one of a noun abbreviation, a syllable combination, or a mixed form for the collected original words;

The abbreviation candidates are applied by applying the frequency of the abbreviation candidates in web documents existing on the Internet to the classification model that considers the qualities for distinguishing the abbreviations as a classification model for calculating the probability that the abbreviation candidates are actual abbreviations. Calculating probabilities that are actual abbreviations; And

An abbreviation dictionary that takes into account an abbreviation generation type comprising listing an abbreviation candidate having a probability that is greater than or equal to the actual abbreviation than an abbreviation threshold for screening an abbreviation candidate having a high probability of not being an actual abbreviation. Auto build method.

The method of claim 1,

Generating the abbreviation candidates

Analyzing the morphemes of the collected original languages and dividing the original languages into constituent nouns; And

Generating abbreviation candidates corresponding to at least one of noun abbreviation, syllable combination, or mixed form for the original words divided into the constituent nouns. .

The method of claim 1,

Computing the probability that the abbreviation candidates are actual abbreviations

Constructing a classification model for a noun abbreviation, a syllable combination, and a mixture, respectively; And

And applying the abbreviation candidates to a corresponding classification model among the constructed classification models.

The method of claim 3, wherein

Comprising the classification model

The classification model for the noun abbreviation is a feature for distinguishing the abbreviation, the model for considering the abbreviation generation type, characterized in that the model to consider the number of nouns omitted when generating the abbreviation.

The method of claim 3, wherein

Comprising the classification model

The classification model for the sound quality combination type is an abbreviation for considering the type of abbreviation, which is a model that considers the frequency of appearance of original words in a searched document by querying the number of end syllables and the candidate candidates as a query to distinguish the abbreviations. Pre-automatic build method.

The method of claim 3, wherein

Comprising the classification model

The classification model for the hybrid type is a feature for distinguishing the abbreviation, considering the difference between the number of syllables of the original and the abbreviation, the number of selected end syllables, and the number of combined syllables in the mixed type. An acronym dictionary auto-building method

The method of claim 1,

Method for automatically building an abbreviation dictionary considering the type of abbreviation, characterized in that the classification model is a simple Bayesian classification model.

Collecting original languages from web documents existing on the Internet;

The abbreviation candidates are applied by applying the frequency of the abbreviation candidates in web documents existing on the Internet to the classification model that considers the qualities for distinguishing the abbreviations as a classification model for calculating the probability that the abbreviation candidates are actual abbreviations. Calculating probabilities that are actual abbreviations and probabilities that are not abbreviations;

Selecting an abbreviation candidate having a probability greater than or equal to the actual abbreviation than an abbreviation threshold for selecting an abbreviation candidate having a high probability of not being an actual abbreviation among the abbreviation candidates;

Among the selected abbreviation candidates, the candidates having a smaller probability than the accuracy factor threshold for increasing the accuracy by further selecting an abbreviation candidate having a high probability that the difference between the probability of being the actual abbreviation and the probability that is not the abbreviation are not the actual abbreviation are selected. Removing from abbreviation candidates; And

And automatically listing the selected abbreviation candidates in an abbreviation dictionary.

The method of claim 8,

Collecting original languages from web documents existing on the Internet;

The difference between the probability of the actual abbreviation and the probability of not being the abbreviation among the abbreviation candidates whose probability of being the actual abbreviation is smaller than the abbreviation threshold is greater than the recall factor threshold for increasing the reproducibility by restoring the abbreviation candidate that can be the actual abbreviation. Or adding an abbreviation candidate to the selected abbreviation candidate;

The method of claim 10,

A computer-readable recording medium having recorded thereon a program for executing the method of claim 1 on a computer.

A language collection unit for collecting the languages from web documents existing on the Internet;

An abbreviation candidate generator for generating abbreviation candidates corresponding to at least one of a noun abbreviation, a syllable combination, or a mixed form with respect to the collected original words;

The abbreviation candidates are applied by applying the frequency of the abbreviation candidates in web documents existing on the Internet to the classification model that considers the qualities for distinguishing the abbreviations as a classification model for calculating the probability that the abbreviation candidates are actual abbreviations. A classification model unit for calculating a probability of being an actual abbreviation; And

An abbreviation considering an abbreviation generation type including a dictionary generation unit for listing an abbreviation candidate having a probability that is greater than or equal to the actual abbreviation than an abbreviation threshold for selecting an abbreviation candidate having a high probability of not being an actual abbreviation among the abbreviation candidates. Pre-automatic build device.