KR101377114B1

KR101377114B1 - News snippet generation system and method for generating news snippet

Info

Publication number: KR101377114B1
Application number: KR1020120113021A
Authority: KR
Inventors: 김태환; 신동욱; 김정선
Original assignee: 한양대학교 에리카산학협력단
Priority date: 2012-10-11
Filing date: 2012-10-11
Publication date: 2014-03-24

Abstract

The present invention relates to a news snippet generation system and a method thereof. The system provides seed news articles, sentences of the news articles generated by combining the seed news articles and one or more overlapping news articles, and a sentence similarity computation unit which calculates similarity of sentences between titles of the seed news articles and the overlapping news articles to rank the importance of each news sentence based on the sentences of the combined news articles and the calculated sentence similarity, and includes a news snippet generator which generates snippets of the combined news articles based on the importance ranking. [Reference numerals] (101) Seed news DB; (102) Search engine; (103) Target news DB; (104) News providing server system A; (105) News providing server system B; (106) News providing server system C; (110) Duplicated candidate news detection unit; (120) Duplicated news detection unit; (130) Duplicated news DB; (140) Duplicated news combining unit; (150) Sentence similarity calculating unit; (160) Sentence weight calculating unit; (170) Summary generator; (180) News recommendation unit; (AA,CC,EE) Seed news; (BB) Target news; (DD) Duplicated candidate news; (FF) Duplicated news; (GG) Network; (HH) Combined news

Description

NEWS SNIPPET GENERATION SYSTEM AND METHOD FOR GENERATING NEWS SNIPPET}

본 발명은 뉴스의 요약문을 생성하는 시스템 및 방법에 관한 것이다.The present invention relates to a system and method for generating a summary of news.

일반적으로 인터넷 검색 엔진은 실시간 발생하는 사건에 관련된 정보를 여러 뉴스 사이트에서 수집하여 제공한다. 뉴스 사이트들은 매일 많은 양의 뉴스들을 제공하고 있으며, 구독자가 관심을 가질만한 이슈가 되는 사건의 경우 대부분의 뉴스 사이트에서 해당 사건에 관련된 유사한 뉴스 기사를 제공한다. 이에 따라, 검색 엔진에 의하여 검색된 리스트에는 여러 뉴스 사이트에서 수집된 다수의 중복 뉴스(duplication news)가 포함된다. 사용자가 인터넷 검색 엔진에 특정 검색어를 입력하여 뉴스를 검색하는 경우에 있어서도 유사한 내용의 뉴스 기사가 중복해서 나타난다. 이와 같이 검색 엔진은 여러 뉴스 사이트에서 동시에 발생하는 중복 뉴스를 처리하지 않고 있어, 수집된 데이터에 중복이 발생하며, 이러한 중복 뉴스로 인해 예를 들어 스마트폰과 같은 개인 단말기의 사용자는 이전에 습득하였던 정보와 유사한 뉴스 기사를 불필요하게 반복적으로 제공받게 되고, 원하는 새로운 뉴스를 찾는데 오랜 시간을 들이게 되는 불편을 겪는다.In general, Internet search engines collect and provide information about events in real time from various news sites. News sites offer a great deal of news every day, and most news sites offer similar news stories related to the event in cases where the issue is likely to be of interest to subscribers. Accordingly, the list searched by the search engine includes a plurality of duplication news collected from various news sites. Similarly, when a user enters a specific search term into an Internet search engine to search for news, similar news articles are duplicated. As such, search engines do not process duplicate news that occurs simultaneously on multiple news sites, resulting in duplication of collected data, which, for example, has been previously learned by users of personal devices such as smartphones. You get unnecessarily repeated news articles that resemble information, and you spend a lot of time searching for new news you want.

한편, 검색 결과나 추천된 뉴스의 내용을 확인하기 위해서는 많은 시간이 걸리는데, 해당 내용을 충분히 표현할 수 있는 문장으로 구성하여 요약한다면 사용자는 요약문을 통하여 더 빠르고 정확하게 판단하여 본인의 요구에 부합하는 뉴스인지를 판별할 수 있다. 이와 같이 뉴스의 내용을 충분히 표현할 수 있는 문장으로 구성한 것을 요약문이라 할 수 있으며, 스니펫(snippet)을 예로 들 수 있다. 일반적으로 스니펫은 사용자가 검색 엔진에 질의어를 입력하면 검색 엔진에 의해 검색된 결과 중 본문에 사용된 문장과 질의어 사이에 가장 연관이 깊은 문장을 사용자에게 제공하여 본인의 요구에 부합하는 문서인지 아닌지를 판별할 수 있게 사용자에게 도움을 주는 시스템이다. 하지만 기존의 스니펫은 사용자의 질의어가 존재해야 스니펫을 구성할 수 있다는 단점이 있다.On the other hand, it takes a lot of time to check the contents of the search results or recommended news, if the summary is composed of sentences that can sufficiently express the content, the user can determine whether the news meets his needs faster and more accurately through the summary statement Can be determined. In this way, a summary sentence composed of sentences that can sufficiently express the contents of the news may be referred to as a snippet. In general, when a user enters a query into the search engine, the snippet provides the user with the most relevant sentence between the sentence used in the body and the query among the results searched by the search engine. It is a system that helps the user to determine. However, the existing snippet has a disadvantage in that the snippet can be composed only when the user's query word exists.

본 발명은 다수의 뉴스로부터 통합되어 생성된 결합 뉴스에 대하여 요약문을 생성할 수 있는 뉴스 요약문 생성 시스템 및 방법을 제공하는 것을 목적으로 한다.It is an object of the present invention to provide a news summary generation system and method capable of generating a summary for a combined news generated by integrating a plurality of news.

또한, 본 발명은 사용자로부터 질의어가 입력되지 않더라도 뉴스의 요약문을 생성할 수 있는 뉴스 요약문 생성 시스템 및 방법을 제공하는 것을 목적으로 한다.In addition, an object of the present invention is to provide a system and method for generating a news summary that can generate a summary of the news even if no query is input from the user.

본 발명이 해결하고자 하는 과제는 이상에서 언급된 과제로 제한되지 않는다. 언급되지 않은 다른 기술적 과제들은 이하의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the above-mentioned problems. Other technical subjects not mentioned may be clearly understood by those skilled in the art from the following description.

본 발명의 일 측면에 따른 뉴스 요약문 생성 시스템은 소정의 시드 뉴스 및 상기 시드 뉴스에 대한 하나 이상의 중복 뉴스를 결합하여 생성된 결합 뉴스의 문장과, 상기 시드 뉴스 및 상기 중복 뉴스의 제목 간의 문장 유사도를 산출하는 문장 유사도 산출부; 및 상기 결합 뉴스의 문장의 유사도에 기초하여 상기 결합 뉴스의 문장들의 중요도 순위를 결정하고, 결정한 중요도 순위에 기초하여 상기 결합 뉴스에 대한 요약문을 생성하는 요약문 생성부를 포함한다.According to an aspect of the present invention, a system for generating a news summary sentence may include a sentence similarity between a sentence of a combined news generated by combining a predetermined seed news and one or more duplicate news for the seed news, and a title of the seed news and a title of the duplicate news. A sentence similarity calculator for calculating; And a summary generation unit configured to determine a priority ranking of sentences of the combined news based on the similarity of sentences of the combined news, and to generate a summary of the combined news based on the determined priority ranking.

상기 문장 유사도 산출부는, 상기 시드 뉴스의 제목, 상기 중복 뉴스의 제목 및 상기 결합 뉴스의 문장으로부터 단어를 추출하며, 추출한 상기 시드 뉴스 및 상기 중복 뉴스의 단어와, 상기 결합 뉴스의 문장의 단어 간의 유사도를 산출하여 상기 문장 유사도를 산출한다.The sentence similarity calculating unit extracts a word from a title of the seed news, a title of the duplicate news, and a sentence of the combined news, and the similarity between the extracted seed news and the words of the duplicate news and the words of the sentences of the combined news. Calculate the sentence similarity.

상기 문장 유사도 산출부는, 상기 시드 뉴스의 제목 및 상기 중복 뉴스의 명사와, 상기 결합 뉴스의 문장에서 나타나는 명사를 공통으로 포함하는 워드넷 용어집 기반 계층에서의 최소 상위 명사를 검출하고, 상기 시드 뉴스의 제목 및 상기 중복 뉴스의 명사, 상기 결합 뉴스의 문장에서 나타나는 명사 및 상기 최소 상위 명사의 동의어 어휘의 개수에 비례하는 연산을 수행하여 단어 중요도를 산출하며, 상기 결합 뉴스의 문장에서 나타나는 각 명사에 대한 단어 중요도의 평균값을 산출하여 문장 유사도를 산출한다.The sentence similarity calculating unit detects a minimum upper noun in a wordnet glossary based layer that includes a title of the seed news and a noun of the duplicate news and a noun appearing in a sentence of the combined news in common, The word importance is calculated by performing an operation proportional to a title and a noun of the duplicate news, a noun appearing in a sentence of the combined news, and a synonym vocabulary of the least upper noun, and calculating a word importance. The sentence similarity is calculated by calculating the average value of the word importance.

상기 뉴스 요약문 생성 시스템은 상기 결합 뉴스의 문장에서 단어가 나타나는 빈도수 및 상기 문장의 위치에 기초하여 문장 가중치를 산출하는 문장 가중치 산출부를 더 포함하며, 상기 요약문 생성부는 상기 문장 유사도 및 상기 문장 중요도에 기초하여 상기 문장들의 중요도 순위를 결정한다.The news summary sentence generation system further includes a sentence weight calculator that calculates a sentence weight based on a frequency of occurrence of a word in a sentence of the combined news and the position of the sentence, and the summary sentence generator is based on the sentence similarity and the sentence importance. The priority ranking of the sentences is determined.

상기 문장 가중치 산출부는, 상기 결합 뉴스의 문장에서 나타나는 각 단어의 빈도수의 평균값을 산출하여 빈도수에 따른 제1 문장 가중치를 산출하는 제1 문장 가중치 산출부; 상기 결합 뉴스의 문장의 위치 순으로 상위 문장부터 높은 가중치를 갖도록 제2 문장 가중치를 부여하는 제2 문장 가중치 산출부; 및 상기 제1 문장 가중치 및 상기 제2 문장 가중치의 곱셈 연산을 수행하여 상기 문장 가중치를 산출하는 연산부를 포함한다.The sentence weight calculator may include: a first sentence weight calculator configured to calculate an average value of the frequency of each word appearing in the sentence of the combined news to calculate a first sentence weight according to the frequency; A second sentence weight calculator configured to assign a second sentence weight to have a higher weight from an upper sentence in order of the position of sentences of the combined news; And an operation unit configured to calculate the sentence weight by performing a multiplication operation of the first sentence weight and the second sentence weight.

상기 요약문 생성부는 상기 결합 뉴스의 문장들 중 상기 중요도 순위가 상위에 해당하는 소정의 개수의 문장을 추출하여 상기 요약문을 생성한다.The summary sentence generator generates the summary sentence by extracting a predetermined number of sentences having the highest priority ranking among sentences of the combined news.

본 발명의 다른 일 측면에 따른 뉴스 제공 시스템은 소정의 시드 뉴스를 저장하는 시드뉴스 데이터베이스; 상기 시드 뉴스의 제목에 나타나는 단어를 포함하는 검색어를 이용하여 대상 뉴스들을 검색하는 검색 엔진; 검색된 상기 대상 뉴스들을 저장하는 대상 뉴스 데이터베이스; 상기 시드 뉴스와 상기 대상 뉴스들 각각으로부터 제목을 추출하고, 추출한 상기 시드 뉴스의 제목과 상기 대상 뉴스의 제목 간의 유사도를 산출하며, 산출한 상기 제목 간의 유사도에 기초하여 상기 대상 뉴스들 중에서 중복 후보 뉴스를 탐지하는 중복 후보 뉴스 탐지부; 상기 시드 뉴스와 상기 중복 후보 뉴스로부터 컨텐츠를 추출하고, 추출한 상기 시드 뉴스의 컨텐츠에 포함된 문장과 상기 중복 후보 뉴스의 컨텐츠에 포함된 문장 간의 유사도를 산출하며, 산출한 상기 문장 간의 유사도에 기초하여 상기 중복 후보 뉴스 중에서 중복 뉴스를 탐지하는 중복 뉴스 탐지부; 탐지된 상기 중복 뉴스를 저장하는 중복 뉴스 데이터베이스; 상기 시드 뉴스와 상기 중복 뉴스를 하나의 문서로 통합하여 결합 뉴스를 생성하는 중복 뉴스 결합부; 상기 결합 뉴스의 문장과, 상기 시드 뉴스 및 상기 중복 뉴스의 제목 간의 문장 유사도를 산출하는 문장 유사도 산출부; 상기 결합 뉴스의 문장의 유사도에 기초하여 상기 결합 뉴스의 문장들의 중요도 순위를 결정하고, 결정한 중요도 순위에 기초하여 상기 결합 뉴스에 대한 요약문을 생성하는 요약문 생성부; 및 상기 결합 뉴스를 상기 요약문과 함께 웹 페이지 형태로 제공하는 뉴스 추천부를 포함한다.News providing system according to another aspect of the present invention includes a seed news database for storing a predetermined seed news; A search engine for searching for target news using a search term including a word appearing in a title of the seed news; A target news database storing the searched target news; Extracting a title from each of the seed news and the target news, calculating a degree of similarity between the title of the extracted seed news and the title of the target news, calculating a degree of similarity between the seed news and the target news, A duplicate candidate news detection unit for detecting a duplicate candidate news; Extracting content from the seed news and the overlap candidate news, calculating a similarity between a sentence included in the extracted content of the seed news and a sentence contained in the content of the overlap candidate news, and based on the calculated similarity between the sentences, A duplicate news detecting unit for detecting duplicate news among the duplicate candidate news; A duplicate news database for storing the duplicate news detected; A duplicate news merger for merging the seed news and the duplicate news into one document to generate combined news; A sentence similarity calculator for calculating a sentence similarity between a sentence of the combined news and a title of the seed news and the duplicate news; A summary sentence generation unit for determining a priority ranking of sentences of the combined news based on the similarity of sentences of the combined news, and generating a summary of the combined news based on the determined priority ranking; And a news recommendation unit for providing the combined news together with the summary in the form of a web page.

본 발명의 또 다른 일 측면에 따른 뉴스 요약문 생성 방법은 소정의 시드 뉴스 및 상기 시드 뉴스에 대한 하나 이상의 중복 뉴스를 결합하여 생성된 결합 뉴스의 문장과, 상기 시드 뉴스 및 상기 중복 뉴스의 제목 간의 문장 유사도를 산출하는 문장 유사도 산출 단계; 및 상기 결합 뉴스의 문장의 유사도에 기초하여 상기 결합 뉴스의 문장들의 중요도 순위를 결정하고, 결정한 중요도 순위에 기초하여 상기 결합 뉴스에 대한 요약문을 생성하는 요약문 생성 단계를 포함한다.According to another aspect of the present invention, a method of generating a news summary includes a sentence of a combined news generated by combining a predetermined seed news and one or more duplicate news for the seed news, and a sentence between the seed news and a title of the duplicate news. A sentence similarity calculating step of calculating similarity; And a summary generation step of determining a priority ranking of sentences of the combined news based on the similarity of sentences of the combined news, and generating a summary of the combined news based on the determined priority ranking.

상기 문장 유사도 산출 단계는, 상기 시드 뉴스의 제목, 상기 중복 뉴스의 제목 및 상기 결합 뉴스의 문장으로부터 단어를 추출하며, 추출한 상기 시드 뉴스 및 상기 중복 뉴스의 단어와, 상기 결합 뉴스의 문장의 단어 간의 유사도를 산출하여 상기 문장 유사도를 산출한다.The sentence similarity calculating step may include extracting a word from a title of the seed news, a title of the duplicate news, and a sentence of the combined news, and extracting a word from the extracted seed news and the duplicate news word and a word of a sentence of the combined news. Similarity is calculated to calculate the sentence similarity.

상기 문장 유사도 산출 단계는, 상기 시드 뉴스의 제목 및 상기 중복 뉴스의 명사와, 상기 결합 뉴스의 문장에서 나타나는 명사를 공통으로 포함하는 워드넷 용어집 기반 계층에서의 최소 상위 명사를 검출하고, 상기 시드 뉴스의 제목 및 상기 중복 뉴스의 명사, 상기 결합 뉴스의 문장에서 나타나는 명사 및 상기 최소 상위 명사의 동의어 어휘의 개수에 비례하는 연산을 수행하여 단어 중요도를 산출하며, 상기 결합 뉴스의 문장에서 나타나는 각 명사에 대한 단어 중요도의 평균값을 산출하여 문장 유사도를 산출한다.The sentence similarity calculating step may include detecting a least significant noun in a wordnet glossary based layer that includes a title of the seed news and a noun of the duplicate news and a noun appearing in a sentence of the combined news in common. Calculating a word importance by performing an operation proportional to a title of a noun and a noun of the duplicate news, a noun appearing in a sentence of the combined news, and a synonym vocabulary of the minimum upper noun; Sentence similarity is calculated by calculating the mean value of the word importance for.

상기 뉴스 요약문 생성 방법은 상기 문장 유사도 산출 단계와, 상기 요약문 생성 단계의 사이에, 상기 결합 뉴스의 문장에서 단어가 나타나는 빈도수 및 상기 문장의 위치에 기초하여 문장 가중치를 산출하는 문장 가중치 산출 단계를 더 포함하며, 상기 요약문 생성 단계는 상기 문장 유사도 및 상기 문장 중요도에 기초하여 상기 문장들의 중요도 순위를 결정한다.The news summary sentence generation method further includes a sentence weight calculation step of calculating a sentence weight based on the sentence similarity calculation step and the summary sentence generation step, based on a frequency of occurrence of a word in a sentence of the combined news and a position of the sentence. And generating the summary sentence based on the sentence similarity and the sentence importance.

상기 문장 가중치 산출 단계는, 상기 결합 뉴스의 문장에서 나타나는 각 단어의 빈도수의 평균값을 산출하여 빈도수에 따른 제1 문장 가중치를 산출하는 제1 문장 가중치 산출 단계; 상기 결합 뉴스의 문장의 위치 순으로 상위 문장부터 높은 가중치를 갖도록 제2 문장 가중치를 부여하는 제2 문장 가중치 산출 단계; 및 상기 제1 문장 가중치 및 상기 제2 문장 가중치의 곱셈 연산을 수행하여 상기 문장 가중치를 산출하는 연산 단계를 포함한다.The sentence weight calculating step may include: calculating a first sentence weight according to a frequency by calculating an average value of a frequency of each word in a sentence of the combined news; A second sentence weight calculation step of assigning a second sentence weight so as to have a high weight from an upper sentence in order of a position of sentences of the combined news; And calculating the sentence weight by performing a multiplication operation of the first sentence weight and the second sentence weight.

상기 요약문 생성 단계는 상기 결합 뉴스의 문장들 중 상기 중요도 순위가 상위에 해당하는 소정의 개수의 문장을 추출하여 상기 요약문을 생성한다.The summary sentence generating step may generate the summary sentence by extracting a predetermined number of sentences having the highest priority ranking among sentences of the combined news.

본 발명의 또 다른 일 측면에 따르면, 상기 뉴스 요약문 생성 방법을 실행시키는 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체가 제공될 수 있다.According to still another aspect of the present invention, a computer-readable recording medium having recorded thereon a program for executing the method for generating a news summary may be provided.

본 발명의 또 다른 일 측면에 따른 뉴스 제공 방법은, 소정의 시드 뉴스의 제목에 나타나는 단어를 포함하는 검색어를 이용하여 대상 뉴스들을 검색하는 단계; 상기 시드 뉴스와 대상 뉴스들 각각으로부터 제목을 추출하고, 추출한 상기 시드 뉴스의 제목과 상기 대상 뉴스의 제목 간의 유사도를 산출하며, 산출한 상기 제목 간의 유사도에 기초하여 상기 대상 뉴스들 중에서 중복 후보 뉴스를 탐지하는 단계; 상기 시드 뉴스와 상기 중복 후보 뉴스로부터 컨텐츠를 추출하고, 추출한 상기 시드 뉴스의 컨텐츠에 포함된 문장과 상기 중복 후보 뉴스의 컨텐츠에 포함된 문장 간의 유사도를 산출하며, 산출한 상기 문장 간의 유사도에 기초하여 상기 중복 후보 뉴스 중에서 중복 뉴스를 탐지하는 단계; 상기 시드 뉴스와 상기 중복 뉴스를 하나의 문서로 통합하여 결합 뉴스를 생성하는 단계; 상기 결합 뉴스의 문장과, 상기 시드 뉴스 및 상기 중복 뉴스의 제목 간의 문장 유사도를 산출하는 단계; 상기 결합 뉴스의 문장의 유사도에 기초하여 상기 결합 뉴스의 문장들의 중요도 순위를 결정하고, 결정한 중요도 순위에 기초하여 상기 결합 뉴스에 대한 요약문을 생성하는 단계; 및 상기 결합 뉴스를 상기 요약문과 함께 웹 페이지 형태로 제공하는 단계를 포함한다.According to another aspect of the present invention, a news providing method includes: searching for target news using a search word including a word appearing in a title of a predetermined seed news; A title is extracted from each of the seed news and the target news, and the similarity between the title of the extracted seed news and the title of the target news is calculated, and duplicate candidate news is selected from the target news based on the calculated similarity between the titles. Detecting; Extracting content from the seed news and the duplicate candidate news, calculating a similarity between a sentence included in the extracted content of the seed news and a sentence included in the content of the duplicate candidate news, and based on the similarity between the sentences Detecting duplicate news among the duplicate candidate news; Generating a combined news by combining the seed news and the duplicate news into one document; Calculating sentence similarity between the sentence of the combined news and the title of the seed news and the duplicate news; Determining a priority ranking of sentences of the combined news based on the similarity of sentences of the combined news, and generating a summary of the combined news based on the determined priority ranking; And providing the combined news together with the summary in the form of a web page.

본 발명의 실시예에 의하면 다수의 뉴스로부터 통합되어 생성된 결합 뉴스에 대하여 요약문을 생성할 수 있다.According to an embodiment of the present invention, a summary may be generated for the combined news generated by integrating a plurality of news.

또한, 본 발명의 실시예에 의하면 사용자로부터 질의어가 입력되지 않더라도 뉴스의 요약문을 생성할 수 있다.In addition, according to an embodiment of the present invention, even if a query is not input from the user, a summary of the news may be generated.

도 1은 본 발명의 일 실시예에 따른 뉴스 제공 시스템의 구성도이다.
도 2는 시드 뉴스와, 중복 뉴스에 해당하는 대상 뉴스의 일 예를 보여주는 도면이다.
도 3은 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 후보 뉴스 탐지부의 구성도이다.
도 4는 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 후보 뉴스 판단부의 구성도이다.
도 5는 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 뉴스 탐지부의 구성도이다.
도 6은 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 뉴스 탐지 전처리부의 구성도이다.
도 7은 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 뉴스 판단부의 구성도이다.
도 8은 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 뉴스 결합부의 구성도이다.
도 9는 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 문장 가중치 산출부의 구성도이다.
도 10은 본 발명의 일 실시예에 따른 뉴스 제공 방법의 흐름도이다.
도 11은 본 발명의 일 실시예에 따른 뉴스 제공 방법에 의해 결합 뉴스를 생성한 것을 예시적으로 나타낸 도면이다.
도 12는 본 발명의 일 실시예에 따른 뉴스 제공 방법에 의해 생성된 결합 뉴스의 시드 뉴스 대비 문장의 증가 비율을 나타낸 그래프이다.
도 13은 본 발명의 일 실시예에 따른 뉴스 제공 방법에 의해 생성된 요약문(스니펫)의 예를 나타낸다.
도 14는 본 발명의 일 실시예에 따른 뉴스 제공 방법에서 결합에 사용된 대상 뉴스의 수와 평균 문장의 수를 뉴스 항목 카테고리별로 나타낸 그래프이다.
도 15는 본 발명의 일 실시예에 따른 뉴스 제공 방법에 의해 생성된 각 뉴스 항목 카테고리별 요약문의 재현율을 보여주는 그래프이다.
도 16은 본 발명의 일 실시예에 따른 뉴스 제공 방법에 의하여 생성한 요약문(스니펫)에 대한 스니펫 적합율과 스니펫 비적합율을 나타낸 그래프이다.1 is a block diagram of a news providing system according to an embodiment of the present invention.
FIG. 2 is a diagram showing an example of a news item corresponding to seed news and overlapping news.
3 is a block diagram of a duplicate candidate news detector constituting a news providing system according to an embodiment of the present invention.
4 is a block diagram of a duplicate candidate news determination unit constituting the news providing system according to an embodiment of the present invention.
5 is a block diagram of a redundant news detector constituting a news providing system according to an embodiment of the present invention.
6 is a block diagram of a redundant news detection preprocessor constituting a news providing system according to an embodiment of the present invention.
7 is a block diagram of a duplicate news determination unit constituting a news providing system according to an embodiment of the present invention.
8 is a block diagram of a redundant news combiner constituting a news providing system according to an embodiment of the present invention.
9 is a block diagram of a sentence weight calculator constituting a news providing system according to an exemplary embodiment of the present invention.
10 is a flowchart illustrating a news providing method according to an embodiment of the present invention.
FIG. 11 is a diagram exemplarily illustrating that a combined news is generated by a news providing method according to an exemplary embodiment of the present invention.
12 is a graph showing an increase ratio of sentences to seed news of the combined news generated by the news providing method according to an embodiment of the present invention.
13 shows an example of a summary (snippet) generated by the news providing method according to an embodiment of the present invention.
14 is a graph showing the number of target news and the average sentence number used for combining in each news item category in the news providing method according to an exemplary embodiment of the present invention.
FIG. 15 is a graph showing a reproduction rate of a summary sentence for each news item category generated by a news providing method according to an exemplary embodiment of the present invention.
16 is a graph showing a snippet fit ratio and a snippet non-compliance ratio for a summary sentence (snippet) generated by the news providing method according to an embodiment of the present invention.

본 발명의 다른 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되는 실시예를 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예는 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 만일 정의되지 않더라도, 여기서 사용되는 모든 용어들(기술 혹은 과학 용어들을 포함)은 이 발명이 속한 종래 기술에서 보편적 기술에 의해 일반적으로 수용되는 것과 동일한 의미를 갖는다. 일반적인 사전들에 의해 정의된 용어들은 관련된 기술 그리고/혹은 본 출원의 본문에 의미하는 것과 동일한 의미를 갖는 것으로 해석될 수 있고, 그리고 여기서 명확하게 정의된 표현이 아니더라도 개념화되거나 혹은 과도하게 형식적으로 해석되지 않을 것이다.Other advantages and features of the present invention and methods for accomplishing the same will be apparent from the following detailed description of embodiments thereof taken in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. If not defined, all terms used herein (including technical or scientific terms) have the same meaning as commonly accepted by universal techniques in the prior art to which this invention belongs. Terms defined by generic dictionaries may be interpreted to have the same meaning as in the related art and / or in the text of this application, and may be conceptualized or overly formalized, even if not expressly defined herein I will not.

본 명세서에서 사용된 용어는 실시 예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 '포함한다' 및/또는 이 동사의 다양한 활용형들은 언급된 구성요소, 단계 및/또는 동작 외의 하나 이상의 다른 구성요소, 단계 및/또는 동작의 존재 또는 추가를 배제하지 않는다. 본 명세서에서 '및/또는' 이라는 용어는 나열된 구성들 각각 또는 이들의 다양한 조합을 가리킨다.The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, the terms "comprises" and / or various uses of the verb do not exclude the presence or addition of one or more other elements, steps and / or operations other than the recited elements, steps and / or operations. The term 'and / or' as used herein refers to each of the listed configurations or various combinations thereof.

한편, 본 명세서 전체에서 사용되는 '~부'의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미할 수 있다. 예를 들어 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미할 수 있다. 그렇지만 '~부'가 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 '~부'에서 제공되는 기능은 더 작은 수의 구성요소 및 '~부'로 결합되거나 추가적인 구성요소와 '~부'로 더 분리될 수 있다.The term " part " used throughout this specification may mean a unit for processing at least one function or operation. For example, a hardware component, such as a software, FPGA, or ASIC. However, '~' is not meant to be limited to software or hardware. &Quot; to " may be configured to reside on an addressable storage medium and may be configured to play one or more processors. Thus, by way of example, 'parts' may refer to components such as software components, object-oriented software components, class components and task components, and processes, functions, , Subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. The functions provided in the components and parts can be combined into a smaller number of components and 'parts' or further separated into additional components and parts.

본 발명의 일 실시예에 따른 뉴스 요약문 생성 시스템은 시드 뉴스 및 시드 뉴스에 대한 중복 뉴스들을 결합하여 생성된 결합 뉴스의 문장과, 시드 뉴스 및 중복 뉴스의 제목 간의 문장 유사도를 산출하고, 결합 뉴스의 문장의 유사도에 기초하여 결합 뉴스의 문장들의 중요도 순위를 결정하고, 결정한 중요도 순위에 기초하여 결합 뉴스에 대한 요약문을 생성한다. 이에 따라, 사용자로부터 질의어가 입력되지 않더라도 결합 뉴스의 요약문을 생성할 수 있다. 또한, 본 발명의 실시예에 의하면 결합 뉴스의 내용을 충분히 표현할 수 있는 문장으로 요약문을 구성할 수 있으며, 이에 따라 사용자는 보다 빠르고 정확하게 본인에게 필요한 뉴스를 찾을 수 있다.According to an embodiment of the present invention, the system for generating a news summary sentence calculates sentence similarity between the combined news generated by combining the duplicate news for the seed news and the seed news, and the title similarity between the seed news and the title of the duplicate news, The importance ranking of the sentences of the combined news is determined based on the similarity of the sentences, and a summary of the combined news is generated based on the determined importance ranking. Accordingly, even if no query is input from the user, a summary of the combined news may be generated. In addition, according to an embodiment of the present invention can be composed of a summary sentence with a sentence that can sufficiently express the content of the combined news, accordingly, the user can find the news necessary for the user more quickly and accurately.

도 1은 본 발명의 일 실시예에 따른 뉴스 제공 시스템의 구성도이다. 도 1을 참조하면, 본 발명의 일 실시예에 따른 뉴스 제공 시스템(100)은 시드 뉴스 데이터베이스(101), 검색 엔진(102), 대상 뉴스 데이터베이스(103), 중복 후보 뉴스 탐지부(110), 중복 뉴스 탐지부(120), 중복 뉴스 데이터베이스(130), 중복 뉴스 결합부(140), 문장 유사도 산출부(150), 문장 가중치 산출부(160), 요약문 생성부(170) 및 뉴스 추천부(180)를 포함한다. 이하에서는 먼저 시드 뉴스(seed news)에 대한 중복 뉴스(duplication news)를 탐지하고, 시드 뉴스(seed news) 및 탐지된 중복 뉴스를 통합하여 결합 뉴스를 생성하는 구성들에 대해 설명한 다음, 결합 뉴스의 요약문을 생성하는 구성들에 대해 설명한다.1 is a block diagram of a news providing system according to an embodiment of the present invention. Referring to FIG. 1, the news providing system 100 according to an embodiment of the present invention may include a seed news database 101, a search engine 102, a target news database 103, a duplicate candidate news detector 110, Duplicate news detector 120, duplicate news database 130, duplicate news combiner 140, sentence similarity calculator 150, sentence weight calculator 160, summary sentence generator 170 and news recommender ( 180). The following describes the configurations of detecting duplication news for seed news and integrating seed news and detected duplicate news to generate a combined news. Describe the constructs that generate the summary.

시드 뉴스 데이터베이스(101)는 시드 뉴스(seed news)들을 저장한다. 시드 뉴스는 대상 뉴스(target news) 중에서 중복 뉴스를 탐지하기 위해 대상 뉴스와 비교되는 뉴스를 의미할 수 있다. 시드 뉴스는 예를 들어 뉴스 제공 서브시스템(104, 105, 106)에서 제공하는 헤드라인 뉴스(headline news)를 포함할 수 있다. 일 예로, 시드 뉴스 데이터베이스(101)는 어느 하나의 뉴스 제공 서브시스템에서 카테고리별로 제공하는 헤드라인 뉴스를 저장할 수 있다. 카테고리는 예를 들어, "정치", "경제", "사회", "문화", "연예" 또는 "스포츠" 등을 포함할 수 있다.The seed news database 101 stores seed news. The seed news may mean news compared with the target news to detect duplicate news among the target news. The seed news may include, for example, a headline news provided by the news providing subsystem 104, 105, 106. For example, the seed news database 101 may store headline news provided for each category in one news providing subsystem. Categories may include, for example, "politics", "economy", "society", "culture", "entertainment" or "sports"

대상 뉴스 데이터베이스(103)는 대상 뉴스(target news)들을 저장한다. 중복 후보 뉴스 탐지부(110) 및 중복 뉴스 탐지부(120)에 의해 대상 뉴스들 중 시드 뉴스와 중복되는 내용을 포함하는 대상 뉴스가 중복 뉴스로 탐지된다. 대상 뉴스는 예를 들어 검색 엔진(102)이 시드 뉴스의 제목에 나타나는 단어를 포함하는 검색어를 이용하여 검색한 뉴스들을 포함할 수 있다. 대상 뉴스는 예를 들어 카테고리별로 수집된 헤드라인 뉴스의 제목에 나타나는 단어들을 검색어로 하여 검색 엔진(102)에 의해 검색된 뉴스일 수 있다.The target news database 103 stores target news. The duplicate candidate news detecting unit 110 and the duplicate news detecting unit 120 detect the target news including the duplicate of the seed news among the target news as the duplicated news. The target news may include, for example, news that the search engine 102 searched using a search term including a word appearing in the title of the seed news. The target news may be news retrieved by the search engine 102 using, for example, words appearing in the title of the headline news collected for each category as a search term.

도 2는 시드 뉴스와, 중복 뉴스에 해당하는 대상 뉴스의 일 예를 보여주는 도면이다. 도 2를 참조하면, 시드 뉴스와 대상 뉴스는 동일한 내용을 포함할 수 있으며, 이러한 경우 대상 뉴스를 중복 뉴스로 탐지할 필요가 있다. 예를 들어 도 2에서 좌측의 시드 뉴스(seed news)는 인터뷰 내용을 생략하여 제공하며, 우측의 대상 뉴스(target news)는 인터뷰 내용(21)을 생략하지 않고 제공한 거의-중복 뉴스의 예에 해당한다. 시드 뉴스와 대상 뉴스가 설명하는 내용이 같더라도 표현하는 방법이 다를 경우 대상 뉴스를 중복 뉴스로 탐지하지 못하거나, 중복 뉴스로 탐지하는데 많은 시간이 소요될 수 있다.FIG. 2 is a diagram showing an example of a news item corresponding to seed news and overlapping news. Referring to FIG. 2, the seed news and the target news may include the same contents. In this case, the target news needs to be detected as duplicate news. For example, in FIG. 2, the seed news on the left is provided by omitting the interview content, and the target news on the right is provided in the example of almost-duplicate news provided without omitting the interview content 21. Corresponding. If seed news and target news are the same, the target news may not be detected as duplicate news or it may take a long time to detect duplicate news.

본 발명의 실시예에 따른 뉴스 제공 시스템(100)은 대상 뉴스들 중에서 중복 뉴스를 빠르고 정확하게 탐지하기 위하여, 중복 후보 뉴스 탐지부(110)와 중복 뉴스 탐지부(120)를 포함한다. 예를 들어 시드 뉴스는 제목(title)과, 컨텐츠(contents)를 포함하며, 대상 뉴스는 텍스트 제목(text title)과, 컨텐츠(contents)를 포함한다. 검색 엔진에 의해 대상 뉴스가 검색될 때, 검색 엔진의 검색 리스트에는 대상 뉴스의 앵커 제목(anchor title)이 나타난다. 대상 뉴스의 제목은 텍스트 제목과 앵커 제목을 포함한다.The news providing system 100 according to an embodiment of the present invention includes a duplicate candidate news detector 110 and a duplicate news detector 120 in order to detect duplicate news quickly and accurately among target news. For example, the seed news includes a title and contents, and the target news includes a text title and contents. When the target news is searched by the search engine, an anchor title of the target news appears in the search list of the search engine. The title of the target news includes a text title and an anchor title.

중복 후보 뉴스 탐지부(110)는 시드 뉴스와 대상 뉴스의 제목(앵커 제목, 텍스트 제목) 간의 유사도에 기초하여 1차적으로 대상 뉴스들 중에서 중복 후보 뉴스를 빠른 속도로 탐지하고, 중복 뉴스 탐지부(120)는 시드 뉴스와 대상 뉴스의 컨텐츠 간의 유사도에 기초하여 2차적으로 중복 후보 뉴스들 중에서 중복 뉴스를 정확하게 탐지할 수 있다.The overlapping candidate news detecting unit 110 detects the overlapping candidate news among the target news at a high speed on the basis of the similarity between the seed news and the title of the target news (anchor title, text title) 120 can precisely detect duplicate news among duplicated candidate news based on the similarity between the seed news and the content of the target news.

중복 후보 뉴스 탐지부(110)는 시드 뉴스와 대상 뉴스들 각각으로부터 제목을 추출하고, 추출한 시드 뉴스의 제목과 각각의 대상 뉴스의 제목 간의 유사도를 산출하며, 산출한 제목 간의 유사도에 기초하여 대상 뉴스들 중에서 중복 후보 뉴스를 탐지한다. 도 3은 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 후보 뉴스 탐지부의 구성도이다. 도 3을 참조하면, 중복 후보 뉴스 탐지부(110)는 전처리부(111)와 중복 후보 뉴스 판단부(112)를 포함한다. 전처리부(111)는 시드 뉴스와 대상 뉴스로부터 제목을 추출하고, 추출한 제목에 포함된 단어 중에서 불용어를 제거하고, 불용어를 제거한 단어를 어근의 형태로 변환한다. 본 발명의 일 실시예에 있어서, 중복 후보 뉴스 탐지부(110)는 대상 뉴스의 텍스트 제목과 앵커 제목 각각을 시드 뉴스의 제목과 비교하여 중복 후보 뉴스를 탐지할 수 있으며, 이러한 경우 전처리부(111)는 시드 뉴스의 제목과, 대상 뉴스의 텍스트 제목 및 앵커 제목에 대하여 전처리(preprocessing)를 수행할 수 있다.The duplicate candidate news detection unit 110 extracts a title from each of the seed news and the target news, calculates the degree of similarity between the title of the extracted seed news and the title of each target news, Of the candidates. 3 is a block diagram of a duplicate candidate news detector constituting a news providing system according to an embodiment of the present invention. Referring to FIG. 3, the overlapping candidate news detection unit 110 includes a preprocessing unit 111 and a duplicate candidate news determination unit 112. The preprocessing unit 111 extracts the title from the seed news and the target news, removes the abbreviation from the words included in the extracted title, and converts the word without the abbreviation into the form of the root. In an embodiment of the present invention, the overlapping candidate news detection unit 110 may detect overlapping candidate news by comparing each of the text title and the anchor title of the target news with the title of the seed news. In this case, the pre- ) Can perform preprocessing on the title of the seed news, the text title of the target news, and the anchor title.

일 실시예에 있어서, 전처리부(111)는 <title></title>과 같은 HTML 태그(Tag) 정보를 이용하여 시드 뉴스 및 대상 뉴스들로부터 제목을 추출할 수 있다. 전처리부(111)는 추출된 제목에 나타나는 단어에서 불용어(stop word)를 제거하고, 어근 처리(stemming)를 한다. 불용어는 예를 들어 관사 'the', 'a', 'an', 전치사 'to', 'of', 'in', 'into', 'for' 등과 같이 발생 빈도가 높지만 단일 단어로는 의미를 갖지 않는 것을 말한다. 어근 처리는 키워드의 어형론적 변형을 찾아내서 동일한 의미의 여러 단어를 하나의 단어로 변환하는 작업을 의미한다. 예를 들어 영문에서 단어들은 일정 의미를 갖는 어근(stem)과 단어의 형태 변화 타입인 어미(suffix)로 구성되는데, 전처리부(111)는 단어들을 의미를 갖는 어근의 형태로 변환한다. 예를 들어, 'description', 'descriptive', 'descriptor'를 어근 처리하면, 'descript'의 같은 단어로 변경된다.In one embodiment, the preprocessing unit 111 may extract the title from the seed news and the target news using HTML tag information such as <title> </ title>. The preprocessing unit 111 removes a stop word from a word appearing in the extracted title, and performs stemming. An abbreviation is a word that occurs frequently, for example, articles 'the', 'a', 'an', prepositions 'to', 'of', 'in', 'into', 'for' What does not have. The root processing refers to the task of finding the morphological transformation of a keyword and converting multiple words of the same meaning into a single word. For example, words in English are composed of a stem having a certain meaning and a suffix, which is a morphological change type of a word. The preprocessor 111 converts words into a form of a root having meaning. For example, if you parse 'description', 'descriptive', or 'descriptor', it will be changed to the same word in 'descript'.

중복 후보 뉴스 판단부(112)는 시드 뉴스의 제목과 각각의 대상 뉴스의 제목 간의 유사도를 산출하고, 산출한 제목 간의 유사도에 기초하여 각각의 대상 뉴스가 중복 후보 뉴스에 해당하는지 여부를 판단한다. 앞서 언급한 바와 같이, 대상 뉴스의 제목은 검색 엔진의 리스트에 나타나는 앵커 제목(anchor title)과 상세 페이지에 나타나는 텍스트 제목(text title)의 두 가지로 구성될 수 있으며, 중복 후보 뉴스 판단부(112)는 시드 뉴스의 제목과 대상 뉴스의 앵커 제목 간의 유사도 및 시드 뉴스의 제목과 대상 뉴스의 텍스트 제목 간의 유사도에 기초하여 대상 뉴스들 중에서 중복 후보 뉴스를 추출할 수 있다. 일 실시예에 있어서, 중복 후보 뉴스 판단부(112)는 후술될 벡터 모델(vector model)과 변형된 다이스 상관 계수(Dice's coefficient)를 이용하여 제목 간의 유사도를 구하고, 제목 간의 유사도가 미리 설정된 임계값 이상인 대상 뉴스를 중복 뉴스가 될 가능성이 있는 중복 후보 뉴스로 분류할 수 있다.The duplicate candidate news determiner 112 calculates the degree of similarity between the title of the seed news and the title of each target news, and determines whether each target news corresponds to the overlap candidate news based on the calculated degree of similarity between the titles. As described above, the title of the target news may be composed of an anchor title appearing in the list of search engines and a text title appearing on the detail page. The duplicate candidate news determiner 112 ) Can extract duplicate candidate news from the target news based on the similarity between the seed news title and the anchor title of the target news and the similarity between the seed news title and the target news text title. In one embodiment, the overlapping candidate news determiner 112 obtains a similarity between titles using a vector model and a modified Dice's coefficient, which will be described later, Or more of the target news may be classified as duplicate candidate news which may be duplicated news.

도 4는 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 후보 뉴스 판단부의 구성도이다. 도 4를 참조하면, 중복 후보 뉴스 판단부(112)는 제1 유사도 산출부(1121), 제2 유사도 산출부(1122), 유사도 결정부(1123) 및 임계값 비교부(1124)를 포함한다. 제1 유사도 산출부(1121)는 시드 뉴스의 제목과, 검색 엔진의 검색 리스트에 나타나는 대상 뉴스의 앵커 제목 간의 제1 유사도를 산출한다. 제2 유사도 산출부(1122)는 시드 뉴스의 제목과, 대상 뉴스의 컨텐츠에 나타나는 제목 간의 제2 유사도를 산출한다. 유사도 결정부(1123)는 제1 유사도 및 제2 유사도 중 보다 큰 유사도를 제목 간의 유사도로 결정한다. 임계값 비교부(1124)는 제목 간의 유사도를 소정의 임계값과 비교하여 중복 후보 뉴스를 탐지한다.4 is a block diagram of a duplicate candidate news determination unit constituting the news providing system according to an embodiment of the present invention. 4, the overlapping candidate news determiner 112 includes a first similarity calculating unit 1121, a second similarity calculating unit 1122, a similarity determining unit 1123, and a threshold comparing unit 1124 . The first degree of similarity calculating unit 1121 calculates a first degree of similarity between the title of the seed news and the anchor title of the target news appearing in the search list of the search engine. The second similarity degree calculating section 1122 calculates a second similarity degree between the title of the seed news and the title appearing in the content of the target news. The similarity determination unit 1123 determines a degree of similarity between the first similarity degree and the second similarity degree as similarities between the titles. The threshold value comparator 1124 compares the similarity between titles with a predetermined threshold to detect duplicate candidate news.

이를 보다 구체적으로 설명하면 다음과 같다. 전처리부(111)에 의해 시드 뉴스와 대상 뉴스로부터 추출되어 불용어가 제거되고 어근 처리된 단어들은 아래의 수식 1과 같은 집합들로 나타낼 수 있다.This will be described in more detail as follows. The words extracted from the seed news and the target news by the preprocessing unit 111, and the words that have been excluded from the abbreviations and are root-processed, can be represented by the following Equation (1).

[수식 1][Equation 1]

이때, ST_i는 i번째 시드 뉴스의 제목에 나타나는 단어들의 집합을 나타내고, CT_j는 수집된 j번째 대상 뉴스의 텍스트 제목에 나타나는 단어들의 집합을 나타내고, AT_j는 수집된 j번째 대상 뉴스의 앵커 제목에 나타나는 단어들의 집합을 나타내고, st_i _,k는 i번째 시드 뉴스의 제목에 나타나는 k번째 단어를 나타내고, ct_j _,m는 수집된 j번째 대상 뉴스의 텍스트 제목에 나타나는 m번째 단어를 나타내고, at_j _,n는 수집된 j번째 대상 뉴스의 앵커 제목에 나타나는 n번째 단어를 나타낸다.In this case, ST _i denotes a set of words appearing in the title of the i-th seed news, CT _j denotes a set of words appearing in the text title of the collected j-th target news, AT _j denotes anchor St _i _{, k} represents the kth word appearing in the title of the ith seed news, ct _j _{, m} represents the mth word appearing in the text title of the collected jth news, at _j _{, n} represents the n-th word appearing in the anchor title of the collected j-th target news.

중복 후보 뉴스 판단부(112)는 시드 뉴스의 제목에서 나타나는 단어의 빈도에 비례하고, 시드 뉴스와 대상 뉴스의 모든 제목에서 나타나는 단어의 빈도에 반비례하는 연산을 수행하여 시드 뉴스의 제목에 나타나는 단어의 가중치를 산출하고, 대상 뉴스의 제목에서 나타나는 단어의 빈도에 비례하고, 시드 뉴스와 대상 뉴스의 모든 제목에서 나타나는 단어의 빈도에 반비례하는 연산을 수행하여 대상 뉴스의 제목에 나타나는 단어의 가중치를 산출할 수 있다. 중복 후보 뉴스 판단부(112)는 예를 들어 아래의 수식 2와 같이 tf-itf(term frequency inverse title frequency)를 이용하여 단어의 가중치를 산출할 수 있다.The duplicate candidate news determiner 112 performs an operation that is inversely proportional to the frequency of the words appearing in the seed news and all the titles of the target news in proportion to the frequency of the words appearing in the title of the seed news, The weights of the words appearing in the title of the target news are calculated by performing an operation in inverse proportion to the frequency of the words appearing in all the titles of the seed news and the target news in proportion to the frequency of the words appearing in the title of the target news . The duplicate candidate news determiner 112 may calculate the weight of the word using the term frequency inverse title frequency (tf-itf) as shown in Equation 2 below.

[수식 2][Equation 2]

이때, ω_t는 시드 뉴스 또는 대상 뉴스의 제목(앵커 제목 또는 컨텐츠 제목)에 나타나는 단어의 가중치를 나타내고,

는 시드 뉴스와 대상 뉴스의 제목에서의 단어의 정규화 빈도를 나타내며,

는 단어의 역제목 빈도수(itf; inverse title frequency)를 나타내며, │T│는 시드 뉴스와 대상 뉴스의 전체 제목의 개수를 나타내며,

는 시드 뉴스와 대상 뉴스의 전체 제목에서 해당 단어가 나타나는 수를 나타낸다. 역제목 빈도수

는 제목에서 흔하게 나오는 단어의 가중치를 낮추는 역할을 한다. 수식 2에서 제목 t에서의 단어의 정규화 빈도

는 아래의 수식 3과 같이 정의될 수 있다.Here,? _T represents a weight of a word appearing in the title of the seed news or the target news (anchor title or content title)

Represents the frequency of word normalization in the headings of the seed news and the target news,

Represents the inverse title frequency (itf) of the word, T represents the total number of titles of the seed news and the target news,

Indicates the number of occurrences of the word in the full title of the seed news and target news. Station title frequency

Is used to lower the weight of words that are common in the title. Frequency of normalization of words in heading t in Equation 2

Can be defined as Equation (3) below.

[수식 3][Equation 3]

수식 3에서, freq_ω,T는 제목 t에서 나타나는 단어의 빈도를 나타내고, max_lfreq_l,t는 제목 t에서 나타나는 단어들의 빈도들 중 최대값을 나타낸다.In Equation 3, freq _{ω, T} represents the frequency of the words appearing in the title t, max _l freq _{l, t} represents the maximum of the frequencies of words appearing in the title t.

제1 유사도 산출부(1121)는 시드 뉴스의 제목에 나타나는 단어의 가중치와, 대상 뉴스의 앵커 제목에 나타나는 단어의 가중치를 이용하여 시드 뉴스의 제목과 대상 뉴스의 앵커 제목 간의 제1 유사도를 산출한다. 일 실시예에 있어서, 제1 유사도 산출부(1121)는 시드 뉴스의 제목과, 대상 뉴스의 앵커 제목에 동시에 포함되는 단어의 가중치들을 합한 값에서 시드 뉴스의 제목에 포함되는 단어의 가중치들과 앵커 제목에 포함되는 단어의 가중치들을 합한 값으로 나눈 값을 제1 유사도로 산출할 수 있다. 제1 유사도 산출부(1121)는 시드 뉴스의 제목과 대상 뉴스의 앵커 제목에 나타나는 단어의 가중치를 변형된 다이스 상관계수의 수식에 적용함으로써, 단어의 중요도를 반영하여 제1 유사도를 산출할 수 있다. 예를 들어 제1 유사도 산출부(1121)는 아래의 수식 4와 같은 변형된 다이스 상관계수의 수식을 이용하여 시드 뉴스의 제목과 대상 뉴스의 앵커 제목에서 나타나는 단어의 가중치로부터 제1 유사도를 산출할 수 있다.The first similarity calculating unit 1121 calculates the first similarity degree between the seed news title and the anchor title of the target news using the weight of the word appearing in the title of the seed news and the weight of the word appearing in the anchor title of the target news . In one embodiment, the first similarity calculation unit 1121 calculates the weight of words included in the title of the seed news and the weights of the words included in the title of the seed news and the anchor title of the target news, A value obtained by dividing the weights of the words included in the title by the total value can be calculated as the first degree of similarity. The first degree of similarity calculating unit 1121 can calculate the first degree of similarity by reflecting the importance of the word by applying the weights of the words appearing in the title of the seed news and the anchor title of the target news to the formula of the deformed degree correlation coefficient . For example, the first degree-of-similarity calculation unit 1121 calculates the first degree of similarity from the weight of the word appearing in the title of the seed news and the anchor title of the target news using the formula of the modified dice correlation coefficient as shown in Equation 4 below .

[수식 4][Equation 4]

수식 4에서,

는 i번째 시드 뉴스의 제목에 나타나는 단어들의 집합 ST_i과, 수집된 j번째 대상 뉴스의 앵커 제목에 나타나는 단어들의 집합 AT_j 사이의 제1 유사도를 나타내고, p는 시드 뉴스의 제목과 대상 뉴스의 앵커 제목에서 동시에 나타나는 단어를 나타내며, c는 시드 뉴스의 제목과 대상 뉴스의 앵커 제목에서 동시에 나타나는 단어의 개수를 나타내며, st_i _,p는 시드 뉴스의 제목에서 나타나는 단어의 가중치를 나타내며, at_j _,p는 대상 뉴스의 앵커 제목에서 나타나는 단어의 가중치를 나타내며, a는 시드 뉴스의 제목에서 나타나는 단어의 개수를 나타내며, b는 대상 뉴스의 앵커 제목에서 나타나는 단어의 개수를 나타낸다.In Equation 4,

Represents the first degree of similarity between the set ST _i of words appearing in the title of the i th seed news and the set of words AT _j appearing in the anchor title of the collected j th target news, p is the title of the seed news, C denotes the number of words appearing simultaneously in the title of the seed news and the anchor title of the target news, st _i _{, p} denotes the weight of words appearing in the title of the seed news, and at _j _{, p} represents the weight of words appearing in the anchor title of the target news, a represents the number of words appearing in the title of the seed news, and b represents the number of words appearing in the anchor title of the target news.

제2 유사도 산출부(1121)는 시드 뉴스의 제목에 나타나는 단어의 가중치와, 대상 뉴스의 텍스트 제목에 나타나는 단어의 가중치를 이용하여 시드 뉴스의 제목과 대상 뉴스의 텍스트 제목 간의 제2 유사도를 산출한다. 일 실시예에 있어서, 제2 유사도 산출부(1122)는 시드 뉴스의 제목과, 대상 뉴스의 컨텐츠의 텍스트 제목에 동시에 포함되는 단어의 가중치들의 합한 값에서 시드 뉴스의 제목에 포함되는 단어의 가중치들과 대상 뉴스의 텍스트 제목에 포함되는 단어의 가중치들을 합한 값으로 나눈 값을 제2 유사도로 산출할 수 있다.The second similarity calculating unit 1121 calculates a second similarity degree between the title of the seed news and the text title of the target news using the weight of the word appearing in the title of the seed news and the weight of the word appearing in the text title of the target news . In one embodiment, the second degree of similarity calculation unit 1122 calculates the weight of words included in the title of the seed news from the sum of the weights of the words simultaneously included in the title of the seed news and the text title of the content of the target news And the weights of the words included in the text title of the target news by the sum of the weights.

제2 유사도 산출부(1122)는 시드 뉴스의 제목과 대상 뉴스의 텍스트 제목에 나타나는 단어의 가중치를 변형된 다이스 상관계수의 수식에 적용함으로써 단어의 중요도를 반영하여 제2 유사도를 산출할 수 있다. 예를 들어 제2 유사도 산출부(1122)는 아래의 수식 5와 같은 변형된 다이스 상관계수의 수식을 이용하여 시드 뉴스의 제목과 대상 뉴스의 텍스트 제목에서 나타나는 단어의 가중치로부터 제2 유사도를 산출할 수 있다.The second degree of similarity calculating unit 1122 may calculate the second degree of similarity by reflecting the importance of the word by applying the weights of the words appearing in the title of the seed news and the text title of the target news to the formula of the modified degree correlation coefficient. For example, the second degree of similarity calculation unit 1122 calculates a second degree of similarity from the weight of a word appearing in the title of the seed news and the text title of the target news using the formula of the modified dies correlation coefficient as shown in the following equation (5) .

[수식 5][Equation 5]

수식 5에서,

는 i번째 시드 뉴스의 제목에 나타나는 단어들의 집합 ST_i과, 수집된 j번째 대상 뉴스의 텍스트 제목에 나타나는 단어들의 집합 CT_j 사이의 제2 유사도를 나타내며, p는 시드 뉴스의 제목과 대상 뉴스의 텍스트 제목에서 동시에 나타나는 단어를 나타내며, c는 시드 뉴스의 제목과 대상 뉴스의 텍스트 제목에서 동시에 나타나는 단어의 개수를 나타내며, st_i _,k는 시드 뉴스의 제목에서 나타나는 단어의 가중치를 나타내며, ct_j _,m는 대상 뉴스의 텍스트 제목에서 나타나는 단어의 가중치를 나타내며, a는 시드 뉴스의 제목에서 나타나는 단어의 개수를 나타내며, b는 대상 뉴스의 텍스트 제목에서 나타나는 단어의 개수를 나타낸다.In Equation 5,

Represents the second similarity between the set ST _i of words appearing in the title of the i th seed news and the set of words CT _j in the text title of the collected j th target news, p is the title of the seed news, C denotes the number of words appearing simultaneously in the title of the seed news and the text title of the target news, st _i _{, k} denotes the weight of words appearing in the title of the seed news, ct _j _{, m} represents the weight of words appearing in the text title of the target news, a represents the number of words appearing in the title of the seed news, and b represents the number of words appearing in the text title of the target news.

유사도 결정부(1123)는 예를 들어 아래의 수식 6에 따라 제1 유사도

와, 제2 유사도

중에서 최대값을 판단하여 이로부터 시드 뉴스의 제목과 대상 뉴스의 제목 간의 유사도

를 산출한다.For example, the similarity determination unit 1123 determines the similarity degree < RTI ID = 0.0 >

And a second similarity degree

And the similarity between the title of the seed news and the title of the target news

.

[수식 6][Equation 6]

예를 들어, 시드 뉴스 A의 제목에 나타나는 단어가 a, b, c이고, 대상 뉴스 B의 앵커 제목에 나타나는 단어가 a, c, d이고, 대상 뉴스 B의 컨텐츠 제목에 나타나는 단어가 b, c, d인 경우, 앞서 언급된 수식 2를 사용하면 단어 a, b, c, d 각각에 대한 단어의 가중치는 0.1, 0.2, 0.3, 0.2로 산출된다. 시드 뉴스 A의 제목과 대상 뉴스 B의 앵커 제목에서 동시에 나타나는 단어는 a, c이므로, 수식 4에서 분자

는 (0.1+0.3)+(0.1+0.3) = 0.8의 값을 가지며,

는 0.1+0.2+0.3 = 0.6이고,

는 0.1+0.3+0.2 = 0.6이 되어 분모

+

는 1.2의 값을 갖는다. 이에 따라, 제1 유사도

는 0.8/1.2 = 0.67의 값을 갖게 된다.For example, if the words appearing in the title of seed news A are a, b, and c, the words appearing in the anchor title of target news B are a, c, and d, and the words appearing in the content title of target news B are b, c , d, the weight of the word for each of the words a, b, c, and d is calculated as 0.1, 0.2, 0.3, and 0.2 using Equation 2 mentioned above. Since the words simultaneously appearing in the seed news A's title and the target news B's anchor title are a and c,

Has a value of (0.1 + 0.3) + (0.1 + 0.3) = 0.8,

0.1 + 0.2 + 0.3 = 0.6,

Is 0.1 + 0.3 + 0.2 = 0.6,

+

Has a value of 1.2. Accordingly,

Has a value of 0.8 / 1.2 = 0.67.

시드 뉴스 A의 제목과 대상 뉴스 B의 컨텐츠 제목에 동시에 나오는 단어는 b, c이므로, 수학식 5에 의해 분자

는 (0.2+0.3)+(0.2+0.3) = 1의 값을 가지며,

는 0.1+0.2+0.3 = 0.6이 되고,

은 0.2+0.3+0.2 = 0.7이 되어, 분모

+

는 1.3의 값을 갖는다. 이에 따라, 제2 유사도

는 1/1.3 = 0.77의 값을 갖게 된다.Since the words simultaneously appearing in the title of the seed news A and the title of the target news B are b and c,

Has a value of (0.2 + 0.3) + (0.2 + 0.3) = 1,

Is 0.1 + 0.2 + 0.3 = 0.6,

Is 0.2 + 0.3 + 0.2 = 0.7,

+

Has a value of 1.3. Accordingly,

1 / 1.3 = 0.77.

유사도 결정부(1123)는 앞서 언급된 수식 6에 따라 제1 유사도와 제2 유사도 중 큰 값인 0.77을 시드 뉴스의 제목과 대상 뉴스의 제목 간의 유사도

로 결정할 수 있다. 임계값 비교부(1124)는 유다고 결정부(112)에 의해 결정된 제목 간의 유사도를 미리 설정된 임계값 α과 비교하여, 임계값 α를 넘으면 대상 뉴스를 중복 후보 뉴스로 분류하고, 임계값 α를 넘지 않으면 중복 후보 뉴스에서 제외한다. 전술한 과정에 따라 중복 후보 뉴스 탐지부(110)에 의해 대상 뉴스들 중에서 중복 후보 뉴스가 분류되는 1차 클러스터링(Clustering)이 수행된다.The similarity determination unit 1123 determines a similarity between the title of the seed news and the title of the target news by setting 0.77, which is the larger of the first similarity and the second similarity, according to Equation 6 mentioned above.

. The threshold comparison unit 1124 compares the similarity between the titles determined by the judging unit 112 with a preset threshold α, classifies the target news as a duplicate candidate news when the threshold α is exceeded, and classifies the threshold α as a duplicate candidate news. If not exceeded, duplicate candidate news will be excluded. According to the above-described process, the primary candidate clustering is performed by the duplicate candidate news detector 110 to classify the duplicate candidate news among the target news.

다시 도 1을 참조하면, 중복 뉴스 탐지부(120)는 분류된 중복 후보 뉴스의 내용을 분석하여 시드 뉴스의 내용과 유사도 비교를 통해 최종적으로 중복 뉴스인지 아닌지를 판별한다. 일 실시예에 있어서, 중복 뉴스 탐지부(120)는 시드 뉴스와 중복 후보 뉴스로부터 컨텐츠를 추출하고, 추출한 시드 뉴스의 컨텐츠에 포함된 문장과 중복 후보 뉴스의 컨텐츠에 포함된 문장 간의 유사도를 산출하며, 산출한 문장 간의 유사도에 기초하여 중복 후보 뉴스 중에서 중복 뉴스를 탐지할 수 있다.Referring again to FIG. 1, the duplicate news detector 120 analyzes the contents of the classified duplicate candidate news and finally determines whether the duplicate news is the duplicate news by comparing the similarity with the contents of the seed news. In an embodiment, the duplicate news detector 120 extracts content from the seed news and the duplicate candidate news, calculates a similarity between the sentences included in the extracted seed news content and the sentences included in the duplicate candidate news content. For example, the duplicate news may be detected among the duplicate candidate news based on the similarity between the calculated sentences.

중복 뉴스 탐지부(120)는 전체 내용이 아닌 문장 단위의 의미적 접근을 통해 시드 뉴스와 중복 후보 뉴스의 내용 간의 유사도를 산출할 수 있다. 일반적인 방식에 따라 문장 간의 유사도를 산출하는 경우, 전혀 다른 상반된 내용임에도 불구하고 유사도가 높게 산출되는 경우가 발생할 수 있다. 예를 들어 아래의 세 개의 문장들 중 첫 번째와 세 번째 문장은 전혀 다른 상반된 내용이나, 일반적인 문장 간의 유사도 산출 방식에 의하면 단어들의 유사성으로 인해 유사도가 높게 산출될 수 있다.The duplicate news detector 120 may calculate the similarity between the seed news and the content of the duplicate candidate news through a semantic approach in sentence units instead of the entire contents. When the similarity between sentences is calculated according to the general method, the degree of similarity may be calculated to be high despite the completely different contents. For example, the first and third sentences of the following three sentences are entirely different from each other, but the similarity between words can be calculated to be high due to the similarity of words according to the method of calculating similarity between general sentences.

① I bought a computer from a computer shop in Yongsan.① I bought a computer from a computer shop in Yongsan.

② The computer was purchased from a computer shop in Yongsan.② The computer was purchased from a computer shop in Yongsan.

③ The computer was sold to a computer shop in Yongsan..③ The computer was sold to a computer shop in Yongsan ..

이에 따라, 본 발명의 실시예는 문장이 가지는 단어의 색인어와 동사를 구분하여 문장의 유사도를 정확하게 산출한다.Accordingly, the embodiment of the present invention accurately calculates the similarity of a sentence by distinguishing an index word and a verb from the words of the sentence.

도 5는 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 뉴스 탐지부의 구성도이다. 도 5를 참조하면, 중복 뉴스 탐지부(120)는 중복 뉴스 탐지 전처리부(121) 및 중복 뉴스 판단부(122)를 포함한다. 중복 뉴스 탐지 전처리부(121)는 시드 뉴스와 중복 후보 뉴스로부터 컨텐츠를 추출하고, 추출한 컨텐츠의 구문을 분석하며, 컨텐츠에서 문장을 검출한다.5 is a block diagram of a redundant news detector constituting a news providing system according to an embodiment of the present invention. Referring to FIG. 5, the overlapping news detection unit 120 includes a duplicated news detection preprocessing unit 121 and a duplicated news determination unit 122. [ The duplicate news detection preprocessing unit 121 extracts the content from the seed news and the duplicated candidate news, analyzes the syntax of the extracted content, and detects the sentence in the content.

도 6은 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 뉴스 탐지 전처리부의 구성도이다. 도 6을 참조하면, 중복 뉴스 탐지 전처리부(121)는 태그 제거부(1211), 구문 분석부(1212), 문장 검출부(1213), 동사 추출부(1214) 및 불용어 제거 및 어근 처리부(1215)를 포함한다. 태그 제거부(1211)는 시드 뉴스와 중복 후보 뉴스의 태그를 제거하고, 시드 뉴스와 중복 후보 뉴스로부터 컨텐츠를 추출한다. 태그 제거부(1211)는 예를 들어 시드 뉴스와 중복 후보 뉴스에서 웹 페이지에 나타나는 HTML 태그(Tag)나, 불필요한 태그들을 제거한다. 구문 분석부(1212)는 컨텐츠의 구문을 분석한다. 구문 분석부(1212)는 예를 들어 스탠포드 파서와 같은 형태소 분석기를 이용하여 태그가 제거된 웹 페이지 뉴스를 구문 분석하여 명사와 동사를 인식할 수 있다. 형태소 분석기는 형태소들로 구성된 단어를 구성 성분별로 분석하여, 명사, 형용사, 동사 등 단어의 품사를 제공한다.6 is a block diagram of a redundant news detection preprocessor constituting a news providing system according to an embodiment of the present invention. 6, the overlapping news detection preprocessing unit 121 includes a tag removal unit 1211, a syntax analysis unit 1212, a sentence detection unit 1213, a verb extraction unit 1214, . The tag removing unit 1211 removes the tags of the seed news and the overlap candidate news, and extracts the content from the seed news and the overlap candidate news. The tag removal unit 1211 removes, for example, HTML tags and unnecessary tags appearing on the web page in the syndication news and the syndication news. The parsing unit 1212 analyzes the syntax of the content. The parser 1212 can recognize a noun and a verb by parsing the tagged web page news using a morpheme analyzer such as a Stanford parser. The morpheme analyzer analyzes the words composed of morphemes by their constituent elements and provides the parts of the words such as nouns, adjectives, and verbs.

문장 검출부(1213)는 구문 분석부(1212)의 구문 분석 결과에 따라, 컨텐츠에서 문장을 검출한다. 문장 검출부(1213)는 예를 들어 형태소 분석기의 분석 결과에서 명사의 형태인 /NNP, /NN과 동사의 형태인 /VBZ, /VBN이 나타나면 문장으로 인식할 수 있다. 문장 검출부(1213)는 마침표의 위치를 이용하여 문장 간을 분리할 수 있으며, 뉴스에서 연속해서 문장의 형태를 가지는 위치를 찾아 뉴스의 내용으로 인지할 수 있다.The sentence detection unit 1213 detects a sentence in the content according to the result of the syntax analysis performed by the syntax analysis unit 1212. The sentence detection unit 1213 can recognize sentences such as / NNP and / NN, which are noun forms, and / VBZ and / VBN, which are forms of verbs, from the analysis result of the morpheme analyzer. The sentence detection unit 1213 can separate the sentences using the position of the period, and can recognize the contents of the news by searching for a position having a sentence shape continuously in the news.

동사 추출부(1214)는 검출된 문장으로부터 동사를 추출한다. 동사 추출부(1214)는 예를 들어 분리된 문장 각각에서 형태소 분석기에 의한 태깅(tagging)이 /VBX 형태로 나타나는 동사를 추출할 수 있다. 추출된 동사는 해당 문장에서 의미를 가지는 동사의 후보가 된다. 예를 들어 스탠포드 파서에서 제공하는 단어 형태의 의존성을 이용하여 의미를 가지는 동사를 선택할 수 있다. 불용어 제거 및 어근 처리부(1215)는 분리된 문장의 단어 중에서 동사를 제거한 나머지 단어에서 단일 단어로는 의미를 갖지 않는 불용어를 제거하고, 불용어를 제거한 단어를 어근의 형태로 변환하는 어근 처리를 한다.The verb extraction unit 1214 extracts the verb from the detected sentence. The verb extractor 1214 may extract a verb, for example, in which the tagging by the morpheme analyzer appears as / VBX in each of the separated sentences. The extracted verb is a verb candidate with meaning in the sentence. For example, you can use the dependency of the word form provided by the Stanford Parser to select verbs with meaning. The abbreviation removal and root processing unit 1215 removes an insignificant word having no meaning as a single word from the remaining words of the separated sentence, and performs a root processing for converting the word without the abbreviation into a root form.

예를 들어, 시드 뉴스 또는 대상 뉴스의 컨텐츠의 문장 중에 "Steve Jobs has succeeded as a businessman."이라는 문장이 포함되어 있을 경우, 이 문장을 형태소 분석기로 POS 태깅(tagging)하면, "Steve/NNP Jobs/NNP has/VBZ succeeded/VBN as/IN a/DT businessman/NN ./."와 같이 나타난다. 여기서, /NNP는 동사와 관련 있는 명사를 나타내며, /VBZ, /VBN은 동사를, /IN은 전치사를, /DT는 관사를, /NN은 명사를 나타낸다.For example, if a sentence in the content of a seed news or a target news contains a sentence "Steve Jobs has succeeded as a businessman.", POS tagging this sentence with a stemmer would result in "Steve / NNP Jobs / NNP has / VBZ succeeded / VBN as / IN a / DT / NN ./. Here, / NNP denotes a noun related to the verb, / VBZ, / VBN denotes a verb, / IN denotes a preposition, / DT denotes an article, and / NN denotes a noun.

동사 추출부(1214)는 예를 들어 태깅된 정보 중 동사를 나타내는 태그 (VP (VBZ has) (VP (VBN succeeded)))를 동사의 후보로 추출하며, 스탠포드 파서에서 제공하는 단어 형태의 의존성을 이용하여 후보로 선출된 동사 중 의미 있는 동사를 선택한다. 해당 문장에 나타나는 단어 형태의 의존성을 살펴보면 nn(Jobs-2, Steve-1), nsubj(succeeded-4, Jobs-2), aux(succeeded-4, has-3), det(businessman-7, a-6), prep_as(succeeded-4, businessman-7)로 표현된다. nn은 명사를 의미하고, nsubj는 명사가 하는 행위를 나타내며, aux는 본동사와 조동사를 나타내며, det는 명사와 관사를 나타내며, prep_as는 두 단어의 충돌 관계를 나타낸다. 단어 형태의 의존성 중 aux의 정보를 살펴보면 succeeded는 본동사이며, has는 조동사인 것을 알 수 있고, 후보로 선택된 두 개의 동사가 succeeded와 has 이기 때문에 해당 문장에서 의미 있는 동사로 succeeded가 선택된다.For example, the verb extractor 1214 extracts a tag (VP (VBN succeeded)) (VP (VBN succeeded)) representing the verb of the tagged information as a candidate of the verb, and determines the dependency of the word form provided by the Stanford parser And the verb is selected from among the candidates selected as candidates. The dependency of the word type in the sentence is nn (Jobs-2, Steve-1), nsubj (succeeded-4, Jobs-2), aux (succeeded-4, has-3) -6), prep_as (succeeded-4, businessman-7). nn represents a noun, nsubj represents an action performed by a noun, aux represents a main verb and auxiliary verb, det indicates a noun and an article, and prep_as indicates a conflict between two words. If we look at the information of aux in the dependence of the word type, succeeded is the vernacular verb, has is the verb verb, and the two verbs selected as candidates are succeeded and has, so succeeded is selected as the verb in the sentence.

다시 도 5를 참조하면, 중복 뉴스 판단부(122)는 시드 뉴스의 컨텐츠에 포함된 문장과 중복 후보 뉴스의 컨텐츠에 포함된 문장 간의 유사도를 산출하고, 산출한 문장 간의 유사도에 기초하여 중복 후보 뉴스가 중복 뉴스에 해당하는지 여부를 판단한다. 도 7은 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 중복 뉴스 판단부의 구성도이다. 도 7을 참조하면, 중복 뉴스 판단부(122)는 단어 유사도 산출부(1221), 동사 유사도 산출부(1222), 문장 유사도 결정부(1223), 뉴스 관계 유사도 산출부(1224) 및 중복 뉴스 탐지 임계값 비교부(1225)를 포함한다.5, the overlapping news determining unit 122 calculates similarities between the sentences included in the contents of the seed news and the sentences included in the contents of the overlapping candidate news, and based on the similarities between the calculated sentences, Whether it corresponds to overlapping news. 7 is a block diagram of a duplicate news determination unit constituting a news providing system according to an embodiment of the present invention. 7, the overlapping news determining unit 122 includes a word similarity calculating unit 1221, a verb similarity calculating unit 1222, a sentence similarity determining unit 1223, a news relation similarity calculating unit 1224, And a threshold value comparator 1225.

단어 유사도 산출부(1221)는 시드 뉴스의 컨텐츠에 포함된 문장의 단어와, 중복 후보 뉴스의 컨텐츠에 포함된 문장의 단어 간의 유사도를 산출한다. 단어 유사도 산출부(1221)는 시드 뉴스와 중복 후보 뉴스의 문장에서 동사와 불용어를 제외한 어근 처리된 단어들을 이용하여 시드 뉴스와 중복 후보 뉴스의 단어 간의 유사도를 산출한다. 일 실시예에 있어서, 단어 유사도 산출부(1221)는 시드 뉴스의 컨텐츠의 문장에서 나타나는 동사를 제외한 단어들의 빈도에 비례하고, 시드 뉴스와 중복 후보 뉴스의 모든 컨텐츠의 문장에서 나타나는 동사를 제외한 단어들의 빈도에 반비례하는 연산을 수행하여 시드 뉴스의 단어의 가중치를 산출할 수 있다. 또한, 단어 유사도 산출부(1221)는 중복 후보 뉴스의 컨텐츠의 문장에서 나타나는 동사를 제외한 단어들의 빈도에 비례하고, 시드 뉴스와 중복 후보 뉴스의 모든 컨텐츠의 문장에서 나타나는 동사를 제외한 단어들의 빈도에 반비례하는 연산을 수행하여 중복 후보 뉴스의 단어의 가중치를 산출할 수 있다.The word similarity degree calculating section 1221 calculates the degree of similarity between the words of the sentences included in the contents of the seed news and the words of the sentences included in the contents of the duplicate candidate news. The word similarity degree calculation unit 1221 calculates the similarity degree between the seed news and the words of the overlap candidate news using the root processed words excluding the verb and the stop words in the sentence of the seed news and the overlap candidate news. In one embodiment, the word similarity degree calculation unit 1221 calculates the word similarity degree in accordance with the frequency of words other than the verbs appearing in the sentence of the seed news, The weight of the word of the seed news can be calculated by performing an operation in inverse proportion to the frequency. In addition, the word similarity degree calculation section 1221 calculates the word similarity degree in proportion to the frequency of the words excluding the verb appearing in the sentence of the content of the duplicate candidate news, and inversely proportional to the frequency of the words except the verb appearing in the sentence of the seed contents The weights of the words of the overlap candidate news can be calculated.

일 실시예에 있어서, 단어 유사도 산출부(1221)는 시드 뉴스의 문장과, 중복 후보 뉴스의 문장에 동시에 포함되는 동사를 제외한 단어의 가중치들을 합한 값에서 시드 뉴스의 문장에 포함되는 동사를 제외한 단어의 가중치들과 중복 후보 뉴스의 문장에 포함되는 동사를 제외한 단어의 가중치들을 합한 값으로 나눈 값을 단어 간의 유사도로 산출할 수 있다. 단어 유사도 산출부(1221)는 예를 들어 벡터 모델과 변형된 다이스 상관계수의 수식을 이용하여 단어 간의 유사도를 산출할 수 있다.In one embodiment, the word similarity degree calculation unit 1221 calculates a word similarity degree from the sum of the weights of words excluding the verbs included in the sentence of the seed news and the sentences of the duplicate candidate news, excluding the verbs included in the sentence of the seed news And the weights of the words other than the verbs included in the sentence of the duplicate candidate news are calculated as the similarities between the words. The word similarity degree calculating unit 1221 can calculate the degree of similarity between words using, for example, a vector model and an equation of a modified die correlation coefficient.

단어 유사도 산출부(1221)는 예를 들어 벡터 모델에서 tf-isf(term frequency inverse sentence frequency)를 이용하여 문장에 사용된 용어의 가중치를 할당할 수 있다. tf-isf는 단어가 문장에 나온 횟수에 비례하고, 그 단어를 포함하고 있는 모든 문장의 전체 개수에 반비례하는 가중치를 할당하여 해당 단어에 중요도를 부여한다. 이를 보다 구체적으로 설명하면 다음과 같다. 먼저 시드 뉴스와 중복 후보 뉴스의 문장에 나타난 단어들의 집합은 예를 들어 수식 7과 같이 나타낼 수 있다.The word similarity degree calculating unit 1221 may assign a weight of a term used in a sentence using, for example, tf-isf (term frequency inverse sentence frequency) in the vector model. tf-isf gives importance to the word by assigning a weight in inverse proportion to the total number of all sentences containing the word in proportion to the number of times the word appears in the sentence. This will be described in more detail as follows. First, a set of words appearing in a sentence of a seed news and a duplicate candidate news may be expressed as shown in Equation 7, for example.

[수식 7][Equation 7]

이때, SC_i는 i번째 시드 뉴스의 컨텐츠(본문)에 나타나는 문장들의 집합을 나타내고, TC_j는 j번째 중복 후보 뉴스의 컨텐츠(본문)에 나타나는 문장들의 집합을 나타내고, sc_i _,k는 i번째 시드 뉴스의 컨텐츠 중 k번째 문장에 나타나는 단어들의 집합을 나타내고,tc_j _,l은 j번째 중복 후보 뉴스 중 l번째 문장에 나타나는 단어들의 집합을 나타내고, t_i _,k,n은 i번째 시드 뉴스의 컨텐츠 중 k번째 문장에 나타나는 n번째 단어를 나타내고, tt_j _,l,m은 j번째 중복 후보 뉴스의 컨텐츠 중 l번째 문장에 나타나는 m번째 단어를 나타낸다. 만약, 시드 뉴스의 컨텐츠의 일 부분만을 차용한 중복 후보 뉴스의 경우 중복 뉴스로 검출되지 않을 수 있으므로, 시드 뉴스와 중복 후보 뉴스 중 문장의 수가 작은 문서를 대상으로 비교 문장의 수가 일치될 수 있다.SC _i represents a set of sentences appearing in the content (body) of the i-th seed news, TC _j represents a set of sentences appearing in the content (body) of the j-th duplicated candidate news, sc _i _, _{T ij} _{, l} represents a set of words appearing in the lth sentence of the jth duplicate candidate news, t _i _{, k, n} represent a set of words appearing in the i th seed news indicates the n-th word appears in the k-th sentence of the content, tt _{_j, l, m} represents the m-th word appears in the l-th sentence of the contents of the j-th redundancy candidate news. If duplicate candidate news that borrowed only a part of the contents of the seed news may not be detected as duplicate news, the number of comparison sentences can be matched to a document having a small number of sentences among seed news and duplicate candidate news.

단어 유사도 산출부(1221)는 예를 들어 아래의 수식 8을 이용하여 시드 뉴스와 중복 후보 뉴스의 문장에 나타나는 단어의 가중치를 산출할 수 있다.The word similarity degree calculating section 1221 can calculate the weights of words appearing in the sentence of the seed news and the overlapping candidate news, for example, by using the following equation (8).

[수식 8][Equation 8]

수식 8에서,

는 문장에 나타나는 단어의 개수를 나타내고,

는 문장의 역빈도수(isf)를 나타내며,

는 해당 뉴스의 본문(시드 뉴스 또는 중복 후보 뉴스의 컨텐츠) 중 단어 ω를 포함하는 문장의 개수를 나타내고,

는 해당 뉴스에 나타나는 전체 문장의 개수를 나타낸다.In Equation 8,

Represents the number of words appearing in the sentence,

Represents the inverse frequency (isf) of the sentence,

Indicates the number of sentences containing the word ω in the body of the news (the content of the seed news or duplicate candidate news),

Represents the total number of sentences appearing in the news.

단어 유사도 산출부(1221)는 시드 뉴스의 컨텐츠에 포함된 문장에 나타나는 단어의 가중치와, 중복 후보 뉴스의 컨텐츠에 포함된 문장에 나타나는 단어의 가중치를 이용하여, 시드 뉴스의 문장에 나타나는 단어와 중복 후보 뉴스의 문장에 나타나는 단어 간의 유사도를 산출할 수 있다. 단어 유사도 산출부(1221)는 예를 들어 아래의 수식 9와 같은 변형된 다이스 상관계수 수식에 따라 단어 간의 유사도를 산출할 수 있다.The word similarity degree calculating unit 1221 calculates the word similarity degree by using a weight of a word appearing in the sentence included in the content of the seed news and a weight of a word appearing in the sentence included in the content of the duplicate candidate news, The degree of similarity between the words appearing in the sentence of the candidate news can be calculated. The word similarity degree calculation unit 1221 can calculate the degree of similarity between words according to a modified dice correlation coefficient equation as shown in the following equation (9).

[수식 9][Equation 9]

수식 9에서,

는 시드 뉴스의 문장과 중복 후보 뉴스의 문장 간 단어를 이용한 문장 유사도를 나타내며, p는 시드 뉴스의 문장과 중복 후보 뉴스의 문장에 동시에 나타나는 단어를 나타내며, c는 시드 뉴스의 문장과 중복 후보 뉴스의 문장에 동시에 나타나는 단어의 개수를 나타내며, t_i _,k,n은 i번째 시드 뉴스의 본문 중 k번째 문장에 나타나는 n번째 단어의 가중치를 나타내고, tt_j _,l,m은 j번째 중복 후보 뉴스 중 l번째 문장에 나타나는 m번째 단어의 가중치를 나타낸다.In Equation 9,

P represents a word appearing simultaneously in a sentence of a seed news and a sentence of a duplicate candidate news, c represents a sentence of a seed news and a sentence of a duplicate candidate news T _i _{, k, n} represent the weight of the nth word appearing in the kth sentence of the i th seed news, t t _j _{, l, m} represent the weight of the jth duplicate candidate news It represents the weight of the mth word appearing in the lth sentence.

예를 들어 i번째 시드 뉴스의 첫 문장에 나타나는 단어가 a, b, c이고, j번째 중복 후보 뉴스의 첫 문장에 나타나는 단어가 a, c, d이고, 두 번째 문장에 나타나는 단어가 b, c, d인 경우, 수식 8을 사용하여 계산된 a, b, c, d 각각의 단어 가중치는 각각 0.1, 0.2, 0.3, 0.2이다. i번째 시드 뉴스 중 처음 문장과 j번째 중복 후보 뉴스 중 처음 문장에 동시에 나오는 단어는 a, c이므로, 수식 9에서 분자

는 (0.1+0.3)+(0.1+0.3) = 0.8의 값이 된다.For example, if the first sentence of the ith seed news is a, b, c, the first sentence of the jth duplicate candidate news is a, c, d, and the second sentence is b, c , d, the word weights of a, b, c, and d calculated using Eq. 8 are 0.1, 0.2, 0.3, and 0.2, respectively. Since the first sentence of the i-th seed news and the j-th duplicate candidate news are the words a and c simultaneously appearing in the first sentence,

(0.1 + 0.3) + (0.1 + 0.3) = 0.8.

i번째 시드 뉴스 중 처음 문장에 나타나는 단어들은 a, b, c이고, j번째 중복 후보 뉴스의 처음 문장에 나타나는 단어들은 a, c, d이므로, 수식 9에서

는 (0.1+0.2+0.3) = 0.6이고,

은 (0.1+0.3+0.2) = 0.6이 되어 분모

+

는 1.2의 값을 갖는다. 이에 따라 i번째 시드 뉴스의 처음 문장과 j번째 중복 후보 뉴스의 처음 문장 사이의 단어 간의 유사도, 다시 말해 단어의 가중치를 반영한 문장의 유사도

는 0.8/1.2 = 0.66의 값이 된다. 같은 방법으로 i번째 시드 뉴스의 처음 문장과 j번째 중복 후보 뉴스의 두 번째 문장에 대한 유사도를 계산하면 0.77의 값이 된다.Since the words appearing in the first sentence of the ith seed news are a, b, c, and the words appearing in the first sentence of the jth duplicate candidate news are a, c, and d,

(0.1 + 0.2 + 0.3) = 0.6,

(0.1 + 0.3 + 0.2) = 0.6,

+

Has a value of 1.2. Thus, the similarity between words between the first sentence of the ith seed news and the first sentence of the jth duplicate candidate news, that is, the similarity of sentences

Becomes 0.8 / 1.2 = 0.66. In the same way, the similarity between the first sentence of the ith seed news and the second sentence of the jth duplicated candidate news is calculated to be 0.77.

동사 유사도 산출부(1222)는 시드 뉴스의 컨텐츠에 포함된 문장의 동사와, 중복 후보 뉴스의 컨텐츠에 포함된 문장의 동사 간의 유사도를 산출한다. 동사 유사도 산출부(1222)는 시드 뉴스의 문장과 중복 후보 뉴스의 문장에서 나타나는 동사를 공통으로 포함하는 워드넷 용어집 기반 계층에서의 최소 상위 동사를 검출하고, 시드 뉴스의 문장에서 나타나는 동사, 중복 후보 뉴스의 문장에서 나타나는 동사 및 최소 상위 동사의 동의어 어휘의 개수에 비례하는 연산을 수행하여 확률 값을 산출하며, 산출한 확률 값을 이용하여 동사 간의 유사도를 산출한다. 동사 유사도 산출부(1222)는 예를 들어 워드넷의 동의어 집합과 계층 구조를 이용하여 동사 간의 유사도를 산출할 수 있다.The verb similarity calculation unit 1222 calculates the similarity between the verb of the sentence included in the content of the seed news and the verb of the sentence contained in the content of the duplicate candidate news. The verb similarity calculating unit 1222 detects the least significant verb in the WordNet glossary-based hierarchy including the verbs in the sentences of the seed news and the sentences of the duplicate candidate news, and detects the verbs and duplicate candidates in the sentences of the seed news. A probability value is calculated by performing an operation proportional to the number of synonyms and vocabulary of the verbs and minimum upper verbs in the news sentence, and the similarity between the verbs is calculated using the calculated probability value. The verbally similarity calculation unit 1222 may calculate similarities between verbs using, for example, a synonym set and a hierarchical structure of WordNet.

품사Part of speech 동의어 집합 수Number of synonyms 상위계층 Upper layer 계념Conceit 수 Number 최대계층 수Maximum number of tiers 동사verb 13,508개13,508 1414 44 형용사adjective 18,563개18,563 -- -- 부사adverb 3,664개3,664 -- -- 계system 115,424개115,424

표 1은 워드넷 2.0이 포함하고 있는 품사별 동의어 집합의 수와 구조를 나타낸다. 동사의 경우 13,508개의 동의어 집합을 포함하며, 최상위 노드에서 최하위 동의어 집합까지의 계층 수가 4개를 넘지 않는다. 형용사와 부사는 명사와 동사가 가지는 계층적 구조가 아니라 위성 구조를 가지며, 명사와 동사에서와 같은 계층 체계에 따른 분류가 없고 속성을 정의하는 개념만 있기 때문에, 단어의 의미 유사도 방법은 동사에 적용되고, 형용사나 부사에는 적용되지 않는다.Table 1 shows the number and structure of word-by-word synonym sets included in WordNet 2.0. The verb contains 13,508 synonym sets, and the number of hierarchies from the top node to the lowest synonym set does not exceed four. Adjectives and adverbs are not hierarchical structures of nouns and verbs but have a satellite structure. Since there is no classification according to the hierarchical system such as nouns and verbs, and only the concept of defining attributes, the semantic similarity method of words is applied to verbs And does not apply to adjectives or adverbs.

동사 유사도 산출부(1222)는 예를 들어 아래의 수식 10 내지 수식 13에 따라 i번째 시드 뉴스의 문장에 나타나는 동사

와 j번째 중복 후보 뉴스의 문장에 나타나는 동사

간의 유사도를 산출할 수 있다.The verb similarity calculator 1222 may be a verb appearing in a sentence of the i-th seed news according to, for example, Equations 10 to 13 below.

And the verb appearing in the sentence of the jth duplicate candidate news

Can be calculated.

[수식 10][Equation 10]

수식 10에서,

는 k번째 문장에서 추출된 본 동사

의 동의어 집합을 나타낸다. 동사 유사도 산출부(1222)는 동사의 동의어 집합에서 동사

와 관련된 어휘의 개수를 카운트하여 본 동사의 빈도

를 산출할 수 있다.In Equation 10,

Is the verb extracted from the kth sentence

Quot; < / RTI > The verb similarity calculation unit 1222 calculates the verb similarity < RTI ID = 0.0 >

And the frequency of the verb

Can be calculated.

[수식 11][Equation 11]

수식 11에서,

는 본 동사

에 대한 확률 값을 나타내며,

는 최상위 노드에서 최하위 동의어 집합까지의 계층에 나타난 각 루트(root)의 빈도수를 나타낸다. 동사는 아래의 표 2에 나타낸 바와 같은 15개의 최상위 루트를 갖는다.In Equation 11,

This verb

, &Lt; / RTI >

Represents the frequency of each root in the hierarchy from the highest node to the lowest synonym set. The verb has fifteen highest routes as shown in Table 2 below.

Bodily care and functionBodily care and function ChangeChange CognitionCognition CommunicationCommunication Social interactionSocial interaction CometitionCometition ConsumptionConsumption ContactContact Weather verbsWeather verbs CreationCreation EmotionEmotion MotionMotion State verbsState verbs PerceptionPerception PossessionPossession --

[수식 12][Equation 12]

수식 12에서 나타내는 IC는 정보 콘텐츠(Information Contents)의 약자이고,

는 동사

의 확률 값

에 로그의 부정을 취하여 산출된다. 이는 워드넷 용어집 기반으로 계층에서

가 가장 많이 포함할 수 있는 정보 콘텐츠를 추출하기 위해 사용한다.The IC shown in Equation (12) stands for Information Contents,

Is the verb

Probability value of

Is calculated by taking the negative of the log. It is based on the WordNet glossary,

Is used to extract the information content that can be included most.

[수식 13][Equation 13]

수식 13에서

는 i번째 시드 뉴스의 k번째 문장에 나타나는 본 동사

와 j번째 중복 후보 뉴스 중

번째 문장에 나타나는 본 동사

의 유사도를 나타낸다. 두 동사 어휘를 공통으로 포함하는 최소 상위 동사(Lowest Common Subsumer : LCS)

가 가지는

값과 각 어휘의

,

를 연산하여 동사 간의 유사도

를 산출한다. 구해진

는 연관성이 있는 동사 어휘 모음 내 얻어낸 의미 유사도 중 가장 높은 값을 갖는다.In Equation 13

Is the verb in the kth sentence of the ith seed news

And jth duplicate candidate news

The verb that appears in the second sentence

. Lowest Common Subsumer (LCS) with two verb vocabularies in common

The

The value and the

,

The similarity between the verbs

. Obtained

Has the highest value among the semantic similarities obtained in the associative verb vowel vowels.

문장 유사도 결정부(1223)는 예를 들어 아래의 수식 14에 나타낸 바와 같이, 단어 간의 유사도와 동사 간의 유사도를 곱셈 연산하며, 시드 뉴스의 각각의 문장과, 중복 후보 뉴스의 모든 문장 간의 곱셈 연산 값 중 가장 큰 값을 문장 간의 유사도로 결정한다.The sentence similarity degree determination unit 1223 multiplies the similarity degree between words and the similarity degree between verbs as shown in the following equation (14), for example, and calculates a multiplication operation value between all the sentences of the seed news and all the sentences of the redundant candidate news The largest value among the sentences is determined as the similarity between the sentences.

[수식 14][Equation 14]

수식 14에서

는 시드 뉴스와 중복 후보 뉴스의 문장 간의 유사도를 나타내고,

는 시드 뉴스와 중복 후보 뉴스의 동사를 이용하여 산출한 문장의 유사도(동사 간의 유사도)를 나타내고,

는 시드 뉴스와 중복 후보 뉴스의 동사를 제외한 단어를 이용하여 산출한 문장의 유사도(단어 간의 유사도)를 나타내며, L은 중복 후보 뉴스에 나타나는 모든 문장들의 집합을 나타낸다. 문장 유사도 결정부(1223)는 시드 뉴스의 문장과 유사도 쌍으로 산출된 중복 후보 뉴스의 문장에 대하여는 다음 시드 뉴스 문장과의 유사도를 산출할 때 연산에서 제외할 수 있다. 만약 i번째 시드 뉴스의 처음 문장에 나타나는 동사와 j번째 중복 후보 뉴스 중 처음 문장에 나타나는 동사의 유사도 값이 i번째 시드 뉴스 처음 문장에 나타나는 동사와 j번째 중복 후보 뉴스 중 두 번째 문장에 나타나는 동사의 유사도 값과 같다면, 유사도가 가장 높은 두 번째 문장과 유사도의 쌍을 이루고 다음 비교 문장에서 제외된다. 문장 유사도 결정부(1223)는 이러한 방법으로 문장의 쌍으로 유사도를 구할 수 있는 모든 뉴스 문장의 유사도를 산출한다.In Equation 14

Indicates the similarity between the sentences of the seed news and duplicate candidate news,

Shows the similarity (similarity between verbs) of the sentences calculated using the verbs of the syndine news and the duplicate candidate news,

Represents the similarity (similarity between words) of the sentences calculated using words except for the verbs of the seed news and the duplicate candidate news, and L represents the set of all sentences appearing in the duplicate candidate news. The sentence similarity degree determination unit 1223 can exclude the sentence of the duplicate candidate news calculated in the similarity pair with the sentence of the seed news from the calculation when calculating the similarity with the next seed news sentence. If the verb that appears in the first sentence of the ith seed news and the similarity value of the verb in the first sentence of the jth duplicate candidate news is the verb that appears in the first sentence of the ith seed news and the verb that appears in the second sentence of the jth duplicate candidate news If it is the same as the similarity value, it forms a pair of similarity with the second sentence with the highest similarity, and is excluded from the next comparison sentence. The sentence similarity determining unit 1223 calculates the similarity of all the news sentences that can obtain the similarity with the pair of sentences in this manner.

뉴스 관계 유사도 산출부(1224)는 시드 뉴스의 모든 문장에 대하여 산출된 문장 간의 유사도들을 합한 값을 시드 뉴스의 문장 개수 및 중복 후보 뉴스의 문장 개수 중 작은 값으로 나누어 뉴스 관계 유사도를 산출한다. 뉴스 관계 유사도 산출부(1224)는 예를 들어 아래의 수식 15에 나타낸 바와 같이, 문장 간의 유사도를 모두 더한 값을 더해진 문장의 수로 나눔으로써 뉴스 관계 유사도를 산출할 수 있다.The news relation similarity calculation unit 1224 calculates a news relation similarity by dividing the sum of similarities between sentences calculated for all sentences of the seed news by a smaller value among the number of sentences of the seed news and the number of sentences of the duplicate candidate news. The news relation similarity calculating unit 1224 can calculate the news relation similarity by dividing the sum of the similarities between sentences by the number of added sentences, as shown in the following equation (15), for example.

[수식 15][Equation 15]

수식 15에서

는 i번째 시드 뉴스의 내용

과 j번째 중복 후보 뉴스의 내용

사이의 유사도를 나타내며, b는 시드 뉴스의 문장 개수, L은 중복 후보 뉴스의 문장 개수,

는 시드 뉴스의 문장 개수와 중복 후보 뉴스의 문장 개수 중 작은 값을 나타낸다.

는 수식 14에 의해 구해진 시드 뉴스의 문장들에 대한 문장 간의 유사도들을 모두 합한 값이다.

는 시드 뉴스의 각각의 문장과 중복 후보 뉴스의 대응하는 문장 간의 유사도의 평균값을 의미하며, 시드 뉴스와 중복 후보 뉴스 간의 유사도를 나타낸다.In Equation 15

Contents of the i-th seed news

And the contents of the jth duplicate candidate news

B is the number of sentences in the seed news, L is the number of sentences in the duplicate candidate news,

Represents the smaller of the number of sentences in the seed news and the number of sentences in the duplicate candidate news.

Is the sum of the similarities between the sentences of the sentences of the seed news obtained by the expression (14).

Means the mean value of the similarity between each sentence in the seed news and the corresponding sentence in the duplicate candidate news, and it indicates the similarity between the seed news and the duplicate candidate news.

중복 뉴스 탐지 임계값 비교부(1225)는 뉴스 관계 유사도를 소정의 임계값과 비교하여 중복 뉴스를 탐지한다. 중복 뉴스 탐지 임계값 비교부(1225)는 산출된 뉴스 관계 유사도가 중복 뉴스 탐지 임계값을 넘으면 중복 후보 뉴스를 중복 뉴스로 분류하고, 임계값을 넘지 않으면 중복 뉴스에서 제외한다. 이에 따라 중복 뉴스 탐지부(120)에 의해 중복 후보 뉴스들 중에서 중복 뉴스를 분류하는 2차 클러스터링(Clustering)이 수행된다.The duplicate news detection threshold comparison unit 1225 detects duplicate news by comparing the news relationship similarity with a predetermined threshold. The duplicate news detection threshold comparison unit 1225 classifies the duplicate candidate news as duplicate news when the calculated news relation similarity exceeds the duplicate news detection threshold, and excludes the duplicate candidate news from the duplicate news. Accordingly, the second news clustering unit 120 performs second clustering to classify the duplicate news among the duplicate candidate news.

다시 도 1을 참조하면, 중복 뉴스 데이터베이스(130)는 중복 후보 뉴스 탐지부(110) 및 중복 뉴스 탐지부(120)에 의해 탐지된 중복 뉴스들을 카테고리별로 저장한다.Referring again to FIG. 1, the duplicate news database 130 stores the duplicate news detected by the duplicate candidate news detecting unit 110 and the duplicate news detecting unit 120 for each category.

중복 뉴스 결합부(140)는 시드 뉴스와 시드 뉴스에 대하여 탐지된 중복 뉴스들의 통합하여 결합 뉴스를 생성한다. 중복 뉴스 결합부(140)는 유사도 산출부(141)와, 결합 뉴스 생성부(142)를 포함한다. 유사도 산출부(141)는 시드 뉴스의 각 문장과, 중복 뉴스의 문장 간의 유사도를 산출하고, 산출한 유사도 중에서 최대값에 해당하는 최대 유사도를 산출한다.The duplicate news combiner 140 generates the combined news by combining the duplicate news detected with respect to the seed news and the seed news. The overlapping news combining unit 140 includes a similarity calculating unit 141 and a combined news generating unit 142. [ The similarity calculating unit 141 calculates the similarity between each sentence of the seed news and the sentence of the duplicate news, and calculates the maximum similarity corresponding to the maximum value among the calculated similarities.

일 실시예에 있어서, 유사도 산출부(141)는 중복 뉴스의 문장마다 최대 유사도를 산출할 수 있다. 일 실시예에 있어서, 유사도 산출부(141)는 시드 뉴스의 문장과, 중복 뉴스의 문장으로부터 동사를 추출하며, 추출한 시드 뉴스의 문장의 동사와, 중복 뉴스의 문장의 동사 간의 유사도를 산출하여 시드 뉴스의 문장과 중복 뉴스의 문장 간의 유사도를 산출할 수 있다. 유사도 산출부(141)는 예를 들어 시드 뉴스의 문장과 중복 뉴스의 문장에서 나타나는 동사를 공통으로 포함하는 워드넷 용어집 기반 계층에서의 최소 상위 동사를 검출하고, 시드 뉴스의 문장에서 나타나는 동사, 중복 뉴스의 문장에서 나타나는 동사 및 최소 상위 동사의 동의어 어휘의 개수에 비례하는 연산을 수행하여 확률 값을 산출하며, 산출한 확률 값을 이용하여 유사도를 산출할 수 있다.In one embodiment, the similarity calculation unit 141 may calculate the maximum similarity for each sentence of the duplicate news. In one embodiment, the similarity calculation unit 141 extracts a verb from the sentence of the seed news and the sentence of the duplicated news, calculates the similarity between the verb of the sentence of the extracted seed news and the verb of the sentence of the duplicated news, The degree of similarity between sentences of news and sentences of duplicate news can be calculated. The similarity calculating unit 141 detects the minimum upper verb in the WordNet glossary based hierarchy that includes, for example, a verb of the seed news and a verb appearing in the duplicate news sentence, and displays a verb and a duplicate in the seed news sentence. A probability value may be calculated by performing an operation proportional to the number of synonyms and vocabulary of the verbs and minimum upper verbs shown in the news sentence, and the similarity may be calculated using the calculated probability value.

결합 뉴스 생성부(142)는 산출된 최대 유사도를 미리 설정된 임계값과 비교하며, 최대 유사도가 임계값 미만이면 중복 뉴스의 문장을 시드 뉴스에 결합하여 결합 뉴스를 생성한다. 일 실시예에 있어서, 결합 뉴스 생성부(142)는 중복 뉴스의 문장과의 유사도가 최대값으로 나타나는 시드 뉴스의 문장과, 이의 다음 문장의 사이에 중복 뉴스의 문장을 추가한다. 이에 따라 결합 뉴스의 문장들은 문맥적으로 자연스러운 구조로 배치된다.The combined news generating unit 142 compares the calculated maximum similarity with a preset threshold value, and if the maximum similarity is less than the threshold value, the combined news is generated by combining the sentence of the duplicate news with the seed news. In one embodiment, the combined news generating unit 142 adds a duplicate news sentence between the sentence of the seed news in which the degree of similarity with the sentence of the duplicate news is the maximum value, and the next sentence thereof. Accordingly, the sentences of the combined news are arranged in a contextually natural structure.

결합 뉴스 생성부(142)는 중복 뉴스의 문장들 중에서 최대 유사도가 임계값 미만에 해당하는 문장을 시드 뉴스에 결합하여 결합 뉴스를 생성하고, 최대 유사도가 임계값 이상이면 중복 뉴스의 문장을 시드 뉴스에 결합하지 않고 삭제한다. 일 실시예에 있어서, 뉴스 제공 시스템은 시드 뉴스에 대한 중복 뉴스 그룹에 속하는 모든 중복 뉴스에 대하여, 최대 유사도가 임계값 미만인 중복 뉴스의 문장을 시드 뉴스에 결합하며, 시드 뉴스와 중복 뉴스 그룹에 속하는 모든 중복 뉴스를 하나의 문서로 통합하여 결합 뉴스를 생성한다. 결합 뉴스는 시드 뉴스의 정보 외에 추가적으로 업데이트된 정보를 모두 포함하고 있으므로 정보의 유실이 방지되며, 이에 따라 뉴스 검색자는 하나의 문서로 통합된 결합 뉴스를 통해 필요한 뉴스 정보를 효과적으로 습득할 수 있다.The combined news generation unit 142 generates a combined news by combining sentences having a maximum similarity below the threshold among the sentences of the duplicate news to the seed news, and seeding the duplicate news sentences when the maximum similarity is greater than or equal to the threshold. Delete without binding to. In one embodiment, the news feed system combines the sentences of the duplicate news with a maximum similarity below the threshold for all duplicate news belonging to the duplicate news group for the seed news to the seed news, and belongs to the seed news and the duplicate news group. Combine all duplicate news into one document to create combined news. The combined news includes all the updated information in addition to the information of the seed news, thereby preventing the loss of information, so that news searchers can effectively acquire the necessary news information through the combined news combined into one document.

문장 유사도 산출부(150), 문장 가중치 산출부(160) 및 요약문 생성부(170)는 중복 뉴스의 제목과 결합 뉴스의 문장 간의 유사도에 기초하여 결합 뉴스의 문장들에 대한 문장 중요도를 산출함으로써, 결합 뉴스에 대한 요약문을 생성한다.The sentence similarity calculator 150, the sentence weight calculator 160, and the summary sentence generator 170 may calculate sentence importance of sentences of the combined news based on the similarity between the title of the duplicate news and the sentences of the combined news. Generate a summary of the combined news.

문장 유사도 산출부(150)는 결합 뉴스의 문장과 시드 뉴스의 제목 간의 문장 유사도, 그리고 결합 뉴스의 문장과 각 중복 뉴스의 제목 간의 문장 유사도를 산출한다. 일 실시예에 있어서, 문장 유사도 산출부(150)는 시드 뉴스의 제목, 중복 뉴스의 제목 및 결합 뉴스의 문장으로부터 단어를 추출하며, 시드 뉴스 및 중복 뉴스의 제목으로부터 추출한 단어와, 결합 뉴스의 문장으로부터 추출한 단어 간의 유사도를 산출하여 문장 유사도를 산출할 수 있다. 문장 유사도 산출부(150)는 예를 들어 시드 뉴스의 제목 및 중복 뉴스의 제목의 명사와, 결합 뉴스의 문장에서 나타나는 명사를 공통으로 포함하는 워드넷 용어집 기반 계층에서의 최소 상위 명사를 검출하고, 시드 뉴스의 제목 및 중복 뉴스의 제목의 명사, 결합 뉴스의 문장에서 나타나는 명사 및 최소 상위 명사의 동의어 어휘의 개수에 비례하는 연산을 수행하여 단어 중요도를 산출하며, 결합 뉴스의 문장에서 나타나는 각 명사에 대한 단어 중요도의 평균값을 산출하여 결합 뉴스의 각 문장에 대한 문장 유사도를 산출할 수 있다.The sentence similarity calculating unit 150 calculates sentence similarity between the sentence of the combined news and the title of the seed news, and the sentence similarity between the sentence of the combined news and the title of each duplicate news. In one embodiment, the sentence similarity calculator 150 extracts words from the title of the seed news, the title of the duplicate news, and the sentence of the combined news, the words extracted from the titles of the seed news and the duplicate news, and the sentences of the combined news. The sentence similarity may be calculated by calculating the similarity between words extracted from the. The sentence similarity calculating unit 150 detects, for example, a minimum high noun in a WordNet glossary based layer that includes a noun of a title of a seed news and a title of a duplicate news and a noun appearing in a sentence of a combined news. Calculate word importance by performing operations proportional to the number of nouns in the seed news and the titles of duplicate news, the nouns in the combined news sentences, and the synonym vocabulary of the least upper nouns. The sentence similarity for each sentence of the combined news may be calculated by calculating an average value of the word importance for the combined news.

표 3은 워드넷에서 명사의 동의어 집합의 수와 구조를 보여준다. 명사는 경우 약 8만 개 동의어 집합을 포함하며, 최상위 노드에서 최하위 동의어 집합까지의 계층은 12개를 넘지 않는다.Table 3 shows the number and structure of the noun synonym sets in WordNet. Nouns contain about 80,000 synonyms, with no more than 12 hierarchies from the top node to the lowest.

품사Part of speech 동의어 집합 수Number of synonyms 상위계층 Upper layer 계념Conceit 수 Number 최대계층 수Maximum number of tiers 명사noun 79,689개79,689 2525 1212

일 실시예에 있어서, 문장 유사도 산출부(150)는 결합 뉴스의 문장에서 추출한 단어들에 대한 워드넷의 명사 동의어 집합 구성과 구조를 이용하여, 뉴스(시드 뉴스 및 중복 뉴스) 제목에 나타나는 단어들과의 의미적 유사도를 구한다. 어휘 간의 의미 거리를 측정하여 단어와 단어 사이의 의미 유사도를 구하기 위해 중복 뉴스들의 제목에 나타나는 단어를 예를 들어 수식 16과 같이 정의할 수 있다.In one embodiment, the sentence similarity calculating unit 150 uses wordsnet noun synonym set structure and structure for words extracted from sentences of combined news, and the words appearing in the news (seed news and duplicate news) titles. Find the semantic similarity with. In order to determine the similarity between words and words by measuring the semantic distance between words, a word appearing in the title of duplicate news may be defined as shown in Equation 16, for example.

[수식 16][Equation 16]

DT_i : i번째 시드 뉴스를 기준으로 수집된 중복 뉴스 제목의 집합DT _i : set of duplicate news titles collected based on the i th seed news

dt_i _,k : i번째 시드 뉴스에 대한 k번째 중복 뉴스의 제목dt _i _{, k} : Title of kth duplicate news for ith seed news

dtn_i _,k,l : i번째 시드 뉴스의 k번째 중복 뉴스의 제목에 나타나는 l번째 단어dtn _i _{, k, l} : the first word in the title of the kth duplicate news in the ith seed news

이때 중복 뉴스의 제목에 나타나는 단어들은 중복을 허용하지 않기 때문에 수식 17을 만족한다.In this case, the words appearing in the title of the duplicate news do not allow duplication, and thus satisfy Equation 17.

[수식 17][Equation 17]

결합 뉴스의 내용에 나타나는 문장과, 문장에 나타나는 단어는 예를 들어 수식 18과 같이 정의할 수 있다.A sentence appearing in the content of the combined news and a word appearing in the sentence may be defined as shown in Equation 18, for example.

[수식 18]Equation 18

CC_j : j번째 결합 뉴스의 본문에 나타나는 문장들의 집합CC _j : set of sentences that appear in the body of the jth combined news

cs_j _,p : j번째 결합 뉴스의 본문 중 p번째 문장cs _j _{, p} : pth sentence of the jth combined news text

cst_j _,p,n : j번째 결합 뉴스의 본문 중 p번째 문장에 나타나는 n번째 단어cst _j _{, p, n} : The nth word in the pth sentence of the body of the jth combined news

이때 탐지된 중복 뉴스의 문장에 나타나는 단어들은 중복을 허용하지 않기 때문에 수식 19를 만족한다.In this case, the words appearing in the detected duplicate news sentences do not allow duplication, and thus satisfy Equation 19.

[수식 19]Formula 19

i번째 시드 뉴스에 대하여 탐지된 중복 뉴스의 제목에 나타나는 단어들의 집합

와 i번째 시드 뉴스에 대하여 탐지된 중복 뉴스를 결합한 결합 뉴스의 내용 중 k번째 문장에 나타나는 단어들의 집합

간의 단어의 의미 유사도는 아래의 수식 20 내지 수식 23을 이용하여 구할 수 있다. 아래의 수식 20, 수식 21, 수식 22는 문장의 명사로 일반화한 수식을 나타낸다.The set of words that appear in the title of duplicate news detected for the i-th seed news.

The set of words that appear in the kth sentence of the combined news content that combines the duplicate news detected for the i th seed news.

The similarity in meaning of a word between can be obtained using Equations 20 to 23 below. Equation 20, Equation 21, and Equation 22 below represent equations generalized to sentence nouns.

[수식 20]Equation 20

수식 20에 나타나는

는 단어

에 포함된 워드넷 데이터에서 명사의 동의어 집합이다. 문장 유사도 산출부(150)는 수식 20에 따라 단어

의 동의어 집합에서

와 관련된 어휘의 수를 카운트하여

를 계산한다.That appears in Equation 20

Word

A set of synonyms for nouns in WordNet data contained in. The sentence similarity calculator 150 determines a word according to Equation 20.

In the synonym set of

By counting the number of vocabulary associated with

.

[수식 21][Equation 21]

수식 21에 나타나는

는 최상위 노드에서 최하위 동의어 집합까지의 계층에 나타난 각 루트(root)의 빈도 수이다.

는

의 값을

로 나눈 값으로 단어

에 대한 확률 값을 나타낸다.That appears in Equation 21

Is the frequency of each root in the hierarchy from the top node to the lowest set of synonyms.

The

The value of

Word divided by

Represents a probability value for.

[수식 22]Formula 22

수식 22에서 나타내는 TI는 단어 정보(Term Information)의 약자이고,

는 명사

의 확률 값에

의 부정을 취한 값이다. 이것은 워드넷 용어집 기반으로 계층에서

가 가장 많이 포함할 수 있는 단어 정보를 추출하기 위함이다.TI in Equation 22 is an abbreviation of Term Information,

Is a noun

In the probability value of

Is the negative value of. This is based on the WordNet glossary

This is to extract the word information that can contain the most.

[수식 23]Formula 23

수식 23에서

는 탐지된 중복 뉴스를 결합한 j번째 뉴스의 내용 중 p번째 문장에 나타나는 n번째 단어의 중요도이다. 문장 유사도 산출부(150)는 i번째 시드 뉴스에 대하여 탐지된 k번째 중복 뉴스의 제목에 나타나는 모든 단어(

)와 의미적 유사도를 구해 가장 큰 값을 산출하여, 결합 뉴스의 문장에 나타나는 단어의 중요도를 산출한다.In Equation 23

Is the importance of the nth word appearing in the pth sentence among the contents of the jth news combined with the detected duplicate news. The sentence similarity calculator 150 may include all the words appearing in the title of the k-th duplicate news detected for the i-th seed news.

) And the semantic similarity are calculated to calculate the largest value, and the importance of the words appearing in the sentence of the combined news.

각각의 문장은 단어들의 집합으로 이루어져 있으므로, 이러한 각각의 단어에 대하여 산출한 단어의 중요도를 이용하여 시드 뉴스 및 중복 뉴스의 제목에 사용된 단어들과 결합 뉴스의 문장에서 나타나는 단어들 간의 의미 유사도를 구할 수 있다. 문장 유사도 산출부(150)는 예를 들어 아래의 수식 24와 같이, 결합 뉴스의 각각의 문장에 대하여, 각 문장에서 사용된 단어들의 중요도를 합하고 합해진 수(n)만큼 나누어 단어의 평균 유사도를 구함으로써, 문장 유사도, 즉 결합 뉴스의 문장과 중복 뉴스 그룹(시드 뉴스 및 중복 뉴스들)의 제목 사이의 유사도를 산출할 수 있다.Since each sentence consists of a set of words, the significance of the words calculated for each word is used to determine the similarity between the words used in the headings of the seed news and duplicate news and the words that appear in the sentences of the combined news. You can get it. For example, the sentence similarity calculating unit 150 calculates the average similarity of words for each sentence of the combined news by adding the importance of the words used in each sentence and dividing by the sum n for each sentence of the combined news, as shown in Equation 24 below. By doing so, it is possible to calculate the sentence similarity, that is, the similarity between the sentence of the combined news and the title of the duplicate news group (seed news and duplicate news).

[수식 24][Formula 24]

수식 24에서

는 결합 뉴스의 문장과 시드 뉴스 및 중복 뉴스들의 제목에 나타나는 단어들 사이의 유사도(문장 유사도)를 의미하며,

는 앞서 설명한 수식 23에 의하여 구해진 단어의 중요도를 모두 더한 값을 의미하며, n은 문장에 사용된 단어의 수를 의미한다. 즉, 수식 24에 나타난 바와 같이, 구해진 단어의 중요도를 모두 더하여 해당 문장에 나타나는 단어 수 n으로 나누어 문장의 중요도를 구할 수 있다.In Equation 24

Means the similarity (sentence similarity) between the sentences of the combined news and the words that appear in the titles of the seed and duplicate news,

Denotes a value obtained by adding up the importance of the words obtained by Equation 23, and n denotes the number of words used in the sentence. That is, as shown in Equation 24, the importance of a sentence may be obtained by adding all the importance of the obtained words and dividing by the number n of words appearing in the sentence.

표 4는 시드 뉴스 및 중복 뉴스의 제목에서 사용된 명사와 결합 뉴스의 문장에서 사용된 명사 사이의 의미 유사도를 구한 예이다.Table 4 shows an example of the semantic similarity between the nouns used in the titles of the seed news and duplicate news and the nouns used in the sentence of the combined news.

TitleTitle ContentsContents

car automobile 1.0 gem jewel 1.0 sport swimming 0.8950 computer keyboard 0.5271 journey flower 0.0

결합 뉴스의 내용(Contents) 중 하나의 문장에서 jewel과 keyboard라는 명사가 나타난 경우를 예로 들면, jewel과 automobile라는 명사와, 시드 뉴스 및 중복 뉴스의 제목에 사용된 모든 명사들 car, gem, sport, computer, journey 간의 유사도를 전술한 수식 20 내지 23에 따라 산출하며, 결합 뉴스의 문장에서 나타난 각 단어의 중요도는 산출된 유사도들 중 가장 큰 값으로 결정된다. 결합 뉴스의 문장에서 사용된 jewel이라는 단어는 시드 뉴스 및 중복 뉴스의 제목에서 사용된 단어들 중 gem이라는 단어와 유사도가 가장 크며, 단어 jewel에 대한 유사도 값은 1이 된다. 결합 뉴스의 문장에서 사용된 keyboard라는 단어는 시드 뉴스 및 중복 뉴스의 제목에서 사용된 단어들 중 computer라는 단어와 유사도가 가장 크며, 단어 keyboard에 대한 유사도 값은 0.5271이다.For example, when the nouns jewel and keyboard appear in one sentence of the contents of the combined news, the nouns jewel and automobile and all the nouns used in the titles of the seed news and duplicate news are car, gem, sport, The similarity between computer and journey is calculated according to Equations 20 to 23, and the importance of each word in the sentence of the combined news is determined as the largest value among the calculated similarities. The word jewel used in the sentence of the combined news has the highest similarity with the word gem among the words used in the titles of the seed news and duplicate news, and the similarity value for the word jewel is 1. The word keyboard used in the sentence of the combined news has the most similarity to the word computer among the words used in the titles of the seed news and duplicate news, and the similarity value for the word keyboard is 0.5271.

수식 24에서,

는 두 단어의 유사도 합(1+0.527)인 1.5271이고, 단어의 수 n은 2이므로, 문장과 제목 사이의 유사도(문장 유사도)는 0.763으로 산출된다. 이값은 후술되는 문장 가중치와 함께 최종적으로 문장이 가지는 문장 중요도에 반영되어 요약문을 생성하는데 활용된다.In Equation 24,

Is 1.5271, which is the sum of similarities between two words (1 + 0.527), and the number of words n is 2, so that the similarity (sentence similarity) between the sentence and the title is calculated as 0.763. This value is used to generate a summary sentence, which is finally reflected in the sentence importance of the sentence together with the sentence weight to be described later.

문장 가중치 산출부(160)는 결합 뉴스의 문장에서 단어가 나타나는 빈도수 및 문장의 위치에 기초하여 문장 가중치를 산출한다. 도 9는 본 발명의 일 실시예에 따른 뉴스 제공 시스템을 구성하는 문장 가중치 산출부의 구성도이다. 도 9를 참조하면, 문장 가중치 산출부(160)는 제1 문장 가중치 산출부(161), 제2 문장 가중치 산출부(162) 및 연산부(163)를 포함한다. 제1 문장 가중치 산출부(161)는 결합 뉴스의 문장에서 나타나는 각 단어의 빈도수의 평균값을 산출하여 빈도수에 따른 제1 문장 가중치를 산출한다. 제2 문장 가중치 산출부(162)는 결합 뉴스의 문장의 위치 순으로 상위 문장부터 높은 가중치를 갖도록 제2 문장 가중치를 부여한다. 연산부(163)는 제1 문장 가중치 및 제2 문장 가중치의 곱셈 연산을 수행하여 문장 가중치를 산출한다. 이를 보다 구체적으로 설명하면 다음과 같다.The sentence weight calculator 160 calculates the sentence weight based on the frequency of occurrence of the word and the position of the sentence in the sentence of the combined news. 9 is a block diagram of a sentence weight calculator constituting a news providing system according to an exemplary embodiment of the present invention. Referring to FIG. 9, the sentence weight calculator 160 includes a first sentence weight calculator 161, a second sentence weight calculator 162, and an operator 163. The first sentence weight calculator 161 calculates a first sentence weight according to the frequency by calculating an average value of the frequency of each word in the sentence of the combined news. The second sentence weight calculator 162 assigns the second sentence weight to have a higher weight from the upper sentence in the order of the sentences of the combined news. The calculator 163 calculates a sentence weight by performing a multiplication operation of the first sentence weight and the second sentence weight. This will be described in more detail as follows.

제1 문장 가중치 산출부(161)는 예를 들어 수식 25에 따라 tf-isf 방법을 사용하여 단어의 출현 빈도수에 따른 중요도를 산출할 수 있다. tf는 결합 뉴스의 선택된 문장에서 구하고, isf는 결합 뉴스의 전체 문장에서 구한다.The first sentence weight calculator 161 may calculate the importance level according to the frequency of occurrence of a word by using the tf-isf method according to Equation 25, for example. tf is obtained from the selected sentence of the combined news, and isf is obtained from the whole sentence of the combined news.

[수식 25][Equation 25]

제1 문장 가중치 산출부(161)는 예를 들어 아래의 수식 26에 따라 단어 출현 빈도수에 따른 중요도 값들의 평균 값을 산출하여 제1 문장 가중치를 산출한다.The first sentence weight calculator 161 calculates the first sentence weight by calculating an average value of importance values according to the word occurrence frequency, for example, according to Equation 26 below.

[수식 26][Equation 26]

수식 26에서,

는 j번째 결합 뉴스의 본문 중 p번째 문장에 나타나는 k번째 단어의 가중치를 수식 25에 따라 구한 값을 의미하며, n은 j번째 결합 뉴스의 본문 중 p번째 문장에 나타나는 모든 명사의 개수를 의미한다.In Equation 26,

Denotes a value obtained by calculating the weight of the k-th word in the p-th sentence of the j-th combined news text according to Equation 25, and n denotes the number of all nouns in the p-th sentence of the j-th combined news text. .

제2 문장 가중치 산출부(162)는 문장의 위치가 상위에 위치할수록 기사 요약에서 중요한 역할을 하는 점을 반영하기 위해, 예를 들어 아래의 수식 27에 따라 문장의 위치에 따른 가중치 값을 산출한다.The second sentence weight calculator 162 calculates a weight value according to the position of the sentence, for example, according to Equation 27 below, in order to reflect that the position of the sentence is located at an upper position to play an important role in the article summary. .

[수식 27][Equation 27]

수식 27에서,

는 j번째 탐지된 중복 뉴스의 결합 뉴스 중 p번째 문장을 의미하며, N은 결합 뉴스의 내용에 나타나는 문장의 전체 개수이며, i는 결합 뉴스의 본문 중 문장의 위치(순번)를 나타낸다. 제2 문장 가중치 산출부(162)에 의해, 결합 뉴스의 상위 문장에 가장 높은 가중치가 부여된다. 결합 뉴스의 첫 번째 문장의 경우 i=1의 값을 가지며,

는 1이 된다. 그 이후 문장부터는 첫 번째 문장에서 멀어지는 거리만큼 가중치 값이 낮아지며, 마지막 문장의 경우 i=N이 되어 1/N의 낮은 가중치 값을 갖는다.In Equation 27,

Is the pth sentence of the combined news of the jth detected duplicate news, N is the total number of sentences appearing in the content of the combined news, i represents the position (order) of the text in the body of the combined news. The second sentence weight calculator 162 gives the highest weight to the upper sentence of the combined news. The first sentence of the combined news has a value of i = 1,

Becomes 1 After that, the weight is lowered by the distance away from the first sentence, and in the last sentence, i = N, which has a low weight value of 1 / N.

연산부(163)는 예를 들어 아래의 수식 28과 같이, 제1 문장 가중치 산출부(161)에 의해 산출된 단어의 출현 빈도수에 따른 문장의 중요도(제1 문장 가중치)

와 제2 문장 가중치 산출부(162)에 의해 산출된 문장의 위치에 따른 중요도(제2 문장 가중치)

값을 곱하여, 결합 뉴스의 문장에 대한 문장 중요도를 구한다.For example, the calculation unit 163 calculates the importance of the sentence according to the frequency of occurrence of the word calculated by the first sentence weight calculator 161 as shown in Equation 28 below (first sentence weight).

And importance according to the position of the sentence calculated by the second sentence weight calculator 162 (second sentence weight)

Multiply the values to find the sentence importance for the sentence in the combined news.

[수식 28][Equation 28]

수식 28에서, Cweight(cs_j _,p)는 j번째 결합 뉴스의 p번째 문장에 대한 문장 중요도(문장 가중치)를 나타낸다.In Equation 28, Cweight (cs _j _{, p} ) represents sentence importance (statement weight) for the pth sentence of the jth combined news.

요약문 생성부(170)는 결합 뉴스의 문장의 유사도 및 문장 가중치에 기초하여 결합 뉴스의 문장들의 중요도 순위를 결정하고, 결정한 중요도 순위에 기초하여 결합 뉴스에 대한 요약문을 생성한다. 일 실시예에 있어서, 요약문 생성부(170)는 결합 뉴스의 문장들 중 중요도 순위가 상위에 해당하는 소정의 개수의 문장을 추출하여 요약문을 생성할 수 있다. 요약문 생성부(170)는 예를 들어 아래의 수식 29에 따라 문장 중요도를 산출하여 중요도 순위를 결정할 수 있다.The summary sentence generation unit 170 determines the importance ranking of sentences of the combined news based on the similarity and sentence weight of the sentences of the combined news, and generates a summary of the combined news based on the determined importance ranking. In one embodiment, the summary sentence generating unit 170 may generate a summary sentence by extracting a predetermined number of sentences having a priority ranking among the sentences of the combined news. For example, the summary sentence generation unit 170 may determine the importance ranking by calculating the sentence importance according to Equation 29 below.

[수식 29]Equation 29

수식 29에서, Sweight(cs_j _,p)는 j번째 결합 뉴스의 p번째 문장에 대한 문장 중요도를 나타내고,

는 미리 결정된 상수, Cweight(cs_j _,p)는 수식 28에 따라 산출된 j번째 결합 뉴스의 p번째 문장에 대한 문장 가중치를 나타내고, Tweight(cs_j,p,DT_j)는 수식 24에 따라 산출된 j번째 결합 뉴스의 p번째 문장과, j번째 결합 뉴스에 대한 시드 뉴스 및 중복 뉴스의 제목 간의 문장 유사도를 나타낸다. 이때 수식 28에 나타나는

값은 0 이상, 1 미만의 값으로 설정될 수 있으며, 예를 들어 0.4의 값으로 설정될 수 있다. 수식 29에 따라 산출된 문장 중요도 값에 따라 결합 뉴스의 문장들은 재랭킹 된다.In Equation 29, Sweight (cs _j _{, p} ) represents the sentence importance for the pth sentence of the jth combined news,

Is a predetermined constant, Cweight (cs _j _{, p} ) denotes the sentence weight for the pth sentence of the jth combined news calculated according to Equation 28, and Tweight (cs _{j, p} , DT _j ) is calculated according to Equation 24 Sentence similarity between the pth sentence of the j th combined news and the titles of the seed news and the duplicate news for the j th combined news. At this time,

The value may be set to a value of 0 or more and less than 1, for example, a value of 0.4. The sentences of the combined news are reranked according to the sentence importance value calculated according to Equation 29.

뉴스 추천부(180)는 하나의 문서로 통합된 결합 뉴스를 요약문과 함께 웹 페이지 형태로 제공할 수 있다. 일 실시예에 있어서, 뉴스 추천부(180)는 결합 뉴스 중 시드 뉴스에 중복 뉴스의 문장이 추가된 부분에 시드 뉴스의 문장과 구별되도록 식별 표시를 하여 웹 페이지 형태로 제공할 수 있다. 식별 표시는 예를 들어 추가된 부분을 진한 글씨체로 표시하거나, 글씨의 바탕을 음영으로 구분하거나, 글씨의 크기를 변경시키거나, 이탤릭체 등의 글씨체로 변화시키거나, 글씨체의 색상을 변경시키거나, "<", ">", "/", "「", "」"와 같은 기호를 삽입하는 등의 다양한 방식으로 수행될 수 있다.The news recommendation unit 180 may provide the combined news integrated into one document together with a summary in the form of a web page. In one embodiment, the news recommendation unit 180 may provide an identification mark so as to be distinguished from a sentence of the seed news in a portion where the sentence of the duplicate news is added to the seed news of the combined news and provided in the form of a web page. The identification mark may be displayed by, for example, displaying the added portion in a dark font, dividing the background of the font into shadows, changing the size of the font, changing the font in italics or the like, changing the color of the font, Such as inserting symbols such as "<", ">", "/", "", "" "", and the like.

도 10은 본 발명의 일 실시예에 따른 뉴스 제공 방법의 흐름도이다. 도 10에 도시된 실시예를 구성하는 단계들은 도 1 내지 도 9에 도시된 실시예의 구성들에 의해 수행될 수 있다. 도 10에 도시된 단계 S101 내지 S102는 1차 클러스터링 단계에 해당하며, 단계 S103 내지 S104는 2차 클러스터링 단계에 해당한다. 먼저 단계 S101에서 중복 후보 뉴스 탐지부(110)의 전처리부(111)는 시드 뉴스와 대상 뉴스로부터 제목을 추출하고, 추출한 제목에 포함된 단어 중에서 단일 단어로는 의미를 갖지 않는 불용어를 제거하고, 불용어를 제거한 단어를 어근의 형태로 변환하는 전처리 단계를 수행한다. 다음으로 단계 S102에서 중복 후보 뉴스 탐지부(110)의 중복 후보 뉴스 판단부(112)는 시드 뉴스의 제목과 대상 뉴스들의 제목 간의 유사도를 산출하며, 산출한 제목 간의 유사도에 기초하여 대상 뉴스들 중에서 중복 후보 뉴스를 추출한다.10 is a flowchart illustrating a news providing method according to an embodiment of the present invention. The steps of configuring the embodiment shown in FIG. 10 may be performed by the configurations of the embodiment shown in FIGS. 1 to 9. Steps S101 to S102 shown in FIG. 10 correspond to the first clustering step, and steps S103 to S104 correspond to the second clustering step. First, in step S101, the preprocessing unit 111 of the duplicate candidate news detector 110 extracts a title from the seed news and the target news, and removes stopwords having no meaning as a single word from the words included in the extracted title. A preprocessing step is performed to convert the words from which the stop words are removed into the form of roots. Next, in step S102, the duplicate candidate news determination unit 112 of the duplicate candidate news detector 110 calculates the similarity between the title of the seed news and the titles of the target news, and among the target news based on the similarity between the calculated titles. Extract duplicate candidate news.

중복 후보 뉴스를 추출하는 단계 S102에 대해 보다 구체적으로 설명하면 다음과 같다. 먼저 단계 S1021에서 제1 유사도 산출부(1121) 및 제2 유사도 산출부(1122)는 예를 들어 앞서 언급된 수식 2 내지 수식 5에 따라 제1 유사도 및 제2 유사도를 산출한다. 이때, 제1 유사도는 시드 뉴스의 제목과, 검색 엔진의 검색 리스트에 나타나는 대상 뉴스의 앵커 제목 간의 유사도이며, 제2 유사도는 시드 뉴스의 제목과, 대상 뉴스의 컨텐츠에 나타나는 제목 간의 유사도이다. 단계 S1022에서 유사도 결정부(1123)는 예를 들어 앞서 언급된 수식 6에 따라서 제1 유사도 및 제2 유사도 중 보다 큰 유사도를 제목 간의 유사도로 결정한다. 다음으로 단계 S1023에서 임계값 비교부(1124)는 제목 간의 유사도를 소정의 임계값과 비교하여 중복 후보 뉴스를 탐지한다.Referring to step S102 of extracting duplicate candidate news in more detail as follows. First, in step S1021, the first similarity calculating unit 1121 and the second similarity calculating unit 1122 calculate the first similarity and the second similarity according to, for example, Equations 2 to 5 mentioned above. Here, the first degree of similarity is a degree of similarity between the title of the seed news and the anchor title of the target news appearing in the search list of the search engine, and the second degree of similarity is the degree of similarity between the title of the seed news and the title of the target news. In step S1022, the similarity determination unit 1123 determines, for example, the greater similarity among the first similarity and the second similarity as the similarity between titles according to Equation 6 mentioned above. Next, in step S1023, the threshold comparison unit 1124 detects duplicate candidate news by comparing the similarity between titles with a predetermined threshold.

다음으로 단계 S103에서 중복 뉴스 탐지부(120)의 전처리부(121)는 시드 뉴스와 중복 후보 뉴스로부터 컨텐츠를 추출하고, 추출한 컨텐츠의 구문을 분석하며, 컨텐츠에서 문장을 검출하는 전처리 작업을 수행한다. 다음으로 단계 S104에서 중복 뉴스 탐지부(120)의 중복 뉴스 판단부(122)는 시드 뉴스의 컨텐츠에 포함된 문장과 중복 후보 뉴스의 컨텐츠에 포함된 문장 간의 유사도를 산출하고, 산출한 문장 간의 유사도에 기초하여 중복 후보 뉴스가 중복 뉴스에 해당하는지 여부를 판단한다.Next, in step S103, the preprocessor 121 of the duplicate news detector 120 extracts content from the seed news and the duplicate candidate news, analyzes the syntax of the extracted content, and performs a preprocessing operation of detecting a sentence from the content. . Next, in step S104, the duplicate news determination unit 122 of the duplicate news detector 120 calculates a similarity between the sentences included in the contents of the seed news and the sentences included in the contents of the duplicate candidate news, and the similarities between the calculated sentences. It is determined based on whether the duplicate candidate news corresponds to the duplicate news.

중복 뉴스를 판단하는 단계 S104에 대하여 보다 구체적으로 설명하면, 먼저 단계 S1041에서 단어 유사도 산출부(1221)는 예를 들어 앞서 언급한 수식 8 내지 수식 9에 따라 시드 뉴스의 컨텐츠에 포함된 문장의 단어와, 중복 후보 뉴스의 컨텐츠에 포함된 문장의 단어 간의 유사도를 산출하고, 동사 유사도 산출부(1222)는 예를 들어 앞서 언급한 수식 10 내지 수식 13에 따라 시드 뉴스의 컨텐츠에 포함된 문장의 동사와, 중복 후보 뉴스의 컨텐츠에 포함된 문장의 동사 간의 유사도를 산출한다. 다음으로 단계 S1042에서 문장 유사도 결정부(1223)는 예를 들어 앞서 언급한 수식 14에 따라 단어 간의 유사도와 동사 간의 유사도를 곱셈 연산하고, 시드 뉴스의 문장과, 중복 후보 뉴스의 각각의 문장 간의 곱셈 연산 값 중 가장 큰 값을 문장 간의 유사도로 결정한다. 다음으로 단계 S1043에서 뉴스 관계 유사도 산출부(1224)는 예를 들어 앞서 언급한 수식 15에 따라 시드 뉴스의 모든 문장에 대하여 산출된 문장 간의 유사도들을 합한 값을 시드 뉴스의 문장 개수 및 중복 후보 뉴스의 문장 개수 중 작은 값으로 나누어 뉴스 관계 유사도를 산출한다. 다음으로 단계 S1044에서 임계값 비교부(1225)는 뉴스 관계 유사도를 소정의 임계값과 비교하여 중복 뉴스를 탐지한다.Referring to step S104 of determining duplicate news in more detail, first, in step S1041, the word similarity calculation unit 1221 is a word of a sentence included in the content of the seed news, for example, according to Equations 8 to 9 mentioned above. And the similarity between words of sentences included in the content of the duplicate candidate news, and the verb similarity calculator 1222 may, for example, verbs of sentences included in the contents of the seed news according to Equations 10 to 13 mentioned above. And similarity between verbs of sentences included in the content of the duplicate candidate news. Next, in step S1042, the sentence similarity determining unit 1223 multiplies the similarity between words and the similarity between verbs according to, for example, Equation 14 mentioned above, and multiplies between sentences of the seed news and each sentence of the duplicate candidate news. The largest value among the operation values is determined as the similarity between sentences. Next, in step S1043, the news relationship similarity calculator 1224 calculates the sum of the similarities between the sentences calculated for all sentences of the seed news according to Equation 15 mentioned above, for example, the number of sentences of the seed news and the duplicate candidate news. The similarity of news relation is calculated by dividing by the small value among the number of sentences. Next, in step S1044, the threshold comparison unit 1225 detects duplicate news by comparing the news relationship similarity with a predetermined threshold.

다음으로, 단계 S105에서 중복 뉴스 결합부(140)는 시드 뉴스와 중복 뉴스를 통합하여 결합 뉴스를 생성한다. 먼저 유사도 산출부(141)는 시드 뉴스의 각 문장과, 중복 뉴스의 문장 간의 유사도를 산출하고, 산출한 유사도 중에서 최대값에 해당하는 최대 유사도를 산출한다. 일 실시예에 있어서, 유사도 산출부(141)는 시드 뉴스의 문장과 중복 뉴스의 문장에서 나타나는 동사를 공통으로 포함하는 워드넷 용어집 기반 계층에서의 최소 상위 동사를 검출하고, 시드 뉴스의 문장에서 나타나는 동사, 중복 뉴스의 문장에서 나타나는 동사 및 최소 상위 동사의 동의어 어휘의 개수에 비례하는 연산을 수행하여 확률 값을 산출하며, 산출한 확률 값을 이용하여 동사 간의 유사도를 산출할 수 있다. 유사도 산출부(141)는 예를 들어 전술한 수식 10 내지 수식 13에 따라 시드 뉴스의 문장에 나타나는 동사와 중복 뉴스의 문장에 나타나는 동사 간의 유사도를 산출할 수 있다. 다만, 수식 10 내지 수식 13에서는 시드 뉴스의 문장과 중복 후보 뉴스의 문장에 대하여 고려하였으나, 단계 S105에서는 중복 후보 뉴스 대신 중복 뉴스의 문장과의 유사도를 산출한다. 유사도 산출부(141)는 시드 뉴스의 문장들 중 중복 뉴스의 문장과의 유사도가 가장 큰 문장을 찾고, 해당 문장이 시드 뉴스의 몇 번째 문장에 해당하는지를 판단한다.Next, in step S105, the duplicate news combiner 140 generates the combined news by integrating the seed news and the duplicate news. First, the similarity calculating unit 141 calculates the similarity between the sentences of the seed news and the sentences of the duplicate news, and calculates the maximum similarity corresponding to the maximum value among the calculated similarities. In one embodiment, the similarity calculator 141 detects the least significant verb in the WordNet glossary based layer that includes both verbs of the seed news and the verbs appearing in the sentences of the duplicate news, and appears in the sentences of the seed news. A probability value is calculated by performing an operation proportional to the number of synonym vocabulary of verbs, verbs and minimum upper verbs that appear in a sentence of duplicate news, and the similarity between verbs can be calculated using the calculated probability value. The similarity calculating unit 141 may calculate similarities between verbs appearing in the sentence of the seed news and verbs appearing in the sentence of the duplicate news according to the above-described Equations 10 to 13, for example. However, in Equation 10 to Equation 13, the sentence of the seed news and the sentence of the duplicate candidate news are considered, but in step S105, the similarity with the sentence of the duplicate news is calculated instead of the duplicate candidate news. The similarity calculation unit 141 finds the sentence having the greatest similarity to the sentence of the duplicated news among the sentences of the seed news, and judges which sentence of the seed news corresponds to the sentence.

다음으로 결합 뉴스 생성부(142)는 산출된 최대 유사도를 임계값과 비교하며, 최대 유사도가 임계값 미만이면 중복 뉴스의 문장을 시드 뉴스에 결합하고, 최대 유사도가 임계값 미만이면 중복 뉴스의 문장을 시드 뉴스에 결합하지 않고 삭제하여, 시드 뉴스와 중복 뉴스를 하나의 문서로 통합하여 결합 뉴스를 생성한다. 일 실시예에 있어서, 결합 뉴스 생성부(142)는 중복 뉴스의 문장과의 유사도가 최대값으로 나타나는 시드 뉴스의 문장과, 이의 다음 문장의 사이에 중복 뉴스의 문장을 추가할 수 있다. 결합 뉴스 생성부(142)는 중복 뉴스들의 모든 문장들 중 시드 뉴스와의 최대 유사도가 임계값 미만인 문장들을 시드 뉴스에 결합할 수 있다.Next, the combined news generation unit 142 compares the calculated maximum similarity with the threshold, and if the maximum similarity is less than the threshold, combines the sentences of the duplicate news with the seed news, and if the maximum similarity is less than the threshold, the sentence of the duplicate news. Delete the combined news without combining it with the seed news, and combines the news and duplicate news into one document to generate the combined news. In one embodiment, the combined news generating unit 142 may add a duplicate news sentence between the sentence of the seed news in which the degree of similarity with the sentence of the duplicate news is the maximum value, and the next sentence thereof. The combined news generating unit 142 may combine the sentences having the maximum similarity degree with the seed news among all the sentences of the overlap news to a seed news.

다음으로, 단계 S106에서 문장 유사도 산출부(150)는 결합 뉴스의 문장과, 시드 뉴스 및 중복 뉴스의 제목 간의 문장 유사도를 산출하며, 문장 가중치 산출부(160)는 결합 뉴스의 문장의 단어 출현 빈도수 및 문장의 순번에 기초하여 문장 가중치를 산출한다. 일 실시예에 있어서, 문장 유사도 산출부(150)는 시드 뉴스의 제목, 중복 뉴스의 제목 및 결합 뉴스의 문장으로부터 단어를 추출하며, 추출한 시드 뉴스 및 중복 뉴스의 단어와, 결합 뉴스의 문장의 단어 간의 유사도를 산출하여 문장 유사도를 산출할 수 있다. 문장 유사도 산출부(150)는 예를 들어 시드 뉴스의 제목 및 중복 뉴스의 명사와, 결합 뉴스의 문장에서 나타나는 명사를 공통으로 포함하는 워드넷 용어집 기반 계층에서의 최소 상위 명사를 검출하고, 시드 뉴스의 제목 및 중복 뉴스의 명사, 결합 뉴스의 문장에서 나타나는 명사 및 최소 상위 명사의 동의어 어휘의 개수에 비례하는 연산을 수행하여 단어 중요도를 산출하며, 결합 뉴스의 문장에서 나타나는 각 명사에 대한 단어 중요도의 평균값을 산출하여 문장 유사도를 산출한다.Next, in step S106, the sentence similarity calculating unit 150 calculates sentence similarity between the sentence of the combined news and the title of the seed news and the duplicate news, and the sentence weight calculating unit 160 shows the frequency of occurrence of words in the sentence of the combined news. And sentence weights are calculated based on the order of sentences. In one embodiment, the sentence similarity calculator 150 extracts words from the title of the seed news, the title of the duplicate news, and the sentence of the combined news, and extracts the words of the extracted seed and duplicate news and the words of the sentences of the combined news. The similarity between sentences can be calculated to calculate sentence similarity. For example, the sentence similarity calculator 150 detects a minimum upper noun in a wordnet glossary-based layer including a noun of a seed news and a noun of duplicate news, and a noun appearing in a sentence of a combined news in common. The word importance is computed by performing operations that are proportional to the number of nouns in the title and duplicate news, the nouns appearing in the combined news sentences, and the synonym vocabulary of the least significant nouns. The sentence similarity is calculated by calculating the average value.

단계 S106에서, 문장 가중치 산출부(160)는 결합 뉴스의 문장에서 단어가 나타나는 빈도수 및 문장의 위치에 기초하여 문장 가중치를 산출한다. 일 실시예에 있어서, 문장 가중치 산출부(160)는 결합 뉴스의 문장에서 나타나는 각 단어의 빈도수의 평균값을 산출하여 빈도수에 따른 제1 문장 가중치를 산출하고, 결합 뉴스의 문장의 위치 순으로 상위 문장부터 높은 가중치를 갖도록 제2 문장 가중치를 부여하고, 제1 문장 가중치 및 제2 문장 가중치의 곱셈 연산을 수행하여 문장 가중치를 산출할 수 있다.In step S106, the sentence weight calculator 160 calculates the sentence weight based on the frequency of occurrence of the word in the sentence of the combined news and the position of the sentence. In one embodiment, the sentence weight calculation unit 160 calculates the average value of the frequency of each word appearing in the sentence of the combined news to calculate the first sentence weight according to the frequency, the higher sentence in order of the position of the sentence of the combined news The sentence weight may be calculated by assigning a second sentence weight to have a high weight, and performing a multiplication operation of the first sentence weight and the second sentence weight.

다음으로 단계 S107에서 요약문 생성부(170)는 결합 뉴스의 문장의 유사도 및 문장 중요도에 기초하여 결합 뉴스의 문장들의 중요도 순위를 결정하고, 결정한 중요도 순위에 기초하여 결합 뉴스에 대한 요약문을 생성한다. 일 실시예에 있어서, 요약문 생성부(170)는 결합 뉴스의 문장들 중 중요도 순위가 상위에 해당하는 소정의 개수의 문장을 추출하여 요약문을 생성할 수 있다.Next, in step S107, the summary sentence generation unit 170 determines the importance ranking of the sentences of the combined news based on the similarity and the sentence importance of the sentences of the combined news, and generates a summary of the combined news based on the determined importance ranking. In one embodiment, the summary sentence generating unit 170 may generate a summary sentence by extracting a predetermined number of sentences having a priority ranking among the sentences of the combined news.

문장 요약문의 수를 선택하기 위해 요약문의 비율을 산정하게 되는데 이를 정하기 위해 예를 들어 아래의 수식 30를 이용할 수 있다.In order to select the number of sentence summaries, the sum of the summaries is calculated. For example, Equation 30 below can be used.

[수식 30][Formula 30]

수식 30에서,

는 i번째 결합 뉴스의 문장의 총 수를 나타내며, R은 요약 비율을 의미한다. 요약문의 문장의 개수는 결합 뉴스의 문장의 총 수와 요약 비율을 곱한 값

에서 소수점 이하를 반올림한 값으로 결정될 수 있다. 요약문 생성부(170)는 결합 뉴스의 문장들 중에서 문장의 중요도에 따라 요약문의 문장의 개수만큼의 문장들을 선택하여 요약문을 생성한다.In Equation 30,

Represents the total number of sentences of the i-th combined news, and R represents the summary ratio. The number of sentences in the summary is multiplied by the total number of sentences in the combined news and the summary ratio

It can be determined by rounding the value below the decimal point in. The summary sentence generation unit 170 generates a summary sentence by selecting as many sentences as the number of sentences of the summary sentence according to the importance of the sentence among the sentences of the combined news.

다음으로 단계 S108에서 뉴스 추천부(180)는 하나의 문서로 통합된 결합 뉴스를 결합 뉴스의 요약문과 함께 웹 페이지 형태로 제공한다. 일 실시예에 있어서, 뉴스 추천부(180)는 결합 뉴스 중 시드 뉴스에 상기 중복 뉴스의 문장이 추가된 부분에 시드 뉴스의 문장과 구별되도록 식별 표시를 하여 결합 뉴스를 웹 페이지 형태로 제공할 수 있다.Next, in step S108, the news recommendation unit 180 provides the combined news integrated into one document together with a summary of the combined news in the form of a web page. In one embodiment, the news recommendation unit 180 may provide the combined news in the form of a web page by identifying and marking the distinguished sentence from the seed news in a portion where the sentence of the duplicate news is added to the seed news. have.

본 발명의 실시예에 따른 뉴스 요약문 생성 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함할 수 있다.The method of generating a news summary according to an exemplary embodiment of the present invention may be embodied in a general-purpose digital computer that can be created as a program that can be executed by a computer and that operates the program using a computer-readable recording medium. The computer-readable recording medium may include a storage medium such as a magnetic storage medium (eg, a ROM, a floppy disk, a hard disk, etc.), an optical reading medium (eg, a CD-ROM, a DVD, etc.).

뉴스 사이트의 6개의 분야(police, business, crime, education, health, web-exclusive)에서 각각 10개의 헤드라인 뉴스(headline news)를 추출하였다. 추출된 뉴스는 제목과 내용을 분류하고, 각각의 헤드라인 뉴스의 제목을 구글(google) 검색 엔진의 뉴스 분야에 질의하였다. 질의 결과 중 각 헤드라인 뉴스마다 상위 20개 또는 그 이하의 뉴스 기사를 수집하였다. 수집한 뉴스 기사는 검색된 리스트에 나타나는 제목(앵커 제목)과 상세 페이지에 나타나는 제목(텍스트 제목) 그리고 내용(컨텐츠)으로 분류하여 저장하였다.Ten headline news items were extracted from six categories of news sites (police, business, crime, education, health, and web-exclusive). The extracted news categorized title and content, and the title of each headline news was inquired into the news field of google search engine. We collected the top 20 or less news articles for each headline news item in the query results. The collected news articles were categorized into the title (anchor title) appearing in the searched list, the title (text title) and the content (content) appearing on the detail page.

시드 뉴스의 제목과 대상 뉴스(target news)의 앵커 제목 간의 유사도 및 시드 뉴스의 제목과 대상 뉴스의 텍스트 제목 간의 유사도를 전술한 수식 2 내지 수식 5에 따라 산출하였다. 두 개의 유사도 중 시드 뉴스와 대상 뉴스의 제목 간의 유사도 값을 전술한 수식 6에 따라 산출하였다. 이렇게 구해진 제목 간의 유사도 값을 임계값과 비교하여 큰 것을 중복 후보 뉴스로 추출하고 임계값보다 작은 것을 중복 뉴스의 후보에서 제외하였다. 선출된 중복 뉴스 후보들에 대하여 내용의 유사성에 기초하여 중복 뉴스 기사인지 여부를 판별하였다. 이때 전술한 수식 8 내지 수식 15에 따라 뉴스의 유사도를 측정하였다.The similarity between the title of the seed news and the anchor title of the target news and the similarity between the title of the seed news and the title of the target news are calculated according to the above-described Equations 2 to 5. [ Among the two similarities, the similarity value between the seed news and the target news title was calculated according to Equation (6). The similarity value between the obtained titles was compared with the threshold value, and the larger one was extracted as the duplicate candidate news, and the smaller one than the threshold value was excluded from the duplicate news candidate. Based on the similarity of the contents to the selected duplicated news candidates, it was judged whether or not they are duplicate news articles. At this time, the similarity of the news was measured according to the above-described Equations 8 to 15.

뉴스 사이트에서 선정된 6개의 분야에서 일요일을 제외한 6주 동안 수집된 헤드라인 뉴스(시드 뉴스)와 수집된 뉴스(대상 뉴스)의 수가 아래의 표 5에 나타난다.The number of headline news (seed news) and news gathered (target news) collected for six weeks except Sunday in six categories selected on the news site is shown in Table 5 below.

PolicePolice BusinessBusiness CrimeCrime EducationEducation HealthHealth WebWeb -- ExclusiveExclusive 헤드라인 뉴스Headline News 340340 274274 215215 4747 2525 3838 대상 뉴스Target News 6,8006,800 5,4805,480 4,3004,300 940940 500500 760760 평균 중복 수Average number of duplicates 16.416.4 13.513.5 2.82.8 22 3.83.8 22

표 5에서 헤드라인 뉴스는 헤드라인 뉴스는 각 분야별 중복 뉴스를 탐지해야 하는 그룹의 수이며, 중복 뉴스를 발견해야 하는 클러스터링의 수와 같다. 예를 들어 "Police" 분야에서 수집된 헤드라인 뉴스(시드 뉴스), 즉 탐지해야 하는 그룹의 수는 340개이다. 6개 분야의 클러스터링 수가 다른 것은 뉴스 기사의 선호도와 연관이 있다. 즉, 대중적인 관심이 많은 분야의 뉴스는 중복되는 뉴스도 많을 뿐만 아니라 뉴스도 매일 업데이트된다. 또한 이러한 뉴스 기사는 사용자에게 전달하려는 내용도 많다.In Table 5, headline news is that headline news is the number of groups that should detect duplicate news in each field and is equal to the number of clusterings where duplicate news should be found. For example, headline news (seed news) collected in the "Police" field, that is, 340 groups to be detected. Different clustering numbers in the six sectors correlate with news article preferences. In other words, news in areas of high public interest is not only redundant, but also updated daily. These news articles also have a lot to convey to the user.

"Police" 분야의 경우 뉴스들의 약 82%가 중복 뉴스에 해당하는 점을 고려하여, 전체 수집된 6,800개의 뉴스(대상 뉴스) 중 5,576개의 뉴스 기사를 중복 뉴스로 탐지한다. 즉, 340개의 클러스터링 그룹 중 한 개의 클러스터링 그룹이 중복 뉴스에 해당하는 것으로 찾아야 하는 평균 뉴스의 수는 16.4개이다. "Business" 분야에서 탐지해야 하는 그룹의 수는 274개이고, 이 분야의 뉴스들 중 약 67%가 중복 뉴스에 해당하므로 전체 수집된 5,480개의 대상 뉴스 중 3,699개의 뉴스 기사를 중복 뉴스로 탐지해야 한다. 즉, 274개의 그룹 중 한 개의 그룹이 찾아야하는 평균 뉴스의 수는 13.5개이다. "Crime" 분야에서 탐지해야 하는 그룹의 수는 215개이고, 이 분야의 뉴스들 중 약 14%가 중복 뉴스에 해당하므로, 전체 수집된 4,300개의 뉴스 중 602개의 뉴스 기사를 탐지해야한다. 즉, 215개의 그룹 중 한 개의 그룹이 찾아야 하는 평균 뉴스의 수는 2.8개이다. "Education" 분야에서 탐지해야하는 그룹의 수는 47개이고, 이 분야의 뉴스들 중 약 11%가 중복 뉴스에 해당하므로, 전체 수집된 940개의 뉴스 중 94개의 뉴스 기사를 탐지해야한다. 즉, 47개의 그룹 중 한 개의 그룹이 찾아야 하는 평균 뉴스의 수는 2개이다. "Health" 분야에서 탐지해야 하는 그룹의 수는 25개이고, 이 분야의 뉴스들 중 약 20%가 중복 뉴스에 해당하므로, 전체 수집된 500개의 뉴스 중 95개의 뉴스 기사를 탐지해야한다. 즉, 25개의 그룹 중 한 개의 그룹이 찾아야 하는 평균 뉴스의 수는 3.8개이다. "Web-exclusive" 분야에서 탐지해야하는 그룹의 수는 38개이고, 이 분야의 뉴스들 중 약 10.5%가 중복 뉴스에 해당하므로, 전체 수집된 760개의 뉴스 중 76개의 뉴스 기사를 탐지해야한다. 즉, 38개의 그룹 중 한 개의 그룹이 찾아야 하는 평균 뉴스의 수는 2개이다.In the "Police" field, 5,576 of the total 6,800 news items (target news) are detected as duplicate news, considering that 82% of the news correspond to duplicate news. That is, the average number of news that one clustering group of 340 clustering groups should find as overlapping news is 16.4. In the "Business" field, there are 274 groups that need to be detected, and about 67% of the news in this area are duplicate news, so 3,699 news articles from the total collected 5,480 target news should be detected as duplicate news. That is, the average number of news that one group of 274 groups should find is 13.5. In the "Crime" field, the number of groups to be detected is 215, and about 14% of the news in this field is duplicate news, so you need to detect 602 news articles out of the total 4,300 news collected. That is, the average number of news that one group of 215 groups should find is 2.8. In the "Education" field, there are 47 groups, and about 11% of the news in this area is duplicate news, so 94 news articles of the total 940 news should be detected. That is, the average number of news that one of the 47 groups should find is two. In the "Health" field, there are 25 groups to be detected, and about 20% of the news in this area is duplicate news, so you need to detect 95 of the 500 news articles collected. That is, the average number of news that one of the 25 groups should find is 3.8. In the "Web-exclusive" field, there are 38 groups, and about 10.5% of the news in this area is duplicate news, so we need to detect 76 of the total 760 news articles collected. That is, the average number of news that one group of 38 groups should find is two.

6개 분야의 헤드라인 뉴스에서 수집된 뉴스의 수가 다른 것은 뉴스 데이터의 선호도와 연관이 있다. "Police" 분야의 뉴스 업데이트 비율을 보면 하루에 90% 이상의 뉴스 기사가 바뀌고, 수집된 중복의 뉴스 비율을 확인해 보면 약 82%로서 유사한 기사가 가장 많다. "Business" 분야 또한 업데이트 비율과 중복 비율이 76%와 67.5%로 유사한 기사가 많이 작성된다. 그에 비해 "Crime" 분야는 업데이트 비율은 높지만 중복 뉴스의 비율이 낮고, 나머지 "Education", "Health", "Web-exclusive"와 같은 분야는 업데이트 비율과 중복 비율이 약 10% 이다. 이러한 다양한 분야의 특성을 고려하여 1차 중복 후보 뉴스와 2차 중복 뉴스의 탐지 비율을 다르게 하여 실험하였다. 실험에서 2차 중복 뉴스의 내용 유사도 값이 임계값 0.7보다 큰 경우 중복 뉴스로 탐지하였다.The number of news stories collected in the six headline news stories is related to the preference of news data. The news update rate in the "Police" field shows that more than 90% of the news articles are changed per day, and the percentage of the collected duplicates is about 82%. In the "Business" field, too, there are many similar articles with update rate and overlap ratio of 76% and 67.5%. In contrast, the "Crime" sector has a high update rate, but the percentage of overlapping news is low, while the remaining "Education", "Health" and "Web-exclusive" Considering the characteristics of these various fields, we experimented with different detection ratios of first redundant candidate news and second redundant news. In the experiment, duplicate news was detected when the content similarity value of second duplicate news was larger than threshold value 0.7.

PolicePolice BusinessBusiness CrimeCrime EducationEducation HealthHealth WebWeb - - ExclusiveExclusive 클러스터링 수Number of clusters 340340 274274 215215 4747 2525 3838 총 뉴스 수Total News 6,8006,800 5,4805,480 4,3004,300 940940 500500 760760 중복 뉴스 수Duplicate News Count 5,5765,576 3,6993,699 602602 9494 9595 7676 비교 총 뉴스 수 Compare total news 2,312,0002,312,000 1,501,5201,501,520 924,500924,500 44,18044,180 12,50012,500 28,88028,880 검색된 뉴스 수News Searches 45,62745,627 32,67232,672 19,21619,216 1,2681,268 238238 809809 검색된 적합한 뉴스 수The number of relevant news found 5,5135,513 3,6163,616 573573 8989 9191 7171 검색된 적합한 뉴스 비율Percentage of relevant news found 98.9%98.9% 97.7%97.7% 95.1%95.1% 94.7%94.7% 95.8%95.8% 93.4%93.4% 줄어드는 contractible 연산양의Mathematical 비율 ratio 98.0%98.0% 97.8%97.8% 97.9%97.9% 97.1%97.1% 98.1%98.1% 97.2%97.2%

표 6은 1차 클러스터링에서 임계값을 0.3보다 큰 값으로 설정했을 때의 결과를 나타낸다. 표 6의 1차 클러스터링 결과를 보면 "Police" 관련 뉴스는 10개의 헤드라인 뉴스 기사가 매일 새로운 뉴스 기사로 업데이트되기 때문에 클러스터링 개수가 가장 많고, "Business"나 "Crime" 관련 뉴스는 10개 중 7-8개 정도가 새로운 뉴스 기사로 업데이트되기 때문에 클러스터링의 수가 다음으로 많다. 하지만 "Business"와 다르게 "Crime"은 상대적으로 제공되는 뉴스기사가 적기 때문에 클러스터링의 수가 "Business"보다 작다. "Education", "Health", "Web-exclusive" 관련 뉴스는 10개 중 1개 정도가 새로운 뉴스 기사로 업데이트 된다. 이것은 "Police", "Business", "Crime" 과 같은 뉴스는 대중에게 많은 관심을 받고 있는 뉴스이기 때문에 새로운 뉴스가 매일 생성되는 것을 의미하고, "Education", "Health", "Web-exclusive" 관련 뉴스는 상대적으로 대중에게 관심 받지 못하고 제공되는 뉴스 기사가 적다는 것을 의미한다.Table 6 shows the results when the threshold is set to a value greater than 0.3 in the primary clustering. In the first clustering result in Table 6, news related to "Police" has the highest number of clustering because 10 headline news articles are updated daily with new news articles, and 7 of 10 news for "Business" or "Crime". Because -8 are updated with new news articles, the number of clustering is the next highest. Unlike "Business", however, "Crime" has a smaller number of clusters than "Business" because there are fewer news stories available. "Education", "Health" and "Web-exclusive" will be updated with new news articles by about 1 in 10 related news. This means that news such as "Police", "Business" and "Crime" is news that is attracting much attention from the public, so new news is generated every day and "Education", "Health" News is relatively uninterested in the public and means few news stories are provided.

표 7은 1차 클러스터링 후 중복 뉴스의 후보 수를 나타내며, 각 클러스터링 1개당 수집된 뉴스의 수가 확연히 줄어드는 것을 확인할 수 있다. 대표적인 예로 "Police"를 보면 1차 클러스터링 전에는 1개의 클러스터링 중복 뉴스를 탐지하기 위해서는 수집된 6,800개의 뉴스 문서를 모두 비교해야 하지만 1차 클러스터링 후에는 1개의 클러스터링 중복 뉴스를 탐지하기 위해 134개의 뉴스 기사만을 비교하면 된다.Table 7 shows the number of candidate candidates for duplicate news after the first clustering, and it can be seen that the number of news collected per clustering decreases significantly. A typical example is "Police". In order to detect one clustering duplicate news before the first clustering, all 6,800 news articles collected should be compared. However, after the first clustering, only 134 news articles .

PolicePolice BusinessBusiness CrimeCrime EducationEducation HealthHealth WebWeb - - ExclusiveExclusive 클러스터링 수Number of clusters 340340 274274 215215 4747 2525 3838 1개의 클러스터링 당 후보 뉴스 수Number of candidate news per clustering 6,8006,800 5,4805,480 4,3004,300 940940 500500 760760 클러스터링 후 평균 후보 뉴스 수Average number of candidate news after clustering 134134 119119 8989 2727 1010 2121

표 8은 1차 클러스터링 후 중복 후보 뉴스에 대하여 유사도 값을 구하여 그 유사도 값이 0.7보다 큰 값으로 설정했을 때의 중복 뉴스를 탐지한 결과를 나타낸다.Table 8 shows the results of detecting duplicate news when a similarity value is obtained for duplicate candidate news after the first clustering and the similarity value is set to a value greater than 0.7.

PolicePolice BusinessBusiness CrimeCrime EducationEducation HealthHealth WebWeb - - ExclusiveExclusive 클러스터링 수Number of clusters 340340 274274 215215 4747 2525 3838 평균 후보뉴스 수Average Candidate News 134134 119119 8989 2727 1010 2121 검출해야 할 평균 중복 수Average number of duplicates to detect 16.416.4 13.513.5 2.82.8 22 3.83.8 22 중복 데이터 총 수Total number of duplicate data 5,5765,576 3,6993,699 602602 9494 9595 7676 검출된 평균 중복 수 Average number of duplicates detected 16.116.1 13.513.5 3.03.0 2.12.1 3.83.8 2.12.1 검출된 데이터 총 수Total number of detected data 5,4795,479 3,6873,687 646646 9898 9696 7979 검색된 적합한 뉴스의 평균 중복 수Average duplicates of eligible news retrieved 15.815.8 12.712.7 2.62.6 1.81.8 3.63.6 1.81.8 검색된 적합한 뉴스의 총 수Total number of eligible news retrieved 5,3735,373 3,4803,480 561561 8686 9191 7070

결과적으로 1개의 클러스터가 가지는 평균 중복 뉴스의 수가 많고 수집된 뉴스가 많을수록 높은 성능을 보였으며, 평균 중복 뉴스의 수가 적고 수집된 뉴스 수도 적으면 낮은 성능을 보였다. 전체적으로는 6개의 분야에서 중복 뉴스 탐지 성능은 90% 이상으로 만족할 만한 성능을 보였다.As a result, the average number of duplicated news items in one cluster is high, and the more the collected news items are, the higher the performance is. Overall, the detection performance of overlapping news in six fields was satisfactory with more than 90%.

도 11은 본 발명의 일 실시예에 따른 뉴스 제공 방법에 의해 결합 뉴스를 생성한 것을 예시적으로 나타낸 도면이다. 도 11에서 좌측 부분은 시드 뉴스의 내용을 나타내고, 우측 부분은 시드 뉴스에 중복 뉴스들을 결합한 결합 뉴스의 내용을 나타낸다. 시드 뉴스의 문장 수는 55개이었으며, 결합 뉴스의 문장 수는 60개로 나타났다. 시드 뉴스에 추가된 5개의 문장은 시드 뉴스에 나타나지 않은 문장으로서, 시드 뉴스의 문장들 사이에 삽입되어 결합 뉴스에 나타난다. 추가된 5개의 문장 중 3개의 문장(S1, S2, S3)이 도 11의 우측 부분에서 진한 글씨체로 나타나 있다.FIG. 11 is a diagram exemplarily illustrating that a combined news is generated by a news providing method according to an exemplary embodiment of the present invention. In FIG. 11, the left part represents the contents of the seed news, and the right part represents the contents of the combined news combining the duplicated news with the seed news. Seed News had 55 sentences, and combined news had 60 sentences. The five sentences added to Seed News are sentences not appearing in Seed News, inserted between Seed News sentences and appear in combined news. Of the added five sentences, three sentences (S1, S2, S3) are shown in dark font in the right part of FIG.

PolicePolice BusinessBusiness CrimeCrime EducationEducation HealthHealth WebWeb - - ExclusiveExclusive 클러스터링 수Number of clusters 340340 274274 215215 4747 2525 3838 결합된Combined 평균 뉴스의 수 The average number of news 15.815.8 12.712.7 2.62.6 1.81.8 3.63.6 1.81.8 뉴스의 평균 문장의 수The average number of sentences in the news 25.625.6 10.610.6 9.69.6 9.19.1 15.615.6 5.75.7 결합된Combined 뉴스의 평균 문장의 수 The average number of sentences in the news 28.328.3 12.112.1 10.310.3 9.39.3 16.416.4 6.16.1

표 9는 중복 뉴스를 결합하였을 때 각 분야별 뉴스의 평균 문장이 얼마나 늘어났는지에 대한 결과이다. 알고리즘 실행 결과 결합 뉴스의 문장 수는 결합되기 전의 문장의 수보다 증가하였다. 표 7에 나타난 바와 같이, "Police" 분야는 평균 문장의 수가 25.6개에서 28.3개로 증가하였고, "Business"는 10.6 개에서 12.1개로, "Crime"은 9.6 개에서 10.3개로, "Education"은 9.1에서 9.3개로, "Health"는 15.6에서 16.4 개로, "Web-Exclusive"는 5.7에서 6.1개로 증가하였다.Table 9 shows the results of how the average sentence of news in each sector increased when combining duplicate news. The number of sentences in combined news results increased from the number of sentences before combining. As shown in Table 7, in the "Police" field, the average number of sentences increased from 25.6 to 28.3, "Business" from 10.6 to 12.1, "Crime" from 9.6 to 10.3, and "Education" from 9.1 9.3, "Health" increased from 15.6 to 16.4, and "Web-Exclusive" increased from 5.7 to 6.1.

도 12는 본 발명의 일 실시예에 따른 뉴스 제공 방법에 의해 생성된 결합 뉴스의 시드 뉴스 대비 문장의 증가 비율을 나타낸 그래프이다. 전체적으로 시드 뉴스와 중복 뉴스로 탐지된 뉴스들의 문장 증가 비율을 확인해 보면 5%에서 14% 정도 증가하는 것을 확인할 수 있으며, 이는 탐지된 중복 뉴스의 문장의 변화량이 5%에서 14% 정도라는 것을 의미한다. 본 발명의 실시예에 의하면, 약 5%에서 14% 정도의 뉴스 정보가 유실되는 것을 방지할 수 있다. 다른 한편으로, 뉴스 기사의 양이 많아짐에 따라 사용자가 원하는 뉴스 기사를 찾는데 소비되는 시간은 증가할 수 있다. 사용자가 뉴스 기사를 더 빠르게 찾을 수 있도록 결합 뉴스에 대해 요약문을 생성하여 제공하였다.12 is a graph showing an increase ratio of sentences to seed news of the combined news generated by the news providing method according to an embodiment of the present invention. Overall, the increase in sentence growth for news detected as seed news and duplicate news increases from 5% to 14%, which means that the detected duplicate news sentences change from 5% to 14% . According to an embodiment of the present invention, it is possible to prevent the loss of news information of about 5% to 14%. On the other hand, as the amount of news articles increases, the time spent searching for the news articles desired by the user may increase. A summary was created and provided for the combined news to help users find news articles more quickly.

도 12는 결합 뉴스의 일 예를 보여준다. 도 12에 나타난 결합 뉴스는 1개의 제목과 12개의 문장으로 구성되며, 도 12에서 제목(title)은 시드 뉴스의 제목이며, 내용은 뉴스에 나타나는 문장들(sentence 1~12)의 집합으로 나타난다. 각각의 문장에 대하여 앞에서 설명된 수식 20 내지 수식 29에 따라 문장 중요도를 산출하였다. 문장 중요도가 높은 순서대로 순위를 매겼으며, 순위가 각 문장의 우측 부분에서 기호 "[]"로 나타나 있다.12 shows an example of combined news. The combined news shown in FIG. 12 is composed of one title and 12 sentences. In FIG. 12, a title is a title of the seed news, and the content is represented by a set of sentences 1-12. For each sentence, sentence importance was calculated according to Equations 20 to 29 described above. The sentences are ranked in order of high importance, and the ranking is indicated by the symbol "[]" in the right part of each sentence.

문장의 요약문(스니펫)의 문장의 개수는 전술한 수식 30에 따라 요약문의 비율(요약 비율)을 고려하여 결정된다. 도 12에서 나타나는 문장의 수는 12이므로, 요약문의 문장의 개수는 요약문의 비율에 따라 아래의 표 10과 같이 결정된다.The number of sentences in the summary sentence (snippet) of the sentence is determined in consideration of the ratio (summary ratio) of the summary sentence according to Equation 30 described above. Since the number of sentences shown in FIG. 12 is 12, the number of sentences in the summary sentence is determined as shown in Table 10 below according to the ratio of the summary sentence.

요약 비율Summary ratio 10%10% 20%20% 30%30% 40%40% 50%50% 60%60% 70%70% 80%80% 90%90% 100%100% 스니펫의Snippet 문장의 개수 The number of sentences 1One 22 44 55 66 77 88 1010 1111 1212

도 13은 본 발명의 일 실시예에 따른 뉴스 제공 방법에 의해 생성된 요약문(스니펫)의 예를 나타낸다. 결합 뉴스의 문장의 수가 12개인 경우, 요약 비율이 10%일 때 요약문의 문장의 수는 1개가 되고, 요약 비율이 20%일 때 요약문의 문장의 수는 2개가 되며, 요약 비율이 30%일 때 요약문의 문장의 수는 4개가 된다.13 shows an example of a summary (snippet) generated by the news providing method according to an embodiment of the present invention. For a combined news story of 12 sentences, the summary rate is 10%, the summary sentence counts to 1, and the summary rate is 20%, the summary sentence counts to 2, and the summary rate is 30% At this time, the number of sentences in the summary statement is four.

이어서, 결합 뉴스의 요약문이 정확한지 여부를 평가하였다. 사용자가 결합 뉴스를 읽고, 결합 뉴스의 내용에서 중요하다고 생각하는 문장을 선택하여 순위를 매겼다. 요약 문장의 수가 결정되면 문장의 중요도가 높은 순으로 선택하여 문장을 나열한다. 사용자에 의해 선택되어 나열된 요약문의 요약 문장들과 본 발명의 실시예에 따른 뉴스 제공 시스템에 의해 생성된 요약문의 요약 문장들 사이를 매칭하여 재현율을 구하였다. 본 발명의 실시예에 따른 뉴스 제공 시스템이 생성한 요약문과, 사용자가 선정한 요약문 사이의 재현율은 아래의 수식 31을 이용하여 구하였다. 참고적으로 정확률은 산출하지 않았는데, 이는 추출된 전체 문장의 총 수와 적합한 문장의 총 수가 같기 때문에 정확률과 재현율의 결과 값이 같기 때문이다.Then, it was evaluated whether the summary of the binding news was correct. The user reads the combined news and ranks them by selecting sentences that they think are important in the content of the combined news. When the number of summary sentences is determined, the sentences are listed in order of importance. The reproducibility was obtained by matching the summary sentences of the summary sentence selected by the user and the summary sentences of the summary sentence generated by the news providing system according to the embodiment of the present invention. The reproducibility between the summary produced by the news providing system according to the embodiment of the present invention and the summary selected by the user was calculated using Equation 31 below. For reference, the accuracy rate was not calculated because the total number of extracted sentences and the total number of suitable sentences are the same, so the result of the accuracy and recall is the same.

[수식 31]Formula 31

본 발명의 실시예에 따른 뉴스 제공 시스템이 생성한 요약문과 사용자가 선정한 요약문의 문장의 중요도를 살펴본 결과, 결합 뉴스의 내용에 나타나는 문장들 중 첫 번째 문장에서 중요한 문장이 나온 경우가 많았다. 아래의 표 11은 각 단계별로 산출된 값과, 최종 문장의 중요도 값을 나타낸 예이다.As a result of examining the importance of the summary sentence generated by the news providing system according to the embodiment of the present invention and the sentence selected by the user, the important sentence often appeared in the first sentence among the sentences appearing in the content of the combined news. Table 11 below shows an example of values calculated at each stage and importance values of the final sentence.

DNODNO SNOSNO PositionPosition tftf -- isfisf termterm similaritysimilarity ScoreScore D1D1 1One 1One 0.39760.3976 0.19940.1994 0.27890.2789 D1D1 22 0.91660.9166 0.32420.3242 0.12270.1227 0.19250.1925 D1D1 33 0.83330.8333 0.35970.3597 0.00240.0024 0.12130.1213 D1D1 44 0.75000.7500 0.25940.2594 0.23160.2316 0.21680.2168 D1D1 55 0.66670.6667 0.44970.4497 0.02310.0231 0.13380.1338 D1D1 66 0.58330.5833 0.25350.2535 0.03370.0337 0.07940.0794 D1D1 77 0.50000.5000 0.38900.3890 0.02620.0262 0.09350.0935 D1D1 88 0.41670.4167 0.17990.1799 0.19300.1930 0.14580.1458 D1D1 99 0.33330.3333 0.35970.3597 0.11060.1106 0.11430.1143 D1D1 1010 0.25000.2500 0.17990.1799 0.04960.0496 0.04770.0477 D1D1 1111 0.16670.1667 0.20070.2007 0.02330.0233 0.02730.0273 D1D1 1212 0.08330.0833 0.62950.6295 0.19300.1930 0.13680.1368 D2D2 1One 1One 0.46630.4663 0.12270.1227 0.26010.2601 D2D2 22 0.92860.9286 0.36220.3622 0.04810.0481 0.16330.1633 D2D2 33 0.85710.8571 0.57310.5731 0.09810.0981 0.25530.2553 D2D2 44 0.78570.7857 0.24150.2415 0.10960.1096 0.14160.1416 D2D2 55 0.71430.7143 0.40930.4093 0.00350.0035 0.11900.1190 D2D2 66 0.64290.6429 0.25550.2555 0.00100.0010 0.06630.0663 D2D2 77 0.57140.5714 0.36220.3622 0.08290.0829 0.13250.1325 D2D2 88 0.50000.5000 0.16370.1637 0.00210.0021 0.03400.0340 D2D2 99 0.42860.4286 0.32740.3274 0.01370.0137 0.06430.0643 D2D2 1010 0.35710.3571 0.16370.1637 0.00980.0098 0.02920.0292 D2D2 1111 0.28570.2857 0.32750.3275 0.12360.1236 0.11150.1115 D2D2 1212 0.21430.2143 0.57300.5730 0.18100.1810 0.15770.1577 D2D2 1313 0.14290.1429 0.32750.3275 0.02200.0220 0.03190.0319 D2D2 1414 0.07140.0714 0.40930.4093 0.17380.1738 0.11600.1160

표 11에서, DNO는 결합 뉴스를 나타내고 SNO는 결합 뉴스(DNO)에서 문장의 위치를 나타낸다. "position"은 수식 27에 따라 문장의 위치에 따른 가중치 값을 산출한 결과를 나타내며, tf-isf는 수식 26에 따라 문장에서 발생된 단어의 수와 주어진 문장에 나타나는 단어가 발생하는 문장의 역 빈도수에 따른 가중치의 평균값을 산출한 결과를 나타내며, "term similarity"는 수식 24에 따라 어휘 간의 의미 거리를 측정하여 가중치 값(문장 유사도)을 산출한 결과를 나타낸다. "score"는 앞에서 구해진 값들을 수식 28 및 수식 29에 대입하여 최종 값인 문장 중요도를 산출한 결과를 나타낸다. 표 11에서 뉴스 제공 시스템에 의해 구해진 최종 값(문장 중요도)이 가장 큰 문장은 첫 번째에 나타나는 문장이며, 이것은 본문 중에 가장 중요한 문장은 첫 번째 문장이라는 것을 의미한다.In Table 11, DNO represents combined news and SNO represents the position of a sentence in the combined news (DNO). "position" represents a result of calculating a weight value according to the position of a sentence according to Equation 27, and tf-isf represents the number of words generated in a sentence according to Equation 26 and an inverse frequency of a sentence in which a word appearing in a given sentence occurs. The term 'value similarity' represents a result of calculating a weight value (sentence similarity) by measuring a semantic distance between vocabularies according to Equation 24. "score" represents the result of calculating the final sentence importance by substituting the values obtained above into Equations 28 and 29. In Table 11, the sentence with the highest final value (sentence importance) obtained by the news providing system is the first sentence, which means that the most important sentence in the text is the first sentence.

도 14는 결합에 사용된 대상 뉴스의 수와 평균 문장의 수를 뉴스 항목 카테고리별로 나타낸 그래프이다. 도 14에 도시된 바와 같이, 결합에 사용된 대상 뉴스의 수가 가장 많은 뉴스 분야는 "Police"이며, "Health" 분야를 제외한 나머지 분야들은 뉴스 기사의 수가 감소할수록 평균 문장의 수도 감소한다.14 is a graph showing the number of target news and the average sentence number used for combining by news item category. As shown in FIG. 14, the news sector having the largest number of target news used for combining is "Police", and the remaining sectors except for the "Health" sector decrease in the average sentence number as the number of news articles decreases.

결합 뉴스의 데이터를 이용하여 앞에서 구한 문장의 중요도 값을 이용하여 요약 비율에 따른 재현율을 수식 31에 따라 구하였다. 예를 들어, 뉴스 문장의 수가 14개일 때 요약 비율이 10%인 경우 문장의 수는 1개이며, 사용자가 선택한 문장의 중요도 1위와 뉴스 제공 시스템이 선택한 문장의 중요도 1위가 같다면 재현율은 100%가 되고, 다르다면 재현율은 50%가 된다. 요약 비율이 20%인 경우, 문장의 수는 3개가 되고 사용자가 선택한 중요도 상위 3개의 문장과 뉴스 제공 시스템이 선택한 중요도 상위 3개의 문장이 일치한다면 재현율은 100%이고 만약 일치하지 않는다면 일치하는 수를 문장의 수로 나누어 재현율을 구하게 된다. 각 카테고리별로 산출한 재현율이 도 15에 도시되어 있다. 도 15를 참조하면, 6개의 분야 중 "Web-exclusive" 항목의 성능이 가장 좋다. "Web-exclusive"의 평균 문장의 수는 6개로 6개 분야 중 가장 작은데, 이는 수집된 데이터 수가 작기 때문이다. 반면에 "Police" 분야와 "Health" 분야의 성능이 좋지 않은데, 6개의 분야 중 "Police" 분야와 "Health" 분야에서 평균 문장의 수가 가장 많았기 때문이다. 가장 성능이 낮은 "Police"와 "Health" 분야를 제외한 4개 분야는 50% 미만의 요약 비율에 대해서는 성능의 차이를 보이지만, 요약 비율이 50% 이상일 경우 일정한 성능을 가진다. 전체적으로 성능을 분석해 보면, 요약 비율이 10%일 때 전체 6개 분야의 평균 재현율은 74.2%이고, 요약 비율이 20%일 때 전체 6개 분야의 평균 재현율은 70.8%이고, 요약 비율이 30%일 때 전체 6개 분야의 평균 재현율은 71.4%이다. 이와 같이, 본 발명의 실시예에 따른 뉴스 제공 방법으로 요약문을 구성했을 때 요약 비율에 상관없이 70% 이상의 성능을 보임을 확인할 수 있다.Using the data of the combined news, the reproducibility according to the summary ratio was calculated using Eq. For example, if the number of news sentences is 14, and the summary ratio is 10%, the number of sentences is one, and if the number of sentences of the sentence selected by the user and the number of sentences of the sentence selected by the news service system are the same, the recall rate is 100. If the difference is%, the reproducibility is 50%. If the summary ratio is 20%, the number of sentences is three, and if the top three sentences of the importance selected by the user and the top three sentences selected by the news delivery system match, the recall is 100%. The recall rate is obtained by dividing the number of sentences. The reproducibility calculated for each category is shown in FIG. Referring to FIG. 15, the performance of the "Web-exclusive" item is the best among the six fields. The average number of sentences in "Web-exclusive" is six, which is the smallest of the six fields, because the number of collected data is small. On the other hand, the performance in the "Police" and "Health" sectors is poor, because of the average number of sentences in the "Police" and "Health" sectors, out of six. The four sectors, except for the lowest performing "Police" and "Health" sectors, show a difference in performance with a summary ratio of less than 50%, but have a consistent performance if the summary ratio is greater than 50%. Analyzing the performance as a whole, when the summary ratio is 10%, the average recall rate of all six sectors is 74.2%, and when the summary ratio is 20%, the average recall rate of all six sectors is 70.8% and the summary ratio is 30%. In all six sectors, the average recall was 71.4%. As such, when the summary sentence is composed of the news providing method according to the embodiment of the present invention, it can be seen that the performance is 70% or more regardless of the summary ratio.

본 발명의 뉴스 제공 시스템에 의해 생성된 요약문이 적합한지 여부를 확인하기 위해 6개의 뉴스 분야에서 각 100개의 추천된 뉴스에 대하여 수식 32와 수식 33을 이용하여 분석하였다.In order to confirm whether the summary generated by the news providing system of the present invention is suitable, each of 100 recommended news items in 6 news fields were analyzed using Equations 32 and 33.

[수식 32]Formula 32

[수식 33]Formula 33

수식 32는 추천된 뉴스에 대해 사용자가 본문의 내용을 읽지 않고 자신이 원하는 정보인지를 요약문만으로 자신이 원하는 정보인지를 판단한 비율을 나타내며, 수식 33은 요약문의 내용이 부족하여 사용자가 본문의 내용을 읽고 자신이 원하는 정보인지를 판단한 비적합 비율을 나타낸다.Equation 32 represents the ratio of the user's recommendation to the recommended news, based on the summary statement. Read and indicate the nonconformity rate for determining whether it is the information you want.

도 16은 본 발명의 일 실시예에 따른 뉴스 제공 방법에 의하여 생성한 요약문(스니펫)에 대한 스니펫 적합율과 스니펫 비적합율을 나타낸 그래프이다. 본문을 읽지 않고 요약문만으로 적합 여부를 판단한 비율이 약 76.2%이며, 본문의 내용을 보고 판단한 비율은 약 23.8%이다. 이것의 객관적 성능은 요약문을 사용했을 때와 본문의 내용을 사용했을 때에 적합한 뉴스인지 부적합한 뉴스인지 판단하는 사용자의 소비 시간을 고려하여 평가할 수 있다.16 is a graph showing a snippet fit ratio and a snippet non-compliance ratio for a summary sentence (snippet) generated by the news providing method according to an embodiment of the present invention. About 76.2% of the respondents did not read the text and judged whether it was suitable for the summary. Only 23.8% judged the content of the text. Its objective performance can be evaluated by considering the user's time spent judging whether the news is the right news or the inappropriate news when using the summary and the text.

아래의 표 12는 사용자가 추천된 뉴스에서 스니펫을 보고 판단한 평균 소요시간과 본문의 내용을 보고 판단한 평균 소요 시간을 나타낸다.Table 12 below shows the average time taken by the user to view and judge the snippet in the recommended news, and the average time determined based on the content of the text.

적합하다고 판단한 평균 소비 시간(초)Average time spent (in seconds) as deemed appropriate 부적합하다고 판단한 평균 소비 시간(초)Average time spent, in seconds, deemed inappropriate 스니펫을Snippet 보고 정보 유무를 판단 Determine whether there is report information 4.34.3 5.65.6 본문의 내용을 읽고 정보 유무를 판단Read the text and judge the presence of information 32.432.4 26.826.8

스니펫을 보고 정보의 유무를 판단하는 시간은 평균 5초이고, 본문의 내용을 보고 정보의 유무를 판단하는 시간은 평균 29.6초이다. 즉 스니펫을 이용해 정보의 유무를 판단하는 경우 사용자의 소비 시간을 6배 절약할 수 있다. 본 발명의 실시예에 따른 뉴스 제공 시스템은 도 16에서와 같이 76%의 성능을 보이므로, 사용자의 소비 시간을 대략 2.5배 절약할 수 있다.The average time for determining the presence or absence of information by looking at the snippet is 5 seconds, and the average time for determining the presence or absence of information is 29.6 seconds. In other words, by using the snippet to determine the presence of information can save six times the user's time. Since the news providing system according to the embodiment of the present invention has a performance of 76% as shown in FIG. 16, the user's time can be saved by approximately 2.5 times.

이상의 실시예들은 본 발명의 이해를 돕기 위하여 제시된 것으로, 본 발명의 범위를 제한하지 않으며, 이로부터 다양한 변형 가능한 실시예들도 본 발명의 범위에 속할 수 있음을 이해하여야 한다. 예를 들어, 본 발명의 실시예에 도시된 각 구성 요소는 분산되어 실시될 수도 있으며, 반대로 여러 개로 분산된 구성 요소들은 결합되어 실시될 수 있다. 따라서, 본 발명의 기술적 보호범위는 특허청구범위의 기술적 사상에 의해 정해져야 할 것이며, 본 발명의 기술적 보호범위는 특허청구범위의 문언적 기재 그 자체로 한정되는 것이 아니라 실질적으로는 기술적 가치가 균등한 범주의 발명에 대하여까지 미치는 것임을 이해하여야 한다.It is to be understood that the above-described embodiments are provided to facilitate understanding of the present invention, and do not limit the scope of the present invention, and it is to be understood that various modifications may be made within the scope of the present invention. For example, each component shown in the embodiment of the present invention may be distributed and implemented, and conversely, a plurality of distributed components may be combined. Therefore, the technical protection scope of the present invention should be determined by the technical idea of the claims, and the technical protection scope of the present invention is not limited to the literary description of the claims, The invention of a category.

100: 뉴스 제공 시스템 101: 시드 뉴스 데이터베이스
102: 검색 엔진 103: 대상 뉴스 데이터베이스
104~106: 뉴스 제공 서버시스템 110: 중복 후보 뉴스 탐지부
111: 전처리부 112: 중복 후보 뉴스 판단부
120: 중복 뉴스 탐지부 121: 중복 뉴스 탐지 전처리부
122: 중복 뉴스 판단부 130: 중복 뉴스 데이터베이스
140: 중복 뉴스 결합부 141: 유사도 산출부
142: 결합 뉴스 생성부 150: 문장 유사도 산출부
160: 문장 가중치 산출부 170: 요약문 생성부
180: 뉴스 추천부100: news providing system 101: seed news database
102: search engine 103: target news database
104 to 106: news providing server system 110: duplicate candidate news detection unit
111: preprocessing unit 112: duplicate candidate news judgment unit
120: Duplicate news detection unit 121: Duplicate news detection preprocessing unit
122: Duplicate news judgment unit 130: Duplicate news database
140: duplicate news combining unit 141: similarity calculating unit
142: combined news generation unit 150: sentence similarity calculation unit
160: sentence weight calculation unit 170: summary sentence generation unit
180: news recommendation

Claims

A sentence similarity calculator configured to calculate a sentence similarity between a sentence of a combined news generated by combining a predetermined seed news and one or more duplicate news for the seed news, and a title of the seed news and the duplicate news; And
A summary sentence generation unit for determining a priority ranking of sentences of the combined news based on the sentence similarity of the combined news, and generating a summary of the combined news based on the determined priority ranking;
The summary sentence generating unit extracts a predetermined number of sentences having the highest priority ranking among sentences of the combined news to generate the summary sentence.

The method of claim 1,
The sentence similarity calculation unit,
Extracts a word from a title of the seed news, a title of the duplicate news, and a sentence of the combined news, calculates a similarity between the extracted seed news and the words of the duplicate news, and a word of a sentence of the combined news to calculate the similarity of the sentence News summary generation system to calculate the.

The method of claim 1,
The sentence similarity calculation unit,
Detects a least significant noun in a WordNet glossary based hierarchy that includes a title of the seed news and a noun of the duplicate news and a noun appearing in a sentence of the combined news, and detects a title of the seed news and the duplicate news. The word importance is calculated by performing an operation proportional to the number of nouns, the nouns appearing in the sentence of the combined news, and the synonym vocabulary of the least upper noun, and calculating the average value of the word importance for each noun in the sentence of the combined news. News summary generation system for calculating the sentence similarity.

The method of claim 1,
And a sentence weight calculator configured to calculate a sentence weight based on a frequency of occurrence of words in the sentence of the combined news and the position of the sentence.
And the summary sentence generation unit determines a priority ranking of the sentences based on the sentence similarity and the sentence weight.

5. The method of claim 4,
The sentence weight calculation unit,
A first sentence weight calculator configured to calculate an average value of the frequency of each word appearing in the sentence of the combined news to calculate a first sentence weight according to the word frequency;
A second sentence weight calculator configured to assign a second sentence weight to have a higher weight from an upper sentence in order of the position of sentences of the combined news; And
And a calculation unit configured to calculate the sentence weight by performing a multiplication operation of the first sentence weight and the second sentence weight.

delete

A seed news database storing predetermined seed news;
A search engine for searching for target news using a search term including a word appearing in a title of the seed news;
A target news database storing the searched target news;
Extracting a title from each of the seed news and the target news, calculating a degree of similarity between the title of the extracted seed news and the title of the target news, calculating a degree of similarity between the seed news and the target news, A duplicate candidate news detection unit for detecting a duplicate candidate news;
Extracting content from the seed news and the overlap candidate news, calculating a similarity between a sentence included in the extracted content of the seed news and a sentence contained in the content of the overlap candidate news, and based on the calculated similarity between the sentences, A duplicate news detecting unit for detecting duplicate news among the duplicate candidate news;
A duplicate news database for storing the duplicate news detected;
A duplicate news merger for merging the seed news and the duplicate news into one document to generate combined news;
A sentence similarity calculator for calculating a sentence similarity between a sentence of the combined news and a title of the seed news and the duplicate news;
A summary sentence generation unit for determining a priority ranking of sentences of the combined news based on the sentence similarity of the combined news, and generating a summary of the combined news based on the determined priority ranking; And
News providing system including a news recommendation for providing the combined news in the form of a web page with the summary.

A sentence similarity calculating step of calculating a sentence similarity between a sentence of the combined news generated by combining a predetermined seed news and one or more duplicate news for the seed news, and a title of the seed news and the duplicate news; And
Determining a priority ranking of sentences of the combined news based on the sentence similarity of the combined news, and generating a summary sentence for the combined news based on the determined priority ranking;
The summary sentence generation step of generating a summary summary by extracting a predetermined number of sentences of the priority ranking among the sentences of the combined news to generate the summary sentence.

9. The method of claim 8,
The sentence similarity calculating step,
Extracts a word from a title of the seed news, a title of the duplicate news, and a sentence of the combined news, calculates a similarity between the extracted seed news and the words of the duplicate news, and a word of a sentence of the combined news to calculate the similarity of the sentence How to create a news summary that calculates.

9. The method of claim 8,
The sentence similarity calculating step,
Detects a least significant noun in a WordNet glossary based hierarchy that includes a title of the seed news and a noun of the duplicate news and a noun appearing in a sentence of the combined news, and detects a title of the seed news and the duplicate news. The word importance is calculated by performing an operation proportional to the number of nouns, the nouns appearing in the sentence of the combined news, and the synonym vocabulary of the least upper noun, and calculating the average value of the word importance for each noun in the sentence of the combined news. To calculate the sentence similarity.

9. The method of claim 8,
And a sentence weight calculation step of calculating a sentence weight based on the sentence similarity calculating step and the summary sentence generating step, based on a frequency of occurrence of a word in a sentence of the combined news and a position of the sentence.
The summary sentence generation step of determining the importance ranking of the sentences based on the sentence similarity and the sentence weight.

12. The method of claim 11,
The sentence weight calculation step,
A first sentence weight calculation step of calculating a first sentence weight according to the word frequency by calculating an average value of the frequency of each word appearing in the sentence of the combined news;
A second sentence weight calculation step of assigning a second sentence weight so as to have a high weight from an upper sentence in order of a position of sentences of the combined news; And
And calculating a sentence weight by performing a multiplication operation of the first sentence weight and the second sentence weight.

delete

A computer-readable recording medium having recorded thereon a program for executing the method according to any one of claims 8 to 12.

Searching for the target news using a search word including a word appearing in a title of the predetermined seed news;
A title is extracted from each of the seed news and the target news, and the similarity between the title of the extracted seed news and the title of the target news is calculated. Detecting;
Extracting content from the seed news and the duplicate candidate news, calculating a similarity between a sentence included in the extracted content of the seed news and a sentence included in the content of the duplicate candidate news, and based on the similarity between the sentences Detecting duplicate news among the duplicate candidate news;
Generating a combined news by combining the seed news and the duplicate news into one document;
Calculating sentence similarity between the sentence of the combined news and the title of the seed news and the duplicate news;
Determining a priority ranking of sentences of the combined news based on the sentence similarity of the combined news, and generating a summary of the combined news based on the determined priority ranking; And
Providing the combined news together with the summary in the form of a web page.