KR102084861B1

KR102084861B1 - System and method for analyzing and utilizing how authors refer to an event in a web document to predict the distribution of how much readers consider such document credible

Info

Publication number: KR102084861B1
Application number: KR1020180128806A
Authority: KR
Inventors: 박종철; 양원석; 정진우; 송호윤; 이희제
Original assignee: 한국과학기술원
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2020-03-04

Abstract

Disclosed are a method for analyzing and utilizing the manner in which an author refers to an event in a web document to predict credibility distribution of readers for a given document, and a system thereof. According to an embodiment of the present invention, the method for predicting a reader credibility distribution comprises the steps of: calculating event credibility for each event information extracted from the web documents by considering authors of the web documents as virtual readers; receiving a document for performing credibility distribution prediction from a user; and predicting the credibility distribution of the received document of each of reader groups distinguished according to a plurality of preset demographic characteristics based on the event credibility calculated for each event information.

Description

System and method for analyzing and utilizing how authors refer to an event in a web document to predict the distribution of how much readers consider such document credible}

본 발명은 문서에 대한 독자의 신뢰도 분포를 예측하는 기술에 관한 것으로, 보다 구체적으로는 웹 문서 저자들이 웹 문서 내에서 사건을 언급하며 사용한 표현을 분석하여 사건에 대한 저자의 신뢰도를 계산하고, 사용자가 입력한 문서에 대해 웹 문서 저자를 가상의 독자로 간주하여 대중 전반에 해당하는 독자들이 보일 신뢰도 분포를 자동 예측할 수 있는 방법 및 그 시스템에 관한 것이다.The present invention relates to a technique for predicting a reader's reliability distribution for a document. More specifically, the author's reliability of an event is calculated by analyzing the expression used by the web document authors by referring to the event in the web document. The present invention relates to a method and system for automatically predicting the distribution of the reliability of a web reader by considering a web document author as a virtual reader.

인터넷의 등장 이후 일상 생활 내에서 접할 수 있는 텍스트 콘텐츠의 양은 급증하고 있고, 인터넷 뉴스, SNS, 블로그, 인터넷 라이브러리 등 다양한 온라인 텍스트 제공 플랫폼은 그 규모가 지속적으로 커지고 있다. Since the advent of the Internet, the amount of text content that can be encountered in everyday life has been increasing rapidly, and various online text providing platforms such as internet news, social media, blogs, and internet libraries have continued to grow in size.

개개인의 사용자가 거대한 온라인 플랫폼에서 제공하는 텍스트 콘텐츠를 모두 독해하는 것은 시간 제약상 불가능하다. 이러한 제약으로 인해, 사용자들은 해당 플랫폼을 통해 텍스트 콘텐츠를 접하였을 때, 해당 텍스트 콘텐츠를 아예 읽지 않을지, 빠르게 읽어볼지, 시간을 할애하여 정독할지를 결정한다. 그러나 이와 같은 결정은 상당한 인지적인 부담을 수반하기 때문에 이를 경감시키기 위한 의사 결정 보조의 필요성이 대두되어 왔다. It is impossible for each user to read all the textual content provided by a huge online platform. Due to these limitations, when users encounter text content through the platform, they decide whether to read the text content at all, read it quickly, or take time to read it. However, since such decisions involve significant cognitive burden, there is a need for decision support to mitigate them.

상기 의사 결정과 관련하여 사용자가 읽은 다음 높은 만족도를 보일 것으로 예상되는 텍스트를 우선적으로 제공하는 방법들이 제안되어 왔다. 이 중 텍스트 콘텐츠의 신뢰도 및 저자의 신뢰도 지표를 사용하는 방법이 제안된 바 있으며 현재 많은 텍스트 콘텐츠 제공 플랫폼에서 사용되고 있다.In connection with the decision making, methods have been proposed that preferentially provide texts that are expected to be satisfied by the user after reading them. Among them, a method of using the reliability of the text content and the author's confidence index has been proposed and is currently used in many text content providing platforms.

종래 일 실시예의 기술인 대한민국 등록특허공보 제10-1284788호는 "신뢰도에 기반한 질의응답 장치 및 그 방법"에 관한 것으로, 신뢰도를 문서의 품질, 출처, 정답 추출 전략 등을 다각도로 평가하는 것을 목적으로 하며, 정답 후보 문서의 출처(source)와 사용자 질의 간의 관련도 및 해당 정답 후보 문서의 출처 신빙성을 이용하여 상기 각각의 정답 후보 문서에 대한 출처 신뢰도를 측정하고, 사용자 질의와 해당 정답 후보 문서에 대한 추출 전략 간의 적합도를 이용하여 정답 후보 문서에 대한 추출전략 신뢰도를 측정하는 기술이다. 또한 해당 종래 기술은 currency, availability, information-to-noise ratio, authority, popularity, cohesiveness의 6 가지 품질 평가 자질을 활용한다.Korean Patent Publication No. 10-1284788, which is a technique of a conventional embodiment, relates to a "question answering device and method based on reliability", with the aim of evaluating reliability of a document from various angles such as quality, source, and correct answer extraction strategy. The source reliability of each correct candidate document is measured by using the relationship between the source of the correct candidate document and the user query and the source reliability of the corresponding correct candidate document. This technique measures the reliability of extraction strategy for the correct candidate documents using the goodness of fit between the extraction strategies. The prior art also utilizes six quality assessment qualities: currency, availability, information-to-noise ratio, authority, popularity, and cohesiveness.

종래 다른 일 실시예의 기술인 대한민국 등록특허공보 제10-1859620호는 "온라인 소셜 네트워크에서 신뢰성 기반의 콘텐츠 추천 방법 및 시스템"에 관한 것으로, 온라인 소셜 네트워크에서 사용자 행위 외에 다른 사용자 행위들도 수집하여 사용자 신뢰도를 복합적으로 계산하는 신뢰성 기반의 콘텐츠 추천 방법을 제공하는 것을 목적으로 하며, 온라인 소셜 네트워크에서 활동하는 사용자의 행위 데이터를 수집하고, 수집된 사용자 행위 데이터를 이용하여 사용자 신뢰도를 계산하며, 온라인 소셜 네트워크에서 콘텐츠 데이터를 수집하고, 상기 사용자 행위 데이터와 수집된 콘텐츠 데이터를 이용하여 콘텐츠 신뢰도를 계산한다.Korean Patent Publication No. 10-1859620, which is a technique of another conventional embodiment, relates to a method and system for recommending content based on reliability in an online social network, and collects other user actions in addition to user actions in an online social network. To provide a reliability-based content recommendation method that calculates a complex, collects behavioral data of users who are active in online social networks, calculates user reliability using collected user behavioral data, and Collects content data, and calculates content reliability using the user behavior data and the collected content data.

하지만, 종래 기술들은 사용자가 텍스트 콘텐츠를 읽을지 여부를 결정하는 과정에 대한 보조를 목적으로 할 때 다음과 같은 문제가 발생할 수 있다.However, the related arts may cause the following problems when assisting the process of determining whether the user reads the text content.

첫째, 신뢰도가 하나의 숫자로 계산되어 제공되기 때문에 하나의 텍스트 콘텐츠를 서로 다른 정도로 신뢰하는 두 명의 사용자가 있을 경우, 두 명의 사용자를 동시에 만족시키기 어렵다. 즉, 주어진 텍스트 콘텐츠를 독해한 이후에 A라는 사용자는 높은 신뢰도를 보이고, B라는 사용자는 낮은 신뢰도를 보일 수 있다. 만약 이러한 상황에서 예측 신뢰도를 높게 설정하여 A와 B에게 일괄적으로 제공할 경우, 독해 이후의 B의 만족도가 저하될 수 있으며, 이와 반대로 신뢰도를 낮게 설정하여 제공할 경우 A의 만족도가 저하될 수 있으며, 신뢰도를 중간 값으로 설정할 경우 A와 B 모두의 만족도가 저하될 수 있다.First, since reliability is calculated by providing a single number, it is difficult to satisfy two users at the same time when there are two users who trust one text content to different degrees. That is, after reading the given text content, the user A may have high reliability, and the user B may have low reliability. In this situation, if the prediction reliability is set high and provided to A and B collectively, the satisfaction of B after reading may be lowered. On the contrary, if the reliability is set low, the satisfaction of A may be lowered. If the reliability is set to the middle value, the satisfaction of both A and B may be lowered.

둘째, 텍스트 콘텐츠에 대한 신뢰 여부는 개인의 주관에 따라 결정될 수 있기 때문에, 사용자 개개인의 인적 정보를 기반으로 문서 신뢰도를 산출하더라도 그것이 특정 사용자가 실제로 문서를 독해한 이후에 보일 신뢰 여부와 상이할 가능성이 높다. 특히 하나의 숫자로 산출되는 신뢰도를 활용할 경우, 사용자가 문서를 독해하기 전에 제공받은 신뢰도와 실제로 독해한 이후에 보이는 신뢰 여부가 상이한 상황이 반복되어 사용자의 만족도가 저하될 수 있다.Second, since trust in text content can be determined according to the subjectivity of the individual, even if the document reliability is calculated based on the personal information of each user, it is unlikely that it will be different from the trust that will be seen after the specific user actually reads the document. This is high. In particular, when using the reliability calculated by one number, a situation in which the reliability provided before the user reads the document and the reliability shown after the actual reading is repeated may be repeated, thereby lowering the user's satisfaction.

셋째, 텍스트 콘텐츠를 읽을지 여부에 대한 결정은 사용자의 관심사, 가용 시간, 읽을 당시의 감정 상태 등 다양한 개인적 요인과 관련되어 있기 때문에 단순히 하나의 숫자를 통해 신뢰도를 제공할 경우, 사용자의 입장에서는 충분히 도움이 되는 의사 결정 보조 정보를 제공받지 못했다고 느낄 수 있다.Third, the decision about whether or not to read text content is related to a variety of personal factors, such as the user's interests, available time, and emotional state at the time of reading. You may feel that you have not been provided with helpful decision-making information.

본 발명의 실시예들은, 웹 문서 저자들이 웹 문서 내에서 사건을 언급하며 사용한 표현을 분석하여 사건에 대한 저자의 신뢰도를 계산하고, 사용자가 입력한 문서에 대해 웹 문서 저자를 가상의 독자로 간주하여 대중 전반에 해당하는 독자들이 보일 신뢰도 분포를 자동 예측할 수 있는 방법 및 그 시스템을 제공한다.Embodiments of the present invention calculate the author's confidence in an event by analyzing the expression used by the web document authors to refer to the event in the web document, and regard the web document author as a virtual reader for the document input by the user. Therefore, the present invention provides a method and system for automatically predicting the reliability distribution that readers in the general public will see.

본 발명의 일 실시예에 따른 독자 신뢰도 분포 예측 방법은 웹 문서의 저자들을 가상의 독자들로 간주하여 상기 웹 문서로부터 추출된 사건 정보 각각에 대한 사건 신뢰도를 계산하는 단계; 사용자로부터 신뢰도 분포 예측을 수행하기 위한 문서를 입력 받는 단계; 및 상기 계산된 사건 정보 각각에 대한 사건 신뢰도에 기초하여 미리 설정된 복수의 인구통계학적 특징에 따라 구별되는 독자 그룹들 각각의 상기 입력된 문서에 대한 신뢰도 분포를 예측하는 단계를 포함한다.According to an embodiment of the present invention, a method for predicting reader confidence distribution includes calculating event reliability for each event information extracted from the web document by considering authors of the web document as virtual readers; Receiving a document for performing a reliability distribution prediction from a user; And predicting a confidence distribution for the input document of each of the reader groups distinguished according to a plurality of preset demographic characteristics based on the event reliability for each of the calculated event information.

상기 사건 신뢰도를 계산하는 단계는 상기 웹 문서를 수집하고, 상기 수집된 웹 문서로부터 상기 사건 정보 각각을 추출하는 단계; 상기 추출된 사건 정보 각각에 대한 사건 신뢰도를 계산하는 단계; 및 상기 추출된 사건 정보 중 동일한 사건 정보에 대해 계산된 사건 신뢰도의 값을 수합하고, 상기 수합된 사건 신뢰도의 값에 기초하여 상기 사건 신뢰도의 값에 따른 독자들 수에 대한 통계 분포인 사건 신뢰도 분포를 생성하는 단계를 포함하고, 상기 예측하는 단계는 상기 사용자의 사건 정보 각각에 대한 사건 신뢰도 분포를 이용하여 상기 입력된 문서로부터 추출된 상기 사용자의 사건 정보에 대한 신뢰도 분포를 예측할 수 있다.The calculating of the reliability of the event may include collecting the web document and extracting each of the event information from the collected web document; Calculating event reliability for each of the extracted event information; And an event reliability distribution which is a statistical distribution of the number of readers according to the value of the event reliability based on the value of the collected event reliability, the sum of values of event reliability calculated for the same event information among the extracted event information. And generating, and the predicting may predict the reliability distribution of the user's event information extracted from the input document using the event reliability distribution of each of the user's event information.

상기 인구통계학적 특징은 상기 문서의 신뢰도를 판단하는 주체가 되는 독자의 성별에 대한 제1 항목, 상기 독자의 연령에 대한 제2 항목, 상기 독자의 직업에 대한 제3 항목, 상기 독자의 관심 주제에 대한 제4 항목 및 상기 독자가 문서의 주제에 대한 전문가인지 여부에 대한 제5 항목 중 적어도 하나를 포함할 수 있다.The demographic characteristics may include a first item about the gender of the reader, a second item about the age of the reader, a third item about the job of the reader, and a subject of interest to the reader, which are subjects of determining the reliability of the document. And a fourth item for and a fifth item for whether the reader is an expert on the subject of the document.

상기 사건 신뢰도를 계산하는 단계는 상기 웹 문서 내에서 저자가 사건을 언급하면서 사용한 긍정 표현 횟수, 부정 표현 횟수, 신뢰 표현 횟수, 불신 표현 횟수, 동의 표현 횟수 및 비동의 표현 횟수에 기초하여 상기 사건 신뢰도를 계산할 수 있다.The calculating of the event reliability may include calculating the event reliability based on the number of positive expressions, negative expressions, confidence expressions, distrust expressions, consent expressions, and non-motion expressions used by the author in referring to the event in the web document. Can be calculated.

나아가, 본 발명의 일 실시예에 따른 독자 신뢰도 분포 예측 방법은 상기 예측된 독자 그룹들 각각의 신뢰도 분포를 상기 사용자에게 제공하는 단계를 더 포함할 수 있다.Furthermore, the reader confidence distribution prediction method according to an embodiment of the present invention may further include providing the user with a reliability distribution of each of the predicted reader groups.

본 발명의 일 실시예에 따른 독자 신뢰도 분포 예측 시스템은 웹 문서의 저자들을 가상의 독자들로 간주하여 상기 웹 문서로부터 추출된 사건 정보 각각에 대한 사건 신뢰도를 계산하는 가상 독자 데이터 수집부; 및 사용자로부터 신뢰도 분포 예측을 수행하기 위한 문서를 입력 받고, 상기 계산된 사건 정보 각각에 대한 사건 신뢰도에 기초하여 미리 설정된 복수의 인구통계학적 특징에 따라 구별되는 독자 그룹들 각각의 상기 입력된 문서에 대한 신뢰도 분포를 예측하는 사용자 입력 문서 처리부를 포함한다.According to an embodiment of the present invention, a reader confidence distribution prediction system includes a virtual reader data collector configured to calculate event reliability for each event information extracted from the web document by considering authors of the web document as virtual readers; And a document for performing a reliability distribution prediction from a user, and inputting the document to each of the reader groups distinguished according to a plurality of demographic characteristics preset based on the event reliability for each of the calculated event information. And a user input document processing unit for predicting a reliability distribution for the.

상기 가상 독자 데이터 수집부는 상기 웹 문서를 수집하고, 상기 수집된 웹 문서로부터 상기 사건 정보 각각을 추출하고, 상기 추출된 사건 정보 각각에 대한 사건 신뢰도를 계산하며, 상기 추출된 사건 정보 중 동일한 사건 정보에 대해 계산된 사건 신뢰도의 값을 수합하고, 상기 수합된 사건 신뢰도의 값에 기초하여 상기 사건 신뢰도의 값에 따른 독자들 수에 대한 통계 분포인 사건 신뢰도 분포를 생성하며, 상기 사용자 입력 문서 처리부는 상기 사용자의 사건 정보 각각에 대한 사건 신뢰도 분포를 이용하여 상기 입력된 문서로부터 추출된 상기 사용자의 사건 정보에 대한 신뢰도 분포를 예측할 수 있다.The virtual reader data collection unit collects the web document, extracts each of the event information from the collected web document, calculates an event reliability of each of the extracted event information, and the same event information among the extracted event information. Summing the values of the event confidence calculated for and generating an event confidence distribution that is a statistical distribution of the number of readers according to the value of the event confidence based on the value of the collected event confidence, wherein the user input document processing unit The reliability distribution of the event information of the user extracted from the input document may be predicted by using the event reliability distribution of each of the user's event information.

상기 가상 독자 데이터 수집부는 상기 웹 문서 내에서 저자가 사건을 언급하면서 사용한 긍정 표현 횟수, 부정 표현 횟수, 신뢰 표현 횟수, 불신 표현 횟수, 동의 표현 횟수 및 비동의 표현 횟수에 기초하여 상기 사건 신뢰도를 계산할 수 있다.The virtual reader data collector may calculate the event reliability based on the number of positive expressions, negative expressions, confidence expressions, distrust expressions, consent expressions, and non-motions expressed by the author in referring to the event in the web document. Can be.

상기 사용자 입력 문서 처리부는 상기 예측된 독자 그룹들 각각의 신뢰도 분포를 상기 사용자에게 제공할 수 있다.The user input document processor may provide the user with a reliability distribution of each of the predicted reader groups.

본 발명의 실시예들에 따르면, 웹 문서 저자들이 웹 문서 내에서 사건을 언급하며 사용한 표현을 분석하여 사건에 대한 저자의 신뢰도를 계산하고, 사용자가 입력한 문서에 대해 웹 문서 저자를 가상의 독자로 간주하여 대중 전반에 해당하는 독자들이 보일 신뢰도 분포를 자동 예측할 수 있다.According to embodiments of the present invention, web document authors analyze an expression used by referring to an event in the web document to calculate the author's confidence in the event, and a virtual reader of the web document author for the user input document. Can be used to automatically predict the distribution of confidence among readers in the general public.

본 발명의 실시예들에 따르면, 고비용을 요구하는 예측 기준 코퍼스를 확장하지 않으면서 신뢰도 분포 예측 정확도의 향상을 가능케 하는 대규모 데이터를 저비용 방식으로 수집 및 활용할 수 있다.According to embodiments of the present invention, large-scale data that enables improvement of reliability distribution prediction accuracy can be collected and utilized in a low-cost manner without extending the cost-critical prediction reference corpus.

본 발명의 실시예에 따른 기술은 다음과 같은 효과를 가질 수 있다.The technique according to the embodiment of the present invention may have the following effects.

(1) 본 발명은 하나의 텍스트 콘텐츠를 서로 다른 정도로 신뢰하는 두 명의 사용자에 대해서, 해당하는 텍스트 콘텐츠를 독해한 이후에 A라는 사용자는 높은 신뢰도를 보이고, B라는 사용자는 낮은 신뢰도를 보일 때, 독자들이 보일 전반적인 신뢰도 분포를 예측하여 A와 B에게 제공함으로써, A와 B 각자가 보이는 신뢰도에 상응하는 신뢰도를 보이는 사람들이 어떻게 분포되어 있으며, 각자가 보이는 신뢰도와 상이한 신뢰도를 보이는 사람들이 어떻게 분포되어 있는지를 보여줌으로써, 두 명의 사용자 모두에게 높은 만족도를 제공할 수 있다.(1) In the present invention, when two users trust one text content to a different degree, after reading the corresponding text content, the user A shows high reliability, and the user B shows low reliability, By predicting the overall distribution of confidence that readers will see and providing them to A and B, how are people with confidence that correspond to the confidence of each of A and B being distributed, and how are those with different confidence from each other being displayed? By showing that there is a high satisfaction level for both users.

(2) 본 발명은 주어진 텍스트 콘텐츠에 대해 독자 전반이 보이는 신뢰도 분포에 대한 정보를 제공한다. 주어진 텍스트 콘텐츠를 읽고 사용자가 보이는 신뢰도는 사용자에게 제공되는 신뢰도 분포 내에 포함되는 부분 정보에 해당하기 때문에 사용자가 직접적인 설문을 통해 다른 독자들의 신뢰도를 수집하지 않는 한 사용자의 상식 선에 일치하는 정보인지를 개략적으로 판단할 수는 있어도 주어진 신뢰도 분포가 정확한 결과인지 여부는 판단하기 어렵다. 따라서 사용자에게 제공된 신뢰도 분포 결과를 사용자가 틀린 결과로 판단하여 사용자 만족도가 저하되는 것을 방지할 수 있다.(2) The present invention provides information on the reliability distribution seen by the reader in general for a given text content. The confidence displayed by the user after reading the given text content corresponds to the partial information contained in the distribution of confidence provided to the user. Therefore, it is determined whether the information matches the common sense line of the user unless the user collects the confidence of other readers through a direct questionnaire. Although it can be estimated roughly, it is difficult to determine whether a given confidence distribution is an accurate result. Therefore, the reliability distribution result provided to the user may be judged to be a wrong result, thereby preventing the user satisfaction from being lowered.

(3) 본 발명은 성별, 연령, 전문가 여부와 같은 복수의 인구통계학적 특징에 따라 구분되는 각각의 독자 그룹이 사용자로부터 입력된 문서에 대해 보일 신뢰도 분포를 예측하여 그 결과를 사용자에게 제공한다. 사용자는 텍스트 콘텐츠를 읽을지 여부를 결정하기 위해 가용 시간, 관심 여부, 감정 상태와 같은 관련 요소들을 스스로 검토하는 과정에서 제공된 신뢰도 분포를 참고할 수 있다. 본 발명은 주어진 텍스트 콘텐츠에 대해 다른 사람들이 어떤 양태의 신뢰도를 보이는지에 대한 정보를 자세하게 제공하여 사용자로부터 공감을 이끌어냄과 동시에 사용자가 충분한 의사 결정 보조 정보를 제공받았다고 느끼게 하며, 이를 통해 전반적인 사용자 만족도를 높일 수 있다.(3) The present invention predicts the distribution of reliability that each group of readers classified according to a plurality of demographic characteristics such as gender, age, expert status, and so on for a document input from the user and provides the result to the user. To determine whether to read the text content, the user may refer to the reliability distribution provided in the process of self-reviewing related factors such as available time, interest, and emotional state. The present invention provides detailed information on what kind of reliability others have for a given text content, inducing empathy from the user and making the user feel that they have been provided with sufficient decision support information, thereby providing overall user satisfaction. Can increase.

이러한 본 발명은 다수의 독자들의 신뢰도 정보를 기반으로 사람이 의사 결정을 하는 데 보조하는 시스템으로, 신뢰 있는 콘텐츠를 제공해야 하는 여러 플랫폼에 적용 및 응용이 가능하며, 뉴스 기사에 대한 독자의 신뢰도 분포를 통한 여론 분석, 기존 문장에 대해 신뢰 있는 문장으로 증강, 대화 중 발화 문장에 대한 신뢰도 자가 평가 및 피드백 등에 활용할 수 있다.The present invention is a system that assists a person in making a decision based on the reliability information of a plurality of readers, and can be applied and applied to various platforms that need to provide reliable content, and the reader's reliability distribution for news articles. It can be used for public opinion analysis through affirmative sentences, reinforcement of reliable sentences for existing sentences, self-evaluation and feedback on reliability of spoken sentences during conversation.

또한, 본 발명은 신뢰도 있는 콘텐츠 제공을 필요로 하는 온라인 뉴스 플랫폼, SNS, 인터넷 문서 제공 플랫폼 등에 적용할 수 있다.In addition, the present invention can be applied to an online news platform, SNS, an Internet document providing platform, etc. requiring reliable content provision.

도 1은 본 발명의 일 실시예에 따른 독자 신뢰도 분포 예측 시스템의 구성을 나타낸 것이다.
도 2는 도 1의 가상 독자 데이터 수집부에 대한 일 실시예의 구성을 나타낸 것이다.
도 3은 도 1의 사용자 입력 문서 처리부에 대한 일 실시예의 구성을 나타낸 것이다.
도 4는 본 발명의 독자 신뢰도 분포 예측 시스템의 출력 결과를 설명하기 위한 일 예시도를 나타낸 것이다.
도 5는 독자 신뢰도 분포 예측 시스템에 포함되는 예측 기준 코퍼스의 예시를 나타낸 것이다.
도 6은 독자 신뢰도 분포 예측 시스템에 포함되는 사건 신뢰 데이터베이스의 예시를 나타낸 것이다.
도 7은 도 1에 도시된 가상 독자 데이터 수집부에 의한 일 실시예의 동작 흐름도를 나타낸 것이다.
도 8은 도 1에 도시된 사용자 입력 문서 처리부에 의한 일 실시예의 동작 흐름도를 나타낸 것이다.1 illustrates a configuration of a system for predicting reliability distribution according to an embodiment of the present invention.
FIG. 2 illustrates a configuration of an embodiment of the virtual reader data collector of FIG. 1.
3 is a block diagram of an embodiment of the user input document processor of FIG. 1.
Figure 4 shows an exemplary view for explaining the output result of the independent reliability distribution prediction system of the present invention.
5 illustrates an example of a prediction reference corpus included in an original reliability distribution prediction system.
6 illustrates an example of an event confidence database included in an original confidence distribution prediction system.
FIG. 7 illustrates an operation flowchart of an embodiment by the virtual reader data collector shown in FIG. 1.
FIG. 8 is a flowchart illustrating an example of an operation performed by the user input document processor illustrated in FIG. 1.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods for achieving them will be apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, only the present embodiments to make the disclosure of the present invention complete, and common knowledge in the art to which the present invention pertains. It is provided to fully inform the person having the scope of the invention, which is defined only by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"가 언급된 구성요소, 단계, 동작 및/또는 소자는 하나 이상 의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, "comprises" and / or "comprising" components, steps, operations, and / or elements referred to may include the components of one or more other components, steps, operations, and / or elements. It does not exclude existence or addition.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used in the present specification (including technical and scientific terms) may be used in a sense that can be commonly understood by those skilled in the art. In addition, terms that are defined in a commonly used dictionary are not ideally or excessively interpreted unless they are specifically defined clearly.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예들을 보다 상세하게 설명하고자 한다. 도면 상의 동일한 구성요소에 대해서는 동일한 참조 부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, with reference to the accompanying drawings, it will be described in detail preferred embodiments of the present invention. The same reference numerals are used for the same elements in the drawings, and duplicate descriptions of the same elements are omitted.

본 발명의 실시예들은, 웹 문서 저자들이 웹 문서 내에서 사건을 언급하며 사용한 표현을 분석하여 사건에 대한 저자의 신뢰도를 계산하고, 사용자가 입력한 문서에 대해 웹 문서 저자를 가상의 독자로 간주하여 대중 전반에 해당하는 독자들이 보일 신뢰도 분포를 자동 예측하는 것을 그 요지로 한다.Embodiments of the present invention calculate the author's confidence in an event by analyzing the expression used by the web document authors to refer to the event in the web document, and regard the web document author as a virtual reader for the document input by the user. Therefore, the main point is to automatically predict the distribution of the reliability of the readers.

즉, 본 발명은 신뢰도를 하나의 숫자로 제공하는 대신 사용자 이외의 다른 독자들이 문서를 읽고 보일 것으로 예상하는 신뢰도의 통계 분포인 신뢰도 분포를 자동으로 예측하고, 이 분포를 사용자에게 제시하는 것이다.That is, the present invention automatically predicts the reliability distribution, which is a statistical distribution of the reliability that other readers than the user would read and display, instead of providing the reliability as a single number, and present the distribution to the user.

이러한 신뢰도 분포를 자동으로 예측하기 위해서는 예측 모델이 필요하며, 특히 기계학습을 통해 예측 모델을 구축하기 위해서는 학습의 기준이 되는 데이터 즉, 실제 설문을 통해 수집된 신뢰도의 분포가 저장된 문서 말뭉치인 코퍼스가 필요하다. 이 경우 일반적으로 신뢰도 분포가 측정된 문서와 설문에 참여한 사람이 많을수록 신뢰도 분포 예측 정확도가 증가한다. 그러나 대량의 문서에 대해 이러한 설문 기반의 방식으로 독자들의 신뢰도 분포 정보를 수집하기 위해서는 높은 비용이 요구된다. Predictive models are required to automatically predict the reliability distribution. Especially, in order to build a predictive model through machine learning, corpus, a corpus, which is a document corpus that stores the distribution of reliability collected through actual questionnaires. need. In this case, in general, the more documents that the reliability distribution is measured and the more people who participated in the survey, the accuracy of the reliability distribution prediction increases. However, a high cost is required for collecting the reliability distribution information of readers in such a question-based manner for a large amount of documents.

본 발명은 추가적인 신뢰도 설문을 통한 코퍼스의 확장에 의존하지 않으면서, 신뢰도 분포 예측 정확도의 향상을 가능케 하는 대규모 데이터를 저비용 방식으로 수집하고 분석할 수 있다.The present invention can collect and analyze large-scale data in a low cost manner that allows for improved reliability distribution prediction accuracy, without resorting to corpus extension through additional reliability questionnaires.

도 1은 본 발명의 일 실시예에 따른 독자 신뢰도 분포 예측 시스템의 구성을 나타낸 것이다.1 illustrates a configuration of a system for predicting reliability distribution according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 독자 신뢰도 분포 예측 시스템(100)은 사용자 입력 문서 처리부(120), 가상 독자 데이터 수집부(110) 및 예측 기준 코퍼스(130)를 포함한다.Referring to FIG. 1, a reader reliability distribution prediction system 100 according to an embodiment of the present invention includes a user input document processor 120, a virtual reader data collector 110, and a prediction reference corpus 130.

사용자 입력 문서 처리부(120)는 사용자가 입력한 문서에 대해 대중이 보이는 신뢰도 분포를 자동으로 예측하여 출력한다.The user input document processing unit 120 automatically predicts and outputs a reliability distribution seen by the public with respect to a document input by the user.

예를 들어, 사용자 입력 문서 처리부(120)는 사용자로부터 문서를 입력 받고 이에 대해 독자들이 보일 신뢰도 분포를 출력한다. 이 때, 출력되는 신뢰도 분포는 독자 그룹을 특정하는 복수의 인구통계학적 특징에 따라 각각 다를 수 있으며, 한 명의 독자가 하나의 문서에 대해 평가하는 신뢰도는 -5점(매우 신뢰할 수 없음)에서 +5점(매우 신뢰할 수 있음)까지의 정수일 수 있다. 물론, 본 발명의 실시 범위는 이에 국한되지 않으며, 다양한 양태의 신뢰도 범위에 대해 실시될 수 있다.For example, the user input document processing unit 120 receives a document from the user and outputs a reliability distribution to be read by the reader. In this case, the output reliability distribution may differ according to a plurality of demographic characteristics that specify the reader group, and the reliability that one reader evaluates for one document is -5 at a very unreliable + level. It can be an integer up to 5 points (very reliable). Of course, the scope of the present invention is not limited thereto, and may be practiced for a range of reliability of various aspects.

여기서, 인구통계학적 특징은 문서의 신뢰도를 판단하는 주체가 되는 독자의 성별에 대한 제1 항목, 문서의 신뢰도를 판단하는 주체가 되는 독자의 연령에 대한 제2 항목, 문서의 신뢰도를 판단하는 주체가 되는 독자의 직업에 대한 제3 항목, 문서의 신뢰도를 판단하는 주체가 되는 독자의 관심 주제에 대한 제4 항목 및 문서의 신뢰도를 판단하는 주체가 되는 독자가 문서의 주제에 대한 전문가인지 여부에 대한 제5 항목 중 적어도 하나 이상의 항목을 포함할 수 있다.Here, the demographic characteristics are the first item of the gender of the reader who is the subject of determining the reliability of the document, the second item of the age of the reader who is the subject of the reliability of the document, the subject of the reliability of the document. The third item on the reader's job, the fourth item on the subject of interest of the reader, and the subject of judgment on the reliability of the document, whether the reader is an expert on the subject of the document. It may include at least one item of the fifth item for.

도 4에 도시된 인구통계학적 특징은 제1 항목에서 제5 항목까지의 5개 항목을 모두 포함하는 경우를 나타낸 것이다. 그러나 본 발명의 실시 범위는 이에 국한되지 않으며, 상기 서술한 바와 같이 하나 이상의 항목을 포함하는 모든 경우에 대해 실시될 수 있다. 도 4에 도시된 바와 같이, 사용자 입력 문서 처리부(120)는 사용자에 의해 입력된 문서에 대해 복수의 인구통계학적 특징에 따라 구분되는 다양한 독자 그룹들이 각각 보일 신뢰도 분포를 자동으로 예측하여 그 결과를 사용자에게 출력하여 제공한다.The demographic characteristic shown in FIG. 4 illustrates a case in which all five items from the first item to the fifth item are included. However, the scope of the present invention is not limited thereto, and may be implemented in all cases including one or more items as described above. As illustrated in FIG. 4, the user input document processing unit 120 automatically predicts a reliability distribution of various reader groups, which are divided according to a plurality of demographic characteristics, for a document input by a user and calculates a result of the result. Output it to the user and provide it.

여기서, 복수의 인구통계학적 특징에 따라 구분되는 하나의 독자 그룹이 보일 신뢰 정도를 하나의 숫자가 아닌 분포 형태로 제공하는 이유는 동일한 독자 그룹에 포함된 두 사람이 하나의 문서에 대해 서로 다른 신뢰도를 보일 수 있기 때문이다. 즉, 두 사람이 서로 성별, 연령, 직업, 관심 주제, 전문가 여부가 모두 동일하더라도 하나의 문서에 대해서 서로 다른 값의 신뢰도를 보일 수 있다. 따라서 복수의 인구통계학적 특징을 통해 독자 그룹을 구분함에도 불구하고 해소되지 않을 것으로 예상하는 신뢰도에 대한 개인차를 시스템에 반영하기 위해 본 발명의 실시예에 따른 시스템은 하나의 숫자가 아닌 분포 형태로 정의된 신뢰도 예측 결과를 출력한다. Here, the reason why a reader group divided according to a plurality of demographic characteristics is provided as a distribution type rather than a number of figures is that two people included in the same reader group have different reliability for one document. Because it can be seen. In other words, even if two people have the same gender, age, occupation, subject of interest, and expertise, the two documents may have different reliability values for one document. Therefore, the system according to the embodiment of the present invention is defined as a distribution form rather than a single number in order to reflect the individual difference in the reliability that is expected to not be resolved despite the classification of reader groups through a plurality of demographic characteristics. Output the reliability prediction result.

도 4에서 독자 그룹 1은 생활/건강에 관심을 가지고 서비스업에 종사하는 비전문가인 40대 남성의 독자 그룹으로, 입력 문서에 대해 낮은 신뢰도로 편향된 신뢰도 분포를 보일 수 있으며, 독자 그룹 2는 정치/정책에 관심을 가지고 교육업에 종사하는 전문가인 20대 여성의 독자 그룹으로, 입력 문서에 대해 높은 신뢰도로 편향된 신뢰도 분포를 보일 수 있다. In FIG. 4, reader group 1 is a reader group of a man in his 40s who is a non-specialist in the service industry with an interest in life / health, and may exhibit a distribution of reliability biased with low reliability for input documents. A reader group of women in their twenties who are experts in the education industry with an interest in learning, and can show a distribution of confidence biased with high confidence in input documents.

상술한 바와 같이, 사용자 입력 문서 처리부(120)는 사용자에 의해 입력된 문서에 대해 인구통계학적 특징에 따라 구분되는 각 독자 그룹이 보일 신뢰도 분포를 자동으로 예측하고, 그 결과를 사용자에게 출력한다. 이를 통해 본 발명의 일 실시예에 따른 독자 신뢰도 분포 예측 시스템은 사용자가 텍스트 콘텐츠를 접하였을 때, 해당하는 텍스트 콘텐츠를 (1) 아예 읽지 않을지, (2) 빠르게 읽어볼지, (3) 시간을 할애하여 정독할지를 결정하는 과정에서 의사 결정 보조 정보를 제공 받을 수 있다.As described above, the user input document processing unit 120 automatically predicts a reliability distribution of each reader group classified according to demographic characteristics of the document input by the user, and outputs the result to the user. Through this, when the user encounters text content, the reader reliability distribution prediction system according to an embodiment of the present invention may (1) not read the text content at all, (2) read it quickly, or (3) take time. In the process of deciding whether or not to read carefully, decision support information may be provided.

다른 독자들, 특히 구체적으로 다른 인구통계학적 특징을 갖는 그룹의 사람들이 주어진 문서를 신뢰하거나 신뢰하지 않는지를 통계 분포 형태로 사용자에게 제공하는 것은 사용자와 다른 독자들 사이의 공감을 이끌어내는 방식으로 정보를 제공하는 것으로 해석될 수 있으며, 이를 통해 본 발명의 일 실시예에 따르는 시스템은 사용자로 하여금 주어진 텍스트 콘텐츠를 읽을지를 결정하는 과정에서 충분한 정보를 제공받았다고 간주할 수 있다.Providing users in the form of a statistical distribution of whether or not they trust a given document by other readers, especially groups of people with different demographic characteristics, is a way to draw empathy between the user and other readers. It can be interpreted that the system according to an embodiment of the present invention through which the user is provided with sufficient information in the process of determining whether to read the given text content.

예측 기준 코퍼스(130)는 직접적인 신뢰도 설문을 통해 수집한 각 문서에 대한 독자들의 신뢰도 분포를 저장하며, 해당 코퍼스(corpus)는 신뢰도 분포를 자동으로 예측하는 예측 모델을 기계학습을 통해 구축하는 과정에서 학습 기준으로 활용될 수 있다.The prediction criterion corpus 130 stores readers' reliability distributions for each document collected through a direct reliability questionnaire, and the corpus is in the process of constructing a predictive model through machine learning that automatically predicts the reliability distribution. It can be used as a learning standard.

도 5는 독자 신뢰도 분포 예측 시스템에 포함되는 예측 기준 코퍼스의 예시를 나타낸 것으로, 도 5에 도시된 바와 같이, 예측 기준 코퍼스는 복수 문서 집합과 각 문서에 대해 복수의 인구통계학적 특징에 따라 구분되는 각 독자 그룹(또는 집단)이 실제로 설문을 통해 보인 신뢰도 분포를 저장한다.FIG. 5 illustrates an example of a prediction criterion corpus included in an independent reliability distribution prediction system. As illustrated in FIG. 5, the prediction criterion corpus is divided according to a plurality of document sets and a plurality of demographic characteristics for each document. Store the distribution of confidence that the reader group (or group) actually showed through the questionnaire.

바람직하게, 예측 기준 코퍼스가 저장하는 복수 문서 집합은 일상생활에서 접할 수 있는 다양한 주제들에 대한 다양한 종류의 문서들을 포함할 수 있다.Preferably, the plurality of document sets stored by the prediction criterion corpus may include various kinds of documents on various topics that may be encountered in everyday life.

예를 들어, 문서의 주제는 생활, 건강, 정치, 정책, 경제, 환경을 포함할 수 있으며, 문서의 종류는 SNS 게시글, 블로그 게시글, 온라인 뉴스, 온라인 포럼 게시글, 연구 논문, 도서를 포함할 수 있다. 바람직하게, 코퍼스 전반에 걸쳐 직접적인 설문의 대상이 되는 각각의 문서들이 갖는 주제와 종류가 획일화되지 않아야 한다. 이는 본 발명의 일 실시예에 따른 시스템에서 설문을 통해 수집한 예측 기준 코퍼스(130)가 신뢰도 분포를 자동 예측하기 위한 학습 기준으로 활용되기 때문이다.For example, the subject of a document may include life, health, politics, policy, economics, the environment, and the type of document may include SNS posts, blog posts, online news, online forum posts, research papers, books. have. Preferably, the subject and type of each document subject to a direct questionnaire throughout the corpus should not be uniform. This is because the prediction reference corpus 130 collected through a questionnaire in the system according to an embodiment of the present invention is used as a learning criterion for automatically predicting a reliability distribution.

예컨대, 예측 기준 코퍼스(130)가 생활/건강에 대한 문서들과 이 문서들에 대해 독자들이 보이는 신뢰도 설문 결과만을 포함하는 경우 해당하는 예측 기준 코퍼스를 통해 학습된 예측 모델은 정치/정책에 대한 문서의 신뢰도 분포를 예측하기에 부적절할 것이며 예측 결과가 실제 독자들이 보일 신뢰도 분포와 상이할 것으로 예상할 수 있다.For example, if the predictive criteria corpus 130 includes only documents on life / health and the results of the reliability questionnaire that readers see for these documents, the predictive model learned through the corresponding predictive criteria corpus is a document on political / policy. It would be inappropriate to predict the reliability distribution of and the expected results would be different from the reliability distributions that the actual readers would see.

가상 독자 데이터 수집부(110)는 주어진 웹 문서 내의 각 사건에 대해 저자가 텍스트를 통해 문서 내에서 직접 드러낸 신뢰 정도를 분석하고, 해당 분석을 웹에서 발견 가능한 모든 문서에 대해 수행하며, 분석 결과를 통계 처리하여 각 사건에 대한 저자들의 신뢰도 분포를 데이터베이스, 예를 들어, 사건 신뢰 데이터베이스에 저장한다.The virtual reader data collection unit 110 analyzes the degree of trust expressed by the author directly in the document for each event in a given web document, performs the analysis on all documents found on the web, and analyzes the analysis result. Statistical processing stores the author's confidence distribution for each event in a database, for example, an event confidence database.

도 6은 독자 신뢰도 분포 예측 시스템에 포함되는 사건 신뢰 데이터베이스의 예시를 나타낸 것으로, 도 6에 도시된 바와 같이, 사건 신뢰 데이터베이스는 웹 문서 내에 언급된 사건에 대해 저자들이 텍스트를 통해 표현한 신뢰 정도를 각 신뢰도 값에 따른 저자의 수에 대한 통계 분포(이하 "사건 신뢰도 분포"라 칭함)로 변환하여 저장한다. 즉, 각 신뢰도 값에 기초하여 사건 신뢰도 분포를 생성하여 저장한다.FIG. 6 shows an example of an event trust database included in the reader confidence distribution prediction system. As shown in FIG. 6, the event trust database indicates the degree of trust expressed by the authors through text for an event mentioned in a web document. It is converted into a statistical distribution of the number of authors according to the reliability value (hereinafter referred to as "event reliability distribution") and stored. That is, an event confidence distribution is generated and stored based on each confidence value.

이 때, 하나의 사건은 텍스트에서 드러난 하나의 술어(Predicate)와 해당하는 술어에 대해 의미역 추출(Semantic Role Labeling)된 의미역(Semantic Role) 및 각 의미역에 해당하는 단어들의 집합으로 정의될 수 있다.In this case, an event may be defined as a predicate exposed in the text, a semantic role labeled with a semantic role for the corresponding predicate, and a set of words corresponding to each semantic. Can be.

도 6은 "향후 10년간 회사 A와 회사 B의 팽팽한 경쟁관계가 예상된다"는 문장의 의미에 해당하는 사건(이하, "예시 사건"이라 칭함)인 "predicate: 예상된다", "subject: 회사 A와 회사 B", "object: 팽팽한 긴장관계", "temporal: 향후 10년간"에 있어 웹 문서의 저자들이 해당 문서 내에서 텍스트를 통해 직접 드러낸 신뢰도의 분포 예시를 포함한다.FIG. 6 is an event corresponding to the meaning of the sentence "tense competition between Company A and Company B is expected in the next 10 years" (hereinafter referred to as "example event"), "predicate: expected", "subject: company A and Company B "," object: tension, "and" temporal: the next decade "include examples of the distribution of the confidence that the authors of a web document reveal through text directly within the document.

예를 들어, SNS 게시글(Doc_1)의 저자인 Author_1은 Doc_1에 기재한 "향후 10년간 회사 A와 회사 B가 팽팽한 긴장관계를 가질 것이라는 것은 근거 없는 헛소문이다"라는 문장을 통해 예시 사건을 신뢰하지 않는다는 것을 드러내고 있고, 온라인 뉴스(Doc_2)의 저자인 Author_2는 Doc_2에 기재한 "향후 10년간 회사 A와 회사 B가 팽팽한 긴장관계를 가진다는 것이 사실일지에 대해서는 10년이라는 시간이 지난 후에 판단 가능할 것이다"라는 문장을 통해 예시 사건에 대해 중립적인 신뢰를 보인다는 것을 드러내고 있으며, 블로그 게시글(Doc_3)의 저자인 Author_3은 Doc_3에 기재한 "주변 근거 자료를 검토해 보고 역사적 추이를 분석해 보면 향후 10년간 회사 A와 회사 B가 팽팽한 긴장 관계를 가질 것이라는 점은 매우 분명하고 합리적인 추론 결과이다"라는 문장을 통해 예시 사건을 신뢰한다는 것을 드러내고 있고, 연구 문서(Doc_4)의 저자인 Author_4는 Doc_4에 기재한 "경제학적인 관점에서 일정 수준 이상의 규모를 갖는 두 회사가 경쟁 구도를 가지기 시작하면 빠른 시일 내에 한 회사가 다른 회사를 합병했다는 것을 본 연구를 통해 알 수 있으며, 따라서 회사 A와 회사 B가 향후 10년간 팽팽한 경쟁관계를 보일 가능성이 적다는 것을 추론할 수 있다"라는 문장을 통해 예시 사건을 신뢰하지 않는다는 것을 드러내고 있다.For example, Author_1, the author of an SNS article (Doc_1), disagreed to the example case by writing in Doc_1, a sentence in "Doctor A and B that would have tension between Company A and Company B over the next decade". Author_2, author of Online News (Doc_2), wrote in Doc_2, "It will be possible after 10 years to see if it is true that Company A and Company B have a tense relationship over the next decade." The statement shows that there is a neutral trust in the example case, and Author_3, the author of the blog post (Doc_3), writes in Doc_3: “Reviewing the evidence base and analyzing historical trends, It is very clear and reasonable reasoning that Firm B will have a tense relationship. ” Author_4, author of the research document (Doc_4), writes in Doc_4, "In economic terms, when two firms with more than a certain level begin to compete, one company merges with another. We can see that this study shows, so we can infer that Company A and Company B are unlikely to be in tight competition for the next decade. ”

각각의 저자들의 신뢰 여부는 도 6에 도시된 바와 같이 사건 신뢰도 분포로서 통계 변환되어 사건 신뢰 데이터베이스(115)에 저장될 수 있다.Trust of each author may be statistically converted into an event confidence distribution and stored in the event trust database 115 as shown in FIG. 6.

사건 신뢰 데이터베이스(115)에 저장된 사건 신뢰도 분포는 사용자 입력 문서 처리부(120)가 사용자로부터 입력된 문서에 대해 신뢰도 분포를 예측하는 과정을 보조하는 데이터로 활용된다. 이런 종류의 데이터 활용 방식은 웹 문서의 저자를 사용자가 입력한 문서에 대한 가상의 독자로 간주하는 방식으로 해석될 수 있다. 즉, 웹 문서의 저자들은 본인의 의견 및 지식을 서술하기 위한 목적으로 문서를 작성한 것이지만, 이런 문서 내에서 저자들은 도 6에서 설명한 예시(Author1~4)와 같이 특정 사건에 대한 스스로의 신뢰 정도를 텍스트를 통해 직접 드러낸다고 볼 수 있다. 웹 문서의 저자들이 텍스트를 통해 드러낸 의견들을 수집하는 과정은 직접적인 설문을 통해 독자들의 신뢰도를 수집하는 과정 대비 저비용으로 수행될 수 있다.The event confidence distribution stored in the event trust database 115 is used as data to assist the user input document processing unit 120 in predicting the reliability distribution of the document input from the user. This kind of data utilization can be interpreted in such a way that the author of the web document is considered a virtual reader of the document entered by the user. In other words, the authors of a web document are written for the purpose of describing their opinions and knowledge, but in such documents, the authors can measure their own degree of trust in a specific event as illustrated in FIG. 6 (Author 1 ~ 4). It can be seen directly through the text. The process of collecting opinions expressed through text by the authors of web documents can be performed at a lower cost than the process of collecting readers' trust through a direct questionnaire.

본 발명의 일 실시예에 따른 독자 신뢰도 분포 예측 시스템은 이와 같이 가상 독자 데이터 수집부(110)을 통해 저비용 방식으로 웹 내에서 저자들이 사건에 대해 드러낸 신뢰도를 대규모로 수집하고, 이 정보를 활용하여 사용자 입력 문서 처리부(120)의 신뢰도 분포를 향상시킬 수 있다.The reader reliability distribution prediction system according to an embodiment of the present invention collects the reliability revealed by the authors about the event on the web in a low cost manner through the virtual reader data collection unit 110, and utilizes this information. The reliability distribution of the user input document processing unit 120 may be improved.

도 2는 도 1의 가상 독자 데이터 수집부에 대한 일 실시예의 구성을 나타낸 것이다.FIG. 2 illustrates a configuration of an embodiment of the virtual reader data collector of FIG. 1.

도 2에 도시된 바와 같이, 가상 독자 데이터 수집부(110)는 웹 문서 수집부(111), 사건 추출부(112), 사건 신뢰도 처리부(113), 통계 처리부(114), 사건 신뢰 데이터베이스(115)를 포함한다.As shown in FIG. 2, the virtual reader data collecting unit 110 includes a web document collecting unit 111, an event extracting unit 112, an event reliability processing unit 113, a statistical processing unit 114, and an event trust database 115. ).

웹 문서 수집부(111)는 웹에서 수집 가능한 웹 문서를 수집한다.The web document collecting unit 111 collects web documents collectible on the web.

여기서, 웹 문서 수집부(111)에 의해 수집 대상이 되는 웹 문서는 SNS 게시글, 블로그 게시글, 온라인 뉴스, 온라인 포럼 게시글, 연구 논문, 도서를 포함할 수 있다.Here, the web document to be collected by the web document collection unit 111 may include SNS posts, blog posts, online news, online forum posts, research papers, books.

나아가, 웹 문서 수집부(111)는 저자를 특정하는 복수의 인구통계학적 특징에 대한 정보와 더불어 웹 문서 게시 날짜에 접근 가능한 경우 해당 날짜 정보를 각 웹 문서와 함께 수집한다. 바람직하게, 웹 문서 수집부(111)는 저자의 고유 식별자에 접근 가능한 경우 고유 식별자를 통해 저자들을 구분하고, 이를 통해 동일한 저자가 작성한 다수의 문서가 있을 경우, 이들이 한 저자에 의해 작성된 것임을 구분할 수 있도록 한다.Furthermore, the web document collection unit 111 collects the date information together with each web document when the web document publication date is accessible together with information on a plurality of demographic characteristics for specifying the author. Preferably, the web document collection unit 111 distinguishes the authors through the unique identifiers when the unique identifiers of the authors are accessible, and if there are a plurality of documents written by the same author, they can be distinguished that they are created by one author. Make sure

사건 추출부(112)는 웹 문서로부터 사건 정보를 추출한다.The event extracting unit 112 extracts event information from a web document.

여기서, 사건은 상술한 바와 같이 의미역 추출(Semantic Role Labeling)에 따라 결정되는 술어 및 의미역에 해당하는 단어들의 집합으로 정의될 수 있으며, 의미역 추출은 DeepSemanticRoleLabeling 또는 PathLSTM와 같은 의미역 추출기를 통해 진행될 수 있다.In this case, an event may be defined as a predicate and a set of words corresponding to a semantic region determined according to semantic role labeling as described above, and the semantic extraction may be performed through a semantic extractor such as DeepSemanticRoleLabeling or PathLSTM. Can proceed.

사건 신뢰도 처리부(113)는 가상 독자로서의 저자가 사건을 언급하면서 사용한 긍정 표현, 부정 표현, 신뢰 표현, 불신 표현, 동의 표현, 비동의 표현 중 하나 이상의 표현을 이용하여 사건에 대한 저자의 신뢰도인 사건 신뢰도를 계산한다.The event credibility processing unit 113 uses the expression of one or more of the affirmative expression, the negative expression, the trust expression, the distrust expression, the consent expression, and the disagreement expression used by the author as a virtual reader to refer to the event. Calculate the reliability.

이 때, 사건 신뢰도는 아래 <수학식 1>에 의해 계산될 수 있다.In this case, the event reliability may be calculated by Equation 1 below.

[수학식 1][Equation 1]

여기서, PositiveCount는 웹 문서 내에서 저자가 사건을 언급하면서 사용한 긍정 표현 횟수를 의미하고, NegativeCount는 웹 문서 내에서 저자가 사건을 언급하면서 사용한 부정 표현 횟수를 의미하며, TrustCount는 웹 문서 내에서 저자가 사건을 언급하면서 사용한 신뢰 표현 횟수를 의미하고, DistrustCount는 웹 문서 내에서 저자가 사건을 언급하면서 사용한 불신 표현 횟수를 의미하며, AgreeCount는 웹 문서 내에서 저자가 사건을 언급하면서 사용한 동의 표현 횟수를 의미하고, DisagreeCount는 웹 문서 내에서 저자가 사건을 언급하면서 사용한 비동의 표현 횟수를 의미할 수 있다.Here, PositiveCount means the number of positive expressions used by the author in referring to the event in the web document, NegativeCount means the number of negative expressions used in the web document by the author in referring to the event, and TrustCount is the number of negative expressions used by the author in the web document. DistrustCount refers to the number of times the trust expression was used while referring to an event, DistrustCount refers to the number of distrust expressions used by the author to refer to an event within a web document, and AgreeCount refers to the number of times the author used to refer to the event within a web document. DisagreeCount may mean the number of disagreements expressed by the author in referring to the event in the web document.

여기서, α₁ 내지 α₆은 각각 가중치를 표현하는 음이 아닌 실수를 의미하며, 한 값의 가중치가 0인 경우는 해당하는 표현의 등장 횟수를 고려하지 않는 것을 의미한다. 예를 들어, α₅가 0인 경우는 동의 표현의 등장 횟수를 고려하지 않는다는 것을 의미한다. 또한 α₁ 내지 α₆ 중 적어도 한 값은 0이 아닌 값을 가진다.Here, α ₁ to α ₆ mean non-negative real numbers representing weights, respectively, and when the weight of one value is 0, it means that the number of occurrences of the corresponding expression is not considered. For example, the case where α ₅ is 0 means that the number of occurrences of the synonym expression is not taken into account. And at least one of α ₁ to α ₆ has a non-zero value.

바람직하게, 상술한 긍정 표현, 부정 표현, 신뢰 표현, 불신 표현, 동의 표현, 비동의 표현은 긍정 의미 표현 단어 집합, 부정 의미 표현 단어 집합, 신뢰 의미 표현 단어 집합, 불신 의미 표현 단어 집합, 동의 의미 표현 단어 집합, 비동의 표현 단어 집합을 각각 활용하여 구분될 수 있다. 이 때, 단어 집합들은 언어학 전문가의 주석을 통해 구축될 수 있다.Preferably, the above-described positive expression, negative expression, confidence expression, distrust expression, consent expression, and disagreement expression may include a positive meaning expression word set, a negative meaning expression word set, a trust meaning expression word set, a distrust meaning expression word set, and a synonym meaning. It can be distinguished by using the expression word set and the non-synonymous expression word set. At this time, the word sets can be constructed through the annotation of linguistic experts.

또한, 사건 신뢰도 처리부(113)는 특정 표현(예를 들어, 긍정 표현)이 해당하는 사건을 언급하면서 사용되었는지를 사건에 해당하는 술어 및 각각의 의미역에 해당하는 단어들 중 한 개 이상의 단어가 대상이 되는 특정 표현에 해당하는 단어 집합(예를 들어, 긍정 표현의 경우에는 긍정 표현 단어 집합)과 의존 트리상에서 직접적으로 연결되어 있는지 여부를 통해 계산할 수 있다.In addition, the event reliability processing unit 113 determines whether a specific expression (eg, affirmative expression) is used while referring to a corresponding event, and one or more words of a predicate corresponding to the event and a word corresponding to each semantic domain are included. The calculation may be performed based on whether the word set corresponding to the specific expression to be targeted (eg, a positive expression word set in the case of a positive expression) is directly connected on the dependency tree.

여기서, 각 문장의 의존 트리는 StanfordCoreNLP와 같은 구문 분석기를 통해 추출할 수 있다.Here, the dependency tree of each sentence can be extracted through a parser such as StanfordCoreNLP.

예를 들어, 사건 신뢰도 처리부(113)는 사건의 술어가 긍정 표현 단어와 문장의 의존 트리상에서 직접적으로 연결되어 있을 경우 해당하는 문장이 사건을 긍정적으로 언급한 것으로 간주하며, 사건의 술어 및 각 의미역에 해당하는 단어 중 어느 것도 문장의 의존 트리상에서 긍정 표현 단어와 직접적으로 연결되어 있지 않은 경우 해당하는 문장은 사건을 긍정적으로 언급하지 않은 것으로 간주한다.For example, when the event predicate is directly connected on the dependency expression tree of affirmative expression words and sentences, the event reliability processor 113 considers the corresponding sentence as a positive reference to the event, and the predicate and each meaning of the event. If none of the inverse words is directly linked to a positively-expression word in the sentence's dependency tree, the sentence is considered to not refer to the event positively.

통계 처리부(114)는 여러 웹 문서에서 복수 저자들이 동일 사건에 대해 드러내는 사건 신뢰도 값들을 수합하고 사건 신뢰도 값에 따른 저자 수 분포인 사건 신뢰도 분포로 변환한다. 여기서, 사건 신뢰도 분포에 대한 일 예는 도 6에 도시된 바와 같다.The statistics processing unit 114 collects event reliability values that the plurality of authors reveal for the same event in various web documents and converts the event reliability distribution, which is the distribution of the number of authors according to the event reliability values. Here, an example of the event reliability distribution is as shown in FIG. 6.

이 때, 통계 처리부(114)는 웹 문서 수집부(111)를 통해 수집한 저자들의 복수의 인구통계학적 특징들과 고유 식별자에 따라 저자들을 구분할 수 있다.In this case, the statistical processor 114 may classify the authors according to a plurality of demographic characteristics and unique identifiers of the authors collected through the web document collector 111.

예를 들어, 통계 처리부(114)는 저자들의 고유식별자를 구별할 수 없거나 복수의 인구통계학적 특징을 알 수 없을 경우에는 해당 저자의 인구통계학적 특징을 "알 수 없음"으로 설정하고 서로 다른 문서가 각각 모두 다른 저자에 의해 작성된 것으로 가정하며, 동일한 저자가 두 개 이상의 다른 문서에서 각각 다른 방식으로 같은 사건을 언급했을 경우 해당하는 사건에 대한 저자의 신뢰도를 각 문서에 대해 상기 수학식 1을 통해 계산된 신뢰도 점수의 평균으로 한다. For example, if the statistical processor 114 cannot distinguish the unique identifiers of the authors or if the demographic characteristics of the plurality are not known, the statistical processor 114 sets the demographic characteristics of the corresponding authors as “unknown” and different documents. It is assumed that each is written by different authors, and that the author's confidence in the corresponding event is represented by Equation 1 for each document when the same author mentions the same event in different ways in two or more different documents. The average of the calculated reliability scores is used.

나아가, 통계 처리부(114)는 복수의 인구통계학적 특징에 따라 구분되는 각 저자 그룹과 "알 수 없음"의 인구통계학적 특징을 갖는 그룹에 해당하는 저자들의 신뢰도 값들을 수합하며, 하나의 사건에 대해 각 저자 그룹별로 사건 신뢰도 분포를 도 6에 도시된 예와 같이, 사건 신뢰 데이터베이스(115)에 저장한다.Furthermore, the statistics processing unit 114 collects the reliability values of the authors corresponding to each author group classified according to the plurality of demographic characteristics and the group having a demographic characteristic of "unknown", For each author group, the event reliability distribution is stored in the event trust database 115 as shown in FIG. 6.

사건 신뢰 데이터베이스(115)는 본 발명의 사건과 관련된 모든 정보를 저장하는 데이터베이스이다.The event trust database 115 is a database that stores all information related to the events of the present invention.

도 3은 도 1의 사용자 입력 문서 처리부에 대한 일 실시예의 구성을 나타낸 것이다.FIG. 3 illustrates a configuration of an embodiment of the user input document processing unit of FIG. 1.

도 3에 도시된 바와 같이, 사용자 입력 문서 처리부(120)는 입력부(121), 전처리부(122), 신뢰도 분포 예측부(123), 출력부(124)를 포함한다.As illustrated in FIG. 3, the user input document processing unit 120 includes an input unit 121, a preprocessor 122, a reliability distribution predicting unit 123, and an output unit 124.

입력부(121)는 사용자로부터 신뢰도 분포 예측의 대상이 되는 문서를 입력 받는다.The input unit 121 receives a document that is a target of reliability distribution prediction from a user.

전처리부(122)는 사용자가 입력한 문서에서 사건 추출을 진행한다. The preprocessor 122 proceeds to extract the event from the document input by the user.

여기서, 사건 추출 방식은 상술한 가상 독자 데이터 수집부(110)의 사건 추출부(112)가 수행하는 사건 추출의 방식과 같을 수 있다.Here, the event extraction method may be the same as the event extraction method performed by the event extraction unit 112 of the virtual reader data collection unit 110 described above.

신뢰도 분포 예측부(123)는 사용자로부터 입력된 문서에서 추출된 각각의 사건과 일치하는 사건 신뢰도 분포를 사건 신뢰 데이터베이스(115)로부터 불러온 다음, 각각의 사건에 대한 사건 신뢰도 분포를 모두 입력 받고 이를 이용하여 복수의 인구통계학적 특징에 따라 구분되는 각각의 저자 그룹이 사용자로부터 입력된 문서에 대해 보일 것으로 예상하는 신뢰도 분포인 문서 신뢰도 분포를 예측한다.The reliability distribution predicting unit 123 loads an event reliability distribution corresponding to each event extracted from a document input from a user from the event trust database 115, and then receives all event reliability distributions for each event and receives the event reliability distribution. Use this to predict the document reliability distribution, which is the confidence distribution that each author group, which is classified according to a plurality of demographic characteristics, is expected to appear for a document input from a user.

이 때, 신뢰도 분포 예측부(123)는 사건 신뢰 데이터베이스(115)와 예측 기준 코퍼스(130)로부터 지도 학습(supervised learning)을 통해 학습한 모델을 이용하여 사용자가 입력한 문서의 문서 신뢰도 분포를 예측할 수 있다.At this time, the reliability distribution predicting unit 123 may predict the document reliability distribution of the document input by the user using a model learned through supervised learning from the event trust database 115 and the prediction reference corpus 130. Can be.

여기서, 학습 모델은 학습 중에 예측 기준 코퍼스에 저장된 문서 내의 텍스트 정보와 해당 문서 내에서 추출된 사건들에 상응하는 사건 신뢰도 분포들을 사건 신뢰 데이터베이스(115)로부터 불러와 입력 기준으로 삼고, 해당하는 문서에 대해 실제 설문을 통해 측정된 문서 신뢰도 분포를 예측 기준 코퍼스(130)로부터 불러와 출력 기준으로 삼을 수 있다.In this case, the learning model uses text information in a document stored in the prediction criterion corpus and learning event distributions corresponding to events extracted in the document as an input criterion. The document reliability distribution measured through the actual questionnaire can be retrieved from the prediction criterion corpus 130 and used as an output criterion.

출력부(124)는 사용자가 입력한 문서에 대해 신뢰도 분포 예측부(123)가 예측한 문서 신뢰도 분포를 사용자에게 출력한다.The output unit 124 outputs the document reliability distribution predicted by the reliability distribution predicting unit 123 to the user input document.

이와 같이, 본 발명의 실시예에 따른 시스템은 웹 문서 저자들이 웹 문서 내에서 사건을 언급하며 사용한 표현을 분석하여 사건에 대한 저자의 신뢰도를 계산하고, 사용자가 입력한 문서에 대해 웹 문서 저자를 가상의 독자로 간주하여 대중 전반에 해당하는 독자들이 보일 신뢰도 분포를 자동 예측할 수 있다.As such, the system according to an embodiment of the present invention calculates the author's confidence in the event by analyzing the expression used by the web document authors to refer to the event in the web document, and the web document author for the user input document Considering a hypothetical reader, you can automatically predict the distribution of confidence that readers across the public will see.

또한, 본 발명의 실시예에 따른 시스템은 하나의 텍스트 콘텐츠를 서로 다른 정도로 신뢰하는 두 명의 사용자에 대해서, 해당하는 텍스트 콘텐츠를 독해한 이후에 A라는 사용자는 높은 신뢰도를 보이고, B라는 사용자는 낮은 신뢰도를 보일 때, 독자들이 보일 전반적인 신뢰도 분포를 예측하여 A와 B에게 제공함으로써, A와 B 각자가 보이는 신뢰도에 상응하는 신뢰도를 보이는 사람들이 어떻게 분포되어 있으며, 각자가 보이는 신뢰도와 상이한 신뢰도를 보이는 사람들이 어떻게 분포되어 있는지를 보여줌으로써, 두 명의 사용자 모두에게 높은 만족도를 제공할 수 있다.In addition, the system according to the embodiment of the present invention, for two users who trust one text content to a different degree, after reading the corresponding text content, the user A shows high reliability, the user B is low When showing confidence, readers are expected to predict the overall distribution of confidence and provide it to A and B so that there is a distribution of people whose confidence corresponds to that of A and B, and that differs from what they see. By showing how people are distributed, you can provide high satisfaction for both users.

도 7은 도 1에 도시된 가상 독자 데이터 수집부에 의한 일 실시예의 동작 흐름도를 나타낸 것이다.FIG. 7 illustrates an operation flowchart of an embodiment by the virtual reader data collector shown in FIG. 1.

도 7을 참조하면, 가상 독자 데이터 수집부에 의한 방법은 웹 문서 수집 단계(S310), 사건 추출 단계(S320), 사건 신뢰도 처리 단계(S330), 사건 통계 처리 단계(S340)를 포함한다.Referring to FIG. 7, the method by the virtual reader data collecting unit includes a web document collecting step S310, an event extracting step S320, an event reliability processing step S330, and an event statistics processing step S340.

웹 문서 수집 단계(S310)는 웹 문서를 수집하는 단계이고, 사건 추출 단계(S320)는 수집된 웹 문서에서 사건을 추출하는 단계이다. 사건 신뢰도 처리 단계(S330)는 수집된 웹 문서 내에서 가상 독자로서의 저자가 사건을 언급하면서 사용한 (1) 긍정 표현, (2) 부정 표현, (3) 신뢰 표현, (4) 불신 표현, (5) 동의 표현, (6) 비동의 표현 중 하나 이상의 표현을 이용하여 사건에 대한 저자의 신뢰도인 사건 신뢰도를 계산하는 단계이며, 사건 통계 처리 단계(S340)는 여러 웹 문서에서 복수 저자들이 동일 사건에 대해 보이는 사건 신뢰도 값들을 수합하고 사건 신뢰도 값에 따른 저자 수 분포로 변환하여 데이터베이스에 저장하는 단계이다.The web document collection step S310 is a step of collecting a web document, and the event extraction step S320 is a step of extracting an event from the collected web document. The event reliability processing step (S330) includes (1) positive expression, (2) negative expression, (3) trust expression, (4) distrust expression, (5) used by the author as a virtual reader in referring to the event in the collected web document. (6) calculating the event reliability, which is the author's confidence in the event, using at least one expression among the expressions of consent and (6) disagreement. Collecting event reliability values and converting them into a distribution of the number of authors according to the event reliability values and storing them in a database.

도 8은 도 1에 도시된 사용자 입력 문서 처리부에 의한 일 실시예의 동작 흐름도를 나타낸 것이다.FIG. 8 is a flowchart illustrating an example of an operation performed by the user input document processor illustrated in FIG. 1.

도 8을 참조하면, 사용자 입력 문서 처리부에 의한 방법은 입력 단계(S410), 전처리 단계(S420), 신뢰도 분포 예측 단계(S430), 출력 단계(S440)를 포함한다.Referring to FIG. 8, the method by the user input document processing unit includes an input step S410, a preprocessing step S420, a reliability distribution prediction step S430, and an output step S440.

입력 단계(S410)는 사용자로부터 문서 신뢰도 분포 예측의 대상이 되는 문서를 입력 받는 단계이며, 전처리 단계(S420)는 사용자가 입력한 문서로부터 사건 추출을 진행하는 단계이다. 신뢰도 분포 예측 단계(S430)는 사용자로부터 입력된 문서에서 추출된 사건과 일치하는 사건 통계 정보를 데이터베이스로부터 입력으로 받아 복수의 인구통계학적 특징에 따라 구분되는 각자의 독자 그룹이 사용자로부터 입력된 문서에 대해 보일 신뢰도 분포를 예측하는 단계이며, 출력 단계(S440)는 예측된 신뢰도 분포를 사용자에게 출력하는 단계이다. The input step S410 is a step of receiving a document that is a target of document reliability distribution prediction from the user, and the preprocessing step S420 is a step of extracting an event from the document input by the user. In the reliability distribution prediction step (S430), event statistical information corresponding to an event extracted from a document input from a user is received as an input from a database, and respective reader groups divided according to a plurality of demographic characteristics are input to the document input from the user. It is a step of predicting the reliability distribution to be seen, and the output step (S440) is a step of outputting the predicted reliability distribution to the user.

비록, 도 7과 8의 방법에서 그 설명이 생략되었더라도, 도 7과 8을 구성하는 각 단계는 도 1 내지 도 6에서 설명한 모든 내용을 포함할 수 있으며, 이는 이 기술 분야에 종사하는 당업자에게 있어서 자명하다.Although the descriptions of the methods of FIGS. 7 and 8 are omitted, each step constituting FIGS. 7 and 8 may include all the contents described with reference to FIGS. 1 to 6, which will be understood by those skilled in the art. Self-explanatory

이상에서 설명된 시스템 또는 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 시스템, 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The system or apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the systems, devices, and components described in the embodiments may include, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable arrays (FPAs). ), A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of explanation, one processing device may be described as being used, but one of ordinary skill in the art will appreciate that the processing device includes a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the above, and may configure the processing device to operate as desired, or process independently or collectively. You can command the device. Software and / or data may be any type of machine, component, physical device, virtual equipment, computer storage medium or device in order to be interpreted by or to provide instructions or data to the processing device. Or may be permanently or temporarily embodied in a signal wave to be transmitted. The software may be distributed over networked computer systems so that they may be stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예들에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiments may be embodied in the form of program instructions that may be executed by various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the media may be those specially designed and constructed for the purposes of the embodiments, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and the drawings as described above, various modifications and variations are possible to those skilled in the art from the above description. For example, the described techniques may be performed in a different order than the described method, and / or components of the described systems, structures, devices, circuits, etc. may be combined or combined in a different form than the described method, or other components. Or, even if replaced or substituted by equivalents, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are within the scope of the claims that follow.

Claims

Calculating event reliability for each event information extracted from the web document by considering the authors of the web document as virtual readers in the virtual reader data collection unit;
Receiving a document for performing a reliability distribution prediction from a user in a user input document processor; And
Predicting, by the user input document processing unit, a reliability distribution of the input document of each of the reader groups distinguished according to a plurality of preset demographic characteristics based on the calculated event reliability of each of the calculated event information;
Including,
The step of calculating the event reliability
Collecting the web document and extracting each of the event information from the collected web document;
Calculating event reliability for each of the extracted event information; And
The event reliability distribution, which is a statistical distribution of the number of readers according to the value of the event reliability, is collected based on the value of the collected event reliability. Steps to Create and Save to Database
Including,
The predicting step
Predicting a reliability distribution of the event information of the user extracted from the input document by using the event reliability distribution corresponding to each of the user's event information extracted from the document input from the user among the generated event reliability distributions ,
The case information
A predicate that is revealed in the text of the web document, a semantic role labeled with a semantic role for the predicate, and a set of words corresponding to each semantic,
The predicting step
The document inputted by the user using a learning model learned through supervised learning from a predictive reference corpus that stores the distribution of the reader's reliability for each document collected through the information stored in the database and the direct reliability questionnaire. Predict reliability distribution,
The learning model
Text information in a document stored in the prediction criterion corpus and learning reliability distributions corresponding to the events extracted in the document during the learning are retrieved from the database as input criteria, and the document measured through the actual questionnaire for the document. An independent reliability distribution prediction method for importing a reliability distribution from the prediction reference corpus as an output reference.

delete

The method of claim 1,
The demographic characteristic is
A first item for a gender of the reader, a second item for the age of the reader, a third item for the occupation of the reader, a fourth item for the subject of interest of the reader, and a subject for determining the reliability of the document; And at least one of a fifth item of whether the reader is an expert on a subject of a document.

The method of claim 1,
The step of calculating the event reliability
A reader that calculates the event reliability based on the number of positive expressions, negative expressions, confidence expressions, disbelief expressions, consent expressions, and non-movement expressions used by the author in referring to the event. Reliability Distribution Prediction Method.

The method of claim 1,
Providing the user with a confidence distribution of each of the predicted reader groups
Reader reliability distribution prediction method further comprising.

A virtual reader data collection unit that considers authors of a web document as virtual readers and calculates event reliability for each event information extracted from the web document; And
Receives a document for performing a reliability distribution prediction from a user, and for each of the input document of each of the reader groups distinguished according to a plurality of demographic characteristics preset based on the event reliability for each of the calculated event information User input document processing unit for predicting the reliability distribution
Including,
The virtual reader data collection unit
Collects the web document, extracts each of the event information from the collected web document, calculates an event reliability for each of the extracted event information, and calculates an event reliability calculated for the same event information among the extracted event information Sum the values of, generate an event reliability distribution which is a statistical distribution of the number of readers according to the value of the event reliability, and store in the database,
The user input document processing unit
Predicting a reliability distribution of the event information of the user extracted from the input document by using the event reliability distribution corresponding to each of the user's event information extracted from the document input from the user among the generated event reliability distributions ,
The case information
It is a set of predicates and semantic roles labeled for the predicate and the corresponding predicates revealed in the text of the web document, and a set of words corresponding to each semantic domain.
The user input document processing unit
The information inputted by the user using a learning model learned through supervised learning from a predictive reference corpus that stores readers' reliability distributions for each document collected through the information stored in the database and the direct reliability questionnaire. Predict reliability distribution,
The learning model
Text information in a document stored in the prediction criterion corpus during learning and event reliability distributions corresponding to events extracted in the document are retrieved from the database as input criteria, and the document measured through the actual questionnaire for the document. An independent reliability distribution prediction system for importing a reliability distribution from the prediction reference corpus as an output reference.

delete

The method of claim 6,
The demographic characteristic is
A first item for a gender of the reader, a second item for the age of the reader, a third item for the occupation of the reader, a fourth item for the subject of interest of the reader, and a subject for determining the reliability of the document; And at least one of a fifth item of whether the reader is an expert on a subject of a document.

The method of claim 6,
The virtual reader data collection unit
A reader that calculates the event reliability based on the number of positive expressions, negative expressions, confidence expressions, disbelief expressions, consent expressions, and non-movement expressions used by the author in referring to the event. Reliability Distribution Prediction System.

The method of claim 6,
The user input document processing unit
And providing the user with a reliability distribution of each of the predicted reader groups.