KR102370729B1

KR102370729B1 - Sentence writing system

Info

Publication number: KR102370729B1
Application number: KR1020210072362A
Authority: KR
Inventors: 최연
Original assignee: 최연
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2022-03-07

Abstract

Disclosed is a sentence writing system. The system comprises: a user terminal; and a sentence writing apparatus constructing a database storing a document of which copyright is expired, acquiring an input sentence, which is a sentence input by a user, from the user terminal, and extracting a sentence similar with the input sentence as an exemplary sentence from the database to output the sentence from the user terminal.

Description

Sentence writing system {SENTENCE WRITING SYSTEM}

본 발명은 문장 작성 시스템에 관한 것으로, 보다 상세하게는 저작권이 만료된 자료를 이용하여 사용자의 문장력을 개선시켜줄 수 있는 문장을 제시하여 주는 문장 작성 시스템에 관한 것이다.The present invention relates to a sentence writing system, and more particularly, to a sentence writing system for presenting sentences that can improve a user's writing ability using data whose copyright has expired.

대부분의 서비스되는 작문 시스템은 번역에 관한 것으로, 예를 들어 한글과 같은 하나의 언어는 습득한 사람이 타 언어를 익히고자 할 때 유용한 시스템이 주를 이룬다.Most of the serviced writing systems are related to translation, and for example, when a person who has acquired one language such as Hangul wants to learn another language, the system is mainly useful.

그러나, 표현하고자 하는 단어는 알고 있더라도 그 쓰임새에 따라 달라지는 형태를 보다 쉽게 이해할 수 있도록 다양한 문장, 단어를 제시하는 보조적 수단이 필요하며, 또한 사용자 수준에 맞는 예시 문장을 제시하는 방식도 필요하다.However, even if you know the word you want to express, you need an auxiliary means of presenting various sentences and words so that you can more easily understand the form that changes depending on the usage, and also a method of presenting example sentences suitable for the user's level.

한편, 유사한 문장들을 식별하기 위한 다양한 방법들이 제시되었다. Meanwhile, various methods for identifying similar sentences have been proposed.

(특허문헌 1)에 따르면 기계 번역 기법을 이용한 유사문장 식별 방법을 제시하고 있으며, (특허문헌 2)에 따르면 인공지능을 사용하여 문장을 분석하는 방법을 제시하고 있다.According to (Patent Document 1), a method for identifying similar sentences using a machine translation technique is presented, and according to (Patent Document 2), a method for analyzing a sentence using artificial intelligence is presented.

이러한 유사문장 식별을 통해 사용자가 입력하는 문장과 유사한 문장을 제시할 수 있는 작문 시스템이 개발될 필요가 있다.There is a need to develop a writing system capable of presenting a sentence similar to a sentence input by a user through such similar sentence identification.

US 7,412,385 B2US 7,412,385 B2 US 2021/0042586 A1US 2021/0042586 A1

본 발명의 일측면은 저작권이 만료된 자료를 저장한 데이터베이스를 구축하고, 데이터베이스에서 사용자 입력 문장과 유사한 문장을 추출하여 사용자 단말로 출력하는 문장 작성 시스템을 개시한다.One aspect of the present invention discloses a sentence writing system that constructs a database storing data whose copyright has expired, extracts a sentence similar to a user input sentence from the database, and outputs it to a user terminal.

본 발명의 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

본 발명의 일 실시예에 따른 문장 작성 시스템은 사용자 단말; 및 저작권이 만료된 자료를 저장한 데이터베이스를 구축하고, 상기 사용자 단말로부터 사용자가 입력하는 문장인 입력 문장을 획득하고, 상기 데이터베이스에서 상기 입력 문장과 유사한 문장을 예시 문장으로 추출하여 상기 사용자 단말에서 출력하는 문장 작성 장치;를 포함한다.Sentence writing system according to an embodiment of the present invention is a user terminal; and constructing a database storing data whose copyright has expired, obtaining an input sentence that is a sentence input by the user from the user terminal, extracting a sentence similar to the input sentence from the database as an example sentence, and outputting it from the user terminal It includes a sentence writing device.

한편, 상기 문장 작성 장치는, 상기 데이터베이스에 저장된 문장들을 Word2Vec 알고리즘에 따라 학습하여 입력 문장에 대한 문맥 정보를 나타내는 벡터값을 추출하는 신경망을 구축하고, 상기 사용자 단말로부터 획득하는 입력 문장을 상기 신경망에 입력하여 상기 입력 문장에 대한 문맥 정보를 나타내는 입력 문장 벡터값을 산출하고, 상기 데이터베이스에 저장된 문장들 각각의 문맥 정보를 나타내는 예시 문장 벡터값들과 상기 입력 문장 벡터값 간의 유사도를 산출하고, 상기 데이터베이스에 저장된 문장들 중 상기 입력 문장 벡터값과 유사도가 가장 높은 예시 문장 벡터값을 갖는 문장을 상기 예시 문장으로 추출하여 상기 사용자 단말로 전송하고,On the other hand, the sentence writing apparatus learns the sentences stored in the database according to the Word2Vec algorithm, builds a neural network that extracts a vector value representing context information for the input sentence, and sends the input sentence obtained from the user terminal to the neural network. input, calculating an input sentence vector value indicating context information for the input sentence, calculating a similarity between example sentence vector values indicating context information of each sentence stored in the database and the input sentence vector value, and the database Extracting a sentence having an example sentence vector value having the highest similarity with the input sentence vector value among the sentences stored in the as the example sentence and transmitting it to the user terminal,

상기 사용자 단말과 무선 통신 방식으로 연결되어 상기 사용자 단말에서 상기 문장 작성 장치로부터 수신하는 상기 예시 문장을 수신하여 출력하는 보조 출력 장치; 및 상기 사용자 단말이 설치되는 테이블 상에 상기 보조 출력 장치를 설치하는 설치 모듈;을 더 포함하고,an auxiliary output device connected to the user terminal through a wireless communication method to receive and output the example sentence received from the sentence writing device in the user terminal; and an installation module for installing the auxiliary output device on a table on which the user terminal is installed;

상기 설치 모듈은, 직육면체의 기중 형태로 구비되는 설치 바아; 상기 설치 바아의 일단부에 소정 두께의 판 형태로 형성되되, 하단 부분에 상기 보조 출력 장치가 삽입 설치되는 안착홈을 형성하는 안착부; 및 상기 설치 바아의 타단에 상기 설치 바아의 회동이 가능하도록 설치되는 설치 브라켓;을 포함하고,The installation module, the installation bar provided in the form of a rectangular parallelepiped; a seating part formed in the form of a plate of a predetermined thickness at one end of the installation bar, and forming a seating groove in the lower part of the auxiliary output device to be inserted and installed; and an installation bracket installed at the other end of the installation bar to enable rotation of the installation bar;

상기 설치 브라켓은, 바닥면; 상기 바닥면의 양단으로부터 상방으로 형성되고, 상단 부분이 반원 형태로 형성되는 한 쌍의 벽면; 및 상기 한 쌍의 벽면 각각의 테두리 부분에 내측으로 요입되어 형성되되, 내측에 걸림 홈을 형성한 나선 형태로 형성되는 복수의 삽입 홈;을 포함하고,The installation bracket, the bottom surface; a pair of wall surfaces formed upwardly from both ends of the bottom surface and having an upper end formed in a semicircular shape; and a plurality of insertion grooves concave inwardly on the edge of each of the pair of wall surfaces and formed in a spiral shape with a locking groove formed therein;

상기 설치 바아는, 타단부에 장홀을 형성하고, 상기 한 쌍의 벽면의 외측면으로부터 상기 장홀을 관통하는 회동 핀을 통해 상기 한 쌍의 벽면 사이에 회동 가능하게 설치되고, 타단부에 상기 삽입 홈에 삽입될 수 있도록 구비되는 삽입 돌기를 형성할 수 있다.The installation bar, forming a long hole at the other end, is rotatably installed between the pair of wall surfaces through a rotation pin penetrating the long hole from the outer surface of the pair of wall surfaces, the insertion groove at the other end It is possible to form an insertion protrusion that is provided to be inserted into the.

또한, 상기 사용자 단말은, 상기 문장 작성 장치에서 운영하는 문장 작성 어플리케이션이 탑재되어 실행될 수 있다.In addition, the user terminal, a sentence writing application operated by the sentence writing device may be loaded and executed.

또한, 상기 사용자 단말은, 사용자로부터 문장을 입력 받기 위한 입력 영역과 상기 예시 문장을 출력하기 위한 출력 영역이 나란히 배치된 인터페이스를 갖는 문장 작성 어플리케이션이 탑재되어 실행될 수 있다.In addition, the user terminal may be loaded with a sentence writing application having an interface in which an input region for receiving a sentence from a user and an output region for outputting the example sentence are arranged side by side and may be executed.

또한, 상기 문장 작성 장치는, 상기 사용자 단말로부터 상기 입력 문장을 수신하는 입력부;를 포함할 수 있다.Also, the sentence writing apparatus may include an input unit configured to receive the input sentence from the user terminal.

또한, 상기 문장 작성 장치는, 저작권이 만료된 자료를 이용할 수 있도록 구비된 웹 사이트에 접속하여 상기 저작권이 만료된 자료를 수집하는 DB 구축부;를 포함할 수 있다.In addition, the sentence writing apparatus may include a DB construction unit for accessing a web site provided to use the copyrighted material for which the copyright has expired and collecting the copyrighted material.

또한, 상기 문장 작성 장치는, 상기 저작권이 만료된 자료들을 수집하여 문장들을 추출하고, 추출한 문장들을 상기 데이터베이스에 저장하는 DB 구축부;를 포함할 수 있다.Also, the sentence writing apparatus may include a DB construction unit that collects the data whose copyright has expired, extracts the sentences, and stores the extracted sentences in the database.

또한, 상기 문장 작성 장치는, 상기 저작권이 만료된 자료들을 수집하는 DB 구축부;를 포함하고, 상기 DB 구축부는, 상기 저작권이 만료된 자료들 간의 유사문장 관계를 식별하는 유사문장 처리시스템;을 포함하고, 상기 유사문장 처리시스템은, 복수의 다큐먼트를 액세스하는 단계; 상기 복수의 다큐먼트로부터, 공통 주제에 관해 서로 다른 작성자들에 의해 작성된 관련 텍스트들의 클러스터(cluster)를 식별하는 단계; 상기 관련 텍스트들의 클러스터를 수신하는 단계; 상기 클러스터로부터 텍스트 분절들(text segments)의 세트(set)를 선택하는 단계; 및 텍스트 정렬(textual alignment)을 이용하여 상기 관련 텍스트 분절들의 세트 내에 포함된 텍스트 분절들 내의 텍스트들 간의 유사문장 관계를 식별하는 단계를 포함할 수 있다.In addition, the sentence writing apparatus includes a DB construction unit that collects the data whose copyright has expired, and the DB construction unit, a similar sentence processing system for identifying a similar sentence relationship between the copyrighted materials; and, the similar sentence processing system comprising: accessing a plurality of documents; identifying, from the plurality of documents, a cluster of related texts written by different authors on a common subject; receiving the cluster of related texts; selecting a set of text segments from the cluster; and identifying sentence-like relationships between texts in text segments included in the set of related text segments using textual alignment.

또한, 상기 문장 작성 장치는, 상기 저작권이 만료된 자료들을 수집하는 DB 구축부;를 포함하고, 상기 DB 구축부는, 상기 저작권이 만료된 자료들 각각의 사상(事象)을 식별하는 사상 예측 시스템;을 포함하고, 상기 사상 예측 시스템은, m개(m은 2 이상의 임의의 정수)의 문장을 해석하고, 상기 m개의 문장으로부터 n개(n은 2 이상의 임의의 정수)의 단어를 추출하는 단어 추출부; 상기 m개의 문장을 각각 소정의 룰에 따라서 q차원(q는 2 이상의 임의의 정수)으로 벡터화함으로써, q개의 축 성분으로 이루어지는 m개의 문장 벡터를 산출하는 문장 벡터 산출부; 상기 n개의 단어를 각각 소정의 룰에 따라서 q차원으로 벡터화함으로써, q개의 축 성분으로 이루어지는 n개의 단어 벡터를 산출하는 단어 벡터 산출부; 상기 m개의 문장 벡터와 상기 n개의 단어 벡터의 내적을 각각 취함으로써, 상기 m개의 문장 및 상기 n개의 단어 간의 관계성을 반영한 m×n개의 유사성 지표값을 산출하는 지표값 산출부; 상기 지표값 산출부에 의해 산출된 상기 m×n개의 유사성 지표값을 이용하여, 1개의 문장에 대하여 n개의 유사성 지표값으로 이루어지는 문장 지표값군을 기초로, 상기 m개의 문장을 복수의 사상으로 분류하기 위한 분류 모델을 생성하는 분류 모델 생성부; 예측 대상으로 하는 1개 이상의 문장을 예측용 데이터로서 입력하는 예측용 데이터 입력부; 및 상기 예측용 데이터 입력부에 의해 입력된 상기 예측용 데이터에 대하여 상기 단어 추출부, 상기 문장 벡터 산출부, 상기 단어 벡터 산출부 및 상기 지표값 산출부의 처리를 실행함으로써 얻어지는 유사성 지표값을, 상기 분류 모델 생성부에 의해 생성된 상기 분류 모델에 적용하는 것에 의해, 상기 예측 대상 데이터로부터 상기 복수의 사상 중 어느 하나를 예측하는 사상 예측부;를 포함할 수 있다.In addition, the sentence writing apparatus includes a DB construction unit that collects the data whose copyright has expired; , wherein the mapping prediction system interprets m sentences (m is an arbitrary integer greater than or equal to 2) and extracts n words (n is an arbitrary integer greater than or equal to 2) from the m sentences. wealth; a sentence vector calculation unit for calculating m sentence vectors comprising q axis components by vectorizing the m sentences in q dimensions (q is an arbitrary integer greater than or equal to 2) according to a predetermined rule; a word vector calculating unit for calculating n word vectors comprising q axis components by vectorizing the n words in q dimensions according to a predetermined rule; an index value calculating unit calculating m × n similarity index values reflecting the relationship between the m sentences and the n words by taking the dot product of the m sentence vectors and the n word vectors, respectively; Classifying the m sentences into a plurality of maps based on a sentence indicator value group including n similarity indicator values for one sentence using the m×n similarity indicator values calculated by the indicator value calculating unit. a classification model generation unit generating a classification model for a prediction data input unit for inputting one or more sentences as prediction data as prediction data; and a similarity index value obtained by executing the processing of the word extraction unit, the sentence vector calculation unit, the word vector calculation unit, and the index value calculation unit on the prediction data input by the prediction data input unit, the classification and an event prediction unit that predicts any one of the plurality of events from the prediction target data by applying to the classification model generated by the model generation unit.

또한, 상기 문장 작성 장치는, 상기 데이터베이스에서 상기 입력 문장과 유사한 문장을 예시 문장으로 추출하는 문장 시현부;를 더 포함하고, 상기 문장 시현부는, 상기 데이터베이스에 저장된 문장들을 나타내는 벡터값과 상기 입력 문장을 나타내는 벡터값 간의 유클리디안 거리(Euclidean distance) 산출을 통해 상기 예시 문장을 추출할 수 있다.In addition, the sentence writing apparatus further includes a sentence display unit that extracts a sentence similar to the input sentence from the database as an example sentence, wherein the sentence display unit includes a vector value representing the sentences stored in the database and the input sentence The example sentence may be extracted by calculating a Euclidean distance between vector values representing .

또한, 상기 문장 작성 장치는, 상기 데이터베이스에서 상기 입력 문장과 유사한 문장을 예시 문장으로 추출하는 문장 시현부;를 더 포함하고, 상기 문장 시현부는, 상기 데이터베이스에 저장된 문장들을 나타내는 벡터값과 상기 입력 문장을 나타내는 벡터값 간의 코사인 유사도(Cosine similarity) 산출을 통해 상기 예시 문장을 추출할 수 있다.In addition, the sentence writing apparatus further includes a sentence display unit that extracts a sentence similar to the input sentence from the database as an example sentence, wherein the sentence display unit includes a vector value representing the sentences stored in the database and the input sentence The example sentence may be extracted by calculating cosine similarity between vector values representing .

또한, 상기 문장 작성 장치는, 상기 데이터베이스에서 상기 입력 문장과 유사한 문장을 예시 문장으로 추출하는 문장 시현부;를 더 포함하고, 상기 문장 시현부는, 상기 데이터베이스에 저장된 문장들을 나타내는 벡터값과 상기 입력 문장을 나타내는 벡터값 간의 타니모토 계수(Tanimoto coeffieient) 산출을 통해 상기 예시 문장을 추출할 수 있다.In addition, the sentence writing apparatus further includes a sentence display unit that extracts a sentence similar to the input sentence from the database as an example sentence, wherein the sentence display unit includes a vector value representing the sentences stored in the database and the input sentence The example sentence may be extracted by calculating a Tanimoto coeffieient between vector values representing .

또한, 상기 사용자 단말과 상기 문장 작성 장치를 연결하는 유선통신망으로 구비되는 네트워크;를 더 포함할 수 있다.In addition, the network provided as a wired communication network for connecting the user terminal and the sentence writing device; may further include.

또한, 상기 사용자 단말과 상기 문장 작성 장치를 연결하는 이동통신망으로 구비되는 네트워크;를 더 포함할 수 있다.In addition, the network provided as a mobile communication network connecting the user terminal and the sentence writing device; may further include.

또한, 상기 사용자 단말과 상기 문장 작성 장치를 연결하는 Wibro(Wireless Broadband)망으로 구비되는 네트워크;를 더 포함할 수 있다.In addition, a network provided as a Wibro (Wireless Broadband) network for connecting the user terminal and the sentence writing apparatus; may further include.

또한, 상기 사용자 단말과 상기 문장 작성 장치를 연결하는 HSDPA(High Speed Downlink Packet Access)망으로 구비되는 네트워크;를 더 포함할 수 있다.In addition, a network provided as a High Speed Downlink Packet Access (HSDPA) network for connecting the user terminal and the sentence writing apparatus may be further included.

또한, 상기 사용자 단말과 상기 문장 작성 장치를 연결하는 위성통신망으로 구비되는 네트워크;를 더 포함할 수 있다.In addition, a network provided as a satellite communication network connecting the user terminal and the sentence writing device; may further include.

또한, 상기 사용자 단말과 상기 문장 작성 장치를 연결하는 와이파이(WI-FI, Wireless Fidelity)망으로 구비되는 네트워크;를 더 포함할 수 있다.In addition, a network provided as a Wi-Fi (Wireless Fidelity) network connecting the user terminal and the sentence writing device; may further include.

또한, 상기 사용자 단말과 무선 통신 방식으로 연결되고, 상기 사용자 단말로부터 상기 예시 문장에 대한 음성 파일을 수신하여 출력하는 음성 출력 장치;를 더 포함할 수 있다.The apparatus may further include a voice output device connected to the user terminal in a wireless communication method, and receiving and outputting a voice file for the example sentence from the user terminal.

한편, 본 발명의 다른 실시예에 따른 사용자 단말과 네트워크를 통해 연결되는 문장 작성 장치에서의 문장 작성 방법에 있어서, 저작권 보호기간이 만료된 자료를 수집하는 단계; 수집한 자료를 데이터베이스에 저장하는 단계; 상기 사용자 단말을 통해 사용자로부터 문장을 입력 받는 단계; 입력 문장의 문맥 정보를 나타내는 벡터값을 출력하는 신경망을 이용하여 상기 데이터베이스에서 사용자로부터 입력 받은 문장과 유사한 문장을 추출하는 단계; 및 상기 사용자 단말로 추출한 문장을 출력하는 단계;를 포함한다.On the other hand, according to another embodiment of the present invention, there is provided a sentence writing method in a sentence writing device connected to a user terminal and a network, the method comprising: collecting data whose copyright protection period has expired; storing the collected data in a database; receiving a sentence from a user through the user terminal; extracting a sentence similar to a sentence input from a user from the database using a neural network that outputs a vector value representing context information of the input sentence; and outputting the extracted sentence to the user terminal.

상술한 본 발명에 따르면 사용자가 문장을 입력하는 동안 사용자 입력 문장과 유사한 예시 문장을 출력함으로써, 사용자가 이를 참고하여 개선된 문장을 작성할 수 있도록 도움을 줄 수 있으며, 아울러 사용자의 문장력 향상을 도모한다.According to the present invention described above, by outputting an example sentence similar to the user input sentence while the user is inputting a sentence, it is possible to help the user to write an improved sentence with reference to this, and also to improve the sentence power of the user .

도 1은 본 발명의 일 실시예에 따른 문장 작성 시스템의 개념도이다.
도 2는 도 1에 도시된 문장 작성 장치의 개념도이다.
도 3은 본 발명의 일 실시예에 따른 문장 작성 방법의 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 유사문장 처리시스템의 일 예를 보여주는 도면이다.
도 5는 도 4에 도시된 시스템의 동작을 나타낸 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 두 개의 문장들로 된 페어들 사이의 정렬을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 유사문장 처리시스템의 다른 예를 보여주는 도면이다.
도 8은 본 발명의 일 실시예에 따른 사상 예측 시스템의 블록도이다.
도 9 및 도 10은 본 발명의 일 실시예에 따른 사상 예측 시스템의 동작예를 나타내는 플로차트이다.
도 11은 본 발명의 다른 실시예에 다른 사상 예측 시스템의 블록도이다.
도 12는 본 발명의 일 실시예에 따른 음성 출력 장치를 보여주는 도면이다.
도 13은 본 발명의 일 실시예에 따른 설치 모듈을 보여주는 도면이다.
도 14 내지 도 17은 본 발명에서 사용되는 신경망의 모식도이다.
도 18 내지 도 20은 본 발명에서 사용되는 Word2Vec 알고리즘의 모식도이다.1 is a conceptual diagram of a sentence writing system according to an embodiment of the present invention.
FIG. 2 is a conceptual diagram of the sentence writing apparatus shown in FIG. 1 .
3 is a flowchart of a sentence writing method according to an embodiment of the present invention.
4 is a diagram showing an example of a similar sentence processing system according to an embodiment of the present invention.
FIG. 5 is a flowchart illustrating the operation of the system shown in FIG. 4 .
6 is a diagram for explaining alignment between pairs of two sentences according to an embodiment of the present invention.
7 is a view showing another example of a similar sentence processing system according to an embodiment of the present invention.
8 is a block diagram of a mapping prediction system according to an embodiment of the present invention.
9 and 10 are flowcharts illustrating an operation example of a mapping prediction system according to an embodiment of the present invention.
11 is a block diagram of a mapping prediction system according to another embodiment of the present invention.
12 is a diagram illustrating an audio output device according to an embodiment of the present invention.
13 is a view showing an installation module according to an embodiment of the present invention.
14 to 17 are schematic diagrams of a neural network used in the present invention.
18 to 20 are schematic diagrams of the Word2Vec algorithm used in the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in a variety of different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention belongs It is provided to fully inform the possessor of the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소, 단계 및 동작은 하나 이상의 다른 구성요소, 단계 및 동작의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, "comprises" and/or "comprising" refers to the stated elements, steps, and acts do not exclude the presence or addition of one or more other elements, steps and acts.

도 1은 본 발명의 일 실시예에 따른 문장 작성 시스템의 개념도이다.1 is a conceptual diagram of a sentence writing system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 문장 작성 시스템(1000)은 사용자 단말(10), 네트워크(20), 문장 작성 장치(30) 및 데이터베이스(40)를 포함할 수 있다.Referring to FIG. 1 , a sentence writing system 1000 according to an embodiment of the present invention may include a user terminal 10 , a network 20 , a sentence writing apparatus 30 , and a database 40 .

사용자 단말(10) 및 문장 작성 장치(30)는 네트워크(20)로 연결될 수 있으며, 데이터베이스(40)는 문장 작성 장치(30)에서 필요한 데이터들을 저장할 수 있다.The user terminal 10 and the sentence writing device 30 may be connected to the network 20 , and the database 40 may store data required by the sentence writing device 30 .

사용자 단말(10)은 사용자가 이용하는 PC, 노트북, 넷북, 모바일, 태블릿 등의 통상의 컴퓨팅 단말기로 구비될 수 있다.The user terminal 10 may be provided as a general computing terminal such as a PC, a notebook computer, a netbook, a mobile device, a tablet, etc. used by the user.

사용자 단말(10)은 본 발명의 일 실시예에 따른 문장 작성 어플리케이션이 탑재되어 실행될 수 있다. 사용자는 사용자 단말(10)을 통해 문장 작성 어플리케이션을 실행하여 소정 문장을 작성할 수 있다.The user terminal 10 may be loaded with a sentence writing application according to an embodiment of the present invention and executed. A user may write a predetermined sentence by executing a sentence writing application through the user terminal 10 .

네트워크(20)는 유선통신망, 이동통신망, Wibro(Wireless Broadband)망, HSDPA(High Speed Downlink Packet Access)망, 위성통신망 및 와이파이(WI-FI, Wireless Fidelity)망 중 하나 일 수 있다The network 20 may be one of a wired communication network, a mobile communication network, a Wibro (Wireless Broadband) network, a High Speed Downlink Packet Access (HSDPA) network, a satellite communication network, and a Wi-Fi (Wireless Fidelity) network.

문장 작성 장치(30)는 통상의 서버로 구비될 수 있으며, 본 발명의 일 실시예에 따른 문장 작성 어플리케이션의 운영 서버일 수 있다.The sentence writing apparatus 30 may be provided as a normal server, and may be an operating server of the sentence writing application according to an embodiment of the present invention.

문장 작성 장치(30)는 네트워크(20)를 통해 사용자 단말(10)과 연결될 수 있으며, 사용자 단말(10)로부터 문장 작성 어플리케이션을 통해 입력되는 문장을 수신할 수 있다. 또한, 문장 작성 장치(30)는 저작권이 만료된 자료(예컨대, 소설 등)를 저장한 데이터베이스(40)를 구축하고, 데이터베이스(40)에서 사용자 입력 문장과 유사한 문장을 추출하여 사용자 단말(10)로 전송할 수 있다. 이때, 사용자 단말(10)에서 실행되는 문장 작성 어플리케이션에서는 문장 작성 장치(30)로부터 수신한 예시 문장을 출력함으로써 사용자가 이를 참고하여 보다 개선된 문장을 작성할 수 있도록 한다.The sentence writing apparatus 30 may be connected to the user terminal 10 through the network 20 , and may receive a sentence input through the sentence writing application from the user terminal 10 . In addition, the sentence writing device 30 builds a database 40 in which copyright has expired (eg, novel, etc.) is stored, and extracts a sentence similar to the user input sentence from the database 40 to the user terminal 10 can be sent to In this case, the sentence writing application executed in the user terminal 10 outputs the example sentence received from the sentence writing device 30 so that the user can write an improved sentence by referring to it.

이하의 설명에서 사용자 단말(10)을 통해 사용자가 입력하는 문장은 사용자 입력 문장이라 칭하고, 데이터베이스(40)에서 추출하는 문장은 예시 문장이라 칭하기로 한다.In the following description, a sentence input by the user through the user terminal 10 will be referred to as a user input sentence, and a sentence extracted from the database 40 will be referred to as an example sentence.

이와 같은 본 발명의 일 실시예에 따른 문장 작성 시스템(1000)은 사용자가 문장을 입력하는 동안 사용자 입력 문장과 유사한 예시 문장을 출력함으로써, 사용자가 이를 참고하여 개선된 문장을 작성할 수 있도록 도움을 줄 수 있으며, 아울러 사용자의 문장력 향상을 도모한다.As such, the sentence writing system 1000 according to an embodiment of the present invention outputs an example sentence similar to the user input sentence while the user inputs the sentence, thereby helping the user to write an improved sentence with reference to this. In addition, it promotes the improvement of the user's sentence power.

도 2는 도 1에 도시된 문장 작성 장치의 개념도이다.FIG. 2 is a conceptual diagram of the sentence writing apparatus shown in FIG. 1 .

도 2를 참조하면, 본 발명의 일 실시예에 따른 문장 작성 장치(30)는 입력부(31), DB 구축부(33) 및 문장 시현부(35)를 포함할 수 있다.Referring to FIG. 2 , the sentence writing apparatus 30 according to an embodiment of the present invention may include an input unit 31 , a DB construction unit 33 , and a sentence display unit 35 .

입력부(31)는 사용자 단말(10)로부터 문장 작성 어플리케이션을 통해 사용자가 입력하는 문장인 사용자 입력 문장을 수신할 수 있다.The input unit 31 may receive a user input sentence that is a sentence input by the user through the sentence writing application from the user terminal 10 .

DB 구축부(33)는 외부 서버에 접속하여 저작권이 만료된 자료를 수집하고, 수집한 자료를 데이터베이스(40)에 저장할 수 있다.The DB building unit 33 may access an external server to collect data whose copyright has expired, and store the collected data in the database 40 .

여기에서, 저작권이 만료된 자료는 저작자의 저작재산권 보호기간이 만료된 저작물로, 저작자가 사망 후 일정기간이 지난 저작물이며, 누구나 별도의 이용허락이나 승인절차 없이 자유롭게 이용 가능하다.Here, data whose copyright has expired is a work whose copyright protection period has expired, which is a work that has passed a certain period after the author's death, and anyone can freely use it without a separate license or approval procedure.

DB 구축부(33)는 저작권이 만료된 자료를 이용할 수 있도록 구비된 웹 사이트에 접속하여 저작권이 만료된 자료를 수집할 수 있다. 예컨대, 웹 사이트는 공유마당(http://gongu.dopyright.or.kr/)일 수 있으며, 만료 저작물 뿐만 아니라 사회적 보존가치가 높은 민간보유 저작물과 공공 콘텐츠와 같은 공유 저작물을 제공하고, 아울러, 해외의 만료 저작물도 찾아볼 수 있다.The DB construction unit 33 may collect data whose copyright has expired by accessing a web site provided so that the copyrighted material can be used. For example, the website may be a shared yard (http://gongu.dopyright.or.kr/), and provides not only expired works, but also shared works such as privately held works and public contents with high social preservation value, and in addition, Expired works abroad can also be found.

DB 구축부(33)는 수집한 자료에서 문장을 추출하고, 문장들을 데이터베이스(40)에 저장할 수 있다.The DB construction unit 33 may extract sentences from the collected data and store the sentences in the database 40 .

문장 시현부(35)는 데이터베이스(40)에서 입력부(31)가 수집하는 사용자 입력 문장과 유사한 문장을 추출하여 사용자 단말(10)로 전송할 수 있다.The sentence display unit 35 may extract a sentence similar to the user input sentence collected by the input unit 31 from the database 40 and transmit it to the user terminal 10 .

예를 들면, 문장 시현부(35)는 데이터베이스(40)에 저장된 문장들을 학습하여 각 문장들의 문맥 정보를 나타내는 벡터값을 출력하는 신경망을 구축할 수 있다. 여기에서, 문장 시현부(35)는 Word2Vec 알고리즘을 이용하여 학습 데이터에 대한 문장의 문맥 정보를 추출할 수 있다. 이때, Word2Vec 알고리즘은 신경망 언어 모델(NNLM : Neural Network Language Model)을 포함할 수 있다. 신경망 언어 모델은 기본적으로 Input Layer, Projection Layer, Hidden Layer, Output Layer로 이루어진 Neural Network이다. 신경망 언어 모델은 단어를 벡터화하는 방법에 사용되는 것이다. 신경망 언어 모델은 공지된 기술이므로 보다 자세한 설명은 생략하기로 한다. Word2vec 알고리즘은, 텍스트마이닝을 위한 것으로, 각 단어 간의 앞, 뒤 관계를 보고 근접도를 정하는 알고리즘이다. Word2vec 알고리즘은 비지도 학습 알고리즘이다. Word2vec 알고리즘은 이름이 나타내는 바와 같이 단어의 의미를 벡터형태로 표현하는 계량기법일 수 있다. Word2vec 알고리즘은 각 단어를 200차원 정도의 공간에서 백터로 표현할 수 있다. Word2vec 알고리즘을 이용하면, 각 단어마다 단어에 해당하는 벡터를 구할 수 있다. Word2vec 알고리즘은 종래의 다른 알고리즘에 비해 자연어 처리 분야에서 비약적인 정밀도 향상을 가능하게 할 수 있다. Word2vec은 입력한 말뭉치의 문장에 있는 단어와 인접 단어의 관계를 이용해 단어의 의미를 학습할 수 있다. Word2vec 알고리즘은 인공 신경망에 근거한 것으로, 같은 맥락을 지닌 단어는 가까운 의미를 지니고 있다는 전제에서 출발한다. Word2vec 알고리즘은 텍스트 문서를 통해 학습을 진행하며, 한 단어에 대해 근처(전후 5 내지 10 단어 정도)에 출현하는 다른 단어들을 관련 단어로서 인공 신경망에 학습시킨다. 연관된 의미의 단어들은 문서상에서 가까운 곳에 출현할 가능성이 높기 때문에 학습을 반복해 나가는 과정에서 두 단어는 점차 가까운 벡터를 지닐 수 있다. Word2vec 알고리즘의 학습 방법은 CBOW(Continuous Bag Of Words) 방식과 skip-gram 방식이 있다. CBOW 방식은 주변 단어가 만드는 맥락을 이용해 타겟 단어를 예측하는 것이다. skip-gram 방식은 한 단어를 기준으로 주변에 올 수 있는 단어를 예측하는 것이다. 대규모 데이터셋에서는 skip-gram 방식이 더 정확한 것으로 알려져 있다. 따라서, 본 발명의 실시 예에서는 skip-gram 방식을 이용한 Word2vec 알고리즘을 사용한다. 예컨대, Word2vec 알고리즘을 통해 학습이 잘 완료되면, 고차원 공간에서 비슷한 단어는 근처에 위치할 수 있다. 상술한 바와 같은 Word2vec 알고리즘에 따르면 학습 문서 내 주위 단어의 분포가 가까운 단어일수록 산출되는 벡터값은 유사해질 수 있으며, 산출된 벡터값이 비슷한 단어는 유사한 것으로 간주할 수 있다. Word2vec 알고리즘은 공지된 기술이므로 벡터값 계산과 관련한 보다 상세한 설명은 생략하기로 한다.For example, the sentence display unit 35 may build a neural network that learns sentences stored in the database 40 and outputs a vector value indicating context information of each sentence. Here, the sentence display unit 35 may extract context information of the sentence for the learning data by using the Word2Vec algorithm. In this case, the Word2Vec algorithm may include a Neural Network Language Model (NNLM). The neural network language model is basically a Neural Network consisting of an Input Layer, a Projection Layer, a Hidden Layer, and an Output Layer. The neural network language model is what is used to vectorize words. Since the neural network language model is a known technology, a detailed description thereof will be omitted. The Word2vec algorithm, for text mining, is an algorithm that determines the proximity by looking at the front and back relationships between each word. The Word2vec algorithm is an unsupervised learning algorithm. The Word2vec algorithm may be a metric that expresses the meaning of a word in a vector form, as the name indicates. The Word2vec algorithm can represent each word as a vector in a space of about 200 dimensions. Using the Word2vec algorithm, a vector corresponding to a word can be obtained for each word. The Word2vec algorithm may enable a dramatic improvement in precision in the field of natural language processing compared to other conventional algorithms. Word2vec can learn the meaning of words by using the relationship between words in the sentences of the input corpus and adjacent words. The Word2vec algorithm is based on an artificial neural network, and it starts from the premise that words with the same context have close meanings. The Word2vec algorithm learns through text documents, and trains the artificial neural network to learn other words that appear nearby (about 5 to 10 words before and after) for one word as related words. Since words with related meanings are more likely to appear close to each other in the document, two words can have vectors that are closer to each other in the process of repeating learning. The learning method of the Word2vec algorithm is divided into a CBOW (Continuous Bag Of Words) method and a skip-gram method. The CBOW method predicts the target word using the context created by the surrounding words. The skip-gram method predicts words that can come around based on one word. The skip-gram method is known to be more accurate in large datasets. Therefore, in the embodiment of the present invention, the Word2vec algorithm using the skip-gram method is used. For example, if learning is successfully completed through the Word2vec algorithm, similar words may be located nearby in a high-dimensional space. According to the Word2vec algorithm as described above, the closer the distribution of the surrounding words in the learning document is, the more similar the calculated vector values may be, and the similarly the calculated vector values may be regarded as similar. Since the Word2vec algorithm is a well-known technique, a more detailed description related to vector value calculation will be omitted.

문장 시현부(35)는 데이터베이스(40)에 저장된 문장들을 학습 데이터로 추출하며, 학습 데이터를 Word2vec 알고리즘을 통해 학습하여 사건 내용에 대한 문맥 정보를 추출하는 신경망을 구축할 수 있다. The sentence display unit 35 extracts the sentences stored in the database 40 as learning data, and learns the learning data through the Word2vec algorithm to construct a neural network that extracts context information about the content of the event.

문장 시현부(35)는 입력부(31)를 통해 획득하는 사용자 입력 문장을 신경망에 입력하여 사용자 입력 문장에 대한 문맥 정보를 나타내는 입력 문장 벡터값을 산출할 수 있다.The sentence display unit 35 may input the user input sentence obtained through the input unit 31 into the neural network to calculate an input sentence vector value representing context information on the user input sentence.

문장 시현부(35)는 데이터베이스(40)에 저장된 문장들 각각의 문맥 정보를 나타내는 예시 문장 벡터값들과 입력 문장 벡터값 간의 유사도를 각각 산출할 수 있다. 이때, 문장 시현부(35)는 유사도 산출 방법으로 유클리디안 거리(Euclidean distance), 코사인 유사도(Cosine similarity), 타니모토 계수(Tanimoto coeffieient) 중 적어도 하나를 이용할 수 있다.The sentence display unit 35 may calculate a similarity between example sentence vector values indicating context information of each sentence stored in the database 40 and an input sentence vector value. In this case, the sentence display unit 35 may use at least one of a Euclidean distance, a cosine similarity, and a Tanimoto coeffieient as a similarity calculation method.

문장 시현부(35)는 데이터베이스(40)에 저장된 문장들 중 입력 문장 벡터값과 유사도가 가장 높은 예시 문장 벡터값을 갖는 문장을 예시 문장으로 추출하여 사용자 단말(10)로 전송할 수 있다.The sentence display unit 35 may extract a sentence having an example sentence vector value having the highest similarity to the input sentence vector value among the sentences stored in the database 40 as an example sentence and transmit it to the user terminal 10 .

여기에서, 사용자 단말(10)에서 실행하는 어플리케이션은 사용자로부터 문장을 입력 받기 위한 입력 영역과 문장 시현부(35)로부터 수신하는 예시 문장을 출력하기 위한 출력 영역이 나란히 배치된 인터페이스를 가질 수 있다. 이에 따라 사용자는 입력 영역에 문장을 입력하면서 동시에 예시 문장을 확인할 수 있다.Here, the application executed in the user terminal 10 may have an interface in which an input area for receiving a sentence from the user and an output area for outputting an example sentence received from the sentence display unit 35 are arranged side by side. Accordingly, the user can check the example sentence while inputting the sentence in the input area.

이와 같이 문장 시현부(35)는 인공지능 신경망을 이용하여 사용자 단말(10)을 통해 입력되는 문장과 가장 유사한 문맥을 갖는 저작권 만료 문장을 추출하여 사용자에게 제공할 수 있다.As such, the sentence display unit 35 may extract a copyright expired sentence having a context most similar to the sentence input through the user terminal 10 using an artificial intelligence neural network and provide it to the user.

도 3은 본 발명의 일 실시예에 따른 문장 작성 방법의 흐름도이다.3 is a flowchart of a sentence writing method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 문장 작성 방법은 도 1 및 도 2에 도시된 문장 작성 장치(30)와 실질적으로 동일한 구성 하에서 진행될 수 있다. 따라서, 도 1 및 도 2에 도시된 구성과 동일한 구성요소는 동일한 도면부호를 부여하고 반복되는 설명은 생략한다.The sentence writing method according to an embodiment of the present invention may be performed under substantially the same configuration as that of the sentence writing apparatus 30 illustrated in FIGS. 1 and 2 . Accordingly, the same components as those shown in FIGS. 1 and 2 are given the same reference numerals, and repeated descriptions are omitted.

도 3을 참조하면, 본 발명의 일 실시예에 따른 문장 작성 방법은 저작권 보호기간이 만료된 자료를 수집하는 단계(S10), 수집한 자료를 데이터베이스(40)에 저장하는 단계(S20), 사용자로부터 문장을 입력 받는 단계(S30), 데이터베이스(40)에서 사용자로부터 입력 받은 문장과 유사한 문장을 추출하는 단계(S40) 및 사용자 단말(10)로 추출한 문장을 출력하는 단계(S50)를 포함할 수 있다.Referring to FIG. 3 , the method for writing a sentence according to an embodiment of the present invention includes the steps of collecting data whose copyright protection period has expired (S10), storing the collected data in the database 40 (S20), and a user It may include the step of receiving a sentence from the input (S30), the step of extracting a sentence similar to the sentence input from the user in the database 40 (S40), and the step of outputting the extracted sentence to the user terminal 10 (S50). there is.

DB 구축부(33)가 저작권 보호기간이 만료된 자료를 수집하는 단계(S10)는 DB 구축부(33)가 저작권이 만료된 자료를 이용할 수 있도록 구비된 웹 사이트에 접속하여 저작권이 만료된 자료를 수집하는 단계일 수 있다.In the step (S10) of the DB construction unit 33 collecting data whose copyright protection period has expired, the DB construction unit 33 accesses a website provided so that the copyright has expired data and the copyright has expired. may be a step in collecting

DB 구축부(33)가 수집한 자료를 데이터베이스(40)에 저장하는 단계(S20)는 수집한 자료에서 문장을 추출하고 문장들을 데이터베이스(40)에 저장하는 단계일 수 있다.The step of storing the data collected by the DB building unit 33 in the database 40 ( S20 ) may be a step of extracting sentences from the collected data and storing the sentences in the database 40 .

입력부(31)가 사용자로부터 문장을 입력 받는 단계(S30)는 입력부(31)가 사용자 단말(10)에서 실행되는 어플리케이션을 통해 문장을 입력 받는 단계일 수 있다.The step ( S30 ) of the input unit 31 receiving a sentence from the user may be a step of the input unit 31 receiving the sentence through an application executed in the user terminal 10 .

문장 시현부(35)가 데이터베이스(40)에서 사용자로부터 입력 받은 문장과 유사한 문장을 추출하는 단계(S40)는 문장 시현부(35)가 입력 문장에 대한 문맥 정보를 추출하는 인공지능 신경망을 이용하여 사용자로부터 입력 받은 문장과 유사한 문맥 정보를 갖는 예시 문장을 추출하는 단계일 수 있다.The step (S40) of the sentence display unit 35 extracting a sentence similar to the sentence input from the user from the database 40 is performed using an artificial intelligence neural network in which the sentence display unit 35 extracts context information for the input sentence. It may be a step of extracting an example sentence having context information similar to the sentence input by the user.

문장 시현부(35)가 사용자 단말(10)로 추출한 문장을 출력하는 단계(S50)는 문장 시현부(35)가 사용자 단말(10)에서 실행되는 어플리케이션을 통해 추출한 예시 문장을 출력하는 단계일 수 있다.The step (S50) of the sentence display unit 35 outputting the extracted sentence to the user terminal 10 may be a step of the sentence display unit 35 outputting an example sentence extracted through an application executed in the user terminal 10. there is.

한편, DB 구축부(33)는 수집한 문장들 중에서 유사한 문장들을 식별하고, 식별한 유사 문장들을 나타내는 대표 문장을 추출하여 데이터베이스(40)에 저장할 수 있다. 예를 들면, 수집한 문장들 중에서는 그 맥락이 동일하더라도 다른 단어들의 조합으로 구성되는 문장들이 존재할 수 있다. DB 구축부(33)는 이러한 유사 문장들이 중첩되어 데이터베이스(40)에 저장되는 것을 방지함으로써 부하를 줄이기 위해 유사한 문장들을 식별하고, 식별한 유사한 문장들을 통합하여 나타낼 수 있는 대표 문장을 추출할 수 있다.Meanwhile, the DB construction unit 33 may identify similar sentences from among the collected sentences, extract representative sentences representing the identified similar sentences, and store the extracted sentences in the database 40 . For example, among the collected sentences, sentences composed of a combination of different words may exist even if the context is the same. The DB construction unit 33 identifies similar sentences in order to reduce the load by preventing these similar sentences from being overlapped and stored in the database 40, and extracts representative sentences that can be expressed by integrating the identified similar sentences. .

DB 구축부(33)는 이러한 유사 문장 식별 및 대표 문장 추출을 위한 유사문장 처리시스템(200)을 포함할 수 있다. 이와 관련하여 도 4 이하를 참조하여 예를 들어 설명한다.The DB construction unit 33 may include a similar sentence processing system 200 for identifying similar sentences and extracting representative sentences. In this regard, an example will be described with reference to FIG. 4 or less.

도 4는 본 발명의 일 실시예에 따른 유사문장 처리시스템의 일 예를 보여주는 도면이다.4 is a diagram showing an example of a similar sentence processing system according to an embodiment of the present invention.

도 4를 참조하면, 시스템(200)은 다큐먼트 데이터베이스(202)로 액세스하며, 다큐먼트 클러스터링 시스템(204), 텍스트 분절 선택 시스템(206), 단어/어구 정렬 시스템(210), 식별 시스템 입력 텍스트(211), 및 생성시스템 입력텍스트(212)를 포함한다. Referring to FIG. 4 , the system 200 accesses a document database 202 , a document clustering system 204 , a text segment selection system 206 , a word/phrase sorting system 210 , and an identification system input text 211 . ), and generation system input text 212 .

도 5는 도 4에 도시된 시스템의 동작을 나타낸 흐름도이다.FIG. 5 is a flowchart illustrating the operation of the system shown in FIG. 4 .

일례로서, 다큐먼트 데이터베이스(202)는, DB 구축부(33)에서 수집한 문장들로 저작권 만료 자료들을 포함한다. 또한, 일예로서, 다양한 다른 뉴스 에이전시에 의해 작성된 다양한 다른 뉴스 기사들을 포함한다. 일례로서, 각각의 기사는 언제 기사가 작성되었는지를 간략하게 나타내는 타임 스탬프를 포함한다. 또한, 일례로서, 다른 뉴스 에이전시로부터의 복수의 기사들이 다양한 다른 사건들에 대하여 작성될 수 있다.As an example, the document database 202 includes copyright expired data as sentences collected by the DB building unit 33 . It also includes, by way of example, various other news articles written by various other news agencies. As an example, each article includes a timestamp that briefly indicates when the article was written. Also, as an example, a plurality of articles from different news agencies may be written for a variety of different events.

이하에서는, 뉴스 기사에 대하여 예를 들어 설명하기로 한다.Hereinafter, a news article will be described as an example.

다큐먼트 클러스터링 시스템(204)은 도 5에서 블록 214로 도시된 바와 같이 다큐먼트 데이터베이스(202)에 액세스한다. 도 4에는 단일의 데이터베이스(202)가 도시되어 있지만, 대신 복수의 데이터베이스가 액세스될 수 있다.The document clustering system 204 accesses the document database 202 as shown by block 214 in FIG. 5 . Although a single database 202 is shown in FIG. 4 , multiple databases may be accessed instead.

클러스터링 시스템(204)은 동일한 사건에 대하여 작성된 다큐먼트 데이터베이스(202)내의 기사들을 식별한다.The clustering system 204 identifies articles in the document database 202 created for the same event.

일실시예에서, 대략적으로 동일한 시간에 쓰여지는 것으로 기사들이 또한 식별된다 (서로간의 소정의 시간적 범위내, 예컨대, 한달, 한 주, 하루, 수 시간내, 등등). 동일한 사건에 대하여 (또는, 동일한 시간에서) 작성된 것으로 식별된 기사들은, 다큐먼트 클러스터(218)를 형성한다. 이러한 것은 도 5의 블록 216으로 도시되어 있다.In one embodiment, articles are also identified as being written at approximately the same time (within a predetermined temporal range of each other, eg, within a month, a week, a day, several hours, etc.). Articles identified as being written about the same event (or at the same time) form a document cluster 218 . This is illustrated by block 216 of FIG. 5 .

관련된 소스의 기사들이 클러스트(218)로 식별되면, 원하는 바에 따라, 이러한 기사들내의 텍스트 분절(문장, 어구, 헤드라인, 문단, 등)들이 추출된다. 예컨대, 뉴스 기사들의 저널리즘적 관습은, 기사의 처음 한 두 문장은 나머지 기사의 요약을 나타내는 것으로 권고한다. 따라서, 본 발명의 일실시예에 따르면, (일례로, 다른 뉴스 에이전시들에서 작성된) 기사들은 클러스터(218)들로 클러스터링되고, 텍스트 분절 선택시스템(206)으로 제공되어, 각 클러스터(218)내의 각 기사의 처음 두 문장들이 추출된다. 본 발명의 설명은 문장들에 대하여 진행하였지만, 이는 예시적인 것일 뿐, 기타의 텍스트 분절들이 용이하게 사용될 수 있다. 기사들의 각클러스터(218)의 문장들은, 클러스터링된 기사에 상응하는 문장 세트(222)로 출력된다. 문장 세트(222)들은 텍스트 분절 선택시스템(206)에 의해 단어/어구 정렬시스템(210)으로 출력된다. 이는 도 5에서 블록 220으로 나타내어져 있다.Once articles from relevant sources are identified by cluster 218, text segments (sentences, phrases, headlines, paragraphs, etc.) within these articles are extracted, as desired. For example, the journalistic convention of news articles recommends that the first two sentences of an article represent a summary of the rest of the article. Thus, in accordance with one embodiment of the present invention, articles (eg, written by different news agencies) are clustered into clusters 218 and provided to a text segment selection system 206 , within each cluster 218 . The first two sentences of each article are extracted. Although the description of the present invention has been directed to sentences, this is exemplary only and other text segments may be readily used. The sentences in each cluster 218 of articles are output as a set of sentences 222 corresponding to the clustered articles. The sentence sets 222 are output by the text segment selection system 206 to the word/phrase sorting system 210 . This is indicated by block 220 in FIG. 5 .

문장들이 사용되는 특정예에서, 이러한 식으로 수집된 많은 문장들은, 문체적인 이유로 다른 뉴스에이전시의 편집자들에 의해 조금씩 재작성된, 몇몇 단일 소스 문장의 버전들로 나타난다. 자주, 이러한 문장들의 세트들은, 문장에서 나타나는 절들의 순서와 같이, 방법상에서 조금씩만 다른 것으로 나타났다.In the specific example where sentences are used, many sentences collected in this way appear as versions of several single-source sentences, which have been slightly rewritten by editors of different news agencies for stylistic reasons. Often, these sets of sentences appear to differ only slightly in method, such as the order of clauses appearing in the sentence.

텍스트 분절 선택시스템(206)은 각 클러스터의 문장들의 세트(222)들을 생성한다. 단어/어구 정렬 시스템(210)은, 세트내 문장들의 홀리스틱(holistic) 검사에 기초하여, 단어들 또는 어구들간의 맵핑들을 추출함으로써, 문장들의 큰 세트들상에서 동작할 수 있음에 주목한다. 그러나, 본 논의는 문장 페어의 생성과 그러한 페어들에 대한 정렬의 수행에 대해서만, 일실시예로서 설명한다. 따라서, 일실시예에서, 식별된 문장의 세트들은, 문장들의 페어로 형성된다. 따라서, 텍스트 분절 선택시스템(206)은, 다른 각각의 문장에 대하여 세트내의 각각의 문장의 짝짓기를 수행한다(이하, 페어링(pairing)). 일실시예에서는, 문장의 페어들이 선택적으로 필터링 되는 단계를 수행하며, 다른 실시예에서는, 단어/어구 정렬시스템(210)으로 바로 출력된다. 본 실시예에 대한 필터링이 설명되겠지만, 필터링과 관련된 단계들은 선택적임을 주목한다.The text segment selection system 206 generates sets 222 of sentences in each cluster. Note that word/phrase alignment system 210 can operate on large sets of sentences by extracting mappings between words or phrases based on a holistic examination of sentences in the set. However, this discussion only describes the creation of sentence pairs and the performance of sorting on those pairs, as an example. Thus, in one embodiment, the identified sets of sentences are formed into pairs of sentences. Accordingly, the text segment selection system 206 performs pairing of each sentence in the set with respect to each other sentence (hereinafter, pairing). In one embodiment, pairs of sentences are selectively filtered, and in another embodiment, they are output directly to the word/phrase sorting system 210 . Although filtering for this embodiment will be described, it is noted that the steps involved in filtering are optional.

일실시예에서, 텍스트 분절 선택시스템(206)은, 공유되는 핵심내용의 단어에 기반하여 문장 페어들을 필터링 하는 발견적 방법(heuristic)을 실행한다. 예컨대, 일실시예에서, 시스템(206)은 문장 페어들을 필터링하여, 각각 적어도 네 개의 문자로 이루어진 적어도 세 개의 단어들을 공유하지 않는 문장 페어들을 제거한다. 물론, 필터링은 선택적이지만, 사용된다면, 실행되는 필터링 알고리즘이 폭넓게 변경될 수 있다. 과거의 결과들에 대한 필터링 (이는, 단어/어구 정렬시스템(210)을 텍스트 분절 선택시스템(206)으로 피드백시키는 피드백 루프를 필요로 한다), 다른 수의 내용 단어들에 대한 필터링, 의미론적이고 구문론적인 정보에 대한 필터링 등의, 임의의 다양한 다른 필터링 기법들이 사용될 수 있다. 어떠한 경우에는, 문장의 세트들이 페어링되고, 필터링 되어, 단어/어구 정렬시스템(210)으로 제공될 수 있다.In one embodiment, the text segment selection system 206 implements a heuristic that filters sentence pairs based on words of shared content. For example, in one embodiment, system 206 filters sentence pairs to remove sentence pairs that do not share at least three words, each of at least four letters. Of course, filtering is optional, but if used, the filtering algorithm implemented can vary widely. Filtering on past results (this requires a feedback loop to feed back word/phrase sorting system 210 to text segment selection system 206), filtering on different numbers of content words, semantic and syntactic Any of a variety of other filtering techniques may be used, such as filtering for undesirable information. In some cases, sets of sentences may be paired, filtered, and provided to word/phrase sorting system 210 .

일실시예에서, 단어/어구 정렬시스템(210)은, 세트(222)들내의 문장들간의 어휘적 대응관계를 학습하고자, 통계적 기계번역 문헌에서 비롯되는, 종래의 단어/어구 정렬 알고리즘을 실행한다. 예컨대, 다음 두 개의 문장이 문장 페어로 기계번역 시스템(210)으로 입력된다고 가정하자: "Storms and tornadoes killed at least 14 people as they ripped through the central U.S. States of Kansas and Missouri" "A swarm of tornadoes crashed through the Midwest, killing at least 19 people in Kansas and Missouri".In one embodiment, word/phrase sorting system 210 implements a conventional word/phrase sorting algorithm, derived from statistical machine translation literature, to learn lexical correspondences between sentences in sets 222 . . For example, suppose the following two sentences are input to the machine translation system 210 as sentence pairs: "Storms and tornadoes killed at least 14 people as they ripped through the central US States of Kansas and Missouri" "A swarm of tornadoes crashed through the Midwest, killing at least 19 people in Kansas and Missouri".

이러한 문장들은, 차이점이 몇 가지 있지만, 공통의 편집 소스를 갖고 있을 수 있다. 어떠한 경우, 일례로서, 이러한 문장들은 동일한 사건에 대하여 거의 동일한 시간에 두 개의 다른 뉴스 에이전스에 의해 작성되었다. 문장내의 차이점으로는, "ripped through"는 "crashed through"에 상응하고, "Central U.S. states"이 "Midwest"에 상응하여, 어절의 순서상에 차이점이 있으며, "killed"는 "killing"에 상응하여 단어들간의 형태학상의 차이점이 있으며, 리포트된 희생자수의 차이점을 포함한다.These sentences, with some differences, may have a common editorial source. In any case, as an example, these sentences were written by two different news agencies at about the same time for the same incident. Within sentences, "ripped through" corresponds to "crashed through", "Central US states" corresponds to "Midwest", there is a difference in word order, and "killed" corresponds to "killing". Thus, there are morphological differences between the words, including differences in the number of reported victims.

도 6은 본 발명의 일 실시예에 따른 두 개의 문장들로 된 페어들 사이의 정렬을 설명하기 위한 도면이다.6 is a diagram for explaining alignment between pairs of two sentences according to an embodiment of the present invention.

도 6은, 종래의 정렬시스템(210)에 따라 단어들과 어구들이 정렬된 후에, 문장들내의 단어들과 다중의 단어어구들간의 상응관계를 나타낸다. 통계적 정렬 알고리즘은, 단어들을 연결하는 라인들에 의해 나타낸 바와같이, 대부분의 상응관계들에 대하여, 다르지만 평행한 정보의 조각들간의 연결관계를 설정하였다. 예컨대, 명사절 "Storms and tornadoes" 및 "a swarm of tornadoes"는 직접적으로 비교되지는 않는다. 따라서, 더 많은 데이터가 요구됨에 따라, "storms"와 "swarm" 간의 연결관계 및 "storms"와 "a" 간의 연결관계는 사라지게 될 것이다. 두 문장간의 연결관계의 어긋남 패턴(crossing pattern)에 의해 어절의 순서의 차이가 나태내어질 수 있다.6 shows the correspondence between words in sentences and multiple word phrases after words and phrases are sorted according to the conventional sorting system 210 . The statistical sorting algorithm established connections between different but parallel pieces of information, for most correspondences, as indicated by the lines connecting the words. For example, the noun phrases "Storms and tornadoes" and "a swarm of tornadoes" are not directly compared. Thus, as more data is required, the link between "storms" and "swarm" and the link between "storms" and "a" will disappear. A difference in the order of words may be revealed by a crossing pattern of the connection relationship between two sentences.

일실시예에서, 「P.F. Brown et al., "The Mathematics of Statistical Machinie Translation: Parameter Estimation", Computational Linguistics, 19:263-312, (June 1993)」에 공지된 기법들을 사용하여, 단어/어구 정렬시스템(210)이 실행된다. 물론, 단어들과 입력 텍스트 사이의 관계를 식별하기 위하여 기타 기계번역 또는 단어/어구 정렬 기법들이 사용될 수 있다. 문장 세트들에 대하여 정렬시스템(210)을 사용하여 정렬 모델을 개발하고, 통계적 단어 및/또는 어구 정렬을 수행하는 것은, 도 5의 블록 230으로 나타내어져 있다.In one embodiment, "P.F. Using techniques known in Brown et al., "The Mathematics of Statistical Machinie Translation: Parameter Estimation", Computational Linguistics, 19:263-312, (June 1993), a word/phrase alignment system 210 is implemented. . Of course, other machine translation or word/phrase alignment techniques may be used to identify relationships between words and input text. Developing a sorting model using sorting system 210 on sentence sets and performing statistical word and/or phrase sorting is represented by block 230 of FIG. 5 .

그 후, 단어/어구 정렬 시스템(210)은, 입력 데이터에 기초하여 발생되었던 정렬 모델들(234)에 따라, 정렬된 단어 및 어구들(232)을 출력한다. 기본적으로, 상기 인용된 정렬시스템에서, 단어의 상응관계를 식별하도록 모델들이 트레이닝(train)된다. 정렬 기법에서는, 도 6에 도시된 바와 같이, 먼저 텍스트 분절들 내의 단어들 간의 단어 정렬을 찾는다. 다음, 시스템은 각각의 정렬에 대하여 확률을 할당하고, 후속의 트레이닝 데이터에 기초하여 확률을 최적화하여 더욱 정확한 모델을 생성한다. 정렬 모델(234)들과 정렬된 단어 및 어구들(232)을 출력하는 것이 도 5의 블록 236에 도시되어 있다.The word/phrase sorting system 210 then outputs the sorted word and phrases 232 according to the sorting models 234 that have been generated based on the input data. Basically, in the sorting system cited above, models are trained to identify word correspondences. In the alignment technique, as shown in FIG. 6 , first, a word alignment is found between words in text segments. The system then assigns a probability to each alignment and optimizes the probability based on subsequent training data to produce a more accurate model. Outputting the aligned words and phrases 232 with the alignment models 234 is shown in block 236 of FIG. 5 .

일례로서, 정렬 모델(234)은, 단어 정렬에 할당된 번역 확률, 단어 또는 어구들이 문장내에서 이동할 확률을 나타내는 이동 확률, 단일 단어가 다른 텍스트 분절내의 두 개의 다른 단어에 해당할 수 있는 확률을 나타내는 다양성(fertility) 확률 등과 같은, 종래의 번역모델 파라미터들을 포함한다.As an example, the alignment model 234 may calculate a translation probability assigned to a word alignment, a movement probability representing the probability that a word or phrase will move within a sentence, and a probability that a single word may correspond to two different words within different text segments. Including the conventional translation model parameters, such as the diversity (fertility) probability to represent.

블록들(237, 238, 및 239)은, 트레이닝(training) 자체에 대하여 시스템을 부트스트랩핑 하는데 사용되는 선택적 처리 단계들을 나타낸다. 이들은 도 7에 대하여 아래에 더욱 상세하게 설명된다.Blocks 237 , 238 , and 239 represent optional processing steps used to bootstrap the system with respect to training itself. These are described in more detail below with respect to FIG. 7 .

부트스트랩핑이 사용되지 않는 실시예에 있어서, 시스템(211)은 시스템(210)의 출력을 수신하여 서로 유사문장화(paraphrase)된 단어, 어구, 또는 문장들을 식별한다. 식별된 유사문장들(213)은 시스템(211)에 의해 출력된다. 이는 도 5의 블록(242)으로 나타내어져 있다.In embodiments where bootstrapping is not used, system 211 receives the output of system 210 to identify words, phrases, or sentences that are paraphrased from one another. The identified similar sentences 213 are output by the system 211 . This is indicated by block 242 of FIG. 5 .

또한, 정렬된 어구 및 모델들은 생성시스템 입력텍스트(212)로 제공될 수 있다. 일례로서, 시스템(212)은, 그 입력에 대한 유사문장(238)을 생성하는 단어 및/또는 어구들을 입력으로 수신하는, 종래의 디코더이다.Also, aligned phrases and models may be provided as generation system input text 212 . As an example, system 212 is a conventional decoder that receives as input words and/or phrases that produce pseudosentence 238 for that input.

따라서, 시스템(212)은, 정렬된 단어 및 어구(232)들 및 정렬 시스템(210)에 의해 생성된 정렬 모델(234)들을 사용하여 입력 텍스트의 유사문장을 생성하는데 사용될 수 있다. 정렬된 단어 및 어구들과 정렬 모델들에 기초하여 입력텍스트의 유사문장을 생성하는 것은 도 5의 블록 240에 의해 도시되어 있다. 일실시예의 생성시스템으로서, 「Y. Wang and A.Waibel, "Deoding Algorithm in Statistical Machine Translation", Proceedings of 35th Annual Meeting of the Association of Computational Linguistics(1997)」가 공지되어 있다.Accordingly, the system 212 may be used to generate pseudosentences of the input text using the aligned words and phrases 232 and the alignment models 234 generated by the alignment system 210 . Generating similar sentences of the input text based on the aligned words and phrases and the alignment models is illustrated by block 240 of FIG. 5 . As a generation system according to an embodiment, "Y. Wang and A. Waibel, "Deoding Algorithm in Statistical Machine Translation", Proceedings of 35th Annual Meeting of the Association of Computational Linguistics (1997) are known.

도 7은 본 발명의 일 실시예에 따른 유사문장 처리시스템의 다른 예를 보여주는 도면이다.7 is a view showing another example of a similar sentence processing system according to an embodiment of the present invention.

도 7은 도 4와, 식별시스템(211)이 또한 부트스트랩 트레이닝(training)에 사용된다는 점을 제외하고 유사하다. 이는, 도 5의 블록 237-239에 더욱 자세하게 도시되어 있다. 예컨대, 도 7 및 도 5에 대하여, 단어/어구 정렬시스템(210)이 전술한 바와 같이 출력 정렬모델(234) 및 정렬된 단어 및 어구들(232)을 갖는다고 가정하자. 그러나, 이제 각 다큐먼트 클러스터(218)의 전체 텍스트가, 시스템을 더욱 더 트레이닝하는데 사용하기 위한 보충 문장세트(300)을 식별하는 식별시스템(211)에 공급된다 (또한, 일례로서 문장들이 사용되고, 기타의 텍스트 분절들이 사용될 수 있다). 정렬 모델(234) 및 정렬된 단어 및 어구들(232)과 함께, 식별시스템(211)은 클러스터링된 다큐먼트(218)들의 텍스트를 처리하여, 각각의 클러스터들로부터 문장 세트(300)들을 재선택할 수 있다. 이것은 블록 237로 도시되어 있다. 그 후, 재선택된 문장 세트(300)는 단어/어구 정렬시스템(210)으로 제공되고, 재선택된 문장 세트(300)들에 기초하여, 정렬 모델(234)들 및 정렬된 단어 및 어구들(232), 및 그들의 관련된 확률행렬을 생성 또는 재계산한다. 단어 및 어구 정렬의 수행과 재선택된 문장세트들에 대한 정렬 모델들 및 정렬된 단어 및 어구들의 생성은, 도 5의 블록 239 및 239로 도시되어 있다.Figure 7 is similar to Figure 4, except that the identification system 211 is also used for bootstrap training. This is illustrated in more detail in blocks 237-239 of FIG. 5 . For example, with respect to FIGS. 7 and 5 , assume that the word/phrase sorting system 210 has the output sorting model 234 and the sorted words and phrases 232 as described above. However, now the entire text of each document cluster 218 is fed to an identification system 211 that identifies a supplemental sentence set 300 for use in further training the system (also, sentences are used as examples, etc.) text segments of may be used). Together with the alignment model 234 and the aligned words and phrases 232 , the identification system 211 can process the text of the clustered documents 218 to reselect the sentence sets 300 from the respective clusters. there is. This is illustrated by block 237 . The reselected sentence set 300 is then provided to the word/phrase sorting system 210 , and based on the reselected sentence sets 300 , alignment models 234 and sorted words and phrases 232 . ), and their associated probability matrices are generated or recalculated. The performance of word and phrase alignment and generation of aligned words and phrases and alignment models for reselected sentence sets is illustrated by blocks 239 and 239 of FIG. 5 .

이하, 재계산된 정렬 모델(234) 및 새롭게 정렬된 단어 및 어구(232)들은, 식별시스템(211)으로 다시 입력되고, 다시 다큐먼트 클러스터(218)들내의 텍스트를 처리하여 새로운 문장 세트를 식별하도록 시스템(211)에 의해 사용될 수 있다. 또한, 새로운 문장세트들은 단어/어구 정렬시스템(210)으로 다시 피드백 되고, 시스템의 트레이닝(training)을 더욱 가다듬도록 공정이 진행될 수 있다.Hereinafter, the recalculated alignment model 234 and the newly sorted words and phrases 232 are input back into the identification system 211, which in turn processes the text in the document clusters 218 to identify a new set of sentences. can be used by system 211 . In addition, the new sentence sets are fed back to the word/phrase sorting system 210 , and the process may proceed to further refine the training of the system.

본 발명을 사용하여 처리된 유사문장에 대하여는, 다양한 적용예들이 존재한다. 예컨대, 유사문장 처리시스템의 잠재적 적용예로서, 종래 기술에 공지된 바와 같은, 질의응답 시스템, 더욱 일반적인 정보처리 시스템을 포함한다. 이러한 시스템은, 질의(query)에 기초하여 다큐먼트 세트를 리턴하는데 있어서, 두 개의 텍스트 분절의 유사도를 결정하기 위해 유사문장 점수를 생성할 수 있다. 마찬가지로, 이러한 시스템은, 더욱 양호한 매칭결과를 찾거나, 호출을 개선하기 위해서, 유사문장 생성능력을 사용하여 질의 확장(query expansion, 복수 형태의 단일의 본래 질의를 생성함)을 수행할 수 있다.For pseudo-sentences processed using the present invention, various applications exist. For example, potential applications of the pseudo-sentence processing system include question-and-answer systems, more general information processing systems, as known in the prior art. Such a system may generate similar sentence scores to determine the degree of similarity between two text segments in returning a set of documents based on a query. Likewise, such a system can perform query expansion (generating a single original query of multiple forms) using its pseudo-sentence generation capabilities to find better matches or improve calls.

유사문장의 인식 및 생성에 대한 다른 적용예로서, 복수의 다큐먼트들의 요약화를 포함한다. 유사문장 인식기법을 활용하여, 요약문을 생성하기 위하여, 자동 다큐먼트 요약시스템이 다른 다큐먼트들 내의 유사한 문장을 찾아, 다큐먼트 세트 내의 가장 두드러진 정보를 결정한다.Another application to the recognition and generation of similar sentences includes a summary of a plurality of documents. Utilizing the similar sentence recognition technique, in order to generate a summary, the automatic document summary system finds similar sentences in other documents and determines the most prominent information in the document set.

유사문장 인식 및 생성에 대한 또 다른 적용예로서, 다이얼로그(dialog) 시스템이 있다. 이러한 시스템은, 입력을 에코(echo)하지만 다르게 표현되어, 정확하게 동일한 입력을 되풀이하는 것을 방지하는, 응답을 생성할 수 있다. 이는 보다 자연스럽고 대화적인 사운드를 다이얼로그 시스템이 표현하도록 한다.Another application for similar sentence recognition and generation is a dialog system. Such a system can generate a response that echoes the input, but is expressed differently, preventing repeating the exact same input. This allows the dialog system to express a more natural and interactive sound.

또한, 유사문장 인식 및 생성기법은, 단어처리 시스템에서 사용될 수 있다. 단어처리 시스템은, 자동적으로 재작성된 문체를 생성하고, 이러한 재작성된 문체를 사용자에게 제안하는데 사용될 수 있다. 예컨대, 이는, 사용자가 다큐먼트를 작성하고 있는 도중, 한 문단에서라도, 한 어구를 많은 횟수 반복하는 경우에 효과적이다. 마찬가지로, 단어처리 시스템은, 한 다큐먼트내에 퍼져있는 반복된 (그러나, 다르게 표현된) 정보를 플 래그(flag)하는 특성을 포함할 수 있다. 마찬가지로, 이러한 단어처리 시스템은, 유사문장으로서 한 편의 산문을 재작성(rewrite)하는 특성을 포함할 수 있다.In addition, similar sentence recognition and generation techniques may be used in a word processing system. The word processing system can be used to automatically generate the rewritten style and suggest the rewritten style to the user. For example, this is effective when a user repeats a phrase many times even in one paragraph while writing a document. Likewise, a word processing system may include a feature to flag repeated (but otherwise expressed) information spread within a document. Similarly, such a word processing system may include a feature of rewriting a piece of prose as a similar sentence.

이와 같이, DB 구축부(33)는 복수의 저작권 만료 자료들에 대하여 유사문장 처리시스템을 트레이닝하는 방법으로서, 복수의 다큐먼트를 액세스하는 단계; 상기 복수의 다큐먼트로부터, 공통 주제에 관해 서로 다른 작성자들에 의해 작성된 관련 텍스트들의 클러스터(cluster)를 식별하는 단계; 상기 관련 텍스트들의 클러스터를 수신하는 단계; 상기 클러스터로부터 텍스트 분절들(text segments)의 세트(set)를 선택하는 단계 - 상기 선택하는 단계는 상기 관련 텍스트들 중 필요한 텍스트 분절들을 관련 텍스트 분절들의 세트로 그룹화하는 단계를 포함함 -; 및 텍스트 정렬(textual alignment)을 이용하여 상기 관련 텍스트 분절들의 세트 내에 포함된 텍스트 분절들 내의 텍스트들 간의 유사문장 관계를 식별하는 단계를 포함하고, 상기 텍스트 정렬은 통계적 텍스트 정렬(statistical textual alignment)을 이용하여 상기 관련 텍스트 분절들의 세트 내의 텍스트 분절들 내의 단어들을 정렬하는 것, 및 상기 정렬된 단어들에 기초하여 상기 유사문장 관계를 식별하는 것을 포함하는, 트레이닝 방법을 따른다.As such, the DB construction unit 33 is a method of training a similar sentence processing system for a plurality of copyright expired materials, the method comprising: accessing a plurality of documents; identifying, from the plurality of documents, a cluster of related texts written by different authors on a common subject; receiving the cluster of related texts; selecting a set of text segments from the cluster, wherein the selecting comprises grouping required ones of the related texts into a set of related text segments; and using textual alignment to identify syntactic relationships between texts within text segments included within the set of related text segments, wherein the text alignment achieves statistical textual alignment. aligning words within text segments within the set of related text segments using the

DB 구축부(33)는 식별된 유사문장 관계에 기초하여 정렬 모델을 계산하는 단계를 더 포함할 수 있으며, 정렬 모델을 이용하여 유사문장들을 대표하는 대표문장을 생성할 수 있다.The DB construction unit 33 may further include calculating an alignment model based on the identified similar sentence relationship, and may generate representative sentences representing similar sentences using the alignment model.

한편, DB 구축부(33)는 복수의 저작권 만료 자료에 대하여 사상(事象)을 예측하고, 예측한 사상 키워드를 매칭하여 데이터베이스(40)에 함께 저장할 수 있다. 이와 같은 경우, 문장 시현부(35)는 사용자로부터 미리 사용자가 선호하는 사상 키워드를 입력 받을 수 있으며, 해당 사상 키워드가 부여된 문장 중에서 사용자에게 제시할 예시 문장을 추출할 수 있다.Meanwhile, the DB construction unit 33 may predict events with respect to a plurality of copyright expired data, match the predicted event keywords, and store them together in the database 40 . In this case, the sentence display unit 35 may receive an input of a mapping keyword preferred by the user in advance from the user, and may extract an example sentence to be presented to the user from among the sentences to which the corresponding mapping keyword is assigned.

DB 구축부(33)는 수집한 문장 별 사상 키워드 예측을 위한 사상 예측 시스템(300)을 포함할 수 있다. 이와 관련하여 도 8 이하를 참조하여 예를 들어 설명한다.The DB construction unit 33 may include a mapping prediction system 300 for prediction of mapping keywords for each collected sentence. In this regard, an example will be described below with reference to FIG. 8 .

도 8은 본 발명의 일 실시예에 따른 사상 예측 시스템을 보여주는 블록도이다.8 is a block diagram illustrating a mapping prediction system according to an embodiment of the present invention.

도 8을 참조하면, 본 실시형태의 사상 예측 시스템은, 그 기능 구성으로서, 학습용 데이터 입력부(310), 단어 추출부(311), 벡터 산출부(312), 지표값 산출부(313), 분류 모델 생성부(314), 예측용 데이터 입력부(320) 및 사상 예측부(321)를 구비하여 구성되어 있다. 벡터 산출부(312)는, 보다 구체적인 기능 구성으로서, 문장벡터 산출부(312a) 및 단어 벡터 산출부(312b)를 구비하고 있다. 또한, 본 실시형태의 사상 예측 시스템은, 기억 매체로서 분류 모델 기억부(330)를 구비하고 있다.Referring to FIG. 8 , the mapping prediction system of the present embodiment has, as its functional configuration, a learning data input unit 310 , a word extraction unit 311 , a vector calculation unit 312 , an index value calculation unit 313 , and a classification A model generation unit 314 , a prediction data input unit 320 , and an event prediction unit 321 are provided and configured. The vector calculating unit 312 includes a sentence vector calculating unit 312a and a word vector calculating unit 312b as a more specific functional configuration. Moreover, the mapping prediction system of this embodiment is equipped with the classification model storage part 330 as a storage medium.

그리고, 이하의 설명의 편의상, 단어 추출부(311), 벡터 산출부(312) 및 지표값 산출부(313)로 구성되는 부분을 유사성 지표값 산출부(300)라고 부른다. 유사성 지표값 산출부(300)는 문장에 관한 문장 데이터를 입력하고, 문장과 그 중에 포함되는 단어와의 관계성을 반영한 유사성 지표값을 산출하여 출력하는 것이다. 또한, 본 실시형태의 사상 예측 시스템은, 유사성 지표값 산출부(300)에 의해 산출되는 유사성 지표값을 이용하여, 문장의 내용으로부터 특정한 사상을 예측하는(복수의 사상 중 어느 것에 해당하는지를 예측함) 것이다. 그리고, 학습용 데이터입력부(310), 유사성 지표값 산출부(300) 및 분류 모델 생성부(314)에 의해, 본 발명의 예측 모델 생성 장치가 구성된다.In addition, for the convenience of the following description, the part composed of the word extraction unit 311 , the vector calculation unit 312 , and the index value calculation unit 313 is called the similarity index value calculation unit 300 . The similarity index value calculating unit 300 inputs sentence data about a sentence, calculates and outputs a similarity indicator value reflecting the relationship between the sentence and the word included therein. In addition, the event prediction system of the present embodiment predicts a specific event from the text content (predicting which of a plurality of events it corresponds to) using the similarity index value calculated by the similarity index value calculating unit 300 . ) will be. In addition, the predictive model generating apparatus of the present invention is configured by the learning data input unit 310 , the similarity index value calculating unit 300 , and the classification model generating unit 314 .

상기 각 기능 블록(310∼314, 320∼321)은 하드웨어, DSP(Digital Signal Processor), 소프트웨어의 어느 것에 의해서도 구성하는 것이 가능하다. 예를 들면, 소프트웨어에 의해 구성하는 경우, 상기 각 기능 블록(310∼314, 320∼321)은, 실제로는 컴퓨터의 CPU, RAM, ROM 등을 구비하여 구성되며, RAM이나 ROM, 하드 디스크 또는 반도체 메모리 등의 기록 매체에 기억된 프로그램이 동작함으로써 실현된다.Each of the functional blocks 310 to 314 and 320 to 321 can be configured by hardware, a digital signal processor (DSP), or software. For example, when configured by software, each of the functional blocks 310 to 314 and 320 to 321 is actually configured with a computer CPU, RAM, ROM, etc., and includes RAM, ROM, hard disk, or semiconductor. This is realized by operating a program stored in a recording medium such as a memory.

학습용 데이터 입력부(310)는, 복수의 사상 중 어느 것에 해당하는지가 이미 알려진 m개(m은 2 이상의 임의의 정수)의 문장에 관한 문장 데이터를 학습용 데이터로서 입력한다. 여기에서, 복수의 사상은 2개여도 되고, 3개 이상이어도 된다. 예를 들면, 특정한 장해나 증상이 발생할 가능성 등과 같이, 하나의 사항에 관한 발생 가능성의 유무를 나타내는 2개의 사상으로 하는 것이 가능하다. 혹은, 사람의 성격 타입이나 취미 취향 등과 같이, 서로 성질의 다른 2개 이상의 사상의 조합으로 하는 것도 가능하다. 그리고, 여기에 예로 든 사상은 일례에 지나지 않고, 이것에 한정되는 것은 아니다.The learning data input unit 310 inputs, as learning data, sentence data related to m sentences (m is an arbitrary integer greater than or equal to 2) for which one of the plurality of maps is known. Here, the number of the plurality of events may be two or three or more. For example, it is possible to set it as two events indicating the presence or absence of occurrence possibility related to one matter, such as the possibility of occurrence of a specific disorder or symptom. Alternatively, it is also possible to set it as a combination of two or more ideas different from each other, such as a person's personality type, hobbies, and the like. In addition, the thought exemplified here is only an example, and is not limited thereto.

입력하는 문장 데이터는, 예측하고자 하는 복수의 사상에 관련된 문장이 기술된 데이터로 하는 것이 바람직하다. 예를 들면, 시스템 장해의 발생 가능성의 유무를 예측하기 위한 예측 모델의 구축을 목적으로 하여 학습용 데이터를 입력하는 경우에는, 시스템의 감시 또는 검사의 결과를 기술한 리포트에 관한 문장 데이터를 입력한다는 방식이다.The input sentence data is preferably data in which sentences related to a plurality of events to be predicted are described. For example, when data for learning is input for the purpose of constructing a predictive model for predicting the presence or absence of the possibility of occurrence of a system failure, sentence data for a report describing the results of monitoring or inspection of the system is input. am.

다만, 사람의 성격 타입이나 취미 취향 등을 예측하는 것 등이 목적인 경우에는, 예측하고자 하는 복수의 사상과는 일견 무관계라고 생각되는 문장이라도, 이하에 설명하는 해석에 의해, 문장과 사상의 관계성이 발견될 가능성도 있을 수 있다. 따라서, 예측하고자 하는 복수의 사상에 관련된다고 인간이 판단한 문장만을 학습용 데이터로 하는 것은 필수가 아니다. 즉, 예측하고자 하는 복수의 사상의 내용에 따라서는, 해당 복수의 사상에 분명히 관련되는 문장이 기술된 데이터뿐만 아니라, 해당 복수의 사상에 일견 무관계라고 생각되는 문장이 기술된 데이터를 포함하여 학습용 데이터로서 입력하도록 해도 된다.However, in the case where the purpose is to predict a person's personality type, hobbies, etc., the relationship between the sentence and the thought can be determined by the interpretation described below, even if the sentence is thought to have no relation at all to the plural ideas to be predicted. It is possible that this could be found. Therefore, it is not essential to use only sentences determined by a human to be related to a plurality of events to be predicted as learning data. That is, depending on the content of the plurality of events to be predicted, data for learning includes not only data in which sentences clearly related to the plurality of events are described, but also data in which sentences that are thought to be unrelated to the plurality of events are described. It may be entered as

또한, 학습용 데이터 입력부(310)에 의해 입력하는 문장, 즉 후술하는 해석 대상으로 하는 문장은, 1개의 센텐스(구점(句點)에 의해 나뉘어지는 단위)로 이루어지는 것이어도 되고, 복수의 센텐스로 이루어지는 것이어도 된다. 복수의 센텐스로 이루어지는 문장은, 1개의 문서에 포함되는 일부 또는 전부의 문장이어도 된다. 1개의 문서에 포함되는 일부의 문장을 학습용 데이터로서 사용하는 경우, 학습용 데이터 입력부(310)는, 문서 내의 어느 부분의 문장을 학습용 데이터로서 사용하는지를 설정한 상태에서, 문장 데이터를 입력한다(엄밀에 말하면, 문서 데이터를 입력하여, 그 문서 내의 설정 부분을 문장 데이터로서 사용함). 예를 들면, 복수의 기재 항목이 존재하는 문서 중에서, 특정한 기재 항목에 관한 문장을 학습용 데이터로서 사용하도록 설정하는 것이 고려된다. 설정하는 기재 항목은 1개라도 되고, 복수라도 된다.In addition, the sentence input by the data input unit 310 for learning, that is, the sentence to be analyzed to be described later, may consist of one sense (a unit divided by a sphere), or may have a plurality of senses. may be done. A sentence composed of a plurality of senses may be a part or all of a sentence included in one document. When a part of sentences included in one document are used as learning data, the learning data input unit 310 inputs sentence data in a state in which it is set which part of the sentences in the document to be used as learning data (strictly, speaking, input the document data, and use the setting part in the document as the sentence data). For example, among documents in which a plurality of description items exist, it is conceivable to set a sentence relating to a specific description item to be used as learning data. The number of description items to be set may be one, or a plurality may be sufficient as them.

단어 추출부(311)는, 학습용 데이터 입력부(310)에 의해 입력된 m개의 문장을 해석하고, 해당 m개의 문장으로부터 n개(n은 2 이상의 임의의 정수)의 단어를 추출한다. 문장의 해석 방법으로서는, 예를 들면, 공지의 형태소 해석을 이용하는 것이 가능하다. 여기에서, 단어 추출부(311)는, 형태소 해석에 의해 분할되는 모든 품사의 형태소를 단어로서 추출하도록 해도 되고, 특정한 품사의 형태소만을 단어로서 추출하도록 해도 된다.The word extraction unit 311 interprets the m sentences input by the learning data input unit 310 and extracts n words (n is an arbitrary integer greater than or equal to 2) from the m sentences. As an analysis method of a sentence, it is possible to use a well-known morpheme analysis, for example. Here, the word extraction unit 311 may extract all morphemes of the part-of-speech divided by the morpheme analysis as words, or may extract only the morphemes of a specific part-of-speech as words.

그리고, m개의 문장 중에는, 동일한 단어가 복수 포함되어 있는 경우가 있다. 이 경우, 단어 추출부(311)는, 동일한 단어를 복수 개 추출하는 것은 하지 않고, 1개만 추출한다. 즉, 단어 추출부(311)가 추출하는 n개의 단어란, n종류의 단어라는 의미이다. 여기에서, 단어 추출부(311)는, m개의 문장으로부터 동일한 단어가 추출되는 빈도를 계측하고, 출현 빈도가 큰 쪽으로부터 n개(n종류)의 단어, 혹은 출현 빈도가 임계값 이상인 n개(n종류)의 단어를 추출하도록 해도 된다.In addition, a plurality of the same word may be included in m sentences. In this case, the word extraction unit 311 does not extract a plurality of the same word, but only one. That is, the n words extracted by the word extraction unit 311 mean n types of words. Here, the word extracting unit 311 measures the frequency at which the same word is extracted from m sentences, and n words (n types) from the one with the largest appearance frequency, or n words ( n types of words) may be extracted.

벡터 산출부(312)는 m개의 문장 및 n개의 단어로부터, m개의 문장 벡터 및 n개의 단어 벡터를 산출한다. 여기에서, 문장 벡터 산출부(312a)는, 단어 추출부(311)에 의한 해석 대상으로 된 m개의 문장을 각각 소정의 룰에 따라서 q차원으로 벡터화함으로써, q개(q는 2 이상의 임의의 정수)의 축 성분으로 이루어지는 m개의 문장 벡터를 산출한다. 또한, 단어 벡터 산출부(312b)는, 단어 추출부(311)에 의해 추출된 n개의 단어를 각각 소정의 룰에 따라서 q차원으로 벡터화함으로써, q개의 축 성분으로 이루어지는 n개의 단어 벡터를 산출한다.The vector calculating unit 312 calculates m sentence vectors and n word vectors from the m sentences and n words. Here, the sentence vector calculating unit 312a vectorizes the m sentences to be analyzed by the word extracting unit 311 in q dimensions according to a predetermined rule, so that q sentences (q is an arbitrary integer of 2 or more) ) computes m sentence vectors composed of the axis components. In addition, the word vector calculating unit 312b calculates n word vectors composed of q axis components by vectorizing the n words extracted by the word extracting unit 311 in q dimensions according to a predetermined rule, respectively. .

본 실시형태에서는, 일례로서, 다음과 같이 하여 문장 벡터 및 단어 벡터를 산출한다. 이제, m개의 문장과 n개의 단어로 이루어지는 집합 S=<d∈D, w∈W>를 생각한다. 여기에서, 각 문장 di(i=1, 2, …, m) 및 각 단어 wj(j=1, 2, …, n)에 대하여 각각 문장 벡터 di 및 단어 벡터 wj(이하에서는, 기호 "→"은 벡터인 것을 가리키는 것으로 함)을 관련짓는다. 그리고, 임의의 단어 wj와 임의의 문장 di에 대하여, 다음의 식(1)에 나타내는 확률P(wj|di)를 계산한다.In this embodiment, as an example, sentence vectors and word vectors are calculated as follows. Now, consider a set S=<d∈D, w∈W> consisting of m sentences and n words. Here, for each sentence di(i=1, 2, …, m) and each word wj (j=1, 2, …, n), respectively, the sentence vector di and the word vector wj (hereinafter, the symbol “→” indicates that it is a vector). Then, with respect to the arbitrary word wj and the arbitrary sentence di, the probability P(wj|di) shown in the following formula (1) is calculated.

그리고, 이 확률 P(wj|di)는, 예를 들면 문장이나 문서를 단락(paragraph)·벡터에 의해 평가하는 것에 대하여 기술한 논문 「"Distributed Representations of Sentences and Documents" by Quoc Le and Tomas Mikolov, Google Inc, Proceedings of the 31st International Conference on Machine Learning Held in Bejing, China on 22-24 June 2014」에 개시되어 있는 확률 p를 모방하여 산출하는 것이 가능한 값이다. 이 논문에는, 예를 들면 "the", "cat", "sat"라는 3개의 단어가 있을 때, 4개째의 단어로서 "on"을 예측한다고 되어 있고, 그 예측확률 p의 산출식이 게재되어 있다. 해당 논문에 기재되어 있는 확률 p(wt|wt-k, …, wt+k)는, 복수의 단어 wtk, …, wt+k로부터 다른 1개의 단어 wt를 예측했을 때의 정해 확률이다.And, this probability P(wj|di) is, for example, in the article ""Distributed Representations of Sentences and Documents" by Quoc Le and Tomas Mikolov, Google Inc, Proceedings of the 31st International Conference on Machine Learning Held in Bejing, China on 22-24 June 2014” is a value that can be calculated by imitating the probability p. In this paper, for example, when there are three words "the", "cat", and "sat", "on" is predicted as the fourth word, and the calculation formula for the prediction probability p is published. . The probability p(wt|wt-k, ..., wt+k) described in the paper is a plurality of words wtk, ... , is a positive probability when another word wt is predicted from wt+k.

이에 대하여, 본 실시형태에서 사용하는 식(1)에 나타내어지는 확률 P(wj|di)은, m개의 문장 중 1개의 문장 di로부터, n개의 단어 중 1개의 단어 wj가 예상되는 정해 확률을 표시하고 있다. 1개의 문장 di로부터 1개의 단어 wj를 예측한다는 것은, 구체적으로는, 어떤 문장 di가 출현했을 때, 그 중에 단어 wj가 포함될 가능성을 예측한다는 것이다.On the other hand, the probability P(wj|di) expressed in Equation (1) used in the present embodiment indicates a positive probability that one word wj out of n words is expected from one sentence di out of m sentences. are doing Predicting one word wj from one sentence di means predicting the possibility that the word wj will be included in a certain sentence di, specifically, when a certain sentence di appears.

식(1)에서는, e를 밑으로 하고, 단어 벡터 w→와 문장 벡터 d→의 내적값을 지수로 하는 지수 함수값을 이용한다. 그리고, 예측 대상으로 하는 문장 di와 단어 wj의 조합으로부터 계산되는 지수 함수값과, 문장 di와 n개의 단어 wk(k=1, 2, …, n)의 각 조합으로부터 계산되는 n개의 지수 함수값의 합계값의 비율을, 1개의 문장 di로부터 1개의 단어 wj가 예상되는 정해 확률로서 계산하고 있다.In Equation (1), an exponential function value is used with e as the base and the dot product of the word vector w→ and the sentence vector d→ as the exponent. Then, an exponential function value calculated from the combination of the sentence di and the word wj to be predicted, and n exponential function values calculated from each combination of the sentence di and n words wk (k=1, 2, …, n) The ratio of the sum values of is calculated as the positive probability that one word wj is expected from one sentence di.

여기에서, 단어 벡터 wj→과 문장 벡터 di→의 내적값은, 단어 벡터 wj→를 문장 벡터 di→의 방향으로 투영한 경우의 스칼라값, 즉 단어 벡터 wj→가 가지고 있는 문장 벡터 di→의 방향의 성분값이라고도 말할 수 있다. 이것은, 단어 wj가 문장 di에 기여하고 있는 정도를 표시하고 있다고 생각할 수 있다. 따라서, 이와 같은 내적을 이용하여 계산되는 지수 함수값을 이용하여, n개의 단어 wk(k=1, 2, …, n)에 대하여 계산되는 지수 함수값의 합계에 대한, 1개의 단어 wj에 대하여 계산되는 지수 함수값의 비율을 구하는 것은, 1개의 문장 di로부터 n개의 단어 중 1개의 단어 wj가 예상되는 정해 확률을 구하는 것에 상당한다.Here, the dot product of the word vector wj→ and the sentence vector di→ is a scalar value when the word vector wj→ is projected in the direction of the sentence vector di→, that is, the direction of the sentence vector di→ possessed by the word vector wj→ It can also be said to be the component value of This can be considered to indicate the degree to which the word wj contributes to the sentence di. Therefore, by using the exponential function value calculated using such a dot product, for the sum of the exponential function values calculated for n words wk (k=1, 2, ..., n), for one word wj Calculating the ratio of the calculated exponential function values corresponds to finding the positive probability that one word wj out of n words is expected from one sentence di.

그리고, 식(1)은, di와 wj에 대하여 대칭이므로, n개의 단어 중 1개의 단어 wj로부터, m개의 문장 중 1개의 문장 di가 예상되는 확률 P(di|wj)를 계산해도 된다. 1개의 단어 wj로부터 1개의 문장 di를 예측한다는 것은, 어떤 단어 wj가 출현했을 때, 그것이 문장 di 중에 포함될 가능성을 예측하는 것이다. 이 경우, 문장 벡터 di→와 단어 벡터 wj→의 내적값은, 문장 벡터 di→를 단어 벡터 wj→의 방향으로 투영한 경우의 스칼라값, 즉 문장 벡터 di→가 가지고 있는 단어 벡터 wj→의 방향의 성분값이라고도 말할 수 있다. 이것은, 문장 di가 단어 wj에 기여하고 있는 정도를 표시하고 있다고 생각할 수 있다.In addition, since Equation (1) is symmetric with respect to di and wj, the probability P(di|wj) at which one sentence di out of m sentences is expected may be calculated from one word wj out of n words. Predicting one sentence di from one word wj means predicting the probability that a certain word wj will be included in the sentence di when it appears. In this case, the dot product of the sentence vector di→ and the word vector wj→ is a scalar value when the sentence vector di→ is projected in the direction of the word vector wj→, that is, the direction of the word vector wj→ possessed by the sentence vector di→ It can also be said to be the component value of This can be considered to indicate the degree to which the sentence di contributes to the word wj.

그리고, 여기서는, 단어 벡터 w→와 문장 벡터 d→의 내적값을 지수로 하는 지수 함수값을 이용하는 계산예를 제시했지만, 지수 함수값을 이용하는 것을 필수로 하는 것은 아니다. 단어 벡터 w→와 문장 벡터 d→의 내적값을 이용한 계산식이면 되고, 예를 들면, 내적값 그 자체의 비율에 의해 확률을 구하도록 해도 된다.Incidentally, here, a calculation example using an exponential function value in which the dot product of the word vector w ? and the sentence vector d ? A calculation formula using the dot product of the word vector w→ and the sentence vector d→ may be sufficient, for example, the probability may be calculated|required by the ratio of the dot product itself.

다음으로, 벡터 산출부(312)는, 하기의 식(2)에 나타낸 바와 같이, 상기 식(1)에 의해 산출되는 확률 P(wj|di)를 모든 집합 S에 대하여 합계한 값 L을 최대화하는 문장 벡터 di→ 및 단어 벡터 wj→을 산출한다. 즉, 문장 벡터 산출부(312a) 및 단어 벡터 산출부(312b)는, 상기 식(1)에 의해 산출되는 확률 P(wj|di)를, m개의 문장과 n개의 단어의 모든 조합에 대하여 산출하고, 이들을 합계한 값을 목표 변수 L로 하여, 해당 목표 변수 L을 최대화하는 문장 벡터 di→ 및 단어 벡터 wj→를 산출한다.Next, the vector calculating unit 312 maximizes the value L obtained by summing the probability P(wj|di) calculated by the above equation (1) for all sets S, as shown in the following equation (2). A sentence vector di→ and a word vector wj→ are calculated. That is, the sentence vector calculating unit 312a and the word vector calculating unit 312b calculate the probability P(wj|di) calculated by Equation (1) above for all combinations of m sentences and n words. Then, a sentence vector di→ and a word vector wj→ maximizing the target variable L are calculated using the sum of these values as the target variable L.

m개의 문장과 n개의 단어의 모든 조합에 대하여 산출한 확률 P(wj|di)의 합계값 L을 최대화한다는 것은, 어떤 문장 di(i=1, 2, …, m)로부터 어떤 단어 wj(j=1, 2, …, n)가 예상되는 정해 확률을 최대화하는 것이다. 즉, 벡터 산출부(312)는, 이 정해 확률이 최대화하는 바와 같은 문장 벡터 di→ 및 단어 벡터 wj→를 산출하는 것이라고 말할 수 있다.Maximizing the sum value L of the probability P(wj|di) calculated for all combinations of m sentences and n words means from a certain sentence di(i=1, 2, ..., m) to a certain word wj(j =1, 2, …, n) is to maximize the expected probability of being positive. That is, it can be said that the vector calculation unit 312 calculates the sentence vector di→ and the word vector wj→ such that this determination probability is maximized.

여기에서, 본 실시형태에서는, 전술한 바와 같이, 벡터 산출부(312)는, m개의 문장 di를 각각 q차원으로 벡터화함으로써, q개의 축 성분으로 이루어지는 m개의 문장 벡터 di→를 산출하고, 또한 n개의 단어를 각각 q차원으로 벡터화함으로써, q개의 축 성분으로 이루어지는 n개의 단어 벡터 wj→를 산출한다. 이것은, q개의 축방향을 가변으로 하여, 전술한 목표 변수 L이 최대화하는 바와 같은 문장 벡터 di→ 및 단어 벡터 wj→를 산출하는 것에 상당한다.Here, in the present embodiment, as described above, the vector calculating unit 312 calculates m sentence vectors di ? composed of q axial components by vectorizing the m sentences di in q dimensions, and By vectorizing each of the n words in q dimensions, n word vectors wj→ composed of q axis components are calculated. This corresponds to calculating a sentence vector di→ and a word vector wj→ as maximized by the above-described target variable L by making q axial directions variable.

지표값 산출부(313)는, 벡터 산출부(312)에 의해 산출된 m개의 문장 벡터 di→와 n개의 단어 벡터 wj→의 내적을 각각 취함으로써, m개의 문장 di 및 n개의 단어 wj간의 관계성을 반영한 m×n개의 유사성 지표값을 산출한다. 본 실시형태에서는, 지표값 산출부(313)는 하기의 식(3)에 나타낸 바와 같이, m개의 문장 벡터 di→의 각 q개의 축성분(d11∼dmq)을 각 요소로 하는 문장 행렬 D와, n개의 단어 벡터 wj→의 각 q개의 축 성분(w11∼wnq)을 각 요소로 하는 단어 행렬 W의 곱을 취하는 것에 의해, m×n개의 유사성 지표값을 각 요소로 하는 지표값 행렬 DW를 산출한다. 여기에서, Wt는 단어 행렬의 전치 행렬이다.The index value calculation unit 313 takes the dot product of the m sentence vectors di→ and the n word vectors wj→ calculated by the vector calculation section 312, respectively, so that the relationship between the m sentences di and the n words wj m × n similarity index values reflecting gender are calculated. In the present embodiment, the index value calculation unit 313 has a sentence matrix D and a sentence matrix D having q axial components d11 to dmq of the m sentence vectors di→ , calculates an index value matrix DW having m × n similarity index values as each element by multiplying the word matrix W with each q axis components (w11 to wnq) of the n word vectors wj→ do. Here, Wt is the transpose matrix of the word matrix.

이와 같이 하여 산출된 지표값 행렬 DW의 각 요소는, 어느 단어가 어느 문장에 대하여 어느 정도 기여하고 있는 것인지를 표시한 것이라고 말할 수 있다. 예를 들면, 1행 2열의 요소 dw12는, 단어 w2가 문장 d1에 대하여 어느정도 기여하고 있는 것인지를 표시한 값이다. 이에 의해, 지표값 행렬 DW의 각 행은 문장의 유사성을 평가하는 것으로서 이용하는 것이 가능하며, 각각의 열은 단어의 유사성을 평가하는 것으로서 이용하는 것이 가능하다.Each element of the index value matrix DW calculated in this way can be said to indicate which word contributes to which sentence to what extent. For example, the element dw12 in row 1 and column 2 is a value indicating how much the word w2 contributes to the sentence d1. Thereby, each row of the index value matrix DW can be used for evaluating the similarity of sentences, and each column can be used for evaluating the similarity of words.

분류 모델 생성부(314)는, 지표값 산출부(313)에 의해 산출된 m×n개의 유사성 지표값을 이용하여, 1개의 문장 di(i=1, 2, …, m)에 대하여 n개의 유사성 지표값 dwj(j=1, 2, …, n)로 이루어지는 문장 지표값군을 기초로, m개의 문장 di를 복수의 사상으로 분류하기 위한 분류 모델을 생성한다. 예를 들면, 제1∼제3의 3개의 사상으로 분류하는 분류 모델을 생성하는 경우, 분류 모델 생성부(314)는, 제1 사상에 해당하는 것이 이미 알려진 문장을 기초로 산출되는 문장 지표값군에 대해서는 「제1 사상」으로 분류되고, 제2 사상에 해당하는 것이 이미 알려진 문장을 기초로 산출되는 문장 지표값군에 대해서는 「제2 사상」으로 분류되고, 제3 사상에 해당하는 것이 이미 알려진 문장을 기초로 산출되는 문장 지표값군에 대해서는 「제3 사상」으로 분류되는 바와 같은 분류 모델을 생성한다. 그리고, 분류 모델 생성부(314)는, 생성한 분류 모델을 분류 모델 기억부(330)에 기억시킨다.The classification model generation unit 314 uses the m × n similarity index values calculated by the index value calculation unit 313 to generate n sentences for one sentence di (i = 1, 2, ..., m). A classification model for classifying the m sentences di into a plurality of mappings is generated based on the sentence indicator value group including the similarity indicator values dwj (j=1, 2, ..., n). For example, when generating a classification model classifying into three first to third events, the classification model generating unit 314 is a sentence index value group calculated based on a sentence in which the first event is already known. is classified as a “first event”, and a sentence index value group calculated based on a sentence in which the second event is known is classified as a “second event”, and the sentence corresponding to the third event is already known. For the sentence index value group calculated on the basis of , a classification model as classified as a "third event" is generated. Then, the classification model generation unit 314 stores the generated classification model in the classification model storage unit 330 .

여기에서, 문장 지표값군이란, 예를 들면, 첫번째의 문장 d1의 경우, 지표값 행렬 DW의 첫째줄에 포함되는 n개의 유사성 지표값 dw11∼dw1n이 이것에 해당한다. 마찬가지로, 두 번째의 문장 d2의 경우, 지표값 행렬 DW의 둘째줄에 포함되는 n개의 유사성 지표값 dw21∼dw2n이 이것에 해당한다. 이하, m개째의 문장 dm에 관한 문장 지표값군(n개의 유사성 지표값 dwm1∼dwmn)까지 동일하다.Here, for the sentence index value group, for example, in the case of the first sentence d1, n similarity index values dw11 to dw1n included in the first line of the index value matrix DW correspond to this. Similarly, in the case of the second sentence d2, n similarity index values dw21 to dw2n included in the second row of the index value matrix DW correspond to this. Hereinafter, the same applies to sentence index value groups (n similarity index values dwm1 to dwmn) related to the m-th sentence dm.

분류 모델 생성부(314)는, 예를 들면, 각 문장 di의 문장 지표값군에 대하여 각각 특징량을 산출하고, 해당 산출한 특징량의 값에 따라, 마르코프 연쇄 몬테카를로법(Markov chain Monte Carlo method)에 의한 복수군 분리의 최적화를 행함으로써, 각 문장 di를 복수의 사상으로 분류하기 위한 분류 모델을 생성한다. 여기에서, 분류 모델 생성부(314)가 생성하는 분류 모델은, 문장 지표값군을 입력으로 하여, 예측하고자 하는 복수의 사상 중 어느 하나를 해로서 출력하는 학습 모델이다. 혹은, 예측하고자 하는 복수의 사상의 각각에 해당하는 가능성을 확률로서 출력하는 학습 모델로 해도 된다. 학습 모델의 형태는 임의이다.The classification model generating unit 314, for example, calculates a feature amount for each sentence index value group of each sentence di, and according to the value of the calculated feature amount, a Markov chain Monte Carlo method A classification model for classifying each sentence di into a plurality of events is generated by optimizing the plural group separation by . Here, the classification model generated by the classification model generating unit 314 is a learning model that takes a sentence index value group as an input and outputs any one of a plurality of events to be predicted as a solution. Alternatively, a learning model that outputs a probability corresponding to each of a plurality of events to be predicted may be used as a probability. The form of the learning model is arbitrary.

예를 들면, 분류 모델 생성부(314)가 생성하는 분류 모델의 형태는, 회귀 모델(선형 회귀, 로지스틱 회귀, 서포트 벡터 머신 등을 베이스로 하는 학습 모델), 나무 모델(결정 나무, 회귀 나무, 랜덤 포레스트, 구배 부스팅 나무 등을 베이스로 하는 학습 모델), 뉴럴 네트워크 모델(퍼셉트론(perceptron), 컨벌루션 뉴럴 네트워크, 재기형 뉴럴 네트워크, 잔차(殘差) 네트워크, RBF 네트워크, 확률적 뉴럴 네트워크, 스파이킹 뉴럴 네트워크, 복소 뉴럴 네트워크 등을 베이스로 하는 학습 모델), 베이즈 모델(베이즈 추론 등을 베이스로 하는 학습 모델), 클러스터링 모델(k근방법, 계층형 클러스터링, 비계층형 클러스터링, 토픽 모델 등을 베이스로 하는 학습 모델) 등 중 어느 하나로 하는 것이 가능하다. 그리고, 여기에 예로 든 분류 모델은 일례에 지나지 않고, 이것에 한정되는 것은 아니다.For example, the form of the classification model generated by the classification model generation unit 314 is a regression model (a learning model based on linear regression, logistic regression, support vector machine, etc.), a tree model (decision tree, regression tree, Learning models based on random forests, gradient boosting trees, etc.), neural network models (perceptrons, convolutional neural networks, reconstructed neural networks, residual networks, RBF networks, stochastic neural networks, spiking) Learning models based on neural networks, complex neural networks, etc.), Bayesian models (learning models based on Bayesian inference, etc.), clustering models (k-root method, hierarchical clustering, non-hierarchical clustering, topic models, etc.) It is possible to do it in any one of the learning models based on the In addition, the classification model cited here is only an example, and is not limited to this.

예측용 데이터 입력부(320)는, 예측 대상으로 하는 1개 이상의 문장에 관한 문장 데이터를 예측용 데이터로서 입력한다. 예측용 데이터 입력부(320)가 입력하는 문장 데이터는, 복수의 사상 중 어디에 해당하는지가 미지인 문장에 관한 문장 데이터다. 예측용 데이터 입력부(320)가 입력하는 문장 데이터는, 학습용 데이터 입력부(310)가 입력하는 문장 데이터와 마찬가지로, 예측하고자 하는 복수의 사상과 관련된 문장이 기술된 데이터라도 되고, 예측하고자 하는 복수의 사상과는 일견 무관계라고 생각되는 문장이 기술된 데이터라도 된다.The prediction data input unit 320 inputs sentence data related to one or more sentences to be predicted as prediction data. The sentence data input by the prediction data input unit 320 is sentence data regarding a sentence whose corresponding one of a plurality of maps is unknown. The sentence data input by the prediction data input unit 320 may be data in which sentences related to a plurality of events to be predicted are described, similar to the sentence data input by the data input unit 310 for learning, or a plurality of events to be predicted. It may be data in which a sentence which is thought to be unrelated at first glance is described.

예측용 데이터 입력부(320)가 입력하는 문장 데이터의 수(문장의 수) m'은, 학습용 데이터 입력부(310)가 입력하는 문장의 수(m개)와 반드시 같은 수가 아니어도 된다. 예측용 데이터 입력부(320)가 입력하는 문장 데이터는 1개여도 되고, 복수 개여도 된다. 다만, 예측용 데이터 입력부(320)가 입력하는 문장에 대해서도 유사성 지표값을 산출한다. 유사성 지표값은, 어느 단어가 어느 문장에 대하여 어느 정도 기여하고 있는 것인가, 어느 문장이 어느 단어에 대하여 어느 정도 기여하고 있는 것인가를 표시한 것이므로, 예측용 데이터 입력부(320)가 입력하는 문장에 대해서도 복수로 하는 것이 바람직하다.The number of sentence data (number of sentences) m' input by the prediction data input unit 320 may not necessarily be the same as the number (m sentences) input by the learning data input unit 310 . The number of sentence data input by the data input unit 320 for prediction may be one or plural. However, a similarity index value is also calculated for a sentence input by the prediction data input unit 320 . Since the similarity index value indicates which word contributes to which sentence to what extent and to what extent which sentence contributes to which word, even for the sentence input by the prediction data input unit 320 . It is preferable to set it as plural.

사상 예측부(321)는, 예측용 데이터 입력부(320)에 의해 입력된 예측용 데이터에 대하여, 유사성 지표값 산출부(300)의 단어 추출부(311), 벡터 산출부(312) 및 지표값 산출부(313)의 처리를 실행함으로써 얻어지는 유사성 지표 값을, 분류 모델 생성부(314)에 의해 생성된 분류 모델(분류 모델 기억부(330)에 기억된 분류 모델)에 적용하는 것에 의해, 예측 대상 데이터로부터 복수의 사상 중 어느 하나를 예측한다.The mapping prediction unit 321 includes the word extraction unit 311, the vector calculation unit 312 and the index value of the similarity index value calculation unit 300 for the prediction data input by the prediction data input unit 320 . Prediction by applying the similarity index value obtained by executing the processing of the calculation unit 313 to the classification model generated by the classification model generation unit 314 (the classification model stored in the classification model storage unit 330 ) Any one of a plurality of events is predicted from the target data.

예를 들면, 예측용 데이터 입력부(320)에 의해 m'개의 문장 데이터가 예측용 데이터로서 입력된 경우, 사상 예측부(321)가 그 m'개의 문장 데이터에 대하여 유사성 지표값 산출부(300)의 처리를 실행하는 것에 의해, m'개의 문장 지표값군을 얻는다. 사상 예측부(321)는, 유사성 지표값 산출부(300)에 의해 산출된 m'개의 문장 지표값군을 1개씩 분류 모델에 입력 데이터로서 부여하는 것에 의해, m'개의 문장 각각에 대하여 복수의 사상의 어디에 해당하는지를 예측한다.For example, when m' sentence data are inputted as prediction data by the prediction data input unit 320, the mapping prediction unit 321 performs the similarity index value calculation unit 300 for the m' sentence data. By executing the processing of , m' sentence index value groups are obtained. The mapping prediction unit 321 applies the m' sentence index value groups calculated by the similarity index value calculation unit 300 one by one to the classification model as input data, so that a plurality of mappings for each of the m' sentences are given. predict where the

여기에서, 단어 추출부(311)는, m개의 학습용 데이터로부터 추출한 n개의 단어와 동일한 단어를 m'개의 예측용 데이터로부터 추출하는 것이 바람직하다. 예측용 데이터로부터 추출되는 n개의 단어로 이루어지는 문장 지표값군이, 학습용 데이터로부터 추출된 n개의 단어로 이루어지는 문장 지표값군과 동일한 단어를 요소로 하는 것으로 되므로, 분류 모델 기억부(330)에 기억된 분류 모델에 대한 적합도가 높아지기 때문이다. 다만, 학습 시와 동일한 n개의 단어를 예측 시에도 추출하는 것을 필수로 하는 것은 아니다. 학습 시와는 다른 단어의 조합에 의해 예측용의 문장 지표값군이 생성되는 경우, 분류 모델에 대한 적합도가 낮아지지만, 적합도가 낮다는 것 자체도 평가의 하나의 요소로서, 사상에 해당하는 가능성을 예측하는 것 자체는 가능하기 때문이다.Here, it is preferable that the word extraction unit 311 extracts the same words as n words extracted from the m pieces of data for learning from the m' pieces of data for prediction. Since the sentence index value group consisting of n words extracted from the prediction data has the same word as the sentence index value group consisting of n words extracted from the learning data as elements, the classification stored in the classification model storage unit 330 This is because the fit of the model increases. However, it is not essential to extract the same n words as when learning, even when predicting. When a sentence index value group for prediction is generated by a combination of words that are different from those during learning, the fit to the classification model is lowered, but the low fit itself is also a factor in the evaluation, and the possibility that corresponds to the mapping is reduced. Prediction itself is possible.

도 9 및 도 10은 본 발명의 일 실시예에 따른 사상 예측 시스템의 동작예를 나타내는 플로차트이다.9 and 10 are flowcharts illustrating an operation example of a mapping prediction system according to an embodiment of the present invention.

도 9를 참조하면, 먼저, 학습용 데이터 입력부(310)는, 복수의 사상 중 어디에 해당하는 지가 이미 알려진 m개의 문장에 관한 문장 데이터를 학습용 데이터로서 입력한다(스텝 S1). 단어 추출부(311)는, 학습용 데이터 입력부(310)에 의해 입력된 m개의 문장을 해석하고, 해당 m개의 문장으로부터 n개의 단어를 추출한다(스텝 S2).Referring to FIG. 9 , first, the learning data input unit 310 inputs, as learning data, sentence data related to m sentences for which a plurality of mappings are already known (step S1 ). The word extraction unit 311 interprets the m sentences input by the learning data input unit 310 and extracts n words from the m sentences (step S2).

이어서, 벡터 산출부(312)는, 학습용 데이터 입력부(310)에 의해 입력된 m개의 문장 및 단어 추출부(311)에 의해 추출된 n개의 단어로부터, m개의 문장 벡터 di→ 및 n개의 단어 벡터 wj→를 산출한다(스텝 S3). 그리고, 지표값 산출부(313)는, m개의 문장 벡터 di→와 n개의 단어 벡터 wj→의 내적을 각각 취함으로써, m개의 문장 di 및 n개의 단어 wj간의 관계성을 반영한 m×n개의 유사성 지표값(m×n개의 유사성 지표값을 각 요소로 하는 지표값 행렬(DW)을 산출한다(스텝 S4).Next, the vector calculating unit 312, from the m sentences input by the learning data input unit 310 and the n words extracted by the word extraction unit 311, m sentence vectors di→ and n word vectors wj→ is calculated (step S3). Then, the index value calculation unit 313 takes the dot product of the m sentence vectors di → and the n word vectors wj →, respectively, so as to reflect the relationship between the m sentences di and the n words wj, m × n similarities An index value matrix DW having an index value (m × n similarity index values as each element) is calculated (step S4).

또한, 분류 모델 생성부(314)는, 지표값 산출부(313)에 의해 산출된 m×n개의 유사성 지표값을 이용하여, 1개의 문장 di에 대하여 n개의 유사성 지표값 dwj로 이루어지는 문장 지표값군을 기초로, m개의 문장 di를 복수의 사상으로 분류하기 위한 분류 모델을 생성하고, 생성한 분류 모델을 분류 모델 기억부(330)에 기억시킨다(스텝 S5).In addition, the classification model generation unit 314 uses the m × n similarity index values calculated by the index value calculation unit 313 to form a sentence index value group consisting of n similarity index values dwj for one sentence di. Based on , a classification model for classifying the m sentences di into a plurality of events is generated, and the generated classification model is stored in the classification model storage unit 330 (step S5).

이상에 의해, 학습 시의 동작이 종료된다.As a result, the operation at the time of learning is finished.

도 10을 참조하면, 예측용 데이터 입력부(320)는, 복수의 사상 중 어디에 해당하는 지가 미지인 1개 이상의 문장에 관한 문장 데이터를 예측용 데이터로서 입력한다(스텝 S11). 사상예측부(321)는, 예측용 데이터 입력부(320)에 의해 입력된 예측용 데이터를 유사성 지표값 산출부(300)에 공급하고, 유사성 지표값의 산출을 지시한다.Referring to FIG. 10 , the prediction data input unit 320 inputs sentence data related to one or more sentences whose corresponding ones of the plurality of events are unknown (step S11). The mapping prediction unit 321 supplies the prediction data input by the prediction data input unit 320 to the similarity index value calculation unit 300, and instructs the calculation of the similarity index value.

이 지시에 따라, 단어 추출부(311)는, 예측용 데이터 입력부(320)에 의해 입력된 m'개의 문장을 해석하고, 해당 m'개의 문장으로부터 n개의 단어(학습용 데이터로부터 추출된 것과 동일한 단어)를 추출한다(스텝 S12).According to this instruction, the word extraction unit 311 interprets the m' sentences input by the prediction data input unit 320, and n words (the same word as extracted from the learning data) from the m' sentences. ) is extracted (step S12).

그리고, m'개의 문장 중에 n개의 단어가 모두 포함되어 있다고는 한정하지 않는다. m'개의 문장 중에 존재하지 않는 단어에 대해서는 Null값으로 된다.In addition, it is not limited that all n words are included among the m' sentences. A null value is set for a word that does not exist among m' sentences.

이어서, 벡터 산출부(312)는, 예측용 데이터 입력부(320)에 의해 입력된 m'개의 문장 및 단어 추출부(311)에 의해 추출된 n개의 단어로부터, m'개의 문장 벡터 di→ 및 n개의 단어 벡터 wj→를 산출한다(스텝 S13).Next, the vector calculating unit 312 generates m' sentence vectors di→ and n from the m' sentences input by the prediction data input unit 320 and the n words extracted by the word extracting unit 311 . n word vectors wj→ are calculated (step S13).

그리고, 지표값 산출부(313)는, m'개의 문장 벡터 di→와 n개의 단어 벡터 wj→의 내적을 각각 취함으로써, m'개의 문장 di 및 n개의 단어 wj간의 관계성을 반영한 m'×n개의 유사성 지표값(m'×n개의 유사성 지표값을 각 요소로 하는 지표값 행렬 DW)을 산출한다(스텝 S14). 지표값 산출부(313)는, 산출한 m'×n개의 유사성 지표값을 사상 예측부(321)에 공급한다.Then, the index value calculation unit 313 takes the dot product of the m' sentence vectors di → and the n word vectors wj →, respectively, so that the relationship between the m' sentences di and the n words wj is reflected in m'× n similarity index values (index value matrix DW having m'xn similarity index values as elements) are calculated (step S14). The index value calculation unit 313 supplies the calculated m'×n similarity index values to the mapping prediction unit 321 .

사상 예측부(321)는, 유사성 지표값 산출부(300)로부터 공급된 m'×n개의 유사성 지표값을 기초로, m'개의 문장 지표값군을 각각 분류 모델 기억부(330)에 기억된 분류 모델에 적용하는 것에 의해, m'개의 문장 각각에 대하여 복수의 사상의 어디에 해당하는지를 예측한다(스텝 S15). 이에 의해, 예측 시의 동작이 종료된다.The mapping prediction unit 321 stores m' sentence index value groups in the classification model storage unit 330 based on the m'×n similarity index values supplied from the similarity index value calculation unit 300 , respectively. By applying it to the model, it is predicted for each of the m' sentences where it corresponds to a plurality of events (step S15). Thereby, the operation at the time of prediction ends.

도 11은 본 발명의 다른 실시예에 다른 사상 예측 시스템의 블록도이다.11 is a block diagram of a mapping prediction system according to another embodiment of the present invention.

도 11을 참조하면, 본 발명의 다른 실시예에 따른 사상 예측 시스템은 도 8에 나타낸 구성에 더하여 보수 결정부(322)를 더욱 구비하고 있다. 또한, 도 8에 나타낸 분류 모델 생성부(314)를 대신하여 분류 모델 생성부(314')를 구비하고 있다.Referring to FIG. 11 , the event prediction system according to another embodiment of the present invention further includes a reward determining unit 322 in addition to the configuration shown in FIG. 8 . In addition, a classification model generating unit 314' is provided in place of the classification model generating unit 314 shown in FIG.

보수 결정부(322)는, 사상 예측부(321)에 의해 예측된 사상에 대한 실제의 사상에 따라, 분류 모델 생성부(314')에 부여하는 보수를 결정한다. 예를 들면, 보수 결정부(322)는, 사상 예측부(321)에 의해 예측된 사상이 실제의 사상과 일치하고 있는 경우에는 플러스의 보수를 부여하도록 결정하고, 일치하지 않고 있는 경우는 무보수 또는 마이너스의 보수를 부여하도록 결정한다. 예측된 사상이 실제의 사상과 일치하고 있는지의 여부의 판정은, 각종 방법에 의해 실시하는 것이 가능하다.The reward determining unit 322 determines the reward given to the classification model generating unit 314' according to the actual mapping to the event predicted by the event prediction unit 321 . For example, the reward determining unit 322 determines to give a positive reward when the event predicted by the event predicting unit 321 matches the actual event, and when it does not match, no reward or Decide to give a negative payoff. Determination of whether the predicted event coincides with the actual event can be performed by various methods.

예를 들면, 복수의 사상으로서 사용자의 취미 취향을 예측하는 경우에는, 예측한 취미 취향에 맞춘 정보를 사용자에게 제시하고, 그 정보에 대하여 사용자가 액션을 일으킨 경우에, 예측된 사상이 실제의 사상과 일치하고 있다고 판정하는 것이 가능하다. 구체적인 예로서, 예측한 취미 취향에 맞춘 상품이나 서비스의 광고 정보를, 사용자가 열람하고 있는 웹페이지에 표시하고, 그 광고 정보를 클릭하여 상세 정보를 열람하거나, 그 광고 정보에 게재되어 있는 상품이나 서비스를 구입하거나 한다는 액션을 사용자가 일으킨 경우에, 예측된 사상이 실제의 사상과 일치하고 있다고 판정한다.For example, when a user's taste preference is predicted as a plurality of events, information tailored to the predicted taste preference is presented to the user, and when the user takes an action on the information, the predicted event is an actual event. It is possible to judge that it is consistent with As a specific example, advertisement information of a product or service tailored to the predicted hobby or taste is displayed on the web page the user is browsing, and the user clicks the advertisement information to view the detailed information, or the product or service posted in the advertisement information. When the user causes an action such as purchasing a service, it is determined that the predicted event coincides with the actual event.

또한, 어떤 시스템에 대하여 특정한 장해가 발생할 가능성을 예측하는 경우에는, 시스템의 감시 이력을 기록한 이력 데이터에 기초하여, 실제로 특정한 장해가 발생했는지의 여부를 감시하고, 예측된 장해가 실제로 발생한 것이 이력 데이터로부터 검출된 경우에, 예측된 사상이 실제의 사상과 일치하고 있다고 판정하는 것이 가능하다. 마찬가지로, 복수 사용자에 대하여 특정한 증상이 발생할 가능성 등을 예측하는 경우에는, 사용자의 진찰 이력 등의 이력 데이터에 기초하여, 실제로 특정한 증상이 발증했는지의 여부를 감시하고, 예측된 증상이 실제로 발증한 것이 이력 데이터로부터 검출된 경우에, 예측된 사상이 실제의 사상과 일치하고 있다고 판정하는 것이 가능하다.In addition, when predicting the possibility of occurrence of a specific failure with respect to a certain system, based on the historical data in which the monitoring history of the system is recorded, whether or not a specific failure has actually occurred is monitored, and the fact that the predicted failure has actually occurred is historical data When detected from , it is possible to determine that the predicted event coincides with the actual event. Similarly, when predicting the possibility of occurrence of a specific symptom for multiple users, based on historical data such as the user's medical examination history, it is monitored whether or not the specific symptom actually occurs, and it is determined that the predicted symptom actually occurs. In the case of detection from the historical data, it is possible to determine that the predicted event coincides with the actual event.

분류 모델 생성부(314')는, 도 8에 나타낸 분류 모델 생성부(314)와 마찬가지로, 학습용 데이터 입력부(310)에 의해 입력된 학습용 데이터를 기초로, 분류 모델을 생성하고, 분류 모델 기억부(330)에 기억시킨다. 이것에 더하여, 분류 모델 생성부(314')는, 보수 결정부(322)에 의해 결정된 보수에 따라, 분류 모델 기억부(330)에 기억된 분류 모델을 개변한다. The classification model generation unit 314' generates a classification model based on the training data input by the training data input unit 310, similarly to the classification model generation unit 314 shown in FIG. 8, and a classification model storage unit. (330) is memorized. In addition to this, the classification model generation unit 314' changes the classification model stored in the classification model storage unit 330 in accordance with the reward determined by the reward determination unit 322 .

이와 같이, DB 구축부(33)는 m개(m은 2 이상의 임의의 정수)의 문장을 해석하고, 상기 m개의 문장으로부터 n개(n은 2 이상의 임의의 정수)의 단어를 추출하는 단어 추출부; 상기 m개의 문장을 각각 소정의 룰에 따라서 q차원(q는 2 이상의 임의의 정수)으로 벡터화함으로써, q개의 축 성분으로 이루어지는 m개의 문장 벡터를 산출하는 문장 벡터 산출부; 상기 n개의 단어를 각각 소정의 룰에 따라서 q차원으로 벡터화함으로써, q개의 축 성분으로 이루어지는 n개의 단어 벡터를 산출하는 단어 벡터 산출부; 상기 m개의 문장 벡터와 상기 n개의 단어 벡터의 내적을 각각 취함으로써, 상기 m개의 문장 및 상기 n개의 단어 간의 관계성을 반영한 m×n개의 유사성 지표값을 산출하는 지표값 산출부; 상기 지표값 산출부에 의해 산출된 상기 m×n개의 유사성 지표값을 이용하여, 1개의 문장에 대하여 n개의 유사성 지표값으로 이루어지는 문장 지표값군을 기초로, 상기 m개의 문장을 복수의 사상으로 분류하기 위한 분류 모델을 생성하는 분류 모델 생성부; 예측 대상으로 하는 1개 이상의 문장을 예측용 데이터로서 입력하는 예측용 데이터 입력부; 및 상기 예측용 데이터 입력부에 의해 입력된 상기 예측용 데이터에 대하여 상기 단어 추출부, 상기 문장 벡터 산출부, 상기 단어 벡터 산출부 및 상기 지표값 산출부의 처리를 실행함으로써 얻어지는 유사성 지표값을, 상기 분류 모델 생성부에 의해 생성된 상기 분류 모델에 적용하는 것에 의해, 상기 예측 대상 데이터로부터 상기 복수의 사상 중 어느 하나를 예측하는 사상 예측부를 포함할 수 있다.In this way, the DB construction unit 33 interprets m sentences (m is an arbitrary integer greater than or equal to 2) and extracts n words (n is an arbitrary integer greater than or equal to 2) from the m sentences. wealth; a sentence vector calculation unit for calculating m sentence vectors comprising q axis components by vectorizing the m sentences in q dimensions (q is an arbitrary integer greater than or equal to 2) according to a predetermined rule; a word vector calculating unit for calculating n word vectors comprising q axis components by vectorizing the n words in q dimensions according to a predetermined rule; an index value calculating unit calculating m × n similarity index values reflecting the relationship between the m sentences and the n words by taking the dot product of the m sentence vectors and the n word vectors, respectively; Classifying the m sentences into a plurality of maps based on a sentence indicator value group including n similarity indicator values for one sentence using the m×n similarity indicator values calculated by the indicator value calculating unit. a classification model generation unit generating a classification model for a prediction data input unit for inputting one or more sentences as prediction data as prediction data; and a similarity index value obtained by executing the processing of the word extraction unit, the sentence vector calculation unit, the word vector calculation unit, and the index value calculation unit on the prediction data input by the prediction data input unit, the classification and an event prediction unit that predicts any one of the plurality of events from the prediction target data by applying to the classification model generated by the model generation unit.

한편, 본 발명의 다른 실시예에 따른 문장 작성 시스템(1000)은 도 1에 도시된 구성요소에 더하여 음성 출력 장치(50)를 더 포함할 수 있다.Meanwhile, the sentence writing system 1000 according to another embodiment of the present invention may further include a voice output device 50 in addition to the components shown in FIG. 1 .

음성 출력 장치(50)는 무선 통신 기능이 구비된 스피커로 적용될 수 있으며, 사용자 단말(10)과 무선 통신 방식으로 연결되어 사용자 단말(10)에서 문장 시현부(35)로부터 수신하는 예시 문장을 음성으로 출력할 수 있다.The voice output device 50 may be applied as a speaker equipped with a wireless communication function, and is connected to the user terminal 10 in a wireless communication method to provide an example sentence received from the sentence display unit 35 in the user terminal 10 by voice. can be output as

도 12는 본 발명의 일 실시예에 따른 음성 출력 장치를 보여주는 도면이다.12 is a diagram illustrating an audio output device according to an embodiment of the present invention.

도 12를 참조하면, 본 발명의 일 실시예에 따른 음성 출력 장치(50)는 본체의 전면부에 부착되어 외부로 노출되고, 다수의 구경을 가진 스피커가 수직방향으로 형성되며, 각 스피커로부터 음향 신호가 출력되는 스피커부(51)를 포함할 수 있다.Referring to FIG. 12 , the audio output device 50 according to an embodiment of the present invention is attached to the front part of the main body and exposed to the outside, and speakers having a plurality of apertures are formed in the vertical direction, and sound is generated from each speaker. It may include a speaker unit 51 through which a signal is output.

한편, 음성 출력 장치(50)는 본체의 내부에 무선통신을 지원하는 와이파이모듈, 블루투스모듈 등을 모듈화하여 구성할 수 있으며, 와이파이 또는 블루투스 환경 내에서 사용자 단말(10)로부터 제공하는 음원을 재생시켜 사용자가 청취할 수 있도록 한다. 이때, 사용자 단말(10)은 문장 시현부(35)로부터 수신하는 문장을 음성 데이터로 변환할 수 있으며, 음성 데이터를 음성 출력 장치(50)로 전송하여 사용자가 청취할 수 있도록 한다.On the other hand, the voice output device 50 can be configured by modularizing a Wi-Fi module, a Bluetooth module, etc. that support wireless communication inside the main body, and reproduces a sound source provided from the user terminal 10 in a Wi-Fi or Bluetooth environment. Allow users to listen. In this case, the user terminal 10 may convert the sentence received from the sentence display unit 35 into voice data, and transmit the voice data to the voice output device 50 so that the user can listen.

한편, 본 발명의 또 다른 실시예에 따른 문장 작성 시스템(1000)은 도 1에 도시된 구성요소에 더하여 보조 출력 장치를 더 포함할 수 있다.Meanwhile, the sentence writing system 1000 according to another embodiment of the present invention may further include an auxiliary output device in addition to the components shown in FIG. 1 .

보조 출력 장치는 무선 통신 기능이 구비된 디스플레이 장치로 적용될 수 있으며, 사용자 단말(10)과 무선 통신 방식으로 연결되어 사용자 단말(10)에서 문장 시현부(35)로부터 수신하는 예시 문장을 출력할 수 있다.The auxiliary output device may be applied as a display device equipped with a wireless communication function, and is connected to the user terminal 10 in a wireless communication method to output an example sentence received from the sentence display unit 35 in the user terminal 10 there is.

이러한 보조 출력 장치는 사용자 단말(10)이 설치되는 테이블 상에 설치 모듈을 통해 설치될 수 있다. 이와 관련하여 도 13을 참조하여 설명한다.Such an auxiliary output device may be installed through an installation module on a table on which the user terminal 10 is installed. In this regard, it will be described with reference to FIG. 13 .

도 13은 본 발명의 일 실시예에 따른 설치 모듈을 보여주는 도면이다.13 is a view showing an installation module according to an embodiment of the present invention.

도 13을 참조하면, 본 발명의 일 실시예에 따른 설치 모듈(60)은 설치 바아(62), 안착부(63) 및 설치 브라켓(61)을 포함할 수 있다.Referring to FIG. 13 , the installation module 60 according to an embodiment of the present invention may include an installation bar 62 , a seating part 63 , and an installation bracket 61 .

설치 바아(62)는 직육면체의 기둥 형태로 구비될 수 있으며, 일단부에 안착부(63)가 마련될 수 있다.The installation bar 62 may be provided in the form of a column of a rectangular parallelepiped, and a seating portion 63 may be provided at one end.

안착부(63)는 소정 두께의 판 형태로 형성되되, 보조 출력 장치의 안착을 위하 하단 부분에 안착홈을 형성할 수 있다. 즉, 보조 출력 장치의 하단 부분은 안착홈에 삽입 설치될 수 있다.The seating part 63 is formed in the form of a plate having a predetermined thickness, and a seating groove may be formed in the lower portion for seating the auxiliary output device. That is, the lower portion of the auxiliary output device may be inserted and installed in the seating groove.

설치 브라켓(61)은 설치 바아(62)의 타단에 설치 바아(62)의 회동이 가능하도록 구비될 수 있다.The installation bracket 61 may be provided at the other end of the installation bar 62 so that the rotation of the installation bar 62 is possible.

설치 브라켓(61)은 바닥면 및 바닥면의 양단으로부터 상방으로 형성되는 한 쌍의 벽면으로 구성될 수 있으며, 이때, 한 쌍의 벽면은 상단 부분이 반원 형태로 형성될 수 있다. 이러한 설치 브라켓(61)의 한 쌍의 벽면 사이에 설치 바아(62)의 타단이 회동 가능하게 설치될 수 있다.The installation bracket 61 may be composed of a bottom surface and a pair of wall surfaces formed upward from both ends of the bottom surface, and in this case, the pair of wall surfaces may be formed in a semicircular shape with an upper end portion. The other end of the installation bar 62 may be rotatably installed between a pair of wall surfaces of the installation bracket 61 .

예를 들면, 설치 바아(62)는 타측단에 장홀(611)을 형성할 수 있으며, 설치 브라켓(61)의 외주면으로부터 장홀(611)을 관통하는 회동 핀(612)을 통해 설치 브라켓(61)에 설치될 수 있다. 이때, 설치 바아(62)는 장홀(611)에 의해 설치 브라켓(61)에 대하여 전후 방향으로 이동할 수 있을 것이다.For example, the installation bar 62 may form a long hole 611 at the other end thereof, and the installation bracket 61 through a rotation pin 612 penetrating the long hole 611 from the outer circumferential surface of the installation bracket 61 . can be installed on At this time, the installation bar 62 may be moved in the front-rear direction with respect to the installation bracket 61 by the long hole 611 .

설치 브라켓(61)은 바닥면이 사용자 단말(10)이 설치된 테이블 상에 부착 고정될 수 있다.The installation bracket 61 may have a bottom surface attached to and fixed to the table on which the user terminal 10 is installed.

설치 브라켓(61)의 한 쌍의 벽면의 테두리 부분에는 내측으로 요입된 다수개의 삽입 홈(611b)이 복수 개 형성될 수 있다. 여기에서, 설치 바아(62)의 타측단의 외주면에는 삽입 홈(611b)에 삽입될 수 있도록 구비되는 삽입 돌기(62a)를 형성할 수 있다. 이때, 삽입 홈(611b)은 나선 형태로 구비될 수 있다. 즉, 삽입 홈(611b)은 내측에 걸림 홈(611a)을 형성한 나선 형태로 구비됨으로써, 내측에 삽입된 삽입 돌기(62a)가 이탈되는 것을 방지할 수 있다.A plurality of insertion grooves 611b concave to the inside may be formed in a plurality of edges of a pair of wall surfaces of the installation bracket 61 . Here, an insertion protrusion 62a provided to be inserted into the insertion groove 611b may be formed on the outer peripheral surface of the other end of the installation bar 62 . In this case, the insertion groove 611b may be provided in a spiral shape. That is, since the insertion groove 611b is provided in a spiral shape having a locking groove 611a formed therein, it is possible to prevent the insertion protrusion 62a inserted therein from being separated.

이와 같은 설치 바아(62)의 회동 방법에 대해 간략히 설명하면, 사용자는 설치 바아(62)를 전방 이동시킨 뒤 원하는 각도로 회동시키고, 그 상태에서 설치 바아(62)를 후방 이동시킴으로써 삽입 홈(611b)에 삽입 돌기(62a)를 끼워 넣어 고정할 수 있다. 이러한 설치 모듈(60)은 사용자 조작이 용이하다는 장점을 갖는다. Briefly describing the rotation method of the installation bar 62 as described above, the user moves the installation bar 62 forward and then rotates it to a desired angle, and in that state, the insertion groove 611b by moving the installation bar 62 backward. ) can be fixed by inserting the insertion protrusion (62a). This installation module 60 has the advantage of being easy to operate by a user.

한편, 안착부(63)는 전체가 은이나 동으로 이루어지거나, 금속판의 상측에 은, 이산화티타늄, 구리 중 어느 하나 이상의 항균물질을 일체로 코팅되는 구성으로 이루어질 수 있다.On the other hand, the seating portion 63 may be entirely made of silver or copper, or may have a configuration in which an antibacterial material of any one or more of silver, titanium dioxide, and copper is integrally coated on the upper side of the metal plate.

더하여, 상기 항균물질은, 이온상태의 증착, 도금, 스프레이 등의 방법을 이용하여 금속판의 상측에 일체로 코팅되어도 좋다.In addition, the antibacterial material may be integrally coated on the upper side of the metal plate using a method such as ion deposition, plating, or spraying.

이상 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although the embodiments of the present invention have been described above with reference to the accompanying drawings, those of ordinary skill in the art to which the present invention pertains can realize that the present invention can be embodied in other specific forms without changing the technical spirit or essential features. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

1000: 문장 작성 시스템
10: 사용자 단말
20: 네트워크
30: 문장 작성 장치
40: 데이터베이스1000: sentence writing system
10: user terminal
20: network
30: sentence writing device
40: database

Claims

user terminal; and
Constructing a database storing data whose copyright has expired, obtaining an input sentence that is a sentence input by the user from the user terminal, extracting a sentence similar to the input sentence from the database as an example sentence, and outputting it from the user terminal a sentence writing device; including;
The sentence writing device,
By learning the sentences stored in the database according to the Word2Vec algorithm, a neural network that extracts a vector value representing context information for an input sentence is constructed, and an input sentence obtained from the user terminal is input to the neural network to provide a context for the input sentence. Calculating an input sentence vector value representing information, calculating a similarity between example sentence vector values representing context information of each of the sentences stored in the database and the input sentence vector value, and calculating the input sentence among sentences stored in the database Extracting a sentence having an example sentence vector value with the highest degree of similarity to the vector value as the example sentence and transmitting it to the user terminal,
an auxiliary output device connected to the user terminal through a wireless communication method to receive and output the example sentence received from the sentence writing device in the user terminal; and
Further comprising; an installation module for installing the auxiliary output device on the table on which the user terminal is installed,
The installation module is
Installation bar provided in the form of a rectangular parallelepiped;
a seating part formed in the form of a plate of a predetermined thickness at one end of the installation bar, and forming a seating groove in the lower part of the auxiliary output device to be inserted and installed; and
Including; and an installation bracket installed at the other end of the installation bar so that the rotation of the installation bar is possible;
The installation bracket is
bottom surface;
a pair of wall surfaces formed upwardly from both ends of the bottom surface and having an upper end formed in a semicircular shape; and
A plurality of insertion grooves that are recessed inwardly on the edge of each of the pair of wall surfaces and formed in a spiral shape with a locking groove formed on the inside;
The installation bar,
A long hole is formed at the other end, and is rotatably installed between the pair of wall surfaces from the outer surface of the pair of wall surfaces through a rotation pin penetrating the long hole, and is inserted into the insertion groove at the other end. Forming an insertion protrusion provided,
The seating part,
A sentence writing system consisting of an integral coating of any one or more antibacterial substances of silver, titanium dioxide, and copper on the upper side of the metal plate.

delete

According to claim 1,
The user terminal is
A sentence writing system that is loaded and executed with a sentence writing application operated by the sentence writing device.

According to claim 1,
The user terminal is
A sentence writing system, wherein a sentence writing application having an interface in which an input region for receiving a sentence from a user and an output region for outputting the example sentence are arranged side by side is mounted and executed.

According to claim 1,
The sentence writing device,
Including; an input unit for receiving the input sentence from the user terminal;

According to claim 1,
The sentence writing device,
A system for writing sentences, including; a DB construction unit for accessing a web site provided so that the copyright has expired material can be used and collecting the copyright expired material.

According to claim 1,
The sentence writing device,
A DB construction unit for collecting the copyrighted materials, extracting sentences, and storing the extracted sentences in the database; including, a sentence writing system.

According to claim 1,
The sentence writing device,
A DB construction unit for collecting the copyright-expired materials and identifying a similar sentence relationship between the copyright-expired materials; including, a text writing system.

According to claim 1,
The sentence writing device,
Including; DB construction unit to collect the copyright expired data,
The DB construction unit,
Including a; event prediction system for identifying the event of each of the copyright-expired materials;
The event prediction system is
a word extraction unit that interprets m sentences (m is an arbitrary integer greater than or equal to 2) and extracts n words (n is an arbitrary integer greater than or equal to 2) from the m sentences;
a sentence vector calculation unit for calculating m sentence vectors comprising q axis components by vectorizing the m sentences in q dimensions (q is an arbitrary integer greater than or equal to 2) according to a predetermined rule;
a word vector calculating unit for calculating n word vectors comprising q axis components by vectorizing the n words in q dimensions according to a predetermined rule;
an index value calculation unit calculating m × n similarity index values reflecting the relationship between the m sentences and the n words by taking dot products of the m sentence vectors and the n word vectors, respectively;
Classifying the m sentences into a plurality of maps based on a sentence indicator value group including n similarity indicator values for one sentence using the m × n similarity indicator values calculated by the indicator value calculating unit. a classification model generation unit generating a classification model for
a prediction data input unit for inputting one or more sentences as prediction data as prediction data; and
A similarity index value obtained by executing the processing of the word extraction unit, the sentence vector calculation unit, the word vector calculation unit, and the index value calculation unit on the prediction data input by the prediction data input unit is set to the classification model and an event prediction unit that predicts any one of the plurality of events from the prediction data by applying to the classification model generated by the generation unit.

According to claim 1,
The sentence writing device,
A sentence display unit for extracting a sentence similar to the input sentence from the database as an example sentence; further comprising,
The sentence display part,
and extracting the example sentence by calculating a Euclidean distance between a vector value representing the sentences stored in the database and a vector value representing the input sentence.

According to claim 1,
The sentence writing device,
A sentence display unit for extracting a sentence similar to the input sentence from the database as an example sentence; further comprising,
The sentence display part,
and extracting the example sentence by calculating a cosine similarity between a vector value representing the sentences stored in the database and a vector value representing the input sentence.

According to claim 1,
The sentence writing device,
A sentence display unit for extracting a sentence similar to the input sentence from the database as an example sentence; further comprising,
The sentence display part,
and extracting the example sentences by calculating a Tanimoto coeffieient between a vector value representing the sentences stored in the database and a vector value representing the input sentence.

According to claim 1,
and a network provided as a wired communication network connecting the user terminal and the sentence writing device.

According to claim 1,
The system further comprising; a network provided as a mobile communication network connecting the user terminal and the sentence writing device.

According to claim 1,
The text writing system further comprising a; a network provided as a Wibro (Wireless Broadband) network for connecting the user terminal and the sentence writing device.

According to claim 1,
The system further comprising; a network provided with a High Speed Downlink Packet Access (HSDPA) network connecting the user terminal and the sentence writing device.

According to claim 1,
The system further comprising; a network provided as a satellite communication network connecting the user terminal and the sentence writing device.

According to claim 1,
A network provided as a Wi-Fi (Wireless Fidelity) network connecting the user terminal and the sentence writing device; further comprising, a sentence writing system.

According to claim 1,
and a voice output device connected to the user terminal in a wireless communication manner, and receiving and outputting a voice file for the example sentence from the user terminal.

In the sentence writing method in the sentence writing device connected through the user terminal and the network,
collecting data whose copyright protection period has expired;
storing the collected data in a database;
receiving a sentence from a user through the user terminal;
extracting a sentence similar to a sentence input from a user from the database using a neural network that outputs a vector value representing context information of the input sentence; and
Including; outputting the extracted sentence to the user terminal;
The sentence writing device,
By learning the sentences stored in the database according to the Word2Vec algorithm, a neural network that extracts a vector value representing context information for an input sentence is constructed, and an input sentence obtained from the user terminal is input to the neural network to provide a context for the input sentence. Calculating an input sentence vector value representing information, calculating a similarity between example sentence vector values representing context information of each of the sentences stored in the database and the input sentence vector value, and calculating the input sentence among sentences stored in the database Extracting a sentence having an example sentence vector value with the highest degree of similarity to the vector value as the example sentence and transmitting it to the user terminal,
an auxiliary output device connected to the user terminal through a wireless communication method to receive and output the example sentence received from the sentence writing device in the user terminal; and
Further comprising; an installation module for installing the auxiliary output device on the table on which the user terminal is installed,
The installation module is
Installation bar provided in the form of a rectangular parallelepiped;
a seating part formed in the form of a plate of a predetermined thickness at one end of the installation bar, and forming a seating groove in the lower part of the auxiliary output device to be inserted and installed; and
Including; and an installation bracket installed at the other end of the installation bar so that the rotation of the installation bar is possible;
The installation bracket is
bottom surface;
a pair of wall surfaces formed upwardly from both ends of the bottom surface and having an upper end formed in a semicircular shape; and
A plurality of insertion grooves that are recessed inwardly on the edge of each of the pair of wall surfaces and formed in a spiral shape with a locking groove formed on the inside;
The installation bar,
A long hole is formed at the other end, and is rotatably installed between the pair of wall surfaces from the outer surface of the pair of wall surfaces through a rotation pin penetrating the long hole, and is inserted into the insertion groove at the other end. Forming an insertion protrusion provided,
The seating part,
A method of writing a sentence consisting of an integral coating of any one or more antibacterial substances of silver, titanium dioxide, and copper on the upper side of the metal plate.