KR102663908B1

KR102663908B1 - Method for providing meaning search service through semantic analysis

Info

Publication number: KR102663908B1
Application number: KR1020220108659A
Authority: KR
Inventors: 김영호
Original assignee: 한국데이터플랫폼 주식회사
Priority date: 2022-05-04
Filing date: 2022-08-29
Publication date: 2024-05-10
Also published as: KR20230156242A; KR102439321B1

Abstract

본 발명의 일 실시예는 시맨틱 분석을 통한 의미 검색 서비스 제공 방법에 관한 것으로, 문장 형태의 질의문에 대하여 의미 분석하고, 텍스트 문서에서 질의문과 동일하거나 유사한 문장을 찾아주어 페이지 정보와 함께 시맨트 분석 서비스를 제공할 수 있다.An embodiment of the present invention relates to a method of providing a semantic search service through semantic analysis, which analyzes the meaning of a query in the form of a sentence, finds sentences that are the same or similar to the query in a text document, and performs semantic analysis along with page information. Services can be provided.

Description

{Method for providing meaning search service through semantic analysis}

본 발명은 시맨틱 분석을 통한 의미 검색 서비스 제공 방법에 관한 것이다.The present invention relates to a method of providing a meaning search service through semantic analysis.

시맨틱(Semantic) 검색은 질의 분석을 통해서 검색 의도를 파악하고, 검색 의도에 부합하는 검색 결과를 제공한다.Semantic search identifies search intent through query analysis and provides search results that match the search intent.

현재 포털 사이트나 검색 사이트는 단어를 입력하는 경우, 검색 엔진에서 입력된 단어와 일치되는 문서를 추출하는 기능을 수행하고 있다. 그러나 예를 들어, 사람이 책을 읽고 나서 어떤 내용을 물어보면, 문장 형태를 시맨틱 분석하여 찾아주는 기능이 아직까지 구현되지 못하고 있다.Currently, portal sites and search sites perform a function of extracting documents that match the entered word from the search engine when a word is entered. However, for example, when a person asks about something after reading a book, a function that searches the sentence form by semantic analysis has not yet been implemented.

텍스트 자료에서는 문장 질의문에 대응하는 문장을 자연어 처리하여 추출하는 것이 매우 어려우며, 문장 질의문에 맞는 문장을 추출하는 효율이나 성능이 높지 않은 문제점이 있다.From text data, it is very difficult to extract sentences corresponding to sentence queries through natural language processing, and there is a problem in that the efficiency or performance of extracting sentences matching sentence queries is not high.

현재에도 문장 형태로 질의하는 경우, 문장의 의미 분석을 통해 유사한 의미의 문장을 포함한 책, 논문, 기타 자료를 찾아주는 시맨틱 분석 제공 서비스가 활성화되어 있지 않으며, 여전히 연구해야 할 대상으로 남아 있다.Even now, when a query is made in the form of a sentence, a semantic analysis service that finds books, papers, and other materials containing sentences with similar meanings through sentence semantic analysis is not active, and it still remains a subject that needs to be studied.

한국 등록특허번호 제10-2285232호Korean Patent No. 10-2285232

이와 같은 문제점을 해결하기 위하여, 본 발명은 문장 형태의 질의문에 대하여 의미 분석하고, 텍스트 문서에서 질의문과 동일하거나 유사한 문장을 찾아주어 페이지 정보와 함께 시맨트 분석 서비스로 제공하는 문장 의미를 분석하여 찾아주는 시맨틱 분석 제공 시스템을 제공하는데 그 목적이 있다.In order to solve this problem, the present invention analyzes the meaning of a query in the form of a sentence, finds sentences that are the same or similar to the query in a text document, and analyzes the sentence meaning provided by a semantic analysis service along with page information. The purpose is to provide a semantic analysis providing system that searches for and provides semantic analysis.

상기 목적을 달성하기 위한 본 발명의 특징에 따른 문장 의미를 분석하여 찾아주는 시맨틱 분석 제공 시스템은,A semantic analysis providing system that analyzes and finds sentence meaning according to the characteristics of the present invention to achieve the above purpose,

문장 형태의 텍스트 정보를 입력받는 문장 획득부;A sentence acquisition unit that receives text information in the form of a sentence;

문장 획득부로부터 입력받은 문장을 의미를 가지는 최소 단위의 단어인 형태소로 분리하는 형태소 추출부;a morpheme extraction unit that separates the sentences input from the sentence acquisition unit into morphemes, which are the smallest units of words with meaning;

입력 단어(형태소)에 대응하는 대체어, 유의어, 추론어, 해당 단어의 감정에 관련된 감정 형태소의 의미 연관어를 벡터화하여 저장하는 연관어 데이터베이스부와, 상기 입력 단어에 매칭되는 하나 이상의 주제어를 벡터화하여 저장하는 주제어 데이터베이스부 및 단어와 단어로 이루어진 문장의 의미를 분석하여 의미별로 분류된 의미 분석 문장을 벡터화하여 저장하는 의미 분석 데이터베이스부로 이루어진 데이터 저장부;An associated word database unit that vectorizes and stores substitute words, synonyms, inferred words, and semantic associations of emotional morphemes related to the emotion of the word corresponding to the input word (morpheme), and vectorizes one or more subject words matching the input word. A data storage unit consisting of a keyword database unit that analyzes the meaning of words and sentences composed of words, vectorizes and stores the semantic analysis sentences classified by meaning, and stores them in a data storage unit;

상기 형태소 추출부로부터 추출된 단어를 단어 벡터 변환을 처리하는 단어 임베딩 처리부;a word embedding processing unit that converts the words extracted from the morpheme extraction unit into word vectors;

상기 단어 임베딩 처리부에서 처리된 단어 벡터를 기초로 단어들의 문맥(단어 의미의 앞뒤 연결)을 고려한 기설정된 문맥 연결 기준에 부합하는 문장에 대하여 문장 벡터 변환을 처리하는 문장 임베딩 처리부;A sentence embedding processing unit that processes sentence vector conversion for sentences that meet preset context connection criteria considering the context of words (forward and backward connection of word meaning) based on the word vector processed by the word embedding processing unit;

상기 문장 벡터 변환을 처리한 문장을 상기 의미 분석 데이터베이스부에서 비교, 분석하여 상기 문장에 대응되는 의미별로 분류된 의미 분석 문장을 벡터화 처리하고, 상기 주제어 데이터베이스부에서 상기 벡터화된 의미 분석 문장을 기초로 하나 이상의 주제어를 추출하는 제어부;The sentences that have undergone the sentence vector conversion are compared and analyzed in the semantic analysis database unit, the semantic analysis sentences classified by meaning corresponding to the sentences are vectorized, and the subject word database unit uses the vectorized semantic analysis sentences as the basis. A control unit that extracts one or more subject words;

책, 논문, 잡지, 출판물의 복수의 텍스트 데이터를 문장 단위로 벡터화하여 저장하고 있는 대상 문서 데이터베이스부; 및a target document database unit that vectorizes and stores a plurality of text data of books, papers, magazines, and publications in sentence units; and

상기 추출한 주제어를 나타내는 벡터 집합과, 상기 대상 문서 데이터베이스부의 텍스트 데이터의 벡터 집합을 이용하여 문장 간의 유사도를 계산하는 문장 유사도 계산부로 이루어진 시맨틱 분석 서버를 포함하며, 상기 제어부는 상기 문장 유사도 계산부에서 판단한 동일하거나 유사한 문장들을 상기 텍스트 데이터에서 유사 문장으로 카운트하고, 유사 문장 개수가 많은 순서로 기설정된 상위 순번의 텍스트 데이터를 추출하고, 상기 추출한 상위 순번의 텍스트 데이터마다 상기 동일하거나 유사한 문장들의 페이지 번호의 몇 번째 문단인지 페이지 알림 정보를 생성한다.A semantic analysis server comprising a vector set representing the extracted key words and a sentence similarity calculation unit that calculates similarity between sentences using a vector set of text data in the target document database unit, wherein the control unit determines that the sentence similarity calculation unit determines the similarity between sentences. or count similar sentences as similar sentences in the text data, extract text data with a preset higher order in the order of the number of similar sentences, and select a number of page numbers of the same or similar sentences for each text data with the extracted higher order. Page notification information is generated in the first paragraph.

또한, 문장 유사도 계산부는 문장 임베딩 기법에서 하기의 수학식 1에 의해 두 문장 p와 q 간의 유사도를 나타내는 유클리디안 거리(Euclidean Distance)를 계산하고, 상기 계산한 유클리디안 거리가 기설정된 제1 임계값 이상인 경우, 동일하거나 유사한 제1 문장으로 판단하는 제1 문장 유사도 계산부; 및In addition, the sentence similarity calculation unit calculates the Euclidean distance representing the similarity between the two sentences p and q according to Equation 1 below in the sentence embedding technique, and the calculated Euclidean distance is calculated from the preset first a first sentence similarity calculator that determines that the first sentence is the same or similar if the value is greater than or equal to the threshold; and

문장 임베딩 기법에서 하기의 수학식 2에 의해 코사인 유사도(Cosine Similarity)를 계산하고, 상기 계산한 코사인 유사도가 기설정된 제2 임계값 이상인 경우, 동일하거나 유사한 제2 문장으로 판단하는 제2 문장 유사도 계산부를 포함하는 문장 의미를 분석하여 찾아주는 시맨틱 분석 제공 시스템.In the sentence embedding technique, cosine similarity is calculated according to Equation 2 below, and if the calculated cosine similarity is more than a preset second threshold, second sentence similarity is judged to be the same or similar. A semantic analysis providing system that analyzes and finds the meaning of sentences containing parts.

[수학식 1][Equation 1]

여기서, 문장 p는 문장 획득부에 입력된 문장을 나타내고, 문장 q는 대상 문서 데이터베이스부에 저장된 복수의 텍스트 데이터를 나타내고, 주제어로 이루어진 문장 p의 doc2vec가 이고, 문장 q의 doc2vec가 임.Here, sentence p represents a sentence input to the sentence acquisition unit, sentence q represents a plurality of text data stored in the target document database unit, and doc2vec of sentence p consisting of the keyword is , and the doc2vec of sentence q is lim.

[수학식 2][Equation 2]

여기서, a는 문장 획득부에 입력된 문장을 나타내고, b는 대상 문서 데이터베이스부에 저장된 복수의 텍스트 데이터임.Here, a represents a sentence input to the sentence acquisition unit, and b is a plurality of text data stored in the target document database unit.

또한, 제어부는 상기 제1 문장 유사도 계산부에서 동일하거나 유사한 문장으로 판단한 하나 이상의 제1 문장과, 상기 제2 문장 유사도 계산부에서 동일하거나 유사한 문장으로 판단한 하나 이상의 제2 문장을 비교하고, 상기 제1 문장, 상기 제1 문장과 중복되지 않은 제2 문장(제1 문장들의 여집합)을 상기 대상 문서 데이터베이스부의 텍스트 데이터에서 검색하고, 상기 대상 문서 데이터베이스부에서 복수의 제1 문장과, 제1 문장과 중복되지 않은 제2 문장들(제1 문장들의 여집합)을 유사 문장 개수로 검색하여 카운트하며, 상기 카운트된 유사 문장 개수가 많은 순서로 상기 기설정된 상위 순번의 텍스트 데이터를 추출한다.In addition, the control unit compares one or more first sentences determined to be the same or similar sentences by the first sentence similarity calculation unit and one or more second sentences determined to be the same or similar sentences by the second sentence similarity calculation unit, and One sentence and a second sentence (a complement of the first sentences) that do not overlap with the first sentence are searched from the text data of the target document database unit, and a plurality of first sentences and a first sentence are searched in the target document database unit. Non-overlapping second sentences (the complement of the first sentences) are searched and counted by the number of similar sentences, and text data in the preset upper order is extracted in the order of the counted number of similar sentences.

전술한 구성에 의하여, 본 발명은 문장의 의미 분석을 통해 유사한 의미의 문장을 포함한 책, 논문, 기타 자료를 찾아주는 시맨틱 분석 제공 서비스를 제공하는 효과가 있다.Through the above-described configuration, the present invention has the effect of providing a semantic analysis service that finds books, papers, and other materials containing sentences with similar meanings through semantic analysis of sentences.

본 발명은 문장의 의미를 단어의 동일어, 대체어, 유의어, 추론어, 해당 단어의 감정에 관련된 감정 형태소 등의 의미 연관어를 모두 포괄하여 검색할 수 있어 문장 검색의 효율과 성능이 향상되는 효과가 있다.The present invention can search the meaning of a sentence by encompassing all semantic related words such as the same word, substitute word, synonym, inferred word, and emotional morpheme related to the emotion of the word, thereby improving the efficiency and performance of sentence search. It works.

도 1은 본 발명의 실시예에 따른 문장 의미를 분석하여 찾아주는 시맨틱 분석 제공 시스템의 구성을 나타낸 블록도이다.
도 2는 본 발명의 실시예에 따른 시맨틱 분석 서버의 내부 구성을 나타낸 블록도이다.
도 3은 본 발명의 실시예에 따른 시맨틱 분석 서버의 학습 모델과 인공 신경 처리망의 구성을 나타낸 도면이다.Figure 1 is a block diagram showing the configuration of a semantic analysis providing system that analyzes and finds sentence meaning according to an embodiment of the present invention.
Figure 2 is a block diagram showing the internal configuration of a semantic analysis server according to an embodiment of the present invention.
Figure 3 is a diagram showing the configuration of a learning model and artificial neural processing network of a semantic analysis server according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.Since the present invention can make various changes and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all changes, equivalents, and substitutes included in the spirit and technical scope of the present invention. While describing each drawing, similar reference numerals are used for similar components.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는 데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. "및/또는"이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, B, etc. may be used to describe various components, but the components should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, a first component may be named a second component without departing from the scope of the present invention, and similarly, the second component may also be named a first component. The term “and/or” includes any of a plurality of related stated items or a combination of a plurality of related stated items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is said to be "connected" or "connected" to another component, it is understood that it may be directly connected to or connected to the other component, but that other components may exist in between. It should be. On the other hand, when it is mentioned that a component is “directly connected” or “directly connected” to another component, it should be understood that there are no other components in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this application are only used to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that it does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by a person of ordinary skill in the technical field to which the present invention pertains. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and unless explicitly defined in the present application, should not be interpreted in an ideal or excessively formal sense. No.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the attached drawings. In order to facilitate overall understanding when describing the present invention, the same reference numerals are used for the same components in the drawings, and duplicate descriptions for the same components are omitted.

도 1은 본 발명의 실시예에 따른 문장 의미를 분석하여 찾아주는 시맨틱 분석 제공 시스템의 구성을 나타낸 블록도이다.Figure 1 is a block diagram showing the configuration of a semantic analysis providing system that analyzes and finds sentence meaning according to an embodiment of the present invention.

본 발명의 실시예에 따른 문장 의미를 분석하여 찾아주는 시맨틱 분석 제공 시스템(100)은 사용자 단말인 하나 이상의 전자 기기(110), 통신망(120) 및 시맨틱 분석 서버(130)를 포함한다.The semantic analysis providing system 100 that analyzes and finds sentence meaning according to an embodiment of the present invention includes one or more electronic devices 110 that are user terminals, a communication network 120, and a semantic analysis server 130.

복수의 전자 기기들(110)은 컴퓨터 장치로 구현되는 고정형 단말이거나 이동형 단말일 수 있다.The plurality of electronic devices 110 may be fixed terminals implemented as computer devices or mobile terminals.

복수의 전자 기기들(110)는 예를 들면, 스마트폰(smart phone), 휴대폰, 내비게이션, 컴퓨터, 노트북, 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 태블릿 PC 등이 있다. 일례로 전자 기기(110)는 무선 또는 유선 통신 방식을 이용하여 통신망(120)를 통해 시맨틱 분석 서버(130)와 통신할 수 있다.The plurality of electronic devices 110 include, for example, smart phones, mobile phones, navigation devices, computers, laptops, digital broadcasting terminals, PDAs (Personal Digital Assistants), PMPs (Portable Multimedia Players), and tablet PCs. there is. For example, the electronic device 110 may communicate with the semantic analysis server 130 through the communication network 120 using a wireless or wired communication method.

통신망(120)은 통신 방식은 제한되지 않으며, 일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망)을 활용하는 통신 방식뿐만 아니라 기기들간의 근거리 무선 통신 역시 포함될 수 있다. 예를 들어, 통신망(120)은 PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 통신망(120)은 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The communication network 120 is not limited to communication methods, and may include not only communication methods using mobile communication networks, wired Internet, wireless Internet, and broadcasting networks, but also short-range wireless communication between devices. For example, the communication network 120 includes a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), a broadband network (BBN), It may include one or more arbitrary networks such as the Internet. Additionally, the communication network 120 may include any one or more of network topologies including a bus network, star network, ring network, mesh network, star-bus network, tree or hierarchical network, etc. Not limited.

시맨틱 분석 서버(130)는 복수의 전자 기기들(110)과 통신망(120)을 통해 통신하여 명령, 코드, 파일, 컨텐츠, 서비스 등을 제공하는 컴퓨터 장치 또는 복수의 컴퓨터 장치들로 구현될 수 있다.The semantic analysis server 130 may be implemented as a computer device or a plurality of computer devices that communicate with a plurality of electronic devices 110 and a communication network 120 to provide commands, codes, files, content, services, etc. .

시맨틱 분석 서버(130)는 통신망(120)를 통해 접속한 전자 기기(110)로 어플리케이션의 설치를 위한 파일을 제공할 수 있다. 이 경우 전자 기기(110)는 시맨틱 분석 서버(130)로부터 제공된 파일을 이용하여 어플리케이션을 설치할 수 있다.The semantic analysis server 130 may provide files for installing an application to the electronic device 110 connected through the communication network 120. In this case, the electronic device 110 may install the application using a file provided from the semantic analysis server 130.

또한, 전자 기기(110)가 포함하는 운영체제(Operating System, OS) 및 적어도 하나의 프로그램(일례로 브라우저나 상기 설치된 어플리케이션)의 제어에 따라 시맨틱 분석 서버(130)에 접속하여 시맨틱 분석 서버(130)가 제공하는 서비스나 컨텐츠를 제공받을 수 있다. 예를 들어, 전자 기기(110)가 어플리케이션의 제어에 따라 통신망(120)를 통해 서비스 요청 메시지를 시맨틱 분석 서버(130)로 전송하면, 시맨틱 분석 서버(130)는 서비스 요청 메시지에 대응하는 코드를 전자 기기(110)로 전송할 수 있고, 전자 기기(110)는 어플리케이션의 제어에 따라 코드에 따른 화면을 구성하여 표시함으로써 사용자에게 컨텐츠를 제공할 수 있다.In addition, the electronic device 110 connects to the semantic analysis server 130 under the control of an operating system (OS) and at least one program (for example, a browser or the installed application) to establish the semantic analysis server 130. You can receive services or content provided by . For example, when the electronic device 110 transmits a service request message to the semantic analysis server 130 through the communication network 120 under the control of an application, the semantic analysis server 130 sends a code corresponding to the service request message. It can be transmitted to the electronic device 110, and the electronic device 110 can provide content to the user by configuring and displaying a screen according to the code under the control of the application.

전자 기기(110)는 사용자로부터 단어로 이루어진 문장을 입력받고, 시맨틱 분석 서버(130)로부터 입력받은 문장의 의미를 시맨틱 분석하여 책, 논문, 잡지, 출판물의 텍스트 데이터에서 검색하여 유사한 문장을 찾아주는 시맨틱 분석 서비스를 제공받는다.The electronic device 110 receives a sentence consisting of words from the user, semantically analyzes the meaning of the sentence input from the semantic analysis server 130, and searches text data of books, papers, magazines, and publications to find similar sentences. Semantic analysis services are provided.

시맨틱 분석 서버(130)는 전자 기기(110)로부터 하나 이상의 단어로 이루어진 문장 형태의 정보를 입력받고, 입력받은 문장의 의미를 분석하여 책, 논문, 잡지, 출판물의 텍스트 데이터에서 유사한 문장을 찾는 시맨틱 분석을 수행한다.The semantic analysis server 130 receives information in the form of a sentence consisting of one or more words from the electronic device 110, analyzes the meaning of the input sentence, and searches for similar sentences in text data of books, papers, magazines, and publications. Perform analysis.

도 2는 본 발명의 실시예에 따른 시맨틱 분석 서버의 내부 구성을 나타낸 블록도이다.Figure 2 is a block diagram showing the internal configuration of a semantic analysis server according to an embodiment of the present invention.

본 발명의 실시예에 따른 시맨틱 분석 서버(130)는 문장 획득부(131), 형태소 추출부(132), 데이터 저장부(133), 제어부(134), 단어 임베딩 처리부(136), 문장 임베딩 처리부(137), 대상 문서 데이터베이스부(135), 문장 유사도 계산부(138) 및 통신부를 포함한다.The semantic analysis server 130 according to an embodiment of the present invention includes a sentence acquisition unit 131, a morpheme extraction unit 132, a data storage unit 133, a control unit 134, a word embedding processing unit 136, and a sentence embedding processing unit. (137), a target document database unit 135, a sentence similarity calculation unit 138, and a communication unit.

문장 획득부(131)는 전자 기기(110)로부터 하나 이상의 단어로 이루어진 문장 형태의 정보를 수신한다. 예를 들어, "오늘 우울한데 어떤 책을 읽을까?", "19세기 인도 철학을 알고 싶습니다" 등 문장 형태의 정보를 나타낸다.The sentence acquisition unit 131 receives information in the form of a sentence consisting of one or more words from the electronic device 110. For example, it represents information in the form of sentences, such as “I’m depressed today, what book should I read?” or “I want to know about 19th century Indian philosophy.”

형태소 추출부(132)는 문장 획득부(131)로부터 입력받은 문장을 의미를 가지는 최소 단위의 단어인 형태소로 분리한다.The morpheme extraction unit 132 separates the sentence input from the sentence acquisition unit 131 into morphemes, which are the smallest units of words with meaning.

이때, 형태소 추출부(132)는 문장의 분별력 향상을 위해 불필요한 단어 즉, 불용어에 대한 필터링을 수행한다.At this time, the morpheme extractor 132 performs filtering on unnecessary words, that is, stop words, in order to improve sentence discrimination.

불용어는 대다수의 문장에서 높은 빈도로 포함된 단어로 조사, 어미, 접두사 또는 접미사 중에 어느 하나로 구성될 수 있으며, 사용자 설정에 따라 지정된 단어를 포함할 수 있다.Stop words are words that are included with high frequency in the majority of sentences and may consist of any one of a particle, ending, prefix, or suffix, and may include words specified according to user settings.

데이터 저장부(133)는 입력 단어(형태소)에 대응하는 대체어, 유의어, 추론어, 해당 단어의 감정에 관련된 감정 형태소 등의 의미 연관어를 word2vec 또는 doc2vec를 통해 벡터화하여 저장하는 연관어 데이터베이스부(133a)와, 입력 단어에 매칭되는 하나 이상의 주제어를 word2vec 또는 doc2vec를 통해 벡터화하여 저장하는 주제어 데이터베이스부(133b)와, 단어들의 문맥(단어 의미의 앞뒤 연결)을 고려한 기설정된 문맥 연결 기준에 부합하고, 단어와 단어로 이루어진 문장의 의미를 분석하여 의미별로 분류된 의미 분석 문장을 벡터화하여 저장하는 의미 분석 데이터베이스부(133c)를 포함한다.The data storage unit 133 is an associated word database unit that vectorizes and stores semantic related words such as substitute words, synonyms, inferred words, and emotional morphemes related to the emotion of the word corresponding to the input word (morpheme) through word2vec or doc2vec. (133a) and a subject word database unit 133b that vectorizes and stores one or more subject words matching the input word through word2vec or doc2vec, and meets the preset context connection standard considering the context of the words (connection before and after the word meaning) and a semantic analysis database unit 133c that analyzes the meaning of words and sentences composed of words, vectorizes and stores the semantic analysis sentences classified by meaning.

예를 들어, 입력 문장이 19세기 인도 철학인 경우, 19세기+인도+철학의 형태소로 분리하고, 문장의 의미를 분석하여 의미 분석 데이터베이스부(133c)에 저장된다.For example, if the input sentence is 19th century Indian philosophy, it is separated into morphemes of 19th century + India + philosophy, the meaning of the sentence is analyzed, and stored in the semantic analysis database unit 133c.

대상 문서 데이터베이스부(135)는 책, 논문, 잡지, 출판물 등의 복수의 텍스트 데이터를 문장 단위로 벡터화하여 저장하고 있다.The target document database unit 135 stores a plurality of text data such as books, papers, magazines, publications, etc. by converting them into vectors in sentence units.

대상 문서 데이터베이스부(135)는 텍스트 변환부(135a), 페이지 번호 식별부(135b) 및 저장부(135c)를 포함한다.The target document database unit 135 includes a text conversion unit 135a, a page number identification unit 135b, and a storage unit 135c.

텍스트 변환부(135a)는 입력되는 텍스트 데이터인 PDF 파일을 텍스트 파일(TXT 파일)로 변환한다.The text converter 135a converts a PDF file, which is input text data, into a text file (TXT file).

페이지 번호 식별부(135b)는 텍스트 변환부(135a)에서 텍스트 파일이 변환할 때마다 페이지 번호를 생성하고, 텍스트 파일의 첫 번째 페이지의 시작 위치에 식별용 태그를 삽입하며, 문단마다 식별용 태그를 순번대로 삽입하여 페이지를 식별할 수 있다.The page number identification unit 135b generates a page number every time the text file is converted in the text conversion unit 135a, inserts an identification tag at the start of the first page of the text file, and inserts an identification tag for each paragraph. You can identify pages by inserting them in order.

페이지 번호 식별부(135b)는 식별용 태그와 페이지 번호로 이루어진 페이지 정보를 삽입한 텍스트 파일을 저장부(135c)에 저장한다.The page number identification unit 135b stores a text file into which page information consisting of an identification tag and page number is inserted in the storage unit 135c.

단어 임베딩 처리부(136)는 형태소 추출부(132)로부터 추출된 단어를 단어 벡터 변환을 처리한다.The word embedding processing unit 136 processes words extracted from the morpheme extraction unit 132 into word vectors.

문장 임베딩 처리부(137)는 단어 임베딩 처리부(136)에서 처리된 단어 벡터를 기초로 단어들의 문맥(단어 의미의 앞뒤 연결)을 고려한 기설정된 문맥 연결 기준에 부합하는 문장에 대하여 문장 벡터 변환을 처리한다.The sentence embedding processing unit 137 processes sentence vector conversion for sentences that meet preset context connection criteria considering the context of words (connection before and after the word meaning) based on the word vector processed by the word embedding processing unit 136. .

단어 임베딩 처리부(136)와 문장 임베딩 처리부(137)는 단어와 문장을 신경망 기반의 임베딩 알고리즘을 이용하여 벡터화한다. 여기서, 임베딩 알고리즘은 doc2vec, word2vec, sense2vec 등 기공지된 임베딩 기술을 활용할 수 있다.The word embedding processing unit 136 and the sentence embedding processing unit 137 vectorize words and sentences using a neural network-based embedding algorithm. Here, the embedding algorithm can utilize well-known embedding technologies such as doc2vec, word2vec, and sense2vec.

제어부(134)는 문장 벡터 변환을 처리한 문장을 의미 분석 데이터베이스부(133c)에서 비교, 분석하여 문장에 대응되는 의미별로 분류된 의미 분석 문장을 벡터화 처리하고, 주제어 데이터베이스부(133b)에서 벡터화된 의미 분석 문장을 기초로 하나 이상의 주제어를 추출한다. 여기서, 주제어는 어떤 글이나 문학 작품에서 중심이 되는 사상을 나타내는 단어나 구를 나타낸다.The control unit 134 compares and analyzes sentences that have undergone sentence vector conversion in the semantic analysis database unit 133c, vectorizes the semantic analysis sentences classified by meaning corresponding to the sentences, and vectorizes them in the keyword database unit 133b. One or more topic words are extracted based on the semantic analysis sentence. Here, the keyword refers to a word or phrase that represents the central idea in a certain text or literary work.

제어부(134)는 입력 문장이 "19세기 인도 철학을 알고 싶습니다"인 경우, 의미 분석 데이터베이스부(133c)에서 19세기+인도+철학을 의미별로 분류하여 의미 분석 문장을 추출하고, 주제어 데이터베이스부(133b)에서 추출한 의미 분석 문장을 이용하여 의미 분석 문장과 관련된 상키야 철학, 힌두교 철학 등을 주제어로 추출한다.If the input sentence is "I want to know about 19th century Indian philosophy," the control unit 134 classifies 19th century + India + philosophy by meaning in the semantic analysis database unit 133c to extract a semantic analysis sentence, and the keyword database unit ( Using the semantic analysis sentence extracted in 133b), Sankhya philosophy, Hindu philosophy, etc. related to the semantic analysis sentence are extracted as keywords.

제어부(134)는 입력 문장이 "오늘 우울한데 어떤 책을 읽을까요?"인 경우, 의미 분석 데이터베이스부(133c)에서 우울+책+읽다를 의미별로 분류하여 의미 분석 문장을 추출하고, 주제어 데이터베이스부(133b)에서 추출한 의미 분석 문장을 이용하여 의미 분석 문장과 관련하여 낙심, 불행, 슬픈, 비참, 외롭, 자포자기, 후회 등을 주제어로 추출한다.If the input sentence is "I'm depressed today, which book should I read?", the control unit 134 classifies depression + book + read by meaning in the semantic analysis database unit 133c to extract a semantic analysis sentence, and the keyword database unit Using the semantic analysis sentence extracted from (133b), keywords such as discouragement, unhappiness, sad, misery, loneliness, despair, regret, etc. are extracted in relation to the semantic analysis sentence.

문장 유사도 계산부(138)는 추출한 주제어를 나타내는 벡터 집합과, 대상 문서 데이터베이스부(135)의 텍스트 데이터의 벡터 집합을 이용하여 문장 간의 유사도를 계산하고 제1 문장 유사도 계산부(138a)와 제2 문장 유사도 계산부(138b)를 포함한다.The sentence similarity calculation unit 138 calculates the similarity between sentences using a vector set representing the extracted subject word and a vector set of text data of the target document database unit 135, and calculates the similarity between the sentences using the first sentence similarity calculation unit 138a and the second sentence. It includes a similarity calculation unit 138b.

제1, 2 문장 유사도 계산부(138a, 138b)는 주제어들을 나타내는 벡터 집합을 이용하여 문장 간의 유사성 점수를 산출할 수 있다.The first and second sentence similarity calculation units 138a and 138b may calculate similarity scores between sentences using a set of vectors representing key words.

제1 문장 유사도 계산부(138a)는 주제어로 이루어진 문장 p의 doc2vec가 이고, 문장 q의 doc2vec가 라 할 때, 두 문장 p와 q 간의 제1 유사도를 나타내는 유클리디안 거리는 하기의 수학식 2와 같이 정의될 수 있다.The first sentence similarity calculation unit 138a calculates the doc2vec of the sentence p consisting of the keyword. , and the doc2vec of sentence q is When , the Euclidean distance representing the first degree of similarity between two sentences p and q can be defined as Equation 2 below.

문장 p는 문장 획득부(131)에 입력된 문장을 나타내고, 문장 q는 대상 문서 데이터베이스부(135)에 저장된 복수의 텍스트 데이터를 나타낸다.Sentence p represents a sentence input to the sentence acquisition unit 131, and sentence q represents a plurality of text data stored in the target document database unit 135.

제1 문장 유사도 계산부(138a)는 문장 임베딩 기법에서 다음의 수학식 1에 의해 두 문장 p와 q 간의 유사도를 나타내는 유클리디안 거리(Euclidean Distance)를 계산하고, 계산한 유클리디안 거리가 기설정된 제1 임계값 이상인 경우, 동일하거나 유사한 제1 문장으로 판단한다.The first sentence similarity calculation unit 138a calculates the Euclidean Distance indicating the similarity between the two sentences p and q using the following equation 1 in the sentence embedding technique, and the calculated Euclidean distance is If it is more than the set first threshold, it is judged as the same or similar first sentence.

제2 문장 유사도 계산부(138b)는 문장 임베딩 기법에서 다음의 수학식 2에 의해 코사인 유사도(Cosine Similarity)를 계산하고, 계산한 코사인 유사도가 기설정된 제2 임계값 이상인 경우, 동일하거나 유사한 제2 문장으로 판단한다.The second sentence similarity calculation unit 138b calculates cosine similarity according to the following equation 2 in the sentence embedding technique, and when the calculated cosine similarity is greater than or equal to a preset second threshold, the second sentence similarity is the same or similar. Judge by sentence.

여기서, a는 문장 획득부(131)에 입력된 문장을 나타내고, b는 대상 문서 데이터베이스부(135)에 저장된 복수의 텍스트 데이터를 나타낸다.Here, a represents a sentence input to the sentence acquisition unit 131, and b represents a plurality of text data stored in the target document database unit 135.

제어부(134)는 문장 유사도 계산부(138)에서 판단한 동일하거나 유사한 문장들을 텍스트 데이터에서 유사 문장으로 카운트하고, 유사 문장 개수가 많은 순서로 기설정된 상위 순번의 텍스트 데이터를 추출하고, 추출한 상위 순번의 텍스트 데이터마다 동일하거나 유사한 문장들의 페이지 번호의 몇 번째 문단인지 페이지 알림 정보를 생성한다.The control unit 134 counts the same or similar sentences determined by the sentence similarity calculation unit 138 as similar sentences in the text data, extracts text data with a preset higher order number in order of the higher number of similar sentences, and For each text data, page notification information is generated indicating the paragraph number of the page number of the same or similar sentences.

제어부(134)는 제1 문장 유사도 계산부(138a)에서 동일하거나 유사한 문장으로 판단한 하나 이상의 제1 문장과, 제2 문장 유사도 계산부(138b)에서 동일하거나 유사한 문장으로 판단한 하나 이상의 제2 문장을 비교하고, 제1 문장, 제1 문장과 중복되지 않은 제2 문장(제1 문장들의 여집합)을 대상 문서 데이터베이스부(135)의 텍스트 데이터에서 검색한다.The control unit 134 combines one or more first sentences determined by the first sentence similarity calculation unit 138a to be the same or similar sentences and one or more second sentences determined by the second sentence similarity calculation unit 138b to be the same or similar sentences. After comparison, the first sentence and the second sentence (the complement of the first sentences) that do not overlap with the first sentence are searched from the text data of the target document database unit 135.

제어부(134)는 대상 문서 데이터베이스부(135)에서 복수의 제1 문장과, 제1 문장과 중복되지 않은 제2 문장들(제1 문장들의 여집합)을 유사 문장 개수로 검색하여 카운트한다.The control unit 134 searches for and counts a plurality of first sentences and second sentences (the complement of the first sentences) that do not overlap with the first sentences as the number of similar sentences in the target document database unit 135.

제어부(134)는 카운트된 유사 문장 개수가 많은 순서로 기설정된 상위 순번의 텍스트 데이터를 추출한다. 여기서, 텍스트 데이터는 대상 문서 데이터베이스부(135)에 저장된 책, 논문, 잡지, 출판물 등 자연어 처리가 가능한 문서를 의미한다.The control unit 134 extracts text data in a preset higher order in the order of the greater number of similar sentences counted. Here, text data refers to documents capable of natural language processing, such as books, papers, magazines, and publications, stored in the target document database unit 135.

예를 들어, 카운트된 유사 문장 개수가 많은 상위 5개의 텍스트 데이터를 추출하고, 각각의 텍스트 데이터마다 유사 문장들을 색깔로 표시한다.For example, the top five text data with the highest number of similar sentences are extracted, and similar sentences are displayed in color for each text data.

제어부(134)는 추출한 각각의 텍스트 데이터에 동일하거나 유사한 문장으로 판단한 제1 문장과 제2 문장을 블록화된 화면 영역을 색깔로 표시하고, 표시된 문장마다 페이지 정보를 이용하여 페이지 번호의 몇 번째 문단인지 페이지 알림 정보를 생성한다.The control unit 134 displays the first and second sentences, which are determined to be the same or similar to each extracted text data, in color on the block screen area, and uses page information for each displayed sentence to determine which paragraph of the page number it is. Create page notification information.

제어부(134)는 페이지 알림 정보와 텍스트 데이터의 종류를 포함한 결과 정보를 생성하여 통신부(139)를 통해 전자 기기(110)로 전송한다.The control unit 134 generates result information including page notification information and the type of text data and transmits it to the electronic device 110 through the communication unit 139.

다른 실시예로서, 제어부(134)는 수학식 1에 의해 계산된 유클리디안 거리(Euclidean Distance)와, 수학식 2에 의해 계산된 코사인 유사도(Cosine Similarity)를 하기의 수학식 3에 대입하여 문장 검색 지수를 계산한다. 문장 검색 지수는 입력 문장에 유사한 정도를 나타내는 유사도 값을 나타낸다.In another embodiment, the control unit 134 substitutes the Euclidean Distance calculated by Equation 1 and the Cosine Similarity calculated by Equation 2 into Equation 3 below to create a sentence Calculate the search index. The sentence search index represents a similarity value indicating the degree of similarity to the input sentence.

제어부(134)는 전술한 제1 문장과 제2 문장마다 문장 검색 지수를 계산하고, 계산한 문장 검색 지수를 기설정된 검색 범위에 따라 색상을 다르게 하여 블록화된 화면 영역으로 표시한다.The control unit 134 calculates a sentence search index for each of the above-described first and second sentences, and displays the calculated sentence search index in a block screen area with different colors according to the preset search range.

예를 들어, 문장 검색 지수의 범위는 제1 값 내지 제2 값인 경우, 녹색, 제2 값 내지 제3 값인 경우, 파란색, 제3 값 내지 제4 값인 경우, 노란색, 제4 값 내지 제5 값인 경우, 빨간색으로 문장의 색깔을 다르게 표시할 수 있다.For example, the range of the sentence search index is green for the first value to the second value, blue for the second value to the third value, yellow for the third value to the fourth value, and yellow for the fourth value to the fifth value. In this case, the sentence can be displayed in a different color, in red.

여기서, W₁는 문장 유사도 검색 시 유클리디안 거리와 관련된 기설정된 가중치이고, W₂는 문장 유사도 검색 시 코사인 유사도와 관련된 기설정돤 가중치이고, ED는 유클리디안 거리(Euclidean Distance)이며, CS는 코사인 유사도(Cosine Similarity)이다.Here, W ₁ is a preset weight related to the Euclidean distance when searching for sentence similarity, W ₂ is a preset weight related to cosine similarity when searching for sentence similarity, ED is the Euclidean Distance, and CS is cosine similarity.

입력 문장의 분리된 형태소를 기초로 주제어 데이터 저장부(133)에서 형태소에 대응하는 하나 이상의 주제어를 추출하는 방법을 아래와 같이 설명한다.A method of extracting one or more subject words corresponding to morphemes from the subject word data storage unit 133 based on the separated morphemes of the input sentence will be described as follows.

제어부(134)는 하기의 수학식 4와 같이, 입력 단어의 검색 빈도수와 입력 단어(형태소)에 대응하는 의미 연관어의 개수에 따라 텍스트 데이터에 포함된 단어의 중요도 인덱스를 계산한다.The control unit 134 calculates the importance index of words included in text data according to the search frequency of the input word and the number of semantically related words corresponding to the input word (morpheme), as shown in Equation 4 below.

제어부(134)는 계산한 중요도 인덱스가 기설정된 제3 임계값 이상인 경우, 해당 텍스트 데이터에 포함된 단어가 입력 문장의 단어에 매칭되는 주제어에 해당된다고 판단하여 주제어 데이터베이스부(133b)에 저장한다.If the calculated importance index is greater than or equal to a preset third threshold, the control unit 134 determines that the word included in the text data corresponds to a key word matching the word in the input sentence and stores it in the key word database unit 133b.

중요도 인덱스는 주제어 데이터 저장부(133)에 저장된 주제어를 선정하는데 필요한 지표이다.The importance index is an indicator necessary to select a key word stored in the key word data storage unit 133.

여기서, W₃는 입력 문장의 단어에 대응하는 주제어를 텍스트 데이터에서 추출하는 경우, 검색 빈도수(Frequency)와 관련된 가중치이고, W₄는 입력 문장의 단어에 대응하는 주제어를 텍스트 데이터에서 추출하는 경우, 의미 연관도(Semantic Relevance)와 관련된 가중치이고, FR은 검색 빈도수(Frequency)의 개수이며, SR은 의미 연관어의 개수를 나타내는 의미 연관도(Semantic Relevance)이다.Here, W ₃ is a weight related to the search frequency when extracting a keyword corresponding to a word in an input sentence from text data, and W ₄ is a weight related to the search frequency when extracting a keyword corresponding to a word in an input sentence from text data. It is a weight related to Semantic Relevance, FR is the number of search frequencies, and SR is Semantic Relevance indicating the number of semantic related words.

중요도 인덱스는 입력 문장의 단어가 텍스트 데이터에서 검색되는 빈도수가 많을수록 인덱스 값이 높고, 텍스트 데이터에서 입력 문장의 단어와 연관된 의미 연관어의 개수가 많을수록 인덱스 값이 높아진다.The importance index has a higher index value as the frequency with which words in the input sentence are searched in the text data increases, and the higher the number of semantic related words associated with the words in the input sentence in the text data, the higher the index value.

제어부(134)는 대상 문서 데이터베이스부(135)와 연동하여 카운트한 유사 문장 개수가 가장 많은 텍스트 데이터를 선택하고, 선택한 텍스트 데이터의 종류, 페이지 알림 정보를 포함한 결과 정보를 생성하여 통신부(139)를 통해 전자 기기(110)로 전송한다.The control unit 134 selects text data with the largest number of similar sentences counted in conjunction with the target document database unit 135, generates result information including the type of selected text data and page notification information, and sends the communication unit 139 to the communication unit 139. It is transmitted to the electronic device 110 through.

제어부(134)는 각각의 텍스트 데이터에서 제1 문장들, 제1 문장과 중복되지 않은 제2 문장들(제1 문장들의 여집합)의 문장 검색 지수를 전술한 수학식 3에 의해 계산하고, 문장 검색 지수가 기설정된 기준치 이상인지 판단하고, 기준치 이상인 문장 검색 지수의 개수를 카운트한다. 제어부(134)는 카운트한 문장 검색 지수가 가장 많은 텍스트 데이터를 선택하고, 선택한 텍스트 데이터의 종류, 페이지 알림 정보를 포함한 결과 정보를 생성하여 통신부(139)를 통해 전자 기기(110)로 전송한다.The control unit 134 calculates the sentence search index of the first sentences and the second sentences (the complement of the first sentences) that do not overlap with the first sentence in each text data according to the above-described equation 3, and performs sentence search. It is determined whether the index is greater than or equal to a preset standard value, and the number of sentence search indices that are greater than or equal to the standard value is counted. The control unit 134 selects text data with the highest counted sentence search index, generates result information including the type of selected text data and page notification information, and transmits it to the electronic device 110 through the communication unit 139.

문장이 입력되면, 인공지능을 이용하여 입력 문장과 연관된 주제어를 생성하는 방법과, 단어와 단어로 이루어진 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장을 생성하는 방법을 하기의 도 3을 참조하여 상세하게 설명한다.When a sentence is input, a method of generating topic words related to the input sentence using artificial intelligence, a method of generating a sentence vector composed of words and a vectorized semantic analysis sentence corresponding thereto, are detailed with reference to FIG. 3 below. Explain clearly.

도 3은 본 발명의 실시예에 따른 시맨틱 분석 서버의 학습 모델과 인공 신경 처리망의 구성을 나타낸 도면이다.Figure 3 is a diagram showing the configuration of a learning model and artificial neural processing network of a semantic analysis server according to an embodiment of the present invention.

본 발명의 실시예에 따른 시맨틱 분석 서버(130)는 제어부(134), 데이터 수집부(140), 디스플레이부(150), 학습 모델부(160) 및 인공 신경 처리망(170)을 포함한다.The semantic analysis server 130 according to an embodiment of the present invention includes a control unit 134, a data collection unit 140, a display unit 150, a learning model unit 160, and an artificial neural processing network 170.

데이터 수집부(140)는 단어와 단어로 이루어진 문장 벡터과, 벡터화된 의미 분석 문장과, 이에 대응하는 텍스트 데이터의 주제어를 수신하여 저장하고 있다.The data collection unit 140 receives and stores sentence vectors composed of words, vectorized semantic analysis sentences, and keywords of text data corresponding thereto.

시맨틱 분석 서버(130)는 문장 벡터와, 벡터화된 의미 분석 문장을 인공 신경 처리망(170)에 입력하고, 인공 신경 처리망(170)의 응답으로 의미 분석 문장에 대응하는 텍스트 데이터의 하나 이상의 주제어를 출력한다.The semantic analysis server 130 inputs a sentence vector and a vectorized semantic analysis sentence into the artificial neural processing network 170, and provides one or more subject words of text data corresponding to the semantic analysis sentence in response to the artificial neural processing network 170. outputs.

데이터 수집부(140)에 저장된 데이터 세트는 훈련 세트와 테스트 세트로 더 나뉜다. 훈련 세트는 머신 러닝 또는 딥 러닝 모델에 제공된다.The data set stored in the data collection unit 140 is further divided into a training set and a test set. A training set is provided to a machine learning or deep learning model.

학습 모델부(160)는 데이터 처리부(161), 학습부(162) 및 분류부(163)를 포함한다.The learning model unit 160 includes a data processing unit 161, a learning unit 162, and a classification unit 163.

인공 신경 처리망(170)은 입력층(171), 컨볼루션 레이어부(173), 풀링 레이어부(174) 및 풀리 커넥티드 레이어부(175)로 이루어진 은닉층(172), 출력층(176)을 포함한다.The artificial neural processing network 170 includes an input layer 171, a hidden layer 172 consisting of a convolution layer 173, a pooling layer 174, and a fully connected layer 175, and an output layer 176. do.

데이터 처리부(161)는 데이터 수집부(140)로부터 데이터 세트에서 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장으로 이루어진 훈련 세트(Train Set)를 수신하여 인공 신경 처리망(170)로 전송한다. 훈련 세트는 학습 데이터를 나타낸다.The data processing unit 161 receives a training set consisting of sentence vectors and corresponding vectorized semantic analysis sentences from the data set from the data collection unit 140 and transmits them to the artificial neural processing network 170. The training set represents learning data.

데이터 처리부(161)는 데이터 수집부(140)로부터 데이터 세트에서 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장으로 이루어진 훈련 세트(Train Set)를 수신하여 인공 신경 처리망(170)로 전송한다.The data processing unit 161 receives a training set consisting of sentence vectors and corresponding vectorized semantic analysis sentences from the data set from the data collection unit 140 and transmits them to the artificial neural processing network 170.

데이터 처리부(161)는 데이터 수집부(140)로부터 데이터 세트에서 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장으로 이루어진 테스트 세트(Test Set)를 수신하여 분류부(163)로 전송한다.The data processing unit 161 receives a test set consisting of sentence vectors and corresponding vectorized semantic analysis sentences from the data set from the data collection unit 140 and transmits them to the classification unit 163.

데이터 처리부(161)는 분산 병렬 처리가 가능한 데이터베이스부로 형성될 수 있다.The data processing unit 161 may be formed as a database unit capable of distributed parallel processing.

인공 신경 처리망(170)은 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장으로 이루어진 훈련 세트(Train Set)를 인공 신경 처리망(170)에 입력하여 적용시켜 오류를 수정하고, 수정된 오류를 이용하여 텍스트 데이터의 주제어 생성의 예측 결과 여부를 출력한다.The artificial neural processing network 170 inputs and applies a training set consisting of sentence vectors and corresponding vectorized semantic analysis sentences to the artificial neural processing network 170 to correct errors and use the corrected errors. It outputs whether the prediction result of key word generation of text data is correct.

인공 신경 처리망(170)은 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장으로 이루어진 훈련 세트(Train Set)를 인공 신경 처리망(170)에 입력하여 적용시켜 오류를 수정하고, 수정된 오류를 이용하여 의미 분석 문장에 대응하는 텍스트 데이터의 주제어 생성 결과를 출력한다.The artificial neural processing network 170 inputs and applies a training set consisting of sentence vectors and corresponding vectorized semantic analysis sentences to the artificial neural processing network 170 to correct errors and use the corrected errors. The result of generating keywords from text data corresponding to the semantic analysis sentence is output.

이때, 인공 신경 처리망(170)은 심층 컨볼루션 신경망(Deep Convolutional Neural Neworks, CNNs)을 이용하고, 입력층(171), 은닉층(172), 출력층(176)을 포함할 수 있다.At this time, the artificial neural processing network 170 uses deep convolutional neural networks (CNNs) and may include an input layer 171, a hidden layer 172, and an output layer 176.

인공 신경 처리망(170)은 예측 분석을 위해 신경망 기반 모델을 사용한다.The artificial neural processing network 170 uses a neural network-based model for predictive analysis.

인공 신경 처리망(170)은 입력층(171) x, 출력층(176) y 및 4개의 뉴런을 포함하는 임의의 양의 은닉층(172)을 포함한다.The artificial neural processing network 170 includes an input layer 171 x, an output layer 176 y, and an arbitrary amount of hidden layers 172 containing four neurons.

각 레이어는 출력층(176)을 제외하고 밴드 W로 표시되는 편향 및 가중치 세트로 구성된다. 각 은닉층의 활성화 함수로 시그모이드 함수를 사용한다. 모델의 예측 점수를 향상시키기 위해 입력 데이터의 편향 및 가중치 미세 조정이 수행된다. 훈련 과정에서 각 반복에는 다음 단계가 포함된다.Each layer consists of a set of biases and weights, denoted by band W, except the output layer 176. The sigmoid function is used as the activation function of each hidden layer. Bias and weight fine-tuning of the input data is performed to improve the model's prediction score. Each iteration in the training process includes the following steps:

예측된 출력층(176) y의 계산을 포함하는 피드포워드(Feed-forward)와, 가중치와 편향을 업데이트하는 역전파(Back-propagation)의 두 단계로 구성된 신경망 모델의 훈련 과정을 수행한다.The training process of the neural network model is performed, which consists of two steps: feed-forward, which includes calculation of the predicted output layer 176 y, and back-propagation, which updates weights and biases.

인공 신경 처리망(170)은 예측 손실의 오차를 측정하기 위해 역전파가 수행하고, 예측 오차(손실)를 측정한다.The artificial neural processing network 170 performs backpropagation to measure the error of the prediction loss and measures the prediction error (loss).

편향과 가중치에 대한 손실 함수의 미분은 가중치와 편향을 조정하기 위해 사용된다.Differentiation of the loss function with respect to the bias and weights is used to adjust the weights and biases.

입력층(171)은 데이터 처리부(161)에 저장된 학습 데이터를 획득하고, 획득한 학습 데이터를 특징맵을 가지는 레이어로 저장한다. 여기서, 특징맵은 다수의 노드들이 2차원으로 배열된 구조를 가짐으로써 후술되는 은닉층(172)과의 연결을 용이하게 할 수 있다.The input layer 171 acquires the training data stored in the data processing unit 161 and stores the acquired training data as a layer with a feature map. Here, the feature map can facilitate connection with the hidden layer 172, which will be described later, by having a structure in which multiple nodes are arranged in two dimensions.

은닉층(172)은 상위 계층에 위치한 레이어의 특징맵을 획득하고, 획득한 특징맵으로부터 점차 높은 수준의 특징을 추출한다. 은닉층(172)은 하나 또는 그 이상으로 형성될 수 있으며 컨볼루션 레이어부(173), 풀링 레이어부(174) 및 풀리 커넥티드 레이어부(175)를 포함한다.The hidden layer 172 acquires the feature map of the layer located in the upper layer, and gradually extracts higher-level features from the obtained feature map. The hidden layer 172 may be formed of one or more layers and includes a convolutional layer 173, a pooling layer 174, and a fully connected layer 175.

컨볼루션 레이어부(173)는 학습 데이터로부터 컨볼루션 연산을 수행하는 구성으로서, 복수의 입력 특징맵과 연결되는 특징맵을 포함한다.The convolution layer unit 173 is a component that performs a convolution operation from learning data and includes a feature map connected to a plurality of input feature maps.

풀링 레이어부(174)는 컨볼루션 레이어부(173)의 출력을 입력으로 받아 컨볼루션 연산, 즉 서브 샘플링 연산을 수행하는 구성이고, 은닉층(172)의 하위 계층에 위치한 컨볼루션 레이어부(173)가 가지는 입력 특징맵의 수와 동일한 수의 특징맵을 포함하며, 각각의 특징맵은 입력 특징맵과 일대일로 연결된다.The pooling layer unit 174 is configured to receive the output of the convolution layer unit 173 as an input and perform a convolution operation, that is, a sub-sampling operation, and the convolution layer unit 173 located in the lower layer of the hidden layer 172 It contains the same number of feature maps as the number of input feature maps, and each feature map is connected one-to-one with the input feature map.

풀리 커넥티드 레이어부(175)는 컨볼루션 레이어부(173)의 출력을 입력으로 받아 출력층(130)에서 출력되는 각 카테고리별 출력에 맞게 학습하는 구성이고, 학습된 국소적 정보, 즉 특징들을 종합하여 추상적인 내용을 학습한다.The fully connected layer unit 175 receives the output of the convolution layer unit 173 as an input and learns according to the output for each category output from the output layer 130, and synthesizes the learned local information, that is, features. to learn abstract content.

이때, 은닉층(172)이 풀링 레이어부(172)를 구비할 경우, 폴링 커넥티드 레이어부(175)는 폴링 레이어부(174)와 연결되며, 폴링 레이어부(174)의 출력으로부터 특징들을 종합하여 추상적인 내용을 학습한다.At this time, when the hidden layer 172 has a pooling layer unit 172, the polling connected layer unit 175 is connected to the polling layer unit 174, and features are synthesized from the output of the polling layer unit 174. Learn abstract content.

출력층(176)은 소프트 맥스(soft-max) 등의 함수를 이용하여 분류하고자 원하는 각 카테고리별 출력을 확률값으로 매핑한다. 이때, 출력층(176)에서 출력된 결과는 학습부(162) 또는 분류부(163)로 전달되어 오류역전파를 수행하거나 응답 데이터로 출력될 수도 있다.The output layer 176 maps the output for each category desired to be classified into a probability value using a function such as soft-max. At this time, the results output from the output layer 176 may be transmitted to the learning unit 162 or the classification unit 163 to perform error backpropagation or may be output as response data.

학습부(162)는 지도 학습을 수행하는 것으로, 지도 학습은 학습 데이터에 기계학습 알고리즘을 적용하여 함수를 추론하고, 그 추론된 함수를 통해 해답을 찾는다.The learning unit 162 performs supervised learning. Supervised learning applies a machine learning algorithm to learning data to infer a function and finds an answer through the inferred function.

학습부(162)는 지도 학습을 통해서 학습 데이터를 대표하는 선형 모델을 생성하고, 그 선형 모델을 통해 미래의 사건을 예측할 수 있다.The learning unit 162 creates a linear model representing the learning data through supervised learning, and can predict future events through the linear model.

학습부(162)는 이전까지 학습된 데이터를 근거로 새로운 데이터가 기존에 학습된 데이터에 어떻게 분류되는지 판단한다.The learning unit 162 determines how the new data is classified with the previously learned data based on the previously learned data.

학습부(162)는 데이터 처리부(161)로부터 데이터 세트에서 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장으로 이루어진 훈련 세트(Train Set)를 인공 신경 처리망(170)의 학습을 수행하고, 유형별 딥러닝 특징값을 이용하여 의미 분석 문장에 대응하는 텍스트 데이터의 주제어 생성 여부를 학습한다.The learning unit 162 performs learning of the artificial neural processing network 170 on a training set consisting of sentence vectors and corresponding vectorized semantic analysis sentences from the data set from the data processing unit 161, and performs deep learning by type. Using learning feature values, we learn whether to generate topic words in text data corresponding to semantic analysis sentences.

학습부(162)는 유형별 딥러닝 특징값을 이용하여 의미 분석 문장에 대응하는 텍스트 데이터의 주제어 생성 여부를 인공 신경 처리망(170)에서 학습한다.The learning unit 162 learns from the artificial neural processing network 170 whether the subject word of the text data corresponding to the semantic analysis sentence is generated using deep learning feature values for each type.

본 발명의 일실시예에서 인공 신경 처리망(170)의 학습은 지도 학습(supervised-learning)으로 이루어진다.In one embodiment of the present invention, learning of the artificial neural processing network 170 is performed through supervised learning.

지도 학습은 학습 데이터와 그에 대응하는 출력 데이터를 함께 인공 신경 처리망(170)에 입력하고, 학습 데이터에 대응하는 출력 데이터가 출력되도록 연결된 간선들의 가중치를 업데이트 하는 방법이다. 일예로, 본 발명의 인공 신경 처리망(170)은 델타 규칙 및 오류역전파 학습 등을 이용하여 인공뉴런들 사이의 연결 가중치를 업데이트 할 수 있다.Supervised learning is a method of inputting learning data and corresponding output data together into the artificial neural processing network 170, and updating the weights of connected edges so that output data corresponding to the learning data is output. For example, the artificial neural processing network 170 of the present invention can update connection weights between artificial neurons using delta rules and error back-propagation learning.

오류역전파(Error-back-propagation) 학습은 주어진 학습 데이터에 대해 전방계산(Feed-Forward)으로 오류를 추정한 후, 출력 레이어에서 시작하여 은닉층(172)과 입력층(171) 방향인 역방향으로 추정한 오류를 전파하고, 오류를 줄이는 방향으로 인공 뉴런들 사이의 연결 가중치를 업데이트한다.Error-back-propagation learning estimates the error using feed-forward for the given learning data, then starts from the output layer and estimates in the reverse direction toward the hidden layer 172 and the input layer 171. It propagates an error and updates the connection weights between artificial neurons in a way that reduces the error.

인공 신경 처리망(170)은 입력층(171) - 은닉층(172) - 폴링 커넥티드 레이어부(175) - 출력층(176)을 통해 획득된 결과로부터 오차를 계산하고, 계산된 오차를 보정하기 위해 다시 출력층(176) - 폴링 커넥티드 레이어부(175) - 은닉층(172) - 입력층(171)의 순서로 오류를 전파하여 연결 가중치를 업데이트할 수 있다.The artificial neural processing network 170 calculates the error from the results obtained through the input layer 171 - hidden layer 172 - polling connected layer unit 175 - output layer 176, and corrects the calculated error. Again, the connection weight can be updated by propagating the error in the following order: output layer 176 - polling connected layer unit 175 - hidden layer 172 - input layer 171.

학습부(162)는 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장으로 이루어진 훈련 세트(Train Set)를 입력 벡터가 되며, 입력층(171), 은닉층(172), 출력층(176)을 통과하면, 의미 분석 문장에 대응하는 텍스트 데이터의 주제어 생성 여부를 출력 벡터로 생성하도록 지도 학습을 통해 학습된다.The learning unit 162 uses a training set consisting of a sentence vector and the corresponding vectorized semantic analysis sentence as an input vector, and when it passes through the input layer 171, the hidden layer 172, and the output layer 176, It is learned through supervised learning to generate an output vector to determine whether the subject word of the text data corresponding to the semantic analysis sentence is generated.

학습부(162)는 인공 신경 처리망(170)을 이용하여 입력되는 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장의 특징값들이 입력 벡터가 되며, 입력층(171), 은닉층(172), 출력층(176)을 통과하면, 의미 분석 문장에 대응하는 텍스트 데이터의 주제어 생성 결과를 출력 벡터로 생성하도록 지도 학습을 통해 학습된다.The learning unit 162 uses the artificial neural processing network 170 to input sentence vectors and the corresponding vectorized semantic analysis sentence feature values as input vectors, and includes an input layer 171, a hidden layer 172, and an output layer. If (176) is passed, the subject word generation result of the text data corresponding to the semantic analysis sentence is learned through supervised learning to generate an output vector.

학습부(162)는 의미 분석 문장에 대응하는 텍스트 데이터의 주제어 생성 결과를 학습 데이터로 하여 인공 신경 처리망(170)과 연동하여 인공지능에 학습한다.The learning unit 162 uses the subject word generation result of the text data corresponding to the semantic analysis sentence as learning data and learns it with artificial intelligence in conjunction with the artificial neural processing network 170.

인공 신경 처리망(170)은 입력값(문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장)이 입력되면, 출력값(의미 분석 문장에 대응하는 텍스트 데이터의 주제어 생성 결과)이 나와야 하는지 미리 알고 있다.The artificial neural processing network 170 knows in advance whether an output value (a subject word generation result of text data corresponding to the semantic analysis sentence) should be produced when an input value (a sentence vector and a corresponding vectorized semantic analysis sentence) is input.

분류부(163)는 학습부(162)에서의 오류역전파를 통해 업데이트된 연결 가중치를 가지는 인공 신경 처리망(170)의 출력 데이터를 응답 데이터로 출력할 수 있다.The classification unit 163 may output output data of the artificial neural processing network 170 with connection weights updated through error backpropagation in the learning unit 162 as response data.

분류부(163)는 업데이트된 연결 가중치를 가지는 인공 신경 처리망(170)에 학습 데이터, 테스트 데이터 또는 학습에 사용되지 않은 새 데이터가 입력되면, 입력층(171) - 은닉층(172) - 폴링 커넥티드 레이어부(175) - 출력층(176)을 통해 출력된 결과를 획득하여 응답 데이터로 출력할 수 있다.When training data, test data, or new data not used for learning are input to the artificial neural processing network 170 with updated connection weights, the classification unit 163 performs an input layer 171 - a hidden layer 172 - a polling connection. Tied layer unit 175 - The result output through the output layer 176 can be obtained and output as response data.

인공 신경 처리망(170)은 입력된 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장에 대응하는 텍스트 데이터의 주제어 생성 결과 여부를 기반으로 최적화를 통해 딥러닝 기반 분류기 모델을 생성한다.The artificial neural processing network 170 generates a deep learning-based classifier model through optimization based on the input sentence vector and the subject word generation result of the text data corresponding to the corresponding vectorized semantic analysis sentence.

학습부(162)는 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장에 따라 인공 신경 처리망 내 레이어들 및 레이어들 간의 연결 강도에 관한 개별 요소 가중치를 다르게 적용할 수 있다.The learning unit 162 may differently apply individual element weights regarding the layers in the artificial neural processing network and the strength of connections between layers according to the sentence vector and the corresponding vectorized semantic analysis sentence.

학습부(162)는 텍스트 데이터의 주제어 생성 결과를 출력 벡터로 생성하도록 지도 학습을 통해 학습되고, 입력층(171)에서 출력층(176)으로 방향으로 계산하고, 반대로 출력층(176)에서 입력층(171) 방향으로 계산하는 작업을 반복하면서 가중치를 수정하여 오차를 최소화한다.The learning unit 162 is trained through supervised learning to generate the key word generation results of text data as an output vector, calculates in the direction from the input layer 171 to the output layer 176, and conversely, from the output layer 176 to the input layer ( 171) The error is minimized by modifying the weight while repeating the calculation in the direction.

분류부(163)는 테스트 데이터인 입력된 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장을 인공 신경 처리망(170)의 딥러닝 기반 분류기 모델을 이용하여 응답 데이터의 결과값(텍스트 데이터의 주제어 생성 결과)으로 출력한다.The classification unit 163 uses the input sentence vector, which is test data, and the corresponding vectorized semantic analysis sentence using the deep learning-based classifier model of the artificial neural processing network 170 to generate the result value of the response data (key words of text data). result).

분류부(163)는 벡터화된 의미 분석 문장에 대응하여 텍스트 데이터의 주제어 생성 결과의 여부를 판단한다.The classification unit 163 determines whether the text data generates a subject word in response to the vectorized semantic analysis sentence.

분류부(163)는 테스트 데이터인 입력된 문장 벡터와 이에 대응하는 벡터화된 의미 분석 문장을 인공 신경 처리망(170)의 딥러닝 기반 분류기 모델을 이용하여 텍스트 데이터의 주제어 생성 결과의 여부를 판단한다.The classification unit 163 uses the input sentence vector, which is test data, and the corresponding vectorized semantic analysis sentence using the deep learning-based classifier model of the artificial neural processing network 170 to determine whether the text data is a subject word generation result. .

출력부(164)는 분류부(163)로부터 수신된 의미 분석 문장에 대응하는 텍스트 데이터의 주제어 생성 결과 여부를 디스플레이부(150)에 표시한다.The output unit 164 displays on the display unit 150 whether there is a result of generating a key word of text data corresponding to the semantic analysis sentence received from the classification unit 163.

주제어 데이터베이스부(133b)는 입력 문장, 의미 분석 문장과 이에 대응하는 텍스트 데이터의 하나 이상의 주제어를 저장하고 있다.The key word database unit 133b stores one or more key words of input sentences, semantic analysis sentences, and text data corresponding thereto.

사서, 평론가 등 전문가들이 책, 논문, 잡지, 출판물과 같은 텍스트 데이터를 읽고, 주제어를 복수개 추출한다.Experts such as librarians and critics read text data such as books, papers, magazines, and publications and extract multiple keywords.

데이터 수집부(140)는 전문가들이 각각의 텍스트 데이터에서 추출한 복수의 주제어를 전자 기기(110)를 통해 수신하여 저장한다.The data collection unit 140 receives and stores a plurality of key words extracted from each text data by experts through the electronic device 110.

제어부(134)는 인공 신경 처리망(170)의 응답으로 의미 분석 문장에 대응하는 텍스트 데이터의 하나 이상의 제1 주제어를 출력하면, 출력된 제1 주제어와 데이터 수집부(140)에 저장된 전문가들이 추출한 제2 주제어를 비교, 분석한다.When the control unit 134 outputs one or more first subject words of text data corresponding to the semantic analysis sentence in response to the artificial neural processing network 170, the output first subject word and the experts extracted from the data collection unit 140 Compare and analyze the second key word.

제어부(134)는 제1 주제어와 제2 주제어가 다른 경우, 다른 제2 주제어를 주제어 데이터베이스부(133b)의 텍스트 데이터의 주제어로 추가하여 업데이트한다.If the first key word and the second key word are different, the control unit 134 updates the text data by adding a different second key word to the key word of the text data in the key word database unit 133b.

본 명세서의 실시예에 따른 동작은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 프로그램 또는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산 방식으로 컴퓨터로 읽을 수 있는 프로그램 또는 코드가 저장되고 실행될 수 있다.Operations according to embodiments of the present specification can be implemented as a computer-readable program or code on a computer-readable recording medium. Computer-readable recording media include all types of recording devices that store data that can be read by a computer system. Additionally, computer-readable recording media can be distributed across networked computer systems so that computer-readable programs or codes can be stored and executed in a distributed manner.

실시예가 소프트웨어로 구현될 때, 상술한 기법은 상술한 기능을 수행하는 모듈(과정, 기능 등)로 구현될 수 있다. 모듈은 메모리에 저장되고, 프로세서에 의해 실행될 수 있다. 메모리는 프로세서 내부 또는 외부에 있을 수 있고, 잘 알려진 다양한 수단으로 프로세서와 연결될 수 있다.When the embodiment is implemented in software, the above-described techniques may be implemented as modules (processes, functions, etc.) that perform the above-described functions. Modules are stored in memory and can be executed by a processor. Memory may be internal or external to the processor, and may be connected to the processor by a variety of well-known means.

또한, 컴퓨터가 읽을 수 있는 기록매체는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다. 프로그램 명령은 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.Additionally, computer-readable recording media may include hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Program instructions may include not only machine language code such as that created by a compiler, but also high-level language code that can be executed by a computer using an interpreter, etc.

본 발명의 일부 측면들은 장치의 문맥에서 설명되었으나, 그것은 상응하는 방법에 따른 설명 또한 나타낼 수 있고, 여기서 블록 또는 장치는 방법 단계 또는 방법 단계의 특징에 상응한다. 유사하게, 방법의 문맥에서 설명된 측면들은 또한 상응하는 블록 또는 아이템 또는 상응하는 장치의 특징으로 나타낼 수 있다. 방법 단계들의 몇몇 또는 전부는 예를 들어, 마이크로프로세서, 프로그램 가능한 컴퓨터 또는 전자 회로와 같은 하드웨어 장치에 의해(또는 이용하여) 수행될 수 있다. 몇몇의 실시예에서, 가장 중요한 방법 단계들의 하나 이상은 이와 같은 장치에 의해 수행될 수 있다.Although some aspects of the invention have been described in the context of an apparatus, it may also refer to a corresponding method description, where a block or device corresponds to a method step or feature of a method step. Similarly, aspects described in the context of a method may also be represented by corresponding blocks or items or features of a corresponding device. Some or all of the method steps may be performed by (or using) a hardware device, such as a microprocessor, programmable computer, or electronic circuit, for example. In some embodiments, one or more of the most important method steps may be performed by such an apparatus.

실시예들에서, 프로그램 가능한 로직 장치(예를 들어, 필드 프로그래머블 게이트 어레이)가 여기서 설명된 방법들의 기능의 일부 또는 전부를 수행하기 위해 사용될 수 있다. 실시예들에서, 필드 프로그래머블 게이트 어레이는 여기서 설명된 방법들 중 하나를 수행하기 위한 마이크로프로세서와 함께 작동할 수 있다. 일반적으로, 방법들은 어떤 하드웨어 장치에 의해 수행되는 것이 바람직하다.In embodiments, a programmable logic device (e.g., a field programmable gate array) may be used to perform some or all of the functionality of the methods described herein. In embodiments, a field programmable gate array may operate in conjunction with a microprocessor to perform one of the methods described herein. In general, the methods are preferably performed by some hardware device.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the present invention has been described above with reference to preferred embodiments, those skilled in the art may make various modifications and changes to the present invention without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that you can do it.

100: 시맨틱 분석 제공 시스템 110: 전자 기기
120: 통신망 130: 시맨틱 분석 서버
131: 문장 획득부 132: 형태소 추출부
133: 데이터 저장부 134: 제어부
135: 대상 문서 데이터베이스부 136: 단어 임베딩 처리부
137: 문장 임베딩 처리부 138: 문장 유사도 계산부
139: 통신부 140: 데이터 수집부
150: 디스플레이부 160: 학습 모델부
170: 인공 신경 처리망100: Semantic analysis provision system 110: Electronic device
120: Communication network 130: Semantic analysis server
131: Sentence acquisition unit 132: Morpheme extraction unit
133: data storage unit 134: control unit
135: Target document database unit 136: Word embedding processing unit
137: Sentence embedding processing unit 138: Sentence similarity calculation unit
139: Communication Department 140: Data Collection Department
150: display unit 160: learning model unit
170: Artificial neural processing network

Claims

A sentence acquisition unit that receives text information in the form of a sentence;
a morpheme extraction unit that separates the sentences input from the sentence acquisition unit into morphemes, which are the smallest units of words with meaning;
An associated word database unit that vectorizes and stores substitute words, synonyms, inferred words, and semantic associations of emotional morphemes related to the emotion of the word corresponding to the input word (morpheme), and vectorizes one or more subject words matching the input word. A data storage unit consisting of a keyword database unit that analyzes the meaning of words and sentences composed of words, vectorizes and stores the semantic analysis sentences classified by meaning, and stores them in a data storage unit;
a word embedding processing unit that converts the words extracted from the morpheme extraction unit into word vectors;
A sentence embedding processing unit that processes sentence vector conversion for sentences that meet preset context connection criteria considering the context of words (front and back connection of word meanings) based on the word vector processed by the word embedding processing unit;
The sentences that have undergone the sentence vector conversion are compared and analyzed in the semantic analysis database unit, the semantic analysis sentences classified by meaning corresponding to the sentences are vectorized, and the subject word database unit uses the vectorized semantic analysis sentences as the basis. A control unit that extracts one or more subject words;
a target document database unit that vectorizes and stores a plurality of text data of books, papers, magazines, and publications in sentence units; and
A semantic analysis server comprising a vector set representing the extracted key words and a sentence similarity calculation unit that calculates similarity between sentences using a vector set of text data in the target document database unit, wherein the control unit determines that the sentence similarity calculation unit determines the similarity between sentences. or count similar sentences as similar sentences in the text data, extract text data with a preset higher order in the order of the number of similar sentences, and select a number of page numbers of the same or similar sentences for each text data with the extracted higher order. Generate page notification information for the first paragraph,
The sentence similarity calculation unit calculates a Euclidean distance indicating the similarity between two sentences p and q according to Equation 1 below in the sentence embedding technique, and the calculated Euclidean distance is set to a preset first threshold. a first sentence similarity calculator that determines that the first sentence is the same or similar if the value is greater than or equal to the value; and
In the sentence embedding technique, cosine similarity is calculated using Equation 2 below, and if the calculated cosine similarity is greater than or equal to a preset second threshold, second sentence similarity is judged to be the same or similar to the second sentence. Contains wealth,
[Equation 1]

The sentence p represents a sentence input to the sentence acquisition unit, the sentence q represents a plurality of text data stored in the target document database unit, and the doc2vec of the sentence p consisting of the keyword is , and the doc2vec of sentence q is lim.
[Equation 2]

Wherein a represents a sentence input to the sentence acquisition unit, and b is a plurality of text data stored in the target document database unit,
The control unit compares one or more first sentences determined to be the same or similar sentences by the first sentence similarity calculation unit and one or more second sentences determined to be the same or similar sentences by the second sentence similarity calculation unit, and Search the text data of the target document database unit for a sentence, a second sentence (a complement of the first sentences) that does not overlap with the first sentence, and search for a plurality of first sentences and a plurality of first sentences that overlap with the first sentence in the target document database unit. Search for and count the second sentences (the complement of the first sentences) that are not used by the number of similar sentences, and extract text data in the preset upper order in the order of the counted number of similar sentences,
The target document database unit includes a text conversion unit that converts a PDF file, which is input text data, into a text file (TXT file); and
The text conversion unit generates a page number every time the text file is converted, inserts an identification tag at the start of the first page of the text file, and inserts the identification tags in order for each paragraph to identify the page. It further includes a page number identification unit that stores a text file containing page information consisting of an identification tag and a page number in a storage unit,
The semantic analysis server further includes a data collection unit that receives and stores sentence vectors consisting of words, vectorized semantic analysis sentences, and keywords of text data corresponding thereto, the sentence vectors, and the vectorized semantic analysis sentences. A semantic analysis providing system that analyzes and finds the meaning of a sentence by inputting it into an artificial neural processing network and outputting one or more keywords of text data corresponding to the semantic analysis sentence in response to the artificial neural processing network.