KR102607826B1

KR102607826B1 - Deep neural network-based document analysis system and method, and computer program stored in recording media and media in which the program is stored

Info

Publication number: KR102607826B1
Application number: KR1020210136807A
Authority: KR
Inventors: 박정진; 오영섭; 백완식; 성희모; 박영민
Original assignee: 비큐리오 주식회사; 삼성엔지니어링 주식회사
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2023-11-30
Also published as: KR20230053373A

Abstract

본 실시예에 따른 심층 신경망 기반의 문서 분석 시스템과 방법 및 이를 구현하기 위한 프로그램이 저장된 기록 매체 및 매체에 저장된 컴퓨터프로그램에 관한 것으로, 문장 또는 문서에서 형태소 분석 및 원형 처리를 통해 키워드를 추출하는 단계와, 다수의 개별 사전을 갖는 사전 유무를 판단하여 사전이 없는 경우에는 사전을 생성하는 단계와, 추출된 키워드와 사전을 이용하여 각 문장을 수치화 하되, 다수의 개별 사전 개수에 해당하는 비트수를 갖는 형태의 수치로 변환하는 단계 및 상기 수치를 심층 신경망을 이용하여 문장 또는 문서를 분석하는 단계를 포함하는 심층 신경망 기반의 문서 분석 시스템과 방법 및 이를 구현하기 위한 프로그램이 저장된 기록 매체 및 매체에 저장된 컴퓨터프로그램을 제공한다. It relates to a deep neural network-based document analysis system and method according to this embodiment, a recording medium storing a program for implementing the same, and a computer program stored in the medium, including extracting keywords from a sentence or document through morphological analysis and prototype processing. Wow, a step of determining the presence of a dictionary with multiple individual dictionaries and creating a dictionary if there is no dictionary, and quantifying each sentence using the extracted keywords and dictionaries, and calculating the number of bits corresponding to the number of individual dictionaries. A deep neural network-based document analysis system and method including converting the numerical value into a numerical value and analyzing the sentence or document using a deep neural network, and a recording medium and a medium storing a program for implementing the same. Provides computer programs.

Description

A deep neural network-based document analysis system and method, a recording medium storing a program to implement the same, and a computer program stored on the medium IS STORED}

본 발명은 심층 신경망 기반의 문서 분석 시스템과 방법 및 이를 구현하기 위한 프로그램이 저장된 기록 매체 및 매체에 저장된 컴퓨터프로그램에 관한 것으로, 텍스트 문서를 문장별로 의미요소를 학습하고, 분류하여 문서의 위험성을 분석할 수 있는 심층 신경망 기반의 문서 분석 시스템과 방법 및 이를 구현하기 위한 프로그램이 저장된 기록 매체 및 매체에 저장된 컴퓨터프로그램을 제공한다. The present invention relates to a document analysis system and method based on a deep neural network, a recording medium storing a program for implementing the same, and a computer program stored in the medium, and analyzes the risk of the document by learning and classifying semantic elements for each sentence of a text document. Provides a deep neural network-based document analysis system and method, a recording medium storing a program for implementing the same, and a computer program stored on the medium.

일반적으로 계약서 또는 제안서와 같은 문서는 법령이나 규정 등에 따라 그 형식이나 내용이 어느 정도 일정한 포멧으로 정의 되어 있다. 또한, 해당 문서와 관련하여 전문성이 없는 사용자가 문서를 읽고, 그 의미를 독해하여 문서내의 위험성이나 누락된 부분을 찾기가 매우 어려운 단점이 있다. 물론, 전문가를 통해 해당 문서에 관한 자문 및 검토를 통해 이를 해소할 수 있지만, 전문가는 제한적인 정보 범위 내에서 문서를 검토하기 때문에 그 검토의 정확성이 떨어지고, 전문가라고 하더라고, 방대한 문서의 모든 항목을 커버하지 못하게 되어 누락이나 잘못된 항목 심지어 불리한 항목과 문서 내의 위험성을 효과적으로 분석하지 못하는 단점이 있다. In general, documents such as contracts or proposals are defined in a somewhat certain format in terms of form and content in accordance with laws, regulations, etc. In addition, there is a disadvantage that it is very difficult for users without expertise in relation to the document to read the document, understand its meaning, and find risks or missing parts in the document. Of course, this can be resolved by consulting and reviewing the document through an expert, but since the expert reviews the document within a limited scope of information, the accuracy of the review is low, and even if he or she is an expert, he or she cannot review all items in a large document. It has the disadvantage of not being able to effectively analyze missing or incorrect items, even unfavorable items, and risks within the document due to the inability to cover them.

(특허 문헌 1) 한국등록특허공보 제10-2118603호(Patent Document 1) Korean Patent Publication No. 10-2118603 (특허 문헌 2) 한국공개특허공보 제10-2020-0103152호(Patent Document 2) Korean Patent Publication No. 10-2020-0103152

본 발명은 상술한 문제점을 해결하기 위하여 안출된 것으로서, 사용자가 정한 분류 기준에 따라 문서를 분류하기 위해 문장별 의미적 요소의 수치화와 이를 심층 신경망 기반의 학습을 진행하고, 학습 결과를 바탕으로 분류 기준에 따라 문서를 분류할 수 있는 심층 신경망 기반의 문서 분석 시스템과 방법 및 이를 구현하기 위한 프로그램이 저장된 기록 매체 및 매체에 저장된 컴퓨터프로그램을 제공한다. The present invention was developed to solve the above-mentioned problems. In order to classify documents according to classification criteria set by the user, the semantic elements of each sentence are quantified, deep neural network-based learning is performed on these, and classification is based on the learning results. Provides a deep neural network-based document analysis system and method that can classify documents according to standards, a recording medium storing a program for implementing the same, and a computer program stored in the medium.

본 발명에 따른 대상 문서의 문장별 키워드를 추출하는 전처리 모듈과, 상기 전처리 모듈 및 학습 제어 모듈에 의한 키워드의 유사 범주를 그룹화 하여 다수의 단어가 범주화되어 복수의 개별 사전으로 분류된 사전 모듈과, 상기 전처리 모듈을 통해 추출된 키워드를 사전 모듈을 바탕으로 수치화하는 수치화 모듈과, 심층 신경망을 통해 수치화 모듈의 수치값을 학습 하거나 예측 분석하는 심층 신경망 모듈과, 학습 기준에 따라 심층 신경망 모듈을 통해 수치화 모듈을 거친 문장의 학습을 제어하는 학습 제어 모듈 및 분석 문서에 대한 분석을 제어하는 분석 모듈를 포함하는 심층 신경망 기반의 문서 분석 시스템을 제공한다. A preprocessing module that extracts keywords for each sentence of the target document according to the present invention, a dictionary module that groups similar categories of keywords by the preprocessing module and the learning control module and categorizes a number of words into a plurality of individual dictionaries; A quantification module that quantifies the keywords extracted through the preprocessing module based on the dictionary module, a deep neural network module that learns or predicts and analyzes the numerical value of the quantification module through a deep neural network, and quantification through a deep neural network module according to learning standards. It provides a deep neural network-based document analysis system that includes a learning control module that controls learning of sentences that have passed through the module and an analysis module that controls analysis of analyzed documents.

상기 전처리 모듈은 상술한 동작을 위해 문서에서 문장을 분리하는 문장 분리 모듈과, 분리된 문장에서 단어를 구분하는 단어 구분 모듈과, 분리된 단어에서 그 원형을 추출 선정하는 원형 추출 모듈을 포함한다. For the above-described operation, the preprocessing module includes a sentence separation module that separates sentences from a document, a word separation module that separates words from the separated sentences, and a prototype extraction module that extracts and selects the original form from the separated word.

상기 원형 추출 모듈은 분리된 단어의 형태소를 분석하는 형태소 분석부와, 형태소 분석 결과에 따라 불용 단어나 불필요 조사를 제거하는 불용 단어 제거부와, 단어의 원형을 매칭하고, 다중 의미 단어 및 복합 명사등의 원형을 선택하는 원형 선택부를 포함한다. The prototype extraction module includes a morpheme analysis unit that analyzes the morphemes of separated words, a stop word removal unit that removes stop words or unnecessary particles according to the morpheme analysis results, matches the original form of the word, and multi-semantic words and compound nouns. It includes a circular selection unit that selects a circular shape, such as.

상기 사전 모듈은 문서 내의 키워드를 제공 받아 분류하는 범주 분류부와, 분류된 범주에 해당하는 개별 사전을 생성하는 개별 사전 생성부와, 생성된 개별 사전 내에 위치한 키워드를 확장시키는 키워드 확장부와, 생성된 개별 사전을 저장 하는 별도의 사전 저장부를 포함한다. The dictionary module includes a category classification unit that receives keywords in a document and classifies them, an individual dictionary creation unit that creates an individual dictionary corresponding to the classified category, a keyword expansion unit that expands keywords located in the generated individual dictionary, and a generation unit. It includes a separate dictionary storage unit that stores individual dictionaries.

상기 수치화 모듈은 사전 모듈을 기반으로 문장별 추출된 키워드를 이용하여 문장을 수치화하되, 개별 사전의 개수에 해당하는 비트로 표현하되, 각각의 키워드가 각 개별 사전에 있는지를 확인하고, n번째 개별 사전에 해당 키워드 단어가 있는 경우, 해당 번째의 개별 사전 위치 값을 1로 표현하는 것을 특징으로 한다. The numerical module digitizes sentences using keywords extracted for each sentence based on the dictionary module, expresses them in bits corresponding to the number of individual dictionaries, checks whether each keyword is in each individual dictionary, and checks whether each keyword is in the nth individual dictionary. If there is a corresponding keyword word, the corresponding individual dictionary position value is expressed as 1.

상기 수치화 모듈은 문장 단위별, 문장 내의 키워드를 각 개별 사전 내의 범주 단어 부분의 단어와 일치하는지 유무를 판단하되, 일치하는 경우는 1로 수치화하고, 일치하지 않는 경우, 0으로 수치화 하는 것을 특징으로 한다. The numerical module determines for each sentence unit whether or not the keyword in the sentence matches a word in the category word part of each individual dictionary, and if it matches, it is digitized as 1, and if it does not match, it is digitized as 0. do.

상기 수치화 모듈은 키워드를 각 개별 사전들과 순차적으로 비교하여 0과 1로 수치화하되, 하나의 개별 사전에 다수의 키워드가 일치하는 경우 해당 개별 사전 위치 값을 1로 수치화 하고, 하나의 키워드가 다수의 개별 사전에 일치하는 경우 각 개별 사전들의 위치 값을 1로 수치화 하는 것을 특징으로 한다.The quantification module sequentially compares keywords with individual dictionaries and quantifies them as 0 and 1. However, if multiple keywords match in one individual dictionary, the individual dictionary position value is quantified as 1, and if one keyword has multiple If it matches an individual dictionary of , the position value of each individual dictionary is quantified as 1.

또한, 본 발명에 따른 문장 또는 문서에서 형태소 분석 및 원형 처리를 통해 키워드를 추출하는 단계와, 다수의 개별 사전을 갖는 사전 유무를 판단하여 사전이 없는 경우에는 사전을 생성하는 단계와, 추출된 키워드와 사전을 이용하여 각 문장을 수치화 하되, 다수의 개별 사전 개수에 해당하는 비트수를 갖는 형태의 수치로 변환하는 단계 및 상기 수치를 심층 신경망을 이용하여 문장 또는 문서를 분석하는 단계를 포함하는 심층 신경망 기반의 문서 분석 방법을 제공한다. In addition, a step of extracting keywords from a sentence or document according to the present invention through morphological analysis and prototype processing, determining whether a dictionary has a plurality of individual dictionaries, and generating a dictionary if there is no dictionary, and extracting keywords In-depth analysis includes the steps of quantifying each sentence using a dictionary and converting them into numbers with the number of bits corresponding to the number of individual dictionaries, and analyzing the sentences or documents using a deep neural network. Provides a neural network-based document analysis method.

상기 키워드를 추출하는 단계는, 문서에서 문장을 분리하고, 분리된 문장에서 단어를 구분하는 단계와, 구분된 단어를 바탕으로 해당 단어의 원형을 추출하되, 불필요한 조사를 제거하고, 명사의 경우 단수형으로 처리하고, 동사의 경우, 현재 기본형으로 치환하고, 형태소 분석을 통해 하나의 단어에 복수 명사 또는 의미가 합해진 경우에는 이들 각각의 원형을 추출하고, 명사와 동사의 원형이 모두 존재하는 경우 이 둘을 모두 추출하는 단계 및 별도의 불필요 용어 사전을 통해 용어 사전에 포함된 단어를 삭제하는 단계를 포함한다. The step of extracting the keyword includes separating sentences from the document and distinguishing words from the separated sentences, extracting the original form of the word based on the separated words, removing unnecessary particles, and, in the case of nouns, singular form. In the case of verbs, it is replaced with the current base form, and if multiple nouns or meanings are combined in one word through morphological analysis, the original form of each of them is extracted. If both the original form of the noun and the verb exist, both the original form is extracted. It includes a step of extracting all of the words and a step of deleting words included in the terminology dictionary through a separate unnecessary terminology dictionary.

상기 사전을 생성하는 단계는, 키워드 분석 즉, TF-IDF 분석 또는 범주 분석을 통해 유사 그룹화가 가능한 범주를 분류하는 단계와, 분류된 범주를 각각의 개별 사전으로 분류하되, 범주 명을 사전 구분 부분으로, 그 범주에 해당하는 키워드 단어를 범주 단어 부분으로 분리 저장하는 단계 및 각 개별 사전으로 분류된 키워드에 대한 동의어 및 유의어 확장을 통해 개별 사전 내의 단어를 증가시키는 단계를 포함한다. The step of creating the dictionary includes classifying categories that can be similarly grouped through keyword analysis, that is, TF-IDF analysis or category analysis, and classifying the classified categories into individual dictionaries, with the category name in the dictionary classification part. This includes the steps of separating and storing keyword words corresponding to the category into category word parts and increasing the words in each individual dictionary by expanding synonyms and synonyms for the keywords classified in each individual dictionary.

상기 각 문장을 수치화는 문장의 각 키워드를 전체 개별 사전의 범주 단어 부분과 일치 여부를 비교하고, 일치하는 개별 사전의 벡터 값을 1로 표현하고, 일치하지 않는 경우에는 0으로 표현하여, 전체 개별 사전의 개수 만큼의 비트수로 문장을 수치화하는 것을 특징으로 한다. To quantify each sentence above, each keyword in the sentence is compared to whether it matches the category word part of the entire individual dictionary, and the vector value of the individual dictionary that matches is expressed as 1, and if it does not match, it is expressed as 0, and the total individual dictionary is expressed as 0. It is characterized by digitizing sentences with the number of bits equal to the number of dictionaries.

하나의 키워드가 복수의 개별 사전의 단어와 일치하는 경우, 일치되는 모든 개별 사전에 해당하는 출력 값을 1로 하고, 하나의 개별 사전 내에 복수의 키워드가 일치하는 경우, 이 하나의 개별 사전의 출력 값을 1로 하는 것을 특징으로 한다. If one keyword matches a word in multiple individual dictionaries, the output value corresponding to all matched individual dictionaries is set to 1, and if multiple keywords match within one individual dictionary, the output value of this single individual dictionary is set to 1. It is characterized by having a value of 1.

또한, 상술한 심층 신경망 기반의 문서 분석 방법을 구현하기 위해 프로그램이 저장된 기록 매체를 제공한다. In addition, a recording medium in which a program is stored is provided to implement the deep neural network-based document analysis method described above.

또한, 심층 신경망 기반의 문서 분석 방법을 구현하기 위해 매체에 저장된 컴퓨터 프로그램을 제공한다. Additionally, a computer program stored on a medium is provided to implement a deep neural network-based document analysis method.

이와 같이 본 발명은 사용자가 정한 분류 기준에 따라 문서를 분류하기 위해 문장별 의미적 요소의 수치화와 이를 심층 신경망 기반의 학습을 진행하고, 학습 결과를 바탕으로 분류 기준에 따라 문서를 분류할 수 있다. In this way, the present invention quantifies semantic elements for each sentence and learns them based on a deep neural network in order to classify documents according to classification criteria set by the user, and can classify documents according to classification criteria based on the learning results. .

또한, 문서에 등장하는 단어의 범주를 구분하고, 이를 사전화 한 다음, 이 사전을 이용하여 문서 내 문장을 수치화 함으로 인해 심층 신경망 분석의 분석능력을 향상시킬 수 있고, 동작 속도를 증대할 수 있다. In addition, the analysis ability of deep neural network analysis can be improved and the operation speed can be increased by classifying the categories of words that appear in the document, dictionary them, and then using this dictionary to quantify the sentences in the document. .

또한, 수치화의 범위를 사전 내의 개별 사전의 개수로 한정함으로 인해 심층 신경망의 구조의 단순화를 기대할 수 있다. Additionally, by limiting the scope of quantification to the number of individual dictionaries within the dictionary, simplification of the structure of the deep neural network can be expected.

도 1은 본 발명의 일 실시예에 따른 심층 신경망 기반의 문서 분석 시스템을 설명하기 위한 개념도이다.
도 2는 일 실시예에 따른 전처리 모듈을 설명하기 위한 블록도이다.
도 3은 일 실시예에 따른 원형 추출 모듈을 설명하기 위한 블록도이다.
도 4는 일 실시예에 따른 사전 모듈을 설명하기 위한 블록도이다.
도 5는 일 실시예에 따른 수치화 모듈의 동작을 설명하기 위한 개념도이다.
도 6은 일 실시예에 따른 심층 신경망 모듈을 설명하기 위한 블록 개념도이다.
도 7은 본 발명의 일 실시예에 따른 심층 신경망 기반의 문서 분석 방법을 설명하기 위한 흐름도이다.
도 8은 일 실시예에 따른 키워드 추출 방법을 설명하기 위한 흐름도이다.
도 9는 일 실시예에 따른 문장 수치화 방법을 설명하기 위한 흐름도이다.
도 10은 일 실시예에 따른 문장 학습 방법을 설명하기 위한 흐름도이다.
도 11은 일 실시예에 따른 사전 생성 방법을 설명하기 위한 흐름도이다. 1 is a conceptual diagram illustrating a deep neural network-based document analysis system according to an embodiment of the present invention.
Figure 2 is a block diagram for explaining a preprocessing module according to an embodiment.
Figure 3 is a block diagram for explaining a prototype extraction module according to an embodiment.
Figure 4 is a block diagram for explaining a dictionary module according to an embodiment.
Figure 5 is a conceptual diagram for explaining the operation of a digitization module according to an embodiment.
Figure 6 is a block diagram for explaining a deep neural network module according to an embodiment.
Figure 7 is a flowchart illustrating a deep neural network-based document analysis method according to an embodiment of the present invention.
Figure 8 is a flowchart illustrating a keyword extraction method according to an embodiment.
Figure 9 is a flowchart illustrating a method for quantifying sentences according to an embodiment.
Figure 10 is a flowchart for explaining a sentence learning method according to an embodiment.
Figure 11 is a flowchart for explaining a dictionary creation method according to an embodiment.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 더욱 상세히 설명하기로 한다. 그러나 본 발명은 이하에서 개시되는 실시예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이다. 도면상에서 동일 부호는 동일한 요소를 지칭한다. Hereinafter, embodiments of the present invention will be described in more detail with reference to the attached drawings. However, the present invention is not limited to the embodiments disclosed below and will be implemented in various different forms. These embodiments only serve to ensure that the disclosure of the present invention is complete and to those skilled in the art to fully convey the scope of the invention. This is provided to inform you. In the drawings, like symbols refer to like elements.

본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주기능별로 구분한 것에 불과함을 명확히 하고자 한다. 즉, 이하에서 설명할 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. 그리고 이하에서 설명할 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있음은 물론이다. 따라서, 본 명세서를 통해 설명되는 각 구성부들의 존재 여부는 기능적으로 해석 되어야 할 것이다. 이러한 이유로 본 발명의 심층 신경망 기반의 문서 분석 시스템과 방법 및 이를 구현하기 위한 프로그램이 저장된 기록 매체 및 매체에 저장된 컴퓨터프로그램의 구성부들의 구성은 본 발명의 목적을 달성할 수 있는 한도 내에서 상이해질 수 있음을 명확히 밝혀둔다. It is intended to be clear that the division of components in this specification is merely a division according to the main function each component is responsible for. That is, two or more components, which will be described below, may be combined into one component, or one component may be divided into two or more components for more detailed functions. In addition to the main functions it is responsible for, each of the components described below may additionally perform some or all of the functions handled by other components, and some of the main functions handled by each component may be performed by other components. Of course, it can also be carried out exclusively by . Therefore, the presence or absence of each component described throughout this specification should be interpreted functionally. For this reason, the deep neural network-based document analysis system and method of the present invention and the recording medium on which the program for implementing the same is stored and the configuration of the components of the computer program stored on the medium will be different within the extent of achieving the purpose of the present invention. Please make it clear that you can.

본 명세서에서, 제1 및 제2, 상부 및 하부 등의 관계적인 용어는, 그러한 엔티티 또는 액션 간의 실제 관계 또는 순서를 반드시 요구하거나 암시하지 않고 다른 엔티티나 액션과 하나의 엔티티 또는 액션을 구별하는 데에만 사용될 수 있다. 용어 "포함하다(comprises)", "포함하는(comprising)" 또는 그 다른 변형은, 구성요소의 리스트를 포함하는 프로세스, 방법, 제품, 또는 장치가 구성요소만을 포함하지 않지만 그러한 프로세스, 방법, 제품, 또는 장치에 명시적으로 열거되거나 내재되지 않은 다른 구성요소를 포함할 수 있도록, 비배타적인 포함물을 커버하도록 의도된다. "하나의 ~를 포함하다"로 진행되는 하나의 구성요소는, 더 이상의 제한없이, 구성요소를 포함하는 프로세스, 방법, 제품, 또는 장치 내에 부가적인 동일한 구성요소의 존재를 배제한다.In this specification, relational terms such as first and second, top and bottom are used to distinguish one entity or action from another without necessarily requiring or implying an actual relationship or order between such entities or actions. Can only be used. The terms “comprises,” “comprising,” or other variations thereof mean that a process, method, product, or device that includes a list of components does not include only those components, but that such process, method, product, or device includes a list of components. , or other components not explicitly listed or inherent in the device, are intended to cover non-exclusive inclusions. The reference to an element as “comprising an element” excludes the presence of an additional identical element within a process, method, product, or device including the element, without further limitation.

본 발명의 특장점 및 이를 구현하는 방법은 첨부된 도면들과 함께 심층 신경망 기반의 문서 분석 방법 및 이를 구현하기 위한 프로그램이 저장된 기록 매체 및 매체에 저장된 컴퓨터프로그램에 대하여 상세히 설명한다. The features and advantages of the present invention and the method of implementing the same will be described in detail with the accompanying drawings regarding the deep neural network-based document analysis method, the recording medium storing the program for implementing the same, and the computer program stored in the medium.

그러나 도면과 상세한 설명은 심층 신경망 기반의 문서 분석 방법 및 이를 구현하기 위한 프로그램이 저장된 기록 매체 및 매체에 저장된 컴퓨터프로그램 분야의 관련 종사자들이 통상적으로 알 수 있는 구성 및 방법에 대하여는 간략하게 설명하거나 생략하였고 본 발명의 개시를 명확하게 이해하는데 필요한 부분으로 한정하였다. 따라서 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니며 다양한 형태와 방법으로 구현될 수 있으며, 본 발명은 청구항의 범주에 의해 정의된다.However, the drawings and detailed descriptions briefly describe or omit the deep neural network-based document analysis method, the recording medium on which the program for implementing it is stored, and the configuration and method commonly known to those in the field of computer programs stored on the medium. The disclosure of the present invention is limited to the parts necessary for a clear understanding. Accordingly, the present invention is not limited to the embodiments disclosed below and may be implemented in various forms and methods, and the present invention is defined by the scope of the claims.

도 1은 본 발명의 일 실시예에 따른 심층 신경망 기반의 문서 분석 시스템을 설명하기 위한 개념도이며, 도 2는 일 실시예에 따른 전처리 모듈을 설명하기 위한 블록도이고, 도 3은 일 실시예에 따른 원형 추출 모듈을 설명하기 위한 블록도이고, 도 4는 일 실시예에 따른 사전 모듈을 설명하기 위한 블록도이고, 도 5는 일 실시예에 따른 수치화 모듈의 동작을 설명하기 위한 개념도이고, 도 6은 일 실시예에 따른 심층 신경망 모듈을 설명하기 위한 블록 개념도이다. Figure 1 is a conceptual diagram for explaining a deep neural network-based document analysis system according to an embodiment of the present invention, Figure 2 is a block diagram for explaining a preprocessing module according to an embodiment, and Figure 3 is an embodiment. FIG. 4 is a block diagram for explaining a prototype extraction module according to an embodiment, FIG. 4 is a block diagram for explaining a dictionary module according to an embodiment, FIG. 5 is a conceptual diagram for explaining the operation of a digitization module according to an embodiment, and FIG. 6 is a block diagram for explaining a deep neural network module according to an embodiment.

도 1 내지 도 6에 도시된 바와 같이 본 실시예에 따른 심층 신경망 기반의 문서 분석 시스템은 대상 문서의 문장별 키워드를 추출하는 전처리 모듈(100)과, 전처리 모듈(100) 및 학습 제어 모듈(500)에 의한 키워드의 유사 범주를 그룹화 하여 다수의 단어가 범주화되어 복수의 개별 사전으로 분류된 사전 모듈(200)과, 전처리 모듈(100)을 통해 추출된 키워드를 사전 모듈(200)을 바탕으로 수치화하는 수치화 모듈(300)과, 심층 신경망을 통해 수치화 모듈(300)의 수치값을 학습 하거나 예측 분석하는 심층 신경망 모듈(400)과, 학습 기준에 따라 심층 신경망 모듈(400)을 통해 수치화 모듈(300)을 거친 문장의 학습을 제어하는 학습 제어 모듈(500)과, 분석 문서에 대한 분석을 제어하는 분석 모듈(600)를 포함한다. 1 to 6, the deep neural network-based document analysis system according to this embodiment includes a preprocessing module 100 that extracts keywords for each sentence of the target document, a preprocessing module 100, and a learning control module 500. ), a dictionary module 200 in which a number of words are categorized and classified into multiple individual dictionaries by grouping similar categories of keywords, and the keywords extracted through the preprocessing module 100 are quantified based on the dictionary module 200. A numericization module 300 that learns or predicts and analyzes the numerical value of the numeric module 300 through a deep neural network, and a digitalization module 300 through the deep neural network module 400 according to a learning standard. ) includes a learning control module 500 that controls learning of sentences that have been processed, and an analysis module 600 that controls analysis of the analysis document.

시스템 및 각 모듈에서 처리 또는 저장되는 정보와 데이터가 저장된 DB모듈을 더 포함하는 것도 가능하다. It is also possible to further include a DB module in which information and data processed or stored in the system and each module are stored.

전처리 모듈(100)은 문장 또는 문서에서 형태소 분석 및 원형 처리를 거쳐 키워드를 추출한다. 문서는 컴퓨터 텍스트 문서를 사용하는 것이 가능하다. The preprocessing module 100 extracts keywords from sentences or documents through morpheme analysis and prototype processing. It is possible to use a computer text document as the document.

전처리 모듈(100)은 대상 문서에서 문장을 분리하고, 분리된 문장에서 각 단어를 구분한다. 이때, 문장의 분리는 문장 구분 기호를 사용하고, 단어의 구분은 글자간의 이격을 기준으로 구분하는 것이 가능하지만, 이에 한정되지 않고, 다양한 기술을 이용하여 이들을 분리 구분하는 것이 가능하다. 여기서, 문서에서 문장을 분리한 다음 각 문장마다 고유의 식별 번호나 기호를 부여하는 것이 효과적이다. 이를 통해 문장 내의 단어 분류를 통해 키워드를 추출하게 되면 해당 문장의 고유 식별 번호가 해당 키워드 앞단에 표시되는 것이 효과적이다. 이를 위해 자립형태소와 의존형태소로 분류하거나, 실질형태소와 형식 형태소로 분류하는 등의 다양한 형태의 형태소분석 기술이 사용되는 것이 가능하다. 형태소 분석을 통해 단어의 실질적 의미 뿐만 아니라 문법적인 의미까지 포함하여 분류가 가능하다.The preprocessing module 100 separates sentences from the target document and distinguishes each word from the separated sentences. At this time, it is possible to separate sentences using sentence delimiters, and to separate words based on the spacing between letters, but it is not limited to this and it is possible to separate and distinguish them using various technologies. Here, it is effective to separate sentences from the document and then assign a unique identification number or symbol to each sentence. When keywords are extracted through word classification within a sentence, it is effective to display the unique identification number of the sentence in front of the keyword. For this purpose, it is possible to use various types of morpheme analysis techniques, such as classifying them into independent morphemes and dependent morphemes, or classifying them into substantive morphemes and formal morphemes. Through morphological analysis, it is possible to classify words including not only their actual meaning but also their grammatical meaning.

그리고, 구분된 단어를 바탕으로 해당 단어의 원형을 추출하고 이를 문장별 키워드로 저장한다. 단어의 원형 추출을 위해 불필요한 조사를 제거하고, 명사의 경우 단수형으로 처리한다. 대명사, 관사를 제거하고, 동시에 주요하지 않는 단어를 제거한다. 그리고, 동사의 경우는 현재형 및 기본형으로 표현한다. 또한, 하나의 단어에 명사나 동사 등이 있는 경우에는 이 둘을 모두 원형으로 추출하는 것이 효과적이다. 형태소 분석을 통해 하나의 단어에 두개의 명사가 합해진 경우는 이들 각각을 원형으로 추출하는 것도 가능하다. 물론, 이에 한정되지 않고, 다양한 원형 처리를 통해 단어의 원형을 확인하는 것이 가능하고, 단어의 원형으로 동사와 명사를 사용하는 것이 효과적이다. 물론, 관사도 원형으로 사용하는 것이 가능하다. 하지만, 조사나 감탄사 그리고, 대명사의 경우는 제거하고, 불필요한 단어도 별도 정리된 개별 사전 즉, 불용어 사전을 이용하여 제거하는 것이 효과적이다. Then, based on the separated words, the original form of the word is extracted and stored as a keyword for each sentence. To extract the original form of a word, unnecessary particles are removed, and nouns are treated as singular. Remove pronouns, articles, and at the same time remove non-main words. Additionally, verbs are expressed in the present tense and basic tense. Additionally, if a word contains a noun or a verb, it is effective to extract both as original forms. When two nouns are combined in one word through morphological analysis, it is possible to extract each of them as their original form. Of course, it is not limited to this, and it is possible to check the prototype of a word through various prototype processing, and it is effective to use verbs and nouns as the prototype of the word. Of course, it is possible to use articles in their original form. However, it is effective to remove particles, interjections, and pronouns, and to remove unnecessary words using a separate dictionary, that is, a stop word dictionary.

이와 같은 전처리 모듈(100)은 상술한 동작을 위해 문서에서 문장을 분리하는 문장 분리 모듈(110)과, 분리된 문장에서 단어를 구분하는 단어 구분 모듈(120)과, 분리된 단어에서 그 원형을 추출 선정하는 원형 추출 모듈(130)을 포함할 수 있다. 원형 추출 모듈(130)은 분리된 단어의 형태소를 분석하는 형태소 분석부(131)와, 형태소 분석 결과에 따라 불용 단어나 불필요 조사를 제거하는 불용 단어 제거부(132)와, 단어의 원형을 매칭하고, 다중 의미 단어 및 복합 명사등의 원형을 선택하는 원형 선택부(133)를 포함하는 것이 가능하다. 물론, 형태소 분석부는 단어 구분 모듈에 포함되어 단어의 분류와 형태소 분류를 동시에 진행하는 것도 가능하다. For the above-described operation, the preprocessing module 100 includes a sentence separation module 110 that separates sentences from a document, a word separation module 120 that separates words from the separated sentences, and a word separation module 120 that separates words from the separated words. It may include a prototype extraction module 130 for extraction and selection. The original form extraction module 130 matches the original form of the word with a morpheme analysis unit 131 that analyzes the morphemes of the separated word, a dead word removal unit 132 that removes unused words or unnecessary particles according to the morpheme analysis results, and the original word of the word. It is possible to include a prototype selection unit 133 that selects the prototype of multi-meaning words and compound nouns. Of course, the morpheme analysis unit is included in the word classification module, so it is possible to simultaneously classify words and classify morphemes.

예를 들어, "이순신장군은 임진왜란에서 거북선 등을 이용하여, 대승을 거뒀다"라는 문장에서 키워드를 추출하는 경우를 고려하면, 먼저 단어 별로 분류한다. "이순신장군은", "임진왜란에서", "거북선", "등을", "이용하여", "대승을", "거뒀다"로 분류할 수 있다. 이후, 각 단어에서 조사나 감탄사 및 불용단어를 제거하고, 단어의 원형을 추출하면 "이순신장군", "임진왜란", "거북선", "이용", "이용하다", "대승", "거두다"로 분류될 수 있다. 그리고, 앞서 이야기한 바와 같이 명사가 연속된 복합명사의 경우에는 이들을 분리하여 "이순신장군"을 "이순신", "장군"으로 분리하는 것이 효과적이다. 여기서, 이용하여의 경우에는 그 원형이 명사로 "이용"이 있고, 동사로 "이용하다"라는 것이 모두 존재할 수 있다. 본 실시예에서는 이 둘을 모두 원형으로 판단하는 것이 효과적이다. 이뿐만 아니라, 이와 같이 형태소 분석에 의해 동사로 "하다"와 "되다"와 같은 동사들의 경우는 대부분 명사와 동사가 같이 원형으로 판단되는 것이 효과적이다. 즉, 공부하다는 공부와 공부하다가, 발견되다는 발견과 발견되다를 모두 원형으로 추출하는 것이 가능하다. 또한, 복합 명사의 경우 분리하되, 그 원형이 될 수 있는 모든 경우의 수를 다 원형으로 추출한다. 예를 들어, "독립문"의 경우 "독립", "문", "독립문"으로 원형 추출하고, "영문법"의 경우, "영문", "문법", "법", "영문법"을 원형으로 추출하는 것이 가능하다. 이와 같이 추출된 단어의 원형은 저장된다. 이때, 저장되는 데이터의 앞단에는 문서에서 분류된 문장의 고유 식별 번호가 표시되어 있는 것이 효과적이다. 이를 통해 하나의 문장이 몇개의 키워드로 구성되었는지 확인이 가능하게 된다. 앞서 언급한 바와 같이 하나의 문장에 구성된 단어수와 동일한 개수의 키워드가 나올 수 있지만, 이에 한정되지 않고, 이보다 많은 수의 키워드가 생성될 수 있다. For example, considering the case of extracting keywords from the sentence “Admiral Yi Sun-sin achieved a great victory during the Japanese Invasion of Korea by using the turtle ship, etc.,” first classify by word. “Admiral Yi Sun-sin” can be classified as “in the Japanese Invasion of Korea”, “turtle ship”, “on his back”, “using”, “achieving a great victory”, and “achieving”. Afterwards, particles, exclamations, and unused words are removed from each word, and the original form of the word is extracted: “Admiral Yi Sun-sin,” “Imjin War,” “Turtle Ship,” “Use,” “Use,” “Great Victory,” and “Reap.” It can be classified as: And, as mentioned earlier, in the case of compound nouns with consecutive nouns, it is effective to separate them and separate "General Yi Sun-sin" into "Yi Sun-sin" and "General." Here, in the case of using, the original form can be both “use” as a noun and “use” as a verb. In this embodiment, it is effective to judge both of them as circular. In addition, in the case of verbs such as "do" and "be" as verbs through morphological analysis, it is effective to judge them as prototypical nouns and verbs in most cases. In other words, it is possible to extract all of the words 'study', 'study', 'discover' and 'discovered' into original forms. In addition, in the case of compound nouns, they are separated, but all possible cases of the original form are extracted as the original form. For example, in the case of “independent sentence”, “independent”, “sentence”, and “independent sentence” are extracted as the original form, and in the case of “English grammar”, “English sentence”, “grammar”, “law”, and “English grammar” are extracted as the original form. It is possible. The original form of the word extracted in this way is stored. At this time, it is effective to display the unique identification number of the sentence classified in the document at the front of the stored data. Through this, it is possible to check how many keywords one sentence consists of. As mentioned earlier, the same number of keywords as the number of words in one sentence may appear, but the present invention is not limited to this, and a larger number of keywords may be created.

이와 같이 전처리 모듈(100)을 통해 문서 및 문장 내의 키워드가 추출된 이후, 이 키워드를 이용하여 사전 모듈(200)은 개별 사전을 생성하고, 수치화 모듈(300)은 각 문장을 사전 모듈(200)을 이용하여 수치화 한다. After the keywords in the document and sentence are extracted through the preprocessing module 100, the dictionary module 200 creates an individual dictionary using the keywords, and the digitization module 300 converts each sentence into the dictionary module 200. Quantify it using .

하기에서는 사전 모듈(200)을 통한 수치화 변화를 위한 개별 사전 생성에 관하여 설명한다. 사전 모듈(200)은 학습 제어 모듈(500)에 의해 생성되고, 분석 모듈(600)에 의해 그 사전 내용이 보강 및 변경 또는 추가되는 것이 가능하다. In the following, the creation of individual dictionaries for numerical changes through the dictionary module 200 will be described. The dictionary module 200 is created by the learning control module 500, and the contents of the dictionary can be reinforced, changed, or added by the analysis module 600.

사전 모듈(200)은 전처리 모듈(100)에 의한 키워드를 유사 범주로 그룹화 하고, 이 그룹화된 키워드들을 각각 개별 사전으로 분류한다. The dictionary module 200 groups keywords generated by the preprocessing module 100 into similar categories and classifies these grouped keywords into individual dictionaries.

사전 모듈(200)은 생성된 모든 키워드를 입력 받는다. 입력된 키워드 분석을 통해 범주를 추출한다. 범주는 키워드 즉, 단어들의 유사 의미 또는 동일 범주에 속하하고 그룹화가 가능함을 지칭한다. 이러한 범주 구분을 위해 본 실시예에서는 TF-IDF분석을 사용하거나, 의미나 유사성 등의 기준을 통한 분류 체계를 생성하는 것이 가능하다. 즉, 문서내의 단어의 빈도수에 따라 그 가중치를 부여하여 구분하는 것도 가능하다. 물론, 빈도수가 아니라 키워드의 중요도를 평가하여 구분하는 것도 가능하다. 또한, 빈도수와 키워드의 중요성을 모두 평가하는 방법도 가능하다. 이를 통해 어떤 단어가 해당 문서내에서 얼마나 많은 빈도로 출연하였는지와 얼마나 중요한지를 그 수치로 나타내는 것이 가능하고, 이를 바탕으로 다양한 분류 체계를 이용하여 어떤 범주의 단어가 많이 도출되고, 그 수치 값이 높은지 낮은지 여부로 구분하는 것이 가능하다. The dictionary module 200 receives all generated keywords as input. Categories are extracted through analysis of input keywords. Category refers to keywords, that is, words that have similar meanings or belong to the same category and can be grouped. To distinguish these categories, in this embodiment, it is possible to use TF-IDF analysis or to create a classification system based on criteria such as meaning or similarity. In other words, it is possible to classify words by assigning weights to them according to the frequency of words in the document. Of course, it is also possible to distinguish by evaluating the importance of keywords rather than frequency. Additionally, it is possible to evaluate both frequency and importance of keywords. Through this, it is possible to indicate in numerical terms how frequently a word appears in the document and how important it is. Based on this, using various classification systems, it is possible to determine which categories of words are most common and their numerical value is high. It is possible to distinguish whether it is low or not.

이와 같이 키워드를 동일 및/또는 유사 범주로 구분한 다음 이 범주를 하나의 개별 사전으로 선정한다. 이때, 개별 사전은 사전 구분부분과 범주 단어 부분으로 구분된다. 사전 구분부분은 범주를 구분하는 명칭 및 ID 또는 분류 기준일수 있고, 범주 단어 부분은 그 범주에 속한 단어들이 위치한 공간일 수 있다. 이와 같이 구분된 범주 하나가 각기 개별 사전으로 구성된다. In this way, keywords are divided into identical and/or similar categories, and then these categories are selected into an individual dictionary. At this time, the individual dictionary is divided into a dictionary division part and a category word part. The dictionary classification part may be a name, ID, or classification standard that distinguishes the category, and the category word part may be a space where words belonging to that category are located. Each of these distinct categories consists of a separate dictionary.

물론, 하나의 범주에 단어의 중복없이 저장되는 것이 효과적이다. 또한, 하나의 단어 즉, 키워드가 다수의 범주에 속하는 것도 가능하다. 이는 단어의 의미를 파악하여 다수 의미가 있는 단어는 그 의미가 해당하는 범주의 개별 사전에 위치하는 것이 효과적이다.Of course, it is effective to store words in one category without duplication. Additionally, it is possible for one word, that is, a keyword, to belong to multiple categories. It is effective to determine the meaning of a word and place words with multiple meanings in individual dictionaries of the category to which they belong.

사전 모듈(200)의 범주 분류 동작을 건설 회사의 계약서나 제안서(RFP) 문서를 이용하여 분류하는 경우로 설명하면, 사전 모듈(200)은 전처리 모듈(100)에 의해 분류된 키워드를 분석하여 시간 범주, 재료 범주, 장치 또는 장비 범주, 장소 범주, 사람 범주 및 비용(돈) 범주 등으로 구분하는 것이 가능할 수 있다. 물론, 이러한 범주는 이보다 더 많을 수 있고, 현재 분석시 없다가 추후 추가되는 것도 가능하다. 즉, 범주 구분은 이에 한정되지 않고, 사전 모듈(200)에서 다양한 유서 문서의 분석을 통해 추가하거나 변경되는 것이 가능하다. If the category classification operation of the dictionary module 200 is explained as a case of classifying a construction company's contract or proposal (RFP) document, the dictionary module 200 analyzes the keywords classified by the preprocessing module 100 and It may be possible to separate into categories, material categories, device or equipment categories, location categories, people categories and cost (money) categories, etc. Of course, there may be more such categories, and it is also possible that they do not exist at the time of the current analysis but are added later. That is, the category classification is not limited to this and can be added or changed through analysis of various will documents in the dictionary module 200.

이와 같이 범주가 구분된 이후에는 범주 즉, 개별 사전 내의 단어 즉, 키워드를 기준으로 사전의 범주 단어 부분을 확장한다. 즉, 동의 및 유의어의 확장을 진행한다. 범주 단어의 확장을 위해서는 어학 사전의 데이터를 사용하는 것이 효과적이다. 이때, 동의어의 빈도수를 이용하되, 빈도수를 조사하여 가중치를 부여하는 것이 가능하다. 또한, 사전의 위치 정보 즉, 색인 정보 기술을 이용하는 것도 가능하다. 이를 통해 만일, 키워드를 통해 비용(돈) 범주에 "가격", "원가"는 단어가 있는 경우, 이의 확장을 진행하면, 대가, 비용, 지출, 경비, 소비, 가치, 시세, 시가, 희생, 손실 등의 단어로의 확장이 가능하다. 범주 즉, 개별 사전에 building 이라는 단어가 있는 경우, 이의 확장을 진행하면 structure, edifice, grocerles 등의 단어로 확장이 가능해질 수 있다. After the categories are divided in this way, the category word portion of the dictionary is expanded based on the category, that is, the word in the individual dictionary, that is, the keyword. In other words, we proceed with the expansion of synonyms and synonyms. To expand category words, it is effective to use data from language dictionaries. At this time, it is possible to use the frequency of synonyms and assign weight by examining the frequency. Additionally, it is also possible to use dictionary location information, that is, index information technology. Through this, if there are words "price" and "cost" in the cost (money) category through keywords, if you proceed with the expansion, price, cost, expenditure, expense, consumption, value, market price, market price, sacrifice, It can be expanded to words such as loss. In other words, if there is a word building in the category, individual dictionary, it can be expanded to include words such as structure, edifice, and grocerles.

상술한 바와 같이 키워드를 이용하여 그 범주를 분리하고, 범주에 해당하는 다수의 개별 사전을 형성하고, 키워드를 확장하는 사전 모듈(200)에 따라 분류된 사전을 시각화할 경우, 이를 아래 표와 같이 표현이 가능할 수 있다. As described above, when the categories are separated using keywords, a number of individual dictionaries corresponding to the categories are formed, and the classified dictionaries are visualized according to the dictionary module 200 that expands the keywords, as shown in the table below. Expression may be possible.

사전 구분 부분dictionary division part 범주 단어 부분Category Word Part 개별사전1Individual Dictionary 1 TIMETIME Time, april, day, dec, December, everyday, hour, jan, January, jul, july, jun, june, march, may, minute, month
…Time, april, day, dec, December, everyday, hour, jan, January, jul, july, jun, june, march, may, minute, month
… 개별사전2Individual Dictionary 2 PERSONPERSON Adjudicator, advocate, affidavit, arbiter, arbitral, arbitration, employ, employee, employer, engineer, person, people …Adjudicator, advocate, affidavit, arbiter, arbitral, arbitration, employ, employee, employer, engineer, person, people… 개별사전3Individual Dictionary 3 MONEYMONEY Bankrupt, benefit, bidding, bill, bond, capital, cash, cent, coin, cost, debt, debtor, discount, dollar, earn, euro, exchange…Bankrupt, benefit, bidding, bill, bond, capital, cash, cent, coin, cost, debt, debtor, discount, dollar, earn, euro, exchange… 개별사전4Individual dictionary 4 PLACEPLACE Boundary, bridge, building, central, country, factory, farm, gate, geo, hall, highway, house, indoor, land, location…Boundary, bridge, building, central, country, factory, farm, gate, geo, hall, highway, house, indoor, land, location… ...... ...... ......

표에서와 같이 개별 사전은 더 많이 추가될 수 있고, 사전 구분 부분의 범주도 더 추가되는 것이 가능하다. 또한, 범주 단어 부분도 문서의 분석에 따라 더 추가되고, 확장에 따라 더 추가되는 것이 가능하다. As shown in the table, more individual dictionaries can be added, and more categories of dictionary divisions can also be added. In addition, it is possible to add more category word parts according to the analysis of the document and to add more as it expands.

이러한 동작을 수행하는 사전 모듈(200)은 상술한 동작을 위해 문서 내의 키워드를 제공 받아 분류하는 범주 분류부(210)와, 분류된 범주에 해당하는 개별 사전을 생성하는 개별 사전 생성부(220)와, 생성된 개별 사전 내에 위치한 키워드를 확장시키는 키워드 확장부(230)를 포함한다. 생성된 개별 사전을 저장 하는 별도의 사전 저장부(240)를 더 포함하는 것도 가능하다. The dictionary module 200 that performs this operation includes a category classification unit 210 that receives and classifies keywords in the document for the above-described operation, and an individual dictionary creation unit 220 that generates an individual dictionary corresponding to the classified category. and a keyword expansion unit 230 that expands keywords located in the generated individual dictionary. It is also possible to further include a separate dictionary storage unit 240 that stores the created individual dictionary.

수치화 모듈(300)은 상술한 사전 모듈(200)을 기반으로 문장별 추출된 키워드를 이용하여 수치 형태로 변화시킨다. 개별 사전의 경우 서로 아무 관계가 없는 독립적인 상태를 갖는다. 이에 따라 사전의 개수 만큼의 차원에 벡터 형태로 표현될 수 있다. The numericalization module 300 uses keywords extracted for each sentence based on the dictionary module 200 described above and converts them into numerical form. In the case of individual dictionaries, they have an independent state with no relationship to each other. Accordingly, it can be expressed in vector form with as many dimensions as the number of dictionaries.

수치화 모듈(300)은 학습 제어 모듈(500) 및 분석 모듈(600)에 따라, 전처리 모듈(100)에 의해 각 문장 별로 추출된 키워드를 사전 모듈에 의해 생성된 사전의 각 개별 사전에 대응시켜 일치 유무에 따라 수치화로 표시한다. 즉, 하나의 문장이 개별 사전 개수에 따른 비트수로 표현되는 것이 가능하다. According to the learning control module 500 and the analysis module 600, the numericalization module 300 matches the keywords extracted for each sentence by the preprocessing module 100 to each individual dictionary of the dictionary generated by the dictionary module. It is displayed numerically depending on the presence or absence. In other words, it is possible for one sentence to be expressed with the number of bits according to the number of individual dictionaries.

수치화 모듈(300)은 문장 단위 별로, 문장 내의 키워드를 각 개별 사전 내의 범주 단어 부분의 단어와 일치 유무를 판단한다. 이때, 일치하는 경우 1로 하고, 일치하지 않는 경우에는 0으로 표현한다. 이로인해 하나의 문장이 개별 사전 개수에 해당하는 비트수로 표현될 수 있다. 예를 들어, 개별 사전이 5개가 있는 경우, 전체 표현가능한 비트수는 "00000"로 5비트가될 수 있다. For each sentence unit, the digitization module 300 determines whether the keyword in the sentence matches the word in the category word portion in each individual dictionary. At this time, if they match, it is expressed as 1, and if they do not match, it is expressed as 0. Because of this, one sentence can be expressed with the number of bits corresponding to the number of individual dictionaries. For example, if there are 5 individual dictionaries, the total number of bits that can be expressed is “00000”, which can be 5 bits.

수치화 모듈(300)은 각각의 키워드가 각 개별 사전에 있는지를 확인하고, n번째 개별 사전에 해당 키워드 단어가 있는 경우, 해당 번째의 개별 사전 위치 값을 1로 표현한다. 예를 들어, 하나의 문장이 5개의 키워드로 구성되고, 앞서와 같이 5개의 개별 사전이 있는 경우, 순차적 또는 동시적으로 키워드와 개별 사전의 범주 단어 부분을 비교하여, 그 결과가 아래 표와 같이 표현되는 것이 가능하다. The digitization module 300 checks whether each keyword is in each individual dictionary, and if the corresponding keyword word is in the nth individual dictionary, it expresses the corresponding individual dictionary position value as 1. For example, if one sentence consists of 5 keywords and there are 5 individual dictionaries as before, the keywords and the category word portion of the individual dictionaries are compared sequentially or simultaneously, and the results are as shown in the table below. It is possible to express

구분division 개별사전1Individual Dictionary 1 개별사전2Individual Dictionary 2 개별사전3Individual Dictionary 3 개별사전4Individual dictionary 4 개별사전5Individual Dictionary 5 키워드 1Keyword 1 00 1One 00 00 00 키워드 2Keyword 2 00 00 00 00 00 키워드 3Keyword 3 00 1One 00 1One 00 키워드 4Keyword 4 00 00 00 00 00 키워드 5Keyword 5 00 00 00 00 00 수치화 결과Numerical result 00 1One 00 1One 00

표 2를 보면, 5개의 개별 사전이 있기때문에 일 문장은 5비트로 표현될 수 있다. 일 문장 내에 5개의 키워드가 있고, 이를 개별 사전들과 비교하여 개별 사전 1에는 키워드들이 위치하지 않기 때문에 첫번째 비트는 0으로 표현된다. 그리고, 개별 사전 2에는 키워드 1과 키워드 3이 위치한다. 이 경우, 두개의 키워드가 있기 때문에 두번째 비트는 1로 표현된다. 이때, 키워드에 의한 수치화는 키워드 중 적어도 하나만 위치하더라도 1로 표현한다. 물론 개별 사전에 모든 키워드가 다 있더라도 1로 표현된다. 다음으로, 개별 사전 3과 5에도 키워드들이 위치하지 않기 때문에 3번째와 5번째 비트값도 0이 된다. 개별 사전4에 키워드 3이 위치하기 때문에 4번째 비트는 1로 표현된다. 이때, 키워드 3의 경우, 개별 사전2와 개별사전 4에 동시에 위치하지만, 개별 사전은 각기 동작 및 검토하기 때문에 각기 1로 표현될 수 있다. 따라서, 예시의 5개의 키워드가 있는 문장을 수치화 모듈에 의해 수치화 하면 "01010"의 값을 얻게 된다. Looking at Table 2, since there are five individual dictionaries, one sentence can be expressed with 5 bits. There are 5 keywords in one sentence, and when compared to individual dictionaries, the first bit is expressed as 0 because the keywords are not located in individual dictionary 1. And, keyword 1 and keyword 3 are located in individual dictionary 2. In this case, because there are two keywords, the second bit is expressed as 1. At this time, quantification by keyword is expressed as 1 even if at least one of the keywords is located. Of course, even if all keywords are in the individual dictionary, they are expressed as 1. Next, since keywords are not located in individual dictionaries 3 and 5, the 3rd and 5th bit values are also 0. Since keyword 3 is located in individual dictionary 4, the 4th bit is expressed as 1. At this time, in the case of keyword 3, it is simultaneously located in individual dictionary 2 and individual dictionary 4, but since the individual dictionaries are operated and reviewed separately, they can each be expressed as 1. Therefore, if the sentence with the five keywords in the example is digitized using the digitization module, the value of “01010” is obtained.

다른 예로, 수치화된 이후의 값이 "000010001000010100"일 경우는 개별 사전이 18개이고, 해당 문장은 5번째 개별 사전과 9번째, 14번째 및 16번째 개별 사전에 해당 문자 키워드에 해당하는 단어가 위치함을 알 수 있다. As another example, if the numerical value is "000010001000010100", there are 18 individual dictionaries, and the word corresponding to the character keyword for the sentence is located in the 5th individual dictionary and the 9th, 14th, and 16th individual dictionary. can be seen.

본 발명에서는 수치화 모듈(300)에서 검토하는 사전의 개수는 한정되지 않고, 범주 분류에 따라 다양할 수 있다. 바람직하게는 10 내지 1000개의 개별 사전을 사용하는 것이 가능하다. 하지만, 개별 사전의 개수가 너무 많을 경우에는 전체 시스템에 부하가 발생할 수 있고, 개별 사전의 개수가 너무 작을 경우에는 문서 분석이 정확하지 않을 수 있다. 이에 바람직하게는 개별 사전의 개수는 50 내지 200개인 것이 효과적이다. 물론, 분석하는 문서의 분야에 따라 이는 가변적으로 변화하는 것이 효과적이다. 또한, 앞서 언급한 바와 같이, 개별 사전은 문서들을 분야등에 따라 그 개수가 증가할 수 있다. 또 필요에 따라 개별 사전을 통합하여 그 개수를 줄이는 것도 가능하다. In the present invention, the number of dictionaries reviewed by the digitization module 300 is not limited and may vary depending on category classification. Preferably, it is possible to use 10 to 1000 individual dictionaries. However, if the number of individual dictionaries is too large, a load may occur on the entire system, and if the number of individual dictionaries is too small, document analysis may not be accurate. Therefore, it is preferably effective to have 50 to 200 individual dictionaries. Of course, it is effective to vary this depending on the field of the document being analyzed. Additionally, as mentioned earlier, the number of individual dictionaries may increase depending on the field of documents. Additionally, if necessary, it is possible to reduce the number by integrating individual dictionaries.

앞선 전처리 모듈(100)의 예시에서, "이순신장군은 임진왜란에서 거북선 등을 이용하여, 대승을 거뒀다"라는 문장에서 키워드를 추출하면, "이순신장군", "임진왜란", "거북선", "이용", "이용하다", "대승", "거두다"로 분류된다. 이때, 이용이라는 단어는 이용이라는 키워드와 이용하다라는 키워드로 2개의 키워드가 생성된다. 이는 키워드가 가중된 것으로 볼 수 있다. 하지만, 사전 모듈(200)에 의해 생성된 개별 사전에서는 이용과 이용하다는 하나의 범주 안에 속하기 때문에 이용이라는 키워드가 위치한 개별 사전과 이용하다라는 키워드가 위치한 개별 사전이 동일하기 때문에 해당 키워드 가중치에 의한 값은 일정하에 1로 표시되어 그 부여된 가중치가 상쇄될 수 있다. 하지만, 이용과 이용하다와 같이 키워드를 분류함으로 인해 문서의 분석을 더욱 구체적으로 할 수 있는 장점이 있다. In the example of the previous preprocessing module 100, if keywords are extracted from the sentence “Admiral Yi Sun-sin achieved a great victory during the Japanese invasions of Korea by using the turtle ship, etc.”, the keywords are “Admiral Yi Sun-sin,” “Imjin invasions of Korea,” “turtle ships,” and “use.” , it is classified as “use”, “mahayana”, and “reap”. At this time, two keywords are created for the word use: the keyword use and the keyword use. This can be seen as keyword weighting. However, in the individual dictionary generated by the dictionary module 200, since use and use belong to one category, the individual dictionary where the keyword use is located and the individual dictionary where the keyword use is located are the same, so the individual dictionary where the keyword use is located is the same. The value may be displayed as 1 under constant conditions so that the weight assigned to it is offset. However, there is the advantage of being able to analyze documents more specifically by classifying keywords such as use and use.

심층 신경망 모듈(400)은 수치화 모듈(300)의 수치값을 이용하여 문장 및 문서 분석을 위한 학습을 진행하거나 문장 및 문서의 분석을 진행하는 것이 가능하다. 심층 신경망 모듈(400)은 데이터를 입력 받는 입력부(410)와, 결과를 출력하는 출력부(420) 및 심층 신경망 분석을 수행하는 은닉부(430)를 포함한다. 본 실시예에서는 입력부(410)는 개별 사전의 개수 즉, 수치화 모듈(300)에 의해 수치화된 비트수와 동일한 것이 효과적이다. 즉, 심층 신경망 모듈(400)은 복수의 입력 노드를 갖는 입력부(410)와 하나의 출력 노드를 갖는 출력부(420) 그리고, 이들 사이에 위치한 다수의 은닉 노드를 갖는 은닉부(430)를 구비한다. 입력부(410)의 입력 노드가 수치화 모듈(300)에 의한 수치화 값에 대응하는 입력 노드를 구비하는 것이 효과적이다. The deep neural network module 400 is capable of performing learning for sentence and document analysis or analyzing sentences and documents using the numerical values of the digitization module 300. The deep neural network module 400 includes an input unit 410 that receives data, an output unit 420 that outputs results, and a hidden unit 430 that performs deep neural network analysis. In this embodiment, it is effective for the input unit 410 to be equal to the number of individual dictionaries, that is, the number of bits digitized by the digitization module 300. That is, the deep neural network module 400 includes an input unit 410 with a plurality of input nodes, an output unit 420 with one output node, and a hidden unit 430 with a plurality of hidden nodes located between them. do. It is effective for the input node of the input unit 410 to have an input node corresponding to the numerical value by the numerical conversion module 300.

은닉부(430)는 전체적으로 연결된 하나 이상의 레이어를 포함하는 것이 효과적이고, 본 실시예의 심층 신경망 모듈(400)은 각 은닉 레이어 사이의 관계에 의해 구분되는 것이 효과적이다. It is effective that the hidden part 430 includes one or more layers connected as a whole, and the deep neural network module 400 of this embodiment is effectively divided by the relationship between each hidden layer.

학습을 수행하는 경우, 제공된 정보에 따라 본 실시예에서는 문서 및 문장의 리스크를 검토하는 경우로 가정하면, 출력부(420)의 출력 값은 0 또는 1로 표현되는 것이 효과적이다. 이때, 문서/문장에 독소 조항 즉, 이슈나 리스크가 있는 경우에는 1로 표시하고, 없는 경우에는 0으로 표현된다. When performing learning, assuming that the risk of documents and sentences is reviewed in this embodiment according to the provided information, it is effective for the output value of the output unit 420 to be expressed as 0 or 1. At this time, if there is a toxic clause, that is, an issue or risk, in the document/sentence, it is expressed as 1, and if there is no clause, it is expressed as 0.

예측 분석을 수행하는 경우에는 출력 값이 1에 가까울 경우에는 독소 조항 즉, 이슈나 리스크가 있는 것으로 표현하고, 0에 가까우면 리스크가 없는 것으로 예측할 수 있다. 예측 분석의 경우 0과 1 사이의 값이 출력된다. 실질적으로 0.94, 0.85. 0.23, 0.01과 같은 값이 나온다. 따라서, 1에 가까운 경우 리스크가 있는 것으로, 0에 가까운 경우 리스크가 없는 것으로 판단하는 것이 가능하다. When performing predictive analysis, if the output value is close to 1, it can be expressed as a toxic clause, that is, there is an issue or risk, and if it is close to 0, it can be predicted that there is no risk. In the case of predictive analysis, values between 0 and 1 are output. Practically 0.94, 0.85. Values such as 0.23 and 0.01 appear. Therefore, it is possible to judge that there is a risk if it is close to 1, and that there is no risk if it is close to 0.

학습 제어 모듈(500)이 구동하는 경우 전체 시스템은 제공된 문서에서 문장의 리스크 여부를 학습하게 된다. 이때, 학습은 전처리 모듈(100)에 의해 키워드화되고, 이 키워드 들은 사전 모듈에 의해 개별 사전을 형성하거나, 기 형성된 개별 사전 내의 범주 단어 부분에 추가 되고, 그 유사/동의어도 함께 추가될 수 있다. When the learning control module 500 is running, the entire system learns whether a sentence is risky in a provided document. At this time, the learning is keyworded by the preprocessing module 100, and these keywords form an individual dictionary by the dictionary module, or are added to the category word portion in the already formed individual dictionary, and similar/synonyms thereof can also be added. .

수치화 모듈(300)에 의해 키워드로 분류된 문장이 2진수 형태의 수치로 개별 사전과 같은 수의 비트 값을 갖는 형태로 변화되고, 이를 심층 신경망 모듈(400)을 통해 학습을 진행하여 해당 문장의 리스크 유무를 학습하게된다. Sentences classified as keywords by the numericization module 300 are converted into binary numbers with the same number of bit values as individual dictionaries, and are learned through the deep neural network module 400 to determine the corresponding sentences. Learn about the presence or absence of risk.

분석 모듈(600)은 기 학습된 결과를 바탕으로 심층 신경망 모듈(400)에서 해당 문자의 리스크 유무를 예측 분석하는 동작을 수행하게 한다. The analysis module 600 causes the deep neural network module 400 to perform an operation of predicting and analyzing the presence or absence of risk of the corresponding character based on the previously learned results.

하기에서는 상술한 심층 신경망 기반의 문서 분석 시스템을 이용한 심층 신경망 기반의 문서 분석 방법에 관하여 설명한다. In the following, a deep neural network-based document analysis method using the deep neural network-based document analysis system described above will be described.

후술되는 방법 설명에서 상술한 시스템 설명과 중복되는 설명은 생략 또는 간략화하여 설명한다. In the method description described later, descriptions that overlap with the above-described system description will be omitted or simplified.

도 7은 본 발명의 일 실시예에 따른 심층 신경망 기반의 문서 분석 방법을 설명하기 위한 흐름도이고, 도 8은 일 실시예에 따른 키워드 추출 방법을 설명하기 위한 흐름도이며, 도 9는 일 실시예에 따른 문장 수치화 방법을 설명하기 위한 흐름도이며, 도 10은 일 실시예에 따른 문장 학습 방법을 설명하기 위한 흐름도이며, 도 11은 일 실시예에 따른 사전 생성 방법을 설명하기 위한 흐름도이다. FIG. 7 is a flowchart for explaining a document analysis method based on a deep neural network according to an embodiment of the present invention, FIG. 8 is a flowchart for explaining a keyword extraction method according to an embodiment, and FIG. 9 is for an embodiment. FIG. 10 is a flowchart for explaining a sentence quantification method according to an embodiment, and FIG. 11 is a flowchart for explaining a dictionary creation method according to an embodiment.

도 7 내지 도 11을 바탕으로 심층 신경망 기반의 문서 분석 방식을 설명하면 다음과 같다, 먼저 도 7에 도시된 바와 같이, 문서내의 리스크나 독소 조항 등의 검출을 위해 해당 문서를 준비한다. 준비된 문서는 전처리 모듈(100)에 의한 전처리 작업을 수행하여 각 문서 내의 문장 별로 구성 단어에 해당하는 키워드를 추출한다(S100). Based on Figures 7 to 11, the deep neural network-based document analysis method is described as follows. First, as shown in Figure 7, the document is prepared to detect risks or toxic provisions in the document. The prepared document is preprocessed by the preprocessing module 100 to extract keywords corresponding to constituent words for each sentence in each document (S100).

이를 도 8을 참조하여 구체적으로 설명하면, 문서 내의 문장을 분리한다(S110). 그리고, 형태소 분석 등을 통해 각 문장에서의 단어를 추출한다(S120). 이때, 단어의 원형을 추출하되, 불필요 단어나 불용어를 제거하고, 조사나 관사의 제거 작업을 진행한다(S130). 좀더 구체적으로, 동사/형용사등의 경우에는 어미 변화에 따른 원형을 추출하고, 명사의 경우에는 단수형 처리를 수행한다(S140). 이외에 의미 분석을 통해 주요하지 않는 단어는 제거함으로 인해 해당 문장에 대응 되는 키워드를 추출한다. 여기서, 단어의 원형이 동사와 명사를 모두 포함하는 경우 이 둘을 모두 키워드로 표시하는 것이 효과적이다. 또한, 복합명사의 경우에도 이를 분리하여 추출되는 명사를 모두 키워드로 추출하는 것이 효과적이다. 또한, 불용어는 별도의 불용어 사전을 통해 불용어 사전에 해당하는 단어의 경우 삭제하는 것이 바람직하다(S150). 불용어 사전에 관하여서는 후술한다. To explain this in detail with reference to FIG. 8, sentences in the document are separated (S110). Then, words from each sentence are extracted through morphological analysis, etc. (S120). At this time, the original form of the word is extracted, unnecessary words or stop words are removed, and particles or articles are removed (S130). More specifically, in the case of verbs/adjectives, the original form is extracted according to the change in ending, and in the case of nouns, singular form processing is performed (S140). In addition, keywords corresponding to the sentence are extracted by removing words that are not important through semantic analysis. Here, if the original form of the word includes both a verb and a noun, it is effective to display both as keywords. Also, in the case of compound nouns, it is effective to separate them and extract all the extracted nouns as keywords. Additionally, it is desirable to delete stop words from a separate stop word dictionary (S150). The dictionary of stop words will be described later.

이어서, 수치화 모듈(300)은 다수의 개별 사전을 갖는 사전과 각 문장별 키워드를 이용하여 심층 신경망을 통한 분석 진행을 위해 수치화한다(S200). Next, the digitization module 300 uses a dictionary with a plurality of individual dictionaries and keywords for each sentence to numerically perform analysis through a deep neural network (S200).

도 9를 참조하면, 문장의 각 키워드를 전체 개별 사전의 범주 단어 부분과 일치하는지 여부를 비교하고(S210), 일치하는 개별 사전의 벡터 값을 1로 표현하고, 일치하지 않는 경우에는 0으로 표현한다(S220). 이를 통해 문장은 전체 개별 사전의 개수 만큼의 비트수로 표현된다(S230). 문장 전체의 키워드를 모두 비교 분석할 경우, 하나라도 해당되지 않는 개별 사전 값은 0이고, 1개 이상 복수개 중복되는 경우에는 1로 표현되는 것이 효과적이다. 이와 같이 2진수 형태의 비트 값으로 수치화를 진행하는 것이 가능하다. 즉, 키워드와 개별 사전에 의한 n차원의 좌표 값으로 수치화되는 것이 가능해진다. 좀더 상세하게 이야기하면 각각의 키워드 단어가 각 개별 사전에 있는지를 확인한다. 확인 결과 m번째 사전에 있다면 m번째 비트를 1로 변경한다. 그리고, 다른 모든 키워드를 위와 같이 순차 비교 또는 연속 비교 또는 동시 비교를 통해 최종 수치화 데이터를 추출하는 것이 바람직하다. Referring to Figure 9, each keyword in the sentence is compared to see whether it matches the category word portion of the entire individual dictionary (S210), and the vector value of the individual dictionary that matches is expressed as 1, and if it does not match, it is expressed as 0. Do it (S220). Through this, the sentence is expressed with the number of bits as the number of individual dictionaries (S230). When comparing and analyzing all keywords in an entire sentence, it is effective to express 0 for individual dictionary values that do not correspond to even one, and 1 if one or more keywords overlap. In this way, it is possible to digitize the bit value in binary form. In other words, it becomes possible to quantify it into n-dimensional coordinate values using keywords and individual dictionaries. In more detail, it checks whether each keyword word exists in each individual dictionary. As a result of checking, if it is in the mth dictionary, change the mth bit to 1. In addition, it is desirable to extract the final numerical data through sequential, continuous, or simultaneous comparison of all other keywords as above.

심층 신경망 모듈(400)을 통해 수치화된 문서에 관한 심층 신경망 분석을 통해 리스크 유무를 판단한다(S300). 이때, 수치화된 데이터가 입력부로 입력되고, 다수의 레이어를 갖는 은닉부를 거친 다음 출력부에 0에서 1 사이 값으로 출력된다. 그리고, 1에 가까울 수록 리스크가 있는 문장으로 판단한다(S400). The presence or absence of risk is determined through deep neural network analysis of the quantified document through the deep neural network module 400 (S300). At this time, numerical data is input to the input unit, passes through a hidden unit with multiple layers, and is then output as a value between 0 and 1 to the output unit. And, the closer it is to 1, the more risky the sentence is judged to be (S400).

입력부의 레이어는 적용된 개별 사전의 개수와 같다. 그리고, 은닉부는 적용된 모델에 따라 노드개수와 개층 수를 달리 정의 되는 것이 효과적이다. The layers of the input section are equal to the number of individual dictionaries applied. Additionally, it is effective to define the number of nodes and layers of the hidden part differently depending on the applied model.

상술한 설명에서는 대상 문서의 예측과 분석에 관하여 설명하였다. 이를 위해 사전에 심층 신경망을 통해 학습을 진행하는 것이 필요하다. In the above description, prediction and analysis of the target document were explained. For this purpose, it is necessary to conduct learning through a deep neural network in advance.

이와 같은 학습 진행을 위해서는 도 10에 도시된 바와 같이 먼저, 전처리 모듈을 통해 다양한 문서로 부터 키워드를 산출한다(S110). 키워드 산출은 형태소 분석 및 원형 처리등의 전처리 단계를 통해 추출하는 것이 효과적이다. To proceed with such learning, as shown in FIG. 10, keywords are first calculated from various documents through a preprocessing module (S110). It is effective to extract keywords through preprocessing steps such as morphological analysis and prototype processing.

그리고, 수치화 모듈(300)을 통해 추출된 키워드와 사전을 이용하여 각 문장을 수치화한다(S1200). 수치화된 문장을 심층 신경망을 이용하여 학습을 진행한다. 이때, 심층 신경망을 이용한 학습에 의해 1은 리스크가 있는 문장으로 0은 리스크가 없는 문장으로 학습하는 것이 효과적이다(S1300). Then, each sentence is digitized using the keywords and dictionary extracted through the digitization module 300 (S1200). Numerical sentences are learned using a deep neural network. At this time, it is effective to learn 1 as a risky sentence and 0 as a risk-free sentence through learning using a deep neural network (S1300).

상술한 바와 같이 심층 신경망을 이용하여 학습과 예측/분석을 위해서는 문장의 수치화가 필요하고, 이러한 수치화를 위해서는 사전 생성이 필요하다. As described above, for learning and prediction/analysis using a deep neural network, quantification of sentences is necessary, and for this quantification, dictionary creation is necessary.

본 실시예에서는 사전 생성을 위해 도 11에 도시된 바와 같이 전처리 모듈을 통해 다양한 문서로 부터 키워드를 산출한다(S2100). 키워드 산출은 형태소 분석 및 원형 처리등의 전처리 단계를 통해 추출하는 것이 효과적이다. In this embodiment, to create a dictionary, keywords are calculated from various documents through a preprocessing module as shown in FIG. 11 (S2100). It is effective to extract keywords through preprocessing steps such as morphological analysis and prototype processing.

산출된 키워드를 분석하여 다양한 범주로 그룹화 한다. 그리고, 그룹화된 범주를 하나의 개별 사전으로 분류한다(S2200). 범주 분류는 TF-IDF분석 및 범주 분석을 통해 수행하는 것이 효과적이다. The generated keywords are analyzed and grouped into various categories. Then, the grouped categories are classified into one individual dictionary (S2200). Category classification is effectively performed through TF-IDF analysis and category analysis.

이어서, 각 키워드를 개별 사전으로 분류 배치하고(S2300), 분류 배치된 키워드를 유사어 및 동의어 확장을 하여 최종 사전을 완료한다(S2400). Next, each keyword is classified and placed into an individual dictionary (S2300), and the classified keywords are expanded into similar words and synonyms to complete the final dictionary (S2400).

물론, 새로운 문서가 전처리 모듈(100)에 의해 분석되는 경우, 새로운 키워드의 추가가 가능하고, 새로운 범주의 생성도 가능하고, 이로인해 새로운 개별 사전의 생성도 가능하다. 사전에 새로운 개별 사전이 생성되는 경우, 문장이 수치화될때 그 백터값이 변화된다. 즉, 비트수가 확장됨을 의미하는 것이다. 이는 심층 신경망의 입력단의 레이어가 증가함을 의미하는 것으로, 사전 모듈(200)에 의한 변경은 심층 신경망 모듈(400)의 변화를 유도한다. 이에 심층 신경망 모듈(400)은 새로운 개별 사전이 증가하는 경우, 새로운 학습을 수행하여야 한다. 사전의 구축은 적용 대상이 되는 업무의 성격과 문서의 종류와 사용처 등에 따라 다른 범주의 사전을 구성하는 것도 가능하다. 본 실시예에서는 사전을 먼저 구축한 다음에 심층 신경망 모듈(400)을 구축하는 것이 가능하다. 물론, 심층 신경망을 사전에 따라 가변되도록 하는 것도 가능하다. Of course, when a new document is analyzed by the preprocessing module 100, new keywords can be added, new categories can be created, and new individual dictionaries can be created. When a new individual dictionary is created in the dictionary, the vector value changes when the sentence is digitized. In other words, this means that the number of bits is expanded. This means that the layer of the input terminal of the deep neural network increases, and changes by the dictionary module 200 induce changes in the deep neural network module 400. Accordingly, the deep neural network module 400 must perform new learning when the number of new individual dictionaries increases. It is also possible to construct dictionaries of different categories depending on the nature of the work to which it is applied, the type of document, and its intended use. In this embodiment, it is possible to first build a dictionary and then build the deep neural network module 400. Of course, it is also possible to make the deep neural network variable according to the dictionary.

본 발명은 상술한 실시예의 구성과 방식에 한정되지 않고, 다양한 변형과 다른 실시예가 가능하다. 즉, 전처리 모듈(100)은 단어의 원형이 명사와 동사가 있는 경우 이둘을 모두 선택하지 않고, 명사 또는 동사중 어느 하나를 선택하는 것이 가능하다. 또한 복합 명사로 이를 분리하지 않고, 복합된 명사 자체를 하나의 원형 즉, 키워드로 분류하는 것이 가능하다. The present invention is not limited to the configuration and method of the above-described embodiments, and various modifications and other embodiments are possible. That is, if the original form of a word includes a noun and a verb, the preprocessing module 100 can select either the noun or the verb rather than selecting both. Also, rather than separating it into a compound noun, it is possible to classify the compound noun itself as a prototype, that is, a keyword.

또한, 이에 한정되지 않고, 사전 모듈(200)에서 원형이 2개 또는 그 이상이 나오는 단어의 키워드를 모두 동일한 사전에 배치하는 것도 가능하다. 그리고, 키워드 추출을 위해 명사를 사용하는 것이 가능하다. 즉, 다른 품사가 아닌 명사를 사용함으로 인해 키워드 분석의 효율성을 증대시킬 수 있고, 동사 등 다름 품사 사용에 따른 유사 의미 단어의 중복을 방지하는 것이 가능하다. Additionally, the present invention is not limited to this, and it is possible to place all keywords of words with two or more original forms in the same dictionary in the dictionary module 200. And, it is possible to use nouns for keyword extraction. In other words, the efficiency of keyword analysis can be increased by using nouns rather than different parts of speech, and it is possible to prevent the duplication of words with similar meanings due to the use of different parts of speech, such as verbs.

그리고, 명사만을 키워드로 추출하는 경우, 사전 구성에 있어서도, 단순화가 가능하고, 단어의 확장이 용이하여 더 효과적인 심층 신경망 분석이 가능할 수 있다. In addition, when only nouns are extracted as keywords, dictionary construction can be simplified and words can be easily expanded, allowing for more effective deep neural network analysis.

또한, 본 발명은 이에 한정되지 않고, 사전 전처리 모듈을 별도의 구성으로 구비하는 것이 가능하다. 사전 모듈(200)은 문장 분석의 기본 기준이 되기 때문에 다양한 문서를 분석한 결과들이 취합된다. 따라서, 앞서 언급한 바와 같이 별도로 문서에서 문장을 구분하고, 이를 단어로 분리한 다음 그 원형을 구분하지 않고, 하나의 문장을 전처리 모듈이 한꺼번에 동시 분석이 가능하다. 즉, 사전 처리를 위한 별도의 사전용 전처리 모듈을 구비하는 것이 가능할 수 있는 것이다. 이를 통해 사전 처리 모듈의 동작 효율성을 향상시키고, 문장 분석이 아닌 문서 전체의 키워드를 효율적으로 도출하고, 추후 분류 및 구분 작업 즉, 범주 분류에서 매우 효과적일 수 있다. Additionally, the present invention is not limited to this, and it is possible to provide a pre-processing module in a separate configuration. Since the dictionary module 200 serves as a basic standard for sentence analysis, results from analyzing various documents are collected. Therefore, as mentioned earlier, the preprocessing module can simultaneously analyze one sentence at once without separately distinguishing sentences from the document, separating them into words, and then distinguishing their original forms. In other words, it may be possible to have a separate pre-processing module for pre-processing. This improves the operational efficiency of the pre-processing module, efficiently derives keywords from the entire document rather than sentence analysis, and can be very effective in later classification and classification tasks, that is, category classification.

상기에서 설명한 본 발명의 기술적 사상은 바람직한 실시예에서 구체적으로 기술되었으나, 상기한 실시예는 그 설명을 위한 것이며 그 제한을 위한 것이 아님을 주의하여야 한다. 또한, 본 발명은 본 발명의 기술 분야의 통상의 전문가라면 본 발명의 기술적 사상의 범위 내에서 다양한 실시예가 가능함을 이해할 수 있을 것이다.Although the technical idea of the present invention described above has been described in detail in preferred embodiments, it should be noted that the above-described embodiments are for illustrative purposes only and are not intended for limitation. In addition, an expert in the technical field of the present invention will understand that various embodiments of the present invention are possible within the scope of the technical idea of the present invention.

100: 전처리 모듈 110: 문장 분리 모듈
120: 단어 구분 모듈 130: 원형 추출 모듈
131: 형태소 분석부 132: 불용 단어 제거부
133: 원형 선택부 200: 사전 모듈
210: 범주 분류부 220: 개벌 사전 생성부
230: 키워드 확장부 240: 사전 저장부
300: 수치화 모듈 400: 심층 신경망 모듈
500: 학습 제어 모듈 600: 분석 모듈100: Preprocessing module 110: Sentence separation module
120: word classification module 130: prototype extraction module
131: Morpheme analysis unit 132: Stop word removal unit
133: circular selection unit 200: dictionary module
210: Category classification unit 220: Individual dictionary creation unit
230: Keyword expansion unit 240: Dictionary storage unit
300: Numericalization module 400: Deep neural network module
500: learning control module 600: analysis module

Claims

A preprocessing module that extracts keywords for each sentence of the target document;
a dictionary module that groups similar categories of keywords by the pre-processing module and the learning control module and categorizes a number of words into a plurality of individual dictionaries;
a quantification module that quantifies sentences using keywords extracted through the pre-processing module based on a dictionary module;
A deep neural network module that learns or predicts and analyzes the numerical values of the numerical module through a deep neural network;
A learning control module that controls the learning of sentences that have gone through the numerical module through a deep neural network module according to learning criteria; and
Includes an analysis module that controls the analysis of the analysis document,
The numerical module digitizes sentences using keywords extracted for each sentence based on the dictionary module, expresses them in bits corresponding to the number of individual dictionaries, checks whether each keyword is in each individual dictionary, and checks whether each keyword is in the nth individual dictionary. If there is a corresponding keyword word, the corresponding individual dictionary position value is expressed as 1,
The numerical module determines for each sentence unit whether or not the keyword within the sentence matches a word in the category word part of each individual dictionary. If it matches, it is numerically 1, and if it is not a match, it is numerically 0.
The deep neural network module includes an input unit that receives data, an output unit that outputs results, and a hidden unit that performs deep neural network analysis,
The deep neural network module determines the risk of a sentence through deep neural network analysis of the numerical sentence by inputting numerical data into an input unit, passing through a hidden unit with multiple layers, and then outputting a value between 0 and 1 to the output unit. A deep neural network-based document analysis system characterized by:

According to paragraph 1,
For the above-described operation, the preprocessing module includes a sentence separation module that separates sentences from a document, a word separation module that separates words from separated sentences, and a prototype extraction module that extracts and selects the original form from the separated word. Neural network-based document analysis system.

According to paragraph 2,
The prototype extraction module includes a morpheme analysis unit that analyzes the morphemes of separated words, a stop word removal unit that removes stop words or unnecessary particles according to the morpheme analysis results, matches the original form of the word, and multi-semantic words and compound nouns. A document analysis system based on a deep neural network that includes a prototype selection unit that selects the prototype, etc.

According to paragraph 1,
The dictionary module includes a category classification unit that receives keywords in a document and classifies them, an individual dictionary creation unit that creates an individual dictionary corresponding to the classified category, a keyword expansion unit that expands keywords located in the generated individual dictionary, and a generation unit. A deep neural network-based document analysis system that includes a separate dictionary storage unit that stores individual dictionaries.

delete

A document analysis method using a deep neural network-based document analysis system according to any one of claims 1 to 4, comprising:
A preprocessing module extracting keywords from a sentence or document through morphological analysis and prototype processing;
The dictionary module determines whether a dictionary has a plurality of individual dictionaries and creates a dictionary if there is no dictionary;
The digitization module digitizes each sentence using the extracted keywords and dictionaries, and converts them into numbers with the number of bits corresponding to the number of individual dictionaries; and
The deep neural network module includes analyzing the sentence or document using the numerical value using a deep neural network,
The digitization module digitizes each sentence by comparing each keyword in the sentence to whether it matches the category word part of the entire individual dictionary, expressing the vector value of the individual dictionary that matches as 1, and if it does not match, expresses it as 0. , A deep neural network-based document analysis method characterized by quantifying sentences with the number of bits as the number of individual dictionaries.

The method of claim 6, wherein the step of extracting the keywords includes:
A sentence separation module separates sentences from a document, and a word separation module separates words from the separated sentences;
The original form extraction module extracts the original form of the word based on the separated words, removes unnecessary particles, treats nouns as singular, replaces them with the current basic form for verbs, and adds them to a single word through morphological analysis. If plural nouns or meanings are combined, extracting the original form of each of them, and extracting both the original form of the noun and the verb if both exist; and
A deep neural network-based document analysis method in which the prototype extraction module deletes words included in the terminology dictionary through a separate unnecessary terminology dictionary.

The method of claim 6, wherein the step of generating the dictionary includes:
The category classification unit of the dictionary module includes the steps of classifying categories that can be similarly grouped through keyword analysis, that is, TF-IDF analysis or category analysis;
An individual dictionary creation unit classifies the classified categories into individual dictionaries, separately storing the category name as a dictionary division part and the keyword words corresponding to the category as a category word part; and
The keyword expansion unit is a deep neural network-based document analysis method that includes the step of increasing words in individual dictionaries by expanding synonyms and synonyms for keywords classified in each individual dictionary.

delete

A recording medium storing a program to implement the deep neural network-based document analysis method described in paragraph 6.

A computer program stored on a medium to implement the deep neural network-based document analysis method described in clause 6.