KR102342055B1

KR102342055B1 - Apparatus and method for processing natural language using structured and unstructured data

Info

Publication number: KR102342055B1
Application number: KR1020210069470A
Authority: KR
Inventors: 정효용; 윤창오; 정민성; 보아동
Original assignee: 주식회사 애자일소다
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-12-27

Abstract

Disclosed are an apparatus and method for processing a natural language using structured data and unstructured data. The present invention can resolve a classification problem of a language model through modeling using feature values of structured data and unstructured data and improve classification accuracy by classifying conventional problems, which are difficult to be classified only simply with unstructured text data, using the structured data together.

Description

Apparatus and method for natural language processing using structured and unstructured data

본 발명은 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치 및 방법에 관한 발명으로서, 더욱 상세하게는 정형 데이터와 비정형 데이터의 특성 값을 이용한 모델링을 통해 언어 모델의 분류 문제를 개선한 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치 및 방법에 관한 것이다.The present invention relates to a natural language processing apparatus and method using structured data and unstructured data, and more particularly, structured data and unstructured data in which the classification problem of a language model is improved through modeling using characteristic values of structured data and unstructured data It relates to a natural language processing apparatus and method using

최근에 신경망(neural network)를 이용하여 자연어를 생성하고, 기기 단말과 사용자 사이의 대화를 지원하는 다양한 애플리케이션에 자연어 생성 기술이 적용되고 있다. Recently, natural language generation technology has been applied to various applications that generate natural language using a neural network and support a conversation between a device terminal and a user.

이러한 신경망은 인간의 생물학적 신경 세포의 특성을 수학적 표현으로 모델링한 모델로서, 인간이 가지고 있는 학습이라는 능력을 모방한 알고리즘을 이용한다. This neural network is a model that models the characteristics of human biological nerve cells with mathematical expressions, and uses an algorithm that mimics the ability of learning that humans have.

또한, 신경망은 학습된 결과에 기초하여 학습에 이용되지 않았던 입력 패턴에 대하여 비교적 올바른 출력을 생성할 수 있는 일반화 능력을 가진다.In addition, the neural network has a generalization ability that can generate a relatively correct output for an input pattern that has not been used for learning based on the learned result.

현존하는 NLP(Natural Language Processing) 알고리즘은 구글이나 Open AI에서 BERT, GPT-3와 같은 트랜스포머(Transformer) 기반의 아키텍처로 NLP 분류 문제를 해결하고 있다. Existing NLP (Natural Language Processing) algorithms are solving the NLP classification problem with a transformer-based architecture such as BERT and GPT-3 in Google or Open AI.

트랜스포머 아키텍처는 적은 수의 일정한 단계만 수행하고, 각 단계에서 각 위치에 관계없이 문장의 모든 텍스트(단어) 간의 관계를 직접 모델링하는 셀프 어텐션 메커니즘(Self-Attention Mechanism)을 이용하여 텍스트 간의 관계를 파악할 수 있고, 성능을 향상시키는데 크게 기여하였다. Transformer architecture performs only a small number of constant steps, and at each step, it identifies the relationship between texts using the Self-Attention Mechanism, which directly models the relationship between all texts (words) in a sentence regardless of their location. and greatly contributed to improving the performance.

뿐만 아니라 기존의 NLP 알고리즘은 비정형 텍스트 데이터의 내용만으로 문제를 해결하고 있다. In addition, the existing NLP algorithm solves the problem only with the content of unstructured text data.

여기서 텍스트 간의 관계는 주어진 단어들로부터 그다음에 등장한 단어의 확률을 예측하는 것으로서, 다음 등장할 단어를 잘 예측하는 모델이 그 언어의 특성을 잘 반영한 모델이고, 문맥을 잘 계산하는 언어 모델이 된다.Here, the relationship between texts predicts the probability of the next word from given words, and the model that predicts the next word well reflects the characteristics of the language and becomes the language model that calculates the context well.

또한, BERT(Bidirectional Encoder Representations from Transformers)는 트랜스포머 기반으로, 문맥을 고려한 임베딩(Sentence Embedding) 또는 상황별 단어 임베딩(Contextual Word Embedding)을 구하는 네트워크로서, 문장을 토큰으로 쪼개서 네트워크에 넣으면 전체 문장에 대한 벡터(Vector)와 문장안의 단어 각각에 대응되는 벡터를 출력한다. In addition, BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based network that obtains contextual embedding or contextual word embedding. Outputs a vector and a vector corresponding to each word in the sentence.

이들을 기반으로 텍스트 분류(Text Classification) 등의 태스크(Task)를 학습하여 수행하면 매우 쉽게 뛰어난 성능을 얻을 수 있는데, 이는 전체 네트워크가 매우 많은 양의 문서로 마스킹된 언어 모델(Masked Language Models, MLM)을 미리 학습(pre-training)하였기 때문이다If you learn and perform tasks such as text classification based on them, you can get great performance very easily, which is a language model in which the entire network is masked with a very large amount of documents (Masked Language Models, MLM). This is because pre-training

그러나, 모델의 크기가 증가하면서, 하나의 GPU에서 큰 모델을 학습하는 것이 점점 어려워지고, 모델 크기가 증가하면서 추론에 필요한 시간이 함께 늘어나는 문제점이 있다.However, as the size of the model increases, it becomes increasingly difficult to learn a large model on one GPU, and as the size of the model increases, the time required for inference increases.

또한, 비정형 텍스트 데이터를 분류할 때는 성능이 압도적이지만, 정형 데이터를 함께 가지고 있는 경우에는 그 정보를 제대로 활용하지 못하는 문제점이 있다.In addition, although the performance is overwhelming when classifying unstructured text data, there is a problem in that the information cannot be used properly when the structured data is also included.

즉, 현재의 NLP 모델은 단어 시퀀스(문장)에 대한 확률을 예측하고, 양방향 트랜스포머를 이용한 버트(BERT) 모델과 1개의 분류 레이어를 구성하여 자연어 처리를 수행하고, 이때, 비정형 텍스트 데이터만을 활용한 분류 시 우수한 분류 성능을 제공할 수 있지만, 정형 데이터의 정보는 따로 활용하고 있지 않는 문제점이 있다.That is, the current NLP model predicts the probability of a word sequence (sentence) and performs natural language processing by configuring a BERT model using a bidirectional transformer and one classification layer. Although it can provide excellent classification performance during classification, there is a problem in that information of structured data is not separately utilized.

따라서, 비정형 텍스트 데이터와 정형 텍스트 데이터가 함께 구성된 데이터에 대한 분류 문제를 해결할 때, 기존 모델들을 활용하여 분류를 원하는 데이터에 대하여 분류하기는 어려운 문제점이 있다.Accordingly, when solving a classification problem for data composed of both unstructured text data and structured text data, it is difficult to classify data for which classification is desired using existing models.

한국 등록특허공보 등록번호 제10-2166390호(발명의 명칭: 비정형 데이터의 모델링 방법 및 시스템)Korea Patent Publication No. 10-2166390 (Title of the invention: modeling method and system of unstructured data)

이러한 문제점을 해결하기 위하여, 본 발명은 정형 데이터와 비정형 데이터의 특성 값을 이용한 모델링을 통해 언어 모델의 분류 문제를 개선한 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치 및 방법을 제공하는 것을 목적으로 한다.In order to solve this problem, an object of the present invention is to provide a natural language processing apparatus and method using structured data and unstructured data in which the classification problem of a language model is improved through modeling using characteristic values of structured data and unstructured data. .

또한, 본 발명은 입력 데이터에 비정형 데이터와 정형 데이터가 함께 있는 경우, 비정형 데이터와 정형 데이터를 함께 활용할 수 있도록 2개의 서로 다른 네트워크를 병렬로 설치하여 분류할 수 있도록 구성함으로써, 분류 정확도를 향상시킬 수 있는 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치 및 방법을 제공하는 것을 목적으로 한다.In addition, the present invention can improve classification accuracy by installing two different networks in parallel so that when there are both unstructured data and structured data in the input data, the unstructured data and structured data can be used together. An object of the present invention is to provide a natural language processing apparatus and method using structured data and unstructured data that can be used.

상기한 목적을 달성하기 위하여 본 발명의 일 실시 예는 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치로서, 입력부로부터 비정형 데이터와 정형 데이터가 입력되면, 서로 다른 머신러닝 네트워크를 이용하여 비정형 데이터의 특성(Feature) 값과 정형 데이터의 특성(Feature) 값을 예측하고, 상기 머신러닝 네트워크는 비정형 데이터를 입력 값으로 사용하여 예측하는 제1 네트워크와, 상기 정형 데이터를 입력 값으로 사용하여 예측하는 제2 네트워크가 병렬로 구성될 수 있으며, 상기 예측된 비정형 데이터의 특성 값과 정형 데이터의 특성 값을 더하여 예측한 결과를 출력하는 데이터 처리부;를 포함할 수 있다.In order to achieve the above object, an embodiment of the present invention is a natural language processing device using structured data and unstructured data. When unstructured data and structured data are input from an input unit, characteristics of unstructured data ( A first network predicts feature values and feature values of structured data, and the machine learning network predicts using unstructured data as input values, and a second network predicts using the structured data as input values may be configured in parallel, and a data processing unit for outputting a predicted result by adding the predicted characteristic value of the unstructured data and the characteristic value of the structured data; may include.

또한, 상기 실시 예에 따른 제1 네트워크는 버트(BERT, Bidirectional Encoder Representations from Transformers) 모델 기반이고, 상기 제2 네트워크는 피드 포워드 신경망(Feed-Forward Neural Network) 기반인 것을 특징으로 한다.In addition, the first network according to the embodiment is based on a BERT (Bidirectional Encoder Representations from Transformers) model, and the second network is characterized in that it is based on a Feed-Forward Neural Network.

또한, 상기 실시 예에 따른 데이터 처리부에서 출력되는 비정형 데이터의 특성 값과 정형 데이터의 특성 값을 더한 결과를 분류 모델에 기반하여 분류하는 분류부;를 더 포함하는 것을 특징으로 한다.In addition, it is characterized in that it further comprises; a classification unit for classifying the result of adding the characteristic value of the unstructured data output from the data processing unit according to the embodiment to the characteristic value of the structured data based on the classification model.

또한, 상기 실시 예에 따른 입력부는 입력 데이터 중에서 텍스트 기반의 데이터를 인식하여 텍스트만 추출하여 출력하는 텍스트 입력부; 및 상기 입력 데이터 중에서 데이터 개체(Entity), 속성(Attribute), 관계(Relationship)에 따른 스키마(Schema) 형태, 연산 가능 여부, 데이터 특성, 숫자 및 범주형 데이터 중 적어도 하나로 이루어진 정형 데이터를 추출하여 출력하는 정형 데이터 입력부;를 포함하는 것을 특징으로 한다.In addition, the input unit according to the exemplary embodiment includes: a text input unit for recognizing text-based data among input data, extracting only text, and outputting; and extracting and outputting structured data consisting of at least one of a data entity, an attribute, and a schema form according to a relationship, whether operation is possible, data characteristics, and number and categorical data from among the input data It characterized in that it includes; a structured data input unit.

또한, 상기 실시 예에 따른 입력부는 입력된 데이터에서 음성 기반의 데이터를 인식하면, 상기 음성 기반의 데이터를 텍스트 데이터로 변환하여 텍스트만 추출하는 STT 입력부를 더 포함하는 것을 특징으로 한다.In addition, when the input unit according to the embodiment recognizes voice-based data from the input data, it is characterized in that it further comprises an STT input unit that converts the voice-based data into text data to extract only text.

또한, 상기 실시 예에 따른 데이터 처리부는 입력된 비정형 데이터를 버트(BERT) 모델 기반에서 분석 및 예측하여 비정형 데이터의 특성 값을 출력하는 제1 네트워크부; 입력된 정형 데이터를 피드 포워드 신경망 기반에서 분석 및 예측하여 정형 데이터의 특성 값을 출력하는 제2 네트워크부; 및 상기 비정형 데이터의 특성 값과 정형 데이터의 특성 값을 더하여 분류부로 출력하는 연산부;를 포함하는 것을 특징으로 한다.In addition, the data processing unit according to the embodiment includes a first network unit that analyzes and predicts the input unstructured data based on a BERT model and outputs characteristic values of the unstructured data; a second network unit that analyzes and predicts the input structured data based on a feed-forward neural network and outputs characteristic values of the structured data; and an operation unit that adds the characteristic value of the unstructured data and the characteristic value of the structured data to the classification unit.

또한, 상기 실시 예에 따른 제1 및 제2 네트워크부는 오버피팅(Overfitting) 방지를 위한 잔차 네트워크(Residual Network)를 포함하는 것을 특징으로 한다.In addition, the first and second network units according to the embodiment are characterized in that they include a residual network for preventing overfitting.

또한, 상기 실시 예에 따른 제1 네트워크부는 입력된 비정형 데이터를 임베딩을 통해 벡터 값으로 변환하는 임베딩 레이어; 상기 변환된 벡터 값을 정규화하는 정규화 레이어; 및 상기 정규화된 벡터 값을 버트 알고리즘을 이용하여 문장에 대한 벡터 값과 문장 내의 개별 단어에 대응하는 벡터 값을 출력하는 버트 레이어;를 포함하는 것을 특징으로 한다.In addition, the first network unit according to the embodiment includes an embedding layer that converts the input unstructured data into a vector value through embedding; a normalization layer that normalizes the transformed vector value; and a vert layer for outputting a vector value for a sentence and a vector value corresponding to an individual word in the sentence using a vert algorithm using the normalized vector value.

또한, 상기 실시 예에 따른 제2 네트워크부는 입력된 정형 데이터를 정규화하는 정규화 레이어; 및 상기 정규화된 정형 데이터를 피드 포워드 신경망(Feed-Forward Neural Network) 기반으로 예측하여 정형 데이터의 특성 값을 출력하는 피드 포워드 레이어;를 포함하는 것을 특징으로 한다.In addition, the second network unit according to the embodiment includes a normalization layer for normalizing the input structured data; and a feed-forward layer that predicts the normalized structured data based on a feed-forward neural network and outputs characteristic values of the structured data.

또한, 본 발명의 일 실시 예는 정형 데이터와 비정형 데이터를 이용한 자연어 처리 방법으로서, a) 입력부가 입력된 데이터를 비정형 데이터와 정형 데이터로 분류하는 단계; b) 데이터 처리부가 분류된 비정형 데이터와 정형 데이터를 입력받아 서로 다른 머신러닝 네트워크를 이용하여 비정형 데이터의 특성(Feature) 값과 정형 데이터의 특성(Feature) 값을 예측하는 단계; 및 c) 상기 데이터 처리부가 상기 예측된 비정형 데이터의 특성 값과 정형 데이터의 특성 값을 더하여 예측 결과를 출력하는 단계;를 포함하고, 상기 머신러닝 네트워크는 비정형 데이터를 입력 값으로 사용하여 예측하는 제1 네트워크와, 상기 정형 데이터를 입력 값으로 사용하여 예측하는 제2 네트워크가 병렬로 구성된 것을 특징으로 한다.In addition, an embodiment of the present invention provides a natural language processing method using structured data and unstructured data, comprising: a) classifying data input by an input unit into unstructured data and structured data; b) the data processing unit receives the classified unstructured data and structured data, and predicts a feature value of the unstructured data and a feature value of the structured data using different machine learning networks; and c) outputting a prediction result by adding, by the data processing unit, the predicted characteristic value of the unstructured data and the characteristic value of the structured data; wherein the machine learning network predicts using the unstructured data as an input value It is characterized in that the first network and the second network for predicting using the structured data as input values are configured in parallel.

또한, 상기 실시 예는 d) 분류부가 상기 데이터 처리부(120)에서 출력되는 비정형 데이터의 특성 값과 정형 데이터의 특성 값을 더한 결과를 분류 모델에 기반하여 분류하는 단계;를 더 포함하는 것을 특징으로 한다.In addition, the embodiment further comprises: d) classifying, by the classification unit, the result of adding the characteristic value of the unstructured data output from the data processing unit 120 and the characteristic value of the structured data based on the classification model; do.

또한, 상기 실시 예에 따른 a) 단계는 a-1) 입력부가 입력 데이터 중에서 텍스트 기반의 데이터를 인식하면 텍스트를 추출하는 단계; 및 a-2) 상기 입력부가 추출된 텍스트를 기반으로 데이터 개체(Entity), 속성(Attribute), 관계(Relationship)에 따른 스키마(Schema) 형태, 데이터 특성, 숫자 데이터, 범주형 데이터 및 연산 가능 여부에 따라 정형 데이터 또는 비정형 데이터로 분류하는 단계;를 포함하는 것을 특징으로 한다.In addition, step a) according to the above embodiment includes: a-1) extracting text when the input unit recognizes text-based data among input data; and a-2) based on the text extracted from the input unit, a schema form according to a data entity, an attribute, and a relationship, data characteristics, numeric data, categorical data, and whether operations are possible and classifying the data into structured data or unstructured data according to

또한, 상기 실시 예는 a-1) 단계에서 입력된 데이터에서 음성 기반의 데이터를 인식하면, 상기 음성 기반의 데이터를 텍스트 데이터로 변환하여 텍스트만을 추출하는 단계;를 더 포함하는 것을 특징으로 한다.In addition, the embodiment further comprises: when speech-based data is recognized from the data input in step a-1), converting the speech-based data into text data and extracting only text.

본 발명은 정형 데이터와 비정형 데이터의 특성 값을 이용한 모델링을 통해 언어 모델의 분류 문제를 개선할 수 있는 장점이 있다.The present invention has the advantage of improving the classification problem of a language model through modeling using characteristic values of structured data and unstructured data.

또한, 본 발명은 입력 데이터에 비정형 데이터와 정형 데이터가 함께 있는 경우, 비정형 데이터와 정형 데이터를 함께 활용할 수 있도록 병렬로 설치된 서로 다른 네트워크에 기반한 예측 값을 이용하여 분류할 수 있도록 구성함으로써, 분류 정확도를 향상시킬 수 있는 장점이 있다.In addition, the present invention is configured to be classified using prediction values based on different networks installed in parallel so that the unstructured data and the structured data can be used together when the input data contains both unstructured data and structured data, so that classification accuracy There are advantages to improving

또한, 본 발명은 비정형 텍스트 데이터 만으로 분류하기 어려운 문제들을 정형 데이터를 함께 사용하여 분류함으로써, BERT 모델을 이용한 분류 시 특정 분야에서 분류 정확도가 낮아지는 것을 방지할 수 있고, 특정 과제를 수행하기 위한 언어 모델, 예를 들어 Bio, Science, Finance 등에서 분류를 원하는 데이터의 분류 정확도를 향상시킬 수 있는 장점이 있다.In addition, the present invention can prevent a decrease in classification accuracy in a specific field during classification using the BERT model by classifying problems that are difficult to classify only with unstructured text data by using structured data together, and a language for performing a specific task There is an advantage in that it is possible to improve the classification accuracy of data for which classification is desired in a model, for example, Bio, Science, Finance, etc.

도1은 본 발명의 일 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치를 개략적으로 나타낸 예시도.
도2는 도1의 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치의 입력부 구성을 나타낸 예시도.
도3은 도1의 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치의 데이터 처리부 구성을 나타낸 예시도.
도4는 본 발명의 일 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 방법을 설명하기 위해 나타낸 흐름도.1 is an exemplary diagram schematically illustrating a natural language processing apparatus using structured data and unstructured data according to an embodiment of the present invention;
FIG. 2 is an exemplary diagram illustrating an input unit configuration of a natural language processing apparatus using structured data and unstructured data according to the embodiment of FIG. 1 .
3 is an exemplary diagram illustrating a configuration of a data processing unit of a natural language processing apparatus using structured data and unstructured data according to the embodiment of FIG. 1 .
4 is a flowchart illustrating a natural language processing method using structured data and unstructured data according to an embodiment of the present invention.

이하에서는 본 발명의 바람직한 실시 예 및 첨부하는 도면을 참조하여 본 발명을 상세히 설명하되, 도면의 동일한 참조부호는 동일한 구성요소를 지칭함을 전제하여 설명하기로 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to preferred embodiments of the present invention and the accompanying drawings.

본 발명의 실시를 위한 구체적인 내용을 설명하기에 앞서, 본 발명의 기술적 요지와 직접적 관련이 없는 구성에 대해서는 본 발명의 기술적 요지를 흩뜨리지 않는 범위 내에서 생략하였음에 유의하여야 할 것이다. Prior to describing the specific content for carrying out the present invention, it should be noted that components not directly related to the technical gist of the present invention are omitted within the scope of not disturbing the technical gist of the present invention.

또한, 본 명세서 및 청구범위에 사용된 용어 또는 단어는 발명자가 자신의 발명을 최선의 방법으로 설명하기 위해 적절한 용어의 개념을 정의할 수 있다는 원칙에 입각하여 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다.In addition, the terms or words used in the present specification and claims have meanings and concepts consistent with the technical idea of the invention based on the principle that the inventor can define the concept of an appropriate term to best describe his invention. should be interpreted as

본 명세서에서 어떤 부분이 어떤 구성요소를 "포함"한다는 표현은 다른 구성요소를 배제하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In the present specification, the expression that a part "includes" a certain element does not exclude other elements, but means that other elements may be further included.

또한, "‥부", "‥기", "‥모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어, 또는 그 둘의 결합으로 구분될 수 있다.Also, terms such as “… unit”, “… group”, and “… module” mean a unit that processes at least one function or operation, which may be divided into hardware, software, or a combination of the two.

또한, "적어도 하나의" 라는 용어는 단수 및 복수를 포함하는 용어로 정의되고, 적어도 하나의 라는 용어가 존재하지 않더라도 각 구성요소가 단수 또는 복수로 존재할 수 있고, 단수 또는 복수를 의미할 수 있음은 자명하다 할 것이다. In addition, the term "at least one" is defined as a term including the singular and the plural, and even if the term at least one does not exist, each element may exist in the singular or plural, and may mean the singular or plural. will be self-evident.

또한, 각 구성요소가 단수 또는 복수로 구비되는 것은, 실시 예에 따라 변경가능하다 할 것이다.In addition, that each component is provided in singular or plural may be changed according to an embodiment.

이하, 첨부된 도면을 참조하여 본 발명의 일 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치 및 방법의 바람직한 실시 예를 상세하게 설명한다.Hereinafter, a preferred embodiment of a natural language processing apparatus and method using structured data and unstructured data according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도1은 본 발명의 일 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치를 개략적으로 나타낸 예시도이고, 도2는 도1의 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치의 입력부 구성을 나타낸 예시도이며, 도3은 도1의 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치의 데이터 처리부 구성을 나타낸 예시도이다.1 is an exemplary diagram schematically showing a natural language processing apparatus using structured data and unstructured data according to an embodiment of the present invention, and FIG. 2 is a natural language processing apparatus using structured data and unstructured data according to the embodiment of FIG. It is an exemplary diagram showing the configuration of the input unit, and FIG. 3 is an exemplary diagram showing the configuration of the data processing unit of the natural language processing apparatus using structured data and unstructured data according to the embodiment of FIG. 1 .

도1 내지 도3을 참조하면, 본 발명의 일 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 장치(100)는 정형 데이터와 비정형 데이터의 특성 값을 이용한 모델링을 통해 언어 모델의 분류 문제를 개선할 수 있도록 비정형 데이터와 정형 데이터가 입력되면, 서로 다른 머신러닝 네트워크를 이용하여 비정형 데이터의 특성(Feature) 값과 정형 데이터의 특성(Feature) 값을 예측하되, 머신러닝 네트워크는 비정형 데이터를 입력 값으로 사용하여 예측하는 제1 네트워크와, 정형 데이터를 입력 값으로 사용하여 예측하는 제2 네트워크가 병렬로 구성될 수 있다.1 to 3 , the natural language processing apparatus 100 using structured data and unstructured data according to an embodiment of the present invention solves the problem of classification of language models through modeling using characteristic values of structured data and unstructured data. When unstructured data and structured data are input for improvement, feature values of unstructured data and feature values of structured data are predicted using different machine learning networks, but the machine learning network inputs unstructured data A first network for prediction using a value and a second network for prediction using structured data as an input value may be configured in parallel.

또한, 본 발명의 일 실시 예에 따른 자연어 처리 장치(100)는 예측된 비정형 데이터의 특성 값과 정형 데이터의 특성 값을 더하여 출력할 수 있으며, 입력부(110)와, 데이터 처리부(120)와, 분류부(130)를 포함하여 구성될 수 있다.In addition, the natural language processing apparatus 100 according to an embodiment of the present invention may output the predicted characteristic value of the unstructured data by adding the characteristic value of the structured data, and the input unit 110 , the data processing unit 120 , It may be configured to include a classification unit 130 .

입력부(110)는 임의의 입력 데이터가 입력되면 텍스트만을 추출하고, 추출된 텍스트를 정형 데이터와 비정형 데이터로 분류하여 데이터 처리부(120)로 출력하는 구성으로서, 텍스트 입력부(111)와, 정형 데이터 입력부(112)를 포함하여 구성될 수 있다.The input unit 110 is configured to extract only text when arbitrary input data is input, classify the extracted text into structured data and unstructured data, and output it to the data processing unit 120 . The text input unit 111 and the structured data input unit (112).

예를 들어, "A 카드사 100만 원"이 입력 데이터로 입력되면, 입력 내용을 단순히 비정형 데이터에 기반하여 분류하는 경우, 정확한 분류가 이루어지지 못하게 된다.For example, if "one million won for card company A" is input as input data, when the input content is simply classified based on unstructured data, accurate classification cannot be made.

따라서, 입력부(110)는 입력 데이터로부터 주기성, 금액, 결제, 취소 환불 등을 추가 파악하여 예를 들어 "A 카드사로부터 100만 원이 결제 취소에 따라 입금된 환불금"과 같이 입력 데이터의 정확한 분류가 이루어질 수 있도록 한다.Therefore, the input unit 110 additionally identifies periodicity, amount, payment, cancellation refund, etc. from the input data, so that the correct classification of the input data is, for example, "a refund of 1 million won received from the A card company according to the payment cancellation". make it happen

텍스트 입력부(111)는 입력 데이터를 분석하여 텍스트 기반의 데이터로 인식되면, 입력 데이터로부터 텍스트만을 추출하여 출력한다.The text input unit 111 analyzes the input data and, when recognized as text-based data, extracts and outputs only text from the input data.

또한, 텍스트 입력부(111)는 텍스트 데이터와 함께, 예를 들어, 주기성, 비용, 결제 요청, 결제 승인, 결제 취소 등의 추출된 텍스트의 내용을 파악하여 분류에 활용할 수 있는 정형 데이터 정보를 추출할 수도 있다.In addition, the text input unit 111 extracts structured data information that can be used for classification by identifying the contents of the extracted text, such as periodicity, cost, payment request, payment approval, payment cancellation, etc. together with text data. may be

또한, 입력부(110)는 입력 데이터를 분석하여 음성 기반의 데이터가 인식되면, STT(Speech To Text)를 이용하여 입력된 음성 기반의 데이터를 텍스트 데이터로 변환하고, 변환된 텍스트 데이터를 인식하여 텍스트만을 추출한 데이터를 출력하는 STT 입력부(111a)가 추가 구성될 수도 있다.In addition, the input unit 110 analyzes the input data and, when voice-based data is recognized, converts the input voice-based data into text data using Speech To Text (STT), and recognizes the converted text data to text The STT input unit 111a for outputting only extracted data may be additionally configured.

또한, STT 입력부(111a)는 텍스트 데이터와 함께, 음성 데이터의 시간, 음성 발화자의 목소리 높낮이, 음성 발화자의 나이, 음성 기반 데이터의 분류 카테고리 등의 정형 데이터 정보를 추출할 수도 있다.In addition, the STT input unit 111a may extract structured data information such as the time of the voice data, the voice pitch of the voice speaker, the age of the voice speaker, and the classification category of the voice-based data together with the text data.

정형 데이터 입력부(112)는 입력 데이터 중에서 예를 들어, 데이터 개체(Entity), 속성(Attribute), 관계(Relationship) 등의 스키마(Schema) 형태, 연산 가능 여부, 데이터 특성, 숫자 및 범주형 데이터 등에 따라 정형 데이터를 추출하여 출력한다.The structured data input unit 112 may include, among the input data, for example, a schema type such as a data entity, an attribute, and a relationship, whether operation is possible, data characteristics, numeric and categorical data, etc. It extracts the structured data according to the output and outputs it.

정형 데이터(Structured Data)는 미리 정해 놓은 형식과 구조에 따라 저장되도록 구성하여 고정된 필드에 저장된 데이터로서, 관계형 데이터베이스의 테이블 형태로 저장될 수 있다.Structured data is data stored in a fixed field that is configured to be stored according to a predetermined format and structure, and may be stored in a table form of a relational database.

즉, 정형 데이터는 구조와 관리 체계에 규칙이 있고, 틀이 잡혀 있어서 일반적으로 사용한 구분자가 있으며, 해당 데이터 값이 있는 데이터이다.In other words, structured data is data with rules in structure and management system, generally used delimiters because it is structured, and corresponding data values.

또한, 정형 데이터는 정해진 형식과 저장 구조를 바탕으로 쉽게 데이터에 대한 부분 검색 및 선택, 갱신, 삭제 등의 연산을 수행할 수 있다.In addition, operations such as partial search, selection, update, and deletion of data can be easily performed on structured data based on a predetermined format and storage structure.

비정형 데이터(Unstructured Data)는 정형 데이터 외에 모든 데이터가 포함될 수 있으며, 틀이 잡혀 있지 않고, 스키마 구조가 없는 데이터로, 음원, 이미지, 동영상, 텍스트 문서, 로그 등과 같은 특정한 형태가 없는 데이터일 수 있고, 연산이 되는 구조가 아니며, 각 데이터의 특성에 맞게 저장 및 관리될 수 있다.Unstructured data may include all data other than structured data, and is data without a frame and schema structure. , it is not a structure that can be calculated, and can be stored and managed according to the characteristics of each data.

본 실시 예에서는 설명의 편의를 위해 텍스트로 이루어진 문서를 실시 예로 설명하지만, 이에 한정되는 것은 아니다.In the present embodiment, a text document is described as an embodiment for convenience of description, but is not limited thereto.

또한, 문서는 텍스트 중심으로 되어 구성될 수 있으며, 날짜, 숫자, 사실, 도표, 그림과 같은 데이터도 포함할 수 있다.In addition, the document may be structured around text, and may also include data such as dates, numbers, facts, diagrams, and pictures.

또한, 비정형 데이터는 텍스트 마이닝, 웹 마이닝, 오피니언 마이닝 또는 파일인 경우 파일을 데이터 형태로 파싱해야 하기 때문에 수집 데이터 처리가 어렵다In addition, if unstructured data is text mining, web mining, opinion mining, or a file, it is difficult to process the collected data because it is necessary to parse the file in the form of data.

또한, 비정형 데이터는 규격화의 어려움이 있어 저장, 관리의 어려움이 있으며, 대규모로 저장된 데이터 안에서 체계적이고 자동적으로 통계적 규칙이나 패턴을 분석하여 가치 있는 정보를 추출하여 분석할 수 있다.In addition, unstructured data has difficulties in standardization and storage and management, and valuable information can be extracted and analyzed by systematically and automatically analyzing statistical rules or patterns in large-scale stored data.

또한, 비정형 데이터는 일정 집단에 대하여 미리 정의된 특성 정의를 통해 분류(Classification) 및 구분을 추론할 수 있고, 구체적인 특성을 공유하는 군집화(Clustering)과, 동시에 발생한 사건 간의 관계를 정의하는 연관성(Association)과, 대용량 데이터 집합 내의 패턴을 기반으로 미래를 예측하는 연속성(Forecasting) 등을 통해 비정형 데이터의 특성 값을 설정할 수 있다.In addition, for unstructured data, classification and classification can be inferred through predefined characteristic definitions for a certain group, clustering sharing specific characteristics, and association defining a relationship between simultaneous events ) and continuity that predicts the future based on patterns in large data sets, it is possible to set the characteristic values of unstructured data.

데이터 처리부(120)는 정형 데이터와 비정형 데이터로 이루어진 입력 데이터(110a)가 입력되면, 서로 다른 머신러닝 네트워크를 이용하여 정형 데이터의 특성(Feature) 값과 비정형 데이터의 특성(Feature) 값을 예측한다.When input data 110a composed of structured data and unstructured data is input, the data processing unit 120 predicts a feature value of the structured data and a feature value of the unstructured data using different machine learning networks. .

또한, 데이터 처리부(120)는 머신러닝 네트워크가 비정형 데이터(110a')를 입력 값으로 사용하여 예측하는 제1 네트워크와, 정형 데이터(110a")를 입력 값으로 사용하여 예측하는 제2 네트워크가 병렬로 구성될 수 있고, 예측된 비정형 데이터의 특성 값과 정형 데이터의 특성 값을 더하여 출력할 수 있도록 제1 네트워크부(121)와, 제2 네트워크부(122)와, 연산부(123)를 포함하여 구성될 수 있다.In addition, the data processing unit 120 includes a first network in which the machine learning network predicts using the unstructured data 110a ′ as an input value, and a second network that predicts using the structured data 110a ″ as an input value in parallel. Including a first network unit 121, a second network unit 122, and an operation unit 123 to output by adding the predicted characteristic values of the unstructured data and the characteristic values of the structured data. can be configured.

또한, 본 실시 예에 따른 제1 네트워크는 버트(BERT, Bidirectional Encoder Representations from Transformers) 모델 기반의 네트워크이고, 제2 네트워크는 피드 포워드 신경망(Feed-Forward Neural Network) 기반의 네트워크로 구성될 수 있다.Also, the first network according to the present embodiment may be a BERT (Bidirectional Encoder Representations from Transformers) model-based network, and the second network may be configured as a Feed-Forward Neural Network-based network.

제1 네트워크부(121)는 입력된 비정형 데이터(110a')를 버트(BERT) 모델의 기반에서 분석 및 예측하여 비정형 데이터의 특성 값을 출력하는 구성으로서, 임베딩 레이어(121a)와, 정규화 레이어(121b)와, 버트 레이어(121c)를 포함하여 구성될 수 있다.The first network unit 121 is a configuration that analyzes and predicts the input unstructured data 110a' based on the BERT model and outputs characteristic values of the unstructured data, and includes an embedding layer 121a and a normalization layer ( 121b) and the butt layer 121c.

임베딩 레이어(121a)는 입력된 비정형 데이터(110a')를 임베딩을 통해 벡터 값으로 변환하는 구성으로서, 토큰 임베딩(Token Embedding), 세그먼트 임베딩(Segment Embedding), 포지션 임베딩(Position Embedding)을 취합하여 3개의 임베딩을 합산한 하나의 임베딩 값으로 만든다.The embedding layer 121a is a configuration that converts the input unstructured data 110a' into a vector value through embedding, and combines 3 token embedding, segment embedding, and position embedding. It is made into one embedding value that is the sum of the embeddings.

정규화 레이어(121b)는 임베딩 레이어(121a)에서 변환된 벡터 값을 정규화한다.The normalization layer 121b normalizes the vector value transformed in the embedding layer 121a.

버트 레이어(121c)는 정규화된 벡터 값을 버트 알고리즘(또는 버트 모델)을 이용하여 문장에 대한 벡터 값과 문장 내의 개별 단어에 대응하는 벡터 값을 출력하는 구성으로서, 버트 알고리즘은 'N'개의 인코더 블럭을 가질 수 있다.The vert layer 121c is a configuration that outputs a vector value for a sentence and a vector value corresponding to an individual word in a sentence using a vert algorithm (or bert model) with normalized vector values, and the bert algorithm uses 'N' encoders can have blocks.

또한, 버트 레이어(121c)의 인코더 블록은 이전 출력 값을 현재의 입력 값으로 하는 RNN(Recurrent Neural Network)과 유사한 특징을 지닐 수 있다.Also, the encoder block of the butt layer 121c may have a characteristic similar to a recurrent neural network (RNN) in which a previous output value is a current input value.

또한, 버트 레이어(121c)는 인코더 블록 내에서 각각의 입력과 처리 결과가 오버피팅(Overfitting)되는 것을 방지하기 위해 잔차 네트워크(Residual Network)로 처리할 수 있다.In addition, the butt layer 121c may process each input and processing result in the encoder block as a residual network to prevent overfitting.

또한, 버트 레이어(121c)는 그래디언트(Gradient)가 비선형 활성화(Non Linear Activation)인 GELU(Gaussian Error Linear Unit)을 거쳐 0 주변의 그래디언트를 계산할 때, ReLU(Rectifier Linear Uint)보다 유연하게 계산이 가능하다.In addition, the butt layer 121c can be calculated more flexibly than ReLU (Rectifier Linear Unit) when the gradient is calculated around 0 through Gaussian Error Linear Unit (GELU), which is a non-linear activation. do.

제2 네트워크부(122)는 제1 네트워크부(121)와 병렬로 설치되고, 입력된 정형 데이터(110a")를 피드 포워드 신경망 기반에서 분석 및 예측하여 정형 데이터의 특성 값을 출력하는 구성으로서, 정규화 레이어(122a)와, 피드 포워드 레이어(122b)를 포함하여 구성될 수 있다.The second network unit 122 is installed in parallel with the first network unit 121, analyzes and predicts the input structured data 110a" based on a feed-forward neural network, and outputs the characteristic values of the structured data, It may be configured to include a normalization layer 122a and a feed forward layer 122b.

정규화 레이어(122a)는 입력된 정형 데이터(110a")를 정규화하여 출력한다.The normalization layer 122a normalizes the input structured data 110a" and outputs it.

피드 포워드 레이어(122b)는 정규화 레이어(122a)에서 정규화된 정형 데이터를 피드 포워드 신경망(Feed-Forward Neural Network)을 기반으로 예측하여 정형 데이터의 특성 값을 출력한다.The feed-forward layer 122b predicts the structured data normalized in the normalization layer 122a based on a feed-forward neural network and outputs characteristic values of the structured data.

피드 포워드 신경망은 입력된 데이터가 입력 레이어에서 은닉 레이어를 거쳐 출력 레이어까지 전달되고, 분류를 위한 벡터 값을 출력한다.In the feed-forward neural network, input data is passed from the input layer through the hidden layer to the output layer, and a vector value for classification is output.

또한, 피드 포워드 신경망은 분류를 예측하기 위해 정해진 컬럼의 데이터만을 참고하고, 입력값의 분포가 변하는 것을 방지하기 위해 잔차 네트워크(Residual Network)를 포함하여 구성될 수 있다.In addition, the feed-forward neural network may be configured to refer only to data of a predetermined column to predict classification, and to include a residual network to prevent the distribution of input values from being changed.

연산부(123)는 제1 네트워크부(121)에서 출력되는 비정형 데이터의 특성 값과, 제2 네트워크부(122)에서 출력되는 정형 데이터의 특성 값을 더하여 합산한 예측 결과를 분류부(130)로 출력한다.The calculating unit 123 adds the characteristic value of the unstructured data output from the first network unit 121 and the characteristic value of the structured data output from the second network unit 122 to the summation of the prediction result to the classification unit 130 . print out

분류부(130)는 데이터 처리부(120)에서 출력되는 비정형 데이터의 특성 값과 정형 데이터의 특성 값을 더한 예측 결과를 Classifier 기반의 분류 모델을 사용하여 최종 분류한다.The classification unit 130 finally classifies the prediction result obtained by adding the characteristic value of the unstructured data output from the data processing unit 120 and the characteristic value of the structured data, using a classifier-based classification model.

다음은 본 발명의 일 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 방법을 설명한다.The following describes a natural language processing method using structured data and unstructured data according to an embodiment of the present invention.

도4는 본 발명의 일 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 방법을 설명하기 위해 나타낸 흐름도이다.4 is a flowchart illustrating a natural language processing method using structured data and unstructured data according to an embodiment of the present invention.

도1 내지 도4를 참조하면, 본 발명의 일 실시 예에 따른 정형 데이터와 비정형 데이터를 이용한 자연어 처리 방법은 입력부(110)로 임의의 입력 데이터가 입력(S100)되면, 입력부(110)는 입력 데이터로부터 텍스트만을 추출하고, 추출된 텍스트를 정형 데이터와 비정형 데이터로 분류하여 데이터 처리부(120)로 출력(S200)한다.1 to 4, in the natural language processing method using structured data and unstructured data according to an embodiment of the present invention, when arbitrary input data is input to the input unit 110 (S100), the input unit 110 is input Only text is extracted from the data, and the extracted text is classified into structured data and unstructured data and output to the data processing unit 120 ( S200 ).

S200 단계에서, 입력부(110)는 입력 데이터를 분석하여 텍스트 기반의 데이터로 인식되면, 입력 데이터로부터 텍스트만을 추출할 수 있다.In step S200 , the input unit 110 analyzes the input data and, when recognized as text-based data, may extract only text from the input data.

한편, S200 단계에서 입력부(100)는 입력 데이터가 음성 기반의 데이터로 인식되는 경우, 입력된 음성 기반의 데이터를 텍스트 데이터로 변환하여 변환된 텍스트 데이터를 인식하여 텍스트만을 추출할 수도 있다.Meanwhile, when the input data is recognized as voice-based data in step S200 , the input unit 100 may convert the input voice-based data into text data to recognize the converted text data and extract only the text.

또한, S200 단계에서 입력부(110)는 추출된 텍스트를 분석하여 예를 들어, 데이터 개체(Entity), 속성(Attribute), 관계(Relationship) 등의 스키마(Schema) 형태, 연산 가능 여부, 데이터 특성 등에 따라 데이터(110a)를 비정형 데이터(110a') 및 정형 데이터(110a")로 분류하고, 분류된 비정형 데이터(110a')와 정형 데이터(110a")는 데이터 처리부(120)로 제공한다.In addition, in step S200, the input unit 110 analyzes the extracted text, and for example, a schema type such as a data entity, an attribute, and a relationship, whether operation is possible, data characteristics, etc. Accordingly, the data 110a is classified into the unstructured data 110a' and the structured data 110a", and the classified unstructured data 110a' and the structured data 110a" are provided to the data processing unit 120 .

데이터 처리부(120)는 S200 단계에서 분류된 비정형 데이터(110a')와 정형 데이터(110a")를 입력받아 서로 다른 머신러닝 네트워크가 병렬로 구성된 제1 네트워크부(121)와 제2 네트워크부(122)를 이용하여 비정형 데이터의 특성(Feature) 값과 정형 데이터의 특성(Feature) 값을 예측(S300, S400)한다.The data processing unit 120 receives the unstructured data 110a' and the structured data 110a" classified in step S200, and the first network unit 121 and the second network unit 122 in which different machine learning networks are configured in parallel. ) to predict (S300, S400) feature values of unstructured data and feature values of structured data.

즉, 비정형 데이터는 제1 네트워크부(121)의 입력 값으로 사용되고, 정형 데이터는 제2 네트워크부(122)의 입력 값으로 사용된다.That is, the unstructured data is used as an input value of the first network unit 121 , and the structured data is used as an input value of the second network unit 122 .

S300 단계에서, 제1 네트워크부(121)는 버트(BERT, Bidirectional Encoder Representations from Transformers) 모델 기반으로서, 입력된 비정형 데이터(110a')를 버트(BERT) 모델의 기반에서 분석 및 예측하여 비정형 데이터의 특성 값을 출력한다.In step S300, the first network unit 121 analyzes and predicts the input unstructured data 110a' based on the BERT (Bidirectional Encoder Representations from Transformers) model based on the BERT model, Outputs the property value.

또한, S300 단계에서, 제1 네트워크부(121)는 입력된 비정형 데이터(110a')를 토큰 임베딩(Token Embedding), 세그먼트 임베딩(Segment Embedding), 포지션 임베딩(Position Embedding)을 취합하여 3개의 임베딩을 합산한 임베딩을 통해 하나의 임베딩된 벡터 값으로 변환한다.In addition, in step S300, the first network unit 121 collects the input unstructured data 110a', token embedding, segment embedding, and position embedding to perform three embeddings. It is converted into a single embedded vector value through the summed embedding.

또한, 제1 네트워크부(121)는 임베딩을 통해 변환된 벡터 값을 정규화하고, 정규화된 벡터 값을 버트 알고리즘(또는 버트 모델)을 이용하여 문장에 대한 벡터 값과 문장 내의 개별 단어에 대응하는 벡터 값을 출력한다.In addition, the first network unit 121 normalizes the vector value converted through embedding, and uses the normalized vector value with a Bert algorithm (or Bert model) to obtain a vector value for a sentence and a vector corresponding to an individual word in the sentence. print the value

이때, 버트 알고리즘은 'N'개의 인코더 블럭을 가질 수 있다.In this case, the Burt algorithm may have 'N' encoder blocks.

또한, 제1 네트워크부(121)는 인코더 블록은 셀프 어텐션 메커니즘(Self-Attention Mechanism)을 사용하여 토큰(단어)간의 특징을 파악할 수 있고, 인코더 블록 내에서 각각의 입력과 처리 결과가 오버피팅(Overfitting)되는 것을 방지하기 위해 잔차 네트워크(Residual Network)를 이용하여 처리할 수 있다.In addition, the first network unit 121 can determine the characteristics between the tokens (words) by the encoder block using the Self-Attention Mechanism, and each input and processing result within the encoder block is overfitting ( In order to prevent overfitting, it can be processed using a residual network.

또한, 제1 네트워크부(121)는 그래디언트(Gradient)가 비선형 활성화 함수(Non Linear Activation Function)인 GELU(Gaussian Error Linear Unit)을 거쳐 0 주변의 그래디언트(Gradient)를 계산할 때, ReLU(Rectifier Linear Uint)보다 유연하게 계산이 가능하다.In addition, the first network unit 121 calculates a gradient around 0 through a Gaussian Error Linear Unit (GELU) in which the gradient is a non-linear activation function, a Rectifier Linear Uint (ReLU). ) can be calculated more flexibly.

S400 단계에서, 제2 네트워크부(122)는 제1 네트워크부(121)와 병렬로 설치되고, 입력된 정형 데이터(110a")를 정규화하고, 정규화된 정형 데이터를 피드 포워드 신경망(Feed-Forward Neural Network) 기반에서 분석 및 예측하여 정형 데이터의 특성 값을 출력한다.In step S400 , the second network unit 122 is installed in parallel with the first network unit 121 , normalizes the input structured data 110a ", and feeds the normalized structured data to a feed-forward neural network (Feed-Forward Neural Network). Network) based analysis and prediction to output the characteristic values of structured data.

피드 포워드 신경망은 입력된 데이터가 입력 레이어에서 은닉 레이어를 거쳐 출력 레이어까지 전달되고, 순환 경로가 존재하지 않는 유사도를 표현하기 위한 벡터 값을 출력한다.The feed-forward neural network transmits input data from the input layer through the hidden layer to the output layer, and outputs a vector value for expressing the similarity in which a circular path does not exist.

또한, 피드 포워드 신경망은 다음 데이터(단어)를 예측하기 위해 모든 이전 데이터를 참고하는 것이 아니라, 정해진 'n'개의 데이터만을 참고하여 버려지는 데이터들이 가진 문맥 정보를 참고할 수 없어 입력과 처리 결과가 오버피팅(Overfitting)되는 것을 방지하기 위해 잔차 네트워크(Residual Network)를 포함하여 구성될 수 있다.In addition, the feed-forward neural network does not refer to all previous data to predict the next data (word), but only refers to a set 'n' number of data and cannot refer to the context information of the discarded data, so the input and processing results are over It may be configured to include a residual network to prevent overfitting.

계속해서, 데이터 처리부(120)는 제1 네트워크부(121)에서 출력되는 비정형 데이터의 특성 값과, 제2 네트워크부(122)에서 출력되는 정형 데이터의 특성 값을 더하여 합산하고, 합산된 예측 결과를 분류부(130)로 출력(S500)한다.Subsequently, the data processing unit 120 adds and sums the characteristic values of the unstructured data output from the first network unit 121 and the characteristic values of the structured data output from the second network unit 122 , and adds the summed prediction result is output to the classification unit 130 (S500).

분류부(130)는 S500 단계에서 출력되는 비정형 데이터의 특성 값과 정형 데이터의 특성 값을 합산한 예측 결과를 Classifier 기반의 분류 모델을 사용하여 최종 분류(S600)한다.The classification unit 130 finally classifies ( S600 ) the prediction result obtained by summing the characteristic values of the unstructured data and the characteristic values of the structured data output in step S500 using a classifier-based classification model.

따라서, 정형 데이터와 비정형 데이터의 특성 값을 이용한 모델링을 통해 언어 모델의 분류 문제를 개선할 수 있고, 종래의 단순히 비정형 텍스트 데이터 만으로 분류하기 어려운 문제들을 정형 데이터를 함께 사용하여 분류함으로써, 분류를 원하는 데이터의 분류 정확도를 향상시킬 수 있다.Therefore, it is possible to improve the classification problem of the language model through modeling using the characteristic values of the structured data and the unstructured data. The classification accuracy of data can be improved.

상기와 같이, 본 발명의 바람직한 실시 예를 참조하여 설명하였지만 해당 기술 분야의 숙련된 당업자라면 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.As described above, although described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope without departing from the spirit and scope of the present invention described in the claims below. You will understand that it can be done.

또한, 본 발명의 특허청구범위에 기재된 도면번호는 설명의 명료성과 편의를 위해 기재한 것일 뿐 이에 한정되는 것은 아니며, 실시예를 설명하는 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다.In addition, the reference numbers described in the claims of the present invention are only provided for clarity and convenience of explanation, and are not limited thereto, and in the process of describing the embodiment, the thickness of the lines shown in the drawings or the size of components, etc. may be exaggerated for clarity and convenience of explanation.

또한, 상술된 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있으므로, 이러한 용어들에 대한 해석은 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In addition, the above-mentioned terms are terms defined in consideration of functions in the present invention, which may vary depending on the intention or custom of the user or operator, so the interpretation of these terms should be made based on the content throughout this specification. .

또한, 명시적으로 도시되거나 설명되지 아니하였다 하여도 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기재사항으로부터 본 발명에 의한 기술적 사상을 포함하는 다양한 형태의 변형을 할 수 있음은 자명하며, 이는 여전히 본 발명의 권리범위에 속한다. In addition, even if it is not explicitly shown or described, a person of ordinary skill in the art to which the present invention pertains can make various modifications including the technical idea according to the present invention from the description of the present invention. Obviously, this still falls within the scope of the present invention.

또한, 첨부하는 도면을 참조하여 설명된 상기의 실시예들은 본 발명을 설명하기 위한 목적으로 기술된 것이며 본 발명의 권리범위는 이러한 실시예에 국한되지 아니한다.In addition, the above embodiments described with reference to the accompanying drawings have been described for the purpose of explaining the present invention, and the scope of the present invention is not limited to these embodiments.

100 : 자연어 처리 장치
110 : 입력부
110a : 데이터
110a' : 비정형 데이터
110a" : 정형 데이터
111 : 텍스트 입력부
111a : STT 입력부
112 : 정형 데이터 입력부
120 : 데이터 처리부
121 : 제1 네트워크부
121a : 임베딩 레이어
121b : 정규화 레이어
121c : 버트 레이어
122 : 제2 네트워크부
122a : 정규화 레이어
122b : 피드 포워드 레이어
123 : 연산부
130 : 분류부100: natural language processing unit
110: input unit
110a: data
110a' : unstructured data
110a": structured data
111: text input unit
111a: STT input unit
112: structured data input unit
120: data processing unit
121: first network unit
121a : embedding layer
121b: normalization layer
121c: butt layer
122: second network unit
122a: normalization layer
122b: feed forward layer
123: arithmetic unit
130: classification unit

Claims

When unstructured data and structured data are input from the input unit 110, feature values of unstructured data and feature values of structured data are predicted using different machine learning networks, and the machine learning network uses unstructured data A first network that predicts using as an input value and a second network that predicts using the structured data as an input value are configured in parallel, and are predicted by adding the predicted characteristic values of the unstructured data and the characteristic values of the structured data Data processing unit 120 for outputting one result; includes,
The data processing unit 120 uses a self-attention mechanism to determine the characteristics between the N encoder blocks using the previous output value as the current input value, and uses a residual network to determine the characteristics between the words. Prevents overfitting of each input and processing result within the encoder block, normalizes the vector value of unstructured data transformed through embedding, and uses the BERT model to add the normalized vector value to the sentence a first network unit 121 for outputting a vector value for a vector value and a vector value corresponding to an individual word in a sentence to output a characteristic value of the unstructured data through analysis and prediction of the unstructured data;
In order to predict the next word through the residual network, it is not possible to refer to the context information of the discarded data by referring only to 'n' data, which prevents overfitting of input and processing results, and feed The structured data input using the forward neural network is transmitted from the input layer through the hidden layer to the output layer, and a vector value with similarity in which a circular path does not exist is output, and the characteristic value of structured data through analysis and prediction of structured data a second network unit 122 that outputs; and
and an operation unit (123) that adds the characteristic value of the unstructured data and the characteristic value of the structured data and outputs the added value to the classification unit (130).

delete

The method of claim 1,
The classification unit 130 for classifying the result of adding the characteristic value of the unstructured data output from the data processing unit 120 and the characteristic value of the structured data based on the classification model; natural language processing device using

4. The method of claim 1 or 3,
The input unit 110 includes: a text input unit 111 for recognizing text-based data among input data, extracting only text, and outputting; and
Extracting and outputting structured data consisting of at least one of a data entity, an attribute, and a schema according to a relationship, whether operation is possible, data characteristics, number, and categorical data from among the input data A natural language processing apparatus using structured data and unstructured data, comprising: a structured data input unit (112).

5. The method of claim 4,
When the input unit 110 recognizes voice-based data from the input data, the STT input unit 111a further comprises an STT input unit 111a for extracting only text by converting the voice-based data into text data. Natural language processing device using data.

delete

The method of claim 1,
The first network unit 121 includes an embedding layer 121a that converts the input unstructured data into a vector value through embedding;
a normalization layer 121b that normalizes the transformed vector value; and
N encoder blocks using the previous output value as the current input value identify the features between words using the Self-Attention Mechanism, and use the Residual Network to each input within the encoder block. It prevents overfitting and outputs the normalized vector value as a vector value for a sentence and a vector value corresponding to individual words in the sentence using the BERT model. and a vert layer (121c) that analyzes and predicts the unstructured data and outputs characteristic values of the unstructured data.

The method of claim 1,
The second network unit 122 includes a normalization layer (122a) for normalizing the input structured data; and
In order to predict the next word through the residual network, it is not possible to refer to the context information of the discarded data by referring only to 'n' data, which prevents overfitting of input and processing results, and feed By using a forward neural network, the normalized structured data is transmitted from the input layer through the hidden layer to the output layer, and a vector value having a similarity in which a circular path does not exist is output, and structured data is analyzed and predicted. Characteristics of structured data A natural language processing apparatus using structured data and unstructured data, characterized in that it comprises; a feed forward layer (122b) for outputting a value.

a) the input unit 110 classifying the input data into unstructured data and structured data;
b) the data processing unit 120 receives the classified unstructured data and structured data, and predicts a feature value of the unstructured data and a feature value of the structured data using different machine learning networks; and
c) outputting, by the data processing unit 120, a prediction result by adding the predicted characteristic value of the unstructured data and the characteristic value of the structured data;
The data processing unit 120 uses a self-attention mechanism to determine the characteristics between the N encoder blocks using the previous output value as the current input value, and uses a residual network to determine the characteristics between the words. Prevents overfitting of each input and processing result within the encoder block, normalizes vector values of unstructured data transformed through embedding, and uses BERT model to add normalized vector values to sentences a first network unit 121 for outputting a vector value and a vector value corresponding to an individual word in a sentence to output a characteristic value of the unstructured data analyzed and predicted by the unstructured data;
In order to predict the next word through the residual network, it is not possible to refer to the context information of the discarded data by referring only to 'n' data, which prevents overfitting of input and processing results, and feed The structured data input using the forward neural network is transmitted from the input layer through the hidden layer to the output layer, and a vector value with similarity in which a circular path does not exist is output, and the characteristic value of structured data through analysis and prediction of structured data a second network unit 122 that outputs; and
and an operation unit (123) for adding the characteristic value of the unstructured data and the characteristic value of the structured data to the classification unit (130).

11. The method of claim 10,
d) classifying, by the classification unit 130, the result of adding the characteristic value of the unstructured data output from the data processing unit 120 and the characteristic value of the structured data based on the classification model; Natural language processing method using data and unstructured data.

delete

11. The method of claim 10,
Step a) includes: a-1) extracting text when the input unit 110 recognizes text-based data among input data; and
a-2) Based on the text extracted by the input unit 110 , a schema form according to a data entity, an attribute, and a relationship, data characteristics, numeric data, categorical data, and operation A natural language processing method using structured data and unstructured data, comprising: classifying structured data or unstructured data according to availability.

14. The method of claim 13,
In step a-1), when the input unit 110 recognizes voice-based data from the input data, converting the voice-based data into text data and extracting only text; Natural language processing method using data and unstructured data.