KR20210051293A

KR20210051293A - Legal Document Automatic Classification Technology

Info

Publication number: KR20210051293A
Application number: KR1020190136360A
Authority: KR
Inventors: 이지훈; 이혁준
Original assignee: 광운대학교 산학협력단
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2021-05-10

Abstract

Provided is a technology for automatically classifying a legal document. In order to implement a machine learning model to classify legal documents, a new type of input data set which combines word embedding utilizing Word2Vec, which has recently been attracting attention in the field of natural language processing, and TF-IDF score which determines the importance of words are used, the data set is learned, and legal documents are classified with high accuracy. The present technology comprises a data unit (110), a learning unit (130), and a classification unit (140).

Description

Legal Document Automatic Classification Technology

본 발명은 법률문서들을 민사사건, 행정사건, 형사사건으로 정확하게 분류하기 위한 방법에 관한 것으로, 보다 상세하게는 특정 단어의 한 문서 내 빈도수와 전체 문서군에서의 빈도수를 이용한 단어의 중요도 수치값과 Word2Vec을 활용한 단어 임베딩, 컨볼루션 뉴럴 네트워크를 활용한 법률문서 분류 자연어 처리 방법에 관한 것이다.The present invention relates to a method for accurately classifying legal documents into civil, administrative, and criminal cases, and more specifically, a numerical value of importance of words using the frequency of a specific word within a document and the frequency of the entire document group. Word embedding using Word2Vec and legal document classification using convolutional neural networks and natural language processing methods.

정보처리기술이 발달함에 따라 많은 법률정보들이 전자적으로 저장되고, 이 정보들을 처리해야하는 변호사들은 대용량의 데이터를 효율적으로 다루기 위해 텍스트 데이터 분류와 같은 기계학습 기술을 활용한다.As information processing technology develops, a lot of legal information is stored electronically, and lawyers who have to process this information use machine learning technologies such as text data classification to efficiently handle large amounts of data.

자동문서분류[Text Classification)는 주어진 텍스트 문서를 이에 적합한 카테고리로 분류하는 기술중 하나로서 뉴스분류, 쳇봇 등 다양한 분야에 활용되고 있다.Automatic document classification (Text Classification) is one of the technologies that classify a given text document into a suitable category, and is used in various fields such as news classification and cheatbot.

법률분야에서의 텍스트 데이터 분류를 일반적으로 예측 코딩(Predictive Coding)이라고 하고, 기존의 예측 코딩에는 로지스틱 회기방법(Logistic Regression)과 SVM(Support Vector Machines)알고리즘이 사용되어 왔다.Classification of text data in the legal field is generally referred to as predictive coding, and logistic regression and SVM (Support Vector Machines) algorithms have been used for conventional predictive coding.

지난 몇년동안 딥 러닝(Deep Learning)이 기계학습 및 인공지능 분야에서 많은 발전을 이루었고, 음성인식 및 언어번역 등 자연어 처리에서 심층 신경망을 사용한 모델들이 좋은 성과를 보이고 있지만 법률 분야에서 적용된 사례는 찾아보기 힘들다.In the past few years, deep learning has made a lot of progress in the field of machine learning and artificial intelligence, and models using deep neural networks in natural language processing such as speech recognition and language translation are showing good results, but there are examples of applications applied in the legal field. Hard.

문서의 중요도를 판단하기 위해 한 문서내의 단어의 빈도수(Term Frequency)와 전체 문서군 내의 단어의 빈도수(Document Frequency)를 활용한 TF-IDF(Term Frequency - Inverse Document Frequency)를 사용한다.To determine the importance of a document, TF-IDF (Term Frequency-Inverse Document Frequency) is used, which utilizes the Term Frequency of words in one document and the Document Frequency in the entire document group.

본 발명의 해결하고자 하는 과제는, 법률문서를 민사사건, 행정사건, 형사사건으로 정확하게 분류하는 특정 단어의 한 문서 내 빈도수(Term Frequency)와 전체 문서군에서의 빈도수(Document Frequency)를 이용한 단어의 중요도 수치값과 Word2Vec을 활용한 단어 임베딩, 컨볼루션 뉴럴 네트워크를 활용한 법률문서 분류 시스템을 제공하는 것이다.The problem to be solved of the present invention is to accurately classify a legal document into a civil case, an administrative case, or a criminal case. It provides a legal document classification system using importance numerical values, word embedding using Word2Vec, and convolutional neural networks.

본 발명의 법률문서 분류 방법은 문서내의 단어의 빈도와 문서군 내의 단어의 빈도를 고려한 단어의 중요도 값을 산출하여 Word2Vec을 사용하여 벡터화 시킨 단어에 중요도 값을 결합해 결합벡터를 생성하고, 이를 입력 데이터로 사용하는 딥 러닝의 컨볼루션 신경망을 구현하여 문서를 분류하는 방법을 제공한다.The legal document classification method of the present invention calculates the importance value of words in consideration of the frequency of words in the document and the frequency of words in the document group, combines the importance value with the words vectorized using Word2Vec, and generates a combined vector, and inputs this Provides a method for classifying documents by implementing a deep learning convolutional neural network that is used as data.

본 발명은 단어가 문서 내에서 갖는 의미를 추론할 수 있는 단어 임베딩 기법과 해당 단어의 중요도를 고려하여 심층 신경망을 통해 학습시킴으로써 법률문서를 민사사건, 형사사건, 행정사건으로 분류하는 높은 정확도를 갖는 법률문서 분류 장치 및 방법을 제공할 수 있다.The present invention has high accuracy to classify legal documents into civil, criminal, and administrative cases by learning through a deep neural network considering the word embedding technique that can infer the meaning of a word in a document and the importance of the word. A legal document classification apparatus and method can be provided.

아울러, 법률문서 뿐만 아니라 다른 텍스트 데이터 분류 또한 높은 정확도로 수행할 수 있으므로, 사용자의 의도에 부합하는 분류 결과를 출력할 수 있는 문서 분류장치 및 방법 또한 제공할 수 있다.In addition, since it is possible to classify not only legal documents but also other text data with high accuracy, it is also possible to provide a document classification apparatus and method capable of outputting a classification result conforming to a user's intention.

도 1은 자료부(110), 전처리부(120), 학습부(130), 분류부(140)으로 이루어진 전체 시스템의 도식화를 나타낸다.
도 2는 전처리부(120)의 내부 시스템의 도식화를 나타낸다.
도 3은 전처리부(120)의 결합부(124)에서 임베딩부(122)의 임베딩 값과 가중치부(123)에서의 가중치 값이 결합돼 새로운 데이터를 만드는 것을 보여준다.
도 4는 학습부에서 사용된 컨볼루션 신경망의 예시 그림이다.1 shows a schematic diagram of an entire system consisting of a data unit 110, a preprocessor 120, a learning unit 130, and a classification unit 140.
2 shows a schematic diagram of the internal system of the preprocessor 120.
3 shows that the embedding value of the embedding unit 122 and the weight value of the weighting unit 123 are combined in the combining unit 124 of the preprocessor 120 to create new data.
4 is an exemplary diagram of a convolutional neural network used in a learning unit.

상술한 기술적 과제를 달성하기 위한 본 발명은, 판례문서의 사건 내용부분만을 모아 분류를 위한 기계학습과 테스트에 사용되어지는 데이터인 자료부(110)와 데이터를 학습하기 위해 그리고 더 정확하게 분류할 수 있게 벡터화 시키고 TF-IDF 스코어와 벡터를 결합하여 새로운 데이터를 구축하는 전처리부(120), 전처리 과정을 마친 데이터를 가지고 딥러닝을 사용해 기계학습을 하는 학습부(130), 학습된 기계를 사용하여 문서를 분류하는 분류부(140)을 포함한다.The present invention for achieving the above-described technical problem, in order to learn and more accurately classify the data and the data unit 110, which is data used for machine learning and testing, for classification by collecting only the case content part of the case law document. Pre-processing unit 120 that vectorizes and combines the TF-IDF score and vector to construct new data, learning unit 130 that performs machine learning using deep learning with the pre-processed data, and uses a learned machine. It includes a classification unit 140 for classifying documents.

자료부(110)에는 '국가 법령정보 공동 활용 OPEN API'를 활용한 59156건(민사사건:27561, 형사사건:18483, 행정사건:13112)의 판례데이터를 사용한다.In the data section 110, case law data of 59156 cases (civil cases: 27561, criminal cases: 18847, administrative cases: 13112) using the'open API for joint use of national legal information' are used.

전처리부(120)는 문장으로 구성된 문서를 단어형식으로 바꾸어주는 변환부(121)와 변환된 단어를 벡터로 바꾸어주는 임베딩부(122), 각 단어의 중요도를 산출하는 가중치부(123), 변환부에서 생성된 벡터와 가중치부에서 생성된 중요도값을 결합하여 새로운 데이터 셋을 생성하는 결합부(124)을 포함한다.The preprocessing unit 120 includes a conversion unit 121 that converts a document composed of sentences into a word format, an embedding unit 122 that converts the converted word into a vector, a weight unit 123 that calculates the importance of each word, and transforms It includes a combiner 124 for generating a new data set by combining the vector generated in the unit and the importance value generated in the weight unit.

한글에서 명사는 단어의 앞에 있는 조사에 따라 역할을 달리하므로 변환부(121)에서는 정확한 의미를 부여하기 위해 명사와 조사를 붙여 한 단어로 인식하게 한다.In Hangul, nouns play different roles depending on the survey in front of the word, so the conversion unit 121 recognizes a noun and a survey as a single word in order to give an accurate meaning.

임베딩부(122)에서는 2013년 구글에서 개발한 Word2Vec이라는 방법론을 사용하여 단어를 n차원의 벡터값들로 표현한다.The embedding unit 122 expresses words as n-dimensional vector values using a methodology called Word2Vec developed by Google in 2013.

가중치부(123)에서는 수학식 1에 기초하여 문서내 단어의 빈도수(Term Frequency)와 전체 문서군 내 단어의 빈도수의 역수(Inverse Document Frequency)를 활용해 해당 단어가 얼마나 중요한 단어를 나타내는지를 판별하는 가중치를 만든다.Based on Equation 1, the weighting unit 123 determines how important the word represents by using the Term Frequency of the word in the document and the Inverse Document Frequency of the word in the entire document group. Create weights

결합부(124)에서는 임베딩부(122)에서 벡터값으로 변환된 단어 데이터에 가중치부(123)의 결과값을 결합해 새로운 데이터를 만든다.The combiner 124 creates new data by combining the result value of the weighting unit 123 with the word data converted into vector values by the embedding unit 122.

학습부(130)에서는 다섯개의 컨볼루션 계층(Layer)과 한개의 완전 연결 계층(Fully Connected Layer)을 가진 컨볼루션 신경망(Convolution Nueral Network)을 사용하여 전처리부(120)에서 생성된 데이터를 사용하여 기계를 학습시킨다.The learning unit 130 uses the data generated by the preprocessor 120 using a convolutional neural network having five convolutional layers and one fully connected layer. It trains the machine.

학습된 기계를 사용하여 분류부(140)에서 법률문서 분류를 진행하고 이에따른 분류정확도를 확인한다.The classification unit 140 classifies legal documents using the learned machine and checks the accuracy of classification accordingly.

100 : 전체 시스템
110 : 자료부
120 : 전처리부
130 : 학습부
140 : 분류부
121 : 변환부
122 : 임베딩부
123 : 가중치부
124 : 결합부100: the whole system
110: data section
120: pre-treatment unit
130: Learning Department
140: classification unit
121: conversion unit
122: embedding part
123: weight part
124: coupling part

Claims

In the legal document classification technology consisting of the data unit 110, the learning unit 130, and the classification unit 140,
Legal document classification technology including the preprocessor 120

The method according to claim 1,
Legal document classification technology including an embedding unit 122, a weight unit 123, and a combination unit 124 in the preprocessor 120