KR20210051293A - Legal Document Automatic Classification Technology - Google Patents
Legal Document Automatic Classification Technology Download PDFInfo
- Publication number
- KR20210051293A KR20210051293A KR1020190136360A KR20190136360A KR20210051293A KR 20210051293 A KR20210051293 A KR 20210051293A KR 1020190136360 A KR1020190136360 A KR 1020190136360A KR 20190136360 A KR20190136360 A KR 20190136360A KR 20210051293 A KR20210051293 A KR 20210051293A
- Authority
- KR
- South Korea
- Prior art keywords
- unit
- document
- legal
- classification
- legal document
- Prior art date
Links
- 238000005516 engineering process Methods 0.000 title claims abstract description 9
- 238000000034 method Methods 0.000 claims description 10
- 238000010801 machine learning Methods 0.000 abstract description 5
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Economics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Medical Informatics (AREA)
- Technology Law (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
본 발명은 법률문서들을 민사사건, 행정사건, 형사사건으로 정확하게 분류하기 위한 방법에 관한 것으로, 보다 상세하게는 특정 단어의 한 문서 내 빈도수와 전체 문서군에서의 빈도수를 이용한 단어의 중요도 수치값과 Word2Vec을 활용한 단어 임베딩, 컨볼루션 뉴럴 네트워크를 활용한 법률문서 분류 자연어 처리 방법에 관한 것이다.The present invention relates to a method for accurately classifying legal documents into civil, administrative, and criminal cases, and more specifically, a numerical value of importance of words using the frequency of a specific word within a document and the frequency of the entire document group. Word embedding using Word2Vec and legal document classification using convolutional neural networks and natural language processing methods.
정보처리기술이 발달함에 따라 많은 법률정보들이 전자적으로 저장되고, 이 정보들을 처리해야하는 변호사들은 대용량의 데이터를 효율적으로 다루기 위해 텍스트 데이터 분류와 같은 기계학습 기술을 활용한다.As information processing technology develops, a lot of legal information is stored electronically, and lawyers who have to process this information use machine learning technologies such as text data classification to efficiently handle large amounts of data.
자동문서분류[Text Classification)는 주어진 텍스트 문서를 이에 적합한 카테고리로 분류하는 기술중 하나로서 뉴스분류, 쳇봇 등 다양한 분야에 활용되고 있다.Automatic document classification (Text Classification) is one of the technologies that classify a given text document into a suitable category, and is used in various fields such as news classification and cheatbot.
법률분야에서의 텍스트 데이터 분류를 일반적으로 예측 코딩(Predictive Coding)이라고 하고, 기존의 예측 코딩에는 로지스틱 회기방법(Logistic Regression)과 SVM(Support Vector Machines)알고리즘이 사용되어 왔다.Classification of text data in the legal field is generally referred to as predictive coding, and logistic regression and SVM (Support Vector Machines) algorithms have been used for conventional predictive coding.
지난 몇년동안 딥 러닝(Deep Learning)이 기계학습 및 인공지능 분야에서 많은 발전을 이루었고, 음성인식 및 언어번역 등 자연어 처리에서 심층 신경망을 사용한 모델들이 좋은 성과를 보이고 있지만 법률 분야에서 적용된 사례는 찾아보기 힘들다.In the past few years, deep learning has made a lot of progress in the field of machine learning and artificial intelligence, and models using deep neural networks in natural language processing such as speech recognition and language translation are showing good results, but there are examples of applications applied in the legal field. Hard.
문서의 중요도를 판단하기 위해 한 문서내의 단어의 빈도수(Term Frequency)와 전체 문서군 내의 단어의 빈도수(Document Frequency)를 활용한 TF-IDF(Term Frequency - Inverse Document Frequency)를 사용한다.To determine the importance of a document, TF-IDF (Term Frequency-Inverse Document Frequency) is used, which utilizes the Term Frequency of words in one document and the Document Frequency in the entire document group.
본 발명의 해결하고자 하는 과제는, 법률문서를 민사사건, 행정사건, 형사사건으로 정확하게 분류하는 특정 단어의 한 문서 내 빈도수(Term Frequency)와 전체 문서군에서의 빈도수(Document Frequency)를 이용한 단어의 중요도 수치값과 Word2Vec을 활용한 단어 임베딩, 컨볼루션 뉴럴 네트워크를 활용한 법률문서 분류 시스템을 제공하는 것이다.The problem to be solved of the present invention is to accurately classify a legal document into a civil case, an administrative case, or a criminal case. It provides a legal document classification system using importance numerical values, word embedding using Word2Vec, and convolutional neural networks.
본 발명의 법률문서 분류 방법은 문서내의 단어의 빈도와 문서군 내의 단어의 빈도를 고려한 단어의 중요도 값을 산출하여 Word2Vec을 사용하여 벡터화 시킨 단어에 중요도 값을 결합해 결합벡터를 생성하고, 이를 입력 데이터로 사용하는 딥 러닝의 컨볼루션 신경망을 구현하여 문서를 분류하는 방법을 제공한다.The legal document classification method of the present invention calculates the importance value of words in consideration of the frequency of words in the document and the frequency of words in the document group, combines the importance value with the words vectorized using Word2Vec, and generates a combined vector, and inputs this Provides a method for classifying documents by implementing a deep learning convolutional neural network that is used as data.
본 발명은 단어가 문서 내에서 갖는 의미를 추론할 수 있는 단어 임베딩 기법과 해당 단어의 중요도를 고려하여 심층 신경망을 통해 학습시킴으로써 법률문서를 민사사건, 형사사건, 행정사건으로 분류하는 높은 정확도를 갖는 법률문서 분류 장치 및 방법을 제공할 수 있다.The present invention has high accuracy to classify legal documents into civil, criminal, and administrative cases by learning through a deep neural network considering the word embedding technique that can infer the meaning of a word in a document and the importance of the word. A legal document classification apparatus and method can be provided.
아울러, 법률문서 뿐만 아니라 다른 텍스트 데이터 분류 또한 높은 정확도로 수행할 수 있으므로, 사용자의 의도에 부합하는 분류 결과를 출력할 수 있는 문서 분류장치 및 방법 또한 제공할 수 있다.In addition, since it is possible to classify not only legal documents but also other text data with high accuracy, it is also possible to provide a document classification apparatus and method capable of outputting a classification result conforming to a user's intention.
도 1은 자료부(110), 전처리부(120), 학습부(130), 분류부(140)으로 이루어진 전체 시스템의 도식화를 나타낸다.
도 2는 전처리부(120)의 내부 시스템의 도식화를 나타낸다.
도 3은 전처리부(120)의 결합부(124)에서 임베딩부(122)의 임베딩 값과 가중치부(123)에서의 가중치 값이 결합돼 새로운 데이터를 만드는 것을 보여준다.
도 4는 학습부에서 사용된 컨볼루션 신경망의 예시 그림이다.1 shows a schematic diagram of an entire system consisting of a
2 shows a schematic diagram of the internal system of the
3 shows that the embedding value of the
4 is an exemplary diagram of a convolutional neural network used in a learning unit.
상술한 기술적 과제를 달성하기 위한 본 발명은, 판례문서의 사건 내용부분만을 모아 분류를 위한 기계학습과 테스트에 사용되어지는 데이터인 자료부(110)와 데이터를 학습하기 위해 그리고 더 정확하게 분류할 수 있게 벡터화 시키고 TF-IDF 스코어와 벡터를 결합하여 새로운 데이터를 구축하는 전처리부(120), 전처리 과정을 마친 데이터를 가지고 딥러닝을 사용해 기계학습을 하는 학습부(130), 학습된 기계를 사용하여 문서를 분류하는 분류부(140)을 포함한다.The present invention for achieving the above-described technical problem, in order to learn and more accurately classify the data and the
자료부(110)에는 '국가 법령정보 공동 활용 OPEN API'를 활용한 59156건(민사사건:27561, 형사사건:18483, 행정사건:13112)의 판례데이터를 사용한다.In the
전처리부(120)는 문장으로 구성된 문서를 단어형식으로 바꾸어주는 변환부(121)와 변환된 단어를 벡터로 바꾸어주는 임베딩부(122), 각 단어의 중요도를 산출하는 가중치부(123), 변환부에서 생성된 벡터와 가중치부에서 생성된 중요도값을 결합하여 새로운 데이터 셋을 생성하는 결합부(124)을 포함한다.The preprocessing
한글에서 명사는 단어의 앞에 있는 조사에 따라 역할을 달리하므로 변환부(121)에서는 정확한 의미를 부여하기 위해 명사와 조사를 붙여 한 단어로 인식하게 한다.In Hangul, nouns play different roles depending on the survey in front of the word, so the
임베딩부(122)에서는 2013년 구글에서 개발한 Word2Vec이라는 방법론을 사용하여 단어를 n차원의 벡터값들로 표현한다.The
가중치부(123)에서는 수학식 1에 기초하여 문서내 단어의 빈도수(Term Frequency)와 전체 문서군 내 단어의 빈도수의 역수(Inverse Document Frequency)를 활용해 해당 단어가 얼마나 중요한 단어를 나타내는지를 판별하는 가중치를 만든다.Based on Equation 1, the weighting unit 123 determines how important the word represents by using the Term Frequency of the word in the document and the Inverse Document Frequency of the word in the entire document group. Create weights
결합부(124)에서는 임베딩부(122)에서 벡터값으로 변환된 단어 데이터에 가중치부(123)의 결과값을 결합해 새로운 데이터를 만든다.The
학습부(130)에서는 다섯개의 컨볼루션 계층(Layer)과 한개의 완전 연결 계층(Fully Connected Layer)을 가진 컨볼루션 신경망(Convolution Nueral Network)을 사용하여 전처리부(120)에서 생성된 데이터를 사용하여 기계를 학습시킨다.The
학습된 기계를 사용하여 분류부(140)에서 법률문서 분류를 진행하고 이에따른 분류정확도를 확인한다.The
100 : 전체 시스템
110 : 자료부
120 : 전처리부
130 : 학습부
140 : 분류부
121 : 변환부
122 : 임베딩부
123 : 가중치부
124 : 결합부100: the whole system
110: data section
120: pre-treatment unit
130: Learning Department
140: classification unit
121: conversion unit
122: embedding part
123: weight part
124: coupling part
Claims (2)
전처리부(120)를 포함하는 법률 문서 분류 기술In the legal document classification technology consisting of the data unit 110, the learning unit 130, and the classification unit 140,
Legal document classification technology including the preprocessor 120
전처리부(120)에 임베딩부(122), 가중치부(123), 결합부(124)를 포함하는 법률 문서 분류 기술The method according to claim 1,
Legal document classification technology including an embedding unit 122, a weight unit 123, and a combination unit 124 in the preprocessor 120
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020190136360A KR20210051293A (en) | 2019-10-30 | 2019-10-30 | Legal Document Automatic Classification Technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020190136360A KR20210051293A (en) | 2019-10-30 | 2019-10-30 | Legal Document Automatic Classification Technology |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20210051293A true KR20210051293A (en) | 2021-05-10 |
Family
ID=75917427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020190136360A KR20210051293A (en) | 2019-10-30 | 2019-10-30 | Legal Document Automatic Classification Technology |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20210051293A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117851602A (en) * | 2024-03-07 | 2024-04-09 | 武汉百智诚远科技有限公司 | Automatic legal document classification method and system based on deep learning |
-
2019
- 2019-10-30 KR KR1020190136360A patent/KR20210051293A/en unknown
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117851602A (en) * | 2024-03-07 | 2024-04-09 | 武汉百智诚远科技有限公司 | Automatic legal document classification method and system based on deep learning |
CN117851602B (en) * | 2024-03-07 | 2024-05-14 | 武汉百智诚远科技有限公司 | Automatic legal document classification method and system based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363753B (en) | Comment text emotion classification model training and emotion classification method, device and equipment | |
CN110781680B (en) | Semantic similarity matching method based on twin network and multi-head attention mechanism | |
Cocarascu et al. | Identifying attack and support argumentative relations using deep learning | |
TWI536364B (en) | Automatic speech recognition method and system | |
CN111368086A (en) | CNN-BilSTM + attribute model-based sentiment classification method for case-involved news viewpoint sentences | |
CN108416065A (en) | Image based on level neural network-sentence description generates system and method | |
CN110825867B (en) | Similar text recommendation method and device, electronic equipment and storage medium | |
An et al. | Lexical and Acoustic Deep Learning Model for Personality Recognition. | |
CN112818698B (en) | Fine-grained user comment sentiment analysis method based on dual-channel model | |
Pan et al. | Macnet: Transferring knowledge from machine comprehension to sequence-to-sequence models | |
CN111221964B (en) | Text generation method guided by evolution trends of different facet viewpoints | |
CN110781666B (en) | Natural language processing text modeling based on generative antagonism network | |
CN116341519A (en) | Event causal relation extraction method, device and storage medium based on background knowledge | |
Leng et al. | Deepreviewer: Collaborative grammar and innovation neural network for automatic paper review | |
Kuchlous et al. | Short text intent classification for conversational agents | |
CN117094291A (en) | Automatic news generation system based on intelligent writing | |
Af'idah et al. | Long short term memory convolutional neural network for Indonesian sentiment analysis towards touristic destination reviews | |
CN112287119B (en) | Knowledge graph generation method for extracting relevant information of online resources | |
CN114282592A (en) | Deep learning-based industry text matching model method and device | |
CN114330483A (en) | Data processing method, model training method, device, equipment and storage medium | |
KR20210051293A (en) | Legal Document Automatic Classification Technology | |
CN116257616A (en) | Entity relation extraction method and system for music field | |
CN114548117A (en) | Cause-and-effect relation extraction method based on BERT semantic enhancement | |
Dhankar et al. | UofA-Truth at Factify 2022: A Simple Approach to Multi-Modal Fact-Checking. | |
Dhankar et al. | UofA-Truth at Factify 2022: Transformer And Transfer Learning Based Multi-Modal Fact-Checking |