KR102574337B1

KR102574337B1 - Violent and Nonviolent Situations Recognition Method based on Korean Dialogue Using BERT Language Model

Info

Publication number: KR102574337B1
Application number: KR1020210101038A
Authority: KR
Inventors: 최영석; 김진환; 이현규; 이지학; 박진오
Original assignee: 광운대학교 산학협력단
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2023-09-01
Also published as: KR20230018952A

Abstract

BERT 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 방법이 개시된다. 상기 방법은 (a) 형태소 기반의 한국어 BERT 언어 모델이 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 문장 텍스트를 사전 학습하며, 한국어 대화 기반 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 문장 텍스트를 입력받아 협박, 갈취 또는 공갈, 직장 내 괴롭힘, 기타 괴롭힘 4종의 폭력 및 비폭력 상황 인식을 위해 사전 학습된 형태소 기반의 한국어 BERT 언어 모델을 실행하는 단계; 및 (b) 상기 사전 학습된 한국어 BERT 언어 모델에 각각 다층 퍼셉트론(MLP), 합성곱 신경망(CNN), 장단기 메모리 신경망(LSTM), 양방향 장단기 신경망(Bi-LSTM) 4가지 분류 모델 중 어느 하나의 딥러닝 신경망 모델이 연결되며, 어느 하나의 딥러닝 신경망 모델을 동작시켜 한국어 대화 기반 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 문장 텍스트를 입력받아 폭력 및 비폭력 상황 인식을 하는 단계를 포함한다.
자연어처리(NLP) 연구분야 중 하나인 혐오 표현 인지는 해외에서 필요성과 함께 활발하게 연구가 진행되고 있다. 하지만, 혐오 표현 인지 연구는 온라인 상에서의 문제에 국한되어 있기 때문에 최근 문제가 점점 심각해지고 있는 학교 폭력, 직장 내 괴롭힘과 같은 오프 라인 상에서의 문제를 해결하기 위해 적합하지 않다. 또한, 한국어 대화 기반 폭력 상황을 인식하는 연구는 매우 미진한 상황이다. 본 연구에서 학교 폭력, 직장 내 괴롭힘과 같은 오프 라인 상에서의 문제를 해결하기 위해 4종의 폭력 상황과 비폭력 상황으로 구성된 21,594개에 해당하는 데이터베이스를 구축하였으며, 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위한 사전 학습된 한국어 BERT 언어 모델과 딥러닝 기반의 4가지 분류 모델로 구성된 딥러닝 신경망 모델을 제안한다. 제안하는 딥러닝 신경망 모델은 자체 구축한 데이터베이스에서 1,000개의 데이터를 임의로 추출한 테스트 데이터셋을 사용한 검증을 통해 4가지의 폭력 상황 및 비폭력 상황 분류의 우수한 성능을 보임을 확인하였다.A Korean conversation-based violent and non-violent situation recognition method using the BERT language model is disclosed. The method (a) the morpheme-based Korean BERT language model pre-learns Korean sentence texts containing words and phrases in violent and non-violent situations, and Korean dialogue-based Korean sentence texts containing words and phrases in violent and non-violent situations and executing a pre-learned morpheme-based Korean BERT language model to recognize four types of violent and non-violent situations: intimidation, extortion or extortion, workplace bullying, and other bullying; and (b) any one of four classification models, multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory neural network (LSTM), and bidirectional long-short-term neural network (Bi-LSTM), respectively, in the pretrained Korean BERT language model. A deep learning neural network model is connected, and it includes a step of recognizing violent and nonviolent situations by operating one of the deep learning neural network models to receive Korean sentence text including words and phrases of Korean conversation-based violent and nonviolent situations.
Recognition of hate speech, one of the fields of natural language processing (NLP) research, is actively being studied with necessity overseas. However, because research on hate expression cognition is limited to online problems, it is not suitable for solving offline problems such as school violence and workplace bullying, which are becoming increasingly serious recently. In addition, studies on recognizing situations of violence based on conversations in Korean are very incomplete. In this study, in order to solve offline problems such as school violence and workplace bullying, a database corresponding to 21,594 of four types of violent and non-violent situations was established, and a database for recognizing violent and non-violent situations based on Korean conversation We propose a deep learning neural network model consisting of a pretrained Korean BERT language model and four classification models based on deep learning. It was confirmed that the proposed deep learning neural network model showed excellent performance in classifying four types of violent and non-violent situations through verification using a test dataset randomly extracted from 1,000 data from a self-constructed database.

Description

Violent and Nonviolent Situations Recognition Method based on Korean Dialogue Using BERT Language Model}

본 발명은 BERT(Bidirectional Encoder Representations from Transformers) 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 방법에 관한 것으로, 보다 상세하게는 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위해 오프 라인 상의 문제를 해결하기 위해 국제표준범죄분류체계(International Classification of Crime for Statistical Purpose, ICCS)를 참고하여 협박, 갈취 또는 공갈, 직장 내 괴롭힘, 기타 괴롭힘 4종의 폭력 상황과 비폭력 상황으로 총 5종의 상황으로 구성되는 데이터베이스를 직접 구축하였으며, 자연어 처리(NLP)를 사용하여 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위한 형태소 분석 기반의 한국어 BERT 언어 모델과 딥러닝(Deep Learning) 기반의 4가지 분류 모델(MLP, CNN, LSTM, Bi-LSTM)로 구성된 딥러닝 신경망 모델을 제공하는, 형태소 분석 기반의 BERT 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 방법에 관한 것이다.The present invention relates to a Korean dialogue-based violent and non-violent situation recognition method using a BERT (Bidirectional Encoder Representations from Transformers) language model, and more particularly, to solve an offline problem for Korean dialogue-based violent and non-violent situation recognition. By referring to the International Classification of Crime for Statistical Purpose (ICCS), a database consisting of 5 types of situations, including 4 types of violence and 4 types of non-violent situations, such as intimidation, extortion or extortion, bullying at work, and other bullying, was created. It was built by using natural language processing (NLP), a Korean BERT language model based on morphological analysis for Korean conversation-based violent and non-violent situation recognition, and four classification models (MLP, CNN, LSTM, Deep Learning) based on deep learning. A method for recognizing violent and non-violent situations based on Korean conversation using a BERT language model based on morphological analysis, which provides a deep learning neural network model composed of Bi-LSTM).

자연어 처리(Natural Language Processing, NLP)는 컴퓨터가 사람의 언어(language)의 구문(syntactic)/의미(semantic)를 자연어 분석을 통해 이해하는 기술이다. 자연어 분석은 형태소 분석(morphological analysis), 주어와 서술어로 표현된 일련의 문자열을 갖는 문장을 자연어의 경우 최소 단위의 의미를 갖는 형태소로 분리하고, 문자열을 정규 문법을 따라 분석하는 구문 분석(통사 분석)(syntactic analysis), 문장의 의미를 분석하는 의미 분석(semantic analysis), 및 화용 분석(pragmatic analysis)의 4가지로 분류된다. NLP 분야에서 Embedding 과정을 거치는데, 사람이 쓰는 자연어를 컴퓨터가 이해할 수 있는 형태(숫자, Vector)로 바꾸는 과정 전체를 임베딩이라고 한다. 단어 임베딩 기술은 신경망에 기반한 언어 모델로부터 도출된 기술로 유사한 단어들을 벡터 공간상에 가깝게 배치하여 어휘 의미를 표현할 수 있는 기술이다.Natural Language Processing (NLP) is a technology in which a computer understands the syntactic/semantic of human language through natural language analysis. Natural language analysis is morphological analysis, syntactic analysis (syntax analysis) that separates a sentence having a series of strings expressed as subjects and predicates into morphemes having the minimum meaning in the case of natural language, and analyzes the string according to regular grammar. ) (syntactic analysis), semantic analysis that analyzes the meaning of a sentence, and pragmatic analysis. In the field of NLP, the embedding process goes through, and the entire process of converting natural language written by humans into a form (numbers, vectors) that computers can understand is called embedding. Word embedding technology is a technology derived from a language model based on a neural network, and is a technology that can express lexical meaning by arranging similar words closely on a vector space.

형태소(morpheme)는 문장을 구성하는 단어의 구조에서 최소 단위의 의미(semantic)를 갖는 단어이며, 소유격과 결합된다. A morpheme is a word that has a semantic minimum unit in the structure of words constituting a sentence, and is combined with a possessive case.

참고로, 구문 분석(Syntactic Analysis)은 하향식 구문 분석과 상향식 구문 분석으로 분류된다. 구문 분석은 인터프리터나 컴파일러에 포함된 파서(parse)가 담당하는 구문분석기를 사용하며, 파서는 어휘 분석 결과물 인 일련의 토큰(token)으로부터 구조적인 구문 트리를 생성하여 구문 분석을 수행하는 프로그램이다. 하향식 구문 분석은 구문분석기에 의해 문장을 좌측에서 우측으로 한 문자씩 읽으며 뿌리 노드로부터 시작하여 단말 노드를 생성하는 방식이다. 반대로, 상향식 구문 분석은 트리를 단말 노드부터 뿌리 노드를 향하여 위쪽 방향으로 parse 트리를 생성하는 방식이다. For reference, syntactic analysis is classified into top-down and bottom-up syntax analysis. Syntax analysis uses a parser included in an interpreter or compiler, and the parser is a program that performs syntax analysis by generating a structural syntax tree from a series of tokens that are outputs of lexical analysis. Top-down syntax analysis is a method in which a sentence is read character by character from left to right by a syntax analyzer, and terminal nodes are created starting from the root node. In contrast, bottom-up parsing is a method of generating a parse tree in an upward direction from the terminal node to the root node.

이와 관련된 선행기술1로써, 특허 공개번호 10-2015-0089168에서는 "인공지능을 사용한 언어분석 방법 및 시스템"이 공개되어 있으며, 인성교육과 자살, 왕따, 학교 폭행, 성폭행 등의 사회문제의 예방을 목적으로 컴퓨터, 휴대폰, 태블릿 PC 등의 전자기기에 입력되는 모든 텍스트와 통화상에 쓰이는 언어를 대상으로 언어분석을 시행하여 인성에 나쁜 영향을 끼치는 나쁜 언어 즉, 비어, 속어 욕설어, 회유어 협박어, 자살 암시어, 왕따어, 폭력, 폭행어와 인성 발달에 좋은 영향을 끼치는 좋은 언어 즉, 바른말, 고운말, 감사하는 말, 선행어 등을 사전에 등록하여 두고. 그런 언어 사용의 횟수를 통계 처리하고 이를 점수화하여 좀더 나은 언어를 사용하려고 스스로 노력하도록 하는 가운데 인성교육을 보다 효율적으로 실시할 수 있도록 하고 이런 용어의 출현 횟수를 분석하여 자살, 폭력, 폭행, 왕따 등의 위험이 있는 학생을 걸러내어 학부모나 담임교사, 기타 관계자에게 사전에 연락을 취하여 이를 예방하기 위한 것이다.As related prior art 1, Patent Publication No. 10-2015-0089168 discloses "Language analysis method and system using artificial intelligence", which is aimed at character education and prevention of social problems such as suicide, bullying, school assault, and sexual assault. For this purpose, language analysis is performed on all texts entered on electronic devices such as computers, mobile phones, and tablet PCs, as well as language used during calls. Eh, suicidal suggestive words, bullying words, violence, assault words, and good words that have a good effect on personality development, such as correct words, nice words, thank you words, and preceding words, etc. are registered in advance. Statistical processing of the number of such language use and scoring it to make efforts to use better language, so that character education can be conducted more efficiently, and by analyzing the number of occurrences of these terms, suicide, violence, assault, bullying, etc. It is to prevent this by filtering out students who are at risk and contacting parents, homeroom teachers, and other related persons in advance.

인공지능을 사용한 언어분석 방법 및 시스템은, 인공지능부(200)의 사전등록부(210)를 통해 목적에 맞게 인성에 나쁜 영향을 미치는 나쁜 언어와 바른 심성을 키우는데 도움이 되는 좋은 언어를 사전에 등록하고 해당 언어 출현마다 부여점수를 정하고, 일정기간 동안 그 점수를 합산하여 양호, 보통, 주의, 위험, 긴급 상황 등의 평가 급간을 사전에 등록하는 사전 등록 단계; 입력부(100)를 통해 입력하는 모든 텍스트와 통화부(140)를 통해 수신,발신 통화를 하는 내용을 음성인식을 통해 텍스트로 전환된 통화 내용을 분석 대상으로 해당 언어의 출현 빈도를 수집하는 단계; 데이터 량을 줄이기 위한 목적으로 기 분석된 대상 자료를 일정단위로 자동으로 삭제하는 단계; 이러한 결과를 모니터부(300)로 송출하여 본인과 관계자가 직접 모니터하는 단계; 평가 결과를 정해진 시기에 본인과 관계자에게 자동으로 송신하는 단계와 긴급 상황 즉 자살, 폭행, 폭력, 왕따 등의 사태를 예상케 하는 분석결과가 나오는 경우에는 자동으로 신속하게 관계자에게 내용을 송신부(350)에서 송신하도록 하는 단계 중 전체 또는 일부분을 포함한다. In the language analysis method and system using artificial intelligence, through the pre-registration unit 210 of the artificial intelligence unit 200, bad language that has a bad effect on personality and good language that helps to develop a good mind are registered in advance for the purpose A pre-registration step of setting an assigned score for each occurrence of the language and summing up the scores for a certain period of time to pre-register an evaluation class such as good, normal, caution, danger, emergency, etc.; Collecting the appearance frequency of the language as an analysis target for all texts input through the input unit 100 and contents of calls received and outgoing through the communication unit 140 converted into text through voice recognition; Automatically deleting pre-analyzed target data in a certain unit for the purpose of reducing the amount of data; Sending these results to the monitor unit 300 to directly monitor the person and the person concerned; The step of automatically sending the evaluation results to the person concerned and the relevant person at a set time, and the transmission unit (350 ), including all or part of the steps for transmission.

이와 관련된 선행기술2로써, 특허 등록번호 10-2260646에서는 "자연어 처리 시스템 및 자연어 처리에서의 단어 표현 방법"이 등록되어 있다. As related prior art 2, Patent Registration No. 10-2260646 "Natural Language Processing System and Word Expression Method in Natural Language Processing" is registered.

자연어 처리 시스템에 의해 수행되는 자연어 처리에서의 단어 표현 방법은 A word representation method in natural language processing performed by a natural language processing system is

a) 적어도 하나 이상의 단어를 포함하는 어휘 및 각 단어에 대해 기학습된 단어 임베딩 정보를 포함하는 어휘 사전 데이터세트를 제공하는 단계;a) providing a vocabulary dictionary dataset including vocabulary including at least one or more words and pre-learned word embedding information for each word;

b) 상기 어휘 사전 데이터세트에 기초한 어휘가 입력 데이터로 제공되면, 단어 표현 모델을 이용하여 상기 입력 데이터에 존재하는 단어들에 대한 하위 단어(subword) 정보를 추출하고, 상기 하위 단어 정보를 단어 임베딩 벡터를 산출하는 단계;b) When a vocabulary based on the lexicon data set is provided as input data, subword information for words existing in the input data is extracted using a word representation model, and the subword information is word-embedded. calculating a vector;

c) 상기 산출된 단어 임베딩 벡터와 해당 단어의 기학습된 단어 임베딩 정보를 매칭함으로써 상기 기학습된 단어 임베딩 정보를 상기 산출된 단어 임베딩 벡터로 대체하여 해당 단어에 대한 단어 표현을 학습하는 단계;c) learning a word representation for a corresponding word by matching the calculated word embedding vector with pre-learned word embedding information of the corresponding word and replacing the pre-learned word embedding information with the calculated word embedding vector;

d) 상기 학습된 단어 표현 모델에 미등록 단어(Out of Vocabulary)가 입력 데이터로 제공되면, 상기 미등록 단어에 대해 하위 단어 정보를 추출한 후 상기 추출된 하위 단어 정보를 이용하여 미등록 단어의 단어 임베딩 벡터를 산출하는 단계; 및d) When an out of vocabulary word is provided as input data to the learned word representation model, after extracting lower word information for the out of vocabulary word, a word embedding vector of the unregistered word is generated using the extracted lower word information. calculating; and

e) 상기 산출된 미등록 단어의 단어 임베딩 벡터에 기초한 벡터 연산을 통해 단어 임베딩 벡터 간 유사도를 계산하여 상기 미등록 단어의 이웃 단어를 추출하여 상기 미등록 단어의 고유 의미를 추론하는 단계를 포함하며,e) inferring a unique meaning of the unregistered word by calculating a similarity between word embedding vectors through a vector operation based on the calculated word embedding vector of the unregistered word and extracting neighboring words of the unregistered word,

상기 단어 표현 모델은,The word representation model,

상기 하위 단어 정보를 이용하여 하위 단어 특징 벡터들을 산출하는 합성곱 신경망(convolutional neural network) 기반의 컨볼루션 모듈과, 상기 컨볼루션 모듈에서 산출된 하위 단어 특징 벡터들을 적응적으로 결합하여 해당 단어의 단어 임베딩 벡터를 산출하는 하이웨이 네트워크(highway network) 기반의 하이웨이 모듈을 포함한다. A convolution module based on a convolutional neural network that calculates lower-order feature vectors using the lower-order word information and the word of the word by adaptively combining the lower-order feature vectors calculated in the convolution module. It includes a highway module based on a highway network that calculates an embedding vector.

도 1은 종래의 자연어 처리(NLP)에서 단어 표현 모델을 설명하는 도면이다. 1 is a diagram illustrating a word representation model in conventional natural language processing (NLP).

단어 표현 모델(200)은 기학습된 단어 임베딩의 지도 학습을 기반으로 단어 표현을 생성하는 것을 학습한다. 이러한 단어 표현 모델(200)은 컨볼루션 모듈(210), 하이웨이 모듈(220) 및 최적화 모듈(230)을 포함한다.The word representation model 200 learns to generate word representations based on supervised learning of previously learned word embeddings. This word representation model 200 includes a convolution module 210 , a highway module 220 and an optimization module 230 .

컨볼루션 모듈(210)은 합성곱 신경망(convolutional neural network, CNN)을 통해 문자 기반의 하위 단어 특징을 추출한다. 하이웨이 모듈(220)은 하이웨이 신경망(highway network)을 활용하여 컨볼루션 모듈(210)에서 추출된The convolution module 210 extracts character-based sub-word features through a convolutional neural network (CNN). The highway module 220 is extracted from the convolution module 210 using a highway network.

하위 단어 특징들을 적응적으로 결합하여 단어 임베딩 벡터를 산출한다. 또한, 최적화 모듈(230)은 하이웨이 모듈(220)에서 산출된 단어 임베딩 벡터가 기 학습된 단어 임베딩과 유사해지도록 최적화를 수행한다.A word embedding vector is calculated by adaptively combining lower word features. Also, the optimization module 230 performs optimization so that the word embedding vectors calculated by the highway module 220 are similar to the pre-learned word embeddings.

컨볼루션 모듈(210)은 자연어 처리에서 로컬 특징들(local features)을 추출할 수 있기 때문에 합성곱 신경망(CNN)을 문자 시퀀스에 적용하여 하위 단어 정보를 추출한다. 컨볼루션 모듈(210)은 문자 시퀀스에서 각기 다른 특징을 추출하는 필터들을 포함하고, 각 필터를 통해 산출된 행렬인 특징 맵(feature maps)을 추출하며, 필터들을 통해 특징 맵이 추출되면 해당 특징의 유무의 비선형 값으로 바꿔주기 위해 비선형 함수(tanh, Hyperbolic tangent)를 적용한다.Since the convolution module 210 can extract local features in natural language processing, a convolutional neural network (CNN) is applied to the character sequence to extract sub-word information. The convolution module 210 includes filters that extract different features from the character sequence, extracts feature maps, which are matrices calculated through each filter, and when the feature maps are extracted through the filters, the corresponding feature Apply a non-linear function (tanh, hyperbolic tangent) to convert it into a non-linear value of presence or absence.

일반적으로, 학습된 단어 표현 모델의 깊이가 증가함에 따라 성능이 향상한다. 하지만, 깊이가 증가할수록 최적화가 어려워지며 훈련에 어려움이 따른다. 하이웨이 신경망은 단어 표현 모델을 깊게 만들면서도 정보의 흐름을 통제하고 학습 가능성을 극대화할 수 있도록 해준다In general, the performance improves as the depth of the trained word representation model increases. However, as the depth increases, optimization becomes difficult and training becomes difficult. The Highway Neural Network deepens the word representation model, while controlling the flow of information and maximizing learning potential.

하이웨이 모듈(220)은 컨볼루션 모듈(210)로부터 수신한 하위단어 특징 벡터들에 대해 input(y)의 값을 가지고, 비선형한 변환(T)과 이동(C)을 추가로 적용한다. 이때, Output(z)이 input(y)에 대하여 얼마나 변환되고 옮겨졌느냐를 표현해주기 때문에 T를 변환 게이트(transform gate), C를 이동 게이트(carry gate)라고 한다.The highway module 220 further applies nonlinear transformation (T) and movement (C) with the value of input (y) to the sub-word feature vectors received from the convolution module 210. At this time, T is called a transform gate and C is a carry gate because Output(z) expresses how much has been transformed and moved relative to input(y).

최근, 자연어처리(NLP) 연구는 순환 신경망(Recurrent Neural Network, RNN) 기반의 언어 모델을 보완한 2018년에 발표된 딥러닝 기반 언어 모델인 BERT(Bidirectional Encoder Representations from Transformers)를 통해 많은 발전을 이루고 있다[1]. BERT는 ELMo, GPT와 같은 기존 언어 모델과 다르게 [2]에서 제안된 Transformer의 encoder를 기반으로 양방향 문맥(context)을 통합하여 입력 데이터인 텍스트를 표현한다. 또한, 라벨이 포함되지 않은 사전과 뉴스 등과 같은 대용량 코퍼스(corpus)를 통해 사전 학습되며, 사전 학습된 모델을 다운 스트림 작업의 라벨이 포함된 데이터베이스를 이용하여 미세조정하여 여러 다운 스트림 작업에 활용된다. 특히, 단순히 단어나 문장의 형태를 인식하는 것이 아닌 단어나 문장을 이해하여 내포되어 있는 의미를 인식하여 혐오 표현(hate speech) 인지에 활용되고 있다.Recently, natural language processing (NLP) research has made a lot of progress through BERT (Bidirectional Encoder Representations from Transformers), a deep learning-based language model announced in 2018 that complements a Recurrent Neural Network (RNN)-based language model. There is [1]. Unlike existing language models such as ELMo and GPT, BERT expresses text, which is input data, by integrating bidirectional context based on the encoder of the transformer proposed in [2]. In addition, it is pre-trained through a large-capacity corpus such as dictionaries and news that do not include labels, and the pre-trained model is fine-tuned using a database containing labels of downstream tasks to be used for various downstream tasks. . In particular, it is used for recognizing hate speech by understanding the meaning of words or sentences rather than simply recognizing the form of words or sentences.

혐오 표현 인지 관련 연구는 일반적으로 온라인상에서의 악성 댓글, 채팅 등을 통해 명예 훼손, 사생활 침해, 인신 공격 등의 문제들을 해결하기 위해 국내 및 국외에서 활발하게 진행되고 있다[3, 4]. 그러나, 온라인상에서의 문제에 국한되어 있는 혐오 표현 연구는 최근 문제가 점점 심각해지고 있는 학교 폭력, 직장 내 괴롭힘과 같은 오프 라인 상에서의 문제를 해결하기 위해서는 적합하지 않다.Studies related to the perception of hate speech are generally being actively conducted domestically and abroad to solve problems such as defamation, invasion of privacy, and personal attacks through malicious comments and chatting online [3, 4]. However, research on hate expression limited to online problems is not suitable for solving offline problems such as school violence and workplace bullying, which are becoming more and more serious recently.

자연어처리(NLP) 연구분야 중 하나인 혐오 표현 인지는 해외에서 필요성과 함께 활발하게 연구가 진행되고 있다. 하지만, 혐오 표현 인지 연구는 온라인상에서의 문제에 국한되어 있기 때문에 최근 문제가 점점 심각해지고 있는 학교 폭력, 직장 내 괴롭힘과 같은 오프라인상에서의 문제를 해결하기 위해 적합하지 않다. 또한, 한국어 대화 기반 폭력 상황을 인식하는 연구는 매우 미진한 상황이다. Recognition of hate speech, one of the fields of natural language processing (NLP) research, is actively being studied with necessity abroad. However, research on hate expression cognition is limited to online problems, so it is not suitable for solving offline problems such as school violence and workplace bullying, which are becoming increasingly serious recently. In addition, studies on recognizing situations of violence based on conversations in Korean are very incomplete.

또한, 오프라인 상에서의 한국어 대화를 이해 및 분석하여 한국어 폭력 상황 및 비폭력 상황을 인식하는 연구는 아직까지 국내, 국외에서 진행되지 않고 있다. 이러한 문제를 형태소 분석 기반의 BERT 언어 모델을 활용하여 해결하기 위해 폭력 및 비폭력 상황에 대한 한국어 대화 기반 텍스트 데이터베이스가 필요하지만 현재 공개된 데이터베이스 중에 관련된 데이터베이스는 없는 상황이다.In addition, research to understand and analyze offline Korean conversations to recognize Korean violent and non-violent situations has not yet been conducted domestically or abroad. In order to solve these problems using the BERT language model based on morpheme analysis, a Korean dialogue-based text database for violent and non-violent situations is needed, but there is no related database among the currently open databases.

특허공개번호 10-2015-0089168 (공개일자 2015년 08월 05일), "인공지능을 사용한 언어분석 방법 및 시스템", 최재용Patent Publication No. 10-2015-0089168 (published on August 5, 2015), "Language analysis method and system using artificial intelligence", Choi Jae-yong 특허등록번호 10-2260646 (등록일자 2021년 05월 31일), "자연어 처리 시스템 및 자연어 처리에서의 단어 표현 방법", 고려대학교 산학협력단Patent Registration No. 10-2260646 (Registration Date May 31, 2021), "Natural Language Processing System and Word Expression Method in Natural Language Processing", Korea University Industry-University Cooperation Foundation

[1] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.[1] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018. [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998~6008.[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998-6008. [3] P. Kapil, A. Ekbal, and D. Das. “Investigating Deep Learning Approaches for Hate Speech Detection in Social Media,” in Proceeding of the 20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle: LR, 2020[3] P. Kapil, A. Ekbal, and D. Das. “Investigating Deep Learning Approaches for Hate Speech Detection in Social Media,” in Proceeding of the 20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle: LR, 2020 [4] 이원석, 이현상, “딥러닝 기술을 사용한 차별 및 혐오 표현 탐지 : 어텐션 기반 다중 채널 CNN 모델링”, 한국정보통신학회논문지, Vol. 24, No. 12, pp. 1595~1603, 2020[4] Won-Seok Lee and Hyun-Sang Lee, “Detection of Discrimination and Hate Expressions Using Deep Learning Technology: Attention-Based Multi-Channel CNN Modeling”, Journal of Information and Communications Society of Korea, Vol. 24, no. 12, p. 1595~1603, 2020

상기 문제점을 해결하기 위한 본 발명의 목적은 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위해 오프 라인 상의 문제를 해결하기 위해 국제표준범죄분류체계(International Classification of Crime for Statistical Purpose, ICCS)를 참고하여 협박, 갈취 또는 공갈, 직장 내 괴롭힘, 기타 괴롭힘 4종의 폭력 상황과 비폭력 상황으로 총 5종의 상황으로 구성되는 데이터베이스를 직접 구축하였으며, 자연어 처리(NLP)를 사용하여 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위한 형태소 분석 기반의 한국어 BERT 언어 모델과 딥러닝 기반의 4가지 분류 모델(MLP, CNN, LSTM, Bi-LSTM)로 구성된 딥러닝 신경망 모델을 제공하는, 형태소 분석 기반의 BERT 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 방법을 제공한다. The purpose of the present invention to solve the above problem is to solve the offline problem for recognizing Korean conversation-based violence and non-violent situations by referring to the International Classification of Crime for Statistical Purpose (ICCS), intimidation, Extortion or extortion, bullying in the workplace, and other bullying 4 types of violent and non-violent situations were built directly into a database, and natural language processing (NLP) was used to recognize Korean conversation-based violent and non-violent situations. Korean conversation using the BERT language model based on morpheme analysis, which provides a deep learning neural network model consisting of a Korean BERT language model based on morpheme analysis and four deep learning-based classification models (MLP, CNN, LSTM, Bi-LSTM) for Provides methods for recognizing based violence and non-violent situations.

본 발명의 목적을 달성하기 위해, BERT 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 방법은 (a) 형태소 기반의 한국어 BERT 언어 모델이 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 문장 텍스트를 사전 학습하며, 한국어 대화 기반 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 문장 텍스트를 입력받아 협박, 갈취 또는 공갈, 직장 내 괴롭힘, 기타 괴롭힘 4종의 폭력 및 비폭력 상황 인식을 위해 사전 학습된 형태소 기반의 한국어 BERT 언어 모델을 실행하는 단계; 및 (b) 상기 사전 학습된 한국어 BERT 언어 모델에 각각 다층 퍼셉트론(MLP), 합성곱 신경망(CNN), 장단기 메모리 신경망(LSTM), 양방향 장단기 신경망(Bi-LSTM) 4가지 분류 모델 중 어느 하나의 딥러닝 신경망 모델이 연결되며, 상기 어느 하나의 딥러닝 신경망 모델을 동작시켜 한국어 대화 기반 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 문장 텍스트를 입력받아 폭력 및 비폭력 상황 인식을 하는 단계를 포함한다. In order to achieve the object of the present invention, a Korean conversation-based violent and non-violent situation recognition method using a BERT language model is (a) a Korean BERT language model based on morphemes recognizes Korean sentence texts including words and phrases in violent and non-violent situations. Pre-learned morphemes are pre-learned to recognize four types of violent and non-violent situations, such as intimidation, extortion or intimidation, bullying at work, and other bullying by inputting Korean sentence texts containing words and phrases of violent and non-violent situations based on Korean conversations. Executing the based Korean BERT language model; and (b) any one of four classification models, multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory neural network (LSTM), and bidirectional long-short-term neural network (Bi-LSTM), respectively, in the pretrained Korean BERT language model. The deep learning neural network model is connected, and the step of operating any one of the deep learning neural network models to recognize violent and nonviolent situations by receiving Korean sentence text including words and phrases of Korean conversation-based violent and nonviolent situations. .

본 발명의 BERT 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 방법은 오프 라인 상의 문제를 해결하기 위해 국제표준범죄분류체계(International Classification of Crime for Statistical Purpose, ICCS)를 참고하여 협박, 갈취 또는 공갈, 직장 내 괴롭힘, 기타 괴롭힘 4종의 폭력 상황과 비폭력 상황으로 총 5종의 상황으로 구성되어 있는 데이터베이스를 직접 구축하였으며, 자연어 처리(NLP)를 사용하여 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위한 형태소 분석 기반의 한국어 BERT 언어 모델과 딥러닝(Deep Learning) 기반의 4가지 분류 모델(MLP, CNN, LSTM, Bi-LSTM)로 구성된 딥러닝 신경망 모델을 제시하였다. The Korean conversation-based violent and non-violent situation recognition method using the BERT language model of the present invention refers to the International Classification of Crime for Statistical Purpose (ICCS) to solve offline problems by intimidation, extortion or extortion , workplace bullying, and other bullying. A database consisting of 5 types of situations, 4 types of violent and non-violent situations, was directly built, and morphemes for recognizing violent and non-violent situations based on Korean conversations using natural language processing (NLP) A deep learning neural network model consisting of an analysis-based Korean BERT language model and four classification models (MLP, CNN, LSTM, and Bi-LSTM) based on deep learning was presented.

도 1은 종래의 자연어 처리(NLP)에서 단어 표현 모델을 설명하는 도면이다.
도 2는 본 발명에 따른 BERT 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 구성도이다. 1 is a diagram illustrating a word representation model in conventional natural language processing (NLP).
2 is a diagram illustrating the construction of Korean dialogue-based violent and non-violent situation recognition using the BERT language model according to the present invention.

이하, 본 발명의 바람직한 실시예를 첨부된 도면을 참조하여 발명의 구성 및 동작을 상세하게 설명한다. 본 발명의 설명에 있어서 관련된 공지의 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 자세한 설명을 생략한다. 또한, 도면 번호는 동일한 구성을 표기할 때에 다른 도면에서 동일한 도면 번호를 부여한다. Hereinafter, the configuration and operation of a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings. In the description of the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, drawing numbers are assigned the same drawing numbers in different drawings when indicating the same configuration.

본 발명의 BERT(Bidirectional Encoder Representations from Transformers) 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 방법은 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위해 오프 라인 상의 문제를 해결하기 위해 국제표준범죄분류체계(International Classification of Crime for Statistical Purpose, ICCS)를 참고하여 협박, 갈취 또는 공갈, 직장 내 괴롭힘, 기타 괴롭힘 4종의 폭력 상황과 비폭력 상황으로 총 5종의 상황으로 구성되어 있는 데이터베이스를 직접 구축하였으며, 자연어 처리(Natural Language Processing, NLP)를 사용하여 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위한 형태소 분석 기반의 한국어 BERT 언어 모델과 딥러닝(Deep Learning) 기반의 4가지 분류 모델(MLP, CNN, LSTM, 및 Bi-LSTM)로 구성된 딥러닝 신경망 모델을 제안한다. The Korean conversation-based violent and non-violent situation recognition method using the BERT (Bidirectional Encoder Representations from Transformers) language model of the present invention uses the International Standard Crime Classification System (International Standard Crime Classification System) to solve offline problems for Korean conversation-based violent and non-violent situation recognition. Classification of Crime for Statistical Purpose (ICCS), we directly built a database consisting of 5 types of situations, 4 types of violence and 4 types of non-violent situations, such as intimidation, extortion or extortion, bullying at work, and other bullying, and natural language processing Korean BERT language model based on morphological analysis for Korean conversation-based violent and non-violent situation recognition using (Natural Language Processing, NLP) and four deep learning-based classification models (MLP, CNN, LSTM, and Bi -LSTM) and proposes a deep learning neural network model.

문장의 구성은 언어별로 구문 규칙(주어 + 목적어 + 동사)에 따라 구문론(syntax), 단어와 문장의 의미에 따라 내용의 의미를 갖는 의미론(semantics)이 적용된다. For the composition of sentences, syntax is applied according to the syntax rules (subject + object + verb) for each language, and semantics, which has the meaning of the content according to the meaning of words and sentences, is applied.

본 발명의 BERT 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 방법은 Korean dialogue-based violent and non-violent situation recognition method using the BERT language model of the present invention

(a) 영상 미디어, 크롤링 기법, 키워드 기반 3가지 방법과 역-변역 기법 기반의 데이터 증강 기법을 사용하여 폭력 및 비폭력 상황에 대한 한국어 대화 기반 텍스트 데이터베이스를 구축하며, 형태소 기반의 한국어 BERT 언어 모델이 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 문장 텍스트를 사전 학습하며, (a) Using video media, crawling technique, keyword-based three methods and inverse-translational technique-based data augmentation technique, a Korean dialogue-based text database for violent and non-violent situations is constructed, and the morpheme-based Korean BERT language model is Pre-learn Korean sentence texts that contain words and phrases in violent and non-violent situations,

한국어 대화 기반 상기 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 문장 텍스트를 사전 학습된 한국어 BERT 언어 모델(pre-training Korean BERT language model)로 입력받고, 협박, 갈취 또는 공갈, 직장 내 괴롭힘, 기타 괴롭힘 4종의 폭력 및 비폭력 상황 인식을 위해 사전 학습된 형태소 기반의 한국어 BERT 언어 모델을 실행하는 단계; 및 Based on Korean dialogue, Korean sentence texts containing the words and phrases of the above violent and non-violent situations are input into the pre-training Korean BERT language model, and threats, extortion or extortion, workplace bullying, etc. Executing a Korean BERT language model based on pretrained morphemes for recognizing four types of violence and non-violent situations; and

(b) 사전 학습(Pre-training)된 모델은 일반적인 언어의 표현을 학습하는 과정이기 때문에 형태소 분석 기반의 BERT 언어 모델을 한국어 문장 텍스트들의 폭력 및 비폭력 상황 인식과 같은 특정 상황에 맞게 미세 조정하기 위해, (b) Since the pre-trained model is a process of learning general language expressions, in order to fine-tune the BERT language model based on morphological analysis to specific situations such as recognizing violent and non-violent situations in Korean sentence texts ,

상기 사전 학습된 한국어 BERT 언어 모델에 각각 다층 퍼셉트론(Multilayer Perceptron, MLP), 합성곱 신경망(Convolutional Neural Networks, CNN), 장단기 메모리 신경망(Long Short-Term Memory, LSTM), 및 양방향 장단기 신경망(Bidirectional Long Short-Term Memory, Bi-LSTM) 4가지 분류 모델 중 어느 하나의 딥러닝 신경망 모델이 연결되며, 상기 어느 하나의 딥러닝 신경망 모델을 동작시키고, 한국어 대화 기반 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 문장 텍스트를 입력받아 폭력 및 비폭력 상황 인식을 하는 단계를 포함한다. Multilayer Perceptron (MLP), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Bidirectional Long Short-Term Memory, Bi-LSTM) Any one of the four classification models is connected, and any one of the deep learning neural network models is operated, and words and phrases in Korean conversation-based violent and non-violent situations are included. It includes the step of recognizing violent and non-violent situations by receiving the input Korean sentence text.

2. BERT 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식2. Recognition of Violent and Non-Violent Situation Based on Korean Conversation Using BERT Language Model

2.1 한국어 BERT 언어 모델 2.1 Korean BERT language model

본 발명에서는 한국전자통신연구원(ETRI)에서 공개한 한국어 BERT 언어 모델을 기반으로 한국어 문장 텍스트에 대하여 폭력 및 비폭력 상황 인식 모델을 학습하였다. 한국어 BERT 언어 모델은 백과사전과 신문기사 등 23 GB의 대용량 텍스트를 학습 말뭉치로 사용하였으며, 한국어 BERT 언어 모델 구조를 나타내는 Transformer blocks, hidden size, 및 self-attention heads는 각각 12, 768, 및 12로 설정하여 사전 학습(pre-training)하였다. In the present invention, based on the Korean BERT language model published by the Electronics and Telecommunications Research Institute (ETRI), a violent and non-violent situation recognition model was learned for Korean sentence text. The Korean BERT language model used large texts of 23 GB, such as encyclopedias and newspaper articles, as a learning corpus. It was set up and pre-trained.

공개된 한국어 BERT 언어 모델은 "어절 기반의 BERT 언어 모델"과 "형태소 분석 기반의 BERT 언어 모델"의 2가지 모델이 있으며, 실시예에서는 한국어의 특성을 반영한 "형태소 분석 기반의 BERT 언어 모델"을 사용하였다.There are two types of published Korean BERT language models, a "word-based BERT language model" and a "morphemic analysis-based BERT language model". used

상기 한국어 BERT 언어 모델은 Tokenization 기법의 형태소 토큰화(Morpheme Tokenization)를 사용하는 "형태소 분석 기반의 BERT 언어 모델"을 사용하였다. The Korean BERT language model uses a "BERT language model based on morpheme analysis" that uses morpheme tokenization of a tokenization technique.

상기 Tokenization 기법은 단어 토큰화(Word Tokenization), 문장 토큰화(Sentence Tokenization), 형태소 토큰화(Morpheme Tokenization)를 사용하고, The tokenization technique uses word tokenization, sentence tokenization, and morpheme tokenization,

상기 형태소 토큰화는 상기 한국어 BERT 언어 모델의 형태소 토큰화를 위해 ETRI에서 공개한 형태소 분석기 API를 사용하였다. The morpheme tokenization used the morpheme analyzer API published by ETRI for morpheme tokenization of the Korean BERT language model.

2.2 폭력 및 비폭력 상황 데이터베이스 2.2 Database of violent and non-violent situations

폭력 및 비폭력 상황에 대한 한국어 대화 기반 텍스트 데이터베이스를 구축하기 위해 영상 미디어, 크롤링 기법, 키워드 기반 3가지 방법을 사용하였으며, 기계 번역에서 대표적인 데이터 증강 기법인 역-번역(Back Translation) 기법을 통해 데이터 증강을 진행하였다. 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위해 문맥과 같은 의미적 연결을 고려하여 표 1에 나타낸 바와 같이 1개의 데이터가 5문장으로 구성되도록 구축하였다. In order to build a text database based on Korean conversations about violent and non-violent situations, three methods were used: video media, crawling technique, and keyword-based, and data augmentation through back translation technique, a representative data augmentation technique in machine translation. proceeded. For Korean dialogue-based violent and non-violent situation recognition, one data was constructed to consist of 5 sentences, as shown in Table 1, in consideration of semantic connections such as context.

또한, 맞춤법, 띄어 쓰기, 문장 문법을 준수하여 데이터를 구축하였다. 첫 번째 영상 미디어 기반 구축 방법은 유튜브와 같은 영상 미디어에서 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 음성 대화를 STT(Speech-To-Text) 변환하여 한국어 텍스트로 작성하고 보완하여 구축하는 방법이다. 두 번째 크롤링 기법 기반 방법은 영화, 드라마 대본 및 뉴스, 영상 미디어 댓글에서 폭력 및 비폭력 상황의 단어와 어절이 포함된 한국어 텍스트를 이용하는 구축 방법이며 특수 문자, 한자, 이모티콘 제거 등의 데이터 전처리 작업을 추가로 진행하였다. 세 번째 키워드 기반 구축 방법은 표 2에서 볼 수 있듯이 각 폭력 상황에 해당하는 키워드 목록을 구축한 후, 키워드가 포함되는 문장을 직접 작성하는 방법이다. 사람이 직접 개입하여 데이터를 구축하기 때문에 구축된 데이터가 특정 상황에 편향성을 가질 수 있어서 성인 20명을 모집하여 다양한 상황을 포함한 데이터를 구축하여 특정 상황에 편향성을 가지는 것을 방지하였다. 부가적으로, 역-번역 기법은 앞에 3가지 방법으로 구축된 텍스트 데이터를 Google Translate, 네이버 파파고(papago) 등과 같은 번역 API를 사용하여 다양한 타 언어(영어, 독일어, 일본어 등)로 번역한 후, 다시 한국어로 역-번역하여 텍스트 데이터를 확장하는 데이터 증강 기법이다. In addition, data was constructed by observing spelling, spacing, and sentence grammar. The first video media foundation construction method is a method of converting STT (Speech-To-Text) Korean voice conversations containing words and phrases in violent and non-violent situations in video media such as YouTube, writing them in Korean text, and supplementing them. . The second crawling-based method is a construction method that uses Korean texts that contain words and phrases in violent and non-violent situations in movies, drama scripts, news, and video media comments, and adds data preprocessing such as removing special characters, Chinese characters, and emoticons. proceeded with As shown in Table 2, the third keyword-based construction method is to construct a keyword list corresponding to each violence situation and then write a sentence containing the keyword directly. Because the data is built with human intervention, the constructed data can have a bias in a specific situation. Therefore, 20 adults were recruited and data including various situations was constructed to prevent bias in a specific situation. Additionally, the reverse-translation technique uses translation APIs such as Google Translate and Naver Papago to translate text data constructed in the previous three methods into various other languages (English, German, Japanese, etc.) , is a data augmentation technique that expands text data by reverse-translating it back into Korean.

본 발명에서는 한국어 문장 텍스트에 대하여 폭력 및 비폭력 상황 인식을 위해 영상 미디어, 크롤링 기법, 키워드 기반 3가지 방법과 역-변역 기법 기반의 데이터 증강 기법을 활용하여 21,594개에 해당하는 한국어 대화 기반 폭력 및 비폭력 상황에 대한 단어와 어절이 포함된 한국어 문장 텍스트를 저장하는 데이터베이스를 구축하였다.In the present invention, for recognizing violent and non-violent situations in Korean sentence text, video media, crawling technique, keyword-based three methods and inverse-translation technique-based data augmentation technique are used to detect violence and non-violence based on Korean conversations corresponding to 21,594 sentences. A database was constructed to store Korean sentence texts containing words and phrases related to situations.

표1은 폭력 및 비폭력 상황별 구축 데이터 예시Table 1 is an example of construction data for each violent and non-violent situation.

표2는 폭력 상황별 키워드 목록 예시Table 2 is an example of a list of keywords by violence situation

2.3 한국어 대화 기반 폭력 및 비폭력 상황 인식 2.3 Recognition of Violent and Non-Violent Situation Based on Korean Conversation

한국어 대화 기반 폭력 및 비폭력 상황 인식을 위해 사전 학습된 한국어 BERT 언어 모델과 다층 퍼셉트론(Multilayer Perceptron, MLP), 합성곱 신경망(Convolutional Neural Networks, CNN), 장단기 메모리 신경망(Long Short-Term Memory, LSTM), 양방향 장단기 신경망(Bidirectional Long Short-Term Memory, Bi-LSTM) 4가지 분류 모델로 구성된 딥러닝 신경망 모델을 제안하며, 제안된 딥러닝 신경망 모델의 구조는 도 2에서 확인할 수 있다. 장단기 메모리 신경망(LSTM)과 양방향 장단기 신경망(Bi-LSTM)의 구조는 동일하다.Pretrained Korean BERT Language Model, Multilayer Perceptron (MLP), Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) , Bidirectional Long Short-Term Memory (Bi-LSTM) proposes a deep learning neural network model composed of four classification models, and the structure of the proposed deep learning neural network model can be seen in FIG. The structures of the long short-term memory neural network (LSTM) and the bidirectional long-short-term neural network (Bi-LSTM) are the same.

도 2에 도시된 바와 같이, 다층 퍼셉트론(MLP) 분류 모델은 BERT 언어 모델의 마지막 인코딩 층의 [CLS] 특징; 합성곱 신경망(CNN) 분류 모델은 마지막 인코딩 층의 전체 토큰(token)의 특징; 장단기 메모리 신경망(LSTM)과 양 방향 장단기 신경망(Bi-LSTM) 분류 모델은 마지막 4개의 인코딩 층의 전체 토큰(token)의 특징을 입력으로 사용하였다. 또한, 다층 퍼셉트론(MLP) 분류 모델은 선형 층과 tanh 활성화 함수로 구성된 pooler 층과 다층 퍼셉트론; 합성곱 신경망(CNN) 분류 모델은 32, 64, 128 커널(kernel) 크기로 구성된 3개의 1차원 합성곱 신경망; 장단기 메모리 신경망(LSTM)과 양방향 장단기 신경망(Bi-LSTM) 분류 모델은 2개의 은닉층(hidden layer)과 입력 크기 절반의 은닉 상태(hidden state) 크기를 가진 4개의 장단기 메모리 신경망(LSTM)과 양방향 장단기 메모리 신경망(Bi-LSTM)을 통해 한국어 대화 기반 폭력 및 비폭력 상황 인식을 진행하였다.As shown in Fig. 2, the multi-layer perceptron (MLP) classification model includes the [CLS] features of the last encoding layer of the BERT language model; Convolutional Neural Network (CNN) classification models include features of all tokens in the last encoding layer; Long short-term memory neural network (LSTM) and bidirectional long-short-term neural network (Bi-LSTM) classification models used the features of all tokens in the last four encoding layers as input. In addition, the multilayer perceptron (MLP) classification model includes a pooler layer composed of a linear layer and a tanh activation function, and a multilayer perceptron; The convolutional neural network (CNN) classification model includes three one-dimensional convolutional neural networks composed of 32, 64, and 128 kernel sizes; The long short-term memory neural network (LSTM) and bi-directional long-short-term neural network (Bi-LSTM) classification models are four long-short-term memory neural networks (LSTM) and bi-directional long-short-term Violent and non-violent situation recognition based on Korean conversation was conducted through memory neural network (Bi-LSTM).

도 2는 본 발명에 따른 BERT 언어 모델을 사용한 한국어 대화 기반 폭력 및 비폭력 상황 인식 구성도이다. 2 is a diagram illustrating the construction of Korean dialogue-based violent and non-violent situation recognition using the BERT language model according to the present invention.

상기 한국어 BERT 언어 모델은 The Korean BERT language model is

한국어 문장 텍스트 입력부; 형태소 토큰화(Morpheme Tokenization)부; 한국어 문장의 토큰(token)이 있을 때, 각 토큰들의 위치에 대한 정보를 position embedding 파라미터(768, 384, 1024 설정 값) 차원수의 값을 나타내는 Position Embedding; 및 문장의 각 토큰들의 값을 position embedding 파라미터(768, 384, 1024 설정 값) 차원수의 값으로 토큰을 매핑하고 토큰의 의미를 파악하기 위해 훈련하는 Token Embedding을 포함하는 Embedding layer; 및 k개 Transformer encoder layer를 구비하며, Korean sentence text input unit; a morpheme tokenization unit; When there is a token in a Korean sentence, position embedding parameter (768, 384, 1024 set value) Position Embedding, which indicates the value of the number of dimensions, provides information about the position of each token; and an embedding layer including Token Embedding, which maps the value of each token in the sentence to the number of dimensions of the position embedding parameter (set value of 768, 384, and 1024) and trains to understand the meaning of the token; and k Transformer encoder layers,

BERT 언어 모델의 tokens은 분류 토큰(classification token) [CLS]; k개의 transformer 전체 층을 다 거치고 나면, 입력 문장의 모든 토큰들의 결합된 의미를 가지게 되며, 입력 문장의 응축된 의미를 가지므로 분류 토큰(classification token) [CLS]에 분류기(classifier)를 붙이면 단일 문장, 또는 연속된 문장의 분류(classification)가 되며, The tokens of the BERT language model are classification tokens [CLS]; After passing through all k transformer layers, it has the combined meaning of all tokens in the input sentence, and has the condensed meaning of the input sentence, so if a classifier is attached to the classification token [CLS], it becomes a single sentence , or a classification of consecutive sentences,

한국어 문장의 각 문장을 구분하기 위해 문장의 마지막에 추가하는 분리 토큰(separator token) [SEP]; 상기 분리 토큰 [SEP]을 Tokenization에서 훈련된 사전을 이용하여 [2, 153, 10, 56, 145, 498, 3]의 값으로 변경한 후에, 각 숫자의 값을 position embedding 파라미터(768, 384, 1024 설정 값) 차원수로 나타내고, A separator token [SEP] added to the end of a Korean sentence to separate each sentence; After changing the separation token [SEP] to a value of [2, 153, 10, 56, 145, 498, 3] using a dictionary trained in Tokenization, the value of each number is set to the position embedding parameters (768, 384, 1024 setting value) expressed as the number of dimensions,

예를들면, 사람은 '나', '학교', '간다' 라는 토큰이 어떤 의미를 가지는지 알고 ['나', '는', '학교', '에', '간다']와 같이 문장이 어떤 의미를 가지는지 알고 있지만, 컴퓨터는 숫자의 값으로 각 토큰을 나타내기 때문에 처음에는 어떤 의미를 가지는지 알 수 없다. 그렇기 때문에, k개의 transformer encoder를 통해 문장의 각 토큰의 언어적 표현과 문장의 의미를 학습하며, For example, a person knows what the tokens 'I', 'school', 'going' mean, and sentences such as ['I', 'a', 'school', 'to', 'going'] I know what it means, but since the computer represents each token as a numeric value, I don't know what it means at first. Therefore, the linguistic expression of each token in the sentence and the meaning of the sentence are learned through k transformer encoders,

상기 한국어 BERT 언어 모델은 6개, 12개, 24개의 transformer encoder layer를 통해 문장의 의미를 학습한다. The Korean BERT language model learns the meaning of sentences through 6, 12, and 24 transformer encoder layers.

1) BERT 언어 모델의 입력1) Input of BERT language model

한국어 텍스트를 BERT 언어 모델로 입력받아, 언어 모델 훈련 및 형태소 인식을 진행한다.Korean text is input to the BERT language model, and language model training and morpheme recognition are performed.

입력 예시) "나는 학교에 간다" Input example) "I go to school"

2) 형태소 토큰화(Morpheme Tokenization)2) Morpheme Tokenization

Tokenization 기법은 단어 토큰화(Word Tokenization), 문장 토큰화(Sentence Tokenization), 형태소 토큰화(Morpheme Tokenization) 등 여러 가지가 존재한다. 그 중에서 형태소 토큰화(Morpheme Tokenization)는 한국어를 가장 작은 말의 단위 형태소로 문장을 분리하여 다른 토큰화 기법보다 한국어의 의미를 더 잘 파악할 수 있다.Tokenization techniques include word tokenization, sentence tokenization, and morpheme tokenization. Among them, morpheme tokenization separates Korean words into the smallest unit morphemes, so that the meaning of Korean words can be understood better than other tokenization techniques.

본 발명의 한국어 BERT 언어 모델에서는 형태소 토큰화(Morpheme Tokenization) 기법 중에 한국전자통신연구원(ETRI)에서 제공하는 형태소 분석기 API를 사용하였다.In the Korean BERT language model of the present invention, the morpheme analyzer API provided by the Electronics and Telecommunications Research Institute (ETRI) was used in the Morpheme Tokenization technique.

형태소 토큰화 예시) "나는 학교에 간다" -> ['나', '는', '학교', '에', '간다']Example of morpheme tokenization) "I am going to school" -> ['I', 'is', 'school', 'to', 'going']

3) BERT special tokens3) BERT special tokens

- [CLS]-[CLS]

○ 분류 토큰(classification token) 이며, transformer 전체 층을 다 거치고 나면, 입력 문장의 모든 토큰들의 결합된 의미를 가지게 된다. ○ It is a classification token, and after passing through all layers of the transformer, it has the combined meaning of all tokens in the input sentence.

○ 입력 문장의 응축된 의미를 가지므로 [CLS]에 간단한 분류기(classifier)를 붙이면 단일 문장, 또는 연속된 문장의 분류(classification)가 가능하다.○ Since it has the condensed meaning of the input sentence, it is possible to classify a single sentence or a series of sentences by attaching a simple classifier to [CLS].

- [SEP]- [SEP]

○ 분리 토큰(separator token)이며, BERT 언어 모델은 2개의 문장을 입력으로 받을 수 있기 때문에 각 문장을 구분하기 위해 문장의 마지막에 [SEP] 토큰을 추가한다.○ It is a separator token, and since the BERT language model can receive two sentences as input, a [SEP] token is added to the end of each sentence to separate each sentence.

4) Embedding layer 4) Embedding layer

- Position Embedding- Position Embedding

○ 문장의 토큰(token)이 있을 때, 각 토큰들의 위치에 대한 정보를 768개의 차원의 값을 나타낸다. 한국어 BERT 언어 모델에서는 position embedding 파라미터를 768으로 사용하며 384, 1024 등의 값으로 변경될 수 있다.○ When there is a token of a sentence, information about the location of each token is displayed as a value of 768 dimensions. In the Korean BERT language model, the position embedding parameter is set to 768 and can be changed to values such as 384 and 1024.

○ 예시) ['나', '는', '학교', '가', '싫', '어', '집', '에', '갔', '다'] 라는 문장의 토큰이 있을 때 문장의 의미는 집에 간 것인데, '학교'와 '집'의 위치만 변경되더라도 문장의 의미가 집에 간 것으로 변경된다. 이와 같이, 각 토큰의 위치에 따라서도 문장의 의미가 달라지므로, 각 토큰들의 위치에 대한 정보를 Position Embedding을 통해 표현한다. ○ Example) There is a token of the sentence ['I', 'is', 'school', 'go', 'hate', 'uh', 'home', 'e', 'go', 'da'] When the meaning of the sentence is going home, even if only the location of 'school' and 'home' is changed, the meaning of the sentence changes to going home. In this way, since the meaning of a sentence is different depending on the location of each token, information about the location of each token is expressed through Position Embedding.

- Token Embedding- Token Embedding

○ 각 토큰들의 값을 768개의 차원의 값으로 매핑하는 것이다.○ It is to map the value of each token to the value of 768 dimensions.

○ '배'라는 토큰이 과일, 신체, 운송수단으로 나타낼 수 있기 때문에 768개의 차원으로 토큰을 매핑하여 토큰의 의미를 파악하기 위해 훈련된다.○ Since the token 'pear' can be represented by fruit, body, and transportation, it is trained to grasp the meaning of the token by mapping it in 768 dimensions.

○ 예시) [[CLS], '나', '는', '학교', '에', '간다', [SEP]] 토큰을 Tokenization에서 훈련된 사전을 이용하여 [2, 153, 10, 56, 145, 498, 3]의 값으로 변경한 후에, 각 숫자의 값을 768개의 차원으로 나타내는 것이다. ○ Example) [[CLS], 'me', 'is', 'school', 'to', 'going', [SEP]] tokens using a dictionary trained in Tokenization [2, 153, 10, 56 , 145, 498, 3], then the value of each number is expressed in 768 dimensions.

5) Transformer encoder 5) Transformer encoder

사람은 '나', '학교', '간다' 라는 토큰이 어떤 의미를 가지는지 알고 ['나', '는', '학교', '에', '간다']와 같이 문장이 어떤 의미를 가지는지 알고 있다. 그렇지만, 컴퓨터는 숫자의 값으로 각 토큰을 나타내기 때문에 처음에는 어떤 의미를 가지는지 알 수 없다. 그렇기 때문에, k개의 Transformer encoder를 통해 각 토큰의 언어적 표현과 문장의 의미를 학습한다. A person knows what the tokens 'I', 'school', 'going' mean, and what sentences such as ['I', 'a', 'school', 'to', 'going'] mean. know if you have However, since the computer represents each token as a numeric value, it is initially unknown what it means. Therefore, the linguistic expression of each token and the meaning of the sentence are learned through k Transformer encoders.

한국어 BERT 언어 모델의 실시예에서는 12개의 transformer encoder layer를 통해 문장의 의미를 학습을 하였지만, 6개, 24개 transformer encoder layer도 사용 가능하다.In the embodiment of the Korean BERT language model, the meaning of sentences was learned through 12 transformer encoder layers, but 6 and 24 transformer encoder layers can also be used.

Transformer encoder의 layer가 늘어날수록 토큰의 언어적 표현능력이 향상되고, 문장의 의미를 학습하는 능력이 향상된다.As the layers of the transformer encoder increase, the linguistic expression ability of tokens improves and the ability to learn the meaning of sentences improves.

6) 미세 조정 과정(Fine tuning)6) Fine tuning

사전 학습(Pre-training)된 모델은 일반적인 언어의 표현을 학습하는 과정이기 때문에 형태소 분석 기반의 BERT 언어 모델을 한국어 문장 텍스트들의 폭력 및 비폭력 상황 인식과 같은 특정 상황에 맞게 미세 조정하는 과정이 필요하다.Since the pre-trained model is a process of learning general language expressions, it is necessary to fine-tune the BERT language model based on morphological analysis to specific situations such as recognizing violent and non-violent situations in Korean sentence texts. .

미세 조정 과정은 BERT 언어 모델 다음에 신경망 모델을 추가하여 진행한다.The fine-tuning process proceeds by adding a neural network model after the BERT language model.

본 BERT 언어모델의 미세 조정과정에서는 아래에서 설명하는 MLP, CNN, LSTM, 및 Bi-LSTM을 포함하는 4개 신경망 모델을 사용하였다.In the fine-tuning process of this BERT language model, four neural network models including MLP, CNN, LSTM, and Bi-LSTM described below were used.

7) 다층 퍼셉트론(Multilayer Perceptron, MLP) 미세 조정 모델7) Multilayer Perceptron (MLP) fine-tuning model

참고로, 다층 퍼셉트론(MLP) 모델은 K x M x N 구조의 입력층(input layer), 하나 이상의 은닉층(hidden layer), 출력층(output layer)으로 구성되며, 패턴 인식 분야에 많이 사용된다. For reference, the multilayer perceptron (MLP) model is composed of an input layer of a K x M x N structure, one or more hidden layers, and an output layer, and is widely used in the field of pattern recognition.

다층 퍼셉트론(MLP) 모델은 선형 층 여러 개를 순차적으로 붙여놓은 형태이며, 인접한 층의 퍼셉트론(Perceptron)간의 연결은 있어도, 같은 층의 퍼셉트론끼리의 연결은 없는 모델이다.The multi-layer perceptron (MLP) model is a model in which several linear layers are sequentially attached, and there is a connection between perceptrons of adjacent layers, but no connection between perceptrons of the same layer.

본 발명의 다층 퍼셉트론(MLP) 미세 조정 모델의 입력은 한국어 BERT 언어 모델의 마지막 Transformer encoder layer의 [CLS] 토큰의 output을 변환하여 사용한다.The input of the multi-layer perceptron (MLP) fine-tuning model of the present invention is used by converting the output of the [CLS] token of the last transformer encoder layer of the Korean BERT language model.

한국어 문장의 모든 token들의 결합된 의미를 함축하며 크기가 1x768인 [CLS] 토큰을 사용하였다.A [CLS] token with a size of 1x768 is used, implying the combined meaning of all tokens in Korean sentences.

BERT 사전 학습 단계에서 사용되는 NSP(Next Sentence Prediction) 과정에서 훈련된 Pooler layer를 재사용하여 문장의 의미를 더 잘 학습하게 하도록 한다.The Pooler layer trained in the NSP (Next Sentence Prediction) process used in the BERT pre-learning step is reused to better learn the meaning of the sentence.

이후, tanh 활성화 함수를 통해 각 신경망 유닛의 값들을 -1 ~ +1 사이의 실수 값으로 변환한다. Then, the values of each neural network unit are converted into real values between -1 and +1 through the tanh activation function.

tanh 활성화 함수는 다음과 같다. , -1< tanh(z) <+1The tanh activation function is , -1< tanh(z) <+1

마지막으로 2층의 Pooler layer로 구성된 다층 퍼셉트론(MLP) 모델을 통해 총 5가지 클래스(4가지의 폭력 상황 + 비폭력 상황)을 인식한다.Finally, a total of five classes (four violent situations + non-violent situations) are recognized through a multi-layer perceptron (MLP) model composed of two layers of Pooler layers.

① [CLS] token 예시) [-20.23, 10.68, ..., 197.39, 64.1237] (크기가 1x768인 실수 값)① [CLS] token example) [-20.23, 10.68, ..., 197.39, 64.1237] (real value with size 1x768)

② MLP input 예시) [-0.43, -0.26, 0.84, ..., 0.32] (크기가 1x768인 -1 ~ 1 사이의 실수 값)② MLP input example) [-0.43, -0.26, 0.84, ..., 0.32] (Real value between -1 and 1 with size 1x768)

③ MLP output 예시) [0.7, 0.1, 0.05, 0.1, 0.05] (크기가 1x5인 확률 값)③ MLP output example) [0.7, 0.1, 0.05, 0.1, 0.05] (probability value with size 1x5)

8) 합성곱 신경망(Convolutional Neural Network, CNN) 미세 조정 모델8) Convolutional Neural Network (CNN) fine-tuning model

이미지와 같은 2차원 데이터를 1차원 데이터로 변경하여 공간 정보 유실로 인한 학습이 비효율적이며 정확도를 높이는데 한계가 있는 점을 해결하기 위해 공간 정보를 유지한 상태로 학습이 가능하게 만들어진 모델이다.It is a model that enables learning while maintaining spatial information in order to solve the problem of inefficient learning due to loss of spatial information and limitations in increasing accuracy by changing 2D data such as images into 1D data.

참고로, 합성곱 신경망(CNN)은 주로 문자 인식과 영상의 이미지 분석에 사용되는 다층 신경망이다. 합성곱 신경망(CNN)은 입력층과 출력층이 각각 맨 첫 번째와 맨 마지막 층에 구성되고, 입력층에 인접하는 컨볼루션 층(convolution layer)과 풀링 층(pooling layer)을 쌍(pair)의 형태로 여러 개 둘 수 있고(convolution layer, pooling layer, convolution layer, pooling layer,.. ), 그 뒤에는 몇 개의 완전연결층(fully-connected layer, FC층)으로 구성된 MLP를 사용한다. 예를들면, 특정 영상이 CNN의 입력으로 주어졌을 때, 각 층에서 생성되는 중간 결과들의 재구성 과정을 보여준다. 이러한 중간 결과들을 특징 맵(feature map)이라고 하며, 특징 맵으로부터 입력 영상에서 단계적으로 특징들이 추출된다. 특징 맵을 만드는 과정에 사용되는 가중치들을 필터(filter)라고 하며, 컨볼루션 층에서 사용되는 마스크(mask), 풀링 층에서 사용되는 2×2 윈도우(window), FC 층에서 사용되는 가중치들의 모음 등을 모두 필터라고 할 수 있다. 풀링 층에서의 down-sampling 또는 sub-sampling)을 위해 사용되는 함수 g는 평균을 계산하는 mean 함수 또는 최대치를 선택하는 max 함수 중 하나를 선택하여 사용할 수 있다. For reference, a convolutional neural network (CNN) is a multilayer neural network mainly used for character recognition and image analysis of images. A convolutional neural network (CNN) consists of an input layer and an output layer in the first and last layers, respectively, and a convolution layer and a pooling layer adjacent to the input layer in the form of a pair. There can be several (convolution layer, pooling layer, convolution layer, pooling layer,..), and then an MLP consisting of several fully-connected layers (FC layers) is used. For example, when a specific image is given as an input of a CNN, it shows the reconstruction process of intermediate results generated in each layer. These intermediate results are called feature maps, and features are extracted from the input image step by step from the feature maps. The weights used in the process of creating a feature map are called filters, and the mask used in the convolution layer, the 2×2 window used in the pooling layer, the collection of weights used in the FC layer, etc. can all be called filters. The function g used for down-sampling or sub-sampling in the pooling layer can be used by selecting either the mean function that calculates the average or the max function that selects the maximum value.

CNN 미세 조정 모델의 입력은 BERT 언어 모델의 마지막 Transformer encoder layer의 모든 토큰의 output (크기: 512(토큰 길이)x768(차원 수))을 사용하였다.The output of all tokens (size: 512 (token length) x 768 (number of dimensions)) of the last transformer encoder layer of the BERT language model was used as the input of the CNN fine-tuning model.

문장의 최대 토큰의 길이를 512로 설정하였기 때문에 ① CNN의 입력이 512x768의 크기를 갖는다.Since the maximum token length of a sentence is set to 512, ① the CNN input has a size of 512x768.

② Kernel size가 32x768, 64x768, 128x768로 구성된 3개의 ③ convolution layer을 통해 3개의 특징 맵(feature map) (크기: 1x481, 1x449, 1x385)이 나오며, 1x3 size로 구성된 ④ max pooling을 통해 나온 3개의 pooled feature map (크기: 1x160, 1x149, 1x128)을 결합한 후에 완전 연결층(fully connected layer)을 통해 5가지 클래스(4가지의 폭력 상황 + 비폭력 상황)을 인식한다.② 3 kernel sizes of 32x768, 64x768, 128x768 ③ 3 feature maps (size: 1x481, 1x449, 1x385) are produced through the convolution layer, and ④ 3 features of 1x3 size are obtained through max pooling After combining pooled feature maps (size: 1x160, 1x149, 1x128), 5 classes (4 violent situations + non-violent situations) are recognized through a fully connected layer.

① CNN input 예시) (크기가 512x768 인 실수 값)① CNN input example) (Real value with size 512x768)

⑤ CNN output 예시) [0.2, 0.6, 0.07, 0.3, 0.1] (크기가 1x5인 확률 값)⑤ CNN output example) [0.2, 0.6, 0.07, 0.3, 0.1] (probability value with size 1x5)

9) 장단기 메모리 신경망(Long Short-Term Memory, LSTM) 미세 조정 모델9) Long Short-Term Memory (LSTM) fine-tuning model

참고로, 순환 신경망(Recurrent Neural Network, RNN)은 특정 노드의 출력이 해당 노드에 다시 입력되는 구조를 갖는 신경망이며 즉, 현재 들어온 입력데이터와 과거에 입력 받았던 데이터를 동시에 고려하여 결과값을 도출하며, 학습도 깊은 심층 신경망(deep neural network)의 학습에서 vanishing gradient problem에 대한 해결 방안으로써, LSTM(Long Short-Term Memory) 신경망이 제안되었다. LSTM 신경망은 셀 상태에 정보를 추가하거나 삭제할 수 있는 게이트(gate) 구조를 갖는다. 게이트(gate)는 정보 결정에 있어 선택할 수 있으며, sigmoid 신경망 층과 벡터의 요소 간 곱 연산으로 구성된다.For reference, a Recurrent Neural Network (RNN) is a neural network that has a structure in which the output of a specific node is re-input to the corresponding node, that is, the result value is derived by considering the current input data and the past input data at the same time, , LSTM (Long Short-Term Memory) neural networks have been proposed as a solution to the vanishing gradient problem in deep neural network learning. The LSTM neural network has a gate structure that can add or delete information to the cell state. The gate can be selected for information determination and consists of a multiplication operation between the sigmoid neural network layer and the elements of the vector.

단방향 장단기 메모리 신경망(Long Short-Term Memory, LSTM)은 한국어 BERT 언어 모델(language model)의 문장의 시계열 데이터를 활용할 때, 시간의 길이가 길어질수록 gradient가 작아져서 Vanishing Gradient 문제가 발생하여 앞에 시간에 대한 정보를 소멸하는 문제가 발생하는 Long-Term Dependency 문제를 해결하기 위해 만들어진 모델이다.When using the time-series data of sentences of the Korean BERT language model, the unidirectional long short-term memory neural network (LSTM) has a vanishing gradient problem as the gradient becomes smaller as the length of time increases, causing This model was created to solve the problem of Long-Term Dependency, which causes information about information to disappear.

장단기 메모리 신경망(Long Short-Term Memory, LSTM) 미세 조정 모델은 단방향 장단기 메모리 신경망(LSTM)으로써, 한국어 문장의 각 토큰의 시계열적인 정보도 고려하기 때문에 BERT의 마지막 layer의 모든 토큰을 사용하는 것보다 더 많은 정보를 활용하기 위해 BERT의 transformer encoder layer 9~12의 output (크기: 4x512x768)을 사용하였다. The long short-term memory (LSTM) fine-tuning model is a unidirectional long short-term memory network (LSTM), and since it also considers the time-series information of each token in a Korean sentence, it is more efficient than using all tokens in the last layer of BERT. In order to utilize more information, BERT's transformer encoder layer 9-12 output (size: 4x512x768) was used.

각 layer의 모든 토큰의 output (크기: 512x768)을 LSTM의 입력으로 사용하였으며, LSTM은 2개의 layer로 구성되어 있고, 각 LSTM layer는 512개의 LSTM cell로 구성된다. 그림에서, LSTM cell 1은 첫번째 LSTM cell, LSTM cell 512는 512번째 LSTM cell을 의미한다.The output of all tokens of each layer (size: 512x768) was used as the input of LSTM, and LSTM consists of two layers, and each LSTM layer consists of 512 LSTM cells. In the figure, LSTM cell 1 means the first LSTM cell, and LSTM cell 512 means the 512th LSTM cell.

각 layer의 LSTM output (크기: 512x192)을 결합한 후에 완전 연결층(fully connected layer)을 통해 4가지의 폭력 및 비폭력 상황을 인식하였다.After combining the LSTM output (size: 512x192) of each layer, four violent and non-violent situations were recognized through a fully connected layer.

① LSTM input 예시)), .., (크기가 4x512x768인 실수 값)① LSTM input example)) , .., (Real value of size 4x512x768)

② LSTM output 예시) [0.1, 0.3, 0.5, 0.03, 0.07] (크기가 1x5인 확률 값)② LSTM output example) [0.1, 0.3, 0.5, 0.03, 0.07] (probability value with size 1x5)

10) 양방향 장단기 메모리 신경망(Bidirectional Long Short-Term Memory, Bi-LSTM) 미세 조정 모델10) Bidirectional Long Short-Term Memory (Bi-LSTM) fine-tuning model

양방향 장단기 메모리 신경망(Bi-LSTM) 미세 조정 모델은 A bidirectional long short-term memory neural network (Bi-LSTM) fine-tuning model is

각 layer의 모든 토큰의 output (크기: 512x768)을 LSTM의 입력으로 사용하였으며, LSTM은 2개의 layer로 구성되고, 각 LSTM layer는 512개의 LSTM cell로 구성되며, The output of all tokens of each layer (size: 512x768) was used as the input of LSTM. LSTM consists of two layers, and each LSTM layer consists of 512 LSTM cells,

각 layer의 LSTM output (크기: 512x192)을 결합한 후에 완전 연결층(fully connected layer, FC 층)을 통해 4가지의 폭력 및 비폭력 상황을 인식한다.After combining the LSTM outputs (size: 512x192) of each layer, four violent and non-violent situations are recognized through the fully connected layer (FC layer).

양방향 장단기 메모리 신경망(Bi-LSTM)은 단방향 LSTM 모델과 동일한 입력을 사용하며, 차이점은 LSTM 모델의 구조가 단방향이 아닌 양방향으로 한국어 언어 모델(language model)의 문장의 시계열적인 정보를 사용하여 훈련한다는 점이다.The bidirectional long short-term memory neural network (Bi-LSTM) uses the same input as the unidirectional LSTM model, and the difference is that the structure of the LSTM model is bidirectional rather than unidirectional and is trained using time-series information of sentences of the Korean language model. point.

3. 결과 3. Results

본 발명의 제안된 딥러닝 신경망 모델은 hyper-parameter로 epoch, max position length, batch size를 각각 5, 128, 32로 설정하여 훈련을 진행하였으며, 직접 구축한 폭력 및 비폭력 상황 데이터베이스에서 임의로 추출한 1,000개의 데이터를 활용하여 정확도(accuracy)와 Macro F1-score 2가지의 성능 평가 척도를 사용하여 딥러닝 신경망 모델의 성능을 평가하였다.The proposed deep learning neural network model of the present invention was trained by setting the epoch, max position length, and batch size to 5, 128, and 32 as hyper-parameters, respectively, and 1,000 randomly extracted Using the data, the performance of the deep learning neural network model was evaluated using two performance evaluation scales: accuracy and Macro F1-score.

표 3은 제안된 딥러닝 신경망 모델 분류 성능을 나타낸다.Table 3 shows the classification performance of the proposed deep learning neural network model.

표 3에서 볼 수 있듯이, 모든 성능 평가 척도에서 양방향 장단기 메모리 신경망(Bi-LSTM)이 가장 우수한 성능을 보였으며, 장단기 메모리 신경망(LSTM)이 다층 퍼셉트론(MLP)에 비해 우수한 성능을 보인다. 또한, 본 발명에서 제안된 딥러닝 신경망 모델을 활용하여 과학기술정보통신부에서 주관하는 2020 인공지능 그랜드 챌린지 4차 2단계 음성인지 트랙에서 입상을 통해 우수한 분류 성능을 보임을 확인하였다.As shown in Table 3, Bi-LSTM showed the best performance in all performance evaluation scales, and LSTM showed superior performance compared to multi-layer perceptron (MLP). In addition, it was confirmed that the deep learning neural network model proposed in the present invention showed excellent classification performance through the prize in the 2nd stage speech recognition track of the 2020 Artificial Intelligence Grand Challenge hosted by the Ministry of Science and ICT.

4. 결론 및 향후 진행 방향 4. Conclusion and future direction

본 발명에서는 학교 폭력, 직장 내 괴롭힘과 같은 오프라인상에서의 문제를 해결하기 위해 4종의 한국어 대화 기반 폭력 상황과 비폭력 상황으로 구성된 데이터베이스를 구축하였으며, 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위한 한국어 BERT 언어 모델과 딥러닝 기반의 4가지 분류 모델(MLP, CNN, LSTM, 및 Bi-LSTM)로 구성된 딥러닝 신경망 모델 제안하였다. 제안된 신경망 모델 검증을 위해 자체적으로 구축한 데이터베이스에서 1,000개의 데이터를 임의로 추출하여 분류 성능을 비교하였다. 한국어 BERT 언어 모델에서 마지막 4개의 인코딩 층의 전체 토큰의 특징을 입력으로 사용한 양방향 장단기 메모리 신경망(Bi-LSTM)의 분류 성능이 가장 우수함을 보였다. In the present invention, in order to solve offline problems such as school violence and workplace bullying, a database consisting of four types of Korean dialogue-based violent and non-violent situations was established, and the Korean BERT language for recognizing Korean dialogue-based violent and non-violent situations A deep learning neural network model consisting of a model and four deep learning-based classification models (MLP, CNN, LSTM, and Bi-LSTM) was proposed. To verify the proposed neural network model, we randomly extracted 1,000 data from our own database and compared the classification performance. In the Korean BERT language model, the classification performance of Bi-LSTM using the features of all tokens of the last four encoding layers as input was the best.

표3은 제안된 딥러닝 신경망 모델(MLP, CNN, LSTM, Bi-LSTM) 분류 성능 비교Table 3 compares the classification performance of the proposed deep learning neural network models (MLP, CNN, LSTM, Bi-LSTM)

본 연구에서 제안된 한국어 BERT 언어 모델과 신경망 모델(MLP, CNN, LSTM, Bi-LSTM)을 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위해 실제 오프라인상에서의 문제를 해결하기 위해 국제표준범죄분류체계(International Classification of Crime for Statistical Purpose, ICCS)를 참고하여 협박, 갈취 또는 공갈, 직장 내 괴롭힘, 기타 괴롭힘 4종의 폭력 상황과 비폭력 상황으로 총 5종의 상황으로 구성되어 있는 데이터베이스를 직접 구축하였으며, 자연어 처리(NLP)를 사용하여 한국어 대화 기반 폭력 및 비폭력 상황 인식을 위한 형태소 분석 기반의 한국어 BERT 언어 모델과 딥러닝(Deep Learning) 기반의 4가지 분류 모델(MLP, CNN, LSTM, Bi-LSTM)로 구성된 딥러닝 신경망 모델을 제시하였다. The Korean BERT language model and neural network models (MLP, CNN, LSTM, Bi-LSTM) proposed in this study were applied to the International Standard Crime Classification System (International Standard Crime Classification System) Classification of Crime for Statistical Purpose (ICCS), we directly built a database consisting of 5 types of situations, 4 types of violence and 4 types of non-violent situations, such as intimidation, extortion or extortion, bullying at work, and other bullying, and natural language processing Consisting of 4 classification models (MLP, CNN, LSTM, Bi-LSTM) based on Deep Learning and Korean BERT language model based on morphological analysis for Korean conversation-based violent and non-violent situation recognition using (NLP) A deep learning neural network model was presented.

본 연구에서 제안된 한국어 BERT 언어 모델과 신경망 모델(MLP, CNN, LSTM, Bi-LSTM)을 음성 스피커, CCTV와 같은 제품에 활용하기에는 성능 고도화 및 모델 경량화 작업이 요구된다. 그러므로, 향후 연구에서는 RoBERTa, ELECTRA와 같은 기존 BERT 언어 모델에서 성능 향상을 보인 언어 모델과 지식 증류(Knowledge Distillation)와 같은 경량화 기법을 결합하는 연구의 필요성이 있다.In order to utilize the Korean BERT language model and neural network models (MLP, CNN, LSTM, Bi-LSTM) proposed in this study for products such as voice speakers and CCTVs, performance enhancement and model weight reduction are required. Therefore, in future research, there is a need for research combining lightweight techniques such as knowledge distillation with language models that have shown performance improvement in existing BERT language models such as RoBERTa and ELECTRA.

본 발명에 따른 실시예들은 다양한 컴퓨터 수단을 통해 수행될 수 있는 프로그램 명령 형태로 구현되고, 컴퓨터 판독 가능 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 기록 매체는 프로그램 명령, 데이터 파일, 데이터 구조를 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 기록 매체는 스토리지, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리, 스토리지와 같은 저장 매체에 프로그램 명령을 저장하고 수행하도록 구성된 하드웨어 장치가 포함될 수 있다.　프로그램 명령의 예는 컴파일러에 의해 만들어지는 것과, 기계어 코드 뿐만 아니라 인터프리터를 사용하여 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.　상기 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로써 작동하도록 구성될 수 있다.Embodiments according to the present invention are implemented in the form of program instructions that can be executed through various computer means, and can be recorded in a computer readable recording medium. The computer readable recording medium may include program instructions, data files, and data structures alone or in combination. Computer-readable recording media include storage, hard disks, magnetic media such as floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - A hardware device configured to store and execute program instructions in storage media such as magneto-optical media, ROM, RAM, flash memory, and storage may be included. Examples of program instructions may include those produced by a compiler, machine language codes as well as high-level language codes that can be executed by a computer using an interpreter. The hardware device may be configured to operate as one or more software modules to perform the operations of the present invention.

이상에서 설명한 바와 같이, 본 발명의 방법은 프로그램으로 구현되어 컴퓨터의 소프트웨어를 이용하여 읽을 수 있는 형태로 기록매체(CD-ROM, RAM, ROM, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등)에 저장될 수 있다. As described above, the method of the present invention is implemented as a program and can be read using computer software on a recording medium (CD-ROM, RAM, ROM, memory card, hard disk, magneto-optical disk, storage device, etc.) ) can be stored in

본 발명의 구체적인 실시예를 참조하여 설명하였지만, 본 발명은 상기와 같이 기술적 사상을 예시하기 위해 구체적인 실시 예와 동일한 구성 및 작용에만 한정되지 않고, 본 발명의 기술적 사상과 범위를 벗어나지 않는 한도 내에서 다양하게 변형하여 실시될 수 있으며, 본 발명의 범위는 후술하는 특허청구범위에 의해 결정되어야 한다.Although described with reference to specific embodiments of the present invention, the present invention is not limited to the same configuration and operation as the specific embodiments to illustrate the technical idea as described above, and within the limit that does not deviate from the technical spirit and scope of the present invention It can be implemented with various modifications, and the scope of the present invention should be determined by the claims described later.

Claims

(a) The morpheme-based Korean BERT language model prelearns Korean sentence texts containing words and phrases in violent and non-violent situations, and receives Korean dialogue-based Korean sentence texts containing words and phrases in violent and non-violent situations. Executing the pretrained morpheme-based Korean BERT language model to recognize four types of violent and non-violent situations: intimidation, extortion or extortion, workplace bullying, and other bullying; and
(b) A deep learning neural network of any one of the multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory neural network (LSTM), and bi-directional short-term neural network (Bi-LSTM) classification model for the pre-trained Korean BERT language model, respectively. Recognizing violent and non-violent situations by operating any one of the deep learning neural network models connected to the model and receiving Korean sentence text including words and phrases of Korean dialogue-based violent and non-violent situations;
Korean dialogue-based violent and non-violent situation recognition method using the BERT language model.

According to claim 1,
Prior to step (a) above, in order to recognize violent and non-violent situations in Korean sentence text, three methods based on image media, crawling, and keywords and data augmentation based on inverse-translation techniques were used to recognize k Korean conversation-based violence. and constructing a database for storing Korean sentence texts including words and phrases for non-violent situations and pre-learning them by the Korean BERT language model. .

According to claim 2,
The Korean BERT language model learns violent and non-violent situation recognition models for Korean sentence text, and sets Korean Transformer blocks, hidden size, and self-attention heads to 12, 768, and 12, respectively. Methods for Recognizing Violent and Non-Violent Situation Based on Korean Conversation.

According to claim 1,
The Korean BERT language model is a Korean conversation-based violent and non-violent situation recognition method using a BERT language model using a "morpheme analysis-based BERT language model" using morpheme tokenization of a tokenization technique.

According to claim 4,
The tokenization technique uses word tokenization, sentence tokenization, and morpheme tokenization,
The method for recognizing violent and non-violent situations based on Korean conversation using a BERT language model, wherein the morpheme tokenization uses a morpheme analyzer API for morpheme tokenization of the Korean BERT language model.

According to claim 1,
The Korean BERT language model is
Korean sentence text input unit; a morpheme tokenization unit; When there are tokens in Korean sentences, position embedding parameter (768, 384, 1024 setting value) position embedding information on the position of each token indicates the value of the number of dimensions; and an embedding layer including Token Embedding, which maps the value of each token in the sentence to the number of dimensions of the position embedding parameter (set value of 768, 384, and 1024) and trains to understand the meaning of the token; and k Transformer encoder layers,
The tokens of the Korean BERT language model include a classification token [CLS]; After passing through all k transformer layers, it has the combined meaning of all tokens in the input sentence, and has the condensed meaning of the input sentence, so if a classifier is attached to the classification token [CLS], it becomes a single sentence , or a classification of consecutive sentences,
A separator token [SEP] added to the end of a Korean sentence to separate each sentence; After changing the separation token [SEP] to a value of [2, 153, 10, 56, 145, 498, 3] using a dictionary trained in Tokenization, the value of each number is set to the position embedding parameters (768, 384, 1024 setting value) expressed as the number of dimensions,
For example, a person knows what the tokens 'I', 'school', 'going' mean, and sentences such as ['I', 'a', 'school', 'to', 'going'] I know what it means, but since the computer represents each token as a numeric value, I don't know what it means at first. Therefore, the linguistic expression of each token in the sentence and the meaning of the sentence are learned through k transformer encoders,
The Korean BERT language model learns the meaning of sentences through 6, 12, and 24 transformer encoder layers. Korean conversation-based violence and non-violent situation recognition method using the BERT language model.

According to claim 1,
The input of the multi-layer perceptron (MLP) fine-tuning model is used by converting the output of the [CLS] token of the last transformer encoder layer of BERT, and when the position embedding parameter is 768, the combined meaning of all tokens is implied and the size is 1x768 In [CLS] token was used,
The Pooler layer trained in the NSP (Next Sentence Prediction) process used in the BERT pre-learning step is reused to further learn the meaning of the sentence,
Afterwards, the tanh activation function The values of each neural network unit are converted to real values between -1 and +1 through
Finally, a Korean conversation-based violent and non-violent situation recognition method using the BERT language model, which recognizes a total of 5 classes of 4 violent situations + non-violent situations through a multilayer perceptron (MLP) model composed of 2 layers of Pooler.

According to claim 1,
As the input of the convolutional neural network (CNN) fine-tuning model, the output of all tokens of the last transformer encoder layer of the Korean BERT language model (size: 512 (maximum token length) x 768 (number of dimensions)) was used,
Since the maximum token length of the sentence is set to 512, ① the CNN input has a size of 512x768, ② three kernel sizes of 32x768, 64x768, and 128x768 ③ three feature maps through convolution layers (size: 1x481, 1x449, 1x385) comes out, and after combining 3 pooled feature maps (size: 1x160, 1x149, 1x128) from ④ max pooling composed of 1 x 3 size, 5 classes (fully connected layer) A method for recognizing violent and non-violent situations based on Korean dialogue using the BERT language model, recognizing 4 types of violent situations + non-violent situations).

According to claim 1,
The long short-term memory neural network (LSTM) fine-tuning model is
As a unidirectional long short-term memory neural network (LSTM), since it also considers the time-series information of each token in a sentence, BERT's transformer encoder layer 9 to utilize more information than using all tokens of the last layer of the Korean BERT language model ~12 output (size: 4x512x768) was used,
The output of all tokens of each layer (size: 512x768) was used as the input of LSTM. LSTM consists of two layers, and each LSTM layer consists of 512 LSTM cells,
A Korean dialogue-based violent and non-violent situation recognition method using the BERT language model that recognizes four types of violent and non-violent situations through a fully connected layer after combining the LSTM outputs (size: 512x192) of each layer.

According to claim 1,
The bidirectional long short-term memory neural network (Bi-LSTM) fine-tuning model is
The output of all tokens of each layer (size: 512x768) was used as the input of LSTM. LSTM consists of two layers, and each LSTM layer consists of 512 LSTM cells,
A Korean dialogue-based violent and non-violent situation recognition method using the BERT language model that recognizes four types of violent and non-violent situations through a fully connected layer after combining the LSTM outputs (size: 512x192) of each layer.