KR102640315B1

KR102640315B1 - Deep learning-based voice phishing detection apparatus and method therefor

Info

Publication number: KR102640315B1
Application number: KR1020220141286A
Authority: KR
Inventors: 박동주; 밀란두; 박명규
Original assignee: 숭실대학교산학협력단
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2024-02-22

Abstract

딥 러닝 기반 보이스 피싱 감지 장치 및 그 방법이 개시된다. 딥 러닝 기반 보이스 피싱 감지 방법은 통화 음성 데이터를 텍스트 형태로 변환한 변환스크립트를 획득하는 단계; 상기 변환스크립트를 전처리하여 각 단어를 워드 임베딩하여 각 단어 표현 벡터를 생성하는 단계; 및 상기 각 단어 표현 벡터를 어텐션 기반 CNN-BiLSTM 모델에 적용하여 보이스 피싱 분류 결과를 출력하는 단계를 포함하되, 상기 어텐션 기반 CNN-BiLSTM 모델은, 1D 컨볼루션 레이어 후단에 맥스 풀링 레이어가 위치되고, 상기 맥스 풀링 레이어 후단에 BiLSTM 신경망 모듈이 위치되고, 상기 BiLSTM 신경망 모듈의 후단에 어텐션 레이어가 위치되고, 상기 어텐션 레이어 후단에 완전 연결 레이어가 위치되며, 상기 완전 연결 레이어는 소프트맥스 레이어와 연결될 수 있다. A deep learning-based voice phishing detection device and method are disclosed. The deep learning-based voice phishing detection method includes obtaining a conversion script that converts call voice data into text form; Preprocessing the conversion script to word embed each word to generate each word expression vector; And applying each word expression vector to an attention-based CNN-BiLSTM model to output a voice phishing classification result, wherein the attention-based CNN-BiLSTM model has a max pooling layer located behind the 1D convolution layer, A BiLSTM neural network module is located behind the max pooling layer, an attention layer is located behind the BiLSTM neural network module, a fully connected layer is located behind the attention layer, and the fully connected layer can be connected to a softmax layer. .

Description

Deep learning-based voice phishing detection apparatus and method therefor}

본 발명은 딥 러닝 기반 보이스 피싱 감지 장치 및 그 방법에 관한 것이다. The present invention relates to a deep learning-based voice phishing detection device and method.

최근 보이스 피싱은 국내외에 수많은 사람들에게 피해를 주고 있다. 이러한 보이스 피싱을 방지하기 위해 관공서, 경찰서, 은행 등 다양한 공적 기관과 민간 단체가 지속적으로 노력하고 있으나, 악의적인 범죄는 날로 증가하고 있으며 매년 막대한 재정적 손실을 초래하고 있다. Recently, voice phishing has been causing damage to numerous people at home and abroad. Various public institutions and private organizations, such as government offices, police stations, and banks, are continuously making efforts to prevent voice phishing, but malicious crimes are increasing day by day and causing enormous financial losses every year.

보이스 피싱 공격자는 전화 통화를 통해 피해자의 민감한 개인 정보(예를 들어, 은행 정보, 카드 정보 등)를 수집하고, 공격자의 계좌로 돈을 송금하도록 유도하기도 한다. 그러나, 이러한 보이스 피싱을 감지하여 피해를 막기 위한 연구는 미비한 실정이다. Voice phishing attackers collect victims' sensitive personal information (e.g. bank information, card information, etc.) through phone calls and induce them to transfer money to the attacker's account. However, research to detect voice phishing and prevent damage is insufficient.

대한민국등록특허공보 제10-1431596호(2014.08.12.)Republic of Korea Patent Publication No. 10-1431596 (2014.08.12.)

본 발명의 목적은 딥 러닝 기반 보이스 피싱 감지 장치 및 그 방법을 제공하기 위한 것이다. The purpose of the present invention is to provide a deep learning-based voice phishing detection device and method.

또한, 본 발명의 다른 목적은 한국어 보이스 피싱을 정확하게 감지할 수 있는 딥 러닝 기반 보이스 피싱 감지 장치 및 그 방법을 제공하기 위한 것이다. In addition, another object of the present invention is to provide a deep learning-based voice phishing detection device and method that can accurately detect Korean voice phishing.

본 발명의 일 측면에 따르면 딥 러닝 기반 보이스 피싱 감지 방법이 제공된다. According to one aspect of the present invention, a deep learning-based voice phishing detection method is provided.

본 발명의 일 실시예에 따르면, 통화 음성 데이터를 텍스트 형태로 변환한 변환스크립트를 획득하는 단계; 상기 변환스크립트를 전처리하여 각 단어를 워드 임베딩하여 각 단어 표현 벡터를 생성하는 단계; 및 상기 각 단어 표현 벡터를 어텐션 기반 CNN-BiLSTM 모델에 적용하여 보이스 피싱 분류 결과를 출력하는 단계를 포함하되, 상기 어텐션 기반 CNN-BiLSTM 모델은, 1D 컨볼루션 레이어 후단에 맥스 풀링 레이어가 위치되고, 상기 맥스 풀링 레이어 후단에 BiLSTM 신경망 모듈이 위치되고, 상기 BiLSTM 신경망 모듈의 후단에 어텐션 레이어가 위치되고, 상기 어텐션 레이어 후단에 완전 연결 레이어가 위치되며, 상기 완전 연결 레이어는 소프트맥스 레이어와 연결되는 것을 특징으로 하는 보이스 피싱 감지 방법이 제공될 수 있다. According to one embodiment of the present invention, obtaining a conversion script that converts call voice data into text form; Preprocessing the conversion script to word embed each word to generate each word expression vector; And applying each word expression vector to an attention-based CNN-BiLSTM model to output a voice phishing classification result, wherein the attention-based CNN-BiLSTM model has a max pooling layer located behind the 1D convolution layer, A BiLSTM neural network module is located behind the max pooling layer, an attention layer is located behind the BiLSTM neural network module, a fully connected layer is located behind the attention layer, and the fully connected layer is connected to a softmax layer. A voice phishing detection method featuring features may be provided.

상기 각 단어 표현 벡터는 상기 1D 컨볼루션 레이어, 상기 맥스 풀링 레이어 및 상기 Bi-LSTM 신경망 모듈에 적용되어 컨텍스트 특징 벡터가 추출되며, 상기 컨텍스트 특징 벡터는 상기 어텐션 레이어에 적용되어 상기 통화 음성 데이터의 계층적 구조를 고려하여 구문(sentence)과 각 단어의 맥락에 포커싱하여 핵심 구문과 단어를 찾은 후 보이스 피싱 분류를 위한 문서 벡터로 출력되며, 상기 문서 벡터가 상기 완전 연결 레이어와 상기 소프트맥스 레이어에 적용되어 보이스 피싱 라벨 및 비-보이스피싱 라벨 중 어느 하나의 분류 라벨로 분류되어 보이스 피싱 분류 결과가 출력될 수 있다. Each word representation vector is applied to the 1D convolution layer, the max pooling layer, and the Bi-LSTM neural network module to extract a context feature vector, and the context feature vector is applied to the attention layer to extract the call voice data layer. Key phrases and words are found by focusing on the sentence and the context of each word in consideration of the enemy structure, and then output as a document vector for voice phishing classification, and the document vector is applied to the fully connected layer and the softmax layer. It is classified into either a voice phishing label or a non-voice phishing label, and the voice phishing classification result can be output.

상기 변환스크립트의 전처리는, 상기 변환스크립트를 분석하여 단어백(bag of word) 또는 토큰 리스트를 생성할 수 있다. Preprocessing of the conversion script may generate a bag of words or a list of tokens by analyzing the conversion script.

상기 워드 임베딩은 FastText 알고리즘을 이용하며, 상기 어텐션 레이어는 계층적 어텐션 네트워크(HAN: Hierarchical Attention Networks)를 이용할 수 있다. The word embedding uses the FastText algorithm, and the attention layer may use a hierarchical attention network (HAN: Hierarchical Attention Networks).

본 발명의 다른 측면에 따르면, 딥 러닝 기반 보이스 피싱 감지 장치가 제공된다. According to another aspect of the present invention, a deep learning-based voice phishing detection device is provided.

본 발명의 일 실시예에 따르면, 통화 음성 데이터를 텍스트 형태로 변환한 변환스크립트를 획득하는 입력부; 상기 변환스크립트를 전처리하여 각 단어를 워드 임베딩하여 각 단어 표현 벡터를 생성하는 워드 임베딩 모듈; 및 상기 각 단어 표현 벡터를 어텐션 기반 CNN-BiLSTM 모델에 적용하여 보이스 피싱 분류 결과를 출력하는 보이스 피싱 감지부를 포함하되, 상기 어텐션 기반 CNN-BiLSTM 모델은, 1D 컨볼루션 레이어 후단에 맥스 풀링 레이어가 위치되고, 상기 맥스 풀링 레이어 후단에 BiLSTM 신경망 모듈이 위치되고, 상기 BiLSTM 신경망 모듈의 후단에 어텐션 레이어가 위치되고, 상기 어텐션 레이어 후단에 완전 연결 레이어가 위치되며, 상기 완전 연결 레이어는 소프트맥스 레이어와 연결되는 것을 특징으로 하는 보이스 피싱 감지 장치가 제공될 수 있다. According to one embodiment of the present invention, an input unit for obtaining a conversion script that converts call voice data into text form; a word embedding module that preprocesses the conversion script to word embed each word to generate each word expression vector; and a voice phishing detection unit that applies each word expression vector to an attention-based CNN-BiLSTM model to output a voice phishing classification result, wherein the attention-based CNN-BiLSTM model has a max pooling layer located behind the 1D convolution layer. A BiLSTM neural network module is located behind the max pooling layer, an attention layer is located behind the BiLSTM neural network module, a fully connected layer is located behind the attention layer, and the fully connected layer is connected to a softmax layer. A voice phishing detection device may be provided.

상기 각 단어 표현 벡터는 상기 1D 컨볼루션 레이어, 상기 맥스 풀링 레이어 및 상기 Bi-LSTM 신경망 모듈에 적용되어 컨텍스트 특징 벡터가 추출되며, 상기 컨텍스트 특징 벡터는 상기 어텐션 레이어에 적용되어 상기 통화 음성 데이터의 계층적 구조를 고려하여 구문(sentence)과 각 단어의 맥락에 포커싱하여 핵심 구문과 단어를 찾은 후 보이스 피싱 분류를 위한 문서 벡터로 출력되되, 상기 문서 벡터가 상기 완전 연결 레이어와 상기 소프트맥스 레이어에 적용되어 보이스 피싱 라벨 및 비-보이스피싱 라벨 중 어느 하나의 분류 라벨로 분류되어 보이스 피싱 분류 결과가 출력될 수 있다. Each word representation vector is applied to the 1D convolution layer, the max pooling layer, and the Bi-LSTM neural network module to extract a context feature vector, and the context feature vector is applied to the attention layer to extract the call voice data layer. Considering the structure of the text, the key phrases and words are found by focusing on the sentence and the context of each word, and then output as a document vector for voice phishing classification, and the document vector is applied to the fully connected layer and the softmax layer. It is classified into either a voice phishing label or a non-voice phishing label, and the voice phishing classification result can be output.

본 발명의 일 실시예에 따른 딥 러닝 기반 보이스 피싱 감지 장치 및 그 방법을 제공함으로써, 한국어 보이스 피싱 감지 정확도를 향상시킬 수 있다. By providing a deep learning-based voice phishing detection device and method according to an embodiment of the present invention, Korean voice phishing detection accuracy can be improved.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반 보이스 피싱 감지 방법을 나타낸 순서도.
도 2는 본 발명의 일 실시예에 따른 어텐션 기반 CNN-BiLSTM 모델의 아키텍처를 예시한 도면.
도 3 내지 도 5는 종래와 본 발명의 일 실시예에 따른 한국어 보이스 피싱 감지 결과를 비교한 도면.
도 6은 본 발명의 일 실시예에 따른 보이스 피싱 감지 장치의 구성을 개략적으로 도시한 블록도.1 is a flowchart showing a deep learning-based voice phishing detection method according to an embodiment of the present invention.
Figure 2 is a diagram illustrating the architecture of an attention-based CNN-BiLSTM model according to an embodiment of the present invention.
Figures 3 to 5 are diagrams comparing Korean voice phishing detection results according to the prior art and an embodiment of the present invention.
Figure 6 is a block diagram schematically showing the configuration of a voice phishing detection device according to an embodiment of the present invention.

본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "구성된다" 또는 "포함한다" 등의 용어는 명세서상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.As used herein, singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “consists of” or “comprises” should not be construed as necessarily including all of the various components or steps described in the specification, and some of the components or steps may be included in the specification. It may not be included, or it should be interpreted as including additional components or steps. In addition, terms such as "... unit" and "module" used in the specification refer to a unit that processes at least one function or operation, which may be implemented as hardware or software, or as a combination of hardware and software. .

이하, 첨부된 도면들을 참조하여 본 발명의 실시예를 상세히 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반 보이스 피싱 감지 방법을 나타낸 순서도이고, 도 2는 본 발명의 일 실시예에 따른 어텐션 기반 CNN-BiLSTM 모델의 아키텍처를 예시한 도면이고, 도 3 내지 도 5는 종래와 본 발명의 일 실시예에 따른 한국어 보이스 피싱 감지 결과를 비교한 도면이다. FIG. 1 is a flowchart showing a deep learning-based voice phishing detection method according to an embodiment of the present invention, FIG. 2 is a diagram illustrating the architecture of an attention-based CNN-BiLSTM model according to an embodiment of the present invention, and FIG. 3 to Figure 5 are diagrams comparing Korean voice phishing detection results according to the conventional art and an embodiment of the present invention.

단계 110에서 보이스 피싱 감지 장치(100)는 한국어 통화 음성 데이터를 텍스트 형태로 변환한 변환스크립트를 획득한다. In step 110, the voice phishing detection device 100 obtains a conversion script that converts Korean call voice data into text form.

예를 들어, 통화 음성 데이터를 텍스트로 변환한 변환스크립트는 각 구문에 대한 단어주머니(BOW: bag of words) 또는 토큰 리스트를 포함할 수 있다. For example, a conversion script that converts call voice data into text may include a bag of words (BOW) or token list for each phrase.

이해와 설명의 편의를 도모하기 위해 BOW에 대해 간략하게 설명하기로 한다. To facilitate understanding and explanation, we will briefly explain BOW.

BOW는 구문에 포함된 각 단어에 인덱스를 부여하고, 각 단어에 대한 빈도를 BOW로 생성할 수 있다. BOW assigns an index to each word included in a phrase, and the frequency of each word can be generated using BOW.

예를 들어, '은행 계좌로 송금해야 한다'와 같은 말뭉치를 가정하기로 한다. 해당 말뭉치에 대한 각 단어 인덱스는 '은행':0, '계좌':1, '는':2, '송금':3, '해야':4, '한다':5와 같이 부여되며, 이에 대한 BOW는 [1,1,1,1,1,1]과 같이 생성될 수 있다. For example, let's assume a corpus such as 'I need to send money to a bank account'. Each word index for the corresponding corpus is given as follows: 'bank':0, 'account':1, 'is':2, 'remittance':3, 'should':4, 'should':5. BOW can be created as [1,1,1,1,1,1].

이와 같이 전처리과정에서 불용어등을 삭제하여 BOW 또는 토큰 리스트가 생성될 수도 있음은 당연하다. In this way, it is natural that a BOW or token list can be created by deleting stop words, etc. during the preprocessing process.

단계 115에서 보이스 피싱 감지 장치(100)는 변환스크립트의 각 단어를 워드 임베딩하여 각 단어 표현 벡터를 생성한다. In step 115, the voice phishing detection device 100 generates each word expression vector by word-embedding each word in the conversion script.

예를 들어, 보이스 피싱 감지 장치(100)는 패스트텍스트(FastText)를 이용하여 단어 리스트에 포함된 각 단어를 벡터로 변환할 수 있다. For example, the voice phishing detection device 100 may convert each word included in the word list into a vector using FastText.

패스트텍스트는 각 단어를 글자 단위 n-gram의 구성으로 취급하여 내부 단어(subword)를 학습할 수 있다. 이와 같이, 패스트텍스트는 각 단어의 내부 단어를 학습하기 때문에, 모르는 단어에 대해서도 다른 단어와의 유사도 계산이 가능한 이점이 있다. 한국어 특성을 고려하여 음절 단위 또는 자모 단위로 워드 임베딩할 수 있다. FastText can learn internal words (subwords) by treating each word as a composition of character-by-character n-grams. In this way, because FastText learns the internal words of each word, it has the advantage of being able to calculate similarity with other words even for unknown words. Considering the characteristics of the Korean language, word embedding can be done on a syllable or alphabet basis.

단계 120에서 보이스 피싱 감지 장치(100)는 각 단어 표현 벡터를 어텐션 기반 CNN-BiLSTM 모델에 적용하여 보이스 피싱 분류 결과를 출력한다. In step 120, the voice phishing detection device 100 applies each word expression vector to the attention-based CNN-BiLSTM model and outputs a voice phishing classification result.

본 발명의 일 실시예에 따르면, 어텐션 기반 CNN-BiLSTM 모델은 특징 추출 블록, 시퀀스 학습 블록, 어텐션 매커니즘 블록 및 분류 블록을 포함하여 구성된다. 어텐션 기반 CNN-BiLSTM 모델의 아키텍처는 도 2에 도시된 바와 같다. 도 2를 참조하면, 어텐션 기반 CNN-BiLSTM 모델은 1D 컨볼루션 레이어 후단에 맥스 풀링 레이어가 위치되며, 해당 맥스 풀링 레이어 후단에 BiLSTM 신경망 모듈이 위치되고, BiLSTM 신경망 모듈 후단에 어텐션 레이어가 위치되며, 해당 어텐션 레이어 후단에 완전 연결 레이어가 위치되며, 완전 연결 레이어는 스프트맥스 레이어와 연결될 수 있다. According to an embodiment of the present invention, the attention-based CNN-BiLSTM model includes a feature extraction block, a sequence learning block, an attention mechanism block, and a classification block. The architecture of the attention-based CNN-BiLSTM model is shown in Figure 2. Referring to Figure 2, in the attention-based CNN-BiLSTM model, a max pooling layer is located behind the 1D convolution layer, a BiLSTM neural network module is located behind the max pooling layer, and an attention layer is located behind the BiLSTM neural network module. A fully connected layer is located behind the attention layer, and the fully connected layer can be connected to the Sptmax layer.

각 단어 표현 벡터는 1D 컨볼루션 레이어, 맥스 풀링 레이어 및 BiLSTM 신경망 모듈을 통해 컨텍스트 특징 벡터를 출력할 수 있다. 이어, 해당 컨텍스트 특징 벡터는 어텐션 레이어에 적용되어 통화 음성 데이터의 계층적 구조를 고려하여 구문과 각 단어의 맥락에 포커싱하여 핵심 구문과 단어를 찾은 후 보이스 피싱 분류를 위한 문서 벡터를 출력할 수 있다. Each word representation vector can output a context feature vector through a 1D convolution layer, max pooling layer, and BiLSTM neural network module. Next, the corresponding context feature vector is applied to the attention layer, taking into account the hierarchical structure of the call voice data, focusing on the phrase and the context of each word, finding key phrases and words, and then outputting a document vector for voice phishing classification. .

여기서, BiLSTM 신경망 모듈은 양방향으로 작동하는 두개의 병렬 LSTM 레이어로 구성될 수 있다. 하나의 LSTM 레이어는 1D 컨볼루션 레이어의 출력결과를 입력받아 순방향으로 전파하고, 다른 하나의 LSTM 레이어는 역 방향으로 전파할 수 있다. Here, the BiLSTM neural network module can be composed of two parallel LSTM layers that operate bidirectionally. One LSTM layer can receive the output result of the 1D convolution layer and propagate it in the forward direction, and the other LSTM layer can propagate it in the reverse direction.

따라서, BiLSTM 신경망 모듈을 통해 이전 컨텍스트와 이후 컨텍스트에서 장기 종속성을 캡쳐할 수 있다. 따라서, 텍스트 분류 작업에 더 나은 성능을 제공하고, 하나의 LSTM 레이어를 사용하는 것보다 더 강력한 결과를 제공할 수 있다. Therefore, the BiLSTM neural network module can capture long-term dependencies in previous and subsequent contexts. Therefore, it can provide better performance for text classification tasks and more powerful results than using a single LSTM layer.

Bi-LSTM 신경망 모듈에서 모든 이전 단어의 종속성을 획득할 수 없으므로, 어텐션 레이어를 통해 이를 보완할 수 있다. 보이스 피싱 스크립트와 비-보이스 피싱 스크립트에서 각 단어의 기여도는 다르다. 어텐션 레이어는 이를 반영하여 컨텍스트 특징을 추출하고, 중요한 특징에 더 집중할 수 있다. Since the dependencies of all previous words cannot be obtained in the Bi-LSTM neural network module, this can be compensated for through the attention layer. The contribution of each word is different in voice phishing scripts and non-voice phishing scripts. The attention layer can reflect this to extract context features and focus more on important features.

어텐션 레이어는 Bi-LSTM 신경망 모듈의 출력 결과를 입력으로 수신하고, 어텐션 매커니즘을 사용하여 컨텍스트 벡터의 의미론적 코딩에 다른 어텐션 가중치를 할당할 수 있다. 어텐션 매커니즘은 다양하며, 공지된 어텐션 매커니즘은 모두 적용될 수 있다. The attention layer receives the output result of the Bi-LSTM neural network module as input, and can use the attention mechanism to assign different attention weights to the semantic coding of the context vector. There are various attention mechanisms, and all known attention mechanisms can be applied.

본 발명의 일 실시예에서는 HAN(Hierarchical Attention Network) 기반 어텐션 레이어가 이용될 수 있다. HAN 기반 어텐션 레이어는 문서의 계층적 구조를 고려하여 문장과 단어의 맥락에 초점을 맞추면서 문서에서 가장 중요한 단어와 문장을 찾을 수 있다. In one embodiment of the present invention, a Hierarchical Attention Network (HAN)-based attention layer may be used. The HAN-based attention layer can find the most important words and sentences in a document by considering the hierarchical structure of the document and focusing on the context of sentences and words.

어텐션 레이어는 문서의 계층 구조를 고려하여 가장 중요한 단어와 문장을 검색한 후 이를 기반으로 보이스 피싱 분류를 수행하는데 사용되는 문서 벡터를 출력할 수 있다. The attention layer can search for the most important words and sentences by considering the hierarchical structure of the document and then output document vectors used to perform voice phishing classification based on them.

해당 문서 벡터는 완전 연결 레이어와 소프트맥스 레이어에 적용되어 보이스 피싱 라벨 및 비-보이스피싱 라벨 중 어느 하나의 분류 라벨로 분류되어 보이스 피싱 분류 결과로서 출력될 수 있다. The document vector can be applied to the fully connected layer and the softmax layer and classified into either a voice phishing label or a non-voice phishing label and output as a voice phishing classification result.

완전 연결 레이어는 어텐션 레이어의 출력인 문서 벡터를 입력받고, 이에 대한 비선형 변환을 수행할 수 있다. 한국어 텍스트 레이블에 대한 분포 확률을 출력하기 위해 최종 완전 연결 레이어(또는 밀집 레이어)에 소프트맥스 활성화 함수를 적용하였으며, 과적합을 방지하기 위해 각 레이어에서 드롭 아웃을 사용하여 훈련을 정규화할 수 있다. The fully connected layer can receive the document vector, which is the output of the attention layer, and perform non-linear transformation on it. A softmax activation function was applied to the final fully connected layer (or dense layer) to output distribution probabilities for Korean text labels, and training can be normalized using dropout in each layer to prevent overfitting.

종래와 본 발명의 일 실시예에 따른 보이스 피싱 분류 결과를 비교한 결과가 도 3에 도시된 바와 같다. The result of comparing the voice phishing classification results according to the prior art and an embodiment of the present invention is shown in FIG. 3.

도 3에 도시된 바와 같이, 본 발명의 일 실시예에 따른 어텐션 기반 CNN-BiLSTM 모델이 99.32%의 정확도를 가지며, 종래 방식보다 성능이 뛰어난 것을 알 수 있다. As shown in Figure 3, it can be seen that the attention-based CNN-BiLSTM model according to an embodiment of the present invention has an accuracy of 99.32% and performs better than the conventional method.

10 에포크에 대한 검증 정확도 및 검증 손실 결과는 도 4 및 도 5에 도시된 바와 같다. 도 4 및 도 5에서 보여지는 바와 같이, 본 발명의 일 실시예에 따른 어텐션 기반 CNN-BiLSTM 모델이 한국어 보이스 피싱 검출 정확도가 종래 기술에 비해 높은 것을 알 수 있다. The verification accuracy and verification loss results for 10 epochs are shown in Figures 4 and 5. As shown in Figures 4 and 5, it can be seen that the attention-based CNN-BiLSTM model according to an embodiment of the present invention has higher Korean voice phishing detection accuracy than the prior art.

도 6은 본 발명의 일 실시예에 따른 보이스 피싱 감지 장치의 구성을 개략적으로 도시한 블록도이다. Figure 6 is a block diagram schematically showing the configuration of a voice phishing detection device according to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 일 실시예에 따른 보이스 피싱 감지 장치(100)는 입력부(610), 워드 임베딩 모듈(620), 보이스 피싱 감지부(630), 메모리(640) 및 프로세서(650)를 포함하여 구성된다.Referring to FIG. 6, the voice phishing detection device 100 according to an embodiment of the present invention includes an input unit 610, a word embedding module 620, a voice phishing detection unit 630, a memory 640, and a processor 650. ) and consists of

입력부(610)는 통화 음성 데이터를 텍스트 형태로 변환한 변환스크립트를 입력받기 위한 수단이다.The input unit 610 is a means for receiving a conversion script that converts call voice data into text form.

워드 임베딩 모듈(620)은 변환스크립트를 전처리하여 각 단어를 워드 임베딩하여 각 단어 표현 벡터를 생성하기 위한 수단이다.The word embedding module 620 is a means for generating each word expression vector by preprocessing the conversion script and word embedding each word.

보이스 피싱 감지부(630)는 어텐션 기반 CNN-BiLSTM 모델을 가지며, 각 단어 표현 벡터를 어텐션 기반 CNN-BiLSTM 모델에 적용하여 보이스 피싱 분류 결과를 출력하기 위한 수단이다. The voice phishing detection unit 630 has an attention-based CNN-BiLSTM model and is a means for outputting a voice phishing classification result by applying each word expression vector to the attention-based CNN-BiLSTM model.

어텐션 기반 CNN-BiLSTM 모델은 도 2에 도시된 바와 같이, 1D 컨볼루션 레이어 후단에 맥스 풀링 레이어가 위치되고, 맥스 풀링 레이어 후단에 BiLSTM 신경망 모듈이 위치되고, BiLSTM 신경망 모듈의 후단에 어텐션 레이어가 위치되고, 어텐션 레이어 후단에 완전 연결 레이어가 위치되며, 완전 연결 레이어는 소프트맥스 레이어와 연결될 수 있다. As shown in Figure 2, the attention-based CNN-BiLSTM model has a max pooling layer located behind the 1D convolution layer, a BiLSTM neural network module located behind the max pooling layer, and an attention layer located behind the BiLSTM neural network module. A fully connected layer is located behind the attention layer, and the fully connected layer can be connected to the softmax layer.

따라서, 보이스 피싱 감지부(630)는 각 단어 표현 벡터를 1D 컨볼루션 레이어, 맥스 풀링 레이어 및 Bi-LSTM 신경망 모듈에 적용하여 컨텍스트 특징 벡터를 추출할 수 있다. 이어, 보이스 피싱 감지부(630)는 컨텍스트 특징 벡터를 어텐션 레이어에 적용하여 통화 음성 데이터의 계층적 구조를 고려하여 구문(sentence)과 각 단어의 맥락에 포커싱하여 핵심 구문과 단어를 찾은 후 보이스 피싱 분류를 위한 문서 벡터로 출력할 수 있으며, 완전 연결 레이어와 소프트맥스 레이어를 통해 보이스 피싱 라벨 및 비-보이스 피싱 라벨 중 어느 하나의 분류 라벨로 분류되어 보이스 피싱 분류 결과를 출력할 수 있다. Accordingly, the voice phishing detection unit 630 can extract the context feature vector by applying each word expression vector to the 1D convolution layer, max pooling layer, and Bi-LSTM neural network module. Next, the voice phishing detection unit 630 applies the context feature vector to the attention layer, considers the hierarchical structure of the call voice data, focuses on the sentence and the context of each word, finds key phrases and words, and then detects voice phishing. It can be output as a document vector for classification, and the voice phishing classification results can be output by being classified into either a voice phishing label or a non-voice phishing label through a fully connected layer and a softmax layer.

메모리(640)는 본 발명의 일 실시예에 따른 어텐션 기반 CNN-BiLSTM 모델을 이용한 보이스 피싱 감지 방법을 수행하기 위해 필요한 명령어를 저장한다.The memory 640 stores commands necessary to perform a voice phishing detection method using an attention-based CNN-BiLSTM model according to an embodiment of the present invention.

프로세서(650)는 본 발명의 일 실시예에 따른 보이스 피싱 감지 장치(100)의 내부 구성 요소들(예를 들어, 입력부(610), 워드 임베딩 모듈(620), 보이스 피싱 감지부(630), 메모리(640) 등)을 제어하기 위한 수단이다. The processor 650 includes internal components (e.g., input unit 610, word embedding module 620, voice phishing detection unit 630, It is a means for controlling the memory 640, etc.).

본 발명의 실시 예에 따른 장치 및 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야 통상의 기술자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Devices and methods according to embodiments of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. Computer-readable media may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on a computer-readable medium may be specially designed and constructed for the present invention or may be known and usable by those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes magneto-optical media and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이제까지 본 발명에 대하여 그 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been examined focusing on its embodiments. A person skilled in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative rather than a restrictive perspective. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the equivalent scope should be construed as being included in the present invention.

100: 보이스 피싱 감지 장치
610: 입력부
620: 워드 임베딩 모듈
630: 보이스 피싱 감지부
640: 메모리
650: 프로세서100: Voice phishing detection device
610: input unit
620: Word embedding module
630: Voice phishing detection unit
640: Memory
650: Processor

Claims

Obtaining a conversion script that converts call voice data into text form;
Preprocessing the conversion script to word-embed each word to generate each word expression vector - the preprocessing of the conversion script is analyzing the conversion script to generate a bag of words or a token list; and
Including applying each word expression vector to an attention-based CNN-BiLSTM model to output a voice phishing classification result,
The attention-based CNN-BiLSTM model is,
A max pooling layer is located behind the 1D convolution layer, a BiLSTM neural network module is located behind the max pooling layer, an attention layer is located behind the BiLSTM neural network module, and a fully connected layer is located behind the attention layer. , the fully connected layer is connected to the softmax layer,
Each word representation vector is applied to the 1D convolution layer, the max pooling layer, and the BiLSTM neural network module to extract a context feature vector,
The context feature vector is applied to the attention layer, takes into account the hierarchical structure of the call voice data, focuses on the sentence and the context of each word, finds key phrases and words, and outputs them as a document vector for voice phishing classification. And
The document vector is applied to the fully connected layer and the softmax layer and classified into either a voice phishing label or a non-voice phishing label to output a voice phishing classification result,
A voice phishing detection method wherein the attention layer uses hierarchical attention networks.

delete

According to claim 1,
A voice phishing detection method characterized in that the word embedding uses the FastText algorithm.

A computer-readable recording medium recording program code for performing the method according to claim 1.

An input unit that obtains a conversion script that converts call voice data into text form - the call voice data is Korean call voice data;
A word embedding module that preprocesses the conversion script to word-embed each word to generate each word expression vector - the preprocessing of the conversion script analyzes the conversion script to generate a bag of words or token list; and
A voice phishing detection unit that applies each word expression vector to an attention-based CNN-BiLSTM model and outputs a voice phishing classification result,
The attention-based CNN-BiLSTM model is,
A max pooling layer is located behind the 1D convolution layer, a BiLSTM neural network module is located behind the max pooling layer, an attention layer is located behind the BiLSTM neural network module, and a fully connected layer is located behind the attention layer. , the fully connected layer is connected to the softmax layer,
Each word representation vector is applied to the 1D convolution layer, the max pooling layer, and the BiLSTM neural network module to extract a context feature vector,
The context feature vector is applied to the attention layer, takes into account the hierarchical structure of the call voice data, focuses on the sentence and the context of each word, finds key phrases and words, and outputs them as a document vector for voice phishing classification. And
The document vector is applied to the fully connected layer and the softmax layer and classified into either a voice phishing label or a non-voice phishing label to output a voice phishing classification result,
A voice phishing detection device wherein the attention layer uses hierarchical attention networks.

delete