KR102051350B1

KR102051350B1 - Method and system for data acquisition for analyzing transaction of cryptocurrency

Info

Publication number: KR102051350B1
Application number: KR1020190110111A
Authority: KR
Inventors: 서상덕; 신승원; 윤창훈; 이승현
Original assignee: (주)에스투더블유랩
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2019-12-03
Also published as: JP2022548501A; US20220358493A1; JP7372707B2; CN114730387A; WO2021045332A1

Abstract

The present invention relates to a method and an apparatus for acquiring data for generating a machine learning model for detecting fraudulent accounts for cryptocurrency. The method for acquiring data for generating a machine learning model for detecting fraudulent accounts for cryptocurrency comprises the steps of: receiving a report relating to a fraudulent address from a first database that stores information about the reported fraudulent address; acquiring a first fraudulent address and first description associated with the first fraudulent address from the report; extracting a plurality of first key words associated with the first fraudulent address from the first description by using natural language processing; and storing the first fraudulent address in a second database.

Description

METHOD AND SYSTEM FOR DATA ACQUISITION FOR ANALYZING TRANSACTION OF CRYPTOCURRENCY}

본 개시는 암호화폐의 사기계정을 검출하기 위한 기계학습모델을 생성하기 위하여 학습데이터를 획득하는 방법 및 장치에 관한 것이다.The present disclosure relates to a method and apparatus for acquiring learning data to generate a machine learning model for detecting fraudulent accounts of cryptocurrencies.

암호화폐(cryptocurrency)는 교환 수단으로 기능하도록 고안된 디지털 자산으로, 블록체인(blockchain) 기술로 암호화되어 분산발행되고 일정한 네트워크에서 화폐로 사용할 수 있는 전자정보를 말한다. 암호화폐는 중앙은행이 발행하지 않고 블록체인 기술에 기초하여 금전적 가치가 디지털방식으로 표시된 전자정보로서 인터넷상 P2P 방식으로 분산 저장되어 운영·관리된다. 암호화폐를 발행하고 관리하는 핵심 기법은 블록체인(blockchain) 기술이다. 블록체인은 지속적으로 늘어나는 기록(블록)의 일람표로서 블록은 암호화방법을 사용하여 연결되어 보안이 확보된다. 각 블록은 전형적으로는 이전 블록의 암호해쉬, 타임스탬프와 거래 데이터를 포함한다. 블록체인은 처음부터 데이터의 수정에 대해 저항력을 가지고 있으며, 양 당사자 간의 거래를 유효하게 영구적으로 증명할 수 있는 공개된 분산 장부이다. 따라서 암호화폐는 조작 방지를 기반으로 투명한 운영을 가능하게 한다. Cryptocurrency is a digital asset that is designed to function as an exchange. It is electronic information that is encrypted and distributed by blockchain technology and can be used as money in certain networks. Cryptocurrencies are electronic information, in which digital values are digitally represented based on blockchain technology, not issued by the central bank, and are distributed and managed in a P2P manner over the Internet. The key technique for issuing and managing cryptocurrencies is blockchain technology. Blockchain is a list of records (blocks) that continues to grow, and blocks are secured by connecting them using encryption. Each block typically contains the cryptographic hash, timestamp and transaction data of the previous block. Blockchain is a publicly available distributed ledger that is resistant to data modifications from the outset and can prove a permanent and valid transaction between both parties. Therefore, cryptocurrency enables transparent operation based on tamper proof.

그 밖에, 암호화폐는 기존 화폐와는 달리 익명성을 갖고 있어, 준 사람과 받은 사람 이외의 제3자는 거래 내역을 일체 알 수 없다는 특징이 있다. 계좌의 익명성 때문에 거래의 흐름을 추적하기 어려우며(Non-trackable), 송금기록, 수금기록 등 일체의 기록은 모두 공개되어 있으나 거래 주체는 알 수 없다. In addition, the cryptocurrency has anonymity, unlike the existing currency, so that third parties other than the giver and the receiver cannot know the transaction details at all. Because of the anonymity of the account, it is difficult to track the flow of transactions (Non-trackable), and all records such as remittance records and collection records are public, but the transaction subjects are unknown.

암호화폐는 전술한 바와 같은 자유성과 투명성으로 인해 기존의 기축통화를 대체할 수 있는 대안으로 여겨지고 있으며, 기존 통화 대비 저렴한 수수료와 간단한 송금 절차로 국제 간 거래 등에 효과적으로 사용될 수 있을 것으로 보인다. 다만 그 익명성으로 인해 암호화폐는 사기 거래에 사용되는 등 범죄 수단으로 악용되기도 한다. Cryptocurrency is considered as an alternative to the existing key currency due to the above-mentioned freedom and transparency, and it can be effectively used for international transactions with low fees and simple remittance procedures compared to the existing currency. However, due to its anonymity, cryptocurrencies may be used as criminal means, such as in fraudulent transactions.

한편, 암호화폐 거래의 데이터는 방대하여 사기 거래의 특징을 수동으로 판별하여 사기 주체를 결정하기 어려운 문제점이 있었다. 이와 관련하여 기계학습을 이용하면 방대한 데이터들의 관계를 자동으로 학습할 수 있다. On the other hand, the data of the cryptocurrency transaction is enormous and it is difficult to determine the fraud subject by manually determining the characteristics of the fraudulent transaction. In this regard, machine learning can be used to automatically learn the relationships between vast amounts of data.

따라서 기계학습을 이용하여 암호화폐를 범죄 수단으로 사용하는 거래 주체를 파악할 수 있는 방법이 요구된다. Therefore, there is a need for a method that can identify the subject of a transaction that uses cryptocurrency as a criminal means using machine learning.

본 개시에 따른 암호화폐의 사기계정을 검출하기 위한 기계학습모델을 생성하기 위한 학습데이터를 획득하는 방법은 신고된 사기 주소에 대한 정보를 저장하고 있는 제 1 데이터베이스로부터 사기 주소와 관련된 리포트를 수신하는 단계, 리포트로부터 제 1 사기 주소 및 제 1 사기 주소와 관련된 제 1 디스크립션(description)을 획득하는 단계, 자연어 처리(Natural Language Processing)를 이용하여, 제 1 디스크립션에서 제 1 사기 주소와 관련된 복수의 제 1 핵심 단어들을 추출하는 단계 및 제 1 사기 주소를 제 2 데이터베이스에 저장하는 단계를 포함하는 것을 특징으로 한다.According to the present disclosure, a method of acquiring learning data for generating a machine learning model for detecting a fraudulent account of a cryptocurrency includes receiving a report related to a fraudulent address from a first database storing information on a reported fraudulent address. Obtaining a first description associated with the first fraudulent address and the first fraudulent address from the report, using natural language processing, the plurality of first associated with the first fraudulent address in the first description; Extracting the first key words and storing the first fraudulent address in a second database.

본 개시에 따른 학습데이터를 획득하는 방법은 공개적으로 접근 가능한 웹사이트로부터 텍스트 정보를 수신하는 단계, 텍스트 정보로부터 암호화폐 주소를 포함하는 메인 텍스트 정보를 추출하는 단계, 자연어 처리를 이용하여 메인 텍스트 정보로부터 복수의 제 2 핵심 단어들을 추출하는 단계, 사기정보검출모델을 획득하는 단계, 복수의 제 2 핵심 단어들을 사기정보검출모델에 적용하여 메인 텍스트에 포함된 암호화폐 주소가 사기 주소인지 결정하는 단계, 암호화폐 주소가 사기 주소인 경우, 암호화폐 주소를 제 2 사기 주소로 획득하는 단계 및 제 2 사기 주소를 제 2 데이터베이스에 저장하는 단계를 포함하는 것을 특징으로 한다.The method for acquiring learning data according to the present disclosure includes receiving text information from a publicly accessible website, extracting main text information including a cryptocurrency address from the text information, and using main language processing. Extracting a plurality of second key words from the terminal, obtaining a fraud detection model, and determining whether the cryptographic address included in the main text is a fraud address by applying the plurality of second key words to the fraud information detection model If the cryptographic address is a fraudulent address, obtaining the cryptographic address as a second fraudulent address and storing the second fraudulent address in a second database.

본 개시에 따른 학습데이터를 획득하는 방법의 사기정보검출모델을 획득하는 단계는, 양호한 암호화폐 주소를 포함하는 것으로 결정된 웹사이트로부터 획득된 양호한 암호화폐 주소와 관련된 단어들을 획득하는 단계, 양호한 암호화폐 주소와 관련된 단어들의 각각이 웹사이트에서 나타나는 제 1 빈도수를 획득하는 단계, 제 1 핵심 단어들의 각각이 제 1 디스크립션에서 나타내는 제 2 빈도수를 획득하는 단계 및 양호로 레이블된 양호한 암호화폐 주소와 관련된 단어들, 제 1 빈도수, 제 2 빈도수 및 사기로 레이블된 복수의 제 1 핵심 단어들을 기계학습하여 사기정보검출모델을 획득하는 단계를 포함하는 것을 특징으로 한다.Acquiring a fraud detection model of the method for acquiring the training data according to the present disclosure may include obtaining words related to a good cryptographic address obtained from a website determined to include a good cryptographic address, and a good cryptocurrency. Obtaining a first frequency each of the words associated with the address appear on a website, obtaining a second frequency each of the first key words represents in the first description, and a word associated with a good cryptographic address labeled as good. For example, the first frequency, the second frequency and a plurality of first key words labeled as fraudulent machine learning to obtain a fraud information detection model.

본 개시에 따른 학습데이터를 획득하는 방법은 암호화폐 주소와 대응되는 태그(tag)를 제공하는 서비스로부터 제 2 디스크립션을 획득하는 단계, 복수의 제 1 핵심 단어들에 기초하여 사기 핵심 단어 세트를 획득하는 단계, 사기 핵심 단어 세트에 포함된 단어가 제 2 디스크립션에 기재된 경우, 제 2 디스크립션에 대응되는 암호화폐 주소를 제 3 사기 주소로 결정하는 단계 및 제 3 사기 주소를 제 2 데이터베이스에 저장하는 단계를 포함하는 것을 특징으로 한다.The method for acquiring learning data according to the present disclosure includes obtaining a second description from a service providing a tag corresponding to a cryptocurrency address, and obtaining a fraudulent key word set based on the plurality of first key words. If the word included in the fraud key word set is described in the second description, determining the cryptographic address corresponding to the second description as the third fraud address, and storing the third fraud address in the second database. Characterized in that it comprises a.

본 개시에 따른 학습데이터를 획득하는 방법의 사기 핵심 단어 세트를 획득하는 단계는, 복수의 제 1 핵심 단어들의 각각에 대하여 제 1 디스크립션에서 등장하는 빈도수를 획득하는 단계 및 복수의 제 1 핵심 단어들 중 빈도수가 높은 소정의 개수의 단어를 사기 핵심 단어 세트로 결정하는 단계를 포함하는 것을 특징으로 한다.Acquiring a fraudulent key word set of the method of acquiring learning data according to the present disclosure may include obtaining a frequency appearing in the first description for each of the plurality of first key words and a plurality of first key words. Determining a predetermined number of words having a high frequency as a fraudulent key word set.

본 개시에 따른 학습데이터를 획득하는 방법은 암호화폐 주소와 대응되는 태그(tag)를 제공하는 서비스로부터 주소의 신뢰도를 나타내는 점수 정보를 획득하는 단계, 점수 정보가 양호(benign)를 나타내고, 제 2 디스크립션에 사기 핵심 단어 세트에 포함된 단어가 포함되지 않은 경우, 암호화폐 주소를 양호 주소로 결정하는 단계, 점수 정보가 사기(scam)를 나타내고, 제 2 디스크립션에 사기 핵심 단어 세트에 포함된 단어가 포함된 경우, 암호화폐 주소를 제 3 사기 주소로 결정하는 단계 및 양호 주소 및 제 3 사기 주소를 제 2 데이터베이스에 저장하는 단계를 더 포함하는 것을 특징으로 한다.The method of acquiring learning data according to the present disclosure may include obtaining score information indicating a reliability of an address from a service providing a tag corresponding to a cryptocurrency address, indicating that the score information is favorable, and If the description does not include the words included in the fraudulent key word set, determining the cryptographic address as a good address, the score information indicates scam, and the second description contains the words included in the fraudulent key word set. If included, further comprising determining the cryptographic address as a third fraud address and storing the good address and the third fraud address in a second database.

본 개시에 따른 암호화폐의 사기계정을 검출하기 위한 기계학습모델을 생성하기 위한 학습데이터를 획득하는 장치는, 프로세서 및 메모리를 포함하고, 프로세서는 메모리에 저장된 명령어에 따라, 신고된 사기 주소에 대한 정보를 저장하고 있는 제 1 데이터베이스로부터 사기 주소와 관련된 리포트를 수신하는 단계, 리포트로부터 제 1 사기 주소 및 제 1 사기 주소와 관련된 제 1 디스크립션(description)을 획득하는 단계, 자연어 처리(Natural Language Processing)를 이용하여, 제 1 디스크립션에서 제 1 사기 주소와 관련된 복수의 제 1 핵심 단어들을 추출하는 단계 및 제 1 사기 주소를 제 2 데이터베이스에 저장하는 단계를 수행하는 것을 특징으로 한다.An apparatus for acquiring learning data for generating a machine learning model for detecting a fraudulent account of a cryptocurrency according to the present disclosure includes a processor and a memory, the processor according to an instruction stored in the memory, for a reported fraud address. Receiving a report relating to a fraudulent address from a first database storing information, obtaining a first description associated with the first fraudulent address and the first fraudulent address from the report, Natural Language Processing Extracting the plurality of first key words related to the first fraudulent address from the first description and storing the first fraudulent address in a second database.

본 개시에 따른 학습데이터를 획득하는 장치의 프로세서는 메모리에 저장된 명령어에 따라, 공개적으로 접근 가능한 웹사이트로부터 텍스트 정보를 수신하는 단계, 텍스트 정보로부터 암호화폐 주소를 포함하는 메인 텍스트 정보를 추출하는 단계, 자연어 처리를 이용하여 메인 텍스트 정보로부터 복수의 제 2 핵심 단어들을 추출하는 단계, 사기정보검출모델을 획득하는 단계, 복수의 제 2 핵심 단어들을 사기정보검출모델에 적용하여 메인 텍스트에 포함된 암호화폐 주소가 사기 주소인지 결정하는 단계, 암호화폐 주소가 사기 주소인 경우, 암호화폐 주소를 제 2 사기 주소로 획득하는 단계 및 제 2 사기 주소를 제 2 데이터베이스에 저장하는 단계를 수행하는 것을 특징으로 한다.A processor of an apparatus for acquiring learning data according to the present disclosure may be configured to receive text information from a publicly accessible website according to an instruction stored in a memory, and extract main text information including a cryptocurrency address from the text information. Extracting a plurality of second key words from the main text information using natural language processing, acquiring a fraud detection model, and applying a plurality of second key words to the fraud information detection model Determining whether the monetary address is a fraudulent address, if the cryptographic address is a fraudulent address, obtaining a cryptographic address as a second fraudulent address, and storing the second fraudulent address in a second database. do.

본 개시에 따른 학습데이터를 획득하는 장치의 프로세서는 메모리에 저장된 명령어에 따라, 양호한 암호화폐 주소를 포함하는 것으로 결정된 웹사이트로부터 획득된 양호한 암호화폐 주소와 관련된 단어들을 획득하는 단계, 양호한 암호화폐 주소와 관련된 단어들의 각각이 웹사이트에서 나타나는 제 1 빈도수를 획득하는 단계, 제 1 핵심 단어들의 각각이 제 1 디스크립션에서 나타내는 제 2 빈도수를 획득하는 단계 및 양호로 레이블된 양호한 암호화폐 주소와 관련된 단어들, 제 1 빈도수, 제 2 빈도수 및 사기로 레이블된 복수의 제 1 핵심 단어들을 기계학습하여 사기정보검출모델을 획득하는 단계를 수행하는 것을 특징으로 한다.The processor of the apparatus for acquiring learning data according to the present disclosure, according to the instructions stored in the memory, obtaining words related to a good cryptographic address obtained from a website determined to include a good cryptographic address, a good cryptographic address. Obtaining a first frequency with each of the words associated with the website appearing, obtaining a second frequency with each of the first key words represented in the first description, and words associated with a good cryptographic address labeled as good. And machine learning a plurality of first key words labeled first frequency, second frequency, and fraud to obtain a fraud information detection model.

본 개시에 따른 학습데이터를 획득하는 장치의 프로세서는 메모리에 저장된 명령어에 따라, 암호화폐 주소와 대응되는 태그(tag)를 제공하는 서비스로부터 제 2 디스크립션을 획득하는 단계, 복수의 제 1 핵심 단어들에 기초하여 사기 핵심 단어 세트를 획득하는 단계, 사기 핵심 단어 세트에 포함된 단어가 제 2 디스크립션에 기재된 경우, 제 2 디스크립션에 대응되는 암호화폐 주소를 제 3 사기 주소로 결정하는 단계 및 제 3 사기 주소를 제 2 데이터베이스에 저장하는 단계를 수행하는 것을 특징으로 한다.The processor of the apparatus for acquiring learning data according to the present disclosure may include obtaining a second description from a service for providing a tag corresponding to a cryptocurrency address according to an instruction stored in a memory, and a plurality of first key words. Obtaining a fraud key word set based on the step, if the words included in the fraud key word set are described in the second description, determining the cryptographic address corresponding to the second description as the third fraud address, and the third fraud Storing the address in a second database.

본 개시에 따른 학습데이터를 획득하는 장치의 프로세서는 메모리에 저장된 명령어에 따라, 복수의 제 1 핵심 단어들의 각각에 대하여 제 1 디스크립션에서 등장하는 빈도수를 획득하는 단계 및 복수의 제 1 핵심 단어들 중 빈도수가 높은 소정의 개수의 단어를 사기 핵심 단어 세트로 결정하는 단계를 수행하는 것을 특징으로 한다.The processor of the apparatus for acquiring learning data according to the present disclosure may include acquiring a frequency appearing in the first description for each of the plurality of first key words according to an instruction stored in a memory, and among the plurality of first key words. And determining a predetermined number of words having a high frequency as a fraudulent key word set.

본 개시에 따른 학습데이터를 획득하는 장치의 프로세서는 메모리에 저장된 명령어에 따라, 암호화폐 주소와 대응되는 태그(tag)를 제공하는 서비스로부터 주소의 신뢰도를 나타내는 점수 정보를 획득하는 단계, 점수 정보가 양호(benign)를 나타내고, 제 2 디스크립션에 사기 핵심 단어 세트에 포함된 단어가 포함되지 않은 경우, 암호화폐 주소를 양호 주소로 결정하는 단계, 점수 정보가 사기(scam)를 나타내고, 제 2 디스크립션에 사기 핵심 단어 세트에 포함된 단어가 포함된 경우, 암호화폐 주소를 제 3 사기 주소로 결정하는 단계 및 양호 주소 및 제 3 사기 주소를 제 2 데이터베이스에 저장하는 단계를 더 수행하는 것을 특징으로 한다.The processor of the apparatus for acquiring learning data according to the present disclosure obtains score information indicating reliability of an address from a service providing a tag corresponding to a cryptocurrency address according to an instruction stored in a memory, Indicating good and if the second description does not include a word included in the fraudulent key word set, determining the cryptographic address as a good address, the score information indicates a scam, and the second description If a word included in the fraudulent key word set is included, determining the cryptographic address as the third fraud address and storing the good address and the third fraud address in a second database.

또한, 상술한 바와 같은 학습데이터를 획득하는 방법을 구현하기 위한 프로그램은 컴퓨터로 판독 가능한 기록 매체에 기록될 수 있다.In addition, a program for implementing the method of acquiring learning data as described above may be recorded in a computer-readable recording medium.

도 1은 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 블록도이다.
도 2는 본 개시의 일 실시예에 따른 학습데이터 획득 장치를 나타낸 도면이다.
도 3은 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 흐름도이다.
도 4는 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 도면이다.
도 5는 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 흐름도이다.
도 6는 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 도면이다.
도 7은 본 개시의 일 실시예에 따라 사기정보검출모델을 획득하는 방법을 나타낸 흐름도이다.
도 8은 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 흐름도이다.
도 9는 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 흐름도이다.
도 10은 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 도면이다.
도 11은 본 개시의 일 실시예에 따라 기계학습모델을 도출하는 구성을 나타낸 도면이다.1 is a block diagram of an apparatus for acquiring learning data according to an embodiment of the present disclosure.
2 is a diagram illustrating an apparatus for acquiring learning data according to an embodiment of the present disclosure.
3 is a flowchart illustrating an operation of a learning data acquisition apparatus according to an embodiment of the present disclosure.
4 is a view for explaining the operation of the learning data acquisition apparatus according to an embodiment of the present disclosure.
5 is a flowchart illustrating an operation of a learning data acquisition apparatus according to an embodiment of the present disclosure.
6 is a view for explaining the operation of the learning data acquisition apparatus according to an embodiment of the present disclosure.
7 is a flowchart illustrating a method of obtaining a fraud information detection model according to an embodiment of the present disclosure.
8 is a flowchart illustrating an operation of a learning data acquisition apparatus according to an embodiment of the present disclosure.
9 is a flowchart illustrating an operation of a learning data acquisition apparatus according to an embodiment of the present disclosure.
10 is a view for explaining the operation of the learning data acquisition apparatus according to an embodiment of the present disclosure.
11 is a view showing a configuration for deriving a machine learning model according to an embodiment of the present disclosure.

개시된 실시예의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 개시는 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 개시가 완전하도록 하고, 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것일 뿐이다.Advantages and features of the disclosed embodiments, and methods of achieving them will be apparent with reference to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below, but may be implemented in various forms, and the present embodiments are merely provided to make the present disclosure complete, and those of ordinary skill in the art to which the present disclosure belongs. It is merely provided to fully inform the scope of the invention.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 개시된 실시예에 대해 구체적으로 설명하기로 한다. Terms used herein will be briefly described, and the disclosed embodiments will be described in detail.

본 명세서에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 관련 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다. The terminology used herein has been selected among general terms that are currently widely used while considering the functions of the present disclosure, but may vary according to the intention or precedent of a person skilled in the relevant field, the emergence of a new technology, and the like. In addition, in certain cases, there is also a term arbitrarily selected by the applicant, in which case the meaning will be described in detail in the description of the invention. Therefore, the terms used in the present disclosure should be defined based on the meanings of the terms and the contents throughout the present disclosure, rather than simply the names of the terms.

본 명세서에서의 단수의 표현은 문맥상 명백하게 단수인 것으로 특정하지 않는 한, 복수의 표현을 포함한다. 또한 복수의 표현은 문맥상 명백하게 복수인 것으로 특정하지 않는 한, 단수의 표현을 포함한다.A singular expression in this specification includes a plural expression unless the context clearly indicates that it is singular. Also, the plural expressions include the singular expressions unless the context clearly indicates the plural.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. When any part of the specification is to "include" any component, this means that it may further include other components, except to exclude other components unless otherwise stated.

또한, 명세서에서 사용되는 "부"라는 용어는 소프트웨어 또는 하드웨어 구성요소를 의미하며, "부"는 어떤 역할들을 수행한다. 그렇지만 "부"는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부"는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부"는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부"들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부"들로 결합되거나 추가적인 구성요소들과 "부"들로 더 분리될 수 있다.Also, as used herein, the term "part" means a software or hardware component, and "part" plays certain roles. However, "part" is not meant to be limited to software or hardware. The “unit” may be configured to be in an addressable storage medium and may be configured to play one or more processors. Thus, as an example, a "part" refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, procedures, Subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables. The functionality provided within the components and "parts" may be combined into a smaller number of components and "parts" or further separated into additional components and "parts".

본 개시의 일 실시예에 따르면 "부"는 프로세서 및 메모리로 구현될 수 있다. 용어 "프로세서" 는 범용 프로세서, 중앙 처리 장치 (CPU), 마이크로프로세서, 디지털 신호 프로세서 (DSP), 제어기, 마이크로제어기, 상태 머신 등을 포함하도록 넓게 해석되어야 한다. 몇몇 환경에서는, "프로세서" 는 주문형 반도체 (ASIC), 프로그램가능 로직 디바이스 (PLD), 필드 프로그램가능 게이트 어레이 (FPGA) 등을 지칭할 수도 있다. 용어 "프로세서" 는, 예를 들어, DSP 와 마이크로프로세서의 조합, 복수의 마이크로프로세서들의 조합, DSP 코어와 결합한 하나 이상의 마이크로프로세서들의 조합, 또는 임의의 다른 그러한 구성들의 조합과 같은 처리 디바이스들의 조합을 지칭할 수도 있다.According to an embodiment of the present disclosure, the “unit” may be implemented with a processor and a memory. The term “processor” should be interpreted broadly to include general purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, and the like. In some circumstances, a “processor” may refer to an application specific semiconductor (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), or the like. The term "processor" refers to a combination of processing devices such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in conjunction with a DSP core, or a combination of any other such configuration. May be referred to.

용어 "메모리" 는 전자 정보를 저장 가능한 임의의 전자 컴포넌트를 포함하도록 넓게 해석되어야 한다. 용어 메모리는 임의 액세스 메모리 (RAM), 판독-전용 메모리 (ROM), 비-휘발성 임의 액세스 메모리 (NVRAM), 프로그램가능 판독-전용 메모리 (PROM), 소거-프로그램가능 판독 전용 메모리 (EPROM), 전기적으로 소거가능 PROM (EEPROM), 플래쉬 메모리, 자기 또는 광학 데이터 저장장치, 레지스터들 등과 같은 프로세서-판독가능 매체의 다양한 유형들을 지칭할 수도 있다. 프로세서가 메모리로부터 정보를 판독하고/하거나 메모리에 정보를 기록할 수 있다면 메모리는 프로세서와 전자 통신 상태에 있다고 불린다. 프로세서에 집적된 메모리는 프로세서와 전자 통신 상태에 있다.The term "memory" should be interpreted broadly to include any electronic component capable of storing electronic information. The term memory refers to random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erase-programmable read-only memory (EPROM), electrical May also refer to various types of processor-readable media, such as erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, and the like. If the processor can read information from and / or write information to the memory, the memory is said to be in electronic communication with the processor. The memory integrated in the processor is in electronic communication with the processor.

아래에서는 첨부한 도면을 참고하여 실시예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the embodiments. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present disclosure.

도 1은 본 개시의 일 실시예에 따른 학습데이터 획득 장치(100)의 블록도이다.1 is a block diagram of an apparatus 100 for acquiring learning data according to an embodiment of the present disclosure.

도 1을 참조하면, 일 실시예에 따른 학습데이터 획득 장치(100)는 데이터 학습부(110) 또는 데이터 인식부(120) 중 적어도 하나를 포함할 수 있다. 상술한 바와 같은 학습데이터 획득 장치(100)는 프로세서 및 메모리를 포함할 수 있다.Referring to FIG. 1, the apparatus 100 for acquiring training data may include at least one of a data learner 110 and a data recognizer 120. The apparatus 100 for acquiring learning data as described above may include a processor and a memory.

데이터 학습부(110)는 데이터 세트를 이용하여 타겟 태스크(target task)를 수행하기 위한 기계학습모델을 학습할 수 있다. 데이터 학습부(110)는 데이터 세트 및 타겟 태스크와 관련된 레이블 정보를 수신할 수 있다. 데이터 학습부(110)는 데이터 세트와 레이블 정보의 관계에 대해 기계학습을 수행하여 기계학습모델을 획득할 수 있다. 데이터 학습부(110)가 획득한 기계학습모델은 데이터 세트를 이용하여 레이블 정보를 생성하기 위한 모델일 수 있다. The data learner 110 may learn a machine learning model for performing a target task using a data set. The data learner 110 may receive label information related to the data set and the target task. The data learning unit 110 may acquire a machine learning model by performing machine learning on the relationship between the data set and the label information. The machine learning model acquired by the data learner 110 may be a model for generating label information using a data set.

데이터 인식부(120)는 데이터 학습부(110)의 기계학습모델을 수신하여 저장하고 있을 수 있다. 데이터 인식부(120)는 입력 데이터에 기계학습모델을 적용하여 레이블 정보를 출력할 수 있다. 또한, 데이터 인식부(120)는 입력 데이터, 레이블 정보 및 기계학습모델에 의해 출력된 결과를 기계학습모델을 갱신하는데 이용할 수 있다.The data recognizer 120 may receive and store the machine learning model of the data learner 110. The data recognizer 120 may output label information by applying a machine learning model to the input data. In addition, the data recognizer 120 may use the input data, the label information, and the result output by the machine learning model to update the machine learning model.

데이터 학습부(110) 및 데이터 인식부(120) 중 적어도 하나는, 적어도 하나의 하드웨어 칩 형태로 제작되어 전자 장치에 탑재될 수 있다. 예를 들어, 데이터 학습부(110) 및 데이터 인식부(120) 중 적어도 하나는 인공 지능(AI; artificial intelligence)을 위한 전용 하드웨어 칩 형태로 제작될 수도 있고, 또는 기존의 범용 프로세서(예: CPU 또는 application processor) 또는 그래픽 전용 프로세서(예: GPU)의 일부로 제작되어 이미 설명한 각종 전자 장치에 탑재될 수도 있다.At least one of the data learner 110 and the data recognizer 120 may be manufactured in the form of at least one hardware chip and mounted on the electronic device. For example, at least one of the data learner 110 and the data recognizer 120 may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or a conventional general purpose processor (eg, a CPU). Alternatively, the electronic device may be manufactured as a part of an application processor or a graphics processor (eg, a GPU) and mounted on the various electronic devices described above.

또한 데이터 학습부(110) 및 데이터 인식부(120)는 별개의 전자 장치들에 각각 탑재될 수도 있다. 예를 들어, 데이터 학습부(110) 및 데이터 인식부(120) 중 하나는 전자 장치에 포함되고, 나머지 하나는 서버에 포함될 수 있다. 또한, 데이터 학습부(110) 및 데이터 인식부(120)는 유선 또는 무선으로 통하여, 데이터 학습부(110)가 구축한 기계학습모델 정보를 데이터 인식부(120)로 제공할 수도 있고, 데이터 인식부(120)로 입력된 데이터가 추가 학습 데이터로써 데이터 학습부(110)로 제공될 수도 있다.In addition, the data learner 110 and the data recognizer 120 may be mounted in separate electronic devices, respectively. For example, one of the data learner 110 and the data recognizer 120 may be included in the electronic device, and the other may be included in the server. In addition, the data learning unit 110 and the data recognizing unit 120 may provide machine learning model information constructed by the data learning unit 110 to the data recognizing unit 120 through wired or wirelessly, or recognize data. The data input to the unit 120 may be provided to the data learning unit 110 as additional learning data.

한편, 데이터 학습부(110) 및 데이터 인식부(120) 중 적어도 하나는 소프트웨어 모듈로 구현될 수 있다. 데이터 학습부(110) 및 데이터 인식부(120) 중 적어도 하나가 소프트웨어 모듈(또는, 인스트럭션(instruction)을 포함하는 프로그램 모듈)로 구현되는 경우, 소프트웨어 모듈은 메모리 또는 컴퓨터로 읽을 수 있는 판독 가능한 비일시적 판독 가능 기록매체(non-transitory computer readable media)에 저장될 수 있다. 또한, 이 경우, 적어도 하나의 소프트웨어 모듈은 OS(Operating System)에 의해 제공되거나, 소정의 애플리케이션에 의해 제공될 수 있다. 또는, 적어도 하나의 소프트웨어 모듈 중 일부는 OS(Operating System)에 의해 제공되고, 나머지 일부는 소정의 애플리케이션에 의해 제공될 수 있다. Meanwhile, at least one of the data learner 110 and the data recognizer 120 may be implemented as a software module. When at least one of the data learner 110 and the data recognizer 120 is implemented as a software module (or a program module including instructions), the software module may be a memory or computer readable non-readable. It may be stored in a non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and others may be provided by a predetermined application.

본 개시의 일 실시예에 따른 데이터 학습부(110)는 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115)를 포함할 수 있다.The data learner 110 according to an exemplary embodiment of the present disclosure may include a data acquirer 111, a preprocessor 112, a training data selector 113, a model learner 114, and a model evaluator 115. It may include.

데이터 획득부(111)는 기계학습에 필요한 데이터를 획득할 수 있다. 학습을 위해서는 많은 데이터가 필요하므로, 데이터 획득부(111)는 복수의 데이터를 포함하는 데이터 세트를 수신할 수 있다. The data acquirer 111 may acquire data necessary for machine learning. Since a lot of data is required for learning, the data acquirer 111 may receive a data set including a plurality of data.

복수의 데이터 각각에 대하여 레이블 정보가 할당될 수 있다. 레이블 정보는 복수의 데이터의 각각을 설명하는 정보일 수 있다. 레이블 정보는 타겟 태스크(target task)가 도출하고자 하는 정보일 수 있다. 레이블 정보는 사용자 입력으로부터 획득되거나, 메모리로부터 획득되거나, 기계학습모델의 결과로부터 획득될 수 있다. 예를 들어 타겟 태스크가 암호화폐 주소의 거래이력으로부터 암호화폐 주소가 사기꾼이 소유한 주소인지 여부를 결정하기 위한 것이라면, 기계학습에 사용되는 복수의 데이터는 암호화폐 주소의 거래이력과 관련된 데이터가 될 것이며 레이블 정보는 암호화폐 주소가 사기꾼이 소유한 주소인지 여부가 될 것이다.Label information may be allocated to each of the plurality of data. The label information may be information describing each of the plurality of data. The label information may be information that a target task intends to derive. Label information may be obtained from user input, from memory, or from the results of the machine learning model. For example, if the target task is to determine whether a cryptocurrency address is a fraudster's owned address from the trading history of cryptocurrency addresses, the multiple data used for machine learning will be the data related to the trading history of cryptocurrency addresses. The label information will be whether or not the cryptocurrency address is owned by the fraudster.

전처리부(112)는 수신된 데이터가 기계학습에 이용될 수 있도록, 획득된 데이터를 전처리할 수 있다. 전처리부(112)는 후술할 모델 학습부(114)가 이용할 수 있도록, 획득된 데이터 세트를 미리 설정된 포맷으로 가공할 수 있다. The preprocessor 112 may preprocess the obtained data so that the received data may be used for machine learning. The preprocessor 112 may process the acquired data set into a preset format so that the model learner 114 to be described later can use the preprocessor 112.

학습 데이터 선택부(113)는 전처리된 데이터 중에서 학습에 필요한 데이터를 선택할 수 있다. 선택된 데이터는 모델 학습부(114)에 제공될 수 있다. 학습 데이터 선택부(113)는 기 설정된 기준에 따라, 전처리된 데이터 중에서 학습에 필요한 데이터를 선택할 수 있다. 또한, 학습 데이터 선택부(113)는 후술할 모델 학습부(114)에 의한 학습에 의해 기 설정된 기준에 따라 데이터를 선택할 수도 있다.The training data selector 113 may select data necessary for learning from the preprocessed data. The selected data may be provided to the model learner 114. The training data selector 113 may select data required for learning from preprocessed data according to a preset criterion. In addition, the training data selector 113 may select data according to preset criteria by learning by the model learner 114 to be described later.

모델 학습부(114)는 데이터 세트에 기초하여 어떤 레이블 정보를 출력할 지에 관한 기준을 학습할 수 있다. 또한, 모델 학습부(114)는 데이터 세트 및 데이터 세트 대한 레이블 정보를 학습 데이터로써 이용하여 기계학습을 수행할 수 있다. 또한 모델 학습부(114)는 기존에 획득된 기계학습모델을 추가적으로 이용하여 기계학습을 수행할 수 있다. 이 경우, 기존에 획득된 기계학습모델은 미리 구축된 모델일 수 있다. 예를 들어, 기계학습모델은 기본 학습 데이터를 입력 받아 미리 구축된 모델일 수 있다.The model learner 114 may learn a criterion about what label information to output based on the data set. In addition, the model learner 114 may perform machine learning using the data set and the label information of the data set as the training data. In addition, the model learning unit 114 may perform machine learning by additionally using a previously acquired machine learning model. In this case, the previously acquired machine learning model may be a pre-built model. For example, the machine learning model may be a model built in advance by receiving basic learning data.

기계학습모델은, 학습모델의 적용 분야, 학습의 목적 또는 장치의 컴퓨터 성능 등을 고려하여 구축될 수 있다. 기계학습모델은, 예를 들어, 신경망(Neural Network)을 기반으로 하는 모델일 수 있다. 예컨대, Deep Neural Network (DNN), Recurrent Neural Network (RNN), Long Short-Term Memory models (LSTM), BRDNN (Bidirectional Recurrent Deep Neural Network), Convolutional Neural Networks (CNN)과 같은 모델이 기계학습모델로써 사용될 수 있으나, 이에 한정되지 않는다.The machine learning model may be constructed in consideration of the application field of the learning model, the purpose of learning, or the computer performance of the device. The machine learning model may be, for example, a model based on a neural network. For example, models such as Deep Neural Network (DNN), Recurrent Neural Network (RNN), Long Short-Term Memory models (LSTM), Bidirectional Recurrent Deep Neural Network (BRDNN), and Convolutional Neural Networks (CNN) can be used as machine learning models. But it is not limited thereto.

다양한 실시예에 따르면, 모델 학습부(114)는 미리 구축된 기계학습모델이 복수 개가 존재하는 경우, 입력된 학습 데이터와 기본 학습 데이터의 관련성이 큰 기계학습모델을 학습할 기계학습모델로 결정할 수 있다. 이 경우, 기본 학습 데이터는 데이터의 타입 별로 기 분류되어 있을 수 있으며, 기계학습모델은 데이터의 타입 별로 미리 구축되어 있을 수 있다. 예를 들어, 기본 학습 데이터는 학습 데이터가 생성된 장소, 학습 데이터가 생성된 시간, 학습 데이터의 크기, 학습 데이터의 생성자, 학습 데이터 내의 오브젝트의 종류 등과 같은 다양한 기준으로 기 분류되어 있을 수 있다. According to various embodiments of the present disclosure, when there are a plurality of pre-built machine learning models, the model learning unit 114 may determine a machine learning model having a high correlation between the input learning data and the basic learning data as a machine learning model to learn. have. In this case, the basic learning data may be pre-categorized by the type of data, and the machine learning model may be built in advance by the type of the data. For example, the basic learning data may be previously classified based on various criteria such as a place where the learning data is generated, a time at which the learning data is generated, a size of the learning data, a creator of the learning data, a kind of object in the learning data, and the like.

또한, 모델 학습부(114)는, 예를 들어, 오류 역전파법(error back-propagation) 또는 경사 하강법(gradient descent)을 포함하는 학습 알고리즘 등을 이용하여 기계학습모델을 학습시킬 수 있다.In addition, the model learner 114 may train the machine learning model using, for example, a learning algorithm including an error back-propagation method or a gradient descent method.

또한, 모델 학습부(114)는, 예를 들어, 학습 데이터를 입력 값으로 하는 지도 학습(supervised learning)을 통하여, 기계학습모델을 학습할 수 있다. 또한, 모델 학습부(114)는, 예를 들어, 별다른 지도없이 타겟 태스크(target task)을 위해 필요한 데이터의 종류를 스스로 학습함으로써, 타겟 태스크를 위한 기준을 발견하는 비지도 학습(unsupervised learning)을 통하여, 기계학습모델을 획득할 수 있다. 또한, 모델 학습부(114)는, 예를 들어, 학습에 따른 타겟 태스크의 결과가 올바른 지에 대한 피드백을 이용하는 강화 학습(reinforcement learning)을 통하여, 기계학습모델을 학습할 수 있다.In addition, the model learner 114 may learn the machine learning model through, for example, supervised learning using the training data as an input value. In addition, the model learner 114 learns unsupervised learning that finds a criterion for the target task, for example, by learning the kind of data necessary for the target task without guidance. Through this, the machine learning model can be obtained. In addition, the model learner 114 may learn the machine learning model through, for example, reinforcement learning using feedback on whether the result of the target task according to the learning is correct.

또한, 기계학습모델이 학습되면, 모델 학습부(114)는 학습된 기계학습모델을 저장할 수 있다. 이 경우, 모델 학습부(114)는 학습된 기계학습모델을 데이터 인식부(120)를 포함하는 전자 장치의 메모리에 저장할 수 있다. 또는, 모델 학습부(114)는 학습된 기계학습모델을 전자 장치와 유선 또는 무선 네트워크로 연결되는 서버의 메모리에 저장할 수도 있다.In addition, when the machine learning model is learned, the model learning unit 114 may store the learned machine learning model. In this case, the model learner 114 may store the learned machine learning model in a memory of the electronic device including the data recognizer 120. Alternatively, the model learner 114 may store the learned machine learning model in a memory of a server connected to the electronic device through a wired or wireless network.

학습된 기계학습모델이 저장되는 메모리는, 예를 들면, 전자 장치의 적어도 하나의 다른 구성요소에 관계된 명령 또는 데이터를 함께 저장할 수도 있다. 또한, 메모리는 소프트웨어 및/또는 프로그램을 저장할 수도 있다. 프로그램은, 예를 들면, 커널, 미들웨어, 어플리케이션 프로그래밍 인터페이스(API) 및/또는 어플리케이션 프로그램(또는 "어플리케이션") 등을 포함할 수 있다.The memory in which the learned machine learning model is stored may store, for example, instructions or data related to at least one other component of the electronic device. The memory may also store software and / or programs. The program may include, for example, a kernel, middleware, an application programming interface (API) and / or an application program (or “application”), and the like.

모델 평가부(115)는 기계학습모델에 평가 데이터를 입력하고, 평가 데이터로부터 출력되는 결과가 소정 기준을 만족하지 못하는 경우, 모델 학습부(114)로 하여금 다시 학습하도록 할 수 있다. 이 경우, 평가 데이터는 기계학습모델을 평가하기 위한 기 설정된 데이터일 수 있다. The model evaluator 115 may input the evaluation data into the machine learning model, and when the result output from the evaluation data does not satisfy a predetermined criterion, the model learner 114 may allow the model learner 114 to learn again. In this case, the evaluation data may be preset data for evaluating the machine learning model.

예를 들어, 모델 평가부(115)는 평가 데이터에 대한 학습된 기계학습모델의 결과 중에서, 인식 결과가 정확하지 않은 평가 데이터의 개수 또는 비율이 미리 설정된 임계치를 초과하는 경우 소정 기준을 만족하지 못한 것으로 평가할 수 있다. 예컨대, 소정 기준이 비율 2%로 정의되는 경우, 학습된 기계학습모델이 총 1000개의 평가 데이터 중의 20개를 초과하는 평가 데이터에 대하여 잘못된 인식 결과를 출력하는 경우, 모델 평가부(115)는 학습된 기계학습모델이 적합하지 않은 것으로 평가할 수 있다.For example, the model evaluator 115 does not satisfy a predetermined criterion when the number or ratio of the evaluation data whose recognition result is not accurate among the results of the learned machine learning model for the evaluation data exceeds a preset threshold. It can be evaluated as. For example, when a predetermined criterion is defined as a ratio of 2%, the model evaluation unit 115 learns when the learned machine learning model outputs an incorrect recognition result for more than 20 evaluation data out of a total of 1000 evaluation data. The machine learning model can be evaluated as not suitable.

한편, 학습된 기계학습모델이 복수 개가 존재하는 경우, 모델 평가부(115)는 각각의 학습된 기계학습모델에 대하여 소정 기준을 만족하는지를 평가하고, 소정 기준을 만족하는 모델을 최종 기계학습모델로써 결정할 수 있다. 이 경우, 소정 기준을 만족하는 모델이 복수 개인 경우, 모델 평가부(115)는 평가 점수가 높은 순으로 미리 설정된 어느 하나 또는 소정 개수의 모델을 최종 기계학습모델로써 결정할 수 있다.On the other hand, when there are a plurality of learned machine learning models, the model evaluator 115 evaluates whether each learned machine learning model satisfies a predetermined criterion, and uses the model satisfying the predetermined criterion as a final machine learning model. You can decide. In this case, when there are a plurality of models satisfying a predetermined criterion, the model evaluator 115 may determine any one or a predetermined number of models that are set in order of the highest evaluation score as the final machine learning model.

한편, 데이터 학습부(110) 내의 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 적어도 하나는, 적어도 하나의 하드웨어 칩 형태로 제작되어 전자 장치에 탑재될 수 있다. 예를 들어, 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 적어도 하나는 인공 지능(AI; artificial intelligence)을 위한 전용 하드웨어 칩 형태로 제작될 수도 있고, 또는 기존의 범용 프로세서(예: CPU 또는 application processor) 또는 그래픽 전용 프로세서(예: GPU)의 일부로 제작되어 전술한 각종 전자 장치에 탑재될 수도 있다.At least one of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 in the data learner 110 is at least one. May be manufactured in the form of a hardware chip and mounted on an electronic device. For example, at least one of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 may be artificial intelligence (AI). It may be manufactured in the form of a dedicated hardware chip, or may be manufactured as a part of an existing general purpose processor (eg, a CPU or an application processor) or a graphics dedicated processor (eg, a GPU) and mounted on the aforementioned various electronic devices.

또한, 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115)는 하나의 전자 장치에 탑재될 수도 있으며, 또는 별개의 전자 장치들에 각각 탑재될 수도 있다. 예를 들어, 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 일부는 전자 장치에 포함되고, 나머지 일부는 서버에 포함될 수 있다.In addition, the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 may be mounted in one electronic device or may be separate. The electronic devices may be mounted on the electronic devices. For example, some of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 are included in the electronic device, and the rest of the data is included in the electronic device. Can be included on the server.

또한, 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 적어도 하나는 소프트웨어 모듈로 구현될 수 있다. 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 적어도 하나가 소프트웨어 모듈(또는, 인스트럭션(instruction) 포함하는 프로그램 모듈)로 구현되는 경우, 소프트웨어 모듈은 컴퓨터로 읽을 수 있는 판독 가능한 비일시적 판독 가능 기록매체(non-transitory computer readable media)에 저장될 수 있다. 또한, 이 경우, 적어도 하나의 소프트웨어 모듈은 OS(Operating System)에 의해 제공되거나, 소정의 애플리케이션에 의해 제공될 수 있다. 또는, 적어도 하나의 소프트웨어 모듈 중 일부는 OS(Operating System)에 의해 제공되고, 나머지 일부는 소정의 애플리케이션에 의해 제공될 수 있다.In addition, at least one of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 may be implemented as a software module. A program in which at least one of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 includes a software module (or instruction). Module may be stored on a computer readable non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and others may be provided by a predetermined application.

본 개시의 일 실시예에 따른 데이터 인식부(120)는 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125)를 포함할 수 있다.The data recognizer 120 according to an exemplary embodiment of the present disclosure may include a data acquirer 121, a preprocessor 122, a recognition data selector 123, a recognition result provider 124, and a model updater 125. It may include.

데이터 획득부(121)는 입력 데이터를 수신할 수 있다. 전처리부(122)는 획득된 입력 데이터가 인식 데이터 선택부(123) 또는 인식 결과 제공부(124)에서 이용될 수 있도록, 획득된 입력 데이터를 전처리할 수 있다. The data acquirer 121 may receive input data. The preprocessor 122 may preprocess the acquired input data so that the acquired input data may be used by the recognition data selector 123 or the recognition result provider 124.

인식 데이터 선택부(123)는 전처리된 데이터 중에서 필요한 데이터를 선택할 수 있다. 선택된 데이터는 인식 결과 제공부(124)에게 제공될 수 있다. 인식 데이터 선택부(123)는 기 설정된 기준에 따라, 전처리된 데이터 중에서 일부 또는 전부를 선택할 수 있다. 또한, 인식 데이터 선택부(123)는 모델 학습부(114)에 의한 학습에 의해 기 설정된 기준에 따라 데이터를 선택할 수도 있다.The recognition data selector 123 may select necessary data from the preprocessed data. The selected data may be provided to the recognition result provider 124. The recognition data selector 123 may select some or all of the preprocessed data according to a preset criterion. In addition, the recognition data selector 123 may select data according to a predetermined criterion by learning by the model learner 114.

인식 결과 제공부(124)는 선택된 데이터를 기계학습모델에 적용하여 결과 데이터를 획득할 수 있다. 기계학습모델은 모델 학습부(114)에 의하여 생성된 기계학습모델일 수 있다. 인식 결과 제공부(124)는 결과 데이터를 출력할 수 있다.The recognition result provider 124 may apply the selected data to the machine learning model to obtain the result data. The machine learning model may be a machine learning model generated by the model learning unit 114. The recognition result provider 124 may output result data.

모델 갱신부(125)는 인식 결과 제공부(124)에 의해 제공되는 인식 결과에 대한 평가에 기초하여, 기계학습모델이 갱신되도록 할 수 있다. 예를 들어, 모델 갱신부(125)는 인식 결과 제공부(124)에 의해 제공되는 인식 결과를 모델 학습부(114)에게 제공함으로써, 모델 학습부(114)가 기계학습모델을 갱신하도록 할 수 있다.The model updater 125 may cause the machine learning model to be updated based on the evaluation of the recognition result provided by the recognition result provider 124. For example, the model updater 125 may allow the model learner 114 to update the machine learning model by providing the model learner 114 with the recognition result provided by the recognition result provider 124. have.

한편, 데이터 인식부(120) 내의 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 적어도 하나는, 적어도 하나의 하드웨어 칩 형태로 제작되어 전자 장치에 탑재될 수 있다. 예를 들어, 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 적어도 하나는 인공 지능(AI; artificial intelligence)을 위한 전용 하드웨어 칩 형태로 제작될 수도 있고, 또는 기존의 범용 프로세서(예: CPU 또는 application processor) 또는 그래픽 전용 프로세서(예: GPU)의 일부로 제작되어 전술한 각종 전자 장치에 탑재될 수도 있다.Meanwhile, at least one of the data acquisition unit 121, the preprocessing unit 122, the recognition data selection unit 123, the recognition result providing unit 124, and the model updating unit 125 in the data recognition unit 120 is at least It may be manufactured in the form of one hardware chip and mounted on an electronic device. For example, at least one of the data acquirer 121, the preprocessor 122, the recognition data selector 123, the recognition result provider 124, and the model updater 125 may be artificial intelligence (AI). ) May be manufactured in the form of a dedicated hardware chip, or may be manufactured as a part of an existing general purpose processor (eg, a CPU or an application processor) or a graphics dedicated processor (eg, a GPU) and mounted on the aforementioned various electronic devices.

또한, 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125)는 하나의 전자 장치에 탑재될 수도 있으며, 또는 별개의 전자 장치들에 각각 탑재될 수도 있다. 예를 들어, 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 일부는 전자 장치에 포함되고, 나머지 일부는 서버에 포함될 수 있다.In addition, the data acquirer 121, the preprocessor 122, the recognition data selector 123, the recognition result provider 124, and the model updater 125 may be mounted in one electronic device or may be separate. May be mounted on the electronic devices. For example, some of the data obtaining unit 121, the preprocessor 122, the recognition data selecting unit 123, the recognition result providing unit 124, and the model updating unit 125 are included in the electronic device, and some of the remaining portions are included in the electronic device. May be included in the server.

또한, 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 적어도 하나는 소프트웨어 모듈로 구현될 수 있다. 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 적어도 하나가 소프트웨어 모듈(또는, 인스트럭션(instruction) 포함하는 프로그램 모듈)로 구현되는 경우, 소프트웨어 모듈은 컴퓨터로 읽을 수 있는 판독 가능한 비일시적 판독 가능 기록매체(non-transitory computer readable media)에 저장될 수 있다. 또한, 이 경우, 적어도 하나의 소프트웨어 모듈은 OS(Operating System)에 의해 제공되거나, 소정의 애플리케이션에 의해 제공될 수 있다. 또는, 적어도 하나의 소프트웨어 모듈 중 일부는 OS(Operating System)에 의해 제공되고, 나머지 일부는 소정의 애플리케이션에 의해 제공될 수 있다.In addition, at least one of the data acquirer 121, the preprocessor 122, the recognition data selector 123, the recognition result provider 124, and the model updater 125 may be implemented as a software module. At least one of the data obtaining unit 121, the preprocessor 122, the recognition data selecting unit 123, the recognition result providing unit 124, and the model updating unit 125 may include a software module (or instruction). If implemented as a program module, the software module may be stored in a computer readable non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and others may be provided by a predetermined application.

아래에서는 데이터 학습부(110)의 데이터 획득부(111), 전처리부(112) 및 학습 데이터 선택부(113)가 학습 데이터를 수신하여 처리하는 방법 및 장치에 대하여 보다 자세히 설명한다. Hereinafter, a method and an apparatus for receiving and processing training data by the data acquirer 111, the preprocessor 112, and the training data selector 113 of the data learner 110 will be described in more detail.

도 2는 본 개시의 일 실시예에 따른 학습데이터 획득 장치를 나타낸 도면이다.2 is a diagram illustrating an apparatus for acquiring learning data according to an embodiment of the present disclosure.

학습데이터 획득 장치(100)는 프로세서(210) 및 메모리(220)를 포함할 수 있다. 프로세서(210)는 메모리(220)에 저장된 명령어들을 수행할 수 있다. The apparatus 100 for obtaining training data may include a processor 210 and a memory 220. The processor 210 may perform instructions stored in the memory 220.

상술한 바와 같이 학습데이터 획득 장치(100)는 데이터 학습부(110)를 포함할 수 있다. 데이터 학습부(110)의 데이터 획득부(111), 전처리부(112) 또는 학습 데이터 선택부(113)는 프로세서(210) 및 메모리(220)에 의하여 구현될 수 있다. As described above, the apparatus 100 for learning data acquisition may include a data learning unit 110. The data acquirer 111, the preprocessor 112, or the learn data selector 113 of the data learner 110 may be implemented by the processor 210 and the memory 220.

이하에서는 도 3 및 도 4와 함께 학습데이터 획득 장치를 자세히 설명한다.Hereinafter, the learning data acquisition apparatus will be described in detail with reference to FIGS. 3 and 4.

도 3은 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 흐름도이다. 또한 도 4는 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 도면이다.3 is a flowchart illustrating an operation of a learning data acquisition apparatus according to an embodiment of the present disclosure. 4 is a view for explaining the operation of the learning data acquisition apparatus according to an embodiment of the present disclosure.

학습데이터 획득 장치(100)는 사기계정을 검출하기 위한 기계학습모델을 생성하기 위한 학습데이터를 획득할 수 있다. 학습데이터 획득 장치(100)는 데이터 획득부(111), 전처리부(112) 또는 학습 데이터 선택부(113)를 포함할 수 있다.The training data acquisition apparatus 100 may acquire training data for generating a machine learning model for detecting a fraud account. The training data acquisition apparatus 100 may include a data acquisition unit 111, a preprocessor 112, or a training data selection unit 113.

학습데이터 획득 장치(100)는 신고된 사기 주소에 대한 정보를 저장하고 있는 제 1 데이터베이스로부터 사기 주소와 관련된 리포트를 수신하는 단계(310)를 수행할 수 있다. The apparatus 100 for acquiring learning data may perform a step 310 of receiving a report related to a fraud address from a first database that stores information on a reported fraud address.

학습데이터 획득 장치(100)는 제 1 데이터베이스(430)로부터 데이터를 수신하기 위한 수신부(410)를 더 포함할 수 있다. 수신부(410) 유선 또는 무선으로 데이터를 수신할 수 있다. The apparatus 100 for acquiring learning data may further include a receiver 410 for receiving data from the first database 430. The receiver 410 may receive data by wire or wirelessly.

제 1 데이터베이스(430)는 암호 화폐의 사기 주소와 관련된 리포트를 제공하는 서비스에 내장되어 있는 데이터베이스일 수 있다. 또한 제 1 데이터베이스(430)는 암호화폐 사기 블랙리스트 서비스(Bitcoin scam blacklist services)에 내장되어 있는 데이터베이스일 수 있다. 예를 들어 사기 주소와 관련된 리포트를 제공하는 서비스는 BitcoinWhosWho 또는 BitcoinAbuse와 같은 서비스가 있을 수 있다. 제 1 데이터베이스(430)에는 암호화폐 주소별로 리포트가 저장되어 있다. 학습데이터 획득 장치(100)는 리포트를 수신할 수 있다. 학습데이터 획득 장치(100)는 리포트에 기초하여 암호화폐 주소가 사기 주소인지 여부를 결정할 수 있다.The first database 430 may be a database embedded in a service for providing a report related to a fraudulent address of cryptocurrency. In addition, the first database 430 may be a database embedded in a bitcoin scam blacklist service. For example, a service that provides a report related to a fraudulent address may be a service such as BitcoinWhosWho or BitcoinAbuse. The first database 430 stores a report for each cryptocurrency address. The training data acquisition apparatus 100 may receive a report. The apparatus 100 for learning data acquisition may determine whether the cryptographic address is a fraudulent address based on the report.

학습데이터 획득 장치(100)는 리포트로부터 제 1 사기 주소 및 제 1 사기 주소와 관련된 제 1 디스크립션(description)을 획득하는 단계(320)를 수행할 수 있다. The apparatus 100 for acquiring learning data may perform an operation 320 of obtaining a first description related to the first fraud address and the first fraud address from the report.

학습데이터 획득 장치(100)는 제 1 사기 주소 및 제 1 사기 주소와 관련된 제 1 디스크립션을 획득하여 처리하기 위한 제 1 분석부(420)를 더 포함할 수 있다. 제 1 분석부는 제 1 데이터베이스로부터 수신한 데이터를 분석할 수 있다. 제 1 분석부(420)는 소프트웨어 또는 하드웨어로 구현될 수 있다. 제 1 분석부(420)는 제 2 분석부 또는 제 3 분석부와 다른 데이터를 처리하지만, 동일한 하드웨어로 구현될 수 있다. The apparatus 100 for acquiring learning data may further include a first analyzer 420 for acquiring and processing the first fraud address and the first description associated with the first fraud address. The first analyzer may analyze data received from the first database. The first analyzer 420 may be implemented in software or hardware. The first analyzer 420 processes data different from that of the second analyzer or the third analyzer, but may be implemented in the same hardware.

제 1 사기 주소는 암호화폐를 주고 받을 수 있는 계정의 주소일 수 있다. 제 1 사기 주소는 제 1 데이터베이스(430)를 포함하는 서비스에 의하여 이미 사기에 사용된 암호화폐 주소로 결정된 주소일 수 있다. 제 1 디스크립션은 제 1 사기 주소가 사기 주소로 결정된 유를 텍스트로 설명할 수 있다. The first fraudulent address may be an address of an account that can exchange cryptocurrency. The first fraudulent address may be an address determined as a cryptographic address already used for fraud by a service including the first database 430. The first description may explain in text the reason that the first fraudulent address was determined to be a fraudulent address.

학습데이터 획득 장치(100)는 특정 언어로 기재되어 있는 제 1 디스크립션만 이용할 수 있다. 제 1 디스크립션은 자연어로 기재되어 있으므로, 학습데이터 획득 장치(100)가 언어분석을 제대로 하지 못한다면 사기 주소의 분석의 정확도가 떨어질 수 있다. 따라서 학습데이터 획득 장치(100)는 분석이 가능한 언어로 되어 있는 제 1 디스크립션만 이용할 수 있다. 하지만 이에 한정되는 것은 아니다.The apparatus 100 for learning data acquisition may use only the first description written in a specific language. Since the first description is written in natural language, if the learning data acquisition apparatus 100 does not properly analyze the language, the accuracy of the analysis of the fraudulent address may be lowered. Therefore, the learning data acquisition apparatus 100 may use only the first description in a language that can be analyzed. But it is not limited thereto.

학습데이터 획득 장치(100)는 자연어 처리(Natural Language Processing)를 이용하여, 제 1 디스크립션에서 제 1 사기 주소와 관련된 복수의 제 1 핵심 단어들을 추출하는 단계(330)를 수행할 수 있다. 제 1 데이터베이스를 포함하는 암호화폐 사기 블랙리스트 서비스는 사기 주소 판별과 관련하여 신뢰도가 높은 서비스일 수 있다. 따라서 학습데이터 획득 장치(100)는 제 1 디스크립션의 텍스트에서 제 1 핵심 단어들을 도출하여 다른 데이터 베이스에서 획득된 암호화폐의 주소와 관련된 정보를 분석할 수 있다.The apparatus 100 for acquiring learning data may perform an operation 330 of extracting a plurality of first key words related to a first fraud address from a first description using natural language processing. The cryptocurrency fraud blacklist service including the first database may be a service having high reliability in relation to fraud address determination. Accordingly, the apparatus 100 for acquiring learning data may derive the first key words from the text of the first description and analyze information related to the address of a cryptocurrency obtained from another database.

학습데이터 획득 장치(100)는 제 1 디스크립션에서 특수문자, URL 및 불용어(stopword)와 같은 분석에 불필요한 문자들을 삭제할 수 있다. 또한 학습데이터 획득 장치(100)는 제 1 디스크립션에서 불필요한 문자들을 삭제한 후 남은 단어가 소정의 개수 미만인 경우 해당 제 1 디스크립션을 사용하지 않을 수 있다. 소정의 개수는 예를 들어 15개일 수 있다. 남은 단어가 소정의 개수 미만인 경우, 단어의 숫자가 너무 적어서 사기 주소를 판별하기 위한 핵심단어로써 사용하기 부적절할 수 있다. 학습데이터 획득 장치(100)는 불필요한 문자들을 삭제한 후 소정의 개수 이상의 제 1 디스크립션을 이용함으로써, 학습데이터 획득 장치(100)의 신뢰도를 높일 수 있다. 또한, 학습데이터 획득 장치(100)가 획득한 데이터에 기초한 기계학습모델의 신뢰도도 높아질 수 있다.The apparatus 100 for acquiring learning data may delete characters unnecessary for analysis such as special characters, URLs, and stopwords in the first description. Further, the apparatus 100 for acquiring learning data may not use the first description when the words remaining after deleting unnecessary characters in the first description are less than a predetermined number. The predetermined number may be 15, for example. If the remaining words are less than a predetermined number, the number of words may be too small to be suitable for use as key words for determining fraudulent addresses. The apparatus 100 for acquiring learning data may increase reliability of the apparatus 100 for acquiring learning data by deleting unnecessary characters and using a first description of a predetermined number or more. In addition, the reliability of the machine learning model based on the data acquired by the learning data acquisition apparatus 100 may be increased.

학습데이터 획득 장치(100)는 제 1 사기 주소를 제 2 데이터베이스(440)에 저장하는 단계(340)를 수행할 수 있다. 제 2 데이터베이스(440)는 학습데이터 획득 장치(100)에 포함될 수 있다. 제 2 데이터베이스(440)는 기계학습모델을 생성하기 위한 데이터를 저장하고 있을 수 있다. 또한 제 2 데이터베이스(440)는 다른 사기 주소를 판별하고, 사기 주소에 대한 디스크립션을 분석하기 위한 데이터를 저장하고 있을 수 있다.The apparatus 100 for acquiring learning data may perform operation 340 of storing the first fraud address in the second database 440. The second database 440 may be included in the learning data acquisition device 100. The second database 440 may store data for generating a machine learning model. In addition, the second database 440 may store data for determining another fraudulent address and analyzing a description of the fraudulent address.

이하에서는 암호화폐 사기 블랙리스트 서비스(Bitcoin scam blacklist services)가 아닌 곳에서 획득된 데이터로부터 사기 주소 및 사기 주소와 관련된 정보를 획득하는 방법 및 장치를 설명한다.Hereinafter, a method and apparatus for obtaining information related to a fraud address and a fraudulent address from data obtained at a location other than a Bitcoin scam blacklist service will be described.

도 5는 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 흐름도이다. 또한 도 6는 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 도면이다.5 is a flowchart illustrating an operation of a learning data acquisition apparatus according to an embodiment of the present disclosure. 6 is a view for explaining the operation of the learning data acquisition apparatus according to an embodiment of the present disclosure.

학습데이터 획득 장치(100)는 공개적으로 접근 가능한 웹사이트로부터 텍스트 정보를 수신하는 단계(510)를 수행할 수 있다. 학습데이터 획득 장치(100)는 수신부(410)를 이용하여 웹사이트로부터 텍스트 정보를 수신할 수 있다.The apparatus 100 for acquiring learning data may perform an operation 510 of receiving text information from a publicly accessible website. The apparatus 100 for acquiring learning data may receive text information from a website using the receiver 410.

공개적으로 접근 가능한 웹사이트(610)는 개인적으로 사용되거나 기술적으로 사용되는 블로그를 포함할 수 있다. 또한 사이버 보안 회사의 사기 분석 리포트일 수 있다. 웹사이트(610)에는 암호화폐의 주소와 관련된 다양한 정보가 기재되어 있을 수 있다. 예를 들어 웹사이트(610)는 특정 암호화폐의 주소가 사기에 사용되었다는 내용, 특정 암호화폐의 주소와의 거래가 만족스러웠다는 내용 또는 특정 암호화폐의 주소와 단순히 거래했다는 내용 등이 기재되어 있을 수 있다. 학습데이터 획득 장치(100)는 이 중 특정 암호화폐의 주소가 사기에 사용되었다는 내용을 추출하기 위하여 아래와 같은 단계를 수행할 수 있다.The publicly accessible website 610 may include a blog that is used personally or technically. It can also be a fraud analysis report from a cyber security company. Website 610 may describe a variety of information related to the address of the cryptocurrency. For example, the website 610 may describe that the address of a specific cryptocurrency was used for fraud, that the transaction with the address of the specific cryptocurrency was satisfactory, or that the transaction was simply performed with the address of the specific cryptocurrency. have. The apparatus 100 for acquiring learning data may perform the following steps to extract content indicating that an address of a specific cryptocurrency is used for fraud.

웹사이트(610)는 제 1 데이터베이스(430)와 다르게 일정한 형식을 가지고 있지 않을 수 있다. 또한 웹사이트(610)는 사기 주소와 관련된 정보 외의 다양한 정보를 포함할 수 있다. The website 610 may not have a predetermined format unlike the first database 430. Website 610 may also include a variety of information in addition to information associated with fraudulent addresses.

학습데이터 획득 장치(100)는 미리 정해진 웹사이트(610)를 크롤링(crawling)할 수 있다. 하지만 이에 한정되는 것은 아니며, 학습데이터 획득 장치(100)는 임의의 웹사이트(610)를 크롤링하여 필요한 데이터를 자동으로 추출할 수 있다. The apparatus 100 for acquiring learning data may crawl a predetermined website 610. However, the present invention is not limited thereto, and the learning data acquisition apparatus 100 may automatically extract necessary data by crawling an arbitrary website 610.

웹사이트(610)의 소스코드는 HTML 문서로 구성될 수 있다. HTML 문서는 웹사이트(610)에 표시될 내용뿐만 아니라 내용을 표시하기 위한 형식과 관련된 코드를 포함할 수 있다. 학습데이터 획득 장치(100)는 웹사이트(610)로부터 HTML body를 텍스트 정보로써 추출할 수 있다. The source code of the website 610 may be composed of an HTML document. The HTML document may include code related to the format for displaying the content as well as the content to be displayed on the website 610. The learning data obtaining apparatus 100 may extract the HTML body from the website 610 as text information.

학습데이터 획득 장치(100)는 텍스트 정보로부터 암호화폐 주소를 포함하는 메인 텍스트 정보를 추출하는 단계(520)를 수행할 수 있다. The apparatus 100 for acquiring learning data may perform an operation 520 of extracting main text information including a cryptocurrency address from the text information.

학습데이터 획득 장치(100)는 제 2 분석부(620)를 더 포함할 수 있다. 제 2 분석부(620)는 웹사이트(610)로부터 수신한 텍스트 정보를 분석할 수 있다. 제 2 분석부(620)는 소프트웨어 또는 하드웨어로 구현될 수 있다. 학습데이터 획득 장치(100)는 제 2 분석부(620)를 이용하여 메인 텍스트 정보를 추출할 수 있다.The apparatus 100 for acquiring learning data may further include a second analyzer 620. The second analyzer 620 may analyze text information received from the website 610. The second analyzer 620 may be implemented in software or hardware. The apparatus 100 for acquiring training data may extract main text information using the second analyzer 620.

학습데이터 획득 장치(100)는 웹사이트(610)의 텍스트 정보 중 암호화폐의 주소가 포함되어 있는 페이지만을 이용할 수 있다. 암호화폐의 주소는 특정한 형식을 가지고 있을 수 있다. 따라서 학습데이터 획득 장치(100)는 웹사이트(610)의 페이지의 내용에 기초하여 페이지에 암호화폐의 주소가 기재되어 있는지 여부를 판단할 수 있다. 학습데이터 획득 장치(100)는 암호화폐의 주소가 포함되어 있는 페이지의 텍스트 정보로부터 불필요한 정보를 제거할 수 있다. 예를 들어 학습데이터 획득 장치(100)는 배너와 HTML 태그를 삭제할 수 있다. 이를 위하여 학습데이터 획득 장치(100)는 Boilerpipe를 이용할 수 있다.The apparatus 100 for acquiring learning data may use only a page including an address of a cryptocurrency among text information of the website 610. The address of a cryptocurrency may have a specific format. Accordingly, the apparatus 100 for acquiring learning data may determine whether an address of a cryptocurrency is described on the page based on the content of the page of the website 610. The apparatus 100 for acquiring learning data may remove unnecessary information from text information of a page including an address of a cryptocurrency. For example, the apparatus 100 for acquiring learning data may delete a banner and an HTML tag. To this end, the learning data acquisition apparatus 100 may use a boilerpipe.

학습데이터 획득 장치(100)의 제 2 분석부(620)는 자연어 처리를 이용하여 메인 텍스트 정보로부터 복수의 제 2 핵심 단어들을 추출하는 단계(530)를 수행할 수 있다. 예를 들어, 학습데이터 획득 장치(100)는 메인 텍스트에서 특수문자, URL 및 불용어(stopword)와 같은 분석에 불필요한 문자들을 삭제할 수 있다The second analyzer 620 of the apparatus 100 for obtaining training data may perform an operation 530 of extracting a plurality of second key words from main text information using natural language processing. For example, the apparatus 100 for acquiring learning data may delete characters unnecessary for analysis such as special characters, URLs, and stopwords from the main text.

학습데이터 획득 장치(100)의 제 2 분석부(620)는 사기정보검출모델을 획득하는 단계(540)를 수행할 수 있다. 사기정보검출모델은 Neural network classifier일 수 있다. 사기정보검출모델은 기계학습을 수행하여 획득된 모델일 수 있다. 사기정보검출모델은 암호화폐의 주소와 관련된 핵심 단어들에 기초하여 암호화폐주소가 사기꾼에 의해 사용되고 있는지 여부를 판단하기 위한 기계학습모델일 수 있다.The second analyzer 620 of the apparatus 100 for obtaining training data may perform an operation 540 of obtaining a fraud information detection model. The fraud detection model may be a neural network classifier. The fraud information detection model may be a model obtained by performing machine learning. The fraud information detection model may be a machine learning model for determining whether a cryptocurrency address is being used by a fraudster based on key words related to the cryptographic address.

학습데이터 획득 장치(100)는 사기정보검출모델을 직접 생성할 수 있다. 학습데이터 획득 장치(100)는 사기정보검출모델을 생성하기 위하여 데이터 학습부(110)를 포함할 수 있다. 또한, 학습데이터 획득 장치(100)는 다른 장치로부터 사기정보검출모델을 수신할 수 있다. 학습데이터 획득 장치(100)가 사기정보검출모델을 생성하는 과정에 대해서는 도 7과 함께 자세히 설명한다.The training data acquisition apparatus 100 may directly generate a fraud information detection model. The training data acquisition apparatus 100 may include a data learner 110 to generate a fraud information detection model. In addition, the learning data acquisition device 100 may receive a fraud information detection model from another device. The process of generating the fraud information detection model by the learning data acquisition apparatus 100 will be described in detail with reference to FIG. 7.

학습데이터 획득 장치(100)의 제 2 분석부(620)는 복수의 제 2 핵심 단어들을 사기정보검출모델에 적용하여 메인 텍스트에 포함된 암호화폐 주소가 사기 주소인지 결정하는 단계(550)를 수행할 수 있다. 보다 구체적으로 학습데이터 획득 장치(100)는 복수의 제 2 핵심 단어들 각각이 메인 텍스트에서 등장하는 빈도수를 도출할 수 있다. 학습데이터 획득 장치(100)는 복수의 제 2 핵심 단어들 및 빈도수를 사기정보검출모델에 적용할 수 있다. 학습데이터 획득 장치(100)는 사기정보검출모델에 의하여 메인 텍스트에 포함된 암호화폐 주소가 사기 주소인지 여부에 대한 정보를 획득할 수 있다.The second analyzing unit 620 of the apparatus 100 for obtaining training data applies a plurality of second key words to a fraud detection model to determine whether the cryptographic address included in the main text is a fraudulent address (550). can do. In more detail, the apparatus 100 for acquiring learning data may derive a frequency in which each of the plurality of second key words appears in the main text. The apparatus 100 for acquiring learning data may apply a plurality of second key words and frequencies to a fraud information detection model. The apparatus 100 for acquiring learning data may acquire information on whether the cryptographic address included in the main text is a fraudulent address, based on the fraud information detection model.

학습데이터 획득 장치(100)의 제 2 분석부(620)는 암호화폐 주소가 사기 주소인 경우, 암호화폐 주소를 제 2 사기 주소로 획득하는 단계(560)를 수행할 수 있다. 보다 구체적으로 메인 텍스트에 포함된 암호화폐 주소가 사기 주소인지 여부에 대한 정보가 사기 주소임을 나타내는 경우, 학습데이터 획득 장치(100)는 메인 텍스트에 포함된 암호화폐 주소를 제 2 사기 주소로 획득할 수 있다.When the cryptographic address is a fraudulent address, the second analyzer 620 of the learning data acquisition apparatus 100 may perform an operation 560 of obtaining the cryptographic address as the second fraudulent address. More specifically, when the information on whether the cryptographic address included in the main text is a fraudulent address indicates that the fraudulent address, the learning data acquisition apparatus 100 may acquire the cryptographic address included in the main text as the second fraudulent address. Can be.

학습데이터 획득 장치(100)는 제 2 사기 주소를 제 2 데이터베이스(440)에 저장하는 단계(570)를 수행할 수 있다. 제 2 데이터베이스(440)는 제 2 사기 주소와 제 1 사기 주소가 중복되는 경우 제 2 사기 주소 또는 제 1 사기 주소 중 어느 하나를 무시하거나, 제 2 사기 주소 또는 제 1 사기 주소 중 어느 하나에 대한 정보를 갱신할 수 있다.The apparatus 100 for acquiring learning data may perform operation 570 for storing the second fraud address in the second database 440. The second database 440 ignores either the second fraud address or the first fraud address if the second fraud address and the first fraud address overlap, or the second fraud address or the first fraud address for either You can update the information.

도 7은 본 개시의 일 실시예에 따라 사기정보검출모델을 획득하는 방법을 나타낸 흐름도이다.7 is a flowchart illustrating a method of obtaining a fraud information detection model according to an embodiment of the present disclosure.

학습데이터 획득 장치(100)는 양호한 암호화폐 주소를 포함하는 것으로 결정된 웹사이트로부터 획득된 양호한 암호화폐 주소와 관련된 단어들을 획득하는 단계(710)를 수행할 수 있다. 양호한 암호화폐 주소는 사기꾼이 소유한 암호화폐 주소가 아닌 것을 나타낼 수 있다.The apparatus 100 for acquiring learning data may perform an operation 710 of acquiring words related to a good cryptographic address obtained from a website determined to include a good cryptographic address. A good cryptographic address may indicate that it is not a cryptographic address owned by a fraudster.

양호한 암호화폐 주소를 포함하는 것으로 결정된 웹사이트는 암호화폐 주소의 신뢰도 정보를 제공하는 웹사이트를 의미할 수 있다. 암호화폐 사용자들은 암호화폐 거래 후 암호화계 거래와 관련된 리뷰를 웹사이트에 남길 수 있다. 사용자는 리뷰를 점수로 나타내거나, 텍스트로 나타낼 수 있다. A website determined to include a good cryptographic address may mean a website that provides reliability information of a cryptographic address. Cryptocurrency users can leave a review on their website after cryptocurrency transactions. The user may present the review as a score or as a text.

사용자는 양호한 암호화폐 주소를 포함하는 웹사이트를 결정할 수 있다. 또는 학습데이터 획득 장치(100)는 자동으로 양호한 암호화폐 주소를 포함하는 웹사이트를 결정할 수 있다. 또한 학습데이터 획득 장치(100)는 양호한 암호화폐 주소를 포함하는 웹사이트 또는 웹페이지로부터 양호한 암호화폐 주소와 관련된 단어들을 획득할 수 있다. 예를 들어, 학습데이터 획득 장치(100)는 웹사이트 또는 웹페이지로부터 불필요한 문자들을 제거할 수 있다. 학습데이터 획득 장치(100)는 웹사이트 또는 웹페이지로부터 불필요한 문자들을 제거한 후 양호한 암호화폐 주소와 관련된 단어들을 획득할 수 있다. 양호한 암호화폐 주소와 관련된 단어들은 양호한 암호화폐 주소를 설명하기 위한 핵심단어들일 수 있다.The user can determine which website contains a good cryptographic address. Alternatively, the learning data acquisition device 100 may automatically determine a website including a good cryptographic address. Also, the learning data acquisition apparatus 100 may obtain words related to a good cryptographic address from a web site or a web page including the good cryptographic address. For example, the learning data acquisition apparatus 100 may remove unnecessary characters from a website or a web page. The apparatus 100 for acquiring learning data may acquire words related to a good cryptographic address after removing unnecessary characters from a website or a web page. Words associated with a good cryptographic address may be key words for describing a good cryptographic address.

학습데이터 획득 장치(100)는 양호한 암호화폐 주소와 관련된 단어들 각각이 웹사이트(610)에서 나타나는 제 1 빈도수를 획득하는 단계(720)를 수행할 수 있다. 학습데이터 획득 장치(100)는 양호한 암호화폐 주소와 관련된 단어들뿐만 아니라 제 1 빈도수에 기초하여 사기정보검출모델의 정확도를 높일 수 있다.The apparatus 100 for acquiring learning data may perform an operation 720 of acquiring a first frequency in which words associated with a good cryptographic address appear on the website 610. The apparatus 100 for acquiring learning data may increase the accuracy of the fraud detection model based on not only words related to a good cryptographic address but also a first frequency.

양호한 암호화폐 주소와 관련된 단어들 제 1 핵심 단어들의 각각이 제 1 디스크립션에서 나타내는 제 2 빈도수를 획득하는 단계(730)를 수행할 수 있다. 학습데이터 획득 장치(100)는 제 1 핵심 단어들을 제 1 데이터베이스(430)로부터 획득할 수 있다. 제 1 핵심 단어들의 획득과정에 대해서는 도 3 및 도 4와 함께 설명한 바 있으므로 중복되는 설명은 생략한다.Acquiring a second frequency 730 of each of the first key words associated with the preferred cryptographic address in the first description may be performed. The apparatus 100 for acquiring learning data may acquire first key words from the first database 430. Since the process of acquiring the first key words has been described with reference to FIGS. 3 and 4, a redundant description thereof will be omitted.

학습데이터 획득 장치(100)는 양호로 레이블된 양호한 암호화폐 주소와 관련된 단어들, 제 1 빈도수, 제 2 빈도수 및 사기로 레이블된 복수의 제 1 핵심 단어들을 기계학습하여 사기정보검출모델을 획득하는 단계(740)를 수행할 수 있다. 사기정보검출모델은 제 1 빈도수 및 양호한 암호화폐 주소와 관련된 단어들에 기초하여 양호한 주소들과 관련된 정보를 학습할 수 있으며, 제 2 빈도수 및 복수의 제 1 핵심 단어들에 기초하여 사기 주소들과 관련된 정보를 학습할 수 있다.The apparatus 100 for acquiring learning data acquires a fraud information detection model by machine learning words associated with a good cryptographic address labeled as good, a first frequency, a second frequency, and a plurality of first key words labeled as fraud. Step 740 may be performed. The fraud detection model can learn information related to good addresses based on words associated with a first frequency and a good cryptographic address, and based on a second frequency and a plurality of first key words, You can learn related information.

학습데이터 획득 장치(100)는 사기정보검출모델을 다른 학습데이터 획득 장치(100)로 유무선으로 전송할 수 있다. 학습데이터 획득 장치(100)는 사기정보검출모델을 메모리(220)에 저장할 수 있다. The training data acquisition apparatus 100 may transmit the fraud information detection model to other training data acquisition apparatus 100 by wire or wireless. The training data acquisition apparatus 100 may store the fraud information detection model in the memory 220.

학습데이터 획득 장치(100)는 새로운 암호화폐주소, 새로운 암호화폐주소에 대응되는 제 2 핵심 단어들 및 제 2 핵심 단어들의 빈도수를 획득할 수 있다. 학습데이터 획득 장치(100)는 제 2 핵심 단어들 및 제 2 핵심 단어들의 빈도수를 사기정보검출모델에 적용하여 새로운 암호화폐주소가 사기인지 양호인지 결정할 수 있다. The apparatus 100 for acquiring learning data may acquire a new cryptocurrency address, frequencies of second key words and second key words corresponding to the new cryptographic address. The apparatus 100 for acquiring learning data may determine whether a new cryptocurrency address is fraudulent or good by applying the frequencies of the second key words and the second key words to a fraud detection model.

위에서는 학습데이터 획득 장치(100)가 사기정보검출모델을 이용하여 웹사이트에 기재된 정보로부터 사기 주소를 판별하는 구성에 대하여 설명하였으나, 이에 한정되는 것은 아니다. 학습데이터 획득 장치(100)는 사기정보검출모델을 이용하여 웹사이트에 기재된 정보로부터 양호 주소를 판별할 수 있다.In the above, the configuration in which the learning data acquisition apparatus 100 determines the fraud address from the information described on the website using the fraud information detection model has been described, but is not limited thereto. The training data acquisition apparatus 100 may determine a good address from the information described on the website using the fraud detection model.

또한, 학습데이터 획득 장치(100)가 사기정보검출모델을 획득하는 방법은 위에 기재된 방법에 한정되지 않는다. 사용자는 웹사이트를 검토 후 사기 주소가 기재되어 있는 웹페이지를 '사기'로 레이블하여 사기 주소와 함께 저장하고, 양호 주소가 기재되어 있는 웹페이지를 '양호'로 레이블하여 양호 주소와 함께 저장할 수 있다. 학습데이터 획득 장치(100)는 사기 주소, '사기'로 레이블된 웹페이지, '양호'로 레이블된 웹페이지 및 양호 주소를 기계학습하여 사기정보검출모델을 획득할 수 있다. 학습데이터 획득 장치(100)는 단순히 웹페이지를 사기정보검출모델에 적용하는 것만으로 웹페이지로부터 주소 또는 주소가 사기꾼과 관련되어 있는지 여부를 결정할 수 있다.In addition, the method for acquiring the fraud information detection model by the training data acquisition apparatus 100 is not limited to the method described above. After reviewing a website, a user can label a webpage with a fraudulent address as "fraud" and save it with a fraudulent address, and a webpage with a good address labeled "good" and save it with a good address. have. The apparatus 100 for acquiring learning data may acquire a fraud information detection model by machine learning a fraud address, a web page labeled 'fraud', a web page labeled 'good', and a good address. The apparatus 100 for acquiring learning data may determine whether an address or an address is associated with a fraudster by simply applying a web page to a fraud detection model.

도 8은 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 흐름도이다. 또한 도 10은 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 도면이다.8 is a flowchart illustrating an operation of a learning data acquisition apparatus according to an embodiment of the present disclosure. 10 is a view for explaining the operation of the learning data acquisition apparatus according to an embodiment of the present disclosure.

학습데이터 획득 장치(100)는 암호화폐 주소와 대응되는 태그(tag)를 제공하는 서비스(1010)로부터 제 2 디스크립션을 획득하는 단계(810)를 수행할 수 있다. 학습데이터 획득 장치(100)는 수신부(410)를 이용하여 제 2 디스크립션을 획득할 수 있다.The apparatus 100 for acquiring learning data may perform an operation 810 of obtaining a second description from a service 1010 that provides a tag corresponding to a cryptocurrency address. The apparatus 100 for acquiring learning data may acquire a second description using the receiver 410.

태그는 암호화폐 주소에 부속하는 메타 정보(meta information)일 수 있다. 암호화폐 주소와 대응되는 태그를 제공하는 서비스는 "blockchain.info", "BitcoinTalk community" 또는 "bitcoin-otc.com" 와 같은 사이트가 있을 수 있다. The tag may be meta information attached to the cryptographic address. Services that provide tags corresponding to cryptocurrency addresses may include sites such as "blockchain.info", "BitcoinTalk community" or "bitcoin-otc.com".

태그는 Submitted link tag, Signed message tag, Bitcointalk profile tag 또는 Bitcoin-OTC profile tag(Bitcoin over-the-counter profile tag)를 포함할 수 있다. Submitted link tag는 태그가 지정된 암호화폐 주소의 간단한 설명을 제공한다. 리포트한 사람은 때때로 사기 정보 출처를 나타내는 페이지 링크와 함께 사기 디스크립션을 제공한다.The tag may include a Submitted link tag, a Signed message tag, a Bitcointalk profile tag, or a Bitcoin-OTC profile tag (Bitcoin over-the-counter profile tag). The Submitted link tag provides a brief description of the tagged cryptographic address. Reporters sometimes provide fraud descriptions with page links that indicate the source of fraudulent information.

Signed message tag는 주소의 소유자를 제공한다. 그러나 이 식별자는 소유주가 선택하므로 사기꾼이 거짓 소유권을 주장할 수 있다. The Signed message tag provides the owner of the address. However, this identifier is chosen by the owner, so the fraudster can claim false ownership.

Bitcointalk profile tag는 암호화폐 커뮤니티에서 사용자 식별자만을 제공할 수 있다.Bitcointalk profile tags can provide only user identifiers in the cryptocurrency community.

Bitcoin-OTC profile tag는 Bitcoin-OTC 웹 사이트에서 사용자 식별자를 제공한다. Bitcointalk 커뮤니티와 달리 이 웹 사이트는 각 사용자 별칭에 대해 평판 점수를 제공한다. 이 점수는 대상 암호화폐 주소로 금융 거래를 수행한 거래 상대방이 부여할 수 있다. 또한, 왜 상대방이 주어진 암호화폐 주소에 주어진 점수를 할당했는지에 대한 간단한 설명을 제공한다. 따라서 bitcoin-OTC profile tag을 이용하여 암호화폐의 사기 주소와 양호 주소와 관련된 정보를 모두 얻을 수 있다.The Bitcoin-OTC profile tag provides a user identifier on the Bitcoin-OTC website. Unlike the Bitcointalk community, the website provides reputation scores for each user alias. This score can be given by the trading partner who performed the financial transaction with the target cryptocurrency address. It also provides a brief explanation of why the counterpart assigned a given score to a given cryptocurrency address. Therefore, the bitcoin-OTC profile tag can be used to obtain both the fraud address and the good address of the cryptocurrency.

제 2 디스크립션은 Signed message tag 또는 Bitcoin-OTC profile tag로부터 획득될 수 있다. 제 2 디스크립션은 암호화폐 주소와 관련된 평판을 텍스트 정보일 수 있다.The second description may be obtained from a Signed message tag or a Bitcoin-OTC profile tag. The second description may be textual information of the reputation associated with the cryptographic address.

학습데이터 획득 장치(100)는 복수의 제 1 핵심 단어들에 기초하여 사기 핵심 단어 세트를 획득하는 단계(820)를 수행할 수 있다.The learning data obtaining apparatus 100 may perform an operation 820 of obtaining a fraudulent key word set based on the plurality of first key words.

학습데이터 획득 장치(100)는 제 3 분석부(1020)를 더 포함할 수 있다. 제 3 분석부(1020)는 태그를 제공하는 서비스(1010)로부터 수신한 제 2 디스크립션을 분석할 수 있다. 제 2 분석부(1020)는 소프트웨어 또는 하드웨어로 구현될 수 있다. 학습데이터 획득 장치(100)는 제 2 분석부(1020)를 이용하여 제 1 핵심 단어들로부터 사기 핵심 단어 세트를 획득할 수 있다.The apparatus 100 for acquiring learning data may further include a third analyzer 1020. The third analyzer 1020 may analyze the second description received from the service 1010 that provides the tag. The second analyzer 1020 may be implemented in software or hardware. The learning data obtaining apparatus 100 may obtain a fraudulent key word set from the first key words using the second analyzer 1020.

학습데이터 획득 장치(100)는 제 1 핵심 단어들을 제 1 데이터베이스(430)로부터 획득할 수 있다. 제 1 핵심 단어들의 획득과정에 대해서는 도 3 및 도 4와 함께 설명한 바 있으므로 중복되는 설명은 생략한다.The apparatus 100 for acquiring learning data may acquire first key words from the first database 430. Since the process of acquiring the first key words has been described with reference to FIGS. 3 and 4, a redundant description thereof will be omitted.

사기 핵심 단어 세트는 명사만을 포함할 수 있다. 또한 학습데이터 획득 장치(100)는 제 1 핵심 단어들 중 분석에 불필요한 문자들을 제거할 수 있다. 예를 들어 학습데이터 획득 장치(100)는 제 1 핵심단어들 중 사기와 관련되지 않은 트위터, 텀블러 및 인스타그램과 관련된 용어들을 삭제할 수 있다. The fraud key word set may include only nouns. In addition, the learning data acquisition apparatus 100 may remove characters unnecessary for analysis among the first key words. For example, the apparatus 100 for acquiring learning data may delete terms related to Twitter, Tumblr, and Instagram that are not related to fraud among the first key words.

학습데이터 획득 장치(100)는 복수의 제 1 핵심 단어들의 각각에 대하여 제 1 디스크립션에서 등장하는 빈도수를 획득하는 단계를 수행할 수 있다. 학습데이터 획득 장치(100)는 복수의 제 1 핵심 단어들 중 빈도수가 높은 소정의 개수의 단어를 사기 핵심 단어 세트로 결정하는 단계를 수행할 수 있다.예를 들어, 학습데이터 획득 장치(100)는 제 1 핵심 단어들 중 가장 빈도수가 높은 11개의 단어를 선택하여 사기 핵심 단어 세트를 획득할 수 있다. The apparatus 100 for acquiring learning data may perform a step of acquiring a frequency appearing in the first description for each of the plurality of first key words. The training data acquisition apparatus 100 may perform a step of determining a predetermined number of words having a high frequency among a plurality of first core words as a fraudulent core word set. For example, the training data acquisition apparatus 100 may be performed. May select 11 words having the highest frequency among the first key words to obtain a fraudulent key word set.

학습데이터 획득 장치(100)는 사기 핵심 단어 세트에 포함된 단어가 제 2 디스크립션에 기재된 경우, 제 2 디스크립션에 대응되는 암호화폐 주소를 제 3 사기 주소로 결정하는 단계(830)를 수행할 수 있다. 태그에 포함된 단어들의 수는 많지 않으므로, 학습데이터 획득 장치(100)는 제 1 핵심 단어들로부터 도출된 사기 핵심 단어 세트에 기초하여 태그가 사기를 나타내는지 여부를 결정할 수 있다.When the words included in the fraudulent key word set are described in the second description, the learning data acquiring apparatus 100 may perform an operation 830 of determining a cryptographic address corresponding to the second description as the third fraud address. . Since the number of words included in the tag is not large, the learning data acquisition apparatus 100 may determine whether the tag indicates fraud based on a fraudulent key word set derived from the first key words.

학습데이터 획득 장치(100)는 사기 핵심 단어 세트에 포함된 단어의 제 1 디스크립션 상에서의 빈도수를 더 이용할 수 있다. 예를 들어서, 제 2 디스크립션에 사기 핵심 단어 세트의 단어가 포함되어 있더라도, 그 단어가 제 2 디스크립션 내에서 자주 나오는 단어가 아닌 경우, 학습데이터 획득 장치(100)는 제 2 디스크립션에 대응되는 암호화폐 주소를 제 3 사기 주소로 결정하지 않을 수 있다. 또한, 제 2 디스크립션에 사기 핵심 단어 세트의 단어가 포함되어 있고, 그 단어가 제 2 디스크립션 내에서 자주 나오는 단어인 경우, 학습데이터 획득 장치(100)는 제 2 디스크립션에 대응되는 암호화폐 주소를 제 3 사기 주소로 결정할 수 있다.The apparatus 100 for acquiring learning data may further use the frequency on the first description of the words included in the fraudulent key word set. For example, even if the second description includes a word of a fraudulent key word set, if the word is not a word frequently appearing in the second description, the learning data acquisition apparatus 100 may convert the cryptocurrency corresponding to the second description. May not determine the address as a third fraudulent address. In addition, when the second description includes a word of a fraudulent key word set, and the word is a word frequently appearing in the second description, the learning data obtaining apparatus 100 may determine a cryptographic address corresponding to the second description. 3 can be decided by fraud address.

학습데이터 획득 장치(100)는 제 3 사기 주소를 제 2 데이터베이스(440)에 저장하는 단계(840)를 수행할 수 있다. 제 2 데이터베이스(440)는 제 3 사기 주소가 제 1 사기 주소 또는 제 3 사기 주소와 중복되는 경우 제 3 사기 주소, 제 1 사기 주소 또는 제 2 사기 주소 중 어느 하나를 무시하거나, 제 3 사기 주소, 제 1 사기 주소 또는 제 2 사기 주소 중 어느 하나에 대한 정보를 갱신할 수 있다.The apparatus 100 for acquiring training data may store 840 a third fraud address in the second database 440. The second database 440 may ignore the third fraud address, the first fraud address, or the second fraud address if the third fraud address is a duplicate of the first fraud address or the third fraud address, or the third fraud address The information on the first fraud address or the second fraud address may be updated.

도 9는 본 개시의 일 실시예에 따른 학습데이터 획득 장치의 동작을 설명하기 위한 흐름도이다.9 is a flowchart illustrating an operation of a learning data acquisition apparatus according to an embodiment of the present disclosure.

도 8은 학습데이터 획득 장치(100)가 태그를 제공하는 서비스(1010)로부터 제 2 디스크립션을 획득하는 경우에 대하여 설명하였다. 도 9는 제 2 디스크립션 뿐만 아니라 암호화폐 주소의 신뢰도 점수 정보를 획득하는 경우에 대하여 설명한다.FIG. 8 illustrates a case in which the learning data obtaining apparatus 100 obtains a second description from a service 1010 providing a tag. 9 illustrates a case of acquiring reliability score information of a cryptocurrency address as well as a second description.

학습데이터 획득 장치(100)는 암호화폐 주소와 대응되는 태그(tag)를 제공하는 서비스로부터 주소의 신뢰도를 나타내는 점수 정보를 획득하는 단계(910)를 수행할 수 있다. 주소의 신뢰도를 나타내는 점수 정보는 암호화폐 주소와 거래한 상대방이 남긴 점수일 수 있다. 또한 복수의 거래 상대방이 점수들을 남긴 경우, 그 점수들의 평균이 주소의 신뢰도를 나타내는 점수 정보일 수 있다.The apparatus 100 for acquiring learning data may perform operation 910 of acquiring score information indicating the reliability of an address from a service providing a tag corresponding to a cryptocurrency address. The score information representing the reliability of the address may be a score left by the counterpart of the cryptocurrency address. In addition, when a plurality of trading partners leave scores, the average of the scores may be score information indicating the reliability of the address.

학습데이터 획득 장치(100)는 점수 정보가 양호(benign)를 나타내고, 제 2 디스크립션에 사기 핵심 단어 세트에 포함된 단어가 포함되지 않은 경우, 암호화폐 주소를 양호 주소로 결정하는 단계(920)를 수행할 수 있다. 학습데이터 획득 장치(100)는 점수 정보가 임계값 이상인 경우, 양호를 나타내는 것으로 결정할 수 있다. 하지만 이에 한정되는 것은 아니며, 학습데이터 획득 장치(100)는 점수 정보가 임계값 이하인 경우 양호를 나타내는 것으로 결정할 수 있다.When the learning data obtaining apparatus 100 indicates that the score information is good and the word included in the fraudulent key word set is not included in the second description, determining the cryptographic address as the good address (920). Can be done. The learning data obtaining apparatus 100 may determine that the score information indicates goodness when the score information is greater than or equal to a threshold value. However, the present invention is not limited thereto, and the learning data acquisition apparatus 100 may determine that the score information indicates goodness when the score information is less than or equal to the threshold value.

학습데이터 획득 장치(100)는 점수 정보가 사기(scam)를 나타내고, 제 2 디스크립션에 사기 핵심 단어 세트에 포함된 단어가 포함된 경우, 암호화폐 주소를 제 3 사기 주소로 결정하는 단계(930)를 수행할 수 있다. 학습데이터 획득 장치(100)는 점수 정보가 임계값 이하인 경우, 사기를 나타내는 것으로 결정할 수 있다. 하지만 이에 한정되는 것은 아니며, 학습데이터 획득 장치(100)는 점수 정보가 임계값 이상인 경우 사기를 나타내는 것으로 결정할 수 있다.When the score information indicates scam and the second description includes a word included in the fraud key word set, the learning data obtaining apparatus 100 determines the cryptographic address as the third fraud address (930). Can be performed. When the score information is less than or equal to the threshold value, the apparatus 100 for learning data acquisition may determine that it represents fraud. However, the present invention is not limited thereto, and the learning data acquisition apparatus 100 may determine that the score information indicates fraud when the score information is greater than or equal to a threshold value.

학습데이터 획득 장치(100)는 점수 정보가 사기를 나타내지만 제 2 디스크립션에 사기 핵심 단어 세트에 포함된 단어가 포함되어 있지 않거나, 점수 정보가 양호를 나타내지만 제 2 디스크립션에 사기 핵심 단어 세트에 포함된 단어가 포함된 경우 암호화폐 주소에 대한 결정을 유보할 수 있다. 학습데이터 획득 장치(100)는 확실한 경우에만 암호화폐 주소를 양호 주소로 결정하거나, 사기 주소로 결정하므로, 추후 확실한 데이터에 기초하여 기계학습이 이루어지도록 할 수 있다.The apparatus 100 for acquiring learning data indicates that the score information indicates fraud but the second description does not include a word included in the fraud key word set, or that the score information indicates good but includes the fraud key word set in the second description. If a word is included, the decision on the crypto address can be withheld. The learning data obtaining apparatus 100 determines the cryptocurrency address as a good address or a fraudulent address only when it is certain, so that the machine learning may be performed based on certain data later.

학습데이터 획득 장치(100)는 양호 주소 및 제 3 사기 주소를 제 2 데이터베이스(440)에 저장하는 단계(940)를 수행할 수 있다. 제 2 데이터베이스(440)는 제 3 사기 주소가 제 1 사기 주소 또는 제 3 사기 주소와 중복되는 경우 제 3 사기 주소, 제 1 사기 주소 또는 제 2 사기 주소 중 어느 하나를 무시하거나, 제 3 사기 주소, 제 1 사기 주소 또는 제 2 사기 주소 중 어느 하나에 대한 정보를 갱신할 수 있다.The apparatus 100 for acquiring learning data may perform operation 940 of storing the good address and the third fraud address in the second database 440. The second database 440 may ignore the third fraud address, the first fraud address, or the second fraud address if the third fraud address is a duplicate of the first fraud address or the third fraud address, or the third fraud address The information on the first fraud address or the second fraud address may be updated.

도 11은 본 개시의 일 실시예에 따라 기계학습모델을 도출하는 구성을 나타낸 도면이다.11 is a view showing a configuration for deriving a machine learning model according to an embodiment of the present disclosure.

이제까지 학습데이터 획득 장치(100)가 제 1 사기 주소, 제 2 사기 주소, 제 3 사기주소 및 양호 주소를 도출하여 제 2 데이터베이스(440)에 저장하는 방법에 대하여 설명하였다. 데이터 학습부(110)는 제 2 데이터베이스(440)에 저장된 데이터에 기초하여 기계학습을 수행하고, 기계학습모델(1130)을 도출할 수 있다. So far, the method of deriving the first fraud address, the second fraud address, the third fraud address, and the good address from the learning data obtaining apparatus 100 has been described. The data learner 110 may perform machine learning based on the data stored in the second database 440 and derive the machine learning model 1130.

데이터 학습부(110)는 제 1 사기 주소, 제 2 사기 주소, 제 3 사기주소 및 양호 주소뿐만 아니라, 제 1 사기 주소, 제 2 사기 주소, 제 3 사기주소 및 양호 주소와 관련된 정보를 이용할 수 있다. 제 1 사기 주소, 제 2 사기 주소, 제 3 사기주소 및 양호 주소와 관련된 정보는 거래 이력을 포함할 수 있다. 거래 이력은 거래 일시, 거래한 상대방의 주소 또는 거래 금액의 크기를 포함할 수 있다.The data learning unit 110 may use information related to the first fraud address, the second fraud address, and the good address, as well as the first fraud address, the second fraud address, the third fraud address, and the good address. have. Information relating to the first fraud address, the second fraud address, the third fraud address and the good address may include a transaction history. The transaction history may include the date and time of the transaction, the address of the counterpart or the amount of the transaction.

데이터 학습부(110)는 제 1 사기 주소, 제 2 사기 주소, 제 3 사기주소 및 양호 주소와 관련된 정보를 분석하여 주소들의 특징을 획득할 수 있다. 데이터 학습부(110)는 주소들의 특징을 이용하여 기계학습을 수행하고 기계학습모델(1130)을 생성할 수 있다.The data learning unit 110 may analyze the information related to the first fraud address, the second fraud address, the third fraud address, and the good address to obtain characteristics of the addresses. The data learning unit 110 may perform machine learning using the features of the addresses and generate the machine learning model 1130.

데이터 학습부(110)는 생성된 기계학습모델(1130)을 메모리에 저장하거나, 다른 장치로 송신할 수 있다. 데이터 인식부(120)는 기계학습모델(1130)에 기초하여 암호화폐 주소가 사기 주소인지 여부를 결정할 수 있다. 데이터 인식부(120)는 새로운 암호화폐 주소를 수신하고, 새로운 암호화폐 주소를 기계학습모델(1130)에 적용하여 암호화폐 주소가 사기 주소인지 여부를 결정할 수 있다.The data learning unit 110 may store the generated machine learning model 1130 in a memory or transmit it to another device. The data recognizer 120 may determine whether the cryptocurrency address is a fraudulent address based on the machine learning model 1130. The data recognition unit 120 may receive a new cryptocurrency address and determine whether the cryptographic address is a fraudulent address by applying the new cryptographic address to the machine learning model 1130.

이제까지 다양한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far I looked at the center of the various embodiments. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

한편, 상술한 본 발명의 실시예들은 컴퓨터에서 실행될 수 있는 프로그램으로 작성가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다.Meanwhile, the above-described embodiments of the present invention can be written as a program that can be executed in a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium. The computer-readable recording medium includes a storage medium such as a magnetic storage medium (eg, a ROM, a floppy disk, a hard disk, etc.) and an optical reading medium (eg, a CD-ROM, a DVD, etc.).

Claims

In the learning data acquisition apparatus, Method for acquiring learning data for generating a machine learning model for detecting fraudulent accounts of cryptocurrency
Receiving a report associated with a fraudulent address from a first database that stores information about the reported fraudulent address;
Obtaining a first description of a fraud address and a first description associated with the first fraud address from the report;
Extracting a plurality of first key words associated with a first fraudulent address from the first description using natural language processing;
Storing the first fraudulent address in a second database;
Receiving textual information from a publicly accessible website;
Extracting main text information including a cryptocurrency address from the text information;
Extracting a plurality of second key words from the main text information using natural language processing;
Obtaining a fraud information detection model;
Determining whether a cryptographic address included in the main text is a fraudulent address by applying the plurality of second key words to the fraud detection model;
If the cryptographic address is a fraudulent address, obtaining the cryptographic address as a second fraudulent address; And
And storing the second fraudulent address in the second database.

delete

The method of claim 1,
Acquiring the fraud information detection model,
Obtaining words associated with a good cryptographic address obtained from a website determined to include a good cryptographic address;
Obtaining a first frequency in which each of the words associated with the good cryptographic address appears on a website;
Obtaining a second frequency each of said first key words represents in said first description; And
Machine learning words associated with the good cryptographic address labeled good, a first frequency, a second frequency, and the plurality of first key words labeled fraud to obtain the fraud detection model. Learning data acquisition method characterized in that.

The method of claim 1,
Obtaining a second description from a service providing a tag corresponding to a cryptocurrency address;
Obtaining a fraudulent key word set based on the plurality of first key words;
If a word included in the fraud key word set is described in the second description, determining a cryptographic address corresponding to the second description as a third fraud address; And
And storing the third fraud address in the second database.

The method of claim 4, wherein
Acquiring the fraud key word set,
Obtaining a frequency appearing in the first description for each of the plurality of first key words; And
And determining a predetermined number of words having a high frequency among the plurality of first key words as the fraudulent key word set.

The method of claim 4, wherein
Obtaining score information indicating the reliability of the address from a service providing a tag corresponding to the cryptographic address;
Determining the cryptographic address as a good address when the score information indicates a good benign and the second description does not include a word included in the fraudulent key word set;
Determining the cryptographic address as the third fraud address when the score information indicates a scam and the second description includes a word included in the fraud key word set; And
And storing the good address and the third fraud address in the second database.

Apparatus for acquiring learning data for generating a machine learning model for detecting fraudulent accounts of cryptocurrency,
Includes a processor and memory,
The processor according to the instructions stored in the memory,
Receiving a report associated with a fraudulent address from a first database that stores information about the reported fraudulent address;
Obtaining a first description of a fraud address and a first description associated with the first fraud address from the report;
Extracting a plurality of first key words associated with a first fraudulent address from the first description using natural language processing;
Storing the first fraudulent address in a second database;
Receiving textual information from a publicly accessible website;
Extracting main text information including a cryptocurrency address from the text information;
Extracting a plurality of second key words from the main text information using natural language processing;
Obtaining a fraud information detection model;
Determining whether a cryptographic address included in the main text is a fraudulent address by applying the plurality of second key words to the fraud detection model;
If the cryptographic address is a fraudulent address, obtaining the cryptographic address as a second fraudulent address; And
And storing the second fraudulent address in the second database.

delete

The method of claim 7, wherein
The processor according to the instructions stored in the memory,
Obtaining words associated with a good cryptographic address obtained from a website determined to include a good cryptographic address;
Obtaining a first frequency in which each of the words associated with the good cryptographic address appears on a website;
Obtaining a second frequency each of said first key words represents in a first description; And
Machine learning words associated with the good cryptographic address labeled good, a first frequency, a second frequency, and the plurality of first key words labeled fraud to obtain the fraud detection model. Learning data acquisition device characterized in that.

The method of claim 7, wherein
The processor according to the instructions stored in the memory,
Obtaining a second description from a service providing a tag corresponding to a cryptocurrency address;
Obtaining a fraudulent key word set based on the plurality of first key words;
If a word included in the fraud key word set is described in the second description, determining a cryptographic address corresponding to the second description as a third fraud address; And
And storing the third fraud address in the second database.

The method of claim 10,
The processor according to the instructions stored in the memory,
Obtaining a frequency appearing in the first description for each of the plurality of first key words; And
And determining a predetermined number of words having a high frequency among the plurality of first key words as the fraudulent key word set.

The method of claim 10,
The processor according to the instructions stored in the memory,
Obtaining score information indicating the reliability of the address from a service providing a tag corresponding to the cryptographic address;
Determining the cryptographic address as a good address when the score information indicates a good benign and the second description does not include a word included in the fraudulent key word set;
Determining the cryptographic address as the third fraud address when the score information indicates a scam and the second description includes a word included in the fraud key word set; And
And storing the good address and the third fraud address in the second database.