KR20220056288A

KR20220056288A - A crowdsourcing system for artificial intelligence data composition

Info

Publication number: KR20220056288A
Application number: KR1020200140357A
Authority: KR
Inventors: 김수경; 김기형; 이형용; 김성은; 진광영
Original assignee: 주식회사 엠아이제이
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2022-05-06

Abstract

The present invention relates to a crowdsourcing system for artificial intelligence data composition, capable of extracting a keyword through morpheme analysis by collecting source data, and building an ontology knowledge base by extracting knowledge data through deep learning. The crowdsourcing system comprises: a data preprocessing part preprocessing data; a morpheme analysis part performing morpheme analysis with respect to the preprocessed data; a keyword extraction part extracting a keyword from input data; a deep learning processing part performing deep learning with respect to the input data; an ontology composition part building the data, to which deep learning has been performed, as an ontology knowledge base; and a search part receiving an inquiry and performing a search with respect to the inquiry. Through the system, artificial intelligence (AI) data can be generated regardless of location and time, so that costs consumed for building data for artificial intelligence can be saved.

Description

{ A crowdsourcing system for artificial intelligence data composition }

본 발명은 원시 데이터를 수집하여 형태소 분석을 통해 키워드를 추출하고, 딥러닝을 통해 지식 데이터를 추출하여 온톨로지 지식 베이스로 구축하는, 인공지능 데이터 구성을 위한 크라우드소싱 시스템에 관한 것이다.The present invention relates to a crowdsourcing system for constructing artificial intelligence data that collects raw data, extracts keywords through morphological analysis, extracts knowledge data through deep learning, and builds it into an ontology knowledge base.

인공지능(AI) 데이터를 구성하기 위해서 많은 비용과 시간이 소모된다. 특히, 언택트 사회에서 이러한 데이터를 만들기 위해서 이러한 비용과 시간이 배가 되고 있다.It consumes a lot of money and time to construct artificial intelligence (AI) data. In particular, in order to create such data in an untact society, these costs and time are multiplying.

따라서 이러한 사회 현상 속에서 AI 데이터는 장소와 시간에 관계없이 만들어야 비용을 절감할 수 있다.Therefore, in this social phenomenon, AI data can be created regardless of place and time to reduce costs.

한국 공개특허공보 제10-2019-0103951호(2019.09.05.공개)Korean Patent Publication No. 10-2019-0103951 (published on September 5, 2019) 한국 등록특허공보 제10-2091240호(2020.03.20.공고)Korean Patent Publication No. 10-2091240 (2020.03.20. Announcement) 한국 공개특허공보 제10-2020-0033707호(2020.03.30.공개)Korean Patent Publication No. 10-2020-0033707 (published on March 30, 2020)

본 발명의 목적은 상술한 바와 같은 문제점을 해결하기 위한 것으로, 원시 데이터를 수집하여 형태소 분석을 통해 키워드를 추출하고, 딥러닝을 통해 지식 데이터를 추출하여 온톨로지 지식 베이스로 구축하는, 인공지능 데이터 구성을 위한 크라우드소싱 시스템을 제공하는 것이다.An object of the present invention is to solve the above-mentioned problems, collect raw data, extract keywords through morphological analysis, extract knowledge data through deep learning, and build an ontology knowledge base, artificial intelligence data configuration It is to provide a crowdsourcing system for

상기 목적을 달성하기 위해 본 발명은 인공지능 데이터 구성을 위한 크라우드소싱 시스템에 관한 것으로서, 데이터를 전처리하는 데이터 전처리부; 전처리된 데이터에 대하여 형태소 분석을 하는 형태소 분석부; 입력 데이터에서 키워드를 추출하는 키워드 추출부; 입력 데이터에 대하여 딥러닝을 수행하는 딥러닝 처리부; 딥러닝 수행된 데이터를 온톨로지 지식 베이스로 구축하는 온톨로지 구성부; 및, 질의를 입력받고, 질의에 대한 검색을 수행하는 검색부를 포함하는 것을 특징으로 한다.In order to achieve the above object, the present invention relates to a crowdsourcing system for constructing artificial intelligence data, comprising: a data pre-processing unit for pre-processing data; a morpheme analysis unit that performs morphological analysis on the preprocessed data; a keyword extraction unit for extracting keywords from input data; a deep learning processing unit that performs deep learning on input data; an ontology configuration unit that builds the deep learning-performed data into an ontology knowledge base; and a search unit that receives a query and searches for the query.

또한, 본 발명은 인공지능 데이터 구성을 위한 크라우드소싱 시스템에 있어서, 상기 유사이미지 도출부는 인공지능 기반 패턴별 유사도 학습을 진행하여 유사 프레임을 도출하는 것을 특징으로 한다.In addition, in the crowdsourcing system for constructing artificial intelligence data, the present invention is characterized in that the similar image derivation unit derives a similar frame by performing similarity learning for each AI-based pattern.

상술한 바와 같이, 본 발명에 따른 인공지능 데이터 구성을 위한 크라우드소싱 시스템에 의하면, 인공지능(AI) 데이터가 장소와 시간에 관계없이 생성될 수 있으므로, 인공지능을 위한 데이터 구축에 소요되는 비용을 절감할 수 있는 효과가 얻어진다.As described above, according to the crowdsourcing system for constructing artificial intelligence data according to the present invention, since artificial intelligence (AI) data can be generated regardless of place and time, the cost of constructing data for artificial intelligence can be reduced. savings can be obtained.

도 1은 비정형 데이터기반의 AI데이터 전체 프로세스를 나타낸 도면.
도 2는 Hwp, PDF 파싱 프로세스(데이터 수집기)를 나타낸 도면.
도 3은 전처리 프로세스를 나타낸 도면.
도 4는 데이터 전처리 모니터링를 나타낸 도면.
도 5는 형태소 분석 프로세스를 나타낸 도면.
도 6은 형태소 분석 모니터링을 나타낸 도면.
도 7 및 도 8은 키워드 추출 프로세스를 나타낸 도면.
도 9는 딥러닝 프로세스를 나타낸 도면.
도 10은 딥러닝 모니터링을 나타낸 도면.1 is a diagram showing the entire process of AI data based on unstructured data.
Figure 2 is a diagram showing the Hwp, PDF parsing process (data collector).
3 is a diagram showing a pre-processing process.
4 is a diagram showing data pre-processing monitoring.
5 is a diagram illustrating a morphological analysis process.
6 is a diagram showing monitoring of morphological analysis.
7 and 8 are diagrams showing a keyword extraction process.
9 is a diagram illustrating a deep learning process.
10 is a diagram showing deep learning monitoring.

이하, 본 발명의 실시를 위한 구체적인 내용을 도면에 따라서 설명한다.Hereinafter, specific contents for carrying out the present invention will be described with reference to the drawings.

또한, 본 발명을 설명하는데 있어서 동일 부분은 동일 부호를 붙이고, 그 반복 설명은 생략한다.In addition, in demonstrating this invention, the same part is attached|subjected by the same code|symbol, and the repetition description is abbreviate|omitted.

먼저, 본 발명의 일실시예에 따른 전체 프로세스를 설명한다.First, the entire process according to an embodiment of the present invention will be described.

○ AI데이터를 구성하기 위해선 많은 비용과 시간이 소모되며 언택트 사회에서 이러한 데이터를 만들기 위해선 이러한 비용과 시간이 배가 되고 있음○ It takes a lot of money and time to construct AI data, and to create such data in an untact society, these costs and time are multiplying.

○ 이러한 사회 현상 속에서 AI데이터는 장소와 시간에 관계없이 만들어야 비용을 절감할 수 있음○ In this social phenomenon, AI data can be created regardless of place and time to reduce costs.

○ 본 방법은 AI 비정형데이터 구성을 위한 프로세스이며, [그림 23]는 이러한 전체 프로세스를 도식화 한 결과임○ This method is a process for constructing AI unstructured data, and [Figure 23] is the result of schematizing the entire process.

○ 데이터 수집기는 데이터를 기계가 사용자가 원하는 도메인의 데이터들을 자동으로 수집하는 프로세스를 뜻함○ Data collector refers to the process in which the machine automatically collects the data of the domain that the user wants.

○ 데이터 전처리란, 데이터 분석 이전 분석에 사용되는 데이터를 정제하는 과정을 뜻함○ Data preprocessing refers to the process of refining data used for analysis before data analysis.

○ 전처리 과정을 통해 자연어의 오류를 바로잡거나 전처리 다음 프로세스에 어울리는 형태로 처리○ Correct errors in natural language through pre-processing, or process them in a form suitable for the process following pre-processing

○ 데이터 수집기에서 수집한 자료 중 정형화되지 않은 데이터(Hwp, Pdf 등)과 같은 데이터들은 각 관리자가 설계한 메타구조에 맞춰 자동으로 데이터베이스화되며, 전처리 과정에서 이러한 자연어들에서 띄어쓰기 오류, 특수태그 처리, 파싱에러 수정 등을 처리하여 기존 자연어에 의미에 맞도록 기계가 자동으로 수정을 진행함○ Among the data collected by the data collector, data such as unstructured data (Hwp, Pdf, etc.) are automatically databased according to the meta structure designed by each manager, and spaces errors and special tags are processed in these natural languages during pre-processing. , parsing error correction, etc., so that the machine automatically corrects it to match the meaning of the existing natural language

○ 데이터 전처리 모니터링이란, 데이터 전처리 전/후의 결과를 관리자가 보고 기계에서 잘못된 처리 부분들에 대해 이슈를 제기하여 향후 전처리 시 이것들이 패턴화되어 해당 프로그램에 대한 신뢰성 향상을 시켜줌○ Data pre-processing monitoring means that the manager reports the results before and after data pre-processing and raises issues about the wrong processing parts in the machine. These patterns are patterned during future pre-processing to improve the reliability of the program.

○ 형태소 분석이란, 자연어로 이루어진 텍스트를 작은 의미단위(형태소)로 나누어 태그를 부여하는 과정을 뜻함○ Morphological analysis refers to the process of assigning tags by dividing texts made of natural language into small semantic units (morphemes).

○ 형태소 분석 과정을 통해 명사, 동사, 형용사 등으로 분류하고 키워드를 추출하는 등의 기능을 수행함○ It performs functions such as classifying into nouns, verbs, and adjectives and extracting keywords through the morpheme analysis process.

○ 형태소 분석 프로세스는 분류된 명사 단어를 추출하여 의미 있는 키워드 후보군을 작성할 수 있음○ The morpheme analysis process can create meaningful keyword candidates by extracting classified noun words

○ 형태소 분석 모니터링이란, 기존 자연어 형태의 문장들과 형태소 분석 결과를 관리자가 보고 기계에서 잘못 분해된 부분들에 대해 이슈를 제기하여 형태소 분석 패턴 수정을 할 수 있는 체계이며 이러한 패턴 수정/추가는 향후 형태소 분석 프로세스에 정확성 증가를 시켜줌○ Morphological analysis monitoring is a system in which the administrator can view the sentences and morpheme analysis results in the existing natural language form and raise issues about the parts that are erroneously decomposed in the machine to correct the morpheme analysis pattern. Increases accuracy in the morphological analysis process

○ 키워드 추출이란, 각 문서별 대표되는 키워드를 자동으로 추출하여 키워드에 대한 주요성 가중치를 추출함○ Keyword extraction is to automatically extract the representative keywords for each document and extract the weight of the key words.

○ 키워드 추출 모니터링이란, 문서 원문과 해당 문서에서 추출된 키워드를 관리자가 보고 적합/부적합/핵심용어 등에 대한 키워드 혹은 형태소에서 단어 분해가 잘못된 것들에 대한 이슈제기 가능○ Keyword extraction monitoring means that the administrator can view the original text of the document and keywords extracted from the document, and raise issues with keywords or morphemes related to appropriate/inappropriate/key terms.

○ 문서 유사도(딥러닝) 추출이란, 형태소 분석된 자연어를 기반으로 앞 옆에 위치한 단어들을 기반으로 단어를 벡터화하여 벡터화된 단어들을 기반으로 코사인 유사도를 이용하여 문서들 간 유사도 자동 추출○ Document similarity (deep learning) extraction is automatic extraction of similarity between documents using cosine similarity based on vectorized words based on vectorized words based on morphologically analyzed natural language.

○ 문서 유사도(딥러닝) 모니터링이란, 딥러닝 결과 상이한 도메인의 문서 간 유사도로 추출된 결과에 대해 관리자가 검토하여 해당 문서 간 연관도에 대한 적합/부적합 여부를 선택할 수 있으며 이는 실제 서비스 반영되어 연관 문서 추론에 사용됨○ Document similarity (deep learning) monitoring means that the administrator can review the results extracted as similarities between documents in different domains as a result of deep learning and select whether the relevant documents are suitable or not suitable for the degree of relevance between the documents, which is reflected in the actual service. Used for document inference

이상, 본 발명자에 의해서 이루어진 발명을 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 실시 예에 한정되는 것은 아니고, 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.As mentioned above, although the invention made by the present inventors has been described in detail according to the embodiments, the present invention is not limited to the embodiments, and various changes can be made without departing from the gist of the present invention.

10 : 사용자 단말 11 : 클라이언트
30 : 크라우드소싱 서버
40 : 데이터베이스
80 : 네트워크10: user terminal 11: client
30: Crowdsourcing Server
40: database
80: network

Claims

In the crowdsourcing system for constructing artificial intelligence data,
a data pre-processing unit for pre-processing data;
a morpheme analysis unit that performs morphological analysis on the preprocessed data;
a keyword extraction unit for extracting keywords from input data;
a deep learning processing unit that performs deep learning on input data;
an ontology configuration unit that builds the deep learning-performed data into an ontology knowledge base; and,
A crowdsourcing system for constructing artificial intelligence data, comprising a search unit for receiving a query and performing a search for the query.

According to claim 1,
Crowdsourcing system for constructing artificial intelligence data, characterized in that the preprocessor automatically converts unstructured data among the collected data into a database according to a pre-designed meta structure.