KR20210045172A

KR20210045172A - Big Data Management and System for Livestock Disease Outbreak Analysis

Info

Publication number: KR20210045172A
Application number: KR1020190128598A
Authority: KR
Inventors: 남준
Original assignee: 케이웨어 (주)
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2021-04-26

Abstract

While big data has emerged as a global concern for information technology, attention is focused on what a public institution and companies will create which value through the big data collected so far. Therefore, a big data management system for a livestock disease outbreak information service constitutes development stages as follows. Firstly, the present invention develops a big data collector module for livestock disease outbreak information to collect disease outbreak data provided from public/private institutions as various sources in web portals, an SNS, and intranet inside public institutions in real time. Secondly, the present invention develops a big data real-time storing manager module for livestock disease outbreak information to store and manage the collected big data through a Hadoop distributed file system (HDFS), and store and manage the big data in a relational DB by performing real-time distributed parallel processing through a map reduce framework. Thirdly, the present invention develops a big data analyzing and visualizing processor module to analyze data for each subject by using real-time data mining technology from the stored livestock disease outbreak data and predict classification, grouping, and disease outbreak progress. Finally, the present invention develops a livestock disease outbreak information analyzing services system to effectively provide a service for disease outbreak statistics, disease outbreak policies, and disease outbreak prediction for strengthen national ability to respond to the livestock disease outbreak from the analyzed livestock disease outbreak information data.

Description

Big Data Management and System for Livestock Disease Outbreak Analysis}

본 발명은 공공/민간 기관의 빅 데이터를 활용하기 위하여 데이터 수집, 저장, 처리, 관리 기술뿐만 아니라 데이터 분석을 통하여 예측 및 가시화 서비스를 위한 기술이다.The present invention is a technology for prediction and visualization services through data analysis as well as data collection, storage, processing, and management technologies in order to utilize big data of public/private organizations.

빅 데이터는 전 세계적으로 정보기술의 관심사로 급부상한 가운데 공공기관 및 기업들이 지금까지 수집한 빅 데이터를 통해 어떠한 가치를 창출할 것인지 관심이 초점되고 있다. 빅 데이터 처리 분석 오픈 소스 프로젝트인 하둡은 하둡 파일 시스템(HDFS), OS 수준 앱스트랙션(OS level abstractions) 그리고 맵리듀스(MapReduce) 엔진, 대량의 데이터에 대한 집계, 질의, 분석 등을 쉽게 할 수 있는 HIVE, 대용량의 데이터를 필요로 하는 지능형 애플리케이션 개발을 위한 분산 및 병렬처리가 가능한 기계학습 라이브러리인 Mahout 등 포함되어 있다. 또한 필요한 자바 아카이브 파일(Java ARchive, JAR)들과 하둡을 시작할 스크립트, 소스 코드들과 관련 자료들로 구성되어 있다. Big data is rapidly emerging as an interest in information technology around the world, and attention is focused on what value public institutions and companies will create through big data collected so far. Hadoop, an open source project for big data processing analysis, is a Hadoop file system (HDFS), OS level abstractions, and MapReduce engine, which can easily aggregate, query, and analyze large amounts of data. It includes HIVE, Mahout, a machine learning library capable of distributed and parallel processing for the development of intelligent applications that require large amounts of data. It also consists of necessary Java archive files (Java ARchive, JAR), script to start Hadoop, source codes, and related materials.

축산 질병 발생 정보의 빅 데이터 경우, 데이터 크기보다는 작은 데이터 양이 여러 부처로 나누어져 있고, 서로 데이터 간에 연계되지 못하고 별도로 관리 및 사용되어 있지 않다. 따라서 축산 질병 발생 정보의 빅 데이터에 대하여 다양한 접근 및 분석 방법이 필요하다. 본 발명의 목적은 축산 질병 발생 정보 빅 데이터 수집기, 축산 질병 발생 정보 빅 데이터 저장 관리기, 축산 질병 발생 정보 빅 데이터 분석 및 가시화 처리기, 축산 질병 발생 정보 통계 및 예측 서비스를 제공하는 것이다.In the case of big data of livestock disease occurrence information, the amount of data smaller than the size of the data is divided into various departments, and data are not linked to each other and are not separately managed and used. Therefore, various approaches and analysis methods are needed for big data of livestock disease occurrence information. An object of the present invention is to provide a livestock disease occurrence information big data collector, livestock disease occurrence information big data storage manager, livestock disease occurrence information big data analysis and visualization processor, and livestock disease occurrence information statistics and prediction services.

본 발명을 해결하기 위하여 기술적 개발 접근방법으로 해결하고자 다음과 같이 문제점을 기술하고 해결방법을 제시한다.In order to solve the present invention with a technical development approach, the problem is described as follows and a solution is proposed.

첫째, 공공기관별, 지자체별, 정부부처, 민간기관별로 질병데이터가 나누어져 관리되고 있으므로 데이터가 연계되어 있지 않다. 따라서 축산 질병 발생 정보 데이터에 대한 다양한 소스 및 형식을 고려한 다중 소스 빅 데이터 수집, 저장관리, 분석 및 가시화 서비스를 제공한다.First, disease data are divided and managed by public institutions, local governments, government departments, and private institutions, so the data is not linked. Therefore, it provides multi-source big data collection, storage management, analysis, and visualization services that consider various sources and formats for livestock disease occurrence information data.

둘째, 정부산하기관에서 일부 축산 질병 발생 정보 데이터를 이용한 서비스가 개발되었으나 질병 발생 정보 서비스가 사업화된 사례가 없다. 따라서 차트, 그래프 등 다양한 가시화 기능 제공을 통해 수요자가 직관적으로 데이터를 파악할 수 있는 있도록 편의성을 제공한다.Second, a service using some livestock disease outbreak information data has been developed by government agencies, but there is no case in which disease outbreak information service has been commercialized. Therefore, it provides convenience so that consumers can intuitively grasp data by providing various visualization functions such as charts and graphs.

셋째, 정부 부처, 공공기관 및 지자체의 축산 질병 발생 정보 데이터에 대한 분석을 통한 서비스 기술 개발 가이드라인이 없다. 따라서 공공 데이터의 빅 데이터 분석을 위한 서비스 기술 개발을 통해 국내 기술 제품을 개발하고 공공기관에서 활용할 수 있는 가이드라인 제시할 수 있다.Third, there is no guideline for service technology development through analysis of information on livestock disease occurrence information of government ministries, public institutions, and local governments. Therefore, through the development of service technology for big data analysis of public data, it is possible to develop domestic technology products and present guidelines that can be used by public institutions.

또한 본 발명을 해결하기 위하여 서비스적 개발 접근방법으로 해결하고자 다음과 같이 문제점을 기술하고 해결방법을 제시한다.In addition, in order to solve the present invention with a service development approach, the following problems are described and a solution is proposed.

첫째, 외산 빅 데이터 분석 플랫폼에 의존적이므로 기존 오픈 소스를 분석하여 국내 공공기관 빅 데이터 특성에 적합한 플랫폼을 개발하고, 질병 발생데이터 특성에 적합한 분석 및 가시화 시스템 개발이다.First, since it is dependent on a foreign big data analysis platform, it is to analyze existing open sources to develop a platform suitable for the characteristics of big data in domestic public institutions, and to develop an analysis and visualization system suitable for the characteristics of disease occurrence data.

둘째, 실시간 데이터 처리 한계, 다양한 데이터 처리 한계, 다수의 분산 파일 관리 어려움이 있다. 따라서 실시간 처리를 위한 다양한 분산질의 처리 기법을 개발하고, 빅 데이터 처리 및 분석에서 복잡하고 다양한 처리 및 분석을 지원하기 위한 일괄 처리 모듈 개발이다.Second, there are limitations in real-time data processing, limitations in processing various data, and difficulties in managing a large number of distributed files. Therefore, it is a batch processing module to develop various distributed query processing techniques for real-time processing and to support complex and various processing and analysis in big data processing and analysis.

셋째, 노드의 작업 부하 및 데이터 활용도를 고려한 부하 분산 정책이 미흡하다. 따라서 효율적인 리소스 관리를 위한 데이터의 중복 제거 및 부하분산 기술 개발이다.Third, the load balancing policy considering the workload and data utilization of the node is insufficient. Therefore, it is the development of data deduplication and load balancing technology for efficient resource management.

본 발명에 따르면 공공/민간기관의 축산 질병 발생 정보 데이터를 실시간 분석을 통하여 지자체에서 정책결정, 축산 질병 발생 정보 통계 서비스 및 축산 질병 발생 정보 예측 서비스를 제공할 수 있다. 또한, 맞춤형 축산 질병 발생 정보 서비스 제공을 위한 요소기술 및 서비스 개발이다.According to the present invention, through real-time analysis of livestock disease occurrence information data of public/private organizations, local governments can provide policy decisions, livestock disease occurrence information statistics service, and livestock disease occurrence information prediction service. In addition, it is the development of elemental technologies and services to provide customized livestock disease occurrence information services.

도 1은 본 발명의 축산 질병 발생 정보 빅 데이터 수집기 모듈 흐름도이다.
도 2은 본 발명의 축산 질병 발생 정보 빅 데이터 전처리 및 저장/분산배치처리 모듈 흐름도이다.
도 3은 본 발명의 축산 질병 발생 정보 빅 데이터 전처리 모듈 순서도이다.
도 4은 본 발명의 축산 질병 발생 정보 통계서비스 도식이다.
도 5은 본 발명의 축산 질병 발생 정보 분류서비스 도식이다.
도 6은 본 발명의 질병 키워드 분류 관리 및 설정 도식이다.
도 7은 본 발명의 특정지역 월별 질병 발생 현황 가식화 서비스이다.
도 8은 본 발명의 탑 키워드별로 질병 트랜드분석와 가시화 서비스이다.1 is a flow chart of a big data collector module for livestock disease occurrence information according to the present invention.
Figure 2 is a flow chart of the livestock disease occurrence information big data pre-processing and storage/distributed batch processing module of the present invention.
3 is a flow chart of a big data preprocessing module for livestock disease occurrence information according to the present invention.
Figure 4 is a schematic diagram of the livestock disease occurrence information statistical service of the present invention.
5 is a schematic diagram of a livestock disease occurrence information classification service according to the present invention.
6 is a diagram illustrating management and setting of disease keyword classification according to the present invention.
7 is a service for decorating the current status of disease occurrence by month in a specific area according to the present invention.
8 is a disease trend analysis and visualization service for each top keyword of the present invention.

도 1은 본 발명의 축산 질병 발생 정보 빅 데이터 수집기 모듈로 축산 질병 발생 데이타(Open API Source, Blog Source, News Source, Web Source 등)들은 데이터 수집기 모듈에서 수집한다. 데이터 수집기 모듈을 구체적으로 설명하면 다음과 같다. Open API는 공공 포탈(data.go.kr)에서 제공하는 공공기관의 축산 질병 발생 정보 데이터를 실시간으로 수집한다. Web Crawler는 국내외 공공기관의 질병발생 보고 화면에서 제공하는 질병 발생 데이터를 웹 크롤러를 통하여 실시간으로 수집한다. Web Scraper는 Web Crawler와 같이 국내외 공공기관의 웹 사이트 등에서 질병 발생 정보 데이터를 실시간으로 수집한다.1 is a livestock disease occurrence information big data collector module of the present invention, and livestock disease occurrence data (Open API Source, Blog Source, News Source, Web Source, etc.) are collected by the data collector module. A detailed description of the data collector module is as follows. Open API collects livestock disease occurrence information data of public institutions provided by the public portal (data.go.kr) in real time. The Web Crawler collects disease occurrence data provided on the disease occurrence report screens of domestic and foreign public institutions in real time through a web crawler. Web Scraper, like Web Crawler, collects disease occurrence information data in real time from websites of domestic and foreign public institutions.

도 2는 축산 질병 발생 정보 빅 데이터 전처리 및 저장 모듈부이다. 이때 Log Aggregator는 도 1에서 다양한 수집기에서 수집되는 질병 발생 데이터 로그 수집 및 관리이다. 연결모듈(Connector Module)는 수집기에서 모아진 질병 발생 데이터를 데이터 전처리기에 전달하고, 데이터 전처리 이전에 필요한 데이터를 필터링한다.2 is a module for preprocessing and storing big data of livestock disease occurrence information. At this time, the Log Aggregator is a disease occurrence data log collection and management collected by various collectors in FIG. 1. The connector module transfers the disease occurrence data collected by the collector to the data preprocessor, and filters the necessary data before data preprocessing.

도 3은 축산 질병 발생 정보 빅 데이터 전처리 모듈 순서도이다. 데이터 전처리기는 다양한 수집기로부터 수집된 데이터 중복을 제거한다. 그리고 수집된 데이터 유형 및 특성에 따라 실시간 처리 및 배치 처리를 구분하기 위하여 데이터 수집 경로에 따른 태깅 기능을 한다. 마지막으로 데이터 전처리기를 통해 정제된 질병 발생 원시 데이터는 분산파일시스템(HDFS)에 저장한다.3 is a flow chart of a big data preprocessing module for livestock disease occurrence information. The data preprocessor eliminates duplication of data collected from various collectors. In addition, in order to classify real-time processing and batch processing according to the collected data type and characteristics, it performs a tagging function according to the data collection path. Finally, the raw data of disease occurrence, purified through the data preprocessor, is stored in the distributed file system (HDFS).

데이터 전처리기 수행 후, 실시간 데이터 저장을 위하여 디스크 서버 모듈, HDFS 서버 모듈, 검색 엔진 인덱스 저장 모듈, DB 저장 모듈을 임시적으로 두어 각각 DISK, HDFS, NoSQL, 검색 엔진, 데이터베이스를 저장하기 위한 실시간 저장관리에 저장된다. 실시간으로 저장된 데이터는 MapReduce, Hive, Pig, Mahout 모듈을 통하여 분산배치처리된다.After executing the data preprocessor, a disk server module, HDFS server module, search engine index storage module, and DB storage module are temporarily placed for real-time data storage to store DISK, HDFS, NoSQL, search engine, and database respectively. Is stored in. Data stored in real time is distributed and batch processed through MapReduce, Hive, Pig, and Mahout modules.

축산 질병 발생 정보 빅 데이터 분석을 위한 데이터 마이닝을 위한 필요한 기능인 데이터 마이닝, 분류/그룹화, 주제 분석, 질병 발생추이분석의 세부 내용은 하기와 같다.Details of data mining, classification/grouping, topic analysis, and disease outbreak trend analysis, which are necessary functions for data mining for big data analysis of livestock disease occurrence information, are as follows.

데이터 마이닝은 분산된 다중 소스에서 제공되는 다양한 형태의 질병 발생 데이터에 대한 통합 마이닝을 위한 분산 처리 기술 및 분석 마이닝 기술로 인 메모리 기반 실시간 질병 발생 빅 데이터 스트림 마이닝 기능을 제공한다. 또한 질병 발생 빅 데이터 패턴 추출, 식별, 검색 기능 및 문맥 기반의 텍스트 마이닝 기능을 제공한다.Data mining is a distributed processing technology and analysis mining technology for integrated mining of various types of disease outbreak data provided from distributed multiple sources, and provides in-memory-based real-time disease outbreak big data stream mining. In addition, it provides the function of extracting, identifying, and searching for disease occurrence big data patterns, and the context-based text mining function.

분류/그룹화는 유사도 또는 랭킹 등의 질병 발생 빅 데이터 분석을 통해 나온 데이터와 결과들을 체계적으로 분류 또는 그룹화하여 효율적인 질의 처리 및 분석과 관련 서비스를 제공한다. Classification/grouping provides efficient query processing and analysis and related services by systematically classifying or grouping data and results obtained through analysis of big data on disease occurrence such as similarity or ranking.

주제 분석은 문맥 기반 텍스트 마이닝 분석을 통하여 축산 질병 발생 정보 데이터의 주제를 자동으로 분석한다.Thematic analysis automatically analyzes the subject of livestock disease occurrence information data through context-based text mining analysis.

질병 발생추이분석은 축산 질병 발생 정보 데이터 및 사회적 관계망 데이터 분석을 통해 국내외 질병 발생 트랜드 및 유형을 분석한다. 이때 질병 발생 종류 및 주제, 통계 분석을 통하여 늘어나는 다양한 형태의 질병 발생 추이 분석이 가능하다. Disease occurrence trend analysis analyzes domestic and foreign disease occurrence trends and types through analysis of livestock disease occurrence information data and social network data. At this time, it is possible to analyze the trend of various types of disease occurrence through the analysis of disease occurrence types, topics, and statistics.

축산 질병 발생 정보 대 서비스 처리를 위한 필요한 기능인 질병 발생통계서비스, 질병 발생 정책서비스, 질병 발생 예측 서비스의 세부 내용은 하기와 같다.Details of the disease occurrence statistics service, disease occurrence policy service, and disease occurrence prediction service, which are necessary functions for processing livestock disease occurrence information versus service, are as follows.

질병 발생 통계서비스는 질병 발생 요청 내용 및 관련된 질병 발생 정보를 연관 검색하여 제공하는 질병 발생 조회 및 통계 서비스이다. The disease occurrence statistics service is a disease occurrence inquiry and statistics service that provides a related search for disease occurrence request details and related disease occurrence information.

질병 발생 정책서비스는 최근 질병 발생 내용 및 경향을 분석하여 공공기관의 의사결정 및 정책 결정을 지원하는 서비스이다.The disease outbreak policy service is a service that supports decision-making and policy-making by public institutions by analyzing recent disease outbreaks and trends.

질병 발생 예측서비스는 질병 발생 분석 내용을 기반으로 향후 예측 되는 질병 발생 서비스에 대한 공공기관 정책 수립 및 예산 마련 서비스 등이다.Disease outbreak prediction services include public institution policy establishment and budget preparation services for future disease outbreak services that are predicted based on the analysis of disease outbreaks.

[실시예1][Example 1]

본 발명의 실효성을 위하여 도 4는 질병 발생통계서비스 도식이다. 도 4의 상위에 있는 총 질병 발생건수는 2019년 1월 1일부터 2019년 10월 15일까지 수집된 총 질병 발생 건수이다. 그리고 국가별로 키워드 순위별로 수치와 도식을 보여준다.4 is a schematic diagram of a disease occurrence statistics service for the effectiveness of the present invention. The total number of disease outbreaks at the top of FIG. 4 is the total number of disease outbreaks collected from January 1, 2019 to October 15, 2019. And it shows figures and diagrams by keyword ranking by country.

[실시예2][Example 2]

본 발명의 실효성을 위하여 도 5은 질병 발생분류서비스를 위한 질병 발생 키워드 분류 및 관리와 지역별로 키워드 현황을 보여준다. 도 6은 질병 발생키워드 분류 및 설정 관리 기능으로 질병 발생 키워드를 분류하기 위하여 카테고리를 만들고 카테고리에 관련된 질병 발생 키워드를 입력한다. 마지막으로 카테고리에 대한 설명을 한다.For the effectiveness of the present invention, FIG. 5 shows the classification and management of disease occurrence keywords for the disease occurrence classification service and the current status of keywords by region. 6 is a disease occurrence keyword classification and setting management function, in order to classify disease occurrence keywords, a category is created and a disease occurrence keyword related to the category is input. Finally, the categories are explained.

[실시예3][Example 3]

본 발명의 실효성을 위하여 도 7은 2019년 01월부터 2019년 10월까지 특정지역인 경기도 월별 질병 발생현황을 가식화하여 보여준다. 또한 도 8은 탑 키워드별로 질병 발생트랜드분석와 가시화를 보여준다.For the practicality of the present invention, FIG. 7 shows the status of monthly disease occurrence in Gyeonggi-do, which is a specific region, from January 2019 to October 2019. In addition, FIG. 8 shows the analysis and visualization of disease occurrence trends by top keywords.

Claims

It is a data collector module, a connection part, and a data preprocessor module part for collecting big data of livestock disease occurrence information. Based on the module unit, it is divided into a module that collects livestock disease occurrence information data, a module that transfers the collected data to a data preprocessor, and a data preprocessor module unit.

The method of claim 1, wherein the detailed modules of the data collector module are Open API, Web Crawler, and Log Aggregator. The Open API module collects livestock disease occurrence information data of public institutions provided by public portals in real time. The Web Crawler module collects disease occurrence data provided by public institutions' website bulletin boards in real time through a web crawler. The Log Aggregator module collects disease occurrence data logs from various collectors.

The method of claim 1, wherein the data preprocessor module removes redundancy of data collected from various collectors. In addition, it performs a tagging function according to the data collection path to classify it into real-time processing and batch processing according to the collected data type and characteristics. Finally, the raw data of disease occurrence, purified through the data preprocessor, is stored in the distributed file system (HDFS).

The method according to claim 1, wherein the data connector module transmits disease occurrence data collected from an Open API, a Web Crawler, and a Log Aggregator collector to a data preprocessor, and filters necessary data before data preprocessing.

The livestock disease occurrence information big data storage management module unit analyzes and stores livestock disease occurrence information big data collected in claim 1 in real time. Detailed modules include real-time processing module, HDFS (Hadoop File System), MapReduce, RDBMS, and NoSQL (HBase) module.

The method of claim 5, wherein the real-time processing module performs in-memory caching and indexing functions for real-time processing. Then, it performs hourly or gradual analysis and stores it in NoSQL DB or RDBMS.

The method of claim 5, wherein the HDFS module stores disease occurrence big data collected from various sources in a distributed file system.

The method of claim 5, wherein the MapReduce module divides and stores a large amount of distributed disease occurrence data by mapping the data using a distributed/parallel Map/Reduce algorithm. It is also a Map/Reduce framework module for processing disease-occurrence data for distributed/parallel processing of large-capacity disease-occurring big data in a cluster environment.

The method of claim 5, wherein the RDBMS module is a data storage that stores large-scale disease occurrence big data in a relational DB through a Map/Reduce algorithm. This module supports standard SQL and provides necessary distributed query processing functions.

The method of claim 5, wherein the NoSQL (HBase) module is a module for storing and managing large-scale disease occurrence big data in a NoSQL DB through a Map/Reduce algorithm.

The data mining module for analyzing livestock disease occurrence information big data includes classification/grouping, subject analysis, and disease occurrence trend analysis modules.

The method of claim 11, wherein the classification/grouping module is a module for providing efficient query processing and analysis-related services by systematically classifying or grouping data and results obtained through analysis of big data on disease occurrence such as similarity or ranking.

The method of claim 11, wherein the subject analysis module is a module that automatically analyzes the subject of livestock disease occurrence information data through context-based text mining analysis.

The method of claim 11, wherein the disease occurrence trend analysis module is a module that analyzes domestic and foreign disease occurrence trends and types through analysis of livestock disease occurrence information data and social network data.

The livestock disease outbreak information public service processing module is a disease outbreak statistics service, disease outbreak policy service, and disease outbreak prediction service module.

The method of claim 15, wherein the disease occurrence statistics service module is a disease occurrence inquiry and statistics service module that provides a related search and provides information on a disease occurrence request and related disease occurrence information.

The method of claim 15, wherein the disease outbreak policy service module is a service module that supports decision-making and policy-making of public institutions by analyzing the content and trend of recent disease outbreaks.

The method of claim 15, wherein the disease occurrence prediction service module is a service module for establishing a public institution's policy and budget preparation for a disease occurrence service predicted in the future based on the analysis of disease occurrence.