KR20230000008A

KR20230000008A - Content publishing automation system using big data

Info

Publication number: KR20230000008A
Application number: KR1020210081305A
Authority: KR
Inventors: 임승환
Original assignee: (주)제스아이앤씨
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2023-01-02

Abstract

The present invention is to provide a big data-based content publishing automation system (100) which automatically collects large amounts of big data, which is useful information people need in daily lives, in a short period of time, process the collected big data, and dynamically create and post website pages without need for a person in charge to write, edit, or post the website pages directly. The big data-based content publishing automation system comprises: a data mining engine (110) which collects online data related to specific government offices in parallel based on multi-process; a natural language processor (120) which extracts words and phrases related to a region where a government office is located and regional policies related to the government office through natural language processing of data; a data analyzer (130) which extracts and classifies words and phrases extracted by the natural language processor (120) into content units, and stores the classified words and phrases; and a content publisher (140) which automatically generates posts to be posted on the web page of the government office by using the words and phrases classified and stored as content units in the data analyzer (130), and actually posts the generated posts on the web page.

Description

Content publishing automation system using big data}

빅데이터를 자동으로 수집하고, 수집한 빅데이터를 가공하여 동적으로 웹사이트 페이지를 작성하고 게시하는 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템에 관한 것이다.It relates to a big data-based content publishing automation system that automatically collects big data, processes the collected big data, and dynamically creates and posts website pages.

현재 정보통신기술의 발달과 PC 및 스마트폰의 보급으로 지방자치단체와 공공기관의 많은 업무들이 온라인 상에서 수행되고 있다.Currently, with the development of information and communication technology and the spread of PCs and smartphones, many tasks of local governments and public institutions are being performed online.

민원인들은 자신의 PC 또는 스마트폰을 활용하여 지방자치단체를 포함하는 정부기관 및 공공기관을 직접 관공서를 방문하지 않고 온라인으로 민원을 요청하고 공무원들은 온라인으로 요청된 민원을 처리하여 그 결과를 온라인으로 민원인들에게 통지하고 있다.Civil petitioners request civil petitions online using their PCs or smartphones to government agencies and public institutions, including local governments, without directly visiting public offices. Notifying complainants.

그러나 온라인으로 요청된 민원업무가 증가함에 따라 공무원들은 공무, 민원처리, 정책개발 등을 포함하는 고유의 업무 외에 웹사이트 운영관리까지 맡아서 수행해야 하고, 이로 인하여 공무원 고유의 업무 처리에 들어가는 비용과 시간이 늘어나는 문제점이 있어, 웹사이트 페이지를 동적으로 작성하고 게시하고 관리할 수 있는 시스템의 개발이 필요하다.However, as the number of civil petitions requested online increases, public officials have to take charge of website operation and management in addition to their own duties, including public affairs, civil complaint handling, and policy development. Due to this increasing problem, it is necessary to develop a system capable of dynamically creating, posting, and managing website pages.

본 발명은 상기한 문제점을 해결하기 위하여 안출된 것으로, 국민이 일상에서 필요로 하는 유용한 정보를 짧은 시간에 대단위의 빅데이터를 자동으로 수집하여, 수집된 빅데이터를 가공하여 담당자가 직접 작성, 편집, 게시를 하지 않고 동적으로 웹사이트 페이지를 작성하고 게시해주는 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템을 제공하는 것을 그 목적으로 한다.The present invention has been made to solve the above problems, and automatically collects a large amount of big data in a short time for useful information that the people need in their daily life, processes the collected big data, and writes and edits it directly by the person in charge. However, its purpose is to provide a big data-based content publishing automation system that dynamically creates and posts website pages without posting them.

본 발명의 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템 멀티프로세스(Multi-Process)에 기반하여 병렬적으로 특정 관공서와 관련한 온라인상의 데이터를 수집하는 데이터마이닝엔진; 상기 데이터를 자연어 프로세싱에 의해서 상기 관공서가 위치한 지역 및 상기 관공서와 관련한 지역정책과 연관된 단어 및 문구를 추출하는 자연어처리기; 상기 자연어처리기에 의해 추출된 단어 및 문구를 컨텐츠 단위로 추출하여 분류하고 분류된 단어 및 문구를 저장하는 데이터분석기; 및 상기 데이터분석기에서 컨텐츠 단위로 분류되어 저장된 상기 단어 및 문구를 활용하여 상기 관공서의 웹페이지에 게시할 게시물을 자동으로 생성하여 상기 웹페이지에 실제로 게시하는 컨텐츠퍼블리셔;를 포함할 수 있다.A data mining engine that collects online data related to a specific government office in parallel based on the big data-based content publishing automation system of the present invention, Multi-Process; a natural language processor for extracting words and phrases associated with the region where the government office is located and local policies related to the government office by natural language processing of the data; a data analyzer for extracting and classifying the words and phrases extracted by the natural language processor in units of content, and storing the classified words and phrases; and a content publisher that automatically creates a posting to be posted on the webpage of the government office by utilizing the words and phrases classified and stored in the data analyzer by content unit and actually posts the posting on the webpage.

또한, 상기 데이터마이닝엔진은 온라인 상의 복수개의 외부서버에 접속하는 외부 인터페이스; 상기 외부서버로부터 수집할 데이터 및 수집스케쥴을 결정하는 수집정책 매니저; 상기 외부서버로부터 수집할 데이터를 크롤링하는 크롤러; 상기 크롤러를 멀티프로세스로 구동하기 위한 정책을 결정하여 상기 크롤러의 실행을 제어하는 크롤러핸들러; 및 상기 크롤러에 의해서 수집된 데이터에서 컨텐츠를 추출하는 데이터 추출기;를 포함할 수 있다.In addition, the data mining engine includes an external interface connecting to a plurality of external servers on-line; a collection policy manager that determines data to be collected from the external server and a collection schedule; a crawler crawling data to be collected from the external server; a crawler handler controlling execution of the crawler by determining a policy for driving the crawler in multi-process; and a data extractor extracting content from the data collected by the crawler.

또한, 멀티프로세스로 구동되는 상기 크롤러의 개수는 CPU코어 개수의 2배인 것을 특징으로 한다.In addition, it is characterized in that the number of crawlers driven by multi-process is twice the number of CPU cores.

또한, 멀티프로세스로 구동되는 상기 크롤러의 개수는 2 내지 4개인 것을 특징으로한다.In addition, it is characterized in that the number of crawlers driven by multi-process is 2 to 4.

또한, 상기 크롤러는 수집할 데이터와 종래 수집되어 데이터베이스에 저장된 데이터를 URL에 기반하여 비교하여 수집할 데이터가 종래 수집되어 저장된 데이터와 동일하다고 판단되면 그 데이터를 수집할 데이터에서 제외시키는 텔레그램모듈을 포함하고, 상기 크롤러가 상기 외부서버로부터 수집할 데이터는 웹페이지 내의 이미지, 동영상, 첨부파일, 댓글을 포함할 수 있다.In addition, the crawler compares the data to be collected and the data previously collected and stored in the database based on the URL, and if the data to be collected is determined to be the same as the previously collected and stored data, the telegram module excludes the data from the data to be collected. In addition, the data to be collected by the crawler from the external server may include images, videos, attachments, and comments in the web page.

또한, 상기 크롤러핸들러는 상기 크롤러를 멀티프로세스로 실행하는 실행모듈; 상기 수집정책 매니저에 의해 설정된 상기 수집스케쥴에 따라 상기 실행모듈을 구동하는 스케쥴링모듈; 및 상기 실행모듈의 구동 시작 시간과 구동 종료 시간, 상기 실행모듈의 구동 성공여부와 실패여부 및 상기 실행모듈에 의해 상기 외부서버로부터 수집된 데이터량을 포함하는 로그데이터를 기록한 로그파일을 생성하는 로그생성모듈;을 포함할 수 있다.In addition, the crawler handler includes an execution module that executes the crawler in multi-process; a scheduling module that drives the execution module according to the collection schedule set by the logging policy manager; and a log file for generating a log file recording log data including the driving start time and driving end time of the execution module, whether or not the execution module was successful or failed, and the amount of data collected from the external server by the execution module. A generation module; may be included.

또한, 상기 자연어처리기는 상기 데이터마이닝엔진에 의해 수집된 데이터를 자연어 분석이 가능한 형태소로 출력하는 자연어 전처리기; 상기 자연어 전처리기에서 출력된 형태소를 분석하는 형태소 분석기; 상기 형태소 분석기에서 형태소 분석 시 참조 가능하며 사용자에 의해 갱신가능한 사전DB; 상기 형태소 분석기에서 분석된 형태소로부터 기 설정된 규칙에 따라 키워드를 추출하는 키워드 추출기; 상기 자연어 전처리기, 형태소 분석기, 사전DB 및 키워드 추출기를 관리하는 분석 매니저;를 포함할 수 있다.In addition, the natural language processor includes a natural language preprocessor outputting data collected by the data mining engine as morphemes capable of natural language analysis; a morpheme analyzer that analyzes the morpheme output from the natural language preprocessor; a dictionary DB that can be referred to when morpheme analysis is performed by the morpheme analyzer and can be updated by a user; a keyword extractor extracting keywords from the morphemes analyzed by the morpheme analyzer according to preset rules; An analysis manager managing the natural language preprocessor, the morpheme analyzer, the dictionary DB, and the keyword extractor; may be included.

또한, 상기 데이터분석기는 상기 자연어 처리기에서 추출된 단어 및 문구 중 컨텐츠로 활용 가능한 단어 및 문구를 분류하고 그룹화하는 컨텐츠 분류기; 상기 컨텐츠 분류기에서 분류되고 그룹화된 단어 및 문구 중에서 상기 관공서의 특정 웹페이지에 게시가능한 컨텐츠 후보군을 자동으로 추출하는 컨텐츠 추출기; 상기 컨텐츠 추출기에서 추출된 단어 및 문구를 저장하고, 상기 데이터마이닝엔진에 의해 수집된 데이터 중 이미지, 영상, 링크 정보를 저장하는 컨텐츠 저장소; 및 상기 컨텐츠 분류기, 컨텐츠 추출기 및 컨텐츠 저장소를 관리하고, 상기 컨텐츠 분류기에서 상기 단어 및 문구를 분류하고 그룹화하는 기준 및 상기 컨텐츠 추출기에서 상기 컨텐츠 후보군을 추출하는 기준을 결정하는 컨텐츠 매니저;를 포함할 수 있다.In addition, the data analyzer includes a content classifier for classifying and grouping words and phrases that can be used as content among words and phrases extracted by the natural language processor; a content extractor for automatically extracting a content candidate group that can be posted on a specific web page of the public office from words and phrases classified and grouped by the content classifier; a content storage for storing words and phrases extracted by the content extractor and storing image, video, and link information among data collected by the data mining engine; and a content manager that manages the content classifier, the content extractor, and the content repository, and determines criteria for classifying and grouping the words and phrases in the content classifier and criteria for extracting the content candidate group in the content extractor. there is.

또한, 상기 지능형 컨텐츠 퍼블리셔는 상기 관공서 홈페이지에 게시될 특정 웹페이지의 템플릿을 생성 및 관리하는 컨텐츠 템플릿 매니저; 상기 자연어 처리기에 의해 추출된 상기 단어 및 문구와 상기 컨텐츠 템플릿 매니저에 의해 생성된 템플릿을 활용하여 상기 특정 웹페이지에 게시될 컨텐츠를 생성하는 컨텐츠 생성기; 상기 컨텐츠를 상기 특정 웹페이지에 게시하기 위한 설정이 생성하고 관리하는 배포 매니저; 상기 컨텐츠를 상기 특정 웹페이지에 게시하기 위해 상기 관공서 홈페이지를관리하는 웹서버에 접속하여 통신하는 내부 인터페이스;를 포함할 수 있다.In addition, the intelligent content publisher includes a content template manager for creating and managing a template of a specific web page to be posted on the government office homepage; a content creator generating content to be posted on the specific webpage by utilizing the words and phrases extracted by the natural language processor and the template generated by the content template manager; a distribution manager that creates and manages settings for posting the content to the specific web page; In order to post the content on the specific web page, an internal interface for accessing and communicating with a web server managing the homepage of the government office; may be included.

본 발명의 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템은 또한, 공무원의 고유한 업무 외에 웹사이트 관련 업무를 감소시킬 수 있다.The big data-based content publishing automation system of the present invention can also reduce website-related tasks in addition to the unique tasks of public officials.

또한, 멀티프로세스로 구동되는 크롤러를 이용하여 빅데이터를 수집하고, 수집한 빅데이터를 웹페이지를 작성하는 데 사용함으로써 빠르게 빅데이터를 수집할 수 있다.In addition, it is possible to collect big data quickly by collecting big data using a multi-process crawler and using the collected big data to create a web page.

또한, 자동으로 웹페이지를 작성할 수 있게 되어 웹사이트 관리에 필요한 비용을 줄일 수 있다.In addition, it is possible to create a web page automatically, thereby reducing the cost required for website management.

도 1은 본 발명의 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템에 대한 개념도이다.
도 2는 본 발명의 데이터마이닝엔진에 대한 개념도이다.
도 3은 본 발명의 자연어처리기에 대한 개념도이다.
도 4는 본 발명의 자연어 전처리기에 대한 개념도이다.
도 5는 본 발명의 형태소 분석기에 대한 개념도이다.
도 6은 본 발명의 키워드 추출기에 대한 개념도이다.
도 7은 본 발명의 분석 매니저에 대한 개념도이다.
도 8은 본 발명의 데이터분석기에 대한 개념도이다.
도 9는 본 발명의 컨텐츠퍼블리셔에 대한 개념도이다.
도 10은 본 발명의 컨텐츠 템플릿 매니저에 대한 개념도이다.
도 11은 본 발명의 컨텐츠 생성기에 대한 개념도이다.
도 12는 본 발명의 배포 매니저에 대한 개념도이다.1 is a conceptual diagram of a content publishing automation system based on big data of the present invention.
2 is a conceptual diagram of the data mining engine of the present invention.
3 is a conceptual diagram of the natural language processor of the present invention.
4 is a conceptual diagram of the natural language preprocessor of the present invention.
5 is a conceptual diagram of the morpheme analyzer of the present invention.
6 is a conceptual diagram of a keyword extractor of the present invention.
7 is a conceptual diagram of an analysis manager of the present invention.
8 is a conceptual diagram of the data analyzer of the present invention.
9 is a conceptual diagram of a content publisher of the present invention.
10 is a conceptual diagram of a content template manager according to the present invention.
11 is a conceptual diagram of a content creator of the present invention.
12 is a conceptual diagram of a distribution manager of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can make various changes and have various embodiments, specific embodiments are illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. "및/또는" 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element, without departing from the scope of the present invention. The term "and/or" includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.It is understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle. It should be. On the other hand, when an element is referred to as “directly connected” or “directly connected” to another element, it should be understood that no other element exists in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

이와 관련하여, 명세서 전체에서 사용되는 정도의 용어 "약", "실질적으로" 등은 언급된 의미에 고유한 제조 및 물질 허용오차가 제시될 때 그 수치에서 또는 그 수치에 근접한 의미로 사용되고, 본 발명의 이해를 돕기 위해 정확하거나 절대적인 수치가 언급된 개시 내용을 비양심적인 침해자가 부당하게 이용하는 것을 방지하기 위해 사용된다. 본 발명의 명세서 상 전체에서 사용되는 정도의 용어 "~(하는) 단계" 또는 "~의 단계"는 "~를 위한 단계"를 의미하지 않는다.In this regard, the terms "about," "substantially," and the like, as used throughout the specification, are used at or approximating that value when manufacturing and material tolerances inherent in the stated meaning are given, and herein Exact or absolute figures are used to aid understanding of the invention and to prevent unfair use by unscrupulous infringers of the disclosed disclosure. The term "step of (doing)" or "step of" used throughout the specification of the present invention does not mean "step for".

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1개의 유닛이 2개 이상의 하드웨어를 이용하여 실현되어도 되고, 2개 이상의 유닛이 1개의 하드웨어에 의해 실현되어도 된다.In this specification, a "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized using two or more hardware, and two or more units may be realized by one hardware.

본 명세서에 있어서 단말, 장치 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말, 장치 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말, 장치 또는 디바이스에서 수행될 수도 있다.In this specification, some of the operations or functions described as being performed by a terminal, device, or device may be performed instead by a server connected to the terminal, device, or device. Likewise, some of the operations or functions described as being performed by the server may also be performed by a terminal, apparatus, or device connected to the server.

본 명세서에 있어서 단말과 매핑(mapping) 또는 매칭(matching)으로 기술된 동작이나 기능 중 일부는 단말의 식별 정보(identifying data)인 단말기의 고유 번호나 개인의 식별 정보를 매핑 또는 매칭한다는 의미로 해석될 수 있다.In this specification, some of the operations or functions described as mapping or matching with a terminal are interpreted as meaning mapping or matching a terminal's unique number or personal identification information, which is the terminal's identifying data. It can be.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this application, they should not be interpreted in an ideal or excessively formal meaning. don't

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention will be described in more detail. In order to facilitate overall understanding in the description of the present invention, the same reference numerals are used for the same components in the drawings, and redundant descriptions of the same components are omitted.

본 발명은 국민이 일상에서 필요로 하는 유용한 정보를 짧은 시간에 대단위의 빅데이터를 자동으로 수집하여, 수집된 빅데이터를 가공하여 담당자가 직접 작성, 편집, 게시를 하지 않고 동적으로 웹사이트 페이지를 작성하고 게시해주는 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템(100)에 관한 것이다.The present invention automatically collects a large amount of big data of useful information that people need in daily life in a short time, processes the collected big data, and dynamically creates website pages without the person in charge directly writing, editing, or posting. It relates to a big data-based content publishing automation system 100 that creates and publishes content.

도 1은 본 발명의 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템(100)에 대한 개념도이다. 도 1를 참조하면, 본 발명의 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템(100)은 외부 인터페이스(111)를 통하여 복수 개의 외부서버(10)에 접속하여 특정 관공서와 관련한 온라인상의 데이터를 수집하고, 상기 데이터를 상기 관공서의 홈페이지의 특정 웹페이지에 게시가능하도록 가공한 뒤, 내부 인터페이스(144)를 통해 상기 관공서의 홈페이지에 대한 웹서버(20)에 접속하여 상기 웹페이지를 업로드하여 자동으로 게시할 수 있다.1 is a conceptual diagram of a content publishing automation system 100 based on big data of the present invention. Referring to FIG. 1, the big data-based content publishing automation system 100 of the present invention accesses a plurality of external servers 10 through an external interface 111 to collect online data related to a specific government office, After processing the data so that it can be posted on a specific web page of the homepage of the government office, access the web server 20 for the homepage of the government office through the internal interface 144, upload the web page, and automatically post it. there is.

이를 위하여 본 발명은 데이터마이닝엔진(110), 자연어처리기(120), 데이터분석기(130) 및 컨텐츠퍼블리셔(140)를 포함하여 구성될 수 있다.To this end, the present invention may include a data mining engine 110, a natural language processor 120, a data analyzer 130, and a content publisher 140.

도 2는 본 발명의 데이터마이닝엔진(110)에 대한 개념도이다. 도 2를 참조하면, 상기 데이터마이닝엔진(110)은 멀티프로세스(Multi-Process)에 기반하여 병렬적으로 특정 관공서와 관련한 온라인상의 데이터를 수집할 수 있다. 이를 위하여, 상기 데이터마이닝엔진(110)은 외부 인터페이스(111), 수집정책 매니저(112), 크롤러(113), 크롤러핸들러(114) 및 데이터 추출기(115)를 포함할 수 있다.2 is a conceptual diagram of the data mining engine 110 of the present invention. Referring to FIG. 2 , the data mining engine 110 may collect online data related to a specific public office in parallel based on a multi-process. To this end, the data mining engine 110 may include an external interface 111, a logging policy manager 112, a crawler 113, a crawler handler 114, and a data extractor 115.

상기 외부 인터페이스(111)는 온라인 상의 복수개의 외부서버(10)에 접속가능한 통신경로를 생성하며, 상기 통신경로를 통해 상기 데이터마이닝엔진(110)은 외부서버(10)에 접속하여 상기 관공서와 관련한 데이터를 수집할 수 있다.The external interface 111 creates a communication path capable of connecting to a plurality of external servers 10 on-line, and the data mining engine 110 accesses the external server 10 through the communication path to provide information related to the government office. data can be collected.

이 때, 상기 외부 인터페이스(111)는 RSS, OpenAPI, HTML5, Meta, File 및 RDB 방식 중 선택되는 어느 하나의 방식에 의해 상기 외부서버(10)로부터 데이터를 받아올 수 있으며, 상기 외부서버(10)는 소셜미디어 서버, 공개데이터를 제공하는 서버 및 공공데이터를 제공하는 서버 중 선택되는 어느 하나가 될 수 있다.At this time, the external interface 111 can receive data from the external server 10 by any one method selected from among RSS, OpenAPI, HTML5, Meta, File, and RDB methods, and the external server 10 ) may be any one selected from a social media server, a server providing public data, and a server providing public data.

상기 수집정책 매니저(112)는 상기 외부서버(10)로부터 수집할 데이터 및 수집스케쥴을 결정할 수 있다. 상기 크롤러(113)가 수집할 데이터가 포함하는 키워드를 등록, 수정 및 삭제하고, 상기 데이터마이닝 엔진의 작동 주기 및 기간을 포함하는 상기 수집스케쥴을 설정 및 관리할 수 있다.The collection policy manager 112 may determine data to be collected from the external server 10 and a collection schedule. Keywords included in the data to be collected by the crawler 113 may be registered, modified, and deleted, and the collection schedule including the operating cycle and period of the data mining engine may be set and managed.

상기 크롤러(113)는 상기 외부서버(10)로부터 수집할 데이터를 크롤링할 수 있다. 상기 크롤러(113)는 수집할 데이터와 종래 수집되어 데이터베이스에 저장된 데이터를 URL에 기반하여 비교하여 수집할 데이터가 종래 수집되어 저장된 데이터와 동일하다고 판단되면 그 데이터를 수집할 데이터에서 제외시키는 텔레그램모듈을 포함하고, 상기 크롤러(113)가 상기 외부서버(10)로부터 수집할 데이터는 웹페이지 내의 이미지, 동영상, 첨부파일, 댓글을 포함할 수 있다. The crawler 113 may crawl data to be collected from the external server 10 . The crawler 113 compares the data to be collected and the data previously collected and stored in the database based on the URL, and if the data to be collected is determined to be the same as the previously collected and stored data, the telegram module excludes the data from the data to be collected. Including, the data to be collected by the crawler 113 from the external server 10 may include images, videos, attachments, and comments in the web page.

상기 텔레그램모듈은 수집하고자 하는 내용의 조건이나 제목 등이 업데이트된 데이터를 수집할 경우, 수집할 데이터에 해당하는 외부서버(10)의 특정 데이터의 제목 중 첫 번 째 제목을 가져와서 txt확장자를 가지는 "제목파일"로 저장하고, 상기 특정 데이터의 URL주소를 상기 "제목파일"과 같은 폴더에 txt확장자를 가지는 별도의 "주소파일"로 생성하여 저장한다.When the telegram module collects data for which the condition or title of the content to be collected is updated, the first title of the title of specific data of the external server 10 corresponding to the data to be collected is imported and the txt extension is added. The branch is stored as a "title file", and the URL address of the specific data is created and stored as a separate "address file" having a txt extension in the same folder as the "title file".

이 , 상기 "제목파일"에는 가장 최신 글의 제목이 저장되게 하여, 크롤링 이후 재탐색된 데이터의 중복여부를 확인하기 위하여 상기 "제목파일"에 저장된 제먹과 재탐색된 데이터의 제목이 같은지를 확인하여, 만약 제목이 동일하다면 그 재탐색된 데이터는 수집하지 않고, 그렇지 않다면 그 재탐색된 데이터를 수집하여 중복된 데이터의 수집을 방지하게 된다.In this case, the title of the most recent article is stored in the "title file", so that the title stored in the "title file" and the title of the re-searched data are the same in order to check whether the data re-searched after crawling is duplicated. By checking, if the title is the same, the re-searched data is not collected, and if not, the re-searched data is collected to prevent duplicate data collection.

싱글프로세스로 구동되는 클로러가 단위 작업에 대해 약 7초 내외가 소요됨에 반하여 본 발명의 클로러는 멀티프로세스로 구동됨에 따라 동일한 작업에 대한 멀티프로세싱 병렬화 크롤링에 소요되는 시간이 약 2초 내외로 짧아져 3 내지 4배의 속도가 향상되는 효과가 있다.Whereas the single-process crawler takes about 7 seconds for a unit task, the clolor of the present invention is multi-process, so the time required for multi-processing parallel crawling for the same task is about 2 seconds. There is an effect of improving the speed by 3 to 4 times by shortening it.

다만, 동일한 파일에 대한 상기 클로러가 크롤링에 소요되는 시간은 동시에 수행되는 상기 클로러의 개수가 2배가 될 때마다 줄어드는 것을 확인하였으나, 동시에 수행되는 상기 클로러의 개수가 64개를 넘어서는 시점부터는 오히려 속도가 지연됨을 확인하였다.However, it was confirmed that the time required for crawling by the crawlers for the same file decreases whenever the number of simultaneously performed crawlers doubles. Rather, it was confirmed that the speed was delayed.

병렬적으로 수행되는 클로러의 수가 2배씩 증가함에 따라 크롤링 속도가 향상되는 구간은 본 시스템에 사용되는 하드웨어 사양에 의해 결정될 수 있는데, 실험적으로 동시에 수행되는 클로러의 개수가 CPU코어 개수의 2개가 되면 안정적으로 빠른 속도를 보장할 수 있었다.As the number of crawlers performed in parallel increases by 2 times, the section where the crawling speed improves can be determined by the hardware specifications used in this system. This would ensure a stable and fast speed.

따라서, 본 발명의 시스템에서 사용되는 멀티프로세스로 수행되는 크롤러(113)는 본 발명의 시스템의 CPU코어의 개수가 2배로 하여 수행됨이 안정적인 시스템 환경을 유지하기 할 수 있으며, 바람직하게는 안정적인 네트워크 환경이 보장되는 경우에는 병렬로 구동되는 상기 크롤러(113)의 개수는 2 내지 4개로 하여 크롤링을 수행하는 것이 최적의 속도를 끌어낼 수 있다.Therefore, the multi-process crawler 113 used in the system of the present invention is performed by doubling the number of CPU cores in the system of the present invention to maintain a stable system environment, preferably a stable network environment When this is guaranteed, the optimal speed can be obtained by crawling with the number of crawlers 113 driven in parallel being 2 to 4.

실질적으로 상기 크롤러(113)는 상기 크롤러핸들러(114)에 의해 제어될 수 있는데, 상기 크롤러핸들러(114)는 상기 크롤러(113)를 멀티프로세스로 구동하기 위한 정책을 결정하여 상기 크롤러(113)의 실행을 제어할 수 있다.Substantially, the crawler 113 may be controlled by the crawler handler 114. The crawler handler 114 determines a policy for driving the crawler 113 in a multi-process and determines the You can control the execution.

이를 위하여 상기 크롤러핸들러(114)는 상기 크롤러(113)를 멀티프로세스로 실행하는 실행모듈(114a), 상기 수집정책 매니저(112)에 의해 설정된 상기 수집스케쥴에 따라 상기 실행모듈(114a)을 구동하는 스케쥴링모듈(114b) 및 상기 실행모듈(114a)의 구동 시작 시관과 구동 종료 시간, 상기 실행모듈(114a)의 구동 성공 및 실패 여부, 상기 실행모듈(114a)에 의해 상기 외부서버(10)로부터 수집된 데이터량을 포함하는 로그데이터를 기록한 로그파일을 생성하는 로그생성모듈(114c)을 포함할 수 있다.To this end, the crawler handler 114 includes an execution module 114a that executes the crawler 113 in multi-process and drives the execution module 114a according to the collection schedule set by the logging policy manager 112. The scheduling module 114b and the driving start time and driving end time of the execution module 114a, whether the execution module 114a was successful or failed, collected from the external server 10 by the execution module 114a It may include a log generation module 114c for generating a log file in which log data including the amount of recorded data is recorded.

상기 데이터 추출기(115)는 상기 크롤러(113)에 의해서 수집된 데이터에서 컨텐츠를 추출할 수 있다. 이를 위하여 제1데이터추출모듈(115a), 제2데이터추출모듈(115b), 인덱싱모듈(115c) 및 저장소관리모듈(115d)을 포함하여 구성될 수 있다.The data extractor 115 may extract content from data collected by the crawler 113 . To this end, it may include a first data extraction module 115a, a second data extraction module 115b, an indexing module 115c, and a storage management module 115d.

상기 제1데이터추출모듈(115a)은 상기 크롤러(113)에 의해서 수집된 데이터에서 컨텐츠 영역과 비컨텐츠 영역을 구분하여 컨텐츠 영역에서의 데이터만 추출하며, 상기 제2데이터추출모듈(115b)은 상기 컨텐츠 추출모듈에서 추출된 데이터 중 웹페이지 및 소셜 미디어 컨텐츠에 포함되어 있는 메타 데이터 또는 해시태크를 자동으로 추출할 수 있으며, 상기 인덱싱모듈(115c)은 상기 제1데이터추출모듈(115a) 및 제2데이터추출모듈(115b)에서 추출된 데이터를 수집 대상 및 컨텐츠별로 자동으로 인덱싱하게 되고, 상기 저장소관리모듈(115d)은 상기 인덱싱모듈(115c)에서 인덱싱된 데이터를 데이터베이스에 저장하고 관리할 수 있다.The first data extraction module 115a separates a content area and a non-content area from the data collected by the crawler 113 and extracts only data from the content area, and the second data extraction module 115b extracts data from the content area. Of the data extracted by the content extraction module, metadata or hashtags included in web pages and social media contents can be automatically extracted, and the indexing module 115c is used to extract the first data extraction module 115a and the second data extraction module 115c. The data extracted by the data extraction module 115b is automatically indexed by object of collection and content, and the storage management module 115d stores and manages the data indexed by the indexing module 115c in a database.

도 3은 본 발명의 자연어처리기(120)에 대한 개념도이고, 도 4 내지 도 7은 각각 상기 자연어처리기(120)의 자연어 전처리기(121), 형태소 분석기(122), 키워드 추출기(124), 분석 매니저(125)에 대한 개념도이다.3 is a conceptual diagram of the natural language processor 120 of the present invention, and FIGS. 4 to 7 show the natural language preprocessor 121, the morpheme analyzer 122, the keyword extractor 124, and analysis of the natural language processor 120, respectively. It is a conceptual diagram of the manager 125.

상기 자연어처리기(120)는 상기 데이터를 자연어 프로세싱에 의해서 상기 관공서가 위치한 지역 및 상기 관공서와 관련한 지역정책과 연관된 단어 및 문구를 추출할 수 있다. 이를 위하여 상기 자연어처리기(120)는 자연어 전처리기(121), 형태소 분석기(122), 사전DB(123), 키워드 추출기(124), 분석 매니저(125)를 포함하여 구성될 수 있다.The natural language processor 120 may extract words and phrases associated with a region where the government office is located and a regional policy related to the government office by natural language processing of the data. To this end, the natural language processor 120 may include a natural language preprocessor 121, a morpheme analyzer 122, a dictionary DB 123, a keyword extractor 124, and an analysis manager 125.

상기 자연어 전처리기(121)는 상기 데이터마이닝엔진(110)에 의해 수집된 데이터를 자연어 분석이 가능한 형태소로 출력할 수 있다.The natural language preprocessor 121 may output data collected by the data mining engine 110 as morphemes capable of natural language analysis.

도 4를 참조하면, 상기 자연어 전처리기(121)는 HTML 태그, 특수문자 및 이모티콘을 외부서버(10)로부터 수집된 상기 데이터에서 제외시키는 데이터 정제 모듈(121a), 기본 문법에 벗어난 문구 및 오타를 추출하여 상기 데이터에서 제외시키는 정규화 모듈(121b), 상기 데이터에 포함되는 문장에서 어휘 단위로 분리하여 각각의 상기 어휘에 대한 품사 단위로 구분하는 구분 모듈(121c), 품사 단위로 구분된 각각의 상기 어휘를 해당 품사로 매칭하는 포스(POS : Part Of Speach) 태킹 모듈(121d), 상기 어휘 중 동사의 시제 및 존댓말에 해당하는 어휘에 대응하는 대표 동사로 변환하는 어간추출 모듈(121e), 상기 어간추출 모듈(121e)에서 대표 동사로 변환 시 의미가 왜곡될 위험이 있는 동음이의어에 대한 단어처리 모듈(121f) 및 상기 어휘 중에 의미가 불분명하거나 의미가 없는 어휘를 제거하는 불용어 제거 모듈(121g)을 포함할 수 있다.Referring to FIG. 4, the natural language preprocessor 121 includes a data purification module 121a that excludes HTML tags, special characters, and emoticons from the data collected from the external server 10, phrases out of basic grammar, and typos. A normalization module 121b for extracting and excluding from the data, a division module 121c for separating sentences included in the data into lexical units and dividing them into parts-of-speech units for each of the vocabularies, A POS (Part Of Speach) tagging module 121d for matching vocabulary with corresponding parts of speech, a stem extraction module 121e for converting vocabulary corresponding to verb tenses and honorific words among the vocabulary into representative verbs, and the stem In the extraction module 121e, a word processing module 121f for homonyms with a risk of distorting meaning when converted into representative verbs and a stopword removal module 121g for removing words whose meaning is unclear or meaningless among the vocabulary words are provided. can include

상기 형태소 분석기(122)는 상기 자연어 전처리기(121)에서 출력된 형태소를 분석할 수 있는데, 도 5를 참조하면 형태소 분석기(122)는 제1형태소모듈(122a) 및 제2형태소모듈(122b)을 포함할 수 있다. 상기 제1형태소모듈(122a)은 상기 형태소의 기본형과 품사를 인식하며, 제2형태소모듈(122b)은 상기 형태소를 문장의 구성 성분으로 분해하여 문장의 구조를 결정할 수 있다.The morpheme analyzer 122 may analyze the morpheme output from the natural language preprocessor 121. Referring to FIG. 5, the morpheme analyzer 122 includes a first morpheme module 122a and a second morpheme module 122b. can include The first morpheme module 122a recognizes the basic form and parts of speech of the morpheme, and the second morpheme module 122b decomposes the morpheme into components of the sentence to determine the structure of the sentence.

상기 사전DB(123)는 상기 형태소 분석기(122)에서 형태소 분석 시 참조 가능하며 사용자에 의해 갱신가능하며, 이를 위한 상기 사전DB(123)는 형태소 사전DB(123), 개체명 사전DB(123), 지역/정책 사전DB(123), 사용자 정의 사전DB(123)를 포함할 수 있다.The dictionary DB 123 can be referred to when morpheme analysis is performed by the morpheme analyzer 122 and can be updated by a user, and the dictionary DB 123 for this purpose includes a morpheme dictionary DB 123 and an object name dictionary DB 123 , region/policy dictionary DB 123, and user-defined dictionary DB 123.

상기 키워드 추출기(124)는 상기 형태소 분석기(122)에서 분석된 형태소로부터 기 설정된 규칙에 따라 키워드를 추출할 수 있으며, 도 6을 참조하면, 상기 키워드 추출기(124)는 제1추출모듈(124a), 제2추출모듈(124b), 제3추출모듈(124c) 및 제4추출모듈을 포함할 수 있다.The keyword extractor 124 may extract keywords from the morphemes analyzed by the morpheme analyzer 122 according to preset rules. Referring to FIG. , a second extraction module 124b, a third extraction module 124c, and a fourth extraction module.

상기 제1추출모듈(124a)은 상기 분석 매니저(125)로부터 설정되는 값에 따라 상기 형태소 분석기(122)에서 분석된 형태소로부터 키워드를 추출할 수 있으며, 상기 제2추출모듈(124b)은 추출된 상기 키워드를 사전DB(123)를 활용하여 빈도수 및 상기 관공서와 관련된 지역과의 연관도에 따라 점수를 매길 수 있고, 상기 제3추출모듈(124c)은 상기 레이팅 모듈에 의해 매겨진 점수에 따라 연관 단어 및 문구를 추출할 수 있다.The first extraction module 124a may extract keywords from the morphemes analyzed by the morpheme analyzer 122 according to a value set by the analysis manager 125, and the second extraction module 124b may extract the extracted keywords. The keyword can be scored according to the frequency and the degree of association with the region related to the public office by using the dictionary DB 123, and the third extraction module 124c performs the related word according to the score determined by the rating module. and phrases can be extracted.

상기 분석 매니저(125)는 상기 자연어 전처리기(121), 형태소 분석기(122), 사전DB(123) 및 키워드 추출기(124)를 관리할 수 있으며, 도 7를 참조하면, 상기 분석 매니저(125)는 제1매니저모듈(125a) 및 제2매니저모듈(125b)을 포함하며, 상기 제1매니저모듈(125a)은 상기 사전DB(123) 를 업로드, 다운로드할 수 있으며, 상기 사전DB(123)의 내용을 검색, 추가, 수정 및 삭제할 수 있으며 상기 제2매니저모듈(125b)은 상기 자연어 전처리기(121) 및 형태소 분석기(122)에서 상기 형태소를 추출 및 분석하기 위한 특정 지역을 포함하는 조건을 설정할 수 있다.The analysis manager 125 may manage the natural language preprocessor 121, the morpheme analyzer 122, the dictionary DB 123, and the keyword extractor 124. Referring to FIG. 7, the analysis manager 125 includes a first manager module 125a and a second manager module 125b, and the first manager module 125a can upload and download the dictionary DB 123, Contents can be searched, added, modified, and deleted, and the second manager module 125b sets conditions including a specific region for extracting and analyzing the morphemes in the natural language preprocessor 121 and morpheme analyzer 122. can

도 8은 본 발명의 데이터분석기(130)에 대한 개념도이며, 상기 데이터분석기(130)는 상기 자연어처리기(120)에 의해 추출된 단어 및 문구를 컨텐츠 단위로 추출하여 분류하고 분류된 단어 및 문구를 저장할 수 있다.8 is a conceptual diagram of the data analyzer 130 of the present invention. The data analyzer 130 extracts and classifies words and phrases extracted by the natural language processor 120 in units of content, and classifies the classified words and phrases. can be saved

도 8을 참조하면, 상기 데이터분석기(130)는 컨텐츠 분류기(131), 컨텐츠 추출기(132), 컨텐츠 저장소(133) 및 컨텐츠 매니저(134)를 포함하여 구성될 수 있다.Referring to FIG. 8 , the data analyzer 130 may include a content classifier 131, a content extractor 132, a content storage 133, and a content manager 134.

상기 컨텐츠 분류기(131)는 상기 자연어 처리기에서 추출된 단어 및 문구 중 컨텐츠로 활용 가능한 단어 및 문구를 분류하고 그룹화할 수 있다.The content classifier 131 may classify and group words and phrases usable as content among words and phrases extracted by the natural language processor.

상기 컨텐츠 추출기(132)는 상기 컨텐츠 분류기(131)에서 분류되고 그룹화된 단어 및 문구 중에서 상기 관공서의 특정 웹페이지에 게시가능한 컨텐츠 후보군을 자동으로 추출할 수 있다.The content extractor 132 can automatically extract a content candidate group that can be posted on a specific web page of the government office from words and phrases classified and grouped by the content classifier 131 .

상기 컨텐츠 저장소(133)는 상기 컨텐츠 추출기(132)에서 추출된 단어 및 문구를 저장하고, 상기 데이터마이닝엔진(110)에 의해 수집된 데이터 중 이미지, 영상, 링크 정보를 저장할 수 있다.The content storage 133 may store words and phrases extracted by the content extractor 132 and store image, video, and link information among data collected by the data mining engine 110 .

상기 컨텐츠 매니저(134)는 상기 컨텐츠 분류기(131), 컨텐츠 추출기(132) 및 컨텐츠 저장소(133)를 관리하고, 상기 컨텐츠 분류기(131)에서 상기 단어 및 문구를 분류하고 그룹화하는 기준 및 상기 컨텐츠 추출기(132)에서 상기 컨텐츠 후보군을 추출하는 기준을 결정할 수 있다.The content manager 134 manages the content classifier 131, the content extractor 132, and the content storage 133, and the criteria for classifying and grouping the words and phrases in the content classifier 131 and the content extractor In step 132, a criterion for extracting the content candidate group may be determined.

상기 데이터분석기(130)에서 컨텐츠 단위로 분류되어 저장된 상기 단어 및 문구를 활용하여 상기 관공서의 웹페이지에 게시할 게시물을 자동으로 생성하여 상기 웹페이지에 실제로 게시할 수 있는데, 도 9는 본 발명의 컨텐츠퍼블리셔(140)에 대한 개념도를 간단하게 도시한 것이다. 도 9를 참조하면, 상기 컨텐츠퍼블리셔(140)는 컨텐츠 템플릿 매니저, 컨텐츠 생성기(142), 배포 매니저(143) 및 내부 인터페이스(144)를 포함할 수 있으며, 도 10 내지 도 12는 각각은 상기 컨텐츠퍼블리셔(140) 컨텐츠 템플릿 매니저, 컨텐츠 생성기(142), 배포 매니저(143)에 대한 개념도이다.Using the words and phrases classified and stored in the data analyzer 130 by content unit, posts to be posted on the webpage of the government office can be automatically created and actually posted on the webpage. A conceptual diagram of the content publisher 140 is simply shown. Referring to FIG. 9, the content publisher 140 may include a content template manager, a content creator 142, a distribution manager 143, and an internal interface 144, and FIGS. 10 to 12 each represent the content. It is a conceptual diagram of the publisher 140, content template manager, content creator 142, and distribution manager 143.

상기 컨텐츠 템플릿 매니저는 상기 관공서 홈페이지에 게시될 특정 웹페이지의 템플릿을 생성 및 관리할 수 있는데, 이를 위하여 컨텐츠 템플릿 매니저는 상기 컨텐츠의 단락 및 정렬방법을 포함하는 컨텐츠구조를 결정하는 제1템플릿모듈(141a), 상기 제1템플릿모듈(141a)에서 설정된 상기 컨텐츠구조에 따라 상기 컨텐츠의 위치를 위치를 결정하는 제2템플릿모듈(141b), 상기 컨텐츠에 사용되는 글꼴을 결정하는 제3템플릿모듈(141c), 상기 컨텐츠레이아웃을 결정하는 제4템플릿모듈(141d)을 포함할 수 있다.The content template manager may create and manage a template of a specific web page to be posted on the government office homepage. To this end, the content template manager determines a content structure including paragraphs and arrangement methods of the content. A first template module ( 141a), a second template module 141b for determining the location of the content according to the content structure set in the first template module 141a, and a third template module 141c for determining the font used for the content. ), and a fourth template module 141d for determining the content layout.

이 때, 상기 컨텐츠레이아웃은 텍스트로만 구성된 텍스트레이아웃, 텍스트와 이미지가 혼합되어 구성되는 제1혼합레이아웃, 텍스트와 영상이 혼합되어 구성되는 제2혼합레이아웃 중 선택되는 어느 하나일 수 있다.In this case, the content layout may be one selected from among a text layout composed of only text, a first mixed layout composed of a mixture of text and images, and a second mixed layout composed of a mixture of text and images.

상기 컨텐츠 생성기(142)는 자연어 처리기에 의해 추출된 상기 단어 및 문구와 상기 컨텐츠 템플릿 매니저에 의해 생성된 템플릿을 활용하여 상기 특정 웹페이지에 게시될 컨텐츠를 생성할 수 있다. The content generator 142 may create content to be posted on the specific web page by utilizing the words and phrases extracted by the natural language processor and the template generated by the content template manager.

이를 위하여 상기 컨텐츠에 사용될 키워드와 문구를 자동으로 생성하는 제1생성모듈(142a), 상기 제1생성모듈(142a)에서 생성된 키워드와 문구를 활용하여 상기 컨텐츠에 적합한 문장을 자동으로 완성하는 제2생성모듈(142b), 상기 제2생성모듈(142b)에서 완성된 문장을 활용하여 문장의 골격을 자동으로 완성하는 제3생성모듈(142c), 상기 제2생성모듈(142b)에서 생성된 문장의 문법 오류를 자동으로 검증하고, 오류를 수정하는 제4생성모듈(142d) 및 상기 제4생성모듈(142d)에서 문법 오류가 검증되고 수정된 문장을 포함하는 상기 컨텐츠를 HTML태그로 자동 변환하는 제5생성모듈(142e)을 포함할 수 있다.To this end, a first generation module 142a automatically generates keywords and phrases to be used in the content, and a first generation module 142a automatically completes sentences suitable for the content using the keywords and phrases generated in the first generation module 142a. 2 generation module 142b, a 3rd generation module 142c that automatically completes the skeleton of a sentence by utilizing the sentence completed in the second generation module 142b, and the sentence generated by the second generation module 142b A fourth generation module (142d) that automatically verifies and corrects grammatical errors in and automatically converts the content including the corrected sentences into HTML tags. A fifth generation module 142e may be included.

상기 배포 매니저(143)는 상기 컨텐츠를 상기 특정 웹페이지에 게시하기 위한 설정이 생성하고 관리할 수 있으며, 이를 위하여 상기 컨텐츠 생성기(142)에서 생성된 컨텐츠를 검토하여 수정하는 제1배포모듈(143a), 상기 관공서의 홈페이지의 웹페이지 중에서 상기 컨텐츠가 게시될 웹페이지를 결정하고 게시를 실행하는 제2배포모듈(143b), 상기 컨텐츠의 게시 스케쥴을 관리하는 제3배포모듈(143c), 상기 컨텐츠의 게시 이력을 저장하고 관리하는 제4배포모듈(143d) 및 상기 제4배포모듈(143d)에 의해 저장된 상기 컨텐츠의 게시 이력에 대한 통계값을 나타내는 대시보드를 생성하는 제5배포모듈(143e)상기 컨텐츠 생성기(142)에서 생성된 컨텐츠를 검토하여 수정하는 제1배포모듈(143a), 상기 관공서의 홈페이지의 웹페이지 중에서 상기 컨텐츠가 게시될 웹페이지를 결정하고 게시를 실행하는 제2배포모듈(143b), 상기 컨텐츠의 게시 스케쥴을 관리하는 제3배포모듈(143c), 상기 컨텐츠의 게시 이력을 저장하고 관리하는 제4배포모듈(143d) 및 상기 제4배포모듈(143d)에 의해 저장된 상기 컨텐츠의 게시 이력에 대한 통계값을 나타내는 대시보드를 생성하는 제5배포모듈(143e)을 포함할 수 있다.The distribution manager 143 can create and manage settings for publishing the contents on the specific web page, and for this purpose, the first distribution module 143a reviews and modifies the contents generated by the contents generator 142. ), a second distribution module 143b for determining a web page on which the content is to be posted from among the web pages of the homepage of the government office and executing the posting, a third distribution module 143c for managing the posting schedule of the content, the content A fourth distribution module (143d) for storing and managing the posting history of and a fifth distribution module (143e) generating a dashboard showing statistical values for the posting history of the content stored by the fourth distribution module (143d). A first distribution module 143a for reviewing and modifying the content generated by the content generator 142, and a second distribution module for determining a web page on which the content is to be posted among web pages of the homepage of the government office and publishing the content ( 143b), the third distribution module 143c for managing the posting schedule of the contents, the fourth distribution module 143d for storing and managing the posting history of the contents, and the contents stored by the fourth distribution module 143d. It may include a fifth distribution module 143e that generates a dashboard indicating statistical values for the posting history of .

상기 내부 인터페이스(144)는 상기 컨텐츠를 상기 특정 웹페이지에 게시하기 위해 상기 관공서 홈페이지를 관리하는 웹서버(20)에 접속하여 통신할 수 있다. 즉, 본 발명의 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템(100)은 내부 인터페이스(144)를 통하여 자동으로 상기 컨텐츠를 상기 웹서버(20)에 업로드함으로써, 상기 관공서의 특정 웹페이지게 게시할 수 있는 것이다. 이 때, 내부 인터페이스(144)는 상기 외부서버(10)와 접속하는 통신망과 달리 별도의 내부 통신망을 이용할 수 있으며, 상기 내부 통신망은 광케이블, 10G이더넷망 등을 사용하여 시스템의 환경에 따라 다양하게 구성될 수 있다.The internal interface 144 may access and communicate with the web server 20 that manages the government office homepage in order to post the content on the specific web page. That is, the big data-based content publishing automation system 100 of the present invention automatically uploads the content to the web server 20 through the internal interface 144, so that it can be posted to a specific web page of the government office. will be. At this time, the internal interface 144 can use a separate internal communication network unlike the communication network connected to the external server 10, and the internal communication network uses an optical cable, a 10G Ethernet network, and the like, depending on the environment of the system. can be configured.

또한, 멀티프로세스로 구동되는 크롤러(113)를 이용하여 종래의 싱글프로세스로 구동되는 크롤러(113)에 비하여 국민이 일상에서 필요로 하는 유용한 정보를 짧은 시간에 대단위의 빅데이터를 자동으로 수집할 수 있으며, 수집된 빅데이터를 가공하여 담당자가 직접 작성, 편집, 게시를 하지 않고 동적으로 웹사이트 페이지를 작성하고 게시할 수 있다.In addition, by using the multi-process crawler 113, a large amount of big data can be automatically collected in a short period of time, useful information needed by the people in daily life compared to the conventional single-process crawler 113. In addition, by processing the collected big data, the person in charge can dynamically create and post website pages without directly writing, editing, or posting.

이를 통해, 상기 관공서의 공무원은 고유한 업무에 집중할 수 있게 되어 민원처리 시간을 단출할 수 있을 뿐만 아니라, 웹사이트 관리에 필요한 비용을 줄일 수 있다는 효과가 있다.Through this, there is an effect that public officials of the public office can focus on their own tasks, thereby reducing civil petition processing time and reducing costs required for website management.

본 발명은 상술한 특정의 바람직한 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형실시가 가능한 것은 물론이고, 그와 같은 변경은 청구범위 기재의 범위내에 있게 된다.The present invention is not limited to the specific preferred embodiments described above, and various modifications can be made by anyone skilled in the art without departing from the gist of the present invention claimed in the claims. Of course, such changes are within the scope of the claims.

10 : 외부서버
20 : 웹서버
100 : 빅데이터 기반의 컨텐츠 퍼블리싱 자동화 시스템
110 : 데이터마이닝엔진
120 : 자연어처리기
130 : 데이터분석기
140 : 컨텐츠퍼블리셔10: external server
20: Web server
100: Big data-based content publishing automation system
110: data mining engine
120: natural language processor
130: data analyzer
140: Content Publisher

Claims

A data mining engine 110 that collects online data related to a specific government office in parallel based on a multi-process;
a natural language processor (120) for extracting words and phrases associated with the region where the government office is located and local policies related to the government office by natural language processing of the data;
a data analyzer 130 that extracts and classifies the words and phrases extracted by the natural language processor 120 in units of content, and stores the classified words and phrases; and,
A content publisher 140 that automatically creates a post to be posted on the webpage of the government office by using the words and phrases classified and stored in the data analyzer 130 by content unit and actually posts it on the webpage. Big data based content publishing automation system.

According to claim 1,
The data mining engine 110,
an external interface 111 connecting to a plurality of external servers 10 on-line;
a logging policy manager 112 that determines data to be collected from the external server 10 and a collection schedule;
a crawler 113 for crawling data to be collected from the external server 10;
a crawler handler 114 for controlling execution of the crawler 113 by determining a policy for driving the crawler 113 in multi-process; and,
Data extractor 115 for extracting content from the data collected by the crawler 113; Big data-based content publishing automation system including a.

According to claim 2,
Big data-based content publishing automation system, characterized in that the number of crawlers 113 driven by multi-process is twice the number of CPU cores.

According to claim 2,
Big data-based content publishing automation system, characterized in that the number of crawlers 113 driven by multi-process is 2 to 4.

According to claim 2,
The crawler 113 compares the data to be collected and the data previously collected and stored in the database based on the URL, and if the data to be collected is determined to be the same as the previously collected and stored data, the telegram module excludes the data from the data to be collected. Including, the data to be collected by the crawler 113 from the external server 10 includes images, videos, attachments, and comments in the web page. Big data-based big data-based content publishing automation system.

According to claim 2,
The crawler handler 114,
Execution module 114a for executing the crawler 113 in multi-process;
a scheduling module 114b for driving the execution module 114a according to the collection schedule set by the logging policy manager 112; and,
Includes the driving start time and driving end time of the execution module 114a, whether the execution module 114a has been driven successfully or failed, and the amount of data collected from the external server 10 by the execution module 114a. A big data-based content publishing automation system including; a log generation module (114c) for generating a log file recording log data to

According to claim 1,
The natural language processor 120,
a natural language preprocessor 121 outputting the data collected by the data mining engine 110 as morphemes capable of natural language analysis;
a morpheme analyzer 122 that analyzes the morpheme output from the natural language preprocessor 121;
a dictionary DB 123 that can be referred to when morpheme analysis is performed by the morpheme analyzer 122 and can be updated by a user;
a keyword extractor 124 for extracting keywords from the morphemes analyzed by the morpheme analyzer 122 according to preset rules;
An analysis manager 125 managing the natural language preprocessor 121, the morpheme analyzer 122, the dictionary DB 123, and the keyword extractor 124; Big data-based content publishing automation system including.

According to claim 1,
The data analyzer 130,
a content classifier 131 for classifying and grouping words and phrases usable as content among the words and phrases extracted by the natural language processor;
a content extractor 132 that automatically extracts a content candidate group that can be posted on a specific web page of the government office from among words and phrases classified and grouped by the content classifier 131;
a content storage 133 for storing words and phrases extracted by the content extractor 132 and storing image, video, and link information among data collected by the data mining engine 110; and,
The content classifier 131, the content extractor 132, and the content storage 133 are managed, and the criteria for classifying and grouping the words and phrases in the content classifier 131 and the content candidate group in the content extractor 132 Big data-based content publishing automation system including; content manager 134 for determining criteria for extracting.

According to claim 1,
The content publisher 140,
a content template manager that creates and manages a template of a specific web page to be posted on the government office homepage;
a content generator 142 generating content to be posted on the specific webpage by utilizing the words and phrases extracted by the natural language processor and the template generated by the content template manager;
a distribution manager 143 that creates and manages settings for posting the content to the specific web page;
An internal interface 144 that connects to and communicates with the web server 20 that manages the government office homepage in order to post the content on the specific web page; Big data-based content publishing automation system including.