KR102023490B1

KR102023490B1 - Method and apparatus for collecting and analyzing text data for crawling text data

Info

Publication number: KR102023490B1
Application number: KR1020170142347A
Authority: KR
Inventors: 김의직; 권정혁; 차민기
Original assignee: 한림대학교 산학협력단
Priority date: 2017-10-30
Filing date: 2017-10-30
Publication date: 2019-09-20
Also published as: KR20190047939A

Abstract

텍스트 데이터 수집 및 분석 방법은, 키워드 및 기간 정보에 관한 입력을 수신하는 단계, 상기 키워드 및 상기 기간 정보에 기초하여 웹으로부터 상기 키워드를 포함하는 기사의 정보를 획득하는 단계, 상기 기사의 정보에 기초하여 상기 기사가 포함된 웹 페이지를 크롤링(crawling)하는 단계, 상기 크롤링된 웹 페이지에 포함된 기사의 텍스트 데이터를 수집하는 단계, 상기 수집한 텍스트 데이터를 크롤링 데이터베이스에 저장하는 단계, 상기 수집한 텍스트 데이터에 기초하여 텍스트 데이터 분석을 수행하는 단계를 포함할 수 있다. The method of collecting and analyzing text data includes: receiving an input regarding a keyword and period information, acquiring information of an article including the keyword from the web based on the keyword and the period information, based on the information of the article Crawling a web page including the article, collecting text data of an article included in the crawled web page, storing the collected text data in a crawl database, and collecting the collected text. Performing text data analysis based on the data.

Description

Method and apparatus for collecting and analyzing text data for crawling text data {METHOD AND APPARATUS FOR COLLECTING AND ANALYZING TEXT DATA FOR CRAWLING TEXT DATA}

본원은 텍스트 데이터 크롤링을 위한 텍스트 데이터 수집 및 분석 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for collecting and analyzing text data for crawling text data.

최근 민원 분석, 뉴스 큐레이션, 선호도 조사, 추천 등의 지능형 서비스에 대한 수요가 급증하고 있다. 이러한 서비스들은 대부분 필요한 정보를 사용자에게 제공하기 위해 웹 게재기사 또는 게시글에 포함된 텍스트 데이터를 수집하고, 이를 분석한다. Recently, demand for intelligent services such as civil complaint analysis, news curation, preference survey, and recommendation is increasing rapidly. Most of these services collect and analyze text data contained in web articles or posts to provide users with the necessary information.

예시적으로, 휴대전화, 태블릿 PC, Wi-Fi 등 생활 무선기기 및 무선통신기술이 대중화 되면서, 이들로부터 방출되는 전자파의 인체 유해성에 대한 국민들의 우려가 급증하고 있다. 전 세계적으로 전자파 관련 민원 및 소송 사례가 증가하고 있는 추세이며, 이에 따른 민원의 원인 분석 및 효율적인 대응책 마련이 시급한 실정이다. 그러나, 기존의 전자파 관련 민원 대응방안은 지역적으로 발생하는 민원에 대한 대응만을 고려하거나 전자파에 대한 일반론적인 인체유해성 만을 고려하기 때문에, 현재 사회적으로 이슈가 되고 있는 전자파 관련 민원에 대한 신속한 대응이 어려우며, 적절한 대응방안이 되지 못할 수 있다. 이러한 문제점을 해결하기 위해서는 현재 이슈가 되고 있는 전자파 관련 민원에 대한 내용분석이 반영된 대응책이 필수적으로 마련되어야 한다. 하지만, 기존 민원 분석은 전문가에 의존적이기 때문에 긴 시간이 소요될 수 있으며, 전문가의 숙련도에 따라 정확한 분석결과를 도출해 내지 못 할 수 있다. 때문에, 고조된 국민들의 불안감을 해소하는데 오랜 기간과 많은 비용 및 인력이 소모될 뿐 아니라, 부정확한 분석결과로 인해 민원 대응에 실패할 수도 있다.For example, as wireless devices and wireless communication technologies such as mobile phones, tablet PCs, and Wi-Fi have become popular, people's concern about human health of electromagnetic waves emitted from them has increased rapidly. There are increasing cases of complaints and litigation related to electromagnetic waves all over the world, and it is urgent to analyze the cause of complaints and prepare effective countermeasures. However, the existing countermeasures against electromagnetic wave-related complaints only consider responses to locally-occurring complaints or only general human hazards against electromagnetic waves, making it difficult to respond to the current electromagnetic-related complaints. It may not be an appropriate response. In order to solve these problems, countermeasures reflecting the content analysis of the current complaints related to electromagnetic waves must be prepared. However, existing civil complaint analysis may take a long time because it depends on the expert, and may not be able to derive the accurate analysis result according to the expert's skill. As a result, long-term, high costs and manpower will be consumed to relieve the heightened anxiety of the people, and inaccurate analysis may lead to failure to respond to complaints.

이에, 원활한 지능형 서비스를 제공하기 위해서는 웹에 게재된 다양한 종류의 기사를 실시간으로 분석하고, 텍스트 데이터를 수집하는 웹 크롤링 기술과 수집된 데이터를 분석하는 빅데이터 분석 기술이 필수적으로 요구된다.Accordingly, in order to provide a smooth and intelligent service, web crawling technology for analyzing various kinds of articles posted on the web in real time, collecting text data, and big data analysis technology for analyzing the collected data are required.

본원의 배경이 되는 기술은 한국공개특허공보 제2010-0094263(공개일: 2010.08.26)호에 개시되어 있다.Background art of the present application is disclosed in Korean Patent Publication No. 2010-0094263 (published: 2010.08.26).

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 사용자로부터 키워드를 입력받고, 웹 크롤링을 이용하여 입력 키워드 관련 인터넷 개제 기사를 수집하고, 입력 키워드 관련 텍스트 데이터 분석을 수행하며, 수집된 기사 및 분석결과를 저장 및 출력할 수 있는 데이터베이스를 포함하는 텍스트 데이터 수집 및 분석 방법을 제공하고자 한다. The present application is to solve the above-mentioned problems of the prior art, receives a keyword from the user, collects the Internet entry article related to the input keyword by using a web crawl, perform text data analysis related to the input keyword, collected articles and To provide a text data collection and analysis method including a database that can store and output the analysis results.

또한, 본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 사용자로부터 입력받은 키워드에 기반하여 키워드 텍스트 데이터 수집 및 분석 결과를 다양한 형태로 시각화하여 사용자에게 제공할 수 있는 텍스트 데이터 수집 및 분석 방법을 제공하고자 한다. In addition, the present application is to solve the above-mentioned problems of the prior art, a text data collection and analysis method that can be provided to the user by visualizing the keyword text data collection and analysis results in various forms based on the keywords received from the user To provide.

또한, 본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 텍스트 데이터 수집 기술과 데이터 분석 기술을 모두 고려하는 통합 프레임워크를 제공하고자 한다.In addition, the present application is to solve the above-mentioned problems of the prior art, to provide an integrated framework that considers both text data collection technology and data analysis technology.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들도 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the embodiments of the present application is not limited to the technical problems as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따르면, 텍스트 데이터 수집 및 분석 방법은, 키워드 및 기간 정보에 관한 입력을 수신하는 단계, 상기 키워드 및 상기 기간 정보에 기초하여 웹으로부터 상기 키워드를 포함하는 기사의 정보를 획득하는 단계, 상기 기사의 정보에 기초하여 상기 기사가 포함된 웹 페이지를 크롤링(crawling)하는 단계, 상기 크롤링된 웹 페이지에 포함된 기사의 텍스트 데이터를 수집하는 단계, 상기 수집한 텍스트 데이터를 크롤링 데이터베이스에 저장하는 단계, 상기 수집한 텍스트 데이터에 기초하여 텍스트 데이터 분석을 수행하는 단계를 포함할 수 있다. As a technical means for achieving the above technical problem, according to an embodiment of the present application, the text data collection and analysis method, the step of receiving an input regarding a keyword and period information, the web based on the keyword and the period information Acquiring information of an article including the keyword from the user; crawling a web page including the article based on the information of the article; collecting text data of the article included in the crawled web page; And storing the collected text data in a crawling database, and performing text data analysis based on the collected text data.

본원의 일 실시예에 따르면, 상기 텍스트 데이터를 수집하는 단계는, 상기 키워드, 상기 기간 정보 및 웹 페이지의 정보를 포함하는 URL을 생성하여 상기 웹으로 전송하는 단계, 웹으로부터 웹 페이지에 포함된 기사 리스트가 포함된 HTML 파일을 수신하는 단계, 상기 HTML 파일을 트리형태로 재구성하는 단계, 상기 HTML 파일로부터 기사 리스트의 URL을 추출하는 단계, 상기 기사 리스트의 URL을 웹으로 전송하고, 웹으로부터 기사 텍스트 데이터를 포함하는 HTML 파일을 수신하는 단계, 상기 기사 텍스트 데이터를 포함하는 HTML 파일을 트리형태로 재구성하고 텍스트 데이터를 추출하는 단계 및 상기 추출된 텍스트 데이터로부터 기사의 제목 및 내용에 해당하는 텍스트 데이터를 추출하는 단계를 포함할 수 있다. According to an embodiment of the present disclosure, the collecting of the text data may include: generating a URL including the keyword, the period information, and the information of the web page, and transmitting the URL to the web, the article included in the web page from the web. Receiving an HTML file including a list, reconstructing the HTML file into a tree form, extracting a URL of an article list from the HTML file, transmitting the URL of the article list to the web, and article text from the web. Receiving an HTML file including the data, reconstructing the HTML file including the article text data into a tree, extracting text data, and extracting text data corresponding to the title and content of the article from the extracted text data. It may include the step of extracting.

본원의 일 실시예에 따르면, 텍스트 데이터 분석 및 수집 방법은, 상기 트리형태로 재구성된 기사 리스트의 HTML 파일 및 상기 트리형태로 재구성된 기사 텍스트 데이터를 포함하는 HTML 파일에 대하여 인코딩을 수행하는 단계를 더 포함할 수 있다. According to an embodiment of the present disclosure, the method of analyzing and collecting text data may include performing encoding on an HTML file of the article list reconstructed in the tree form and an HTML file including the article text data reconstructed in the tree form. It may further include.

본원의 일 실시예에 따르면, 상기 트리형태로 재구성된 기사 리스트의 HTML 파일 및 상기 트리형태로 재구성된 기사 텍스트 데이터를 포함하는 HTML 파일에 대하여 인코딩을 수행하는 단계에서 상기 인코딩은 UTF-8 (Universal Transformation Format-8)인코딩일 수 있다. According to one embodiment of the present application, the encoding is performed in the step of performing encoding on the HTML file of the article list reconstructed in the tree form and the article text data reconstructed in the tree form is UTF-8 (Universal Transformation Format-8).

본원의 일 실시예에 따르면, 텍스트 데이터 분석 및 수집 방법은, 상기 추출된 텍스트 데이터를 txt 형식의 파일로 생성하는 단계를 더 포함하고, 상기 생성된 txt 형식의 파일이 상기 크롤링 데이터베이스에 저장될 수 있다. According to an embodiment of the present disclosure, the text data analysis and collection method may further include generating the extracted text data as a txt file, and the generated txt file may be stored in the crawl database. have.

본원의 일 실시예에 따르면, 텍스트 데이터 분석 및 수집 방법은, 상기 수집한 텍스트 데이터를 상기 크롤링 데이터베이스로부터 읽어오는 단계, 상기 수집한 텍스트 데이터에 대하여 미리 설정된 사전 정의 단어에 기초하여 전처리하는 단계 및 상기 전처리된 기사의 텍스트 데이터로부터 기반 데이터 셋트를 형성하는 단계를 더 포함하고, 상기 텍스트 데이터 분석을 수행하는 단계는 기반 데이터 셋트에 기초하여 수행될 수 있다. According to one embodiment of the present application, the text data analysis and collection method, the step of reading the collected text data from the crawl database, the step of pre-processing based on a predefined word preset for the collected text data and the Forming a base data set from the text data of the preprocessed article, wherein performing the text data analysis may be performed based on the base data set.

본원의 일 실시예에 따르면, 텍스트 데이터 수집 및 분석 장치는, 키워드 및 기간 정보에 관한 입력을 수신하는 유저 인터페이스부, 상기 키워드 및 상기 기간 정보에 기초하여 웹으로부터 상기 키워드를 포함하는 기사의 정보를 획득하고, 상기 기사의 정보에 기초하여 상기 기사가 포함된 웹 페이지를 크롤링(crawling)하고, 상기 크롤링된 웹 페이지에 포함된 기사의 텍스트 데이터를 수집하는 웹 크롤러부, 상기 수집한 텍스트 데이터를 저장하는 크롤링 데이터베이스 및 상기 수집한 텍스트 데이터에 기초하여 텍스트 데이터 분석을 수행하는 데이터 분석부를 포함할 수 있다. According to one embodiment of the present application, the text data collection and analysis apparatus, the user interface unit for receiving an input regarding the keyword and the period information, the information of the article including the keyword from the web based on the keyword and the period information; A web crawler unit for acquiring, crawling a web page including the article based on the information of the article, and collecting text data of an article included in the crawled web page, and storing the collected text data. And a data analysis unit configured to perform text data analysis based on the crawling database and the collected text data.

본원의 일 실시예에 따르면, 상기 웹 크롤러부는, 상기 키워드, 상기 기간 정보 및 웹 페이지의 정보를 포함하는 URL을 생성하여 상기 웹으로 전송하고, 웹으로부터 웹 페이지에 포함된 기사 리스트가 포함된 HTML 파일을 수신하여 상기 HTML 파일을 트리형태로 재구성하고, 상기 HTML 파일로부터 기사 리스트의 URL을 추출하여 상기 기사 리스트의 URL을 웹으로 전송하고, 웹으로부터 기사 텍스트 데이터를 포함하는 HTML 파일을 수신하고, 상기 기사 텍스트 데이터를 포함하는 HTML 파일을 트리형태로 재구성하는 파싱부 및 상기 트리형태로 재구성된 기사 텍스트 데이터를 포함하는 HTML 파일로부터 텍스트 데이터를 추출하고, 상기 추출된 텍스트 데이터로부터 기사의 제목 및 내용에 해당하는 텍스트 데이터를 추출하는 추출부를 포함할 수 있다. According to the exemplary embodiment of the present application, the web crawler unit generates a URL including the keyword, the period information, and the information of the web page, and transmits the generated URL to the web, the HTML including a list of articles included in the web page from the web. Receiving a file and reconstructing the HTML file into a tree form, extracting the URL of the article list from the HTML file, transmitting the URL of the article list to the web, receiving an HTML file containing article text data from the web, Extracting text data from a parsing unit for reconstructing the HTML file including the article text data into a tree form and an HTML file including the article text data reconstructed into the tree form, and extracting the title and contents of the article from the extracted text data. It may include an extraction unit for extracting the text data corresponding to the.

본원의 일 실시예에 따르면, 상기 웹 크롤러부는, 상기 트리형태로 재구성된 기사 리스트의 HTML 파일 및 상기 트리형태로 재구성된 기사 텍스트 데이터를 포함하는 HTML 파일에 대하여 인코딩을 수행하는 언어 지원부를 더 포함할 수 있다. According to an embodiment of the present application, the web crawler unit further includes a language support unit for encoding the HTML file of the article list reconstructed in the tree form and the HTML file including the article text data reconstructed in the tree form. can do.

본원의 일 실시예에 따르면, 언어지원부는 상기 트리형태로 재구성된 기사 텍스트 데이터를 포함하는 HTML 파일에 대하여 인코딩을 수행하되, 상기 인코딩은 UTF-8 (Universal Transformation Format-8)인코딩일 수 있다. According to one embodiment of the present application, the language support unit performs encoding on the HTML file including the article text data reconstructed in the tree form, the encoding may be UTF-8 (Universal Transformation Format-8) encoding.

본원의 일 실시예에 따르면, 상기 웹 크롤러부는, 상기 추출된 텍스트 데이터를 txt 형식의 파일로 생성하는 파일 생성부를 더 포함하고, 상기 크롤링 데이터베이스는 상기 생성된 txt 형식의 파일을 저장할 수 있다. According to the exemplary embodiment of the present application, the web crawler unit may further include a file generator configured to generate the extracted text data as a txt file, and the crawl database may store the generated txt file.

본원의 일 실시예에 따르면, 상기 데이터 분석부는, 상기 수집한 텍스트 데이터를 상기 크롤링 데이터베이스로부터 읽어와서 미리 설정된 사전 정의 단어에 기초하여 전처리하는 전처리부, 상기 전처리된 기사의 텍스트 데이터로부터 기반 데이터 셋트를 형성하는 데이터 형성부 및 상기 기반 데이터 셋트에 기초하여 상기 텍스트 데이터 분석을 수행하는 분석부를 포함할 수 있다. According to an embodiment of the present application, the data analysis unit, a pre-processing unit for reading the collected text data from the crawl database based on a pre-defined word, a pre-set data set from the text data of the pre-processed article It may include a data forming unit to form and an analysis unit for performing the text data analysis based on the base data set.

본원의 일 실시예에 따르면, 텍스트 데이터 수집 및 분석 시스템은, 텍스트 데이터 수집 및 분석 장치 및 상기 텍스트 데이터 수집 및 분석 장치에 키워드 및 기간 정보에 관한 입력을 제공하고, 상기 텍스트 데이터 수집 및 분석 장치로부터 텍스트 데이터 분석의 결과를 수신하여 출력하는 사용자 단말을 포함할 수 있다. According to an embodiment of the present disclosure, a text data collection and analysis system provides an input regarding keywords and period information to a text data collection and analysis device and the text data collection and analysis device, and from the text data collection and analysis device. It may include a user terminal for receiving and outputting the results of the text data analysis.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-mentioned means for solving the problems are merely exemplary and should not be construed as limiting the present application. In addition to the above-described exemplary embodiments, additional embodiments may exist in the drawings and detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, 사용자로부터 키워드를 입력받고, 웹 크롤링을 이용하여 입력 키워드 관련 인터넷 개제 기사를 수집하고, 입력 키워드 관련 텍스트 데이터 분석을 수행하며, 수집된 기사 및 분석결과를 저장할 수 있는 데이터베이스를 포함하는 텍스트 데이터 수집 및 분석 방법을 제공할 수 있다.According to the above-described problem solving means of the present invention, receives a keyword from the user, collects the Internet entry article related to the input keyword by using a web crawl, perform text data analysis related to the input keyword, and store the collected article and analysis results It is possible to provide a method for collecting and analyzing text data that includes a database.

또한, 전술한 본원의 과제 해결 수단에 의하면, 사용자로부터 입력받은 키워드에 기반하여 키워드 텍스트 데이터 수집 및 분석 결과를 다양한 형태로 시각화하여 사용자에게 제공할 수 있는 텍스트 데이터 수집 및 분석 방법을 제공할 수 있다. In addition, according to the above-described problem solving means of the present application, it is possible to provide a text data collection and analysis method that can be provided to the user by visualizing the keyword text data collection and analysis results in various forms based on the keyword input from the user. .

또한, 전술한 본원의 과제 해결 수단에 의하면, 웹 게제 한글 텍스트 데이터를 수집할 수 있고, 데이터 분석 자동화를 통한 분석 시간 절감 및 텍스트 데이터 수집 및 분석을 통한 통합 프레임워크를 획득할 수 있다. In addition, according to the above-described problem solving means of the present application, it is possible to collect the web posting Hangul text data, to reduce the analysis time through data analysis automation and to obtain an integrated framework through the text data collection and analysis.

도 1은 본원의 일 실시예에 따른 텍스트 데이터 수집 및 분석 시스템의 개략적인 구성을 나타낸 도면이다.
도 2는 본원의 일 실시예에 따른 텍스트 데이터 수집 및 분석 장치의 개략적인 구성을 나타낸 블록도이다.
도3은 본원의 일 실시예에 따른 텍스트 데이터 수집 및 분석 방법을 개략적으로 나타낸 흐름도이다.
도4는 본원의 일 실시예에 따른 기사의 정보를 획득하는 단계를 개략적으로 나타낸 흐름도이다.
도5는 본원의 일 실시예에 따른 기사의 텍스트 데이터를 수집하는 단계를 개략적으로 나타낸 흐름도이다.
도6는 본원의 일 실시예에 따른 수집한 텍스트 데이터 셋트를 예시적으로 나타낸 도면이다.
도7은 본원의 일 실시예에 따른 수집한 텍스트 데이터 전처리 단계를 개략적으로 나타낸 흐름도이다.
도8은 본원의 일 실시예에 따른 텍스트 데이터 분류 과정을 예시적으로 나타낸 도면이다.
도9은 본원의 일 실시예에 따른 미리 설정된 제거 텍스트 요소를 제거하는 과정을 예시적으로 나타낸 도면이다.
도10은 본원의 일 실시예에 따른 기반 데이터 셋트를 형성하는 단계를 개략적으로 나타낸 흐름도이다.
도 11은 본원의 일 실시예에 따른 매트릭스 데이터 셋트를 개략적으로 나타낸 도면이다.
도 12는 본원의 일 실시예에 따른 텍스트 데이터 분석을 수행하는 단계를 개략적으로 나타낸 흐름도이다.
도 13은 본원의 일 실시예에 따른 결정된 최소 빈도수 이상의 단어를 빈도수에 따라 출력 위치, 출력 크기 및 출력 색을 결정하여 워드 클라우드 형태로 출력한 예를 나타낸 도면이다.
도14는 본원의 일 실시예에 따른 제 1단어 집합과 상기 제 2 단어 집합 간의 연관성을 매트릭스 형태로 출력하는 그룹 매트릭스 출력의 예를 나타낸 도면이다.
도15는 본원의 일 실시예에 따른 제 1단어 집합에 속한 단어와 제 2단어 집합에 속한 단어 간의 연관성을 네트워크 그래프 형태로 출력하는 그래프 출력의 예를 나타낸 도면이다.
도16은 본원의 일 실시예에 따른 결정된 연관규칙의 분포를 그래프로 출력하는 분포도 출력의 예를 나타낸 도면이다.
도17은 본원의 일 실시예에 따른 추출 빈도 및 추출 확률을 포함하는 그래프의 예를 나타낸 도면이다.
도18은 본원의 일 실시예에 따른 requestFlag 값에 따른 요청 목적을 개략적으로 나타낸 도면이다. 1 is a view showing a schematic configuration of a text data collection and analysis system according to an embodiment of the present application.
2 is a block diagram illustrating a schematic configuration of an apparatus for collecting and analyzing text data according to an embodiment of the present application.
3 is a flowchart schematically illustrating a text data collection and analysis method according to an exemplary embodiment of the present application.
Figure 4 is a flow diagram schematically illustrating the steps of obtaining information of the article according to an embodiment of the present application.
5 is a flow chart schematically illustrating the step of collecting text data of an article according to an embodiment of the present application.
6 is a diagram illustrating a text data set collected according to an embodiment of the present application.
7 is a flowchart schematically showing the collected text data preprocessing step according to an embodiment of the present application.
8 is a diagram illustrating a text data classification process according to an embodiment of the present disclosure.
9 is a diagram exemplarily illustrating a process of removing a preset removal text element according to an embodiment of the present disclosure.
10 is a flow diagram schematically illustrating the steps of forming a base data set according to an embodiment of the present application.
11 is a diagram schematically illustrating a matrix data set according to an embodiment of the present application.
12 is a flowchart schematically illustrating performing text data analysis according to an embodiment of the present application.
FIG. 13 is a diagram illustrating an example in which an output location, an output size, and an output color of a word having a minimum frequency determined or more are determined and output according to a frequency according to an embodiment of the present disclosure.
FIG. 14 is a diagram illustrating an example of a group matrix output for outputting, in a matrix form, an association between a first word set and a second word set according to an embodiment of the present disclosure.
FIG. 15 is a diagram illustrating an example of a graph output for outputting, in network graph form, an association between a word belonging to a first word set and a word belonging to a second word set, according to an embodiment of the present disclosure.
FIG. 16 is a diagram illustrating an example of a distribution chart output for graphically outputting a distribution of determined association rules according to an embodiment of the present application. FIG.
17 is a diagram illustrating an example of a graph including an extraction frequency and an extraction probability according to an embodiment of the present application.
18 is a diagram schematically illustrating a request purpose according to a requestFlag value according to an embodiment of the present application.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present disclosure. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted for simplicity of explanation, and like reference numerals designate like parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a portion is "connected" to another portion, this includes not only "directly connected" but also "electrically connected" with another element in between. do.

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout this specification, when a member is said to be located on another member "on", "upper", "top", "bottom", "bottom", "bottom", this means that any member This includes not only the contact but also the presence of another member between the two members.

본원 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함" 한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout this specification, when a part is said to "include" a certain component, it means that it can further include other components, without excluding the other components unless specifically stated otherwise.

본원은 텍스트 데이터 수집 및 분석 방법에 관한 것으로서, 사용자의 입력 키워드로부터 웹크롤링을 통해 웹 게재 기사 또는 게시글에 포함된 텍스트 데이터를 수집하고, 분석을 위해 수집된 데이터를 데이터베이스에 저장하며, 미리 정의된 사전 정의 단어에 따라 수집된 텍스트를 분류하고, 이에 기초하여 키워드와 추출 단어 간의 연관규칙 분석을 수행하여, 이를 시각화 기능을 사용하여 사용자에게 분석 결과를 제공할 수 있다. The present invention relates to a method for collecting and analyzing text data, and collects text data included in a web posting article or a post through web crawling from a user's input keyword, stores the collected data for analysis in a database, The collected text may be classified according to a dictionary word, and the association rule analysis between the keyword and the extracted word may be analyzed based on the text, and the analysis result may be provided to the user using a visualization function.

도1은 본원의 일 실시예에 따른 텍스트 데이터 수집 및 분석 시스템의 개략적인 구성을 나타낸 도면이고, 도 2는 본원의 일 실시예에 따른 텍스트 데이터 수집 및 분석 장치의 개략적인 구성을 나타낸 도면이다. 도 1 내지 도 2를 참조하면, 텍스트 데이터 수집 및 분석 시스템은, 텍스트 데이터 수집 및 분석 장치(100), 사용자 단말(200) 및 웹 (300) 포함할 수 있다. 텍스트 데이터 수집 및 분석 시스템은, 키워드 및 기간 정보에 기초하여 웹(300)으로부터 인터넷 게재 기사를 수집하고 수집된 기사에 기초하여 관련 키워드 텍스트 데이터 수집 및 분석결과를 사용자 단말(200)에 제공하는 것이다. 사용자 단말(200)은 텍스트 데이터 수집 및 분석 장치(100)에 키워드 및 기간 정보에 관한 입력을 제공하고, 텍스트 데이터 수집 및 분석 장치(100)로부터 텍스트 데이터 분석의 결과를 수신하여 출력할 수 있다. 1 is a view showing a schematic configuration of a text data collection and analysis system according to an embodiment of the present application, Figure 2 is a view showing a schematic configuration of a text data collection and analysis apparatus according to an embodiment of the present application. 1 to 2, the text data collection and analysis system may include a text data collection and analysis apparatus 100, a user terminal 200, and a web 300. The text data collection and analysis system collects Internet-published articles from the web 300 based on keywords and period information and provides the user terminal 200 with relevant keyword text data collection and analysis results based on the collected articles. . The user terminal 200 may provide an input regarding keywords and period information to the text data collection and analysis apparatus 100, and receive and output a text data analysis result from the text data collection and analysis apparatus 100.

사용자 단말(200)은 스마트폰(Smartphone), 스마트패드(SmartPad), 스마트 TV, 태블릿 PC등과 PCS(Personal Communication System), GSM(Global System for Mobile communication), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000 단말, CDMA(Code Division Multiple Access)-2000 단말, W-CDMA(W-Code Division Multiple Access) 단말, Wibro(Wireless Broadband Internet) 단말 같은 데이터 처리 및 출력이 가능한 모든 종류의 무선 통신 장치 중 어느 하나일 수 있다.The user terminal 200 includes a smartphone, a smart pad, a smart TV, a tablet PC, a personal communication system (PCS), a global system for mobile communication (GSM), a personal digital cellular (PDC), a personal PHS (PHS). Handyphone System (PDA), Personal Digital Assistant (PDA), International Mobile Telecommunication (IMT) -2000 Terminal, Code Division Multiple Access (CDMA) -2000 Terminal, W-Code Division Multiple Access (W-CDMA) Terminal, Wireless Broadband Internet ) May be any one of all kinds of wireless communication devices capable of processing and outputting data such as a terminal.

텍스트 데이터 수집 및 분석 장치(100), 사용자 단말(200) 및 웹 (300)은 네트워크로 연결될 수 있다. 네트워크는 단말 및 서버와 같은 각각의 노드 상호 간에 정보 교환이 가능한 유, 무선의 연결 구조를 의미하는 것으로, 이러한 네트워크의 일 예에는 3GPP(3rd Generation Partnership Project) 네트워크, LTE(Long Term Evolution) 네트워크, 5G 네트워크, WIMAX(World Interoperability for Microwave Access) 네트워크, 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), 블루투스(Bluetooth) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함되나 이에 한정되지는 않는다.The apparatus 100 for collecting and analyzing text data, the user terminal 200, and the web 300 may be connected to a network. The network refers to a wired or wireless connection structure capable of exchanging information between each node such as a terminal and a server. Examples of such a network include a 3rd generation partnership project (3GPP) network, a long term evolution (LTE) network, 5G network, World Interoperability for Microwave Access (WIMAX) network, Internet, Local Area Network (LAN), Wireless Local Area Network (WLAN), Wide Area Network (WAN), Personal Area Network (PAN), Bluetooth ( Bluetooth) networks, satellite broadcasting networks, analog broadcasting networks, digital multimedia broadcasting (DMB) networks, and the like.

도 1을 참조하면, 사용자 단말(200)에서 입력된 정보는 텍스트 데이터 수집 및 분석 장치(100)로 전송되어, 텍스트 데이터 수집 및 분석 장치(100)에 의해 입력 키워드 관련 텍스트 데이터가 수집 및 분석될 수 있다. 텍스트 데이터 수집 및 분석 장치(100)는 네트워크를 통해 사용자 단말(200)로 시각화한 텍스트 데이터 수집 및 분석 결과를 제공할 수 있다. 텍스트 데이터 수집 및 분석 장치(100)는 사용자의 입력 키워드 정보에 기초하여 웹 (300)으로부터 입력된 키워드와 관련된 인터넷 게재 기사를 수집할 수 있다. Referring to FIG. 1, the information input from the user terminal 200 is transmitted to the text data collection and analysis apparatus 100 so that the text data related to the input keyword may be collected and analyzed by the text data collection and analysis apparatus 100. Can be. The text data collection and analysis apparatus 100 may provide a text data collection and analysis result visualized by the user terminal 200 through a network. The text data collection and analysis apparatus 100 may collect an internet posting article related to a keyword input from the web 300 based on the input keyword information of the user.

텍스트 데이터 수집 및 분석 장치(100)는 키워드 및 기간 정보에 기초하여 웹(web)으로부터 키워드를 포함하는 기사 정보를 획득하고, 기사의 정보에 기초하여 기사가 포함된 웹 페이지를 크롤링(crawling)하고, 크롤링한 웹 페이지에 포함된 기사로부터 텍스트 데이터 분석을 수행하고, 분석 결과를 저장할 수 있다.The text data collection and analysis apparatus 100 obtains article information including a keyword from the web based on the keyword and period information, and crawls a web page including the article based on the information of the article. You can analyze text data from articles contained in crawled web pages and save the analysis results.

도2를 참조하면, 텍스트 데이터 수집 및 분석 장치(100)는 크게 유저 인터페이스부(110), 웹 크롤러부(120), 데이터 분석부(130) 및 데이터베이스(140)를 포함할 수 있다. 웹 크롤러부(120)는 파싱부(121), 추출부(122), 언어 지원부(123) 및 파일 생성부(124)를 포함할 수 있다. 데이터 분석부(130)는 전처리부(131), 데이터 형성부(132) 및 분석부(133)를 포함할 수 있다. 또한, 데이터 베이스(140)는 크롤링 데이터베이스(141), 분석 데이터베이스(142) 및 사전 정의 단어 데이터베이스(143)를 포함할 수 있다. Referring to FIG. 2, the text data collection and analysis apparatus 100 may largely include a user interface unit 110, a web crawler unit 120, a data analysis unit 130, and a database 140. The web crawler 120 may include a parser 121, an extractor 122, a language supporter 123, and a file generator 124. The data analyzer 130 may include a preprocessor 131, a data generator 132, and an analyzer 133. In addition, the database 140 may include a crawl database 141, an analysis database 142, and a dictionary word database 143.

본원의 일 실시예에 따르면, 유저 인터페이스부(110)는 GUI를 사용자에게 제공함으로써 웹 크롤링 및 텍스트 데이터 분석을 위한 세부적인 값들을 설정하도록 도와주고, 도출된 연관규칙 분석 결과 및 워드 클라우드 (Word Cloud) 결과를 출력하여 사용자에게 제공할 수 있다. 도시하지 않았지만, 유저 인터페이스부(110)는 1) User Input Panel, 2) Result View Panel, 3) Pre-define Knowledge Panel 세개의 기능블록을 포함할 수 있다. User Input Panel은 키워드, 기사 게재기간, 페이지 및 크롤링 범위 설정 등 데이터 수집을 위한 설정과, 출력조건, Word Cloud 최소 빈도수 설정 등 데이터 분석을 위한 다양한 옵션값 설정을 위해 사용될 수 있다. Result View Panel은 데이터 분석 후 시각화된 결과를 제공하기 위해 사용되며, Pre-define Knowledge Panel은 Pre-define Knowledge DB에 저장된 단어들을 추가/삭제하기 위해 사용될 수 있다.According to one embodiment of the present application, the user interface unit 110 provides a GUI to help the user to set detailed values for web crawling and text data analysis, and the derived association rule analysis result and word cloud (Word Cloud). ) You can output the result and provide it to the user. Although not shown, the user interface 110 may include three functional blocks: 1) User Input Panel, 2) Result View Panel, and 3) Pre-define Knowledge Panel. The User Input Panel can be used for setting data collection such as keyword, article publication period, page and crawl range, and setting various option values for data analysis such as output condition and Word Cloud minimum frequency setting. The Result View Panel is used to provide visualized results after data analysis. The Pre-define Knowledge Panel can be used to add / delete words stored in the Pre-define Knowledge DB.

웹 크롤러부(120)는 데이터 분석에 필요한 데이터 셋트를 추출하기 위해 사용자로부터 입력받은 특정 키워드 및 기사 게재기간을 조건으로 수집 가능한 기사를 검색하고, 검색된 모든 기사의 텍스트 데이터를 수집한다. 웹 크롤러부(120)는 사용자가 입력한 키워드 및 옵션 값들을 기반으로 URL을 생성하고, 생성한 URL을 사용하여 웹에 게재된 기사를 검색한다. 이 후 검색된 기사의 목록 및 기사 본문을 포함하는 HTML 파일을 웹(300)으로부터 가져온다. 또한, 웹 크롤러부(120)는 한글 지원을 위한 인코딩을 수행하고, HTML 파일 내용 중 기사의 제목 및 내용에 해당하는 텍스트 데이터를 추출하여, 추출된 텍스트 데이터들을 txt 형식으로 생성 및 저장할 수 있다.The web crawler unit 120 searches for collectable articles on the basis of a specific keyword inputted by a user and an article posting period to extract a data set necessary for data analysis, and collects text data of all found articles. The web crawler unit 120 generates a URL based on the keyword and option values input by the user, and searches for an article posted on the web using the generated URL. Thereafter, an HTML file including a list of searched articles and an article body is imported from the web 300. In addition, the web crawler unit 120 may encode for Hangul support, extract text data corresponding to the title and the content of the HTML file contents, and generate and store the extracted text data in txt format.

데이터 분석부(130)는 웹 크롤러부(120)의 수행 결과로 수집된 데이터 셋트를 기반으로 데이터 전처리를 수행하고, 연관규칙 분석을 위한 매트릭스 데이터를 형성한다. 이후, 연관규칙 분석을 수행하고, 분석 결과를 다양한 형태로 시각화할 수 있다. 데이터 분석부(130)는 수집된 기사들의 데이터를 미리 설정된 단어와 비교하여 분석할 기사를 분류한다. 또한, 분류된 기사의 본문으로부터 체언 (명사)을 추출하고, 추출된 단어 중 빈도수가 높은 단어들을 기반으로 연관규칙 분석을 위한 데이터 셋트를 형성할 수 있다. 이후, 데이터 분석부(130)는 Apriori 알고리즘을 활용한 연관규칙 분석을 수행하고, 분석결과를 Word Cloud, Grouped Matrix, Graph, Scatter Plot, Extraction Probability 형태로 시각화할 수 있다.The data analysis unit 130 performs data preprocessing based on the data set collected as a result of the web crawler unit 120 and forms matrix data for analysis of association rules. Then, the association rule analysis can be performed and the analysis results can be visualized in various forms. The data analysis unit 130 classifies articles to be analyzed by comparing the data of the collected articles with preset words. In addition, it is possible to extract a noun (noun) from the body of the classified articles, and form a data set for association rule analysis based on the words with high frequency among the extracted words. Thereafter, the data analysis unit 130 may perform the association rule analysis using the Apriori algorithm and visualize the analysis result in the form of Word Cloud, Grouped Matrix, Graph, Scatter Plot, Extraction Probability.

데이터베이스(140)는 웹 크롤러부(120)에 의해 추출된 데이터 셋트를 저장하고, 데이터 분석부(130)를 통해 얻은 분석 결과 및 시각화된 이미지를 저장할 수 있다. 특히, 데이터 전처리 과정에서 수행되는 기사 분류 작업을 위해 사용자가 정의한 미리 설정된 단어들을 저장하며, 전처리 결과 및 연관분석을 위한 데이터 셋트를 저장한다. 데이터베이스(140)는 1) 크롤링 데이터베이스, 2) 분석 데이터베이스, 3) 사전 정의 단어 데이터베이스를 포함한다. 크롤링 데이터베이스는 웹 크롤러부(120)를 통해 수집된 데이터를 저장 및 관리를 위해 사용되며, 분석 데이터베이스는 데이터 분석 결과를 저장 및 관리하기 위해, 그리고 사전 정의 단어 데이터베이스는 사용자 정의 기반의 사전 정의 단어를 저장 및 관리하기 위해 사용될 수 있다.The database 140 may store the data set extracted by the web crawler unit 120 and store the analysis result and the visualized image obtained through the data analyzer 130. In particular, the user-defined preset words are stored for the article classification performed in the data preprocessing process, and the data set for the preprocessing result and the association analysis is stored. The database 140 includes 1) a crawl database, 2) an analysis database, and 3) a dictionary word database. The crawl database is used for storing and managing data collected through the web crawler unit 120. The analysis database stores and manages data analysis results, and the dictionary word database defines user-defined dictionary words. Can be used for storage and management.

텍스트 데이터 수집 및 분석 장치(100)의 각 부에 대한 설명은 아래에서 자세히 설명하도록 한다. Description of each part of the text data collection and analysis apparatus 100 will be described in detail below.

도3은 본원의 일 실시예에 따른 텍스트 데이터 수집 및 분석 방법을 개략적으로 나타낸 흐름도이다. 또한, 도 3에 도시된 각 단계의 설명은 도 4내지 도 12를 통해 구체적으로 설명하도록 한다. 3 is a flowchart schematically illustrating a text data collection and analysis method according to an exemplary embodiment of the present application. In addition, the description of each step illustrated in FIG. 3 will be described in detail with reference to FIGS. 4 to 12.

도3을 참조하면, 단계 S310에서, 유저 인터페이스부(110)는 웹 크롤러부(120)로 키워드 및 기간 정보를 전송하고, 웹 크롤러부(120)는 키워드 및 기간 정보를 수신할 수 있다. 예를 들어, 기간 정보는 기사의 게재 기간일 수 있다. Referring to FIG. 3, in operation S310, the user interface unit 110 may transmit keyword and period information to the web crawler unit 120, and the web crawler unit 120 may receive keyword and period information. For example, the period information may be a publication period of the article.

단계 S320에서, 웹 크롤러부(120)는 키워드 및 기간 정보에 기초하여 웹(web)으로부터 키워드를 포함하는 기사의 정보를 획득할 수 있다. 또한, 단계 S330에서, 웹 크롤러부(120)는 기사의 정보에 기초하여 기사가 포함된 웹 페이지를 크롤링(crawling)할 수 있다. 단계 S320 및 단계330에서, 웹 크롤러부(120)는 키워드 및 기간 정보를 포함하는 URL을 생성하여 웹으로 전송하고, 웹으로부터 검색결과가 포함된 HTML 파일을 수신할 수 있다.In operation S320, the web crawler unit 120 may obtain information of an article including the keyword from the web, based on the keyword and the period information. In operation S330, the web crawler unit 120 may crawl a web page including the article based on the information of the article. In steps S320 and 330, the web crawler unit 120 may generate and transmit a URL including a keyword and period information to the web, and receive an HTML file including a search result from the web.

단계 S340에서, 웹 크롤러부(120)는 데이터 분석부(130)로 크롤링한 웹 페이지에 포함된 기사의 텍스트 데이터를 전송할 수 있다. 또 다른 예로서, 단계 S340에서, 데이터 분석부(130)는 데이터베이스로부터 웹 크롤러부(120)에 의해 수집된 기사의 텍스트 데이터를 읽어올 수 있다.In operation S340, the web crawler unit 120 may transmit text data of an article included in the crawled web page to the data analyzer 130. As another example, in operation S340, the data analyzer 130 may read text data of an article collected by the web crawler 120 from a database.

단계 S350에서, 데이터 분석부(130)는 미리 설정된 사전 정의 단어에 기초하여 크롤링한 웹 페이지를 전처리 할 수 있다. 보다 구체적으로, 데이터 분석부(130)는 미리 설정된 사전 정의 단어에 기초하여 크롤링한 웹 페이지에 포함된 기사의 텍스트 데이터를 전처리할 수 있다. 데이터 분석부(130)는 미리 설정된 사전 정의 단어를 포함하는 기사의 텍스트 데이터와 미리 설정된 사전 정의 단어를 포함하지 않는 기사의 텍스트 데이터를 분류할 수 있다. 또한, 데이터 분석부(130)는 미리 설정된 사전 정의 단어를 포함하는 기사의 텍스트 데이터 중에서 미리 설정된 제거 텍스트 요소를 제거할 수 있다. In operation S350, the data analyzer 130 may preprocess the crawled web page based on a preset word. More specifically, the data analyzer 130 may preprocess the text data of the article included in the crawled web page based on a preset word. The data analyzer 130 may classify text data of an article including a preset word and text data of an article not including a preset word. In addition, the data analyzer 130 may remove the preset removal text element from the text data of the article including the preset word.

단계 S360에서, 데이터 분석부(130)는 크롤링한 웹 페이지에 포함된 기사로부터 데이터 분석에 사용될 기반 데이터 셋트를 형성할 수 있다. 보다 구체적으로, 데이터 분석부(130)는 전처리된 웹 페이지의 기사의 텍스트 데이터로부터 기반 데이터 셋트를 형성할 수 있다. 데이터 분석부(130)는 미리 설정된 텍스트 제거 요소가 제거된 기사의 텍스트 데이터에 대하여 기반 데이터 셋트를 형성할 수 있다. 본원의 일 실시예에 따르면, 상기 기반 데이터 셋트는 미리 설정된 텍스트 제거 요소가 제거된 기사에 포함된 체언 집합, 미리 설정된 순위 이내의 빈발 단어 집합 및 각 기사의 빈발 단어 포함 여부에 관한 정보인 매트릭스 데이터 셋트를 포함할 수 있다.In operation S360, the data analyzer 130 may form a base data set to be used for data analysis from articles included in the crawled web page. More specifically, the data analyzer 130 may form the base data set from the text data of the article of the preprocessed web page. The data analyzer 130 may form a base data set for the text data of the article from which the preset text removing element is removed. According to an embodiment of the present disclosure, the base data set includes matrix data that is information on whether a set of spoken words included in an article from which a predetermined text removing element is removed, a set of frequent words within a predetermined rank, and whether each article includes frequent words May contain a set.

단계 S370에서, 데이터 분석부(130)는 기반 데이터 셋트에 기초하여 텍스트 데이터를 분석할 수 있다. 본원의 일 실시예에 따르면, 데이터 분석부(130)는 기사에서 사용된 단어들의 빈도수 분석 및 기사 내 단어들 간의 연관규칙 분석을 포함하여 텍스트 데이터 분석을 수행할 수 있다. 또한, 데이터 분석부(130)는 빈도수 분석의 결과 및 연관규칙 분석의 결과를 서로 다른 그래픽으로 출력할 수 있다.In operation S370, the data analyzer 130 may analyze text data based on the base data set. According to one embodiment of the present application, the data analysis unit 130 may perform text data analysis including frequency analysis of words used in articles and analysis of association rules between words in articles. In addition, the data analyzer 130 may output the results of the frequency analysis and the results of the association rule analysis in different graphics.

단계S380에서, 데이터 베이스(140)는 텍스트 데이터 분석의 결과를 저장할 수 있다. 본원의 일 실시예에 따르면, 데이터 베이스(140)는 빈도수 분석의 결과 및 연관규칙 분석의 결과의 식별자와 연계하여 상기 텍스트 데이터 분석의 결과를 저장할 수 있다. 예를 들어, 연관규칙 분석의 결과의 식별자는 Word Cloud, Grouped Matrix, Graph, Scatter Plot, Extraction Probability 등을 포함할 수 있다.In operation S380, the database 140 may store a result of the text data analysis. According to an embodiment of the present disclosure, the database 140 may store the result of the text data analysis in association with an identifier of the result of the frequency analysis and the result of the association rule analysis. For example, the identifier of the result of the association rule analysis may include Word Cloud, Grouped Matrix, Graph, Scatter Plot, Extraction Probability, and the like.

도4는 본원의 일 실시예에 따른 기사의 정보를 획득하는 단계를 개략적으로 나타낸 흐름도이다. Figure 4 is a flow diagram schematically illustrating the steps of obtaining information of the article according to an embodiment of the present application.

도4를 참조하면, 단계 S410에서 유저 인터페이스부(110)는 키워드 및 기간 정보에 관한 입력을 수신할 수 있다. 기간 정보는 기사의 게재 기간일 수 있다. 예를 들어, 사용자는 기사 게재 기간에 제한을 두지 않고 검색을 수행하도록 설정할 수 있고, 특정 기간 내에 게재된 기사를 검색하고자 하는 경우, 시작 일자(년,월,일) 및 마지막 일자(년,월,일)를 각각 입력하여 게재기간을 설정할 수 있다. 예를 들어, 유저 인터페이스부(110)는 캘린더 UI를 사용하여 사용자가 게재기간의 시작일자 및 마지막 일자를 설정할 수 있게 할 수 있다.Referring to FIG. 4, in operation S410, the user interface 110 may receive an input regarding keywords and period information. The period information may be a publication period of the article. For example, a user can set up a search without limiting the duration of the article, and if you want to search for articles published within a certain time period, the start date (year, month, day) and the last date (year, month). , Days) to set the date range. For example, the user interface unit 110 may use a calendar UI to allow a user to set the start date and the last date of the publication period.

단계 S420에서, 유저 인터페이스부(110)는 키워드 및 기간 정보와 함께 요청 정보를 포함하는 요청 플래그(RequestFlag)를 웹 크롤러부(120)로 전송할 수 있다. 예시적으로, 도 18을 참조하면, 요청 1(request 1)은 검색된 기사의 수에 대한 요청일 수 있다. 유저 인터페이스부(110)는 요청 1(request 1)을 포함하는 요청 플래그를 전송할 수 있다. 즉, 유저 인터페이스부(110)는 키워드, 기간 정보 및 요청 1을 웹 크롤러부(120)로 전송할 수 있다.In operation S420, the user interface 110 may transmit a request flag (RequestFlag) including request information together with keywords and period information to the web crawler unit 120. For example, referring to FIG. 18, request 1 may be a request for the number of articles searched. The user interface unit 110 may transmit a request flag including request 1. That is, the user interface unit 110 may transmit the keyword, the period information, and the request 1 to the web crawler unit 120.

웹 크롤러부(120)는 유저 인터페이스부(110)로부터 전송받는 요청1의 목적인 검색된 기사의 수를 계산할 수 있다. The web crawler unit 120 may calculate the number of searched articles that are the purpose of the request 1 received from the user interface unit 110.

단계 S430에서, 웹 크롤러부(120)의 파싱부(121)는 키워드, 기간 정보 및 웹 페이지의 정보를 포함하는 URL을 생성할 수 있다. 일예로, 파싱부(121)는 기간 정보가 특정되어 있지 않을 경우 기사의 게재 기간을 한정하지 않고, 기사 검색결과를 얻기 위한 URL을 생성할 수 있다. 다른 일예로, 파싱부(121)는 기간 정보가 시작 일자 및 마지막 일자가 설정되어 있다면, 특정 값으로 설정된 게재기간을 기반으로 기사 검색결과를 얻기 위한 URL을 생성할 수 있다. In operation S430, the parsing unit 121 of the web crawler unit 120 may generate a URL including a keyword, period information, and web page information. For example, the parsing unit 121 may generate a URL for obtaining an article search result without limiting a publication period of an article when period information is not specified. As another example, the parsing unit 121 may generate a URL for obtaining an article search result based on a publication period set to a specific value, if the period information is set to a start date and a last date.

단계S440에서, 파싱부(121)는 생성한 URL을 웹(300)으로 전송하고, 웹(300)에 검색 결과가 포함된 HTML 파일을 웹(300)요청할 수 있다. In operation S440, the parser 121 may transmit the generated URL to the web 300, and request the web 300 an HTML file including a search result from the web 300.

단계 S450에서, 웹(300)은 키워드 및 기간 정보를 포함하는 HTML 파일을 생성하고, 단계S460에서, 웹(300)은 검색 결과가 포함된 HTML 파일을 웹 크롤러부(120)로 전송할 수 있다. In operation S450, the web 300 may generate an HTML file including the keyword and period information. In operation S460, the web 300 may transmit the HTML file including the search result to the web crawler unit 120.

단계S470에서, 파싱부(121)는 웹(300)으로부터 수신된 HTML 파일을 트리 형태로 재구성할 수 있다. 추출부(122)는 재구성된 HTML 파일에서 기사의 수 추출과 연계된 미리 설정된 속성 값(예를 들어, span class="result_num")을 가지는 데이터를 추출할 수 있다.In operation S470, the parser 121 may reconstruct the HTML file received from the web 300 in a tree form. The extractor 122 may extract data having a predetermined attribute value (eg, span class = "result_num") associated with extraction of the number of articles from the reconstructed HTML file.

단계S480에서, 추출부(122)는 추출된 데이터로부터 공백 및 특수문자를 제거하고 기사의 수를 집계할 수 있다. 예시적으로, 웹 크롤러부(120)는 추출이 완료된 데이터는 검색된 기사의 수를 "total: 기사의 수"의 형태로 유저 인터페이스부(100)로 전송할 수 있다. 웹 크롤러부(120)는 검색결과가 존재하지 않는면, "total: NA"의 형태로 유저 인터페이스부(120)로 전송할 수 있다. In step S480, the extraction unit 122 may remove spaces and special characters from the extracted data and count the number of articles. For example, the web crawler unit 120 may transmit the number of the retrieved data to the user interface unit 100 in the form of “total: the number of articles”. If the search result does not exist, the web crawler unit 120 may transmit the result to the user interface unit 120 in the form of “total: NA”.

단계 S490에서, 웹 크롤러부(120)는 기사의 수에 기초하여 크롤링 가능한 웹 페이지의 수를 계산할 수 있다. 예를 들어, 하나의 웹 페이지에는 적어도 하나의 기사가 포함될 수 있다. 웹 크롤러부(120)는 키워드 및 기간 정보에 기초하여 키워드를 포함하는 기사의 결과 즉 검색 결과가 존재한다면 최대 크롤링 가능한 웹 페이지의 수를 숫자로 표현하여 유저 인터페이스부(120)로 전송할 수 있다. 웹 크롤러부(120)는 검색 결과가 존재하지 않는다면 텍스트 형태로 표현하여 검색 결과가 없음을 유저 인터페이스부(120)로 전송할 수 있다. In operation S490, the web crawler unit 120 may calculate the number of crawlable web pages based on the number of articles. For example, one web page may include at least one article. The web crawler unit 120 may transmit the result of the article including the keyword, that is, the search result, based on the keyword and the period information to the user interface unit 120 by expressing the maximum number of crawlable web pages in numbers. If the search result does not exist, the web crawler unit 120 may express it in text form to transmit the absence of the search result to the user interface unit 120.

도5는 본원의 일 실시예에 따른 기사의 텍스트 데이터를 수집하는 단계를 개략적으로 나타낸 흐름도이다. 5 is a flow chart schematically illustrating the step of collecting text data of an article according to an embodiment of the present application.

웹 크롤러부(120)는 유저 인터페이스부(110)에서 사용자 등에 의해 입력 받은 웹 페이지의 범위 내에서 키워드를 포함하는 기사가 포함된 웹 페이지를 크롤링 할 수 있다.The web crawler unit 120 may crawl a web page including an article including a keyword within a range of a web page input by a user or the like in the user interface unit 110.

단계S510에서, 유저 인터페이스부(110)는 사용자 등에 의해 크롤링 범위 입력을 수신할 수 있다. 크롤링 범위는 크롤링 할 기사의 웹 페이지 범위를 의미하며, 상기 단계 S490에서 산출된 크롤링 가능 웹 페이지의 수 보다 작은 범위로 설정되어야 한다. 유저 인터페이스부(110)는 최대 크롤링 가능한 웹 페이지의 수를 제공할 수 있다. 사용자는 시작(start) 웹 페이지 및 종료(end) 페이지를 입력하여 크롤링 범위를 설정할 수 있다.In operation S510, the user interface unit 110 may receive a crawl range input by a user or the like. The crawl range means a web page range of articles to be crawled, and should be set to a range smaller than the number of crawlable web pages calculated in step S490. The user interface 110 may provide a maximum number of crawlable web pages. The user can set a crawl scope by entering a start web page and an end page.

단계 S520에서, 유저 인퍼페이스부(110)는 키워드와 함께 웹 크롤링을 수행하는 시작 웹 페이지를 전송할 수 있다. 또한, 유저 인터페이스부(110)는 웹 크롤러부(120)로 웹 페이지 범위 전송 시 요청 2(request 2)을 포함하는 요청 플래그를 전송할 수 있다. 도 18을 참조하면, 요청2는 기사 크롤링 요청을 목적으로 하는 요청 플래그일 수 있다. 웹 크롤러부(120)는 유저 인터페이스부(110)의 요청 2에 따라, 기사 크롤링을 수행할 수 있다. In operation S520, the user interface unit 110 may transmit a starting web page for performing a web crawl with a keyword. In addition, the user interface 110 may transmit a request flag including a request 2 when the web page range is transmitted to the web crawler 120. Referring to FIG. 18, request 2 may be a request flag for an article crawl request. The web crawler unit 120 may crawl the article according to the request 2 of the user interface unit 110.

단계 S530에서, 웹 크롤러부(120)는 키워드 및 기간 정보에 기초하여 키워드를 포함하는 기사의 수를 계산할 수 있다. 파싱부(121)는 기사의 수를 계산하기 위해, 특정 페이지 크롤링을 위한 URL을 생성할 수 있다. URL 생성 과정은 단계 S430에서의 키워드 및 기간 정보를 포함하는 URL을 생성하는 과정과 유사하다. 파싱부(121)는 키워드, 기간 정보 및 웹 페이지의 정보를 포함하는 URL 을 생성하여 웹(300)으로 전송하고, 웹(300)으로부터 웹 페이지에 포함된 기사 리스트가 포함된 HTML 파일을 수신하며, HTML 파일을 트리형태로 재구성할 수 있다. In operation S530, the web crawler unit 120 may calculate the number of articles including the keyword based on the keyword and the period information. The parser 121 may generate a URL for crawling a specific page to calculate the number of articles. The URL generation process is similar to the process of generating a URL including keyword and period information in step S430. The parser 121 generates a URL including a keyword, period information, and web page information and transmits the generated URL to the web 300, and receives an HTML file including a list of articles included in the web page from the web 300. You can reorganize the HTML file into a tree.

단계 S540에서, 웹 크롤러부(120)는 입력 받은 웹 페이지의 범위 내에서 키워드를 포함하는 기사를 크롤링 할 수 있다. 단계 S540에서, 파싱부(121)는 상기 HTML 파일로부터 기사 리스트의 URL을 추출하여 상기 기사 리스트의 URL을 웹(300)으로 전송하고, 웹으로부터 기사 텍스트 데이터를 포함하는 HTML 파일을 수신하고, 상기 기사 텍스트 데이터를 포함하는 HTML 파일을 트리형태로 재구성할 수 있다.In operation S540, the web crawler unit 120 may crawl an article including a keyword within the range of the input web page. In step S540, the parser 121 extracts the URL of the article list from the HTML file, transmits the URL of the article list to the web 300, receives an HTML file including the article text data from the web, and HTML files containing article text data can be reorganized into a tree.

또한, 단계 S540에서, 웹 크롤러부(120)의 추출부(122)는 트리형태로 재구성된 기사 리스트의 HTML 파일로부터 텍스트 데이터를 추출할 수 있다. 또한, 추출부(122)는 상기 추출된 텍스트 데이터로부터 기사의 제목 및 내용에 해당하는 텍스트 데이터를 추출할 수 있다. 언어 지원부(123)는 트리형태로 재구성된 기사 텍스트 데이터를 포함하는 HTML 파일에 대하여 인코딩을 수행할 수 있다. 예를 들어, 인코딩은 한글 텍스트 데이터 손실을 방지하기 위해 UTP-8(Universal Transformation Format-8) 인코딩일 수 있다. In operation S540, the extraction unit 122 of the web crawler unit 120 may extract text data from an HTML file of an article list reconstructed in a tree form. In addition, the extractor 122 may extract text data corresponding to the title and the content of the article from the extracted text data. The language support unit 123 may perform encoding on the HTML file including the article text data reconstructed in a tree form. For example, the encoding may be UTP-8 (Universal Transformation Format-8) encoding to prevent the loss of Korean text data.

단계 S550에서, 웹 크롤러부(120)는 기사 리스트의 수만큼 크롤링 과정을 반복하여 수행할 수 있다. 웹 크롤러부(120)는 현재 크롤링을 수행하고 있는 기사의 수가 총 기사의 수보다 이상인 경우, 반복 과정을 종료하고, 크롤링이 완료된 기사 텍스트 데이터를 크롤링 데이터 베이스로 전송할 수 있다. In operation S550, the web crawler unit 120 may repeat the crawling process by the number of article lists. If the number of articles currently being crawled is greater than the total number of articles, the web crawler unit 120 may terminate the repetition process and transmit the crawled article text data to the crawl database.

단계 S560에서, 웹 크롤러부(120)는 기사 리스트의 URL 수만큼 반복하여 특정 페이지의 기사 텍스트 데이터 요청 과정이 완료되면 기사 텍스트 데이터를 크롤링 데이터베이스(300)로 전송할 수 있다. 유저 인터페이스부(110)는 크롤링 데이터베이스(141)에 기사 텍스트 데이터의 저장이 완료되면, 사용자가 지정한 웹 페이지 범위 중 시작 페이지와 종료 페이지를 비교할 수 있다. In operation S560, the web crawler unit 120 may repeat the number of URLs in the article list and transmit the article text data to the crawl database 300 when the article text data request process of the specific page is completed. When the storing of the article text data is completed in the crawling database 141, the user interface 110 may compare a start page and an end page of a web page range designated by the user.

단계 S570에서 웹 크롤러부(120)는 종료 페이지의 수가 현재 크롤링을 수행하고 있는 웹 페이지의 수보다 크다면 웹 크롤러부(120)는 웹 페이지의 시작 페이지를 1로 증가시키고 웹 페이지에 포함되어 있는 기사의 텍스트 데이터의 크롤링 과정을 반복하여 수행할 수 있다. 즉, 웹 크롤러부(120)는 입력 받은 웹 페이지의 범위 내에서 웹 페이지의 크롤링이 반복하여 수행될 수 있다. In step S570, if the number of end pages is greater than the number of web pages that are currently crawling, the web crawler unit 120 increases the start page of the web page to 1 and includes the web page. You can repeat the crawling process for text data in articles. That is, the web crawler unit 120 may repeatedly perform crawling of the web page within the range of the received web page.

또한, 파일 생성부(124)는 추출된 텍스트 데이터를 txt 형식의 파일로 생성할 수 있다. 크롤링 데이터 베이스(141)는 생성된 txt 형식의 파일을 저장할 수 있다. In addition, the file generator 124 may generate the extracted text data as a txt file. The crawl database 141 may store the generated txt file.

도6는 본원의 일 실시예에 따른 수집한 텍스트 데이터 셋트를 예시적으로 나타낸 도면이다. 6 is a diagram illustrating a text data set collected according to an embodiment of the present application.

도6을 참조하면, 크롤링된 웹 페이지에 포함된 기사의 텍스트 데이터를 수집 결과로, 수집한 텍스트 데이터들은 txt 형식의 파일로 저장될 수 있다. 크롤링 과정 중 언어 지원부(123)에서 기사 텍스트 데이터를 포함하는 HTML 파일에 대하여 인코딩을 수행함으로써, 한글 데이터가 손실 되는 것을 방지 하고, 한글 형태의 기사의 제목을 얻을 수 있다. Referring to FIG. 6, as a result of collecting text data of an article included in a crawled web page, the collected text data may be stored as a file in a txt format. During the crawling process, the language support unit 123 encodes the HTML file including the article text data, thereby preventing the Korean data from being lost and obtaining the title of the article in the Korean form.

도7은 본원의 일 실시예에 따른 수집한 텍스트 데이터 전처리 단계를 개략적으로 나타낸 흐름도이다. 7 is a flowchart schematically showing the collected text data preprocessing step according to an embodiment of the present application.

도7을 참조하면, 전처리부(131)는 수집한 텍스트 데이터 셋트에 대하여 미리 설정된 사전 정의 단어에 기초하여 전처리를 수행할 수 있다. 데이터 분석부(130)는 요청 3및 4의 정보가 수신되면, 데이터 분석의 정확도를 높이기 위해 데이터 전처리를 수행할 수 있다. 전처리 과정은 데이터 분류 과정 및 데이터 필터링 과정으로 진행될 수 있다. 데이터 필터링 과정은 미리 설정된 사전 정의 단어를 포함하는 기사의 텍스트 데이터 중에서 미리 설정된 제거 텍스트 요소를 제거하는 과정일 수 있다. Referring to FIG. 7, the preprocessor 131 may perform preprocessing based on a predefined word defined for the collected text data set. When the information of the requests 3 and 4 is received, the data analyzer 130 may perform data preprocessing to increase the accuracy of data analysis. The preprocessing process may be a data classification process and a data filtering process. The data filtering process may be a process of removing a preset removal text element from text data of an article including a preset word.

데이터 전처리를 위한 사전 작업으로, 사전 정의 단어 데이터베이스(143)에 사전 정의 단어를 정의할 수 있다. 사전 정의 단어는 사용자에 의해 정의될 수 있다. As a dictionary operation for data preprocessing, dictionary definition words may be defined in the dictionary definition word database 143. Dictionary definition words may be defined by the user.

단계 S710에서, 유저 인터페이스부(110)는 사용자가 추가한 사전 정의 단어 입력을 수신할 수 있다. 사용자는 사전 정의 단어를 필요에 따라 추가하거나, 삭제할 수 있다. 유저 인터페이스부(110)는 추가 및 삭제에 대한 입력을 수신하고, 업데이트 된 사전 정의 단어에 기초하여 사전 정의 단어 데이터베이스(143)로 전송할 수 있다. In operation S710, the user interface 110 may receive a predefined word input added by the user. The user can add or delete dictionary definition words as needed. The user interface 110 may receive an input for addition and deletion, and transmit the input to the dictionary definition word database 143 based on the updated dictionary definition word.

또한, 단계 S710에서, 유저 인터페이스부(100)는 사전 정의 단어 전송 시 요청 3(request 3)을 포함하는 요청 플래그를 전송할 수 있다. 요청 3은 데이터 분류 요청을 목적으로하는 requestFlag 값일 수 있다. In operation S710, the user interface unit 100 may transmit a request flag including request 3 when the predefined word is transmitted. Request 3 may be a requestFlag value for the purpose of data classification request.

단계 S720에서, 데이터 분석부(130)의 요청으로 크롤링 데이터베이스(141)는 수집한 텍스트 데이터를 데이터 분석부(130)로 전송할 수 있다. 데이터 분석부(130)의 전처리부(131)는 텍스트 데이터를 분류하기 위해 크롤링 데이터베이스(141)에 수집된 텍스트 데이터를 전송 받을 수 있다. In operation S720, the crawl database 141 may transmit the collected text data to the data analyzer 130 at the request of the data analyzer 130. The preprocessor 131 of the data analyzer 130 may receive the text data collected in the crawl database 141 to classify the text data.

단계 S730에서, 전처리부(131)는 수집한 텍스트 데이터 중 기사 본문의 내용과 사전 정의 단어를 비교할 수 있다. 전처리부(131)는 사전 정의 단어가 포함되어 있다면, 임시 파일을 생성하여 데이터 셋트를 일차적으로 분류할 수 있다.In operation S730, the preprocessor 131 may compare the contents of the article body with the predefined words among the collected text data. If a pre-defined word is included, the preprocessor 131 may generate a temporary file to primarily classify the data set.

단계 S740에서, 전처리부(131)는 데이터 분류 작업의 속도 향상과 텍스트 데이터 셋트의 중복 분류를 방지하기 위해 분류된 데이터는 크롤링 데이터베이스(141)에 텍스트 데이터 삭제 요청을 할 수 있다. In operation S740, the preprocessor 131 may request that the classified data be deleted from the crawl database 141 to speed up data classification and prevent duplicate classification of the text data set.

즉, 전처리부(131)는 미리 설정된 사전 정의 단어를 포함하지 않는 기사의 텍스트 데이터는 크롤링 데이터베이스(141)에 유지하고, 미리 설정된 사전 정의 단어를 포함하는 기사의 텍스트 데이터를 크롤링 데이터베이스에서 삭제하고 분석 데이터베이스(142)에 저장할 수 있다. That is, the preprocessor 131 maintains the text data of the article that does not include the preset word in the crawl database, and deletes and analyzes the text data of the article that includes the preset word in the crawl database. May be stored in the database 142.

단계 S750에서, 크롤링 데이터베이스(141)는 미리 설정된 사전 정의 단어를 포함하는 기사의 텍스트 데이터를 삭제할 수 있다. 또한, 단계 S760에서, 분석 데이터베이스(142)는 미리 설정된 사전 정의 단어를 포함하는 기사의 텍스트 데이터를 저장한다. In operation S750, the crawl database 141 may delete text data of an article including a preset word. In addition, in step S760, the analysis database 142 stores the text data of the article including the preset word.

전처리부(131)는 수집한 텍스트 데이터에 대하여 미리 설정된 사전 정의 단어의 수만큼 반복하여 기사 분류를 수행할 수 있다. The preprocessor 131 may repeatedly classify the collected text data by the number of preset words defined in advance.

도 8은 본원의 일 실시예에 따른 텍스트 데이터 분류 과정을 예시적으로 나타낸 도면이다. 8 is a diagram illustrating a text data classification process according to an embodiment of the present disclosure.

도8(a)는 사용자가 미리 설정한 사전 정의 단어일 수 있다. 사용자는 분류하고 싶은 단어, 즉, 기사에 포함되어야 하는 단어를 사전 정의 단어로 지정하고, 수집된 텍스트 데이터로부터 사전 정의 단어의 포함 여부를 판단할 수 있다. 예를 들어, 전처리부(131)는 유저 인터페이스부(110)로부터 사용자가 미리 설정한 사전 정의 단어를 전파, 부정, 인체, 부작용, 노출이라고 정의한 입력을 수신할 수 있다. 전처리부(131)는 크롤링 데이터베이스에 수집된 텍스트 데이터 셋트 중, 사용자가 미리 설정한 사전 정의 단어를 전파, 부정, 인체, 부작용, 노출이라는 단어가 포함된 기사의 데이터를 크롤링 데이터베이스(141)에서 삭제하고, 미리 설정한 사전 정의 단어가 포함된 기사의 텍스트 데이터를 분석 데이터베이스(142)에 저장할 수 있다. 도8(b)는 텍스트 데이터 셋트로부터 미리 설정한 사전 정의 단어가 포함되지 않은 단어를 삭제하는 과정을 나타내고, 도8(c)는 전처리부(131)에서 수행된 전처리된 텍스트 데이터의 셋트의 예시일 수 있다.FIG. 8A may be a word defined by a user in advance. The user may designate a word to be classified, that is, a word to be included in the article as a dictionary definition word, and determine whether to include the dictionary definition word from the collected text data. For example, the preprocessor 131 may receive an input from the user interface 110 that defines a predefined word set by the user as propagation, negation, human body, side effects, or exposure. The preprocessor 131 deletes, from the crawl database 141, data of articles including the words propagation, negation, human body, side effects, and exposure, which are pre-set by the user, among the text data sets collected in the crawl database. In addition, text data of an article including a preset word defined in advance may be stored in the analysis database 142. FIG. 8B illustrates a process of deleting a word not including a preset word from a text data set, and FIG. 8C illustrates an example of a set of preprocessed text data performed by the preprocessor 131. Can be.

또한, 전처리부(131)는 유저 인터페이스부(110)로부터 요청 4(request 4)을 포함하는 요청 플래그를 수신할 수 있다. 요청4는 텍스트 데이터 필터링 요청을 목적으로 하는 requestFlag 값일 수 있다. 단계 S770에서, 전처리부(131)는 텍스트 데이터 필터링을 수행하기 위해 분석 데이터베이스(142)로부터 미리 설정된 사전 정의 단어를 포함하는 기사의 텍스트 데이터를 수신할 수 있다.In addition, the preprocessor 131 may receive a request flag including request 4 from user interface 110. Request 4 may be a requestFlag value for the purpose of text data filtering request. In operation S770, the preprocessor 131 may receive text data of an article including a predefined word from the analysis database 142 to perform text data filtering.

단계 S780에서, 요청 4를 수신한 전처리부(131)는 분석 데이터베이스(142)에 저장된 사전 정의 단어를 포함하는 텍스트 데이터 중에서 미리 설정된 제거 텍스트 요소를 제거하는 과정을 수행할 수 있다. 예를 들어, 미리 설정된 제거 텍스트 요소는 기사를 작성한 기자의 이름, 기자의 이메일 주소, 광고 및 저작관 표기 중 적어도 어느 하나를 포함하는 것일 수 있다. 미리 설정된 제거 텍스트 요소를 제거하여, 텍스트 데이터 분석에 불필요한 요소들을 제거할 수 있다. In operation S780, the preprocessor 131 receiving the request 4 may perform a process of removing a predetermined removal text element from text data including a predefined word stored in the analysis database 142. For example, the preset removal text element may include at least one of the name of the journalist who wrote the article, the email address of the journalist, an advertisement, and a copyright notice. By removing the preset removal text element, it is possible to remove elements unnecessary for text data analysis.

구체적으로, 전처리부(131)는 미리 설정된 제거 텍스트 요소를 제거하는 과정을 수행할 때, 미리 설정된 사전 정의 단어를 포함하는 기사의 텍스트 데이터 중 마침표의 모든 위치를 검색하고, 마침표의 위치 검색 결과에 기초하여 기사의 마지막 문장을 결정하고, 결정된 마지막 문장의 다음 텍스트 영역을 미리 설정된 제거 텍스트 요소로서 제거할 수 있다. In detail, when the preprocessing unit 131 performs a process of removing the preset removal text element, the preprocessing unit 131 searches all positions of the period among the text data of the article including the preset predefined word, and searches for the position search result of the period. Based on this, the last sentence of the article can be determined, and the next text area of the determined last sentence can be removed as a preset removal text element.

도9은 본원의 일 실시예에 따른 미리 설정된 제거 텍스트 요소를 제거하는 과정을 예시적으로 나타낸 도면이다. 9 is a diagram exemplarily illustrating a process of removing a preset removal text element according to an embodiment of the present disclosure.

도9를 참조하면, 미리 설정된 사전 정의 단어를 포함하는 기사의 텍스트 데이터에는 마지막 문장 다음으로 기자 이름, 기자 이메일, 광고, 저작권 표기 등과 같은 요소들이 저장되어 있을 수 있다. 전처리부(131)는 분류된 데이터들 중 "."의 모든 위치를 검색하고, 검색되어 나온 위치 중 끝에서 두 번째에 위치하는 곳을 기준으로 텍스트 요소를 제거할 수 있다. 예를 들어, 일반적으로 기사의 내용 중 마지막 문장의 마침표 이후에 기자의 이름, 이메일 주소, 저작권 표기 등이 기재되는데, 이메일 주소에 마침표가 포함되어 있으므로, 검색되어 나온 마침표의 위치 중 끝에서 두 번째에 위치하는 곳을 기사의 마지막 문장이 끝난 지점으로 결정하고, 마지막 문장의 다음에 위치하는 텍스트 영역을 제거할 수 있다. 전처리부(131)는 기준점으로 정한 이후의 내용을 삭제하고, 삭제가 완료된 텍스트 데이터를 분석 데이터베이스(142)에 저장할 수 있다. Referring to FIG. 9, elements such as a reporter's name, a reporter's e-mail, an advertisement, a copyright notice, and the like may be stored in the text data of an article including a preset word. The preprocessor 131 may search for all positions of "." Among the classified data and remove text elements based on the second position from the end of the retrieved positions. For example, a journalist's name, email address, copyright notice, etc. is usually included after the period in the last sentence of the article. Since the email address contains a period, the second from the end of the position of the found period You can determine where is at the end of the last sentence of the article, and remove the text area after the last sentence. The preprocessing unit 131 may delete the content after determining the reference point and store the text data in which the deletion is completed in the analysis database 142.

도10은 본원의 일 실시예에 따른 기반 데이터 셋트를 형성하는 단계를 개략적으로 나타낸 흐름도이다. 10 is a flow diagram schematically illustrating the steps of forming a base data set according to an embodiment of the present application.

도10을 참조하면, 데이터 형성 단계는 데이터 분석에 사용될 기반 데이터 셋트를 형성하기 위한 단계로, 체언 추출, 빈발단어 추출, 매트릭스 데이터 셋트 형성 과정이 순차적으로 수행될 수 있다. Referring to FIG. 10, the data forming step is a step for forming a base data set to be used for data analysis, and a process of extracting a word, extracting frequent words, and forming a matrix data set may be sequentially performed.

먼저, 데이터 분석부(130)의 데이터 형성부(132)는 유저 인터페이스부(110)로 요청 5(request 5)를 포함하는 요청 플래그를 수신할 수 있다. 요청 5는 체언 추출 요청을 목적으로하는 requestFlag 값일 수 있다. 체언 추출 과정은 수집된 기사로부터 대명사, 명사, 수사와 같은 체언을 추출하는 과정일 수 있다. First, the data forming unit 132 of the data analysis unit 130 may receive a request flag including request 5 from the user interface unit 110. Request 5 may be a requestFlag value for the purpose of the message extraction request. The extracting process may be a process of extracting a statement such as pronouns, nouns, and investigations from the collected articles.

단계 S810에서, 데이터 형성부(132)는 분석 데이터 베이스(142)로부터 전처리된 기사의 텍스트 데이터를 전송받을 수 있다. In operation S810, the data forming unit 132 may receive text data of a preprocessed article from the analysis database 142.

단계S820에서, 데이터 형성부(132)는 전처리된 기사의 텍스트 데이터를 기반으로 데이터 분석에서 불필요한 조사, 형용사, 부사, 등을 제외한 체언만을 추출할 수 있다. 본원의 일 실시예에 따르면, 데이터 형성부(132)는 체언 추출 과정을 통해 추출된 체언으로 구성된 단어 셋트 중 두 글자 이상의 단어를 분류하여 분석 데이터 베이스(142)에 저장 할 수 있다. In operation S820, the data forming unit 132 may extract only a statement excluding unnecessary research, adjectives, adverbs, and the like from the data analysis based on the text data of the preprocessed article. According to the exemplary embodiment of the present application, the data forming unit 132 may classify two or more words of a word set composed of the extracted words through a message extraction process and store them in the analysis database 142.

체언 추출 과정 후 데이터 형성부(132)는 유저 인터페이스부(110)로 요청 6(request 6)를 포함하는 요청 플래그를 수신할 수 있다. 요청6은 빈발단어 추출 요청을 목적으로하는 requestFlag 값일 수 있다. 빈발단어 추출은 추출된 체언의 빈도수를 기준으로 정렬하여 특정 순위 이상의 체언을 추출할 수 있다. After the message extraction process, the data forming unit 132 may receive a request flag including request 6 to the user interface 110. Request 6 may be a requestFlag value for the purpose of extracting the frequent words. Frequently extracted words can be sorted on the basis of the frequency of the extracted reciprocal can extract the transcript above a certain rank.

단계 S830에서, 유저 인터페이스부(110)는 사용자의 단어 빈도수 순위에 관한 입력을 수신할 수 있다. 단어 빈도수 순위는 빈발 단어를 추출하기 위한 단어의 빈도수 순위 기준을 의미하며, 빈발단어 추출 과정에서 특정 빈도수 순위 이내인 단어를 추출하는데 사용될 수 있다. 데이터 형성부(132)는 분석 데이터베이스(142)로 체언 집합 데이터를 요청할 수 있다. In operation S830, the user interface 110 may receive an input regarding a word frequency ranking of the user. The word frequency ranking refers to a frequency ranking criterion of a word for extracting frequent words, and may be used to extract words within a specific frequency ranking in the frequent word extraction process. The data forming unit 132 may request the message set data from the analysis database 142.

단계 S840에서, 분석 데이터 베이스(142)는 추출된 체언 집합 데이터 셋트를 전송할 수 있다. In operation S840, the analysis database 142 may transmit the extracted message set data set.

단계 S850에서, 데이터 형성부(132)는 체언 집합 데이터 및 단어 빈도수 순위에 기반하여 빈발단어를 추출할 수 있다. 데이터 형성부(132)는 체언 집합 데이터 셋트 중 빈도수가 사용자가 입력한 단어 빈도수 순위 값 이내인 단어들만 추출하여 분석 데이터 베이스(142)에 저장할 수 있다. In operation S850, the data forming unit 132 may extract the frequent word based on the message set data and the word frequency ranking. The data forming unit 132 may extract and store only words having a frequency within a word frequency ranking value input by the user in the analysis set data set in the analysis database 142.

빈발 단어 추출 과정 후 데이터 형성부(132)는 유저 인터페이스부(110)로 요청 7(request 7)를 포함하는 요청 플래그를 수신할 수 있다. 요청7은 매트릭스 데이터 셋트 형성 요청을 목적으로하는 requestFlag 값일 수 있다. 매트릭스 데이터 셋트 형성 과정은 연관규칙 분석 및 단어 추출 확률을 위해 각 기사의 빈발단어 포함여부가 나타난 데이터 셋트를 형성하는 과정일 수 있다. After the frequent word extraction process, the data forming unit 132 may receive a request flag including a request 7 to the user interface 110. Request 7 may be a requestFlag value for the purpose of forming a matrix data set. The process of forming a matrix data set may be a process of forming a data set in which frequent words are included in each article for analysis of association rules and word extraction probabilities.

단계 S860에서, 데이터 형성부(132)는 분석 데이터베이스(142)로부터 전처리된 기사의 텍스트 데이터 셋트 및 빈발 단어 집합을 수신할 수 있다. In operation S860, the data forming unit 132 may receive a text data set of a preprocessed article and a set of frequent words from the analysis database 142.

단계 S870에서 데이터 형성부(132)는 전처리된 기사의 텍스트 데이터 셋트, 빈발단어 집합 및 요청7의 요청 플래그 수신 시 포함된 키워드 값에 기반하여 연관 규칙 분석을 위해 각 기사의 빈발단어 포함여부를 나타내는 매트릭스 데이터 셋트를 형성할 수 있다. 단계 S880에서, 데이터 형성부(132)는 체언 집합, 빈발 단어 집합 및 매트릭스 데이터 셋트를 포함하는 기반 데이터 셋트를 형성한다.In step S870, the data forming unit 132 indicates whether each article includes a frequent word for analysis of association rules based on the text data set of the preprocessed article, the frequent word set, and the keyword value included in the request flag of the request 7. It is possible to form matrix data sets. In step S880, the data forming unit 132 forms a base data set including a talk set, a frequent word set, and a matrix data set.

도 11은 본원의 일 실시예에 따른 매트릭스 데이터 셋트를 개략적으로 나타낸 도면이다. 11 is a diagram schematically illustrating a matrix data set according to an embodiment of the present application.

도 11을 참조하면, 매트릭스 데이터 셋트의 열 이름은 특정 빈도수 이상의 단어를 나타내고, 행 이름은 기사의 식별자로서, 1부터 기사의 수만큼의 숫자를 나타낼 수 있다. 데이터 형성부(132)는 각각의 기사와 특정 빈도수 이상의 단어를 비교하여 해당 기사가 특정 빈도수 이상의 단어를 포함하고 있으면 1로 설정하고 포함하지 않으면 0으로 설정할 수 있다. 또한, 데이터 형성부(132)는 사용자가 입력한 키워드와 추출된 단어들 간의 관계를 분석하기 위해 첫 번째 열의 이름에 사용자가 입력한 키워드를 넣고 첫 번째 열에 해당하는 행의 값을 1로 넣어준다. 사용자가 입력한 키워드는 단계S310에서 수신한 키워드 정보일 수 있다. Referring to FIG. 11, a column name of a matrix data set represents a word of a certain frequency or more, and a row name is an identifier of an article, and may represent numbers from 1 to the number of articles. The data forming unit 132 may compare each article with a word having a specific frequency or more and set the value to 1 if the article includes a word having a specific frequency or more and to 0 if not. In addition, the data forming unit 132 inserts the keyword input by the user in the name of the first column and puts the value of the row corresponding to the first column as 1 to analyze the relationship between the keyword input by the user and the extracted words. . The keyword input by the user may be keyword information received in step S310.

도 12는 본원의 일 실시예에 따른 텍스트 데이터 분석을 수행하는 단계를 개략적으로 나타낸 흐름도이다. 12 is a flowchart schematically illustrating performing text data analysis according to an embodiment of the present application.

도12를 참조하면, 데이터 분석 단계는 추출된 데이터 셋트를 분석하여 사용자에게 분석 결과를 다양한 시각화 형태로 제공해주기 위한 단계일 수 있다. 데이터 분석에서는 기사에서 사용된 단어들의 빈도수 분석과 기사 내 단어들 간의 연관규칙 분석이 수행될 수 있다. 단어 빈도수 분석은 특정 단어의 사용 횟수 분석과 특정 단어의 추출확률 분석이 수행될 수 있다. 연관규칙 분석은 기사 내에 사용된 단어들의 상관관계를 분석하는 것으로, 매트릭스 데이터 셋트를 기반으로 수행될 수 있다. Referring to FIG. 12, the data analyzing step may be a step for analyzing the extracted data set and providing the analysis result to the user in various visualization forms. In the data analysis, frequency analysis of words used in an article and association rule analysis between words in the article may be performed. In the word frequency analysis, the number of times of use of a specific word may be analyzed and the extraction probability analysis of the specific word may be performed. Association rule analysis is to analyze the correlation of the words used in the article, it can be performed based on the matrix data set.

분석부(133)는 기사에서 사용된 단어들의 빈도수 분석 및 기사 내 단어들 간의 연관규칙을 분석할 수 있다. 분석부(130)는 빈발단어로 만들 수 있는 모든 단어 집합들과 각 기사가 포함하고 있는 단어집합을 비교하여, 다수의 기사에서 동시에 사용된 단어집합을 찾아 IF-THEN 형태의 규칙으로 정의하는 분석 방법을 사용하여 연관규칙 분석을 수행할 수 있다. The analysis unit 133 may analyze the frequency of words used in the article and the association rule between the words in the article. The analysis unit 130 compares all the word sets that can be made with frequent words with the word sets included in each article, and finds the word sets used simultaneously in a plurality of articles and defines them as IF-THEN type rules. The method can be used to perform association rule analysis.

예시적으로, 연관규칙은, {전자파, 중계기} -> {반대}로 정의 및 표현될 수 있다. 즉, 기사에서 '전자파, 중계기'라는 단어가 사용되면, '반대'라는 단어도 함께 사용된다고 해석될 수 있다. 여기서, {전자파, 중계기}는 제1단어 집합(Left-Hand Side (LHS))에 해당하며, {반대}는 제 2 단어 집합(Right-Hand Side (RHS))에 해당한다. 제1단어 집합(LHS)은 매트릭스 데이터 셋트 내 빈발단어들로 만들 수 있는 단어집합이며, 제2 단어 집합(RHS)은 매트릭스 데이터 셋트 내 빈발단어들 중 하나의 단어로 구성된 단어집합이다. For example, the association rule may be defined and expressed as {electromagnetic wave, repeater}-> {inverse}. In other words, if the word "electromagnetic wave, repeater" is used in the article, the word "opposite" can be interpreted as being used together. Here, {electromagnetic wave, repeater} corresponds to the first word set (Left-Hand Side (LHS)), {reverse} corresponds to the second word set (Right-Hand Side (RHS)). The first word set (LHS) is a word set that can be made up of frequent words in a matrix data set, and the second word set (RHS) is a word set composed of one word among frequent words in a matrix data set.

또한, 매트릭스 데이터 셋트에 전자파, 중계기, 송전탑 세 빈발단어를 포함하고 있으면, 제 1 단어 집합(LHS)은 {전자파}, {중계기}, {송전탑}, {전자파, 중계기}, {전자파, 송전탑}, {중계기, 송전탑}, {전자파, 중계기, 송전탑}이 되며, 제 2 단어 집합(RHS)는 {전자파}, {중계기}, {송전탑}이 된다. 연관규칙에서 제1단어집합(LHS)과 제2단어 집합(RHS)은 중복 단어를 사용하지 않는다. 분석부(133)는 제1단어집합(LHS)과 제2단어 집합(RHS)에 관련된 수많은 연관 규칙들을 분석 결과로 생성할 수 있다. 제2단어 집합(RHS)을 특정단어로 사용자가 지정하면, 사용자가 원하는 단어에 대한 연관규칙 분석 결과를 쉽게 확인할 수 있다. In addition, if the matrix data set includes three frequent words including electromagnetic waves, repeaters, and transmission towers, the first word set LHS is {electromagnetic waves}, {repeaters}, {transmission towers}, {electromagnetic waves, repeaters}, {electromagnetic waves, transmission towers} , {Repeater, transmission tower}, {electromagnetic wave, repeater, transmission tower}, and the second word set (RHS) is {electromagnetic wave}, {repeater}, {transmission tower}. In the association rule, the first word set (LHS) and the second word set (RHS) do not use duplicate words. The analysis unit 133 may generate a number of association rules related to the first word set LHS and the second word set RHS as an analysis result. When the user designates the second word set (RHS) as a specific word, the result of the rule analysis on the word desired by the user can be easily confirmed.

연관 규칙 분석은 매트릭스 데이터 셋트 내 단어들로 만들 수 있는 모든 집합을 각 기사별로 비교하는 작업이기 때문에, 단어와 기사의 수가 많아지면 연산 횟수가 기하급수적으로 증가할 수 있다. 분석부(133)는 연관규칙 분석 시 연산 횟수를 줄이기 위해 두 가지 원칙을 기반으로 비빈발 단어집합에 대한 연산을 제거하는 Apriori알고리즘을 사용할 수 있다. Association rule analysis compares every set of words in a matrix data set for each article, so the number of words and articles can increase exponentially. The analysis unit 133 may use the Apriori algorithm to remove the operation on the non-frequent word set based on two principles in order to reduce the number of operations when analyzing the association rule.

Apriori알고리즘은 1)한 항목집합이 빈발(frequent)하다면 이 항목집합의 모든 부분집합 역시 빈발항목집합이다. 2) 한 항복집합이 비빈발(infrequent)하다면 이 항목집합을 포함하는 모든 집합은 비빈발 항목집합이다. 의 두 가지 알고리즘을 포함할 수 있다. The Apriori algorithm: 1) If a set of items is frequent, then all subsets of this set are also frequent itemsets. 2) If a yield set is infrequent, then all sets that contain this set are non-frequent itemsets. It can include two algorithms of.

항목집합은 데이터 셋트(집합) 내 단어들로 만들 수 있는 집합이므로, Apriori알고리즘을 적용하면 비교적 단어집합을 줄여 연산 횟수를 감소시킬 수 있다. Since the item set is a set that can be made up of words in a data set, the Apriori algorithm can be used to reduce the number of operations by relatively reducing the word set.

분석부(133)는 연관규칙 분석을 통해 키워드와 추출단어 간 관계를 설정하기 위해 지지도(Support), 신뢰도(Confidence), 향상도(Lift)라는 세 가지 요소를 사용할 수 있다. 지지도, 신뢰도, 향상도는 아래와 같이 계산되며, X는 제1단어집합(LHS), Y는 제 2 단어집합(RHS)를 의미한다. The analysis unit 133 may use three elements, such as support, confidence, and lift, to establish a relationship between the keyword and the extracted word through the analysis of the association rule. Support, reliability, and improvement are calculated as follows, where X is the first word set (LHS) and Y is the second word set (RHS).

① 지지도 (support, s(X→Y))① support (support, s (X → Y))

= X와 Y를 동시에 포함하는 기사 수 / 전체 기사 수= Articles with X and Y simultaneously / total articles

② 신뢰도 (Confidence, c(X→Y))② Confidence, c (X → Y)

= X와 Y를 동시에 포함하는 기사 수 / X를 포함하는 기사 수= Number of articles containing X and Y simultaneously / articles containing X

③ 향상도 (Lift, Lift(X→Y))③ Improvement (Lift, Lift (X → Y))

= X를 포함하는 기사 중 Y를 포함하는 기사의 비율 / 전체기사 중 Y를 포함하는 기사의 비율= Ratio of articles containing Y among articles containing X / ratio of articles containing Y among articles

예시적으로, 지지도는 추출된 기사 중 X와 Y를 모두 포함한 기사의 비율을 의미하며, 신뢰도는 X를 포함하는 기사 중 Y를 포함하는 기사의 비율을 의미한다. 또한, 향상도는 키워드를 입력했을 때 추출된 기사 중 Y를 포함한 기사 비율 대비 X를 포함하는 기사 중 Y를 포함하는 기사의 비율을 의미한다.For example, support means a ratio of articles including both X and Y among extracted articles, and reliability refers to a ratio of articles including Y among articles including X. Also, the degree of improvement means a ratio of articles including Y among articles including X to articles ratio including Y among articles extracted when a keyword is input.

자세히 말해 지지도는 X와 Y가 얼마나 많은 기사에서 동시에 사용되었는지를 측정해 준다. 만약 지지도 값이 매우 작다면 해당 연관규칙을 따르는 기사의 수가 매우 적다는 의미를 내포하게 된다. 즉, 지지도 측정을 통해 특정 단어집합이 특정기사에서만 사용된 것인지 아니면 많은 기사에서 사용된 것인지를 측정할 수 있다. In detail, support measures how many articles X and Y were used at the same time. If the support value is very small, it implies that the number of articles following the relevant rule is very small. In other words, support measures can be used to determine whether a particular word set is used in a particular article or in many articles.

신뢰도는 해당 규칙이 정답일 확률을 의미하며, 기사에서 X가 사용된 기사 중, Y가 동시에 사용된 기사의 비율을 의미한다. 만약 특정 단어 집합 간의 신뢰도가 매우 낮다면 해당 규칙이 정답이 아닐 확률이 높다는 것을 의미하며, 일반적인 경우에는 잘 발생하지 않는 규칙임을 알 수 있다. 지지도와 신뢰도가 모두 높은 연관성 규칙일지라도 유의미한 규칙이라 단정할 수는 없다. 왜냐하면 지지도와 신뢰도가 높은 연관규칙 중에는 우연에 의해 연관성이 높은 것처럼 보이는 규칙들이 있기 때문이다. 따라서 연관규칙의 유용성을 측정하기 위해 향상도가 사용된다.Reliability is the probability that the rule is the correct answer, and it is the ratio of articles in which X is used at the same time among articles in which X is used in the article. If the confidence between specific word sets is very low, it means that the rule is not likely to be the correct answer. In general, it is a rule that does not occur well. An association rule with high support and confidence, however, cannot be concluded as a meaningful rule. This is because some of the association rules with high support and reliability appear to be related by chance. Therefore, the improvement is used to measure the usefulness of the association rule.

향상도는 'X를 포함할 때 Y를 포함할 확률이 X를 고려하지 않은 경우 Y를 포함할 확률에 비해 얼마나 향상되는가'에 대한 정보를 제공해준다. 만약 향상도가 1이면 X가 포함된 기사 중에서 Y를 찾는 것과, 모든 기사에서 Y를 찾을 확률이 동일하기 때문에 두 단어집합 X, Y는 서로 상관관계가 없을 가능성이 높다. 반면, 향상도가 1 이상이면 두 단어집합은 서로 밀접한 상관관계를 (양의 상관관계) 가질 가능성이 높으며, 1 이하이면 서로 상반된 상관관계를 (음의 상관관계) 가질 가능성이 높다. 예를 들어, 빵, 버터는 밀접한 관계가 있는 양의 상관관계를 가질 확률이 높으며, 설사약, 변비약은 서로 상반된 관계가 있는 음의 상관관계를 가질 확률이 높다. The improvement provides information about how much better the probability of containing Y when including X is compared to the probability of containing Y if X is not considered. If the improvement is 1, the two sets of words X and Y are unlikely to correlate because the probability of finding Y in articles containing X is the same as the probability of finding Y in all articles. On the other hand, if the improvement is 1 or more, the two word sets are likely to have a close correlation (positive correlation) with each other, and if it is 1 or less, there is a high possibility that they have opposite correlations (negative correlation). For example, bread and butter are more likely to have positive correlations, and diarrhea and constipation are more likely to have negative correlations.

단계 S901에서 유저 인터페이스부(110)는 요청 플래그 및 최소 빈도수를 전송할 수 있다. 데이터 분석부(130)는 유저 인터페이스부(110)로부터 요청 8(request 8)를 포함하는 요청 플래그를 수신할 수 있다. 요청8은 Word Cloud 시각화 요청을 목적으로하는 requestFlag 값일 수 있다.In operation S901, the user interface 110 may transmit a request flag and a minimum frequency. The data analyzer 130 may receive a request flag including a request 8 from the user interface 110. Request 8 may be a requestFlag value for the purpose of Word Cloud visualization request.

단계 S902에서, 분석부(133)는 분석 데이터 베이스(142)로부터 체언 집합 데이터를 수신할 수 있다. In operation S902, the analyzer 133 may receive the message set data from the analysis database 142.

단계 S903에서, 분석부(133)는 체언 집합에 포함된 단어 중에서 최소 빈도수 이상의 단어를 결정하여 빈도수 분석을 수행할 수 있다. 또한, 분석부(133)는 결정된 최소 빈도수 이상의 단어를 빈도수에 따라 출력 위치, 출력 크기 및 출력 색을 결정하여 워드 클라우드(Word Cloud)형태로 출력할 수 있다. In operation S903, the analyzer 133 may determine a word having a minimum frequency or more among the words included in the message set and perform frequency analysis. In addition, the analyzer 133 may determine an output position, an output size, and an output color according to the frequency of the word having the determined minimum frequency or more and output the word in the form of a word cloud.

도 13은 본원의 일 실시예에 따른 결정된 최소 빈도수 이상의 단어를 빈도수에 따라 출력 위치, 출력 크기 및 출력 색을 결정하여 워드 클라우드 형태로 출력한 예를 나타낸 도면이다. FIG. 13 is a diagram illustrating an example in which an output location, an output size, and an output color of a word having a minimum frequency determined or more are determined and output according to a frequency according to an embodiment of the present disclosure.

예시적으로 도 13을 참조하면, 워드 클라우드 형태의 시각화하는 것은, 핵심적인 단어를 돋보이게 하는 시각화하여, 기사에서 사용된 단어들을 빈도수에 따라 서로 다른 크기와 색으로 표현할 수 있다. For example, referring to FIG. 13, the visualization in the word cloud form may visualize the core word to stand out, and may express the words used in the article in different sizes and colors according to the frequency.

또한, 분석부(133)는 생성된 워드 클라우드 형태로 출력된 시각화 결과를 분석 데이터 베이스(142)에 저장할 수 있다. In addition, the analysis unit 133 may store the visualization result output in the generated word cloud form in the analysis database 142.

다음으로, 데이터 분석부(130)는 유저 인터페이스부(110)로부터 요청 9(request 9)를 포함하는 요청 플래그를 수신할 수 있다. 요청9은 Grouped Matrix 시각화 요청을 목적으로하는 requestFlag 값일 수 있다.Next, the data analyzer 130 may receive a request flag including request 9 from the user interface 110. Request 9 may be a requestFlag value for the purpose of requesting Grouped Matrix visualization.

단계 S904에서, 데이터 분석부(130)는 최소 지지도 및 최소 신뢰도 입력 정보를 수신할 수 있다. 또한, 데이터 분석부(130)는 제 2 단어 집합에 관한 입력을 수신할 수 있다. In operation S904, the data analyzer 130 may receive minimum support and minimum reliability input information. In addition, the data analyzer 130 may receive an input regarding the second word set.

단계 S905에서, 분석 데이터 베이스(142)는 데이터 분석부(133)의 요청에 의해 매트릭스 데이터 셋트를 전송할 수 있다.In operation S905, the analysis database 142 may transmit the matrix data set at the request of the data analyzer 133.

단계 S906에서, 분석부(133)는 매트릭스 데이터 셋트를 기반으로 연관규칙 분석을 수행할 수 있다. 분석부(133)는 매트릭스 데이터 셋트 내의 빈발 단어들로 만들 수 있는 단어 조합을 포함하는 제 1 단어 집합을 결정하고, 제 1 단어 집합에 포함된 단어 조합이 기사에 포함되면 제 2 단어 집합에 포함된 단어도 기사에 포함되는지에 관한 연관규칙을 결정할 수 있다. In operation S906, the analyzer 133 may perform an association rule analysis based on the matrix data set. The analysis unit 133 determines a first word set including a word combination that can be generated from the frequent words in the matrix data set, and includes the word combination included in the first word set in the second word set when the article is included in the article. Association rules can be determined as to whether or not words are included in an article.

분석부(133)는 결정된 연관규칙에 관한 지지도, 신뢰도 및 향상도를 연산하고, 결정된 연관규칙 중 최소 지지도 및 최소 신뢰도 이상의 지지도 및 신뢰도를 가지는 연관규칙에 관한 분석 결과를 출력할 수 있다. The analysis unit 133 may calculate the degree of support, reliability, and improvement of the determined association rule, and may output an analysis result about the association rule having the least support and the reliability of the minimum support and the minimum reliability among the determined association rules.

예시적으로 도14를 참조하면, 단계 S907에서, 분석부(133)는 제 1 단어 집합과 제 2 단어 집합 간의 연관성을 매트릭스 형태로 출력할 수 있다. 도 14는 그룹 매트릭스 출력의 예를 나타낸 도면이다. 도 14에서 연관규칙의 제 1단어집합(LHS)과 제2단어집합(RHS)을 기준으로 각 단어집합 간 연관성을 매트릭스(Matrix)형태로 나타낸 것을 확인할 수 있다. 그룹 매트릭스 출력 그래프의 원의 크기는 각 규칙의 지지도를 나타내고, 색상의 진함은 향상도를 의미할 수 있다. 제 1단어집합(LHS) 이름 앞의 숫자는 그 조건으로 되어있는 연관규칙의 수를 의미하며, 제 1단어집합(LHS)에 "+"와 함께 표시된 숫자는 표시가 생략된 단어의 수를 의미할 수 있다. 그룹 매트릭스 출력 그래프의 원의 크기가 클수록, 색이 진할수록 많이 발생하는 규칙이라고 해석할 수 있다. For example, referring to FIG. 14, in operation S907, the analyzer 133 may output an association between the first word set and the second word set in a matrix form. 14 is a diagram illustrating an example of a group matrix output. In FIG. 14, it can be seen that the relationship between each word set based on the first word set (LHS) and the second word set (RHS) of the association rule is expressed in a matrix form. The size of the circle of the group matrix output graph may indicate the degree of support of each rule, and the intensity of the color may mean the degree of improvement. The number in front of the name of the first word set (LHS) means the number of association rules that are conditional, and the number marked with "+" in the first word set (LHS) means the number of words without indication. can do. The larger the size of the circle of the group matrix output graph, and the darker the color, the more often the rule occurs.

분석부(133)는 검색 키워드별로 서로 다른 개수의 규칙을 생성할 수 있다. 서로 다른 개수의 규칙이 생성되는 것은, 각 키워드마다 매트릭스 데이터 셋트에 포함되어 있는 기사의 수 및 기사 내 포함된 단어의 수가 다르기 때문이다. 분석부(133)는 옵션 값(지지도 =0.1, 신뢰도=0.1) 이상의 값을 가지는 규칙의 수가 검색 키워드별로 다르기 때문에, 검색 키워드별로 서로 다른 개수의 규칙을 생성할 수 있다. 다시 말해, 분석부(133)는 연관규칙 분석을 수행 시 각 기사에 포함되는 단어집합들에 따라 생성되는 규칙들의 지지도, 신뢰도 값이 결정할 수 있다. 매트릭스 데이터 셋트 내 기사의 수 및 각 기사에 포함되어 있는 단어의 수가 적을수록 옵션 값 미만의 규칙들이 많아져서 결과에서 제외될 수 있다. The analyzer 133 may generate different numbers of rules for each of the search keywords. The different number of rules are generated because the number of articles included in the matrix data set and the number of words included in the articles are different for each keyword. Since the number of rules having a value equal to or greater than an option value (support map = 0.1, reliability = 0.1) is different for each search keyword, the analysis unit 133 may generate different numbers of rules for each search keyword. In other words, the analysis unit 133 may determine the support and confidence values of the rules generated according to the word sets included in each article when performing the association rule analysis. The smaller the number of articles in the matrix data set and the number of words in each article, the more rules below the option value and can be excluded from the results.

또한, 데이터 분석부(130)는 유저 인터페이스부(110)로부터 요청 10(request 10)를 포함하는 요청 플래그를 수신할 수 있다. 요청10은 Graph 시각화 요청을 목적으로하는 requestFlag 값일 수 있다.In addition, the data analyzer 130 may receive a request flag including a request 10 from the user interface 110. Request 10 may be a requestFlag value for the purpose of requesting a graph visualization.

단계 S908에서, 데이터 분석부(130)는 분석 데이터 베이스(142)로 매트릭스 데이터 셋트를 요청할 수 있다. 분석부(133)는 매트릭스 데이터 셋트를 기반으로 연관규칙 분석을 수행할 수 있다. 분석부(133)는 특정 지지도 순위 이상의 규칙들을 기반으로 그래프(Graph) 시각화를 수행할 수 있다. In operation S908, the data analyzer 130 may request a matrix data set from the analysis database 142. The analysis unit 133 may perform an association rule analysis based on the matrix data set. The analyzer 133 may perform graph visualization based on rules of a specific support rank or higher.

단계 S909에서 연관 규칙 분석시 분석부(133)는 유저 인터페이스부(110)로부터 최소 지지도 및 최소 신뢰도값을 수신할 수 있다. 또한, 분석부(133)는 사용자가 원하는 단어에 대한 연관규칙 분석 결과만을 확인하기 위해 제2 단어집합(RHS)값을 설정할 수 있다. When analyzing the association rule in step S909, the analyzer 133 may receive a minimum support and a minimum confidence value from the user interface 110. In addition, the analysis unit 133 may set a second word set (RHS) value in order to confirm only the analysis result of the association rule for the word desired by the user.

도 15를 참조하면, 단계 S910에서, 분석부(133)는 제 1 단어 집합에 속한 단어와 제 2 단어 집합에 속한 단어 간의 연관성을 네트워크 형태로 출력하는 그래프 출력을 할 수 있다. 도 15에 도시된 네트워트 그래프 형태의 단어에서 원으로 향하는 화살표는 제 1 단어 집합(LHS), 원에서 단어로 향하는 화살표는(RHS)를 의미한다. 예를 들어, 분석부(133)는 '송전탑(LHS)->O->주민(RHS)'의 연관성을 분석할 수 있다. 즉, 분석부(133)는 송전탑이라는 단어가 기사에서 사용된다면, 주민이라는 단어 또한 기사에서 사용된다는 규칙을 분석할 수 있다. 'O->전자파(RHS)'로 분석 될 경우, 제 1 단어 집합(LHS)에 제시된 단어의 사용여부에 상관없이, 전자파가 기사에서 사용된다는 규칙을 의미할 수 있다. Referring to FIG. 15, in operation S910, the analyzer 133 may output a graph outputting a correlation between a word belonging to a first word set and a word belonging to a second word set in a network form. An arrow pointing in a circle in a word in the network graph form shown in FIG. 15 means a first word set LHS and an arrow pointing in a circle by a word RHS. For example, the analysis unit 133 may analyze the relationship between the transmission tower (LHS)-> O-> residents (RHS). In other words, if the word transmission tower is used in the article, the analysis unit 133 may analyze the rule that the word inhabitant is also used in the article. When analyzed as 'O-> electromagnetic wave (RHS)', it may mean a rule that electromagnetic waves are used in articles regardless of whether or not the words presented in the first word set (LHS) are used.

또한, 원의 크기는 각 규칙의 지지도를, 원의 색상의 진하기는 향상도를 의미할 수 있다. 단어의 위치는 다른 단어들과의 연관관계를 의미하며, 도 15에서는 중계기가 연관관계의 중심에 있음을 확인할 수 있다. 즉, 원의 크기가 클수록, 색이 진할수록 발생 빈도가 높은 규칙이라고 해석할 수 있다. 일예로, 다른 단어들과 떨어져서 있는 두 단어가 있다면, 이는 두 단어가 둘만의 연관관계를 가지고 있는 것일 수 있다. In addition, the size of the circle may mean the degree of support of each rule, and the intensity of the color of the circle may mean the degree of improvement. The position of the word means an association with other words, and in FIG. 15, it can be seen that the repeater is at the center of the association. In other words, it can be interpreted that the larger the size of the circle and the darker the color, the higher the rule of occurrence. For example, if there are two words that are separated from other words, it may be that the two words have a unique relationship.

분석부(133)는 생성된 네트워크 형태로 출력한 그래프 출력을 분석 데이터베이스(142)에 저장할 수 있다. The analyzer 133 may store the graph output in the generated network form in the analysis database 142.

또한, 데이터 분석부(130)는 유저 인터페이스부(110)로부터 요청 11(request 11)를 포함하는 요청 플래그를 수신할 수 있다. 요청11은 Scatter Plot 시각화(분포도 출력) 요청을 목적으로하는 requestFlag 값일 수 있다.In addition, the data analyzer 130 may receive a request flag including a request 11 from the user interface 110. Request 11 may be a requestFlag value for the purpose of Scatter Plot visualization (distribution diagram output) request.

단계 S911에서, 데이터 분석부(130)는 분석 데이터 베이스(142)로 매트릭스 데이터 셋트를 요청할 수 있다. 분석부(133)는 매트릭스 데이터 셋트를 기반으로 연관규칙 분석을 수행할 수 있다. 분석부(133)는 결정된 연관규칙의 분포를 분포도 출력으로 시각화하여 제공할 수 있다. In operation S911, the data analyzer 130 may request a matrix data set from the analysis database 142. The analysis unit 133 may perform an association rule analysis based on the matrix data set. The analyzer 133 may visualize and provide the distribution of the determined association rule as a distribution map output.

단계 S912에서 연관 규칙 분석시 분석부(133)는 유저 인터페이스부(110)로부터 최소 지지도 및 최소 신뢰도값을 수신할 수 있다. 또한, 분석부(133)는 사용자가 원하는 단어에 대한 연관규칙 분석 결과만을 확인하기 위해 제2 단어집합(RHS)값을 설정할 수 있다. When analyzing the association rule in step S912, the analysis unit 133 may receive a minimum support value and a minimum confidence value from the user interface unit 110. In addition, the analysis unit 133 may set a second word set (RHS) value in order to confirm only the analysis result of the association rule for the word desired by the user.

단계 S913에서 분석부(133)는 생성된 연관규칙들의 분포를 그래프로 제공할 수 있다. 예시적으로 도 16을 참조하면, 분도포 출력 그래프의 X축은 연관규칙의 지지도를 의미하고, Y축은 연관규칙의 신뢰도를 의미할 수 있다. 각 점들의 색상의 진하기는 향상도를 의미하는 것일 수 있다. 분포도 출력 그래프는 생성된 규칙들의 지지도, 신뢰도, 향상도에 대한 분포를 나타낼 수 있다. 분포도 출력 그래프의 각 점들은 하나의 규칙을 의미하며, 점들의 밀도는 규칙의 수에 비례할 수 있다. 분포도 출력 그래프는 검색 키워드에 상관없이, 지지도가 낮을수록, 신뢰도가 높을수록 생성된 규칙의 수가 많아지며, 규칙의 향상도는 높아질 수 있다. 연관규칙 분석에서, 발생 빈도가 낮은 규칙들 중 신뢰도가 높은 규칙일수록 두 단어집합 간 연관성이 높은 규칙임을 의미할 수 있다. In operation S913, the analyzer 133 may provide a graph of the distribution of the generated association rules. For example, referring to FIG. 16, the X axis of the distribution output graph may mean support of an association rule, and the Y axis may mean reliability of an association rule. The darkening of the color of each point may mean an improvement degree. The distribution output graph may represent a distribution of support, reliability, and improvement of the generated rules. Each point in the distribution output graph represents a rule, and the density of the points may be proportional to the number of rules. Regardless of the search keyword, the distribution output graph may have a higher number of generated rules as the degree of support is lower and the reliability is higher, and the degree of improvement of the rule may be increased. In the association rule analysis, a rule having a higher reliability among rules with low occurrence frequency may mean a rule having a high association between two word sets.

마지막으로, 데이터 분석부(130)는 유저 인터페이스부(110)로부터 요청 12(request 12)를 포함하는 요청 플래그를 수신할 수 있다. 요청12는 Extraction Probability 시각화 요청을 목적으로하는 requestFlag 값일 수 있다.Finally, the data analyzer 130 may receive a request flag including request 12 from the user interface 110. Request 12 may be a requestFlag value for the purpose of Extraction Probability visualization request.

단계 S914에서, 분석부(133)는 분석 데이터 베이스(142)로 매트릭스 데이터 셋트를 요청할 수 있다. 분석부(133)는 매트릭스 데이터 셋트를 기반으로 빈도수 분석을 수행할 수 있다. 분석부(133)는 기사에서 특정 단어가 추출될 확률 분석 및 추출 빈도를 시각화하여 제공할 수 있다. 단계 S915에서, 분석부(133)는, 매트릭스 데이터 셋트를 읽어오고, 체언 집합에 포함된 단어가 매트릭스 데이터 셋트의 각 기사에서 추출될 확률을 연산하고, 체언 집합에 포함된 단어 별 각 기사에서의 추출 빈도 및 추출 확률을 포함하는 그래프를 출력하여 빈도수 분석을 수행할 수 있다.In operation S914, the analyzer 133 may request a matrix data set from the analysis database 142. The analyzer 133 may perform frequency analysis based on the matrix data set. The analysis unit 133 may provide a visualization of the probability analysis and the extraction frequency of the specific word extracted from the article. In step S915, the analysis unit 133 reads the matrix data set, calculates a probability that words included in the statement set are extracted from each article of the matrix data set, Frequency analysis may be performed by outputting a graph including the extraction frequency and the extraction probability.

예시적으로 도 17을 참조하면, 단계 S915에서 분석부(133)는 추출 빈도 및 추출 확률을 포함하는 그래프를 바(Bar) 형태의 그래프로 표현할 수 있다. 도 17에 도시된 추출 빈도 및 추출 확률을 포함하는 그래프는 임의의 키워드를 입력했을 때 기사에서 특정 단어가 추출될 확률을 Bar 그래프 형태로 나타낼 수 있다. 추출 빈도 및 추출 확률을 포함하는 그래프의 X축은 빈도수가 높은 단어집합을 의미하며, Y축은 단어집합들의 추출확률을 의미한다. 도 17에 도시된 그래프에서는 추출된 모든 단어집합 중 빈도수가 높은 상위 30개의 단어집합이 X축을 구성한 예시를 나타내고 있다. 각 바 그래프 상단의 숫자는 단어집합 별 추출확률을 구체적을 나타낸 것이다. For example, referring to FIG. 17, in step S915, the analysis unit 133 may express a graph including an extraction frequency and an extraction probability as a bar graph. The graph including the extraction frequency and the extraction probability shown in FIG. 17 may represent the probability that a specific word is extracted from an article in the form of a bar graph when an arbitrary keyword is input. The X axis of the graph including the extraction frequency and the extraction probability means a high frequency word set, and the Y axis means an extraction probability of the word sets. The graph shown in FIG. 17 shows an example in which the top 30 word sets having a high frequency among all the extracted word sets form the X axis. The numbers at the top of each bar graph show the extraction probability for each word set.

단계 S915에서 임의의 검색 키워드를 입력했을 때 특정 단어가 기사로부터 추출될 확률은 매트릭스 데이터 셋트를 기반으로 계산될 수 있다. 분석부(133)는 (특정 단어가 포함된 기사의 수) / (크롤링된 기사의 수)를 계산하여 추출 확률을 표현할 수 있다. When a certain search keyword is input in step S915, the probability that a specific word is extracted from the article may be calculated based on the matrix data set. The analysis unit 133 may express the extraction probability by calculating (the number of articles including a specific word) / (the number of crawled articles).

분석부(133)는 생성한 추출 빈도 및 추출 확률 그래프를 분석 데이터 베이스(142)에 저장할 수 있다. The analysis unit 133 may store the generated extraction frequency and extraction probability graph in the analysis database 142.

상술한 설명에서, 각 단계는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, each step may be further divided into additional steps or combined into fewer steps, according to an embodiment herein. In addition, some steps may be omitted as necessary, and the order between the steps may be changed.

본원의 일 실시 예에 따른 텍스트 데이터 수집 및 분석 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Text data collection and analysis method according to an embodiment of the present application may be implemented in the form of program instructions that can be executed by various computer means may be recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

또한, 전술한 텍스트 데이터 수집 및 분석 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the above-described text data collection and analysis method may be implemented in the form of a computer program or an application executed by a computer stored in a recording medium.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is intended for illustration, and it will be understood by those skilled in the art that the present invention may be easily modified in other specific forms without changing the technical spirit or essential features of the present application. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the following claims rather than the above description, and it should be construed that all changes or modifications derived from the meaning and scope of the claims and their equivalents are included in the scope of the present application.

100: 텍스 데이터 수집 및 분석 장치
110: 유저 인터페이스부
120: 웹 크롤러부
121: 파싱부 122: 추출부
123: 언어 지원부 124: 파일 생성부
130: 데이터 분석부
131: 전처리부 132: 데이터 형성부
133: 분석부
140: 데이터베이스
141: 크롤링 데이터베이스 142: 분석 데이터베이스
143: 사전 정의 단어 데이터베이스
200: 사용자 단말
300: 웹100: cortex data collection and analysis device
110: user interface unit
120: web crawler section
121: parser 122: extractor
123: language support unit 124: file generation unit
130: data analysis unit
131: preprocessing unit 132: data forming unit
133: analysis unit
140: database
141: Crawl Database 142: Analysis Database
143: Dictionary definition word database
200: user terminal
300: web

Claims

In the text data collection and analysis method,
Receiving an input regarding keyword and period information;
Obtaining information of an article including the number of articles including the keyword and the number of crawlable web pages from the web based on the keyword and the period information;
Crawling a web page including the article based on the information of the article;
Collecting text data of an article included in the crawled web page;
Storing the collected text data in a crawl database;
Performing text data analysis based on the collected text data;
Including,
Performing the text data analysis,
Determining a first word set comprising a word combination that can be made of frequent words in a matrix data set that is information about whether or not each article includes frequent words;
Receiving an input relating to a second word set including at least one word of the frequent word sets within a preset rank; And
And determining association rules as to whether words included in the second set of words are also included in the article when the word combination included in the first set of words is included in the article.

The method of claim 1,
Collecting the text data,
Generating a URL including the keyword and the period information and transmitting the generated URL to the web;
Receiving an HTML file containing search results from the web;
Reconstructing the HTML file into a tree;
Extracting a URL of an article list from the HTML file;
Transmitting the URL of the article list to a web and receiving an HTML file containing article text data from the web;
Reconstructing the HTML file including the article text data into a tree form and extracting the text data; And
Extracting text data corresponding to the title and the content of the article from the extracted text data,
That includes, text data collection and analysis method.

The method of claim 2,
Performing encoding on an HTML file of the article list reconstructed in the tree form and an HTML file including the article text data reconstructed in the tree form,
Further comprising, text data collection and analysis method.

The method of claim 3, wherein
And the encoding is UTF-8 (Universal Transformation Format-8) encoding.

The method of claim 2,
Generating the extracted text data in a txt file;
More,
The generated txt file is stored in the crawl database, text data collection and analysis method.

The method of claim 3, wherein
Reading the collected text data from the crawl database;
Preprocessing the collected text data based on a predefined word defined in advance; And
Forming an underlying data set from text data of the preprocessed article,
More,
And performing the text data analysis is performed based on an underlying data set.

In the text data collection and analysis apparatus,
A user interface unit which receives an input regarding a keyword and period information;
Obtain information of an article including the number of articles including the keyword and the number of crawlable web pages from the web based on the keyword and the period information, and the web page including the article based on the information of the article A crawler unit for crawling and collecting text data of articles included in the crawled web pages;
A crawling database storing the collected text data; And
A data analyzer configured to perform text data analysis based on the collected text data;
Including,
The user interface unit,
Receiving an input regarding a second word set including at least one word of the frequent word sets within a preset rank,
The analysis unit,
Determining a first word set including a word combination that can be formed from frequent words in a matrix data set that is information about whether or not each article includes a frequent word, and if the word combination included in the first word set is included in the article And determining association rules as to whether words included in the second set of words are also included in an article.

The method of claim 7, wherein
The web crawler unit,
Generate and transmit a URL including the keyword and the period information to the web, receive an HTML file containing a search result from the web, reconstruct the HTML file into a tree form, and URL of an article list from the HTML file. Parsing unit for extracting the URL to transmit the URL of the article list to the web, receiving an HTML file containing the article text data from the web, and reconstructing the HTML file including the article text data into a tree form; And
An extraction unit for extracting text data from an HTML file including the article text data reconstructed in the tree form, and extracting text data corresponding to the title and content of the article from the extracted text data;
To include, text data collection and analysis device.

The method of claim 8,
The web crawler unit,
A language support unit for encoding the HTML file of the article list reconstructed in the tree form and the HTML file including the article text data reconstructed in the tree form;
Further comprising, text data collection and analysis device.

The method of claim 9,
And the encoding is UTF-8 (Universal Transformation Format-8) encoding.

The method of claim 8,
The web crawler unit,
A file generator for generating the extracted text data into a txt file;
More,
The crawling database stores the generated txt file, text data collection and analysis device.

The method of claim 9,
The data analysis unit,
A preprocessor configured to read the collected text data from the crawl database and preprocess the pre-defined word based on a preset word;
A data forming unit for forming a base data set from the text data of the preprocessed article; And
An analysis unit which performs the text data analysis based on the base data set;
To include, text data collection and analysis device.

An apparatus for collecting and analyzing text data according to any one of claims 7 to 12; And
A user terminal for providing an input regarding keywords and period information to the text data collection and analysis device, and receiving and outputting a result of text data analysis from the text data collection and analysis device;
Text data collection and analysis system comprising a.

A computer-readable recording medium having recorded thereon a program for executing the method of any one of claims 1 to 6.