KR20210083510A

KR20210083510A - Crime detection system through fake news decision and web monitoring and Method thereof

Info

Publication number: KR20210083510A
Application number: KR1020190175841A
Authority: KR
Inventors: 양중식; 이영준; 조영준
Original assignee: (주)아이와즈
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-07-07
Also published as: KR102318297B1

Abstract

The present invention groups monitoring selection targets into general web targets, dark web targets, and fact collection targets based on expert advice, puts reputable websites in data collection targets, efficiently collects and performs distributed aggregation and incremental indexing, generates learning data from the collected crime intelligence data, compares and analyzes web monitoring data and similar fact documents using natural language processing technology and artificial intelligence technology, and calculates the probability of authenticity of criminal intelligence. So, it improves the reliability of the criminal intelligence. A crime detection system includes a selection part, a collection part, a crime information extraction model generation part, a crime information extraction part and a determination part.

Description

Crime detection system through fake news decision and web monitoring and Method thereof

본 발명은 가짜뉴스 탐지와 주기적 웹 모니터링을 통한 범죄첩보 탐지 시스템 및 그 방법에 관한 것으로서, 더욱 상세하게는 가짜뉴스를 탐지하고 주기적으로 웹을 모니터링하여 수집한 데이터를 분석하여 자동으로 최신 범죄첩보를 추출하는 기술에 관한 것이다.The present invention relates to a system and method for detecting criminal information through fake news detection and periodic web monitoring, and more particularly, to detect fake news and periodically monitor the web and analyze the collected data to automatically generate the latest criminal information. It is about extraction technology.

다크웹(dark web)은 일반적인 인터넷 브라우저로는 접속할 수 없는 암호화된 인터넷망이다. 다크웹은 권력의 감시와 검열로부터 보안, 익명 및 프라이버시를 보장받기 위하여 시작되었으나, 위조지폐, 무기 암거래, 청부살인, 공문서 위조, 마약 거래, 아동 포르노 거래와 같이 인터넷 지하세계에서 범죄의 온상이 되었다.The dark web is an encrypted internet network that cannot be accessed with a general internet browser. The dark web was started to ensure security, anonymity and privacy from the surveillance and censorship of power, but it has become a hotbed of crime in the internet underworld such as counterfeit bills, dark arms trade, contract murder, counterfeiting of official documents, drug trade, and child pornography trade. .

사회관계망서비스(SNS) 또는 언론사 등을 통하여 유포되는 가짜뉴스는 교묘하게 조작된 속임수 뉴스로서, 정치 또는 경제적 이익을 위해 의도적으로 언론 보도의 형식으로 유포된 거짓 정보이고, 핵심 내용을 왜곡하거나 조작된 뉴스로 대부분 사실 확인이 쉽지 않은 자극적인 내용들을 포함하고 있다. Fake news disseminated through social networking services (SNS) or media companies is cleverly manipulated deceptive news, which is false information that is intentionally disseminated in the form of media reports for political or economic gain, and distorts or manipulates core contents. Most of the news contains provocative content that is not easy to confirm.

대부분의 다크웹은 범죄의 온상이 되기 때문에 다크웹에서 수집한 데이터는 범죄첩보로 활용할 수 있고, 가짜뉴스는 표현의 자유 뒤에 숨은 사회의 공적(公敵)이기 때문에 가짜뉴스로 판별하고 퇴치하는 방법이 필요하다.Since most of the dark web is a hotbed of crime, data collected from the dark web can be used as criminal intelligence, and since fake news is a public achievement hidden behind freedom of expression, the method to identify and combat fake news is difficult. need.

범죄첩보 탐지 시스템에 관한 종래의 기술로서 특허문헌 1은 다크웹에서 스캐닝한 정보를 분석하여 범죄프로파일링을 수행하는 다크웹 범죄정보 분석시스템 및 그 방법을 제시하고 있다.As a prior art related to a criminal intelligence detection system, Patent Document 1 proposes a dark web crime information analysis system and method for performing criminal profiling by analyzing information scanned in the dark web.

상기 기술은 다크웹을 통해 취득한 범죄정보를 분석하는 시스템을 제시하고 있는 바, 점점 고도화 지능화 되어가는 현실의 범죄수법에 비추어 볼 때 다크웹 범죄정보만으로는 미흡한 문제점이 있으므로, 다크웹 데이터를 포함하여 일반웹 데이터 및 팩트 데이터까지 수집 비교분석하고 종합적으로 판단하여 범죄첩보의 진위 확률을 산출할 필요가 있다.The above technology suggests a system for analyzing criminal information acquired through the dark web, and in view of the increasingly sophisticated and intelligent real criminal methods, there is a problem that only dark web criminal information is insufficient. It is necessary to calculate the probability of authenticity of criminal intelligence by collecting, comparing, analyzing, and comprehensively judging web data and fact data.

한국등록특허 제10-1852107호Korean Patent No. 10-1852107

상기의 문제를 해결하고자 본 발명은 전문가의 자문을 기반으로 모니터링 선정대상을 일반웹 대상, 다크웹 대상, 팩트수집 대상으로 그룹핑하고 공신력 있는 웹사이트도 데이터 수집 대상에 포함시키고 분산 수집 및 증분 색인 수행 등을 통해 효율적으로 수집한다.In order to solve the above problem, the present invention groups the monitoring selection target into general web target, dark web target, and fact collection target based on expert advice, and includes websites with public trust in data collection targets, and performs distributed collection and incremental indexing. efficiently collected through

수집한 범죄정보 데이터로부터 학습데이터를 생성하고 자연어처리 기술 및 인공지능 기술을 이용하여 웹 모니터링 데이터와 유사 팩트 문서를 분석하고 범죄첩보의 진위 확률을 산출하여 최종 결과를 사용자에게 제공하고자 한다.It is intended to generate learning data from the collected crime information data, analyze web monitoring data and similar fact documents using natural language processing technology and artificial intelligence technology, calculate the probability of authenticity of crime information, and provide the final result to users.

상기의 과제를 해결하고자 하는 본 발명의 웹 모니터링을 통한 범죄첩보 탐지 시스템은, 범죄첩보 탐지 모니터링 대상을 선정하는 선정부, 선정된 대상으로부터 범죄첩보를 수집하는 수집부, 수집한 범죄첩보를 검정하여 학습시키고 수정 및 삭제하여 재학습시키는 범죄첩보 추출 모델 생성부, 데이터 학습 및 분석을 통한 범죄첩보 추출부 및 범죄첩보의 사실여부를 판단하는 판단부를 포함하되, 상기 범죄첩보 추출 모델은 6하 원칙에 의한 스키마로 기존 및 신규 범죄 모형으로 생성되고, 정답 데이터셋으로 학습되는 것을 특징으로 한다.The crime information detection system through web monitoring of the present invention to solve the above problems is a selection unit that selects a crime information detection monitoring target, a collection unit that collects crime information from the selected target, and examines the collected crime information. A crime intelligence extraction model generation unit that learns, corrects and deletes and re-learns, a crime information extraction unit through data learning and analysis, and a judgment unit to determine whether the crime information is true, wherein the crime information extraction model is based on the 6th principle It is characterized in that it is created with existing and new crime models with a schema by

상기 선정부는, 선정대상을 일반웹 대상, 다크웹 대상 또는 팩트수집 대상으로 그룹핑하는 것을 특징으로 한다.The selection unit is characterized in that the selection target is grouped into a general web target, a dark web target, or a fact collection target.

상기 수집부는 복수의 수집기를 통한 분산 수집으로 접속차단을 방지하고, 증분 색인 수행에 의한 효율적인 수집인 것을 특징으로 한다.The collection unit is characterized in that it prevents access blocking by distributed collection through a plurality of collectors, and is efficient collection by performing incremental indexing.

상기 범죄첩보 추출 모델 생성부는 전문가가 포함된 전담팀에서 수집한 범죄정보 데이터로부터 정답 태그를 표기한 학습데이터를 생성하고 인공지능 학습을 위한 정답문서를 작성함으로써 상기 범죄정보 데이터에서 범죄첩보를 추출하고 분석하는 것을 특징으로 한다.The crime information extraction model generation unit extracts crime information from the crime information data by generating learning data marked with an answer tag from the crime information data collected by a dedicated team including experts and writing an answer document for artificial intelligence learning, characterized by analysis.

상기 추출부는 범죄첩보 데이터를 분석하고 최신 범죄첩보를 추출하는 단계로, 상기 수집부에서 수집한 데이터와 이로부터 미제사건 정보검색 및 유사문서 정보검색으로 검출한 데이터로부터 범죄첩보 추출 모델을 통해 미제사건을 포함한 범죄첩보를 추출하고, 수집한 각 데이터에 대한 자연어처리 및 인공지능을 활용하여 형태소를 분석하고 미분석어를 검출하여, 추출한 범죄첩보와 분석한 형태소와 검출한 미분석어를 토대로 보고서를 작성하는 것을 특징으로 한다.The extraction unit is a step of analyzing the crime intelligence data and extracting the latest crime information. From the data collected by the collection unit and the data detected by the open case information search and similar document information search, the unsolved case through the criminal intelligence extraction model extracts crime information including, and analyzes morphemes using natural language processing and artificial intelligence for each collected data, detects unanalyzed words, and generates reports based on the extracted crime information, analyzed morphemes, and detected unanalyzed words. characterized by writing.

상기 판단부는 범죄첩보의 사실여부를 판단하는 단계로, 팩트 DB에서 웹 모니터링 데이터와 유사한 데이터를 검색하고, 이를 통해 추출한 범죄첩보의 사실여부를 판단하는 유사 팩트 자료검색과, 자연어처리 기술 및 인공지능 기술을 이용하여 웹 모니터링 데이터와 유사 팩트 문서를 분석하고 웹 모니터링 데이터가 거짓일 확률을 산출하는 가짜뉴스 검출과 가짜일 확률이 높아질 때마다 그 근거자료를 저장하고 최종적인 결과를 사용자에게 제공하는 결과 제공을 포함하는 것을 특징으로 한다.The determination unit is a step of determining whether the crime information is true, and searches for data similar to the web monitoring data in the fact DB, and searches for similar fact data to determine whether the extracted crime information is true, natural language processing technology and artificial intelligence Detecting fake news that analyzes web monitoring data and similar fact documents using technology and calculates the probability that the web monitoring data is false, and stores the evidence whenever the probability of fake news increases and provides the final result to the user characterized in that it includes providing.

본 발명의 다른 실시예로서, 웹 모니터링을 통한 범죄첩보 탐지 방법은, 범죄첩보 탐지 모니터링 대상을 선정하는 선정단계, 선정된 대상으로부터 범죄첩보를 수집하는 수집단계, 수집한 범죄첩보를 검정하여 학습시키고 수정 및 삭제하여 재학습시키는 범죄첩보 추출 모델 생성단계, 데이터 학습 및 분석을 통한 범죄첩보 추출단계 및 범죄첩보의 사실여부를 판단하는 판단단계를 포함하되, 상기 범죄첩보 추출 모델은 6하 원칙에 의한 스키마로 기존 및 신규 범죄 모형으로 생성되고, 정답 데이터셋으로 학습되는 것을 특징으로 한다.As another embodiment of the present invention, the method of detecting crime information through web monitoring includes a selection step of selecting a crime information detection monitoring target, a collection step of collecting crime information from the selected target, examining the collected crime information, and learning A crime information extraction model generation step of modifying and deleting and re-learning, a crime information extraction step through data learning and analysis, and a judgment step of determining whether the crime information is true, wherein the crime information extraction model is based on the 6th principle It is characterized in that it is created with existing and new crime models as schemas, and is trained with the correct answer dataset.

상기 선정 및 수집단계는, 모니터링 대상 사이트를 설정하는 단계와 수집 스케줄을 설정하는 단계를 거쳐 팩트 데이터 수집물, 일반웹 수집물, 다크웹 수집물에서 색인어와 역색인어를 추출하고 증분 색인을 수행하는 단계를 포함하는 것을 특징으로 한다.The selection and collection step is to extract index words and inverse index words from fact data collections, general web collections, and dark web collections through the steps of setting the monitoring target site and setting the collection schedule, and performing incremental indexing. It is characterized in that it comprises a step.

상기 범죄첩보 추출 모델 생성단계는 범죄첩보 추출 모델 구축 알고리즘과 결과 검정 및 재학습 알고리즘으로 구성된다.The step of generating the crime information extraction model is composed of a crime information extraction model construction algorithm and a result verification and re-learning algorithm.

상기 범죄첩보 추출 모델 구축 알고리즘은 기존 및 신규 범죄첩보 데이터를 확보하는 단계와 범죄첩보 태그 표기단계와 학습용 정답 데이터셋 생성단계와 범죄첩보 추출 모델 학습단계와 추출 모델 성능 검증을 통해 검증된 범죄첩보 추출 모델을 저장하는 단계를 포함하고, 상기 결과 검정 및 재학습 알고리즘은 사용자(전문가)에게 검증 요청단계와 사용자(전문가) 검증을 통해 검증된 학습 데이터 생성단계와 미분석어 사전에 추가단계와 범죄첩보 추출모델 추가 학습 수행단계를 포함하는 것을 특징으로 한다.The crime intelligence extraction model construction algorithm is a step of securing existing and new crime intelligence data, a crime intelligence tag marking step, a correct data set creation step for learning, a crime intelligence extraction model learning step, and extraction of the crime information verified through the extraction model performance verification Including the step of storing the model, the result test and re-learning algorithm is a step of requesting verification to a user (expert), a step of generating training data verified through user (expert) verification, an addition step to the non-analyzed dictionary, and a crime intelligence It is characterized in that it includes a step of performing additional learning of the extraction model.

상기 추출단계는 데이터 분석 및 최신 범죄첩보 추출 알고리즘과 각 문서별 분석 수행 알고리즘으로 구성된다.The extraction step consists of data analysis and the latest crime intelligence extraction algorithm, and an analysis performance algorithm for each document.

상기 최신 범죄첩보 추출 알고리즘은 미제사건과 유사문서 검색단계와 웹 모니터링 문서와 미제사건 유사문서별로 그룹핑 하는 단계와 범죄첩보 추출단계와 각 문서별 분석 수행단계와 자동 보고서 작성단계를 포함하고, 상기 각 문서별 분석 수행 알고리즘은 문서 전처리단계와 형태소 분석 및 미분석어 검출단계와 개체명 분석 및 미분석어 검출단계와 구문 분석단계와 의미역 분석단계와 문서 요약단계를 포함하는 것을 특징으로 한다.The latest crime information extraction algorithm includes a step of searching for open cases and similar documents, a step of grouping by web monitoring documents and similar documents of unsolved cases, a step of extracting crime information, an analysis execution step for each document, and an automatic report writing step, The analysis performing algorithm for each document is characterized in that it includes a document pre-processing step, a morphological analysis and an unanalyzed word detection step, an entity name analysis and an unanalyzed word detection step, a syntax analysis step, a semantic translation analysis step, and a document summary step.

상기 판단단계는 자연어처리 기술을 활용하는 데이터 분석단계와 해당 데이터와 유사한 팩트 데이터 검색단계와 유사 데이터가 존재할 경우에 자연어처리 기술을 활용한 팩트 데이터 분석단계와 현재 분석 중인 데이터와 팩트 데이터 비교분석단계와 남은 유사 팩트 문서가 더 이상 없는 경우에 범죄첩보 데이터에 대한 최종적인 가짜확률 계산단계를 포함하는 것을 특징으로 한다.The determination step is a data analysis step using natural language processing technology, a fact data search step similar to the corresponding data, and a fact data analysis step using natural language processing technology if there is similar data, and a comparison analysis step of the data currently being analyzed and the fact data and a final fake probability calculation step for crime intelligence data when there are no more similar fact documents remaining.

본 발명은 모니터링 선정대상을 일반웹 대상, 다크웹 대상, 팩트수집 대상으로 그룹핑하고 공신력 있는 웹사이트도 데이터 수집 대상에 포함시키며 분산 수집 및 증분 색인 수행을 통해 효율적으로 수집하고 수집한 범죄첩보 데이터로부터 학습데이터를 생성하고 자연어처리 기술 및 인공지능 기술을 이용하여 웹 모니터링 데이터와 유사 팩트 문서를 비교분석하여 범죄첩보의 진위 확률을 산출함으로서 범죄첩보의 신뢰성을 향상시킬 수 있는 현저한 효과가 있다.The present invention groups monitoring selection targets into general web targets, dark web targets, and fact collection targets, includes websites with public trust in data collection targets, and efficiently collects and collects crime intelligence data through distributed collection and incremental indexing. There is a remarkable effect to improve the reliability of criminal intelligence by generating learning data and calculating the probability of authenticity of crime information by comparatively analyzing web monitoring data and similar fact documents using natural language processing technology and artificial intelligence technology.

도 1은 본 발명의 실시예에 따른 범죄첩보 탐지 시스템을 도시한 블록도이다.
도 2는 본 발명의 실시예에 따른 범죄첩보 탐지 방법에서 범죄첩보 탐지 대상으로부터 범죄첩보를 수집하는 수집방법을 도시한 흐름도이다.
도 3은 범죄첩보 추출 모델 구축 방법을 도시한 흐름도이다.
도 4는 최신 범죄첩보 추출 방법과 문서별 분석 수행 방법을 도시한 흐름도이다.
도 5는 범죄첩보 추출 모델로 분석한 결과 검정 및 재학습 방법을 도시한 흐름도이다.
도 6은 범죄첩보 사실여부 판단 방법을 도시한 흐름도이다.1 is a block diagram illustrating a crime information detection system according to an embodiment of the present invention.
2 is a flowchart illustrating a collection method for collecting crime information from a crime information detection target in the crime information detection method according to an embodiment of the present invention.
3 is a flowchart illustrating a method of constructing a crime information extraction model.
4 is a flowchart illustrating the latest method of extracting crime information and a method of performing analysis for each document.
5 is a flowchart illustrating a method of testing and re-learning as a result of analysis with a crime information extraction model.
6 is a flowchart illustrating a method for determining whether a crime information is true.

이하, 첨부 도면들 및 첨부 도면들에 기재된 내용들을 참조하여 본 발명의 실시예를 상세하게 설명하지만, 본 발명이 실시예에 의해 제한되거나 한정되는 것은 아니다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings and the contents described in the accompanying drawings, but the present invention is not limited or limited by the embodiments.

도 1은 본 발명의 실시예에 따른 범죄첩보 탐지 시스템을 도시한 블록도로서, 범죄첩보 탐지 시스템은 범죄첩보 탐지 모니터링 대상을 선정하는 선정부(10), 선정된 대상으로부터 범죄첩보를 수집하는 수집부(20), 수집한 범죄첩보를 검증하여 학습시키고 수정 및 삭제하여 재학습시키는 범죄첩보 추출 모델 생성부(30), 데이터 학습 및 분석을 통한 범죄첩보 추출부(40), 범죄첩보의 사실여부를 판단하는 판단부(50)를 포함한다.1 is a block diagram illustrating a crime information detection system according to an embodiment of the present invention, wherein the crime information detection system includes a selection unit 10 that selects a crime information detection monitoring target, and a collection unit that collects crime information from the selected target. The unit 20, the crime information extraction model generating unit 30 that verifies, learns, corrects and deletes the collected crime information and re-learns, the crime information extraction unit 40 through data learning and analysis, whether the crime information is true It includes a determination unit 50 for determining the.

상기 선정부(10)는 전문가의 자문을 기반으로 모니터링 대상을 선정하되, 선정대상을 일반웹 대상(110), 다크웹 대상(120), 팩트수집 대상(130)으로 그룹핑한다.The selection unit 10 selects a monitoring target based on expert advice, and groups the selection target into a general web target 110 , a dark web target 120 , and a fact collection target 130 .

본 발명에서 웹(web)은 일반웹과 다크웹을 모두 포괄하는 용어로 사용한다. In the present invention, the web (web) is used as a term encompassing both the general web and the dark web.

본 발명의 범죄첩보 추출에는 지인, 인맥 또는 탐문에 의한 직접추출은 배제되지만 웹을 통한 추출을 위한 기초 데이터나 색인어 선정에 활용하는 방식으로 온라인-오프라인의 연계가 바람직하다.In the criminal intelligence extraction of the present invention, direct extraction by acquaintances, personal connections, or interrogation is excluded, but an online-offline connection is preferable in a way that is used for basic data or index word selection for extraction through the web.

전문가는 사이버범죄 관련 전문가, 정보보호 관련 전문가, 범죄심리학자 등을 포함하는 광의적인 개념이며 온라인-오프라인 연계를 위해 정보경찰이나 수사경찰 등 오프라인 전문가도 포함될 수 있다.An expert is a broad concept that includes cybercrime-related experts, information protection experts, and criminal psychologists, and offline experts such as information police and investigative police can also be included for online-offline connection.

또한, 전문가에는 전문가가 포함된 전담팀도 포함되며, 이들에 의한 모니터링 대상에는 일반웹 대상, 다크웹 대상, 팩트수집 대상뿐만이 아니고 대형 신문사, 중견기자 블로그, 전문 법조인 블로그 등 공신력 있는 웹사이트가 포함된다.In addition, experts include a dedicated team including experts, and monitoring targets by them include not only general web targets, dark web targets, and fact-collecting targets, but also websites with credibility such as large newspapers, blogs for mid-level journalists, and blogs for professional lawyers. do.

최근 국제적인 사이버보안 및 사이버범죄예방의 필요성이 증대되는 추세인 만큼 상기 웹 대상에는 국내뿐만이 아니고 해외 사이트도 포함되며 상기 전문가 집단에도 국제범죄 전문가가 포함될 수 있다.Recently, as the need for international cybersecurity and cybercrime prevention is increasing, the web target includes not only domestic but also overseas sites, and the expert group may include international crime experts.

특히 다크웹은 일반적인 인터넷 브라우저로는 접근이 불가능하며 암호화된 네트워크에 존재하여 특정 브라우저를 통해서만 접근이 가능한 웹사이트로서 다크웹 사이트의 주소는 일반 웹사이트 도메인과 다른 형태를 갖고 있는 경우가 대부분이므로 다크웹에 대한 정보수집과 모니터링에는 일반웹과 다른 절차와 기술이 사용되어야 하기 때문에 별도의 수집 및 모니터링 수행이 필요하다.In particular, the dark web is a website that cannot be accessed with a general internet browser and exists in an encrypted network and can only be accessed through a specific browser. In order to collect and monitor information on the web, procedures and technologies different from those of the general web must be used, so separate collection and monitoring is required.

따라서 다크웹에 접속하고 스캐닝 및 정보채취 할 수 있는 전문가가 별도로 필요할 수도 있다.Therefore, an expert who can access the dark web, scan and gather information may be required separately.

선정된 대상으로부터 범죄첩보를 수집하는 상기 수집부(20)의 특징은 복수의 수집기를 통한 분산 수집으로 접속차단을 방지하고, 증분 색인 수행에 의한 효율적인 수집이다.The feature of the collection unit 20 for collecting crime information from a selected target is to prevent access blocking through distributed collection through a plurality of collectors, and to efficiently collect by performing incremental indexing.

분산 수집이라 함은 복수의 수집 대상을 복수의 수집기에 분배하여 접속함으로써 수집 대상으로부터 접속차단을 피하는 수집방법이다.Distributed collection is a collection method that avoids blocking access from collection objects by distributing and accessing a plurality of collection objects to a plurality of collectors.

수집된 데이터는 전문가가 포함된 전담팀에 의해 수시 팩트 검증을 거친 후 출처에 따라 웹 모니터링 데이터 DB와 팩트 DB로 분류되어 저장되고 다시 웹 모니터링 데이터 DB는 일반웹 DB와 다크웹 DB로 분류되어 저장된다.The collected data goes through frequent fact verification by a dedicated team including experts, and is classified and stored into a web monitoring data DB and a fact DB according to the source, and the web monitoring data DB is classified into a general web DB and a dark web DB and stored. do.

팩트 데이터 수집의 경우 수집된 팩트 데이터가 과도하게 광대하고 광범위한 경우에는 샘플링을 진행한 후 상기 전문가가 포함된 전담팀에 의한 수시 팩트 검증을 거치도록 할 수 있다.In the case of fact data collection, if the collected fact data is excessively vast and extensive, it may be subjected to frequent fact verification by a dedicated team including the expert after sampling.

수집된 데이터로부터 색인어 및 역색인어를 추출한다.Index words and inverse index words are extracted from the collected data.

색인은 원하는 데이터 항목을 찾아내는 데 걸리는 시간을 단축시키기 위해 색인(데이터와 그 보관 위치 목록)을 사용해서 메모리나 메모리 내의 데이터를 검색하는 방법이다.An index is a method of searching memory or data in memory using an index (a list of data and its storage locations) to speed up the time it takes to find a desired item of data.

역색인은 형태소 분석과 대상 문서 튜닝 후 키워드에 문서들의 기본키(primary Key) 혹은 주소, 파일명 등과 같은 값을 매핑하여 저장하는 기술이다. 역색인 작업을 하게 될 때의 장점은 매우 빨리 찾을 수 있다는 것이다. 기존보다 추가적인 작업과 메모리가 필요하게 되지만 검색은 매우 빠른 속도로 처리된다.Inverse indexing is a technology that maps and stores the primary key of documents or values such as addresses and file names to keywords after morphological analysis and tuning of the target document. The advantage of inverting indexing is that you can find it very quickly. It requires additional work and memory than before, but the search is processed very quickly.

증분 색인은 빅데이터, GPU(Graphics Processing Unit) 등 많은 양의 정보를 검색하는 방법으로 해당 문서의 추가, 삭제, 변경만을 검색서비스에 반영하는 증분식 검색 시스템을 말한다.Incremental indexing is a method of retrieving a large amount of information such as big data and GPU (Graphics Processing Unit), and refers to an incremental search system that reflects only the addition, deletion, and change of the relevant document to the search service.

범죄첩보 추출 모델 생성부(30)는 수집한 범죄첩보를 검정하여 학습시키고 수정 및 삭제하여 재학습시키는 범죄첩보 추출 모델 생성부로서, 전문가가 포함된 전담팀에서 수집한 범죄정보 데이터로부터 정답 태그를 표기한 학습데이터를 생성하고 인공지능 학습을 위한 정답문서를 작성함으로서 상기 범죄정보 데이터에서 범죄첩보를 추출하고 분석한다. 인공지능 학습의 예로는 딥러닝을 통한 기계학습 등이 있다. The crime information extraction model generation unit 30 is a crime information extraction model generation unit that tests, learns, corrects and deletes the collected crime information and re-learns, and extracts the correct answer tag from the crime information data collected by a dedicated team including experts. By generating the marked learning data and writing the correct answer document for artificial intelligence learning, the crime information is extracted and analyzed from the crime information data. Examples of artificial intelligence learning include machine learning through deep learning.

범죄첩보 추출 모델 생성부의 기능은 크게 범죄첩보 추출 모델 구축과 결과 검정 및 재학습이다.The function of the crime intelligence extraction model generation unit is to build a crime information extraction model, test the results, and re-learning.

범죄첩보 추출 모델 구축은 기존 및 신규 범죄첩보 데이터를 확보하여 범죄첩보 태그를 표기하고 학습용 정답 데이터셋을 생성하는 작업이며, 생성된 모델은 범죄첩보 및 범죄정보 전문가가 포함된 전담팀의 검증을 거친 후 저장된다.Building a crime intelligence extraction model is the task of securing existing and new crime intelligence data, marking the crime information tag, and generating a correct data set for learning, and the generated model is verified by a dedicated team including criminal intelligence and crime information experts. is saved after

범죄첩보 추출 모델은 범죄 사건들을 유형화하여 시간, 장소, 범행 대상, 범행 방법, 범행 동기 등을 6하 원칙에 의한 스키마(schema)로 세운다. 분석대상 문서가 범죄첩보 추출 모델에 입력되었을 경우, 해당 모델은 추출한 범죄첩보 정보들을 괄호로 묶고 구분자와 함께 해당 첩보가 어떤 정보에 해당하는지 태그를 달아준다. 예를 들면, "이번 달 10일 밤 10시 쯤 홍대입구역 4번 출구 앞에서 이글 대기 예정. 예정대로 작대리 전달할 것."이라는 내용을 가진 문서가 범죄첩보 추출 모델에 입력되면, "(이번 달 10일 밤 10시 쯤/시간) (홍대입구역 4번 출구 앞/장소)에서 (이글/범행 대상(마약용어)) 대기 예정. 예정대로 (작대기/범행 방법(마약용어)) 전달할 것."이라는 결과를 도출한다.The criminal intelligence extraction model categorizes criminal cases and sets the time, place, crime target, crime method, and motive as a schema based on the 6th principle. When the document to be analyzed is input to the criminal intelligence extraction model, the model binds the extracted criminal information information in parentheses and tags along with the identifier what information the information corresponds to. For example, if a document with the content "I will wait for the eagle in front of Exit 4 of Hongik University Station around 10:00 pm on the 10th of this month. I will deliver it as scheduled." (Around 10 p.m./hour on Sunday) (in front/place of Exit 4 of Hongik University Station) (Eagle/the subject of the crime (drug term)) will be on standby. (Standby/criminal method (drug term)) will be delivered as scheduled." to derive

범죄첩보는 사람, 증거, 혹은 사건을 확인하거나 범죄사건의 증인이나 증거와 관련된, 아직 분석이 이루어지지 않은 원형의 자료로서 범죄정보를 생산하기 위해 필요한 재료를 뜻하고, 범죄정보는 다양한 경로로 입수된 첩보를 평가하는 분석과정의 생산물로서, 관련된 첩보들을 통합하여 집약적인 꾸러미로 만든 후 문제해결을 위한 과학적인 방법들을 사용하여 만들어 낸 범죄현상에 대한 결론이나 예측을 뜻하지만 통상적으로 서로 혼용되어 쓰이고 있다.Criminal intelligence refers to materials necessary to produce criminal information as raw materials that have not yet been analyzed, which identify people, evidence, or cases, or are related to witnesses or evidence of criminal cases. It is the product of the analysis process to evaluate the intelligence that has been collected, and it refers to the conclusion or prediction of a crime phenomenon created by using scientific methods to solve the problem after integrating the relevant information into an intensive package, but it is usually used interchangeably. have.

범죄첩보는 수사첩보의 한 내용으로 범죄수사상 참고가 될 만한 제반사항이라고 할 수 있다.Criminal intelligence is one of the contents of investigation intelligence, and it can be said that all matters that can be used as a reference in criminal investigation.

구축된 모델은 유효기간이 길지 않으므로 새로운 환경과 정보에 의해 항상 리뉴얼(renewal) 되어야 한다. 본 발명에서는 리뉴얼 기능이 상기 결과 검정 및 재학습 과정에 의해 자체적으로 수행된다.Since the built model does not have a long validity period, it must always be renewed by new environments and information. In the present invention, the renewal function is performed by itself by the result verification and re-learning process.

즉, 새로운 범죄첩보가 추가되면 기존 범죄첩보와 배치되는 부분을 수정하여 재생시키는 작업이 본 발명의 범죄첩보 추출 모델 구축 중에, 결과 검정 및 재학습 과정에서 자동으로 수행된다는 것이 본 발명의 특징 중 하나이다.That is, one of the features of the present invention is that when a new crime information is added, the operation of reproducing and correcting the part that is in conflict with the existing crime information is automatically performed during the result verification and re-learning process during the construction of the criminal information extraction model of the present invention .

결과 검정 및 재학습 과정은 전문가에게 검증 요청단계와 전문가 검증을 통해 검증된 학습 데이터 생성단계와 미분석어 사전에 추가단계와 범죄첩보 추출모델 추가 학습 수행단계를 포함하는 것을 특징으로 한다.The result verification and re-learning process is characterized by including a step of requesting verification from an expert, a step of generating learning data verified through expert verification, an additional step to the unanalyzed dictionary, and a step of performing additional learning of the criminal intelligence extraction model.

범죄첩보 데이터에서 범죄첩보를 추출하고 이를 토대로 작성된 보고서는 전문가에게 전달되며 전문가의 검증을 거쳐 오류가 수정된 정답문서로 재생성되어 다시 범죄첩보 추출 모델 생성을 위한 재학습자료로 피드백 되는 선순환이 본 발명의 범죄첩보 추출 모델 구축 단계에서 자체적 자동적으로 이루어진다.The present invention is a virtuous cycle in which crime intelligence is extracted from crime intelligence data, and a report written based on it is delivered to an expert, and is regenerated as a correct answer document with errors corrected through expert verification and fed back as re-learning data for generating a criminal intelligence extraction model. This is done automatically in the stage of building a model for extracting criminal information.

데이터 학습 및 분석을 통한 범죄첩보 추출부(40)는 범죄첩보 데이터를 분석하고 최신 범죄첩보를 추출하는 단계로, 수집부에서 수집한 데이터와 미제사건 정보검색 및 유사문서 정보검색으로 검출한 데이터로부터 범죄첩보 추출 모델을 통해 범죄첩보를 추출하고, 수집한 각 데이터에 대해 자연어처리 및 인공지능을 활용하여 형태소를 분석하고 미분석어를 검출하여, 추출한 범죄첩보와 분석한 형태소와 검출한 미분석어를 토대로 보고서를 작성한다.The crime intelligence extraction unit 40 through data learning and analysis analyzes the crime information data and extracts the latest crime information. From the data collected by the collection unit and the data detected by the open case information search and similar document information search Crime intelligence is extracted through the crime intelligence extraction model, and morphemes are analyzed using natural language processing and artificial intelligence for each collected data, and unanalyzed words are detected. Create a report based on

상기 데이터 학습 및 분석은 컴퓨터 프로그램이 데이터와 처리 경험을 이용한 학습을 통해 정보 처리 능력을 스스로 향상시키는 기계학습이며, 그 중에서도 입출력 사이의 매핑이 존재하는 지도학습에 해당한다.The data learning and analysis is machine learning in which a computer program improves its information processing ability by itself through learning using data and processing experience, and among them, it corresponds to supervised learning in which mapping between input and output exists.

본 발명은 입출력 사이의 매핑을 위한 출처별 학습세트를 포함한다. 예를 들어 가짜뉴스 또는 가짜범죄첩보와 공신력 있는 뉴스 또는 진짜범죄첩보의 출처들을 각각 입력과 출력 쌍으로 학습시키는 지도학습을 통해 진짜범죄첩보의 검출능력을 향상시키는 것이다.The present invention includes a source-specific training set for mapping between input and output. For example, it is to improve the detection ability of real crime information through supervised learning that learns fake news or fake crime information and sources of credible news or real crime information as input and output pairs, respectively.

기계학습을 위해 인간의 학습능력을 모방한 인공지능을 활용할 수 있다.For machine learning, artificial intelligence that mimics human learning ability can be used.

수집한 데이터에서 미제사건과 유사한 문서를 검색할 때, TF-IDF(Term Frequency-Inverse Document Frequency) 등의 유사도 분석 알고리즘을 활용할 수 있다. TF-IDF는 단어의 빈출도와 단어의 특수성을 통해 문장 간 유사도를 계산하는 알고리즘이다. Similarity analysis algorithms such as Term Frequency-Inverse Document Frequency (TF-IDF) can be used when searching for documents similar to unsolved cases from the collected data. TF-IDF is an algorithm that calculates the similarity between sentences through word frequency and word specificity.

TF-IDF 알고리즘을 활용하여 유사도 분석을 수행하기 위해 먼저 분석 대상 문서들에 대해 형태소 분석을 진행한다. 형태소 분석 결과에 대해 TF와 IDF 값을 산출한다. 문서에서 해당 단어가 얼마나 출현하였는지를 통해 TF값을 구하고, 해당 단어가 전체 문서들 중에서 몇 개의 문서에 출현했는지를 통해 DF값을 구하고 DF값의 역수를 취해 IDF값을 구한다. 마지막으로 구해둔 TF값과 IDF값을 곱하여 산출한 TF-IDF값을 활용하여 유사도 분석을 수행한다.In order to perform similarity analysis using the TF-IDF algorithm, morphological analysis is first performed on the documents to be analyzed. TF and IDF values are calculated for the result of morphological analysis. The TF value is obtained from the number of occurrences of the word in the document, the DF value is obtained from the number of documents among the entire documents, and the IDF value is obtained by taking the reciprocal of the DF value. Finally, similarity analysis is performed using the TF-IDF value calculated by multiplying the obtained TF value and the IDF value.

범죄첩보 데이터를 분석하고 최신 범죄첩보를 추출하는 과정에서 기 구축된 범죄첩보 추출 모델이 적용되며, 새로 추출된 최신 범죄첩보에 의해 상기 범죄첩보 추출 모델이 리뉴얼되는 선순환이 계속 이루어지게 된다.In the process of analyzing crime intelligence data and extracting the latest crime information, a pre-established criminal intelligence extraction model is applied, and a virtuous cycle in which the criminal information extraction model is renewed by the newly extracted latest crime information continues.

수집한 각 데이터에 대해 자연어처리 및 인공지능을 활용하여 형태소를 분석하고 미분석어를 검출하여, 추출한 범죄첩보와 분석한 형태소와 검출한 미분석어를 토대로 보고서를 작성한다.For each collected data, natural language processing and artificial intelligence are used to analyze morphemes, detect unanalyzed words, and create a report based on the extracted criminal intelligence, the analyzed morphemes, and the detected unanalyzed words.

참고로 여기서 자연어처리는 인간언어 분석과 표현을 자동화하기 위한 계산 기법이다.For reference, natural language processing is a computational technique for automating human language analysis and expression.

작성된 상기 보고서는 사용자(전문가)에게 전달되며 상기 사용자(전문가)의 검증을 거쳐 오류가 수정된 정답문서로 재생성 되고 범죄첩보 추출 모델 생성을 위한 재학습자료로 피드백 되어 상기 범죄첩보 추출 모델의 리뉴얼이 계속 이루어지게 된다.The written report is delivered to the user (expert), and after verification of the user (expert), it is regenerated as a correct answer document in which the error is corrected, and is fed back as re-learning data for generating the criminal intelligence extraction model, so that the renewal of the criminal information extraction model is performed. will continue to be done

상기 판단부(50)는 가짜뉴스 또는 범죄첩보의 사실여부를 판단하는 단계로, 팩트 DB에서 웹 모니터링 데이터와 유사한 데이터를 검색하고, 이를 통해 추출한 뉴스 또는 범죄첩보의 사실여부를 판단하는 유사 팩트 자료검색과, 자연어처리 기술 및 인공지능 기술을 이용하여 웹 모니터링 데이터와 유사 팩트 문서를 분석하고 웹 모니터링 데이터가 거짓일 확률을 산출하는 가짜뉴스 검출과, 가짜일 확률이 높아질 때마다 그 근거자료를 저장하고 최종적인 결과를 사용자에게 제공하는 결과제공을 포함한다.The determination unit 50 is a step of determining whether fake news or crime information is true, and similar fact data that searches for data similar to web monitoring data in the fact DB, and determines whether the extracted news or crime information is true. Detecting fake news that analyzes web monitoring data and similar fact documents using search, natural language processing technology and artificial intelligence technology, and calculates the probability that the web monitoring data is false, and stores the evidence whenever the probability of being fake increases and providing the final result to the user.

상기 팩트 DB에서 웹 모니터링 데이터와 유사한 데이터를 검색하고, 유사한 문서가 존재하지 않으면 웹 모니터링 데이터는 가짜일 확률이 아주 높아 가짜확률 계산이 무의미하다고 판단하고'판단 보류'로 분류하여 자동으로 전문가 및 전담팀에 검토 요청하거나 재분석을 수행하고, 유사한 문서가 존재하면 상기 웹 모니터링 데이터의 가짜확률 계산을 위해 상기 웹 모니터링 데이터와 유사문서를 자연어처리 기술과 인공지능 기술을 활용하여 비교대조하고 서로 다른 내용이 검출되면 서로 다른 정도에 따라 가짜확률을 계산한 후 해당 문서를 근거자료로 저장한다.Data similar to web monitoring data is searched for in the fact DB, and if similar documents do not exist, the probability that the web monitoring data is fake is very high. Request a review or re-analysis from the team, and if similar documents exist, compare and contrast the web monitoring data and similar documents using natural language processing technology and artificial intelligence technology to calculate the false probability of the web monitoring data. When detected, the fake probability is calculated according to different degrees and the document is stored as evidence.

계속해서 비교대조하지 않고 남은 유사 팩트 문서가 있으면 추가적으로 비교대조하여 서로 다른 내용이 검출되는 정도에 따라 가짜확률을 계산하거나 기 산출된 가짜확률을 업데이트한 후 남은 유사 팩트 문서가 없을 때까지 동일한 루프를 반복 시행하고, 남은 유사 팩트 문서가 없으면 범죄첩보 데이터에 대한 최종적인 가짜확률을 계산한다.If there are similar fact documents that are not continuously compared and contrasted, the fake probability is calculated according to the degree to which different contents are detected by additional comparison and comparison, or the same loop is repeated until there are no similar fact documents remaining after updating the calculated fake probability. It is repeated, and if there are no similar fact documents remaining, the final false probability of the criminal intelligence data is calculated.

상기의 과정에서 가짜확률이 높아질 때마다 그 근거로 사용된 자료를 근거자료로 저장하고 최종적인 결과를 사용자에게 제공한다.Whenever the false probability increases in the above process, the data used as the basis is stored as the basis data and the final result is provided to the user.

한편, 본 발명의 웹 모니터링을 통한 범죄첩보 탐지 방법은, 범죄첩보 탐지 모니터링 대상을 선정하는 선정단계, 선정된 대상으로부터 범죄첩보를 수집하는 수집단계, 수집한 범죄첩보를 검정하여 학습시키고 수정 및 삭제하여 재학습시키는 범죄첩보 추출 모델 생성단계, 데이터 학습 및 분석을 통한 범죄첩보 추출단계, 범죄첩보의 사실여부를 판단하는 판단단계로 이루어진다.On the other hand, the method of detecting crime information through web monitoring of the present invention includes a selection step of selecting a crime information detection monitoring target, a collection step of collecting crime information from the selected target, learning by examining the collected crime information, and modifying and deleting the information. It consists of a stage of generating a model for extracting criminal information by re-learning it, a stage of extracting criminal information through data learning and analysis, and a judgment stage of determining whether the information is true or not.

도 2는 본 발명의 실시예에 따른 범죄첩보 탐지 방법에서 범죄첩보 탐지 대상을 선정하고 선정된 대상으로부터 범죄첩보를 수집하는 수집방법을 도시한 흐름도로서, 상기 선정 및 수집단계는, 모니터링 대상 사이트를 설정하는 단계와 수집 스케줄을 설정하는 단계를 거쳐 팩트 데이터 수집물, 일반웹 수집물, 다크웹 수집물에서 색인어와 역색인어를 추출하고 증분 색인을 수행하는 단계를 포함한다.2 is a flowchart illustrating a collection method for selecting a crime information detection target and collecting crime information from the selected target in the crime information detection method according to an embodiment of the present invention. It includes a step of extracting index words and inverse index words from fact data collections, general web collections, and dark web collections, and performing incremental indexing through the steps of setting and setting the collection schedule.

수집한 팩트 데이터의 검정 단계는, 수집한 팩트 데이터의 양이 적당하면 전문가 및 전단팀에 팩트 검정 자료로 보내고, 과다하면 샘플링 진행 후 검정 자료로 보낸다.In the verification stage of the collected fact data, if the amount of collected fact data is appropriate, it is sent as fact verification data to experts and the front-end team, and if it is excessive, it is sent as verification data after sampling.

전문가 및 전단팀의 팩트 검정 후 수집된 데이터로부터 색인어 및 역색인어를 추출하고 신속한 검색을 위해 증분 색인을 수행하고 팩트 DB에 저장한다.After fact-testing by experts and front-end teams, index words and inverse index words are extracted from the collected data, and incremental indexing is performed for quick retrieval and stored in the fact DB.

수집한 웹 데이터의 처리 단계는, 먼저 일반웹과 다크웹으로 분류하고, 일반웹에서 수집한 데이터의 경우 수집한 데이터에서 색인어 및 역색인어를 추출하고 신속한 검색을 위해 증분 색인을 수행하고 팩트 DB에 저장한다.The processing stage of the collected web data is first classified into the general web and the dark web. In the case of data collected from the general web, index words and inverse index words are extracted from the collected data, and incremental indexing is performed for quick search and stored in the fact DB. Save.

다크웹 데이터의 경우 히든 서비스와 연동시켜 색인어 및 역색인어를 추출하고 신속한 검색을 위해 증분 색인을 수행하고 팩트 DB에 저장한다.In the case of dark web data, index words and inverse index words are extracted by linking with hidden services, and incremental indexing is performed for quick search and stored in the fact DB.

도 3은 범죄첩보 추출 모델 구축 방법을 도시한 흐름도로서, 기존 및 신규 범죄첩보 데이터를 확보하는 단계와 범죄첩보 태그 표기단계와 학습용 정답 데이터셋 생성단계와 범죄첩보 추출 모델 학습단계와 추출 모델 성능 검증을 통해 검증된 범죄첩보 추출 모델을 저장하는 단계를 포함한다.3 is a flowchart illustrating a method of constructing a crime intelligence extraction model. The steps of securing existing and new crime intelligence data, the step of marking the crime information tag, the creation of a correct answer dataset for learning, the step of learning the criminal intelligence extraction model, and verification of the performance of the extraction model It includes the step of storing the verified criminal intelligence extraction model through.

범죄첩보 추출 모델은 범죄 사건들을 유형화하여 시간, 장소, 범행 대상, 범행 방법, 범행 동기 등을 6하 원칙에 의한 스키마(schema)로 세운다. 분석대상 문서가 범죄첩보 추출 모델에 입력되었을 경우, 해당 모델은 추출한 범죄첩보 정보들을 괄호로 묶고 구분자와 함께 해당 첩보가 어떤 정보에 해당하는지 태그를 달아준다. 예를 들면, "이번 달 10일 밤 10시 쯤 홍대입구역 4번 출구 앞에서 이글 대기 예정. 예정대로 작대리 전달할 것."이라는 내용을 가진 문서가 범죄첩보 추출 모델에 입력되면, "(이번 달 10일 밤 10시 쯤/시간) (홍대입구역 4번 출구 앞/장소)에서 (이글/범행 대상(마약용어)) 대기 예정. 예정대로 (작대기/범행 방법(마약용어)) 전달할 것."이라는 결과를 도출하여 사용자에게 전달해 준다. The criminal intelligence extraction model categorizes criminal cases and sets the time, place, crime target, crime method, and motive as a schema based on the 6th principle. When the document to be analyzed is input to the criminal intelligence extraction model, the model groups the extracted criminal information information in parentheses and tags along with the identifier what information the information corresponds to. For example, if a document with the content "I will wait for the eagle in front of Exit 4 of Hongik University Station around 10:00 pm on the 10th of this month. I will deliver it as scheduled." (Around 10 p.m./hour on Sunday) (in front/place of Exit 4 of Hongik University Station) (Eagle/the subject of the crime (drug term)) will be on standby. (Standby/criminal method (drug term)) will be delivered as scheduled." is derived and delivered to the user.

기존에 수집해 둔 범죄첩보 데이터와 신규 수집한 범죄첩보 데이터를 확보하고, 사전에 전문가의 자문을 얻어 확보해 둔 범죄첩보 태그셋 DB와 연계하여 확보한 범죄첩보 데이터에 범죄첩보 태그 표기를 하고 학습용 정답 데이터셋을 생성한다.Securing the previously collected crime intelligence data and newly collected crime intelligence data, and tagging the crime intelligence tag on the crime intelligence data obtained by linking with the criminal intelligence tag set DB secured by consulting an expert in advance Create an answer data set.

생성한 정답 데이터셋으로 범죄첩보 추출 모델을 학습시키고 추출 모델 성능을 검정하여 추출 모델의 정답률이 기준치 이상이면 범죄첩보 추출 모델을 저장하고, 기준치 이하이면 재학습을 통해 기준치 이상이 될 때까지 학습을 반복시킨다.The crime intelligence extraction model is trained with the generated correct answer dataset and the performance of the extraction model is tested. If the correct rate of the extraction model is higher than the reference value, the crime information extraction model is saved. Repeat.

정답 데이터셋는 은어적 표현과 6하 원칙에 의한 스키마(schema)로 작성된다. 정답 데이터셋의 예로서, '오빠가(전달자) 오늘(시간) 홍대 앞에서(장소) 고기(대마초) 왕창(전달 양) 사줄게(전달).'와 같은 식으로 6하 원칙의 태그로 작성되는 경우가 많다.The answer data set is written as a schema based on slang expressions and the 6th principle. As an example of the correct answer data set, it is written as a tag of the 6th principle, such as 'My brother (dispatcher) today (time) in front of Hongdae (place) meat (cannabis) wangchang (amount delivered) (delivery).' Often times.

도 4는 최신 범죄첩보 추출 방법과 문서별 분석 수행 방법을 도시한 흐름도이다.4 is a flowchart illustrating the latest method of extracting crime information and a method of performing analysis for each document.

도 4의 최신 범죄첩보 추출 방법은 미제사건과 유사문서 검색단계와 웹 모니터링 문서와 미제사건 유사문서별로 그룹핑하는 단계와 범죄첩보 추출단계와 각 문서별 분석 수행단계와 자동 보고서 작성단계를 포함한다.The latest crime information extraction method of FIG. 4 includes a step of searching for open cases and similar documents, a step of grouping by web monitoring documents and similar documents of unsolved cases, a step of extracting crime information, an analysis execution step for each document, and an automatic report writing step.

미제사건과 유사문서 검색에는 TF-IDF 등의 유사도 분석 알고리즘을 활용할 수 있다. TF-IDF는 단어의 빈출도와 단어의 특수성을 통해 문장 간 유사도를 계산하는 알고리즘이다.A similarity analysis algorithm such as TF-IDF can be used to search for open cases and similar documents. TF-IDF is an algorithm that calculates the similarity between sentences through word frequency and word specificity.

미제사건 DB 및 웹(일반웹과 다크웹) 모니터링 데이터 DB에서 추출한 데이터에서 미제사건과 유사문서를 검색하고, 웹 모니터링 문서와 미제사건 유사문서별로 그룹핑한 후 범죄첩보 추출 모델을 이용하여 범죄첩보를 추출한다. 이 추출에는 이번 주기 수집 데이터에 대한 전체적인 범죄첩보 추출과 이번 주기 수집 데이터 중 미제사건과 유사한 사건에 대한 범죄첩보 추출이 포함된다.Unsolved cases and similar documents are searched from the data extracted from the open case DB and web (general web and dark web) monitoring data DB, and the crime information is collected using the criminal intelligence extraction model after grouping the web monitoring documents and the open case similar documents. extract This extraction includes the extraction of overall criminal information from the data collected in this cycle and the extraction of criminal information about cases similar to unsolved cases among the data collected in this cycle.

도 4의 문서별 분석 수행 방법은 문서 전처리단계와 형태소 분석 및 미분석어 검출단계와 개체명 분석 및 미분석어 검출단계와 구문 분석단계와 의미역 분석단계와 문서 요약단계를 포함한다. 데이터 분석에 자연어처리 및 인공지능 기술을 활용할 수 있다. 문서 요약은 구문 분석과 의미역 분석 결과를 토대로 육하원칙에 따른 정보로 요약한다.The method of performing analysis for each document of FIG. 4 includes a document preprocessing step, a morphological analysis and an unanalyzed word detection step, an entity name analysis and an unanalyzed word detection step, a syntax analysis step, a semantic area analysis step, and a document summary step. Natural language processing and artificial intelligence technologies can be used for data analysis. Document summary is summarized as information according to the six-fold principle based on the results of syntax analysis and semantic translation analysis.

상기와 같이 추출한 범죄첩보 데이터와 수집한 데이터에 대한 분석 결과를 토대로 자동 보고서를 작성한다. 작성 요령은 해당 문서에서 추출한 범죄첩보 데이터와 검출한 미분석어를 나열하는 방식으로 미리 정의해 둔 보고서 형식에 맞춰 작성한다. 작성한 보고서는 범죄첩보 보고서 DB에 저장되고 사용자(전문가)의 검증을 거쳐 최신 범죄첩보 정보지원, 미제사건 범죄첩보 정보지원 및 학습 데이터 생성 자료로 활용된다.An automatic report is created based on the crime intelligence data extracted as described above and the analysis results of the collected data. The preparation method is written according to a predefined report format by listing the crime intelligence data extracted from the document and the unanalyzed words detected. The prepared report is stored in the crime intelligence report DB, and after verification by the user (expert), it is used as the latest crime intelligence information support, unsolved crime intelligence information support, and learning data generation data.

도 5는 범죄첩보 추출 모델로 분석한 결과 검정 및 재학습 방법을 도시한 흐름도로서, 사용자(전문가)가 자동으로 작성되어 검증 요청된 보고서를 검토하고 잘못된 부분을 수정하여 정답 문서로 재생성하고 시스템에 재학습시키는 과정이며 이를 토대로 시간이 지날수록 더 정확한 결과를 도출할 수 있도록 하는 과정이다. 이 과정은 사용자(전문가)에게 검증을 요청하는 단계와 사용자(전문가) 검증을 통해 검증된 학습 데이터를 생성하는 단계와 미분석어 사전에 추가하는 단계와 범죄첩보 추출모델에 재학습 시키는 단계를 포함한다.5 is a flow chart showing a method for testing and re-learning as a result of analysis with a criminal intelligence extraction model. A user (expert) automatically reviews the verification-requested report, corrects the wrong part, regenerates it as a correct document, and enters the system It is a process of re-learning, and based on this, it is a process to derive more accurate results over time. This process includes the step of requesting verification from the user (expert), the step of generating the learning data verified through user (expert) verification, the step of adding the unanalyzed word to the dictionary, and the step of re-learning the crime information extraction model do.

정리하면, 범죄첩보 추출 모델에 의해 추출된 범죄첩보 데이터에서 범죄첩보를 추출하고 이를 토대로 작성된 보고서는 전문가에게 전달되며 전문가의 검증을 거쳐 오류가 수정된 정답문서로 재생성 되어 다시 범죄첩보 추출 모델 생성을 위한 재학습자료로 피드백되는 선순환이 본 발명의 범죄첩보 추출 모델 구축 단계에서 자체적으로 이루어진다는 것이다.In summary, criminal intelligence is extracted from the criminal intelligence data extracted by the criminal intelligence extraction model, and the report prepared based on this is delivered to the expert, verified by the expert, and then regenerated as an error-corrected correct document to create the criminal intelligence extraction model again. The virtuous cycle that is fed back as re-learning data for the purpose of the present invention is that it is made by itself in the stage of building the criminal intelligence extraction model of the present invention.

도 6은 범죄첩보 사실여부 판단 방법을 도시한 흐름도로서, 자연어처리 기술 활용 데이터 분석단계와 해당 데이터와 유사한 팩트 데이터 검색단계와 유사 데이터가 존재할 경우에 자연어처리 기술을 활용한 팩트 데이터 분석단계와 현재 분석 중인 데이터와 팩트 데이터를 비교분석하는 단계와 남은 유사 팩트 문서가 있는 경우에 범죄첩보 데이터에 대한 최종적인 가짜확률 계산단계를 포함한다.6 is a flowchart illustrating a method for determining whether a crime information is true. The data analysis step using natural language processing technology, the fact data search step similar to the corresponding data, and the fact data analysis step using natural language processing technology when similar data exists It includes a step of comparing and analyzing the data under analysis with the fact data, and a step of calculating the final fake probability for the crime intelligence data when there are similar fact documents remaining.

웹 모니터링 데이터 DB의 데이터를 자연어처리기술을 활용하여 분석하고, TF-IDF 등의 유사도 분석 알고리즘을 활용하여 해당 데이터와 유사한 팩트 데이터를 검색한다. It analyzes the data of the web monitoring data DB using natural language processing technology, and uses a similarity analysis algorithm such as TF-IDF to search for fact data similar to the corresponding data.

유사 데이터의 존재 유무에 따라, 유사한 문서가 존재하지 않으면 웹 모니터링 데이터는 가짜일 확률이 아주 높아 가짜확률 계산이 무의미하다고 판단하고 '판단 보류'로 분류하여 자동으로 전문가 및 전담팀에 검토 요청하거나 재분석을 수행하고, 유사한 문서가 존재하면 상기 웹 모니터링 데이터의 가짜확률 계산을 위해 상기 웹 모니터링 데이터와 유사문서를 자연어처리 기술과 인공지능 기술을 활용하여 비교대조하고 서로 다른 내용이 검출되면 서로 다른 정도에 따라 가짜확률을 산출한 후 해당 문서를 근거 자료로 저장한다.Depending on the existence of similar data, if similar documents do not exist, the probability that the web monitoring data is fake is very high, so it is judged that the calculation of the fake probability is meaningless, and it is classified as 'pending judgment' and automatically requested to review or reanalyzed by experts and dedicated teams , and if there is a similar document, the web monitoring data and the similar document are compared and contrasted using natural language processing technology and artificial intelligence technology to calculate the false probability of the web monitoring data. After calculating the fake probability, the document is stored as evidence.

계속해서 비교대조하지 않고 남은 유사 팩트 문서가 있으면 추가적으로 비교대조하여 서로 다른 내용이 검출되는 정도에 따라 가짜확률을 계산하거나 기 산출된 가짜확률을 업데이트한 후 남은 유사 팩트 문서가 없을 때까지 동일한 루프를 반복 시행하고, 남은 유사 팩트 문서가 없으면 범죄첩보 데이터에 대한 최종적인 가짜확률을 산출한다.If there are similar fact documents that are not continuously compared and contrasted, the fake probability is calculated according to the degree to which different contents are detected by additional comparison and comparison, or the same loop is repeated until there are no similar fact documents remaining after updating the calculated fake probability. It is repeated, and if there are no similar fact documents remaining, the final false probability of the criminal intelligence data is calculated.

가짜뉴스의 판별은 범죄첩보 사실여부 판단에 활용되며 가짜뉴스 탐지기는 상기에 기재된 방법과 동일한 방법으로 작동한다.The identification of fake news is used to determine whether the crime information is true, and the fake news detector works in the same way as described above.

10; 선정부 20; 수집부
30; 생성부 40; 추출부
50; 판단부 110; 일반웹
120; 다크웹 130; 팩트 수집10; selection 20; collector
30; generating unit 40; extractor
50; judgment unit 110; general web
120; Dark Web 130; Fact Gathering

Claims

a selection unit that selects a crime intelligence detection and monitoring target;
a collection unit for collecting crime information from the target selected by the selection unit;
a crime information extraction model generating unit for re-learning by examining, learning, modifying and deleting the crime information collected by the collection unit;
A crime information extraction unit that learns data from the crime information extraction model and extracts crime information through analysis; and
Including a determination unit for determining whether the crime information extracted from the crime information extraction unit is true,
The crime intelligence extraction model is a schema based on the 6th principle, and is generated as an existing and new crime model, and is a crime intelligence detection system through web monitoring, characterized in that it is learned from the correct answer dataset.

According to claim 1,
The selection unit is a crime intelligence detection system through web monitoring, characterized in that grouping the selection target into a general web target, a dark web target, and a fact collection target.

According to claim 1,
The collection unit prevents access blocking by distributed collection through a plurality of collectors, and is an efficient collection by performing incremental indexing.

According to claim 1,
The crime information extraction model generation unit generates learning data with an answer tag marked from the collected crime information data and creates an answer document for AI learning, characterized in that it extracts and analyzes the crime information from the crime information data Crime intelligence detection system through monitoring.

According to claim 1,
The extraction unit,
Extracting crime information including unsolved cases from the data collected by the collection unit and the data detected by the open case information search and similar document information search through the crime information extraction model,
For each collected data, natural language processing and artificial intelligence are used to analyze morphemes and detect unanalyzed words.
Crime intelligence detection system through web monitoring, characterized in that it creates a report based on the extracted crime information, the analyzed morpheme, and the detected unanalyzed word.

According to claim 1,
The judging unit,
Search for data similar to web monitoring data in the fact DB, and search for similar fact data to determine the truth of the extracted criminal intelligence,
Analyze web monitoring data and similar fact documents using natural language processing technology and artificial intelligence technology to detect fake news that calculates the probability that the web monitoring data is false,
Crime intelligence detection system through web monitoring, characterized in that it includes providing a result of storing the evidence and providing the final result to the user whenever the probability of a fake increases.

a selection step of selecting a crime intelligence detection and monitoring target;
A collection step of collecting crime information from the selected target;
A crime information extraction model generation step of examining, learning, modifying, and deleting the collected criminal information and re-learning;
Criminal intelligence extraction step through data learning and analysis, and
Including a judgment step of determining whether the crime information is true,
The crime information extraction model is generated as a schema according to the 6th principle, and the crime information detection method through web monitoring, characterized in that it is learned from the correct answer dataset.

8. The method of claim 7,
The selection and collection step is
Extracting index words and inverse index words from fact data collections, general web collections, and dark web collections through the steps of setting the monitoring target site and setting the collection schedule, and performing incremental indexing. A method of detecting criminal intelligence through web monitoring.

8. The method of claim 7,
The crime intelligence extraction model generation step consists of a crime intelligence extraction model construction algorithm and result verification and re-learning algorithm,
The crime intelligence extraction model construction algorithm includes the steps of securing the existing and new crime intelligence data, the crime intelligence tag marking step, the generation of correct answer dataset for learning, the crime intelligence extraction model learning step, and the criminal intelligence verified through the extraction model performance verification storing the extraction model;
The result verification and re-learning algorithm, web monitoring, characterized in that it includes a verification request step, a step of generating learning data verified through user verification, a step of adding to the unanalyzed dictionary, and a step of performing additional learning of a crime intelligence extraction model A method of detecting criminal intelligence through

8. The method of claim 7,
The extraction step consists of data analysis and the latest crime intelligence extraction algorithm and an analysis execution algorithm for each document,
The latest crime intelligence extraction algorithm includes a step of searching for unsolved cases and similar documents, a step of grouping web monitoring documents and similar documents of unsolved cases, a step of extracting crime information, a step of analyzing each document, and an automatic report writing step,
The analysis performing algorithm for each document includes a document pre-processing step, a morphological analysis and an unanalyzed word detection step, an entity name analysis and an unanalyzed word detection step, a syntax analysis step, a semantic translation analysis step, and a document summary step. A method of detecting criminal intelligence through web monitoring.

8. The method of claim 7,
The determining step includes a data analysis step using natural language processing technology, a fact data search step similar to the corresponding data, and a fact data analysis step using natural language processing technology when similar data exists;
Comparative analysis stage between the data currently being analyzed and the fact data, and
Crime intelligence detection method through web monitoring, characterized in that it includes a final false probability calculation step for the crime intelligence data when there are no more similar fact documents remaining.