KR102073833B1

KR102073833B1 - Electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of the plurality of files and operating method thereof

Info

Publication number: KR102073833B1
Application number: KR1020190139908A
Authority: KR
Inventors: 이미영
Original assignee: (주)키온비트
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2020-02-05
Also published as: WO2021091124A1

Abstract

Disclosed are an electronic device capable of searching for a similar file to a reference file based on distribution information of features of each of a plurality of files and an operating method thereof. According to the present invention, the electronic device and the operating method thereof, which can search for a similar file to a reference file based on distribution information of features of each of a plurality of files, are capable of supporting searching for a similar file to the reference file by, for each of a plurality of predetermined files, extracting n units of features which are divided with the border of points where the pre-set data patterns exist from a bit string constituting the data, creating distribution information on the frequency where at least one unique hash value exists, wherein the at least one unique hash value is extracted to prevent the hash values from overlapping each other in the n units of hash values corresponding to the n units of features, and calculating the similarity between the reference file and the rest files among the plurality of files based on the distribution information on the frequency where the at least one unique hash value individually exists.

Description

ELECTRONIC DEVICE CAPABLE OF SEARCHING FOR A SIMILAR FILE WITH RESPECT TO A REFERENCE FILE BASED ON DISTRIBUTION INFORMATION OF FEATURES OF EACH OF THE PLURALITY OF FILES AND OPERATING METHOD THEREOF}

본 발명은 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치 및 동작 방법에 대한 것이다.The present invention relates to an electronic device and a method of operating a similar file that can be searched for a reference file based on distribution information of features of each of a plurality of files.

기술 발전에 따라 정보통신과 관련한 사이버 범죄가 급증하면서, 범죄 수사에 있어, PC나 노트북, 휴대폰 등 각종 저장매체 또는 인터넷 상에 남아 있는 각종 디지털 정보를 분석하여 범죄의 단서를 찾는 수사 기법인 디지털 포렌식(Digital Forensic)이 중요한 역할을 하고 있다.As cybercrime related to information and communication increases rapidly with the development of technology, digital forensics is an investigation technique that searches for clues of crime by analyzing various digital information remaining on various storage media such as PCs, laptops, mobile phones, or the Internet. Digital Forensic plays an important role.

이러한 디지털 포렌식 수사에 있어, 수사관들은 범죄의 주요 증거를 수집하기 위해 상당한 양의 디지털 정보들을 모두 확인할 필요가 있으나, 많은 양의 디지털 정보들을 일일이 하나씩 확인하기에는 너무 많은 시간과 노력이 소요되어 비효율적이므로, 키워드 검색 등을 통해 정밀하게 분석할 디지털 정보들을 취합하는 과정이 우선적으로 진행되어야 한다. In these digital forensic investigations, investigators need to check all of the significant amounts of digital information to collect the main evidence of the crime, but it is inefficient because it takes too much time and effort to check large amounts of digital information one by one. The process of collecting digital information to be analyzed precisely through keyword search should be carried out first.

관련하여, 많은 양의 디지털 정보들 중 필요한 정보를 효율적으로 추출하기 위해서는, 각종 저장 매체 등의 기기에서 생산되는 많은 양의 파일들을 유사한 내용으로 분류하는 기법이 활용될 수 있을 것이다. In this regard, in order to efficiently extract necessary information among a large amount of digital information, a technique of classifying a large amount of files produced by a device such as various storage media into similar contents may be utilized.

기존에는, 수많은 파일들을 서로 유사한 내용의 파일들로 분류하기 위해, 자연어 처리(NLP) 방식 또는 문서 파일 내에 포함된 텍스트를 엔그램(Ngram)으로 변환한 후, 상기 문서 파일 내에서 존재하는 엔그램 각각의 빈도수를 비교하는 방식을 이용하였다. Conventionally, in order to classify a large number of files into files of similar contents, the text contained in a natural language processing (NLP) method or a document file is converted into an Ngram, and an engram existing in the document file is converted. The method of comparing each frequency was used.

그러나, 이러한 방식들은 분류 대상인 파일들이 언어와 관련된 프로세스로 구성되어야 한다거나 인코딩 방식이 서로 동일해야 한다는 한계점들이 있어, 여러 종류의 디지털 기기로부터 생산된 다양한 유형의 파일들을 분석하는 디지털 포렌식 업무에 활용되기에 무리가 있었다.However, these methods have limitations in that the files to be classified must be composed of language-related processes or the encoding methods are the same, so they are used in digital forensic tasks for analyzing various types of files produced from various types of digital devices. There was a bunch.

따라서, 효율적인 디지털 포렌식 수사를 위해, 복수의 파일들 각각에 대해 피쳐(feature)들의 분포 정보를 추출하여, 특정 파일에 대한 유사 파일을 탐색할 수 있는 기법에 대한 연구가 필요하다.Therefore, for efficient digital forensic investigation, a study on a technique for extracting distribution information of features for each of a plurality of files and searching for similar files for a specific file is required.

일본 공개특허공보 특개2015-201042호(2015.11.12.)Japanese Laid-Open Patent Publication No. 2015-201042 (2015.11.12.) 일본 특허공보 특허 제 5598925호(2014.10.01.)Japanese Patent Publication No. 5598925 (2014.10.01.) 대한민국 등록특허공보 제10-0895102호(2009.04.20.)Republic of Korea Patent Publication No. 10-0895102 (2009.04.20.)

본 발명에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치 및 동작 방법은 미리 정해진 복수의 파일들 각각에 대하여, 데이터를 구성하는 비트열로부터 기설정된 데이터 패턴이 존재하는 지점을 경계로 분할된 n개의 피쳐들을 추출하고, 상기 n개의 피쳐들에 대응되는 n개의 해시 값들에서, 해시 값이 서로 중복되지 않도록 추출된 적어도 하나의 고유 해시 값 각각이 존재하는 빈도수에 대한 분포 정보를 생성한 후, 상기 적어도 하나의 고유 해시 값 각각이 존재하는 빈도수에 대한 분포 정보를 기초로 상기 복수의 파일들 중 기준 파일과 나머지 파일들 간 유사도를 연산함으로써, 상기 기준 파일에 대한 유사 파일의 탐색을 지원하고자 한다. An electronic device and an operation method capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to the present invention include a bit string constituting data for each of a plurality of predetermined files. At least one unique hash value extracted from the n features partitioned from the point where a predetermined data pattern exists, and the n hash values corresponding to the n features are extracted so that the hash values do not overlap each other. After generating distribution information for each existing frequency, by calculating the similarity between the reference file and the remaining files of the plurality of files based on the distribution information for the frequency of each of the at least one unique hash value In addition, the present invention aims to support the searching of similar files with respect to the reference file.

본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치는 사용자로부터 미리 정해진 복수의 파일들 중 어느 하나인 기준 파일에 대한 유사 파일 탐색 명령이 수신되면, 상기 복수의 파일들 각각으로부터, 상기 복수의 파일들 각각에 대한 n(n은 2이상의 자연수임)개의 피쳐(feature)들 - 상기 n개의 피쳐들은 상기 복수의 파일들 각각에 대한 데이터를 구성하는 비트열에서, 기설정된(predetermined) 데이터 패턴이 존재하는 지점을 경계로 하여 분할함으로써 생성된 n개의 부분 비트열을 의미함 - 을 추출하는 피쳐 추출부, 상기 복수의 파일들 각각에 대한 상기 n개의 피쳐들을 기설정된 해시 함수에 입력으로 인가하여, 상기 복수의 파일들 각각에 대한 n개의 해시 값들을 생성하는 해시 값 생성부, 상기 복수의 파일들 각각에 대하여, 상기 n개의 해시 값들로부터 해시 값이 서로 중복되지 않는 적어도 하나의 고유 해시 값을 추출한 후, 상기 n개의 해시 값들에서 상기 적어도 하나의 고유 해시 값 각각이 존재하는 빈도수를 카운트하는 카운트부, 상기 복수의 파일들 각각에 대하여, 상기 적어도 하나의 고유 해시 값의 빈도수를 오름차순으로 정렬한 후, 오름차순으로 정렬된 상기 적어도 하나의 고유 해시 값의 빈도수를 기초로, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 생성하는 분포 정보 생성부, 상기 기준 파일에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보와, 상기 복수의 파일들에서 상기 기준 파일을 제외한 나머지 파일들에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보 간 유사도를 연산하는 유사도 연산부, 상기 나머지 파일들 중 상기 기준 파일과의 상기 유사도가 기설정된 기준치 이상인 적어도 하나의 유사 파일을 선택한 후, 상기 적어도 하나의 유사 파일을 파일 저장소에 저장하는 파일 저장부 및 상기 사용자로부터 수신된 상기 기준 파일에 대한 상기 유사 파일 탐색 명령에 대응하여, 상기 적어도 하나의 유사 파일로 구성된 유사 파일 목록을 화면 상에 표시하는 유사 파일 표시부를 포함한다.An electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present invention may be assigned to a reference file, which is one of a plurality of predetermined files from a user. When a similar file search command is received, from each of the plurality of files, n (n is a natural number of two or more) features for each of the plurality of files-the n features are the plurality of files A feature extracting unit for extracting n partial bit strings generated by dividing a bit string constituting data for each of the plurality of bit strings at a point where a predetermined data pattern exists. Applying the n features for each of the files as input to a predetermined hash function, generating n hash values for each of the plurality of files. Is a hash value generator, for each of the plurality of files, extracts at least one unique hash value from which the hash values do not overlap each other from the n hash values, and then, the at least one unique hash from the n hash values A counting unit for counting a frequency in which each value exists, and for each of the plurality of files, the frequency of the at least one unique hash value in ascending order, and then the frequency of the at least one unique hash value in ascending order A distribution information generation unit configured to generate distribution information on a frequency of the at least one unique hash value, distribution information on a frequency of the at least one unique hash value corresponding to the reference file, and the plurality of files Of the at least one unique hash value corresponding to the remaining files except for the reference file. A similarity calculator for calculating similarity between distribution information about the frequency, at least one similar file having the similarity with the reference file among the remaining files equal to or greater than a predetermined reference value, and storing the at least one similar file in a file storage; And a similar file display unit configured to display a similar file list including the at least one similar file on the screen, in response to the similar file search command for the reference file received from the user.

또한, 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치의 동작 방법은 사용자로부터 미리 정해진 복수의 파일들 중 어느 하나인 기준 파일에 대한 유사 파일 탐색 명령이 수신되면, 상기 복수의 파일들 각각으로부터, 상기 복수의 파일들 각각에 대한 n개의 피쳐들 - 상기 n개의 피쳐들은 상기 복수의 파일들 각각에 대한 데이터를 구성하는 비트열에서, 기설정된 데이터 패턴이 존재하는 지점을 경계로 하여 분할함으로써 생성된 n개의 부분 비트열을 의미함 - 을 추출하는 단계, 상기 복수의 파일들 각각에 대한 상기 n개의 피쳐들을 기설정된 해시 함수에 입력으로 인가하여, 상기 복수의 파일들 각각에 대한 n개의 해시 값들을 생성하는 단계, 상기 복수의 파일들 각각에 대하여, 상기 n개의 해시 값들로부터 해시 값이 서로 중복되지 않는 적어도 하나의 고유 해시 값을 추출한 후, 상기 n개의 해시 값들에서 상기 적어도 하나의 고유 해시 값 각각이 존재하는 빈도수를 카운트하는 단계, 상기 복수의 파일들 각각에 대하여, 상기 적어도 하나의 고유 해시 값의 빈도수를 오름차순으로 정렬한 후, 오름차순으로 정렬된 상기 적어도 하나의 고유 해시 값의 빈도수를 기초로, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 생성하는 단계, 상기 기준 파일에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보와, 상기 복수의 파일들에서 상기 기준 파일을 제외한 나머지 파일들에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보 간 유사도를 연산하는 단계, 상기 나머지 파일들 중 상기 기준 파일과의 상기 유사도가 기설정된 기준치 이상인 적어도 하나의 유사 파일을 선택한 후, 상기 적어도 하나의 유사 파일을 파일 저장소에 저장하는 단계 및 상기 사용자로부터 수신된 상기 기준 파일에 대한 상기 유사 파일 탐색 명령에 대응하여, 상기 적어도 하나의 유사 파일로 구성된 유사 파일 목록을 화면 상에 표시하는 단계를 포함한다.In addition, according to an embodiment of the present invention, an operation method of an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of the plurality of files may be based on any one of a plurality of predetermined files. When a similar file search command is received for one reference file, n features for each of the plurality of files, the n features for each of the plurality of files, from each of the plurality of files. Extracting, from the constituent bit strings, the n partial bit strings generated by dividing at a point where a predetermined data pattern exists, and extracting the n features for each of the plurality of files. Generating n hash values for each of the plurality of files by applying as input to a set hash function, the plurality of files For each, extracting at least one unique hash value from which the hash values do not overlap each other, and counting a frequency at which each of the at least one unique hash value exists from the n hash values, For each of the plurality of files, sorting the frequencies of the at least one unique hash value in ascending order and then based on the frequencies of the at least one unique hash value sorted in ascending order of the at least one unique hash value. Generating distribution information on the frequency, distribution information on the frequency of the at least one unique hash value corresponding to the reference file, and the at least one corresponding to the remaining files other than the reference file in the plurality of files. Calculating a similarity between distribution information about a frequency of one unique hash value, wherein Selecting at least one similar file having the similarity with the reference file more than a predetermined reference value among remaining files, and storing the at least one similar file in a file storage and the reference to the reference file received from the user. In response to a similar file search command, displaying a list of similar files composed of the at least one similar file on a screen.

본 발명에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치 및 동작 방법은 미리 정해진 복수의 파일들 각각에 대하여, 데이터를 구성하는 비트열로부터 기설정된 데이터 패턴이 존재하는 지점을 경계로 분할된 n개의 피쳐들을 추출하고, 상기 n개의 피쳐들에 대응되는 n개의 해시 값들에서, 해시 값이 서로 중복되지 않도록 추출된 적어도 하나의 고유 해시 값 각각이 존재하는 빈도수에 대한 분포 정보를 생성한 후, 상기 적어도 하나의 고유 해시 값 각각이 존재하는 빈도수에 대한 분포 정보를 기초로 상기 복수의 파일들 중 기준 파일과 나머지 파일들 간 유사도를 연산함으로써, 상기 기준 파일에 대한 유사 파일의 탐색을 지원할 수 있다.An electronic device and an operation method capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to the present invention include a bit string constituting data for each of a plurality of predetermined files. At least one unique hash value extracted from the n features partitioned from the point where a predetermined data pattern exists, and the n hash values corresponding to the n features are extracted so that the hash values do not overlap each other. After generating distribution information for each existing frequency, by calculating the similarity between the reference file and the remaining files of the plurality of files based on the distribution information for the frequency of each of the at least one unique hash value In addition, the present invention may support searching for a similar file with respect to the reference file.

도 1은 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치의 구조를 도시한 도면이다.
도 2는 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치를 설명하기 위한 도면이다.
도 3은 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치의 동작 방법을 도시한 순서도이다. 1 is a diagram illustrating a structure of an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present invention.
FIG. 2 is a diagram for describing an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present disclosure.
FIG. 3 is a flowchart illustrating an operation method of an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present disclosure.

이하에서는 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명하기로 한다. 이러한 설명은 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였으며, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 본 명세서 상에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 사람에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. This description is not intended to limit the invention to the specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the invention. In describing the drawings, similar reference numerals are used for similar components, and unless otherwise defined, all terms used in the present specification, including technical or scientific terms, may be used in the art to which the present invention pertains. It has the same meaning as is commonly understood by someone who has it.

본 문서에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다. 또한, 본 발명의 다양한 실시예들에 있어서, 각 구성요소들, 기능 블록들 또는 수단들은 하나 또는 그 이상의 하부 구성요소로 구성될 수 있고, 각 구성요소들이 수행하는 전기, 전자, 기계적 기능들은 전자회로, 집적회로, ASIC(Application Specific Integrated Circuit) 등 공지된 다양한 소자들 또는 기계적 요소들로 구현될 수 있으며, 각각 별개로 구현되거나 2 이상이 하나로 통합되어 구현될 수도 있다. In this document, when a part is said to "include" a certain component, it means that it can further include other components, without excluding the other components unless otherwise stated. In addition, in various embodiments of the present invention, each component, functional block or means may be composed of one or more subcomponents, and the electrical, electronic and mechanical functions performed by each component are electronic Various elements or mechanical elements known as a circuit, an integrated circuit, an application specific integrated circuit (ASIC), etc. may be implemented, and may be implemented separately, or two or more may be integrated into one.

한편, 첨부된 블록도의 블록들이나 흐름도의 단계들은 범용 컴퓨터, 특수용 컴퓨터, 휴대용 노트북 컴퓨터, 네트워크 컴퓨터 등 데이터 프로세싱이 가능한 장비의 프로세서나 메모리에 탑재되어 지정된 기능들을 수행하는 컴퓨터 프로그램 명령들(instructions)을 의미하는 것으로 해석될 수 있다. 이들 컴퓨터 프로그램 명령들은 컴퓨터 장치에 구비된 메모리 또는 컴퓨터에서 판독 가능한 메모리에 저장될 수 있기 때문에, 블록도의 블록들 또는 흐름도의 단계들에서 설명된 기능들은 이를 수행하는 명령 수단을 내포하는 제조물로 생산될 수도 있다. 아울러, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 명령들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 가능한 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 정해진 순서와 달리 실행되는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 실질적으로 동시에 수행되거나, 역순으로 수행될 수 있으며, 경우에 따라 일부 블록들 또는 단계들이 생략된 채로 수행될 수도 있다.On the other hand, the steps of the blocks or flowcharts in the accompanying block diagrams are computer program instructions for performing specified functions mounted on a processor or memory of a data processing equipment such as a general purpose computer, a special purpose computer, a portable notebook computer, and a network computer. It can be interpreted as meaning. Since these computer program instructions can be stored in a memory provided in a computer device or in a computer readable memory, the functions described in the steps of the blocks in the block diagram or the flowcharts are produced as a product containing an instruction means for performing this. May be In addition, each block or step may represent a portion of a module, segment, or code that includes one or more executable instructions for executing a specified logical function (s). It should also be noted that in some alternative embodiments, the functions recited in blocks or steps may be performed in a different order. For example, two blocks or steps shown in succession may be performed substantially concurrently or in reverse order, and in some cases, some blocks or steps may be omitted.

도 1은 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치의 구조를 도시한 도면이다. 1 is a diagram illustrating a structure of an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치 (110)는 피쳐 추출부(111), 해시 값 생성부(112), 카운트부(113), 분포 정보 생성부(114), 유사도 연산부(115), 파일 저장부(116) 및 유사 파일 표시부(117)를 포함한다.Referring to FIG. 1, an electronic device 110 capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present invention is a feature extractor 111. , A hash value generator 112, a counter 113, a distribution information generator 114, a similarity calculator 115, a file storage 116, and a similar file display 117.

피쳐 추출부(111)는 사용자로부터 미리 정해진 복수의 파일들 중 어느 하나인 기준 파일에 대한 유사 파일 탐색 명령이 수신되면, 상기 복수의 파일들 각각으로부터, 상기 복수의 파일들 각각에 대한 n(n은 2이상의 자연수임)개의 피쳐(feature)들을 추출한다.The feature extractor 111 receives n (n) for each of the plurality of files from each of the plurality of files, when a similar file search command for the reference file, which is one of a plurality of predetermined files, is received from the user. Is a natural number of two or more) features.

여기서, 상기 n개의 피쳐들은 상기 복수의 파일들 각각에 대한 데이터를 구성하는 비트열에서, 기설정된(predetermined) 데이터 패턴이 존재하는 지점을 경계로 하여 분할함으로써 생성된 n개의 부분 비트열을 의미한다. Herein, the n features mean n partial bit strings generated by dividing a bit string constituting data for each of the plurality of files with respect to a point where a predetermined data pattern exists. .

예컨대, 복수의 파일들 중 크기가 '100KB(102400B)'인 '파일 1'이 존재하고, 기설정된 데이턴 패턴이 '0000000000000'(뒤에서 13개의 비트들이 0인 값)이라고 가정하면, 피쳐 추출부(111)는 사용자로부터 상기 복수의 파일들 중 어느 하나인 기준 파일에 대한 유사 파일 탐색 명령이 수신된 경우, 우선, 상기 복수의 파일들 중 '파일 1'에 대한 데이터를 구성하는 비트열을 '0000000000000'이 존재하는 지점을 경계로 하여 분할함으로써, '파일 1'에 대하여 n개의 피쳐들을 추출할 수 있다.For example, assuming that 'file 1' having a size of '100 KB (102400B)' among a plurality of files exists, and the preset dayton pattern is '0000000000000' (a value of 13 bits being 0 in the back), the feature extractor ( 111), when a similar file search command for a reference file, which is one of the plurality of files, is received from the user, first, a bit string constituting data for 'file 1' of the plurality of files is '0000000000000'. By dividing by the boundary where 'is present, n features can be extracted for' File 1 '.

만약, '파일 1'에 대한 데이터를 구성하는 비트열에서 '0000000000000'이 존재하는 지점이 하기의 표 1과 같이 존재한다면, 피쳐 추출부(111)는 '파일 1'에 대하여 '6'개의 피쳐들을 추출할 수 있다.If a point where '0000000000000' exists in the bit string constituting the data for 'File 1' exists as shown in Table 1 below, the feature extractor 111 has '6' features for 'File 1'. Can extract them.

상기의 표 1에서, offset은 상기 복수의 파일들 각각에 대한 데이터를 구성하는 비트열에서 상기 기설정된 데이터 패턴인 '0000000000000'이 발견된 지점까지의 데이터 크기(Byte)를 의미한다. In Table 1, the offset means a data size (Byte) from a bit string constituting data for each of the plurality of files to a point where the preset data pattern '0000000000000' is found.

이러한 방식으로, 피쳐 추출부(111)는 상기 복수의 파일들 각각으로부터, 상기 복수의 파일들 각각에 대한 n개의 피쳐들을 추출할 수 있으며, 상기 복수의 파일들 각각으로부터 추출된 피쳐들의 수는 서로 상이할 수 있다. In this manner, the feature extracting unit 111 may extract n features for each of the plurality of files from each of the plurality of files, and the number of features extracted from each of the plurality of files is mutually different. Can be different.

해시 값 생성부(112)는 상기 복수의 파일들 각각에 대한 상기 n개의 피쳐들을 기설정된 해시 함수에 입력으로 인가하여, 상기 복수의 파일들 각각에 대한 n개의 해시 값들을 생성한다.The hash value generator 112 generates the n hash values for each of the plurality of files by applying the n features of each of the plurality of files as inputs to a preset hash function.

만약, 앞선 예와 같이, '파일 1'에 대하여 '6'개의 피쳐들이 추출되었다면, 해시 값 생성부(112)는 상기 복수의 파일들 중 '파일 1'에 대하여, 상기 '6'개의 피쳐들을 기설정된 해시 함수에 입력으로 인가함으로써, '6'개의 해시 값들을 생성할 수 있다.If, as in the previous example, '6' features are extracted for 'file 1', the hash value generator 112 may select the '6' features for the 'file 1' of the plurality of files. By applying as input to a predetermined hash function, '6' hash values can be generated.

만약, 상기 표 1에서 나타낸 상기 '6'개의 피쳐들 중 1, 3, 5번 피쳐들이 서로 동일하고, 4, 6번 피쳐들이 서로 동일하다고 하는 경우, 해시 값 생성부(112)는 상기 '6'개의 피쳐들 각각에 대응하는 해시 값으로 'H1', 'H2', 'H1', 'H3', 'H1' 및 'H3'을 생성할 수 있다.If the features 1, 3, and 5 of the '6' features shown in Table 1 are the same as each other, and the features 4 and 6 are the same, the hash value generating unit 112 may use the '6' feature. 'H1', 'H2', 'H1', 'H3', 'H1' and 'H3' may be generated as hash values corresponding to 'features'.

이렇게, 해시 값 생성부(112)는 '파일 1'에 대한 방식과 동일한 방식으로 상기 복수의 파일들 각각에 대한 n개의 해시 값들을 생성할 수 있다. As such, the hash value generator 112 may generate n hash values for each of the plurality of files in the same manner as that for the 'file 1'.

카운트부(113)는 상기 복수의 파일들 각각에 대하여, 상기 n개의 해시 값들로부터 해시 값이 서로 중복되지 않는 적어도 하나의 고유 해시 값을 추출한 후, 상기 n개의 해시 값들에서 상기 적어도 하나의 고유 해시 값 각각이 존재하는 빈도수를 카운트한다.The counting unit 113 extracts, from each of the plurality of files, at least one unique hash value from which the hash values do not overlap each other, and then the at least one unique hash from the n hash values. Count the frequency with which each value exists.

예컨대, 전술한 바와 같이, '파일 1'에 대하여 'H1', 'H2', 'H1', 'H3', 'H1' 및 'H3'과 같은 '6'개의 해시 값들이 존재한다고 가정하면, 카운트부(113)는 상기 복수의 파일들 중 '파일 1'에 대하여, 'H1', 'H2', 'H1', 'H3', 'H1' 및 'H3'과 같은 상기 '6'개의 해시 값들로부터 해시 값이 서로 중복되지 않는 적어도 하나의 고유 해시 값으로 'H1', 'H2', 'H3'을 추출할 수 있다. For example, suppose that as described above, '6' hash values such as 'H1', 'H2', 'H1', 'H3', 'H1' and 'H3' exist for 'File 1'. The counting unit 113 may include the hashes of the six hashes, such as 'H1', 'H2', 'H1', 'H3', 'H1' and 'H3', for 'File 1' among the plurality of files. From the values, 'H1', 'H2' and 'H3' may be extracted as at least one unique hash value in which the hash values do not overlap each other.

이후, 카운트부(113)는 'H1', 'H2', 'H1', 'H3', 'H1' 및 'H3'과 같은 상기 '6'개의 해시 값들에서 상기 적어도 하나의 고유 해시 값인 'H1', 'H2', 'H3' 각각이 존재하는 빈도수를 카운트할 수 있다.Thereafter, the counting unit 113 is the at least one unique hash value 'H1' from the '6' hash values such as 'H1', 'H2', 'H1', 'H3', 'H1' and 'H3'. The number of frequencies in which ',' H2 'and' H3 'are present can be counted.

즉, 카운트부(113)는 'H1', 'H2', 'H1', 'H3', 'H1' 및 'H3'과 같은 상기 '6'개의 해시 값들에서 상기 적어도 하나의 고유 해시 값 중 'H1'이 존재하는 빈도수를 '3', 'H2'가 존재하는 빈도수를 '1', 'H3'이 존재하는 빈도수를 '2'로 카운트할 수 있고, 이와 같은 방식으로 상기 복수의 파일들 각각에 대하여, 상기 n개의 해시 값들에서 상기 적어도 하나의 고유 해시 값 각각이 존재하는 빈도수를 카운트할 수 있다. That is, the counting unit 113 may include 'out of at least one unique hash value among the' 6 'hash values such as' H1', 'H2', 'H1', 'H3', 'H1' and 'H3'. H1 'can be counted as' 3', 'H2' is present as' 1 ',' H3 'can be counted as' 2', and in this manner each of the plurality of files For, it is possible to count a frequency in which each of the at least one unique hash value exists in the n hash values.

분포 정보 생성부(114)는 상기 복수의 파일들 각각에 대하여, 상기 적어도 하나의 고유 해시 값의 빈도수를 오름차순으로 정렬한 후, 오름차순으로 정렬된 상기 적어도 하나의 고유 해시 값의 빈도수를 기초로, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 생성한다.The distribution information generating unit 114 sorts the frequencies of the at least one unique hash value in ascending order for each of the plurality of files, and then based on the frequencies of the at least one unique hash values arranged in ascending order. Generate distribution information on the frequency of the at least one unique hash value.

예컨대, 앞선 예와 같이, '파일 1'에 대하여, 상기 적어도 하나의 고유 해시 값인 'H1', 'H2', 'H3' 각각이 존재하는 빈도수로 '3', '1', '2'가 카운트되었다고 가정하는 경우, 분포 정보 생성부(114)는 상기 복수의 파일들 중 '파일 1'에 대하여, 상기 적어도 하나의 고유 해시 값의 빈도수를 오름차순인 '1', '2', '3'의 순서로 정렬함으로써, 도 2의 도면부호 210과 같은 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 생성할 수 있다.For example, as in the previous example, '3', '1', and '2' are the frequencies in which each of the at least one unique hash values 'H1', 'H2', and 'H3' exists for 'File 1'. If it is assumed that the count is, the distribution information generating unit 114, for the 'file 1' of the plurality of files, the ascending frequency of the at least one unique hash value '1', '2', '3' By sorting in order, distribution information on the frequency of the at least one unique hash value as shown by reference numeral 210 of FIG. 2 may be generated.

이러한 방식으로, 분포 정보 생성부(114)는 상기 복수의 파일들 각각에 대해 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 생성할 수 있다.In this manner, the distribution information generator 114 may generate distribution information on the frequency of the at least one unique hash value for each of the plurality of files.

이때, 본 발명의 일실시예에 따르면, 분포 정보 생성부(114)는 정규화부(118)를 포함할 수 있다.At this time, according to an embodiment of the present invention, the distribution information generator 114 may include a normalizer 118.

정규화부(118)는 상기 복수의 파일들 각각에 대하여, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보가 생성되면, 상기 적어도 하나의 고유 해시 값의 빈도수 중 최대 빈도수와 최소 빈도수를 추출한 후, 상기 최대 빈도수와 상기 최소 빈도수를 기초로, 상기 적어도 하나의 고유 해시 값 각각의 빈도수에 대해 정규화를 위한 연산을 수행함으로써, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 정규화시킨다.The normalization unit 118 extracts the maximum frequency and the minimum frequency among the frequencies of the at least one unique hash value when the distribution information on the frequency of the at least one unique hash value is generated for each of the plurality of files. The distribution information for the frequency of the at least one unique hash value is normalized by performing an operation for normalizing the frequency of each of the at least one unique hash value based on the maximum frequency and the minimum frequency.

이때, 본 발명의 일실시예에 따르면, 정규화부(118)는 상기 최대 빈도수와 상기 최소 빈도수를 기초로, 상기 적어도 하나의 고유 해시 값 각각의 빈도수에 대해 하기의 수학식 1에 따른 정규화를 위한 연산을 수행함으로써, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 정규화시킬 수 있다.In this case, according to an embodiment of the present invention, the normalization unit 118 is for normalization according to Equation 1 below for each frequency of the at least one unique hash value based on the maximum frequency and the minimum frequency. By performing the operation, distribution information on the frequency of the at least one unique hash value may be normalized.

여기서,

는 a의 정규화 값을 의미하고, a_i는 오름차순으로 정렬된 상기 적어도 하나의 고유 해시 값 중 i번째 해시 값의 빈도수, Min은 상기 적어도 하나의 고유 해시 값의 빈도수 중 상기 최소 빈도수, Max는 상기 적어도 하나의 고유 해시 값의 빈도수 중 상기 최대 빈도수를 의미한다.here,

Is the normalized value of a, a _i is the frequency of the i th hash value of the at least one unique hash value arranged in ascending order, Min is the minimum frequency of the frequencies of the at least one unique hash value, Max is the The maximum frequency among frequencies of at least one unique hash value.

예컨대, 앞선 예와 같이 상기 복수의 파일들 중 '파일 1'에 대하여 도 2의 도면부호 210과 같이 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보가 생성되면, 정규화부(118)는 상기 적어도 하나의 고유 해시 값의 빈도수 중 최대 빈도수인 '3'과 최소 빈도수인 '1'을 추출한 후, 상기 최대 빈도수 '3'과 상기 최소 빈도수 '1'을 기초로, 상기 적어도 하나의 고유 해시 값 각각의 빈도수에 대해 상기의 수학식 1에 따른 정규화를 위한 연산을 수행함으로써, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 정규화시킬 수 있다.For example, when distribution information on the frequency of the at least one unique hash value is generated with respect to 'file 1' of the plurality of files as shown in the previous example, as shown by reference numeral 210 of FIG. After extracting the maximum frequency '3' and the minimum frequency '1' among the frequencies of the at least one unique hash value, the at least one unique hash value is based on the maximum frequency '3' and the minimum frequency '1'. By performing an operation for normalization according to Equation 1 for each frequency, distribution information for the frequency of the at least one unique hash value may be normalized.

구체적으로, 정규화부(118)는 상기 적어도 하나의 고유 해시 값 각각의 빈도수인 '1', '2', '3'에 대해 상기의 수학식 1에 따른 정규화를 위한 연산을 수행하여 '0', '0.5', '1'의 정규화 값을 산출함으로써, 도 2의 도면부호 220과 같이, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 정규화시킬 수 있다. Specifically, the normalization unit 118 performs an operation for normalization according to Equation 1 with respect to the frequencies '1', '2', and '3' of each of the at least one unique hash value, thereby performing '0'. By calculating the normalization values of '0.5' and '1', distribution information on the frequency of the at least one unique hash value can be normalized as indicated by reference numeral 220 of FIG. 2.

마찬가지로, 정규화부(118)는 상기 복수의 파일들 각각에 대하여, 상기 적어도 하나의 고유 해시 값의 최대 빈도수와 최소 빈도수를 기초로, 상기 적어도 하나의 고유 해시 값 각각의 빈도수에 대해 정규화를 위한 연산을 수행함으로써, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 정규화시킬 수 있다.Similarly, the normalization unit 118 calculates for each of the plurality of files based on the maximum frequency and the minimum frequency of the at least one unique hash value, and normalize the frequency of each of the at least one unique hash value. By performing the above, distribution information on the frequency of the at least one unique hash value may be normalized.

유사도 연산부(115)는 상기 기준 파일에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보와, 상기 복수의 파일들에서 상기 기준 파일을 제외한 나머지 파일들에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보 간 유사도를 연산한다.The similarity calculator 115 may include distribution information about a frequency of the at least one unique hash value corresponding to the reference file, and the at least one unique hash corresponding to the remaining files except the reference file in the plurality of files. Calculate the similarity between distribution information about the frequency of values.

이때, 본 발명의 일실시예에 따르면, 유사도 연산부(115)는 상기 기준 파일에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보와, 상기 나머지 파일들에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보 간 유사도를 하기의 수학식 2에 따라 연산할 수 있다.In this case, according to an embodiment of the present invention, the similarity calculator 115 may include distribution information on the frequency of the at least one unique hash value corresponding to the reference file, and the at least one uniqueness corresponding to the remaining files. The similarity between distribution information on the frequency of hash values may be calculated according to Equation 2 below.

여기서,

는 파일 a에 대하여, 오름차순으로 정렬된 상기 적어도 하나의 고유 해시 값 중 i번째 해시 값에 대한 빈도수의 정규화 값,

는 파일 b에 대하여, 오름차순으로 정렬된 상기 적어도 하나의 고유 해시 값 중 i번째 해시 값에 대한 빈도수의 정규화 값을 의미하고, m은 파일 a에 대한 상기 적어도 하나의 고유 해시 값의 개수와 파일 b에 대한 상기 적어도 하나의 고유 해시 값의 개수 중 더 작은 수를 의미한다.here,

Is a normalized value of the frequency for the i th hash value of the at least one unique hash value sorted in ascending order, for file a,

Denotes a normalized value of a frequency for an i th hash value of the at least one unique hash value arranged in ascending order for file b, and m is the number of the at least one unique hash value for file a and file b It means a smaller number of the number of the at least one unique hash value for.

관련해서, 전술한 예에서 설명한 '파일 1'이 상기 기준 파일에 해당하고, 상기 나머지 파일들 중 '파일 2'가 '0', '0.25', '1'과 같은 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 정규화된 분포 정보를 갖는다고 가정하자.In this regard, 'file 1' described in the above example corresponds to the reference file, and among the remaining files, 'file 2' is the at least one unique hash value such as '0', '0.25', and '1'. Suppose we have normalized distribution information for the frequency of.

유사도 연산부(115)는 상기 기준 파일인 '파일 1'에 대응되는 도면부호 220과 같은 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보와, 상기 복수의 파일들에서 상기 기준 파일을 제외한 나머지 파일들에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보 간 유사도를 상기의 수학식 2에 따라 연산할 수 있다.The similarity calculating unit 115 distributes information on the frequency of the at least one unique hash value, such as reference numeral 220, corresponding to the reference file 'File 1', and the remaining files except the reference file from the plurality of files. The similarity between the distribution information about the frequency of the at least one unique hash value corresponding to the data may be calculated according to Equation 2 above.

구체적으로, 유사도 연산부(115)는 상기 기준 파일인 '파일 1'에 대응되는 분포 정보인 {0, 0.5, 1}과 상기 나머지 파일들 중 '파일 2'에 대응되는 분포 정보인 {0, 0.25, 1} 간 유사도를 하기의 수학식 3에 따라 0.917로 연산할 수 있다. In detail, the similarity calculator 115 includes {0, 0.5, 1}, which is distribution information corresponding to 'File 1', which is the reference file, and {0, 0.25, which is distribution information corresponding to 'File 2', among the remaining files. , 1} may be calculated as 0.917 according to Equation 3 below.

이러한 방식으로, 유사도 연산부(115)는 상기 기준 파일에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보와, 상기 나머지 파일들에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보 간 유사도를 연산할 수 있다.In this manner, the similarity calculating unit 115 distributes distribution information on the frequency of the at least one unique hash value corresponding to the reference file and the frequency of the at least one unique hash value corresponding to the remaining files. The similarity between the information can be calculated.

파일 저장부(116)는 상기 나머지 파일들 중 상기 기준 파일과의 상기 유사도가 기설정된 기준치 이상인 적어도 하나의 유사 파일을 선택한 후, 상기 적어도 하나의 유사 파일을 파일 저장소에 저장한다.The file storage unit 116 selects at least one similar file whose similarity with the reference file is greater than or equal to a preset reference value among the remaining files, and stores the at least one similar file in a file storage.

이때, 본 발명의 일실시예에 따르면, 파일 저장부(116)는 상기 적어도 하나의 유사 파일을 디스크 이미지 파일 형식으로 변환하여 상기 파일 저장소에 저장할 수 있다.In this case, according to an embodiment of the present invention, the file storage unit 116 may convert the at least one similar file into a disk image file format and store it in the file storage.

여기서, 디스크 이미지 파일 형식이란, 하드 디스크 전체를 복제할 때 사용되는 파일포맷을 의미한다. 관련해서, 포렌식에 사용되는 이미지 포맷으로는 EWF(Expert Witness Compression Format), AFF(Advanced Forensics Format)등이 있다. 이러한 디스크 이미지 파일 형식은 저장매체의 원본 상태를 그대로 유지하기 위하여, 단순히 사본 저장매체에 복사하는 방식이 아니라, 원본 저장매체의 모든 물리적인 섹터를 사본 저장매체로 복제하는 방식을 사용한다.Here, the disk image file format means a file format used when the entire hard disk is duplicated. Relatedly, image formats used in forensics include Expert Witness Compression Format (EWF) and Advanced Forensics Format (AFF). The disk image file format uses a method of copying all physical sectors of the original storage medium to the copy storage medium, rather than simply copying the copy storage medium to maintain the original state of the storage medium.

즉, 파일 저장부(116)는 상기 적어도 하나의 유사 파일을 디스크 이미지 파일 형식으로 변환하여 상기 파일 저장소에 저장함으로써, 추후 디지털 포렌식 툴에 활용될 수 있도록 지원한다. That is, the file storage unit 116 converts the at least one similar file into a disk image file format and stores the file in the file storage, so that the file storage unit 116 can be used later in the digital forensic tool.

유사 파일 표시부(117)는 상기 사용자로부터 수신된 상기 기준 파일에 대한 상기 유사 파일 탐색 명령에 대응하여, 상기 적어도 하나의 유사 파일로 구성된 유사 파일 목록을 화면 상에 표시한다.The similar file display unit 117 displays a similar file list including the at least one similar file on the screen in response to the similar file search command for the reference file received from the user.

즉, 유사 파일 표시부(117)는 상기 사용자로부터 상기 기준 파일에 대한 상기 유사 파일 탐색 명령이 수신된 경우, 상기 적어도 하나의 유사 파일로 구성된 유사 파일 목록을 화면 상에 표시함으로써, 사용자에게 상기 복수의 파일들 중 상기 기준 파일과 유사한 파일에 대한 정보를 제공할 수 있다.That is, the similar file display unit 117 displays the similar file list including the at least one similar file on the screen when the similar file search command for the reference file is received from the user, thereby providing the plurality of similar files to the user. Information about a file similar to the reference file among the files may be provided.

도 3은 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치의 동작 방법을 도시한 순서도이다.FIG. 3 is a flowchart illustrating an operation method of an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present disclosure.

단계(S310)에서는 사용자로부터 미리 정해진 복수의 파일들 중 어느 하나인 기준 파일에 대한 유사 파일 탐색 명령이 수신되면, 상기 복수의 파일들 각각으로부터, 상기 복수의 파일들 각각에 대한 n개의 피쳐들(상기 n개의 피쳐들은 상기 복수의 파일들 각각에 대한 데이터를 구성하는 비트열에서, 기설정된 데이터 패턴이 존재하는 지점을 경계로 하여 분할함으로써 생성된 n개의 부분 비트열을 의미함)을 추출한다.In operation S310, when a similar file search command for a reference file, which is one of a plurality of predetermined files, is received from the user, n features for each of the plurality of files, from each of the plurality of files, are determined. The n features extract n partial bit strings generated by dividing a bit string constituting data for each of the plurality of files with respect to a point where a predetermined data pattern exists.

단계(S320)에서는 상기 복수의 파일들 각각에 대한 상기 n개의 피쳐들을 기설정된 해시 함수에 입력으로 인가하여, 상기 복수의 파일들 각각에 대한 n개의 해시 값들을 생성한다.In operation S320, the n features of each of the plurality of files are applied as inputs to a predetermined hash function to generate n hash values of each of the plurality of files.

단계(S330)에서는 상기 복수의 파일들 각각에 대하여, 상기 n개의 해시 값들로부터 해시 값이 서로 중복되지 않는 적어도 하나의 고유 해시 값을 추출한 후, 상기 n개의 해시 값들에서 상기 적어도 하나의 고유 해시 값 각각이 존재하는 빈도수를 카운트한다.In operation S330, for each of the plurality of files, at least one unique hash value from which the hash values do not overlap each other is extracted from the n hash values, and then the at least one unique hash value is extracted from the n hash values. Count the frequency with which each is present.

단계(S340)에서는 상기 복수의 파일들 각각에 대하여, 상기 적어도 하나의 고유 해시 값의 빈도수를 오름차순으로 정렬한 후, 오름차순으로 정렬된 상기 적어도 하나의 고유 해시 값의 빈도수를 기초로, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 생성한다.In operation S340, for each of the plurality of files, the frequencies of the at least one unique hash value are sorted in ascending order, and based on the frequencies of the at least one unique hash value sorted in ascending order, the at least one. Generate distribution information about the frequency of the unique hash value of.

단계(S350)에서는 상기 기준 파일에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보와, 상기 복수의 파일들에서 상기 기준 파일을 제외한 나머지 파일들에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보 간 유사도를 연산한다.In operation S350, distribution information about a frequency of the at least one unique hash value corresponding to the reference file and the at least one unique hash value corresponding to the remaining files except the reference file in the plurality of files may be used. Calculate the similarity between distribution information about the frequency of

단계(S360)에서는 상기 나머지 파일들 중 상기 기준 파일과의 상기 유사도가 기설정된 기준치 이상인 적어도 하나의 유사 파일을 선택한 후, 상기 적어도 하나의 유사 파일을 파일 저장소에 저장한다.In operation S360, after selecting at least one similar file having the similarity degree to the reference file among the remaining files equal to or greater than a predetermined reference value, the at least one similar file is stored in a file storage.

단계(S370)에서는 상기 사용자로부터 수신된 상기 기준 파일에 대한 상기 유사 파일 탐색 명령에 대응하여, 상기 적어도 하나의 유사 파일로 구성된 유사 파일 목록을 화면 상에 표시한다.In operation S370, a list of similar files including the at least one similar file is displayed on the screen in response to the similar file search command for the reference file received from the user.

이때, 본 발명의 일실시예에 따르면, 단계(S340)에서는 상기 복수의 파일들 각각에 대하여, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보가 생성되면, 상기 적어도 하나의 고유 해시 값의 빈도수 중 최대 빈도수와 최소 빈도수를 추출한 후, 상기 최대 빈도수와 상기 최소 빈도수를 기초로, 상기 적어도 하나의 고유 해시 값 각각의 빈도수에 대해 정규화를 위한 연산을 수행함으로써, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 정규화시키는 단계를 포함할 수 있다.In this case, according to an embodiment of the present invention, in step S340, when distribution information on the frequency of the at least one unique hash value is generated for each of the plurality of files, the at least one unique hash value may be generated. After extracting the maximum frequency and the minimum frequency among the frequencies, based on the maximum frequency and the minimum frequency, by performing the operation for normalization for each frequency of the at least one unique hash value, the at least one unique hash value Normalizing distribution information about the frequency may be included.

또한, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 정규화시키는 단계는 상기 최대 빈도수와 상기 최소 빈도수를 기초로, 상기 적어도 하나의 고유 해시 값 각각의 빈도수에 대해 상기의 수학식 1에 따른 정규화를 위한 연산을 수행함으로써, 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보를 정규화시킬 수 있다.In addition, normalizing the distribution information for the frequency of the at least one unique hash value may be performed according to Equation 1 above for each frequency of the at least one unique hash value based on the maximum frequency and the minimum frequency. By performing an operation for normalization, distribution information on the frequency of the at least one unique hash value may be normalized.

이때, 본 발명의 일실시예에 따르면, 단계(S350)에서는 상기 기준 파일에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보와, 상기 나머지 파일들에 대응되는 상기 적어도 하나의 고유 해시 값의 빈도수에 대한 분포 정보 간 유사도를 상기의 수학식 2에 따라 연산할 수 있다.At this time, according to an embodiment of the present invention, in step S350, distribution information on the frequency of the at least one unique hash value corresponding to the reference file, and the at least one unique hash corresponding to the remaining files. The similarity between the distribution information about the frequency of the values can be calculated according to Equation 2 above.

또한, 본 발명의 일실시예에 따르면, 단계(S360)에서는 상기 적어도 하나의 유사 파일을 디스크 이미지 파일 형식으로 변환하여 상기 파일 저장소에 저장할 수 있다.Further, according to an embodiment of the present invention, in step S360, the at least one similar file may be converted into a disk image file format and stored in the file storage.

이상, 도 3을 참조하여 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치의 동작 방법에 대해 설명하였다. 여기서, 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치의 동작 방법은 도 1과 도 2를 이용하여 설명한 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치(110)의 동작에 대한 구성과 대응될 수 있으므로, 이에 대한 보다 상세한 설명은 생략하기로 한다.In the above, with reference to FIG. 3, an operation method of an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present disclosure has been described. Here, an operation method of an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present invention is described with reference to FIGS. 1 and 2. Since a configuration of an operation of the electronic device 110 capable of searching for a similar file with respect to a reference file based on the distribution information of the features of each of the files of the file may be corresponding, a detailed description thereof will be omitted.

본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치의 동작 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 저장매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.An operation method of an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present invention may be implemented in a storage medium for execution by combining with a computer. It can be implemented as a stored computer program.

또한, 본 발명의 일실시예에 따른 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치의 동작 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 컴퓨터 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. In addition, an operation method of an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files according to an embodiment of the present invention may be performed by combining with a computer. It may be implemented in the form of program instructions and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, and the like, alone or in combination. Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. As described above, the present invention has been described by specific embodiments such as specific components and the like, but the embodiments and drawings are provided only to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. For those skilled in the art, various modifications and variations are possible from these descriptions.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the described embodiments, and all of the equivalents and equivalents of the claims, as well as the appended claims, will belong to the scope of the present invention. .

110: 복수의 파일들 각각에 대한 피쳐들의 분포 정보를 기초로 기준 파일에 대한 유사 파일의 탐색이 가능한 전자 장치
111: 피쳐 추출부 112: 해시 값 생성부
113: 카운트부 114: 분포 정보 생성부
115: 유사도 연산부 116: 파일 저장부
117: 유사 파일 표시부 118: 정규화부110: An electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files
111: feature extraction unit 112: hash value generation unit
113: count unit 114: distribution information generation unit
115: similarity calculator 116: file storage unit
117: pseudo file display unit 118: normalization unit

Claims

When a similar file search command for a reference file, which is one of a plurality of predetermined files, is received from a user, n (n is a natural number of two or more) for each of the plurality of files is received from each of the plurality of files. Features-The n features are n partial bits generated by dividing at a point where a predetermined data pattern exists in a bit string constituting data for each of the plurality of files. Means a column extracting feature;
A hash value generator for generating n hash values for each of the plurality of files by applying the n features of each of the plurality of files as inputs to a predetermined hash function;
For each of the plurality of files, after extracting at least one unique hash value from which the hash values do not overlap each other from the n hash values, a frequency in which each of the at least one unique hash value exists in the n hash values A counting unit for counting;
For each of the plurality of files, sorting the frequencies of the at least one unique hash value in ascending order and then based on the frequencies of the at least one unique hash value sorted in ascending order of the at least one unique hash value. A distribution information generator for generating distribution information on the frequency;
Distribution information on the frequency of the at least one unique hash value corresponding to the reference file and distribution on the frequency of the at least one unique hash value corresponding to the remaining files other than the reference file in the plurality of files A similarity calculator which calculates similarity between information;
A file storage unit for storing at least one similar file in a file storage after selecting at least one similar file having the similarity degree to the reference file more than a predetermined reference value among the remaining files; And
Similar file display unit for displaying a list of similar files composed of the at least one similar file on the screen in response to the similar file search command for the reference file received from the user
Including,
The distribution information generator
For each of the plurality of files, when distribution information on the frequency of the at least one unique hash value is generated, the maximum frequency and the minimum frequency among the frequencies of the at least one unique hash value are extracted, and then the maximum frequency and the A normalizer for normalizing distribution information on the frequency of the at least one unique hash value by performing an operation for normalization on the frequency of each of the at least one unique hash value based on the minimum frequency.
An electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files including a.

delete

The method of claim 1,
The normalization unit
Distribution for the frequency of the at least one unique hash value by performing an operation for normalization according to Equation 1 below for the frequency of each of the at least one unique hash value based on the maximum frequency and the minimum frequency. An electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files for normalizing information.
[Equation 1]

here,

Is the normalized value of a, a _i is the frequency of the i th hash value of the at least one unique hash value arranged in ascending order, Min is the minimum frequency of the frequencies of the at least one unique hash value, Max is the The maximum frequency of the frequencies of at least one unique hash value.

The method of claim 3,
The similarity calculating unit
The similarity between the distribution information on the frequency of the at least one unique hash value corresponding to the reference file and the distribution information on the frequency of the at least one unique hash value corresponding to the remaining files is expressed by Equation 2 below. An electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files to be computed accordingly.
[Equation 2]

here,

Denotes a normalized value of a frequency for an i th hash value of the at least one unique hash value arranged in ascending order for file b, and m is the number of the at least one unique hash value for file a and file b Means the smaller number of said at least one unique hash value for.

The method of claim 1,
The file storage unit
An electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of the plurality of files stored in the file storage by converting the at least one similar file into a disk image file format.

When a similar file search command for a reference file, which is one of a plurality of predetermined files, is received from a user, n (n is a natural number of two or more) for each of the plurality of files is received from each of the plurality of files. Features-The n features are n partial bits generated by dividing at a point where a predetermined data pattern exists in a bit string constituting data for each of the plurality of files. Means heat;
Applying the n features for each of the plurality of files as inputs to a predetermined hash function to generate n hash values for each of the plurality of files;
For each of the plurality of files, after extracting at least one unique hash value from which the hash values do not overlap each other from the n hash values, a frequency in which each of the at least one unique hash value exists in the n hash values Counting;
For each of the plurality of files, sorting the frequencies of the at least one unique hash value in ascending order and then based on the frequencies of the at least one unique hash value sorted in ascending order of the at least one unique hash value. Generating distribution information on the frequency;
Distribution information on the frequency of the at least one unique hash value corresponding to the reference file and distribution on the frequency of the at least one unique hash value corresponding to the remaining files other than the reference file in the plurality of files Calculating a similarity between the information;
Selecting at least one similar file whose similarity with the reference file is greater than or equal to a predetermined reference value among the remaining files, and storing the at least one similar file in a file storage; And
In response to the similar file search command for the reference file received from the user, displaying a similar file list composed of the at least one similar file on a screen;
Including,
Generating distribution information on the frequency of the at least one unique hash value
For each of the plurality of files, when distribution information on the frequency of the at least one unique hash value is generated, the maximum frequency and the minimum frequency among the frequencies of the at least one unique hash value are extracted, and then the maximum frequency and the Normalizing distribution information on the frequency of the at least one unique hash value by performing an operation for normalization on the frequency of each of the at least one unique hash value based on the minimum frequency
And a similar file search for the reference file based on distribution information of the features of each of the plurality of files.

delete

The method of claim 6,
Normalizing distribution information on the frequency of the at least one unique hash value
Distribution for the frequency of the at least one unique hash value by performing an operation for normalization according to Equation 1 below for the frequency of each of the at least one unique hash value based on the maximum frequency and the minimum frequency. A method of operating an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files for normalizing information.
[Equation 1]

here,

The method of claim 8,
Computing the similarity between the distribution information for the frequency of the at least one unique hash value
The similarity between the distribution information about the frequency of the at least one unique hash value corresponding to the reference file and the distribution information about the frequency of the at least one unique hash value corresponding to the remaining files is expressed by Equation 2 below. A method of operating an electronic device capable of searching for a similar file with respect to a reference file based on distribution information of features of each of a plurality of files to be computed accordingly.
[Equation 2]

here,

The method of claim 6,
The step of storing in the file store
And converting the at least one similar file into a disk image file format and searching for a similar file with respect to a reference file based on distribution information of features of each of the plurality of files stored in the file storage.

A computer readable recording medium having recorded thereon a computer program for executing the method of any one of claims 6, 8, 9 or 10 in combination with a computer.

A computer program stored in a storage medium for executing the method of any one of claims 6, 8, 9 or 10 in combination with a computer.