KR102550868B1

KR102550868B1 - verification system for achievements of faculty

Info

Publication number: KR102550868B1
Application number: KR1020210009072A
Authority: KR
Inventors: 강상길; 조명우; 이한음; 허청환
Original assignee: 인하대학교 산학협력단
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2023-07-04
Also published as: KR20220106297A

Abstract

본 발명은 교원 업적 검증 시스템에 관한 것으로서, 더욱 상세하게는 교원이 입력한 업적 정보를 검증하는 전체 과정을 자동으로 수행하도록 함으로써, 인력, 시간 및 비용을 상당량 절감할 수 있을 뿐만 아니라, 검증시에 동일성이 아니라 유사성을 판단하도록 하여 보다 정확하게 검증할 수 있는 교원 업적 검증 시스템에 관한 것이다.
상기한 목적을 달성하기 위한 본 발명은 교원들이 업적 정보를 입력하는 정보 입력 단계와, 입력 받은 업적 정보를 통하여 웹상에서 업적 관련 정보를 추출하는 정보 추출 단계와, 상기 정보 추출 단계에서 추출한 업적 관련 정보와 업적 정보를 비교하여 검증하는 데이터 검증 단계로 이루어지는 것을 특징으로 한다.The present invention relates to a teacher achievement verification system, and more particularly, by automatically performing the entire process of verifying achievement information entered by a teacher, not only can a considerable amount of manpower, time and cost be saved, but also It is about a teacher achievement verification system that can be verified more accurately by determining similarity rather than identity.
The present invention for achieving the above object is an information input step in which teachers input achievement information, an information extraction step in which achievement-related information is extracted from the web through the input achievement information, and achievement-related information extracted in the information extraction step. It is characterized in that it consists of a data verification step of comparing and verifying achievement information.

Description

Verification system for achievements of faculty}

본 발명은 교원 업적 검증 시스템에 관한 것으로서, 더욱 상세하게는 교원이 입력한 업적 정보를 검증하는 전체 과정을 자동으로 수행하도록 함으로써, 인력, 시간 및 비용을 상당량 절감할 수 있을 뿐만 아니라, 검증시에 동일성이 아니라 유사성을 판단하도록 하여 보다 정확하게 검증할 수 있는 교원 업적 검증 시스템에 관한 것이다.The present invention relates to a teacher achievement verification system, and more particularly, by automatically performing the entire process of verifying achievement information entered by a teacher, not only can a considerable amount of manpower, time and cost be saved, but also It is about a teacher achievement verification system that can be verified more accurately by determining similarity rather than identity.

대학교 등에 재직하고 있는 교원들의 업적을 검증하기 위해서는 각 교원들이 작성한 논문, 특허, 저서나 역서에 대한 내용을 일일이 찾아보면서 대조하여야 하는데, 통상적으로 대조하여 검증할 논문은 약 3000건 이상, 특허는 약 400건 이상, 역서나 저서의 경우는 약 200건 이상이 해당된다.In order to verify the achievements of faculty members working in universities, etc., it is necessary to search and compare the contents of thesis, patent, book, or translation written by each faculty member. More than 400 cases, and more than 200 cases in the case of translations or books.

이렇게 상당한 양의 논문, 특허, 저서나 역서를 검증하기 위해서는 다수의 작업자가 수작업으로 직접 각 웹 사이트를 찾아다니면서 검증하여야 하고, 한번에 끝나는 것이 아니라 서로 교차 검증까지 거쳐야 하므로 다수의 인력이 상당한 시간을 소모하기 때문에 상당한 비용이 소요된다.In order to verify such a considerable amount of theses, patents, books or translations, a large number of workers manually visit each website and verify it, and it is not done at once but cross-validated, which consumes a large number of manpower. Because of this, it costs a lot of money.

따라서, 전술한 검증 작업을 수작업이 아닌 특정 시스템을 이용하여 자동으로 수행하도록 할 경우 인력, 시간 및 비용을 절감하는 효과가 있다.Therefore, when the above-described verification work is performed automatically using a specific system instead of manually, there is an effect of saving manpower, time, and cost.

이러한 검증 작업을 자동으로 수행하기 위한 기술의 일 예로 도 1 및 도 2에 도시된 바와 같은 한국공개특허 제10-2020-0082218호에 기재된 기술이 있는데, 그 기술적 특징은 비신뢰 데이터에 대한 신뢰도를 검증하는 서버에 의하여, 개인이 SNS 상에 개시한 게시물에서 키워드를 추출하는 단계; 웹 상에서 상기 키워드를 포함하는 뉴스 기사를 크롤링하는 단계; 및 상기 크롤링된 뉴스 기사의 개수를 기초로 하여 상기 개인이 SNS 상에 개시한 게시물의 신뢰도를 평가하는 단계를 포함하는 것을 특징으로 한다.As an example of a technology for automatically performing this verification task, there is a technology described in Korean Patent Publication No. 10-2020-0082218 as shown in FIGS. 1 and 2, the technical feature of which is to increase reliability of unreliable data. extracting keywords from postings posted by individuals on SNS by the verifying server; crawling news articles containing the keyword on the web; and evaluating reliability of posts posted by the individual on SNS based on the number of crawled news articles.

그런데, 한국공개특허 제10-2020-0082218호에 기재된 기술은 SNS 등에 업로드된 게시물의 내용을 웹 상에 공개된 뉴스 기사를 크롤링하여 신뢰도를 자동으로 검증하는 기술로서 자동으로 업로드된 정보를 검증하는 장점은 있으나, 검증시에 키워드가 정확하게 일치하는 경우만 고려하게 되므로 다양한 이유에 의해 키워드가 변형될 경우 정확하게 판단할 수 없는 문제점이 있다.However, the technology described in Korean Patent Publication No. 10-2020-0082218 is a technology that automatically verifies the reliability of the contents of posts uploaded to SNS by crawling news articles published on the web, which automatically verifies the uploaded information. Although there is an advantage, there is a problem in that it cannot accurately determine when a keyword is modified due to various reasons because only cases in which keywords exactly match are considered during verification.

한국공개특허 제10-2020-0082218호(2020.07.08.공개)Korean Patent Publication No. 10-2020-0082218 (published on July 8, 2020) 한국등록특허 제10-1153138호(2012.05.29.등록)Korean Registered Patent No. 10-1153138 (registered on May 29, 2012)

본 발명은 상기한 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은 교원들이 기재한 업적 정보를 기반으로 하여 해당 정보가 있는 웹 사이트에서 크롤러 기능을 가지는 추출 모듈을 통하여 정보를 추출하여 업적 정보와 자동으로 비교하도록 함으로써, 전체 과정을 자동으로 수행할 수 있어 인력, 시간 및 비용을 상당량 절감할 수 있는 교원 업적 검증 시스템을 제공하는 것이다.The present invention has been devised to solve the above problems, and an object of the present invention is to extract information from a website containing the information based on achievement information written by teachers through an extraction module having a crawler function to obtain achievement information. It is to provide a teacher achievement verification system that can significantly reduce manpower, time and cost by automatically performing the entire process by automatically comparing with

그리고, 본 발명의 다른 목적은 추출 모듈을 통하여 각 웹 사이트에서 추출한 정보와 업적 정보를 비교할 때, 업적 정보가 알파벳일 경우 모두 소문자로 전환하고, 한글일 경우 자음과 모음으로 분해하는 전처리 과정을 거치도록 하고, 검증시에 문자열 비교 방식(sequence matching)을 기본으로 하되 연속성에 가산점을 부가하도록 하여 유사성에 따라 검증하도록 함으로써, 검증 결과의 정확도를 높일 수 있는 교원 업적 검증 시스템을 제공하는 것이다.And, another object of the present invention is to compare achievement information with information extracted from each website through an extraction module, if the achievement information is alphabetic, convert it to all lowercase letters, and if it is Korean, go through a preprocessing process that decomposes into consonants and vowels It is to provide a teacher achievement verification system that can increase the accuracy of verification results by adding additional points to continuity while verifying based on sequence matching during verification.

이러한 문제점을 해결하기 위한 본 발명은;The present invention to solve these problems;

교원들이 입력한 업적 정보를 교원 DB에 저장하는 입력 업적 정보 DB 저장 단계와, 상기 입력 업적 정보를 통하여 웹상에서 업적 관련 정보를 추출하는 정보 추출 단계와, 상기 정보 추출 단계에서 추출한 업적 관련 정보와 상기 입력 업적 정보를 비교하여 검증하는 데이터 검증 단계로 이루어지는 것을 특징으로 한다.The input achievement information DB storage step of storing the achievement information input by teachers in the teacher DB, the information extraction step of extracting achievement-related information from the web through the input achievement information, and the achievement-related information extracted in the information extraction step and the above It is characterized in that it consists of a data verification step of verifying by comparing input achievement information.

여기서, 상기 입력 업적 정보는 각 교원들이 입력한 자신의 논문, 특허, 저서 또는 역서에 대한 정보인 것을 특징으로 한다.Here, the input achievement information is characterized in that each teacher inputs information about his/her thesis, patent, book, or translation.

그리고, 상기 정보 추출 단계는 상기 서버에 구비되는 추출 모듈을 통하여 논문, 특허, 저서 또는 역서에 대한 정보를 획득할 수 있는 웹 사이트에 접속하는 접속 단계와, 접속한 웹 사이트에서 크롤링 작업을 통하여 정보를 추출하는 데이터 추출 단계와, 상기 데이터 추출 단계에서 추출된 정보에서 업적 관련 정보만을 추출하기 위한 데이터 다듬기 단계로 이루어지는 것을 특징으로 한다.In addition, the information extraction step includes an access step of accessing a website capable of obtaining information on papers, patents, books, or translations through an extraction module provided in the server, and crawling the information from the accessed website. It is characterized in that it consists of a data extraction step of extracting and a data refinement step of extracting only achievement-related information from the information extracted in the data extraction step.

이때, 상기 데이터 다듬기 단계는 각 웹 사이트의 포맷 형식을 참조하여 추출된 정보에서 업적 관련 정보 이외의 특수문자를 포함한 관련없는 문자를 제거하는 것을 특징으로 한다.At this time, the data trimming step is characterized by removing irrelevant characters including special characters other than achievement-related information from the extracted information by referring to the format of each website.

한편, 상기 데이터 검증 단계는 상기 교원 DB에 저장된 입력 업적 정보와 상기 정보 추출 단계를 통하여 추출된 업적 관련 정보를 설정된 형식으로 변경하여 처리하는 전처리 단계와, 상기 전처리한 업적 관련 정보와 입력 업적 정보를 비교하여 검증하는 비교 단계로 이루어지는 것을 특징으로 한다.Meanwhile, the data verification step includes a preprocessing step of converting and processing the input achievement information stored in the teacher DB and the achievement related information extracted through the information extraction step into a set format, and the preprocessed achievement related information and the input achievement information It is characterized in that it consists of a comparison step of comparing and verifying.

여기서, 상기 전처리 단계는 상기 입력 업적 정보와 업적 관련 정보가 알파벳일 경우 소문자로 변환하고, 한글일 경우 자음과 모음으로 분리하는 것을 특징으로 한다.Here, in the pre-processing step, if the input achievement information and achievement-related information are alphabets, they are converted to lowercase letters, and if they are Korean letters, they are separated into consonants and vowels.

이때, 상기 비교 단계는 문자열 비교(sequence matching) 방법 중 편집 거리(Levenstein Distance) 알고리즘에서 연속성에 가중치를 부가하여 상기 업적 정보와 업적 관련 정보를 비교하는 것을 특징으로 한다.At this time, the comparison step is characterized in that the achievement information and achievement-related information are compared by adding a weight to continuity in a Levenstein Distance algorithm among sequence matching methods.

또한, 상기 비교 단계는 하기의 과정을 통하여 도출되는 가중치 편집 거리 값(D(i,j))으로 유사성을 판단하는 것을 특징으로 한다.In addition, the comparison step is characterized in that the similarity is determined by the weight edit distance value (D(i,j)) derived through the following process.

[과정][procedure]

1. 비교하고자 하는 문자열의 A[i]와 B[j]가 일치할 경우,1. If A[i] and B[j] of the string to be compared match,

D(i,j) = D(i-1,j-1)이고, D(i,j) = D(i-1,j-1),

만약 b == True 라면 w = w + 1 이고,If b == True then w = w + 1,

만약 b != True 라면 b = true 이다.If b != True then b = true.

2. 비교하고자 하는 문자열의 A[i]와 B[j]가 일치하지 않을 경우,2. If A[i] and B[j] of the string to be compared do not match,

D(i,j) = min( D(i-1,j)+1/w, D(i,j-1)+1/w, D(i-1,j-1)+1/w )이고,D(i,j) = min( D(i-1,j)+1/w, D(i,j-1)+1/w, D(i-1,j-1)+1/w ) ego,

b = False 이다.b = False.

{ D(i,j) = 가중치 편집 거리 값, 초기값인 D(0,0)=0,{ D(i,j) = weight edit distance value, initial value D(0,0)=0,

A[i] = 문자열 A의 i번째 문자,A[i] = ith character of string A,

B[j] = 문자열 B의 j번째 문자,B[j] = jth character of string B,

w = 연속성 가중값(w의 초기값은 0),w = continuity weight (initial value of w is 0),

b = 연속성을 판단하기 위한 Boolean(b의 초기값은 False),b = Boolean to determine continuity (initial value of b is False),

i,j = 0 ~ n }i,j = 0 to n }

상기한 구성의 본 발명에 따르면, 교원들이 기재한 업적 정보를 기반으로 하여 해당 정보가 있는 웹 사이트에서 크롤러 기능을 가지는 추출 모듈을 통하여 정보를 추출하여 업적 정보와 자동으로 비교하도록 함으로써, 전체 과정을 자동으로 수행할 수 있어 인력, 시간 및 비용을 상당량 절감할 수 있는 효과가 있다.According to the present invention having the above configuration, based on the achievement information written by the teachers, information is extracted through an extraction module having a crawler function from a website having the corresponding information and automatically compared with the achievement information, thereby completing the entire process Since it can be performed automatically, it has the effect of significantly saving manpower, time and cost.

그리고, 본 발명은 추출 모듈을 통하여 각 웹 사이트에서 추출한 정보와 업적 정보를 비교할 때, 업적 정보가 알파벳일 경우 모두 소문자로 전환하고, 한글일 경우 자음과 모음으로 분해하는 전처리 과정을 거치도록 하고, 검증시에 문자열 비교 방식(sequence matching)을 기본으로 하되 연속성에 가산점을 부가하도록 하여 유사성에 따라 검증하도록 함으로써, 검증 결과의 정확도를 높일 수 있는 효과가 있다.In the present invention, when comparing information extracted from each website and achievement information through an extraction module, if the achievement information is in the alphabet, all of it is converted to lowercase letters, and if it is in Korean, it is subjected to a preprocessing process of decomposing into consonants and vowels, At the time of verification, a string comparison method (sequence matching) is used as a basis, but additional points are added to continuity to verify according to similarity, thereby increasing the accuracy of verification results.

도 1은 종래의 크롤링을 통한 검증 방법의 개략도이다.
도 2는 종래의 크롤링을 통한 검증 방법에서 키워드를 포함한 데이터를 추출하기 위한 웹 페이지의 예시도이다.
도 3은 본 발명에 따른 교원 업적 검증 시스템의 개념도이다.
도 4는 본 발명에 따른 교원 업적 검증 시스템의 블럭도이다.
도 5는 본 발명에 따른 교원 업적 검증 시스템의 흐름도이다.
도 6은 Sequnce matching 기법의 일 예인 일반적인 Levenstein Distance의 예시도이다.
도 7은 본 발명에 따른 교원 업적 검증 시스템에서 검증 완료시 결과를 보여주는 화면의 예시도이다.1 is a schematic diagram of a conventional verification method through crawling.
2 is an exemplary view of a web page for extracting data including keywords in a conventional verification method through crawling.
3 is a conceptual diagram of a teacher achievement verification system according to the present invention.
4 is a block diagram of a teacher achievement verification system according to the present invention.
5 is a flowchart of a teacher achievement verification system according to the present invention.
6 is an exemplary diagram of a general Levenstein Distance, which is an example of a sequence matching technique.
7 is an exemplary view of a screen showing results upon completion of verification in the teacher achievement verification system according to the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예를 보다 상세하게 설명한다. 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다. 그리고, 본 발명은 다수의 상이한 형태로 구현될 수 있고, 기술된 실시 예에 한정되지 않음을 이해하여야 한다. Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the accompanying drawings. The same reference numerals are used for the same components in the drawings, and redundant descriptions of the same components are omitted. And, it should be understood that the present invention may be embodied in many different forms and is not limited to the described embodiments.

도 3은 본 발명에 따른 교원 업적 검증 시스템의 개념도이고, 도 4는 본 발명에 따른 교원 업적 검증 시스템의 블럭도이고, 도 5는 본 발명에 따른 교원 업적 검증 시스템의 흐름도이고, 도 6은 Sequnce matching 기법의 일 예인 일반적인 Levenstein Distance의 예시도이고, 도 7은 본 발명에 따른 교원 업적 검증 시스템에서 검증 완료시 결과를 보여주는 화면의 예시도이다.Figure 3 is a conceptual diagram of the teacher achievement verification system according to the present invention, Figure 4 is a block diagram of the teacher achievement verification system according to the present invention, Figure 5 is a flowchart of the teacher achievement verification system according to the present invention, Figure 6 is a sequence This is an example of a general Levenstein Distance, which is an example of a matching technique, and FIG. 7 is an example of a screen showing results upon completion of verification in the teacher achievement verification system according to the present invention.

본 발명은 교원 업적 검증 시스템에 관한 것으로 도 3 내지 도 5에 도시된 바와 같이 그 구성은 교원들이 입력한 업적 정보를 서버(100)에 구비되는 교원 DB(110)에 저장하는 단계(S100)와 입력 받은 업적 정보를 통하여 웹상에서 업적 관련 정보를 추출하는 정보 추출 단계(S200)와 상기 정보 추출 단계(S200)에서 추출한 업적 관련 정보와 입력 업적 정보를 비교하여 검증하는 데이터 검증 단계(S300)로 이루어진다.The present invention relates to a teacher achievement verification system, and as shown in Figs. It consists of an information extraction step (S200) of extracting achievement-related information from the web through the input achievement information and a data verification step (S300) of verifying by comparing the achievement-related information extracted in the information extraction step (S200) with the input achievement information. .

여기서, 교원들은 도 3 및 도 4에 도시된 바와 같이 각각이 소지하고 있는 스마트폰, 데스크탑 컵퓨터, 테블릿 등의 단말기(200)를 통하여 통신모듈(120)이 구비되는 서버(100)에 접속하여 각 교원들의 업적 정보를 입력하게 된다.Here, as shown in FIGS. 3 and 4, teachers access the server 100 equipped with the communication module 120 through a terminal 200 such as a smart phone, a desktop computer, or a tablet that each teacher possesses. Thus, each teacher's achievement information is entered.

이때, 상기 서버(100)에는 교원 DB(110)가 구비되어 있어 각 교원들이 서버(100)에 접속하여 입력하는 업적 정보를 저장하게 되며, 이러한 교원 입력 업적 정보는 통상적으로 논문, 특허, 저서 또는 역서에 대한 정보가 포함된다.At this time, the server 100 is provided with a teacher DB 110 to store achievement information input by each teacher by accessing the server 100. Such teacher input achievement information is usually a thesis, patent, book or Information about the epigraph is included.

그리고, 상기 정보 추출 단계(S200)는 논문, 특허, 저서 또는 역서에 대한 정보를 획득할 수 있는 웹 사이트(300)에 접속하는 접속 단계(S210)와 접속한 웹 사이트(300)에서 크롤링 작업을 통하여 정보를 추출하는 데이터 추출 단계(S220)와 상기 데이터 추출 단계(S220)에서 추출된 정보에서 업적 관련 정보만을 추출하기 위한 데이터 다듬기 단계(S230)로 이루어진다.In addition, the information extraction step (S200) includes the access step (S210) of accessing the website 300 from which information on thesis, patent, book or translation can be obtained, and the crawling operation on the accessed website 300. It consists of a data extraction step (S220) of extracting information through a data extraction step (S220) and a data trimming step (S230) of extracting only achievement-related information from the information extracted in the data extraction step (S220).

여기서, 상기 서버(100)에는 추출 모듈(130)이 구비되는데, 상기 추출 모듈(130)은 크롤러의 기능을 수행하여 웹 사이트(300)에서 크롤링 작업을 통하여 정보를 추출하게 되는데, 상기 추출 모듈(130)에는 논문, 특허, 저서 또는 역서에 대한 정보를 획득할 수 있는 웹 사이트(300)의 정보가 저장되는 웹 DB(134)가 구비된다.Here, the server 100 is provided with an extraction module 130. The extraction module 130 performs a crawler function to extract information from the website 300 through a crawling operation. The extraction module ( 130) is provided with a web DB 134 that stores information on the web site 300 where information on papers, patents, books or translations can be obtained.

그래서, 상기 접속 단계(S210)에서는 상기 추출 모듈(130)이 웹 DB(134)에 저장된 웹 사이트의 정보를 통하여 해당 웹 사이트(300)에 접속하게 되고, 데이터 추출 단계(S220)에서는 전술한 바와 같이 크롤링 작업을 통하여 입력 업적 정보와 관련이 있는 정보를 추출하여 상기 추출 모듈(130)에 구비되는 임시 저장부(132)에 저장하게 된다.So, in the connection step (S210), the extraction module 130 accesses the corresponding website 300 through the website information stored in the web DB 134, and in the data extraction step (S220), as described above Likewise, information related to the input achievement information is extracted through the crawling operation and stored in the temporary storage unit 132 provided in the extraction module 130 .

한편, 상기 데이터 다듬기 단계(S230)에서는 교원들의 업적 관련 정보 이외의 정보를 삭제하여 다듬는 작업을 진행하게 되는데, 상기 웹 DB(134)에는 각 웹 사이트(300)의 포맷에 대한 정보를 참고하여 의미없는 문자를 삭제하게 된다.On the other hand, in the data trimming step (S230), a trimming operation is performed by deleting information other than the achievement-related information of the teachers. Deletes missing characters.

즉, 각 웹 사이트(300)에서 각 필드를 표시하는 포맷이 차이가 있는데, 예를 들어 논문에서 제목을 표시할 경우에도 제목의 앞과 뒤에 콜론(:)이나 세미콜론(;)을 쓸 수도 있고, 따옴표(")나 다른 특수 문자를 사용할 수도 있으며, 각 특수 문자와 제목 사이에 다수의 공백을 사용할 수도 있다.That is, there is a difference in the format of displaying each field on each website 300. For example, even when displaying a title in a thesis, a colon (:) or semicolon (;) may be used before and after the title, You can also use quotation marks (") or other special characters, or multiple spaces between each special character and the title.

여기서, 상기 데이터 다듬기 단계(S230)에서는 상기 웹 DB(134)에는 각 웹 사이트(300)의 포맷 정보가 포함되어 있어 이를 참고하여 상기 추출 모듈(130)이 추출한 정보에서 업적 관련 정보 이외의 특수문자를 포함한 관련없는 문자를 제거하게 된다.Here, in the data trimming step (S230), format information of each web site 300 is included in the web DB 134, and special characters other than achievement-related information are extracted from the information extracted by the extraction module 130 by referring to this format information. This will remove extraneous characters including .

이렇게 상기 추출 모듈(130)에 의해 처리된 업적 관련 정보는 추출 모듈(130)에 구비되는 임시 저장부(132)에 저장되어 후술할 데이터 검증 단계(S300)에서 입력 업적 정보와 비교할 때, 보다 정확하게 비교할 수 있게 된다.The achievement-related information processed by the extraction module 130 is stored in the temporary storage unit 132 provided in the extraction module 130 and is more accurately compared with the input achievement information in the data verification step (S300) to be described later. be able to compare.

즉, 본 발명에서는 웹 사이트(300)에서 추출한 업적 관련 정보와 교원이 입력한 업적 정보를 비교할 때, 문자열 비교(Sequence Matching) 방법을 사용하게 되는데, 문자열 비교 방법은 문자열을 구성하는 각 문자를 순서대로 비교하여 일치하는 지를 판단하기 때문에 비교 대상 문자열에 웹 사이트(300)의 포맷을 구성하는 특수 문자가 포함되어 있다면 정확한 비교가 어렵게 되므로 이러한 의미없는 문자들을 제거하여 비교시에 정확도를 높이게 된다.That is, in the present invention, when comparing the achievement-related information extracted from the website 300 and the achievement information input by the teacher, a sequence matching method is used. Since it is determined whether they match by comparing them as they are, if the string to be compared contains special characters constituting the format of the web site 300, it is difficult to accurately compare them, so these meaningless characters are removed to increase accuracy in comparison.

그리고, 상기 데이터 검증 단계(S300)는 상기 교원 DB(110)에 저장된 입력 업적 정보와 상기 정보 추출 단계(S200)를 통하여 추출된 업적 관련 정보를 설정된 형식으로 변경하여 처리하는 전처리 단계(S310)와 상기 전처리한 업적 관련 정보와 입력 업적 정보를 비교하여 검증하는 비교 단계(S320)로 이루어진다.In addition, the data verification step (S300) is a preprocessing step (S310) of changing the input achievement information stored in the teacher DB 110 and the achievement-related information extracted through the information extraction step (S200) into a set format and processing it. It consists of a comparison step (S320) of verifying by comparing the preprocessed achievement-related information with the input achievement information.

여기서, 상기 전처리 단계(S310)는 교원 DB(110)에 저장된 입력 업적 정보와 상기 추출 모듈(130)을 통하여 추출된 업적 관련 정보가 알파벳일 경우에는 모두 소문자로 변환하여 통일하고, 한글일 경우에는 각 글자를 모두 자음과 모음으로 분리하여 교원 DB(110)에 별도로 할당된 부분에 저장하게 된다.Here, in the preprocessing step (S310), if the input achievement information stored in the teacher DB 110 and the achievement-related information extracted through the extraction module 130 are alphabets, they are all converted to lowercase letters and unified, and if they are Korean, All letters are separated into consonants and vowels and stored in a separately allocated part of the teacher DB 110.

즉, 한글일 경우를 예로 들면, 홍 길 동과 홍 갈 동을 비교할 때, 3글자 중에서 중간의 1글자가 차이가 나므로 1/3이 다르기 때문에 상당히 다르게 보이지만, 이를 자모음 단위로 분해하게 되면 ㅎㅗㅇㄱㅣㄹㄷㅗㅇ 과 ㅎㅗㅇㄱㅏㄹㄷㅗㅇ을 비교하게 되므로 9 음소 중에서 1가지 음소만 차이가 나므로 1/9이 다르기 때문에 유사성이 높아 보이게 된다.In other words, in the case of Hangeul, for example, when comparing Hong Gil- dong and Hong Gal- dong, the middle one character is different among the three letters, so it looks quite different because 1/3 is different, but when it is broken down into consonant and vowel units, Since ㅇㄱㅣㄹㄴㅗㅇ and ㅎㅗㅇㄱ아ㄹㅇ are compared, only one phoneme is different among 9 phonemes, so the similarity appears high because 1/9 is different.

추가로 알파벳일 경우에는 이니셜(initail)로 표기하는 경우도 있기 때문에 이니셜로 변환한 것을 교원 DB(110)에 추가로 할당된 부분에 저장할 수도 있다.In addition, in the case of alphabets, initials may be written as initials, so conversion to initials may be stored in an additionally allocated portion of the teacher DB 110.

한편, 상기 비교 단계(S330)는 문자열 비교(sequence matching) 방법 중 편집 거리(Levenstein Distance) 알고리즘에서 연속성에 가중치를 부가하여 상기 입력 업적 정보와 업적 관련 정보를 비교하여 검증하게 된다.Meanwhile, in the comparison step (S330), a weight is added to continuity in a Levenstein Distance algorithm among a sequence matching method, and the input achievement information and achievement-related information are compared and verified.

여기서, 일반적인 편집 거리(Levenstein Distance) 알고리즘은 두 문자열을 비교하여 동일한 문자열을 만들기 위해서 삽입(insertion), 삭제(deletion), 대체(replacement)의 3가지 연산 중 한가지를 몇 번을 수행하여야 하는지를 판단하는 것이다.Here, the general edit distance (Levenstein Distance) algorithm compares two strings to determine how many times one of the three operations of insertion, deletion, and replacement must be performed to create the same string. will be.

이러한 과정을 프로그래밍 적으로 표현한 것을 살펴보면:Here's a programmatic representation of this process:

D(i,j) = D(i-1,j-1)이고, D(i,j) = D(i-1,j-1),

D(i,j) = min( D(i-1,j), D(i,j-1), D(i-1,j-1) )이다.D(i,j) = min( D(i-1,j), D(i,j-1), D(i-1,j-1) ).

이때, D(i,j) = 편집 거리 값(전술한 3가지 연산을 몇번 수행하는지)을 의미하며 초기값인 D(0,0)=0 이고, 두가지 문자 열 A,B 에서 A[i] = 문자열 A의 i번째 문자, B[j] = 문자열 B의 j번째 문자을 의미하며 변수인 i와 j는 0 ~ n의 범위의 정수로서 각 문자열을 구성하는 글자의 수가 된다.At this time, D(i,j) = edit distance value (how many times the above 3 operations are performed), the initial value D(0,0)=0, and A[i] in two character strings A,B = i-th character of character string A, B[j] = means the j-th character of character string B, and the variables i and j are integers in the range of 0 to n, and are the number of characters constituting each character string.

그래서, 문자열을 구성하는 각 문자를 비교하기 위하여 편집 거리(Levenstein Distance) 알고리즘을 풀이한 상기 과정 수행하여 편집 거리 값(D(i,j))을 도출할 수 있고, 편집 거리 값은 전술한 바와 같이 두가지 문자열을 비교하여 동일하게 바꾸기 위해서 전술한 3가지 연산 중 한가지를 몇번을 수행하여야 하는지를 나타내며, 이러한 편집 거리 값이 클수록 두가지 문자열의 유사성이 낮은 것을 의미한다.So, in order to compare each character constituting the string, the above process of solving the Levenstein Distance algorithm can be performed to derive the edit distance value (D(i,j)), and the edit distance value is as described above. It indicates how many times one of the three operations described above should be performed in order to compare two character strings and change them to be the same. The larger the editing distance value, the lower the similarity between the two character strings.

이러한 편집 거리(Levenstein Distance) 알고리즘의 예는 도 6에 도시된 바와 같이 문자열 ABC와 문자열 QWC를 비교한 테이블을 동하여 명확하게 알 수 있다.An example of such an edit distance (Levenstein Distance) algorithm can be clearly seen by working with a table comparing the string ABC and the string QWC as shown in FIG.

그런데, 일반적인 편집 거리(Levenstein Distance) 알고리즘에서는 단순히 차이가 나는 문자가 몇 개인지만을 확인할 수 있는 것으로서, 차이나는 문자의 개수가 많지만 좀더 유사성이 높은 문자열을 확인할 수 없게 된다.However, in the general Levenstein Distance algorithm, only a few characters with differences can be identified, and a character string with a high similarity cannot be identified even though the number of characters with differences is large.

예를 들자면 co m pa r is o n과 co n pa l is e n은 전혀 관련이 없는 문자열이지만, 밑줄친 3부분에서 차이가 있어 편집 거리는 3이 되며, compar ison 과 compar e 는 상당한 유사성을 가지지만 두 문자열은 밑줄친 4부분에서 차이가 있어 편집거리는 4가 되므로 유사성이 떨어진다고 판단하게 된다.For example, co m pa r is o n and co n pa l is e n are completely unrelated strings, but there is a difference in the underlined 3 parts, so the edit distance is 3, and compar ison and compar e do not have much similarity. The two strings of dumplings have a difference in the underlined 4 parts, so the editing distance becomes 4, so it is judged that the similarity is low.

그래서, 본 발명에서는 이러한 문제점을 해결하기 위하여 편집 거리(Levenstein Distance) 알고리즘을 사용하기는 하지만 연속적으로 일치하는 부분이 많을수록 유사성이 높다는 점을 고려하여 전술한 바와 같이 연속성에 가중치를 부가하여 유사성을 판단하게 된다.Therefore, in the present invention, although the Levenstein Distance algorithm is used to solve this problem, the similarity is determined by adding a weight to the continuity as described above considering that the similarity is higher as the number of continuously matched parts increases. will do

그리고, 편집 거리(Levenstein Distance) 알고리즘에 가중치를 부가한 과정을 프로그래밍 적으로 표현한 것을 살펴보면:And, looking at the programmatic representation of the process of adding weights to the Levenstein Distance algorithm:

D(i,j) = D(i-1,j-1)이고, D(i,j) = D(i-1,j-1),

만약 b == True 라면 w = w + 1 이고,If b == True then w = w + 1,

만약 b != True 라면 b = True 이다.If b != True then b = True.

b = False 이다.b = False.

A[i] = 문자열 A의 i번째 문자,A[i] = ith character of string A,

B[j] = 문자열 B의 j번째 문자,B[j] = jth character of string B,

i,j = 0 ~ n }i,j = 0 to n }

여기서, 본 발명의 가중치가 부가된 편집 거리(Levenstein Distance) 알고리즘으로 두가지 문자열을 비교하여 보면, Here, comparing two strings with the weighted edit distance (Levenstein Distance) algorithm of the present invention,

1. co m pa r is o n VS co n pa l is e n (3번의 연산이 필요함)1. co m pa r is o n VS co n pa l is e n (requires 3 operations)

편집 거리 값 = 1 + 1/2 + 1/3 = 11/6 로서 약 1.83이 된다.Edit distance value = 1 + 1/2 + 1/3 = 11/6, which is about 1.83.

이때, 그 값의 도출 과정을 살펴보면 : w 값은 연속으로 일치하는 문자가 있을 경우 값이 증가하게(비교 당시의 b 값이 True일 경우) 되고 편집 거리 값의 초기값은 0이다.At this time, look at the derivation process of the value: The value of w increases when there are consecutively matched characters (if the value of b at the time of comparison is True), and the initial value of the edit distance value is 0.

문자 'c'를 비교할 때, 두 문자가 동일하므로 D(1,1)=D(0,0)=0 이 되며, b 값이 초기값인 False 이므로 w 값은 0이고, 이후에 b값이 True로 변경된다.When comparing the character 'c', since the two characters are identical, D(1,1)=D(0,0)=0, and since the value of b is the initial value of False, the value of w is 0, and the value of b is then changed to true.

문자 'o'를 비교할 때, 두 문자가 동일하므로 D(2,2)=D(1,1)=0 이 되며, b 값이 True 이므로 w 값은 1이 증가되어 1이되고, b 값은 변하지 않는다.When comparing the character 'o', since the two characters are identical, D(2,2)=D(1,1)=0, and since the value of b is True, the value of w is incremented by 1 to become 1, and the value of b becomes 1. It doesn't change.

문자 'm'을 비교할 때, 두 문자가 일치하지 않으므로 D(3,3) = D(2,2)+1/w = 0+1/1이 되며, b 값은 False로 변경되고, w 값은 변하지 않아서 여전히 1이다.When comparing the character 'm', the two characters do not match, so D(3,3) = D(2,2)+1/w = 0+1/1, the value of b is changed to False, and the value of w is unchanged and is still 1.

문자 'p'을 비교할 때, 두 문자가 동일하므로 D(4,4)=D(3,3)=1 이 되며, 기존 b 값이 False이므로 w값은 변하지 않고 1이며 b는 true로 변경된다.When comparing the character 'p', since the two characters are identical, D(4,4)=D(3,3)=1, and since the original value of b is False, the value of w remains unchanged and is 1, and b is changed to true. .

문자 'a'을 비교할 때, 두 문자가 동일하므로 D(5,5)=D(4,4)=1 이 되며, 기존 b 값이 True이므로 w값은 1이 증가되어 2가 되고, b는 변하지 않는다.When comparing the character 'a', since the two characters are identical, D(5,5)=D(4,4)=1, and since the original value of b is True, the value of w is increased by 1 to become 2, and b is It doesn't change.

문자 'r'을 비교할 때, 두 문자가 일치하지 않으므로 D(6,6) = D(5,5)+1/w = 1+1/2이 되며, b 값은 False로 변경되고, w 값은 변하지 않아서 여전히 2이다.When comparing the character 'r', the two characters do not match, so D(6,6) = D(5,5)+1/w = 1+1/2, the value of b is changed to False, and the value of w is unchanged and is still 2.

문자 'i'을 비교할 때, 두 문자가 동일하므로 D(7,7) = D(6,6) = 1+1/2이 되며, 기존 b 값이 False이므로 w값은 변하지 않고 1이며 b는 true로 변경된다.When comparing the character 'i', since the two characters are identical, D(7,7) = D(6,6) = 1+1/2, and since the original value of b is False, the value of w remains unchanged and is 1, and b is change to true

문자 's'을 비교할 때, 두 문자가 동일하므로 D(8,8)=D(7,7)=1+1/2이 되며, 기존 b 값이 True이므로 w값은 1이 증가되어 3이 되고, b는 변하지 않는다.When comparing the character 's', since the two characters are identical, D(8,8)=D(7,7)=1+1/2, and since the original value of b is True, the value of w is increased by 1 to become 3. and b does not change.

문자 'o'을 비교할 때, 두 문자가 일치하지 않으므로 D(9,9) = D(8,8)+1/w = 1+1/2+1/3이 되며, b 값은 False로 변경되고, w 값은 변하지 않아서 여전히 3이다.When comparing the character 'o', the two characters do not match, so D(9,9) = D(8,8)+1/w = 1+1/2+1/3, and the value of b changes to False , and the value of w does not change, so it is still 3.

문자 'n'을 비교할 때, 두 문자가 동일하므로 D(10,10) = D(9,9) = 1+1/2+1/3이 되며, 기존 b 값이 False이므로 w값은 변하지 않고 3이며 b는 true로 변경된다.When comparing the character 'n', since the two characters are identical, D(10,10) = D(9,9) = 1+1/2+1/3, and since the original value of b is False, the value of w does not change. 3 and b is changed to true.

따라서, 최종 가중치가 적용된 편집 거리값인 D(10,10)값은 1+1/2+1/3이 된다.Therefore, the D(10,10) value, which is the edited distance value to which the final weight is applied, becomes 1+1/2+1/3.

2. compar ison VS compar e (4번의 연산이 필요함)2. compar ison VS compar e (requires 4 operations)

편집 거리 값 = 1/5 + 1/5 + 1/5 + 1/5 = 4/5 로서 0.8이 된다.Edit distance value = 1/5 + 1/5 + 1/5 + 1/5 = 4/5, which is 0.8.

따라서, 본 발명의 가중치가 부가된 편집 거리(Levenstein Distance) 알고리즘으로 두가지 문자열을 비교하였을 때, 2번째의 경우가 편집 거리 값이 0.8로서 더 낮게 나와 유사성이 높은 것으로 판단되며 이는 실제 유사성과도 일치하는 결과가 도출되는 것을 알 수 있다.Therefore, when the two strings are compared with the Levenstein Distance algorithm to which the weight is added according to the present invention, the second case has a lower edit distance value of 0.8, which is judged to have high similarity, which is consistent with actual similarity. It can be seen that the result of

그래서, 본 발명에서는 상기 데이터 검증 단계(S300)를 구성하는 비교 단계(S330)에서 가중치가 부가된 편집 거리(Levenstein Distance) 알고리즘을 사용하여 업적 정보와 업적 관련 정보를 비교함으로써, 교원의 입력시나 각 웹 사이트에 기재시 오탈자가 있더라도 유사성을 통하여 검증의 정확도를 높일 수 있게 된다.Therefore, in the present invention, in the comparison step (S330) constituting the data verification step (S300), the achievement information is compared with the achievement-related information using the Levenstein Distance algorithm to which a weight is added, so that when the teacher inputs or each Even if there is a typo in writing on the website, the accuracy of verification can be increased through similarity.

그리고, 상기 데이터 검증 단계(S300) 이후에는 검증 결과를 확인하기 위한 검증 결과 확인 단계(S400)가 더 수행될 수 있다.And, after the data verification step (S300), a verification result checking step (S400) for confirming the verification result may be further performed.

여기서, 상기 데이터 검증 단계(S300)에서는 검증시 일치되지 않는 부분에 대한 정보를 검증 모듈(140)에 별도로 구비되는 결과 저장부(142)에 저장하게 되며, 상기 검증 결과 확인 단계(S400)에서는 상기 결과 저장부(142)에 저장된 정보를 참조하여 도 7에 도시된 바와 같이 간략한 코멘트와 함께 결과를 출력하게 된다.Here, in the data verification step (S300), information on the part that does not match during verification is stored in the result storage unit 142 provided separately in the verification module 140, and in the verification result confirmation step (S400) Referring to the information stored in the result storage unit 142, the result is output with a brief comment as shown in FIG.

이때, 상기 검증 결과 확인 단계(S400)에서는 입력 데이터에 오류가 있을 경우에는 결과 정보를 해당 교원 DB(110)에 저장된 정보를 통하여 교원에게 알려주어 오류를 수정할 수 있도록 한다.At this time, in the verification result confirmation step (S400), if there is an error in the input data, the result information is notified to the teacher through the information stored in the corresponding teacher DB 110 so that the error can be corrected.

이상에서 본 발명의 바람직한 실시 예를 설명하였으나, 본 발명의 권리범위는 이에 한정되지 않으며, 본 발명의 실시 예와 실질적으로 균등한 범위에 있는 것까지 본 발명의 권리 범위가 미치는 것으로 본 발명의 정신을 벗어나지 않는 범위 내에서 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것이다.Although preferred embodiments of the present invention have been described above, the scope of the present invention is not limited thereto, and the spirit of the present invention extends to those within the scope substantially equivalent to the embodiments of the present invention. Various modifications and implementations are possible by those skilled in the art to which the present invention pertains within the scope that does not deviate from the above.

100 : 서버 110 : 교원 DB
120,210 : 통신모듈 130 : 추출 모듈
140 : 검증 모듈 200 : 단말기
220 : 디스플레이부 300 : 웹 사이트
S100 : 정보 입력 단계
S200 : 정보 추출 단계
S300 : 데이터 검증 단계
S400 : 검증 결과 확인 단계100: Server 110: Faculty DB
120,210: communication module 130: extraction module
140: verification module 200: terminal
220: display unit 300: website
S100: information input step
S200: information extraction step
S300: Data Verification Step
S400: verification result confirmation step

Claims

An input achievement information DB storage step in which achievement information input by teachers is stored in a teacher DB provided in the server; an information extraction step in which an extraction module provided in the server extracts achievement-related information from the web through the input achievement information; , a data verification step in which the verification module provided in the server compares the achievement-related information extracted in the information extraction step with the input achievement information to verify, and a verification result confirmation step in which verification results are confirmed,
The input achievement information is information about their thesis, patent, book, or translation entered by each teacher,
The information extraction step includes an access step in which the extraction module accesses a website where information on thesis, patent, book, or translation can be obtained, and information is extracted from the website accessed by the extraction module through a crawling operation. a data extraction step, and a data refinement step for the extraction module to extract only achievement-related information from the information extracted in the data extraction step;
The data verification step includes a preprocessing step in which the verification module converts the input achievement information stored in the teacher DB and the achievement-related information extracted through the information extraction step into a set format and processes it, and the verification module relates to the preprocessed achievement It consists of a comparison step of comparing and verifying information and input achievement information,
In the data verification step, information on a part in which the preprocessed achievement-related information and the input achievement information do not match is stored in a separately provided result storage unit;
In the verification result confirmation step, if there is an error in the achievement information input by the teachers, the result information stored in the result storage unit is output with a brief comment to inform the teacher so that the teacher can correct the error. and
In the comparison step, when the verification module compares the input achievement information and the achievement-related information in the Levenstein Distance algorithm among the sequence matching methods, a weight is added to continuity, which is the number of consecutively matched parts, Compare the achievement information and achievement-related information,
The comparison step is characterized in that the similarity is determined by the weight edit distance value (D(i,j)) derived through the following process.
[procedure]
1. If A[i] and B[j] of the string to be compared match,
D(i,j) = D(i-1,j-1),
If b == True then w = w + 1,
If b != True then b = true.

2. If A[i] and B[j] of the string to be compared do not match,
D(i,j) = min( D(i-1,j)+1/w, D(i,j-1)+1/w, D(i-1,j-1)+1/w ) ego,
b = False.

{ D(i,j) = weight edit distance value, initial value D(0,0)=0,
A[i] = ith character of string A,
B[j] = jth character of string B,
w = continuity weight (initial value of w is 0),
b = Boolean to determine continuity (initial value of b is False),
i,j = 0 to n }

delete

According to claim 1,
In the data trimming step, the extracting module removes irrelevant characters including special characters other than achievement-related information from the extracted information by referring to the format of each website.

delete

According to claim 1,
In the pre-processing step, the verification module converts the input achievement information and achievement-related information into lowercase letters if they are alphabetic, and separates them into consonants and vowels if they are Korean.

delete