KR101713483B1

KR101713483B1 - Method for scoring supply type answer sheet, computer program and storage medium for the same

Info

Publication number: KR101713483B1
Application number: KR1020160058372A
Authority: KR
Inventors: 노은희; 성경희; 강승식; 김재훈; 장은서; 천민아; 송미영; 임은영; 김명화; 박종임; 김유향
Original assignee: 한국교육과정평가원
Priority date: 2015-10-08
Filing date: 2016-05-12
Publication date: 2017-03-07
Also published as: KR101691327B1; KR101680007B1; KR101713487B1

Abstract

The present invention relates to an answer scoring method using a machine learning method and a computer program and a recording medium for the same. By the answer scoring method using a machine learning method according to the present invention, at least two first classification criteria are set that respectively correspond to at least two classification grades depending on the types of multiple input answers - each of the two or more first classification criteria including a feature vector that is a coordinate value which shows a position in a feature space having each feature as its axis - the feature corresponding to a characteristic element determined in advance with respect to the multiple input answers is extracted, the feature vector is calculated that is the coordinate value which shows the position in the feature space having each feature as its axis, the multiple input answers are classified based on the classification grade corresponding to the first classification criterion with the highest inter-vector similarity value by comparison between the respective feature vectors of the multiple input answers and the respective feature vectors of the two or more first classification criteria, at least two second classification criteria are formed by averaging of the respective coordinates of the respective feature vectors of the multiple input answers classified based on the classification grades, the two or more first classification criteria are replaced with the two or more second classification criteria, the multiple input answers are classified based on the classification grade corresponding to the second classification criterion with the highest inter-vector similarity value by comparison between the respective feature vectors of the multiple input answers and the respective feature vectors of the two or more second classification criteria, and scoring is performed for each classification grade.

Description

METHOD FOR SCORING SUPPLY TYPE ANSWER SHEET, COMPUTER PROGRAM AND STORAGE MEDIUM FOR THE SAME,

본 발명은 기계학습 방법을 이용한 답안 채점 방법, 그를 위한 컴퓨터 프로그램과 기록매체에 관한 것이다.The present invention relates to an answer scoring method using a machine learning method, a computer program therefor, and a recording medium.

서답형 문항은 피험자가 미리 제시된 답 가운데에서 정답을 고르는 것이 아니고 직접 정답을 구성하여 작성하는 문항 형태이다. 보통 서답형은 단답형, 괄호형, 논술형(또는 서술형) 등의 세부 유형으로 구분한다. 단답형은 간단한 단어ㆍ구ㆍ절(節)ㆍ문장 등으로 답을 적도록 하는 형식이고, 괄호형은 괄호 안에 답을 적도록 하는 형식이다. 논술형 또는 서술형은 대동소이하지만, 논술형은 학생이 나름대로의 생각이나 주장을 논리적으로 설득력 있게 작성해야 함을 강조하는 형식이고, 서술형은 논술형에 비해 비교적 짧으면서 객관적인 정답이 존재하는 형식 정도로 구분하여 이해하기도 한다.The answer item is a form of the question that the subject does not select the right answer among the answers presented beforehand, but composes the correct answer. Usually, the answer forms are divided into detailed types such as short answer, parentheses, and essay (or narrative). The short answer type is a form in which the answer is written in simple words, phrases, clauses, and sentences, and the bracket type is a form in which the answer is written in parentheses. Although the essay or essay type is very similar to the essay type, essay type emphasizes that the student should write his or her thoughts and arguments logically and convincingly. The essay type is relatively short compared with the essay type, do.

서답형 문항은 학생이 생각한 바를 직접 구성하여 쓰기 때문에, 선택형 문항에 비해 고등 정신 능력을 측정하는 데 효과적이다. 하지만 채점 결과의 공정성 문제나 채점에 소요되는 시간과 비용 등의 문제로, 우리 교육 현장에서 이를 적극적으로 활용하지 못하고 있다. 특히 국가수준 학업성취도 평가와 같은 대규모 시험에서 서답형 문항의 수와 형식은 더욱 제한적일 수밖에 없다.The answer to the question is effective in measuring higher mental ability compared to the optional item because the student writes what he thinks. However, due to problems such as the fairness of grading results and the time and cost of scoring, we are not actively utilizing it in our educational field. In particular, the number and format of answer items are limited in large scale tests such as national level academic achievement evaluation.

한국 등록특허공보 제10-1275146호(2013.06.10. 공개)Korean Patent Registration No. 10-1275146 (published on June 10, 2013)

대규모 평가들이 컴퓨터 기반으로 시행된다면 무엇보다 채점 결과가 즉각적으로 산출될 수 있는 시스템이 구축되어야 할 것이다. 즉, 컴퓨터 기반 평가 체제에서 채점과 관련된 부담을 획기적으로 줄여줌으로써 수백 명에 관한 학교 현장 또는 수만 내지 수십만의 대규모 평가에서 서답형 문항의 비중을 확대할 수 있는 시스템의 개발이 요구되고 있다. 본 발명은 이와 같은 대규모 평가가 가능하도록 기계학습 방법을 이용한 답안 채점 방법, 그를 위한 컴퓨터 프로그램과 기록매체를 제공한다.If large-scale evaluations are conducted on a computer-based basis, the system should be constructed so that the scoring results can be calculated immediately. In other words, it is required to develop a system that can increase the proportion of answer items in the school site or hundreds to thousands of large scale evaluations by reducing the burden related to scoring in the computer based evaluation system. The present invention provides an answer scoring method using a machine learning method, a computer program therefor, and a recording medium to enable such a large scale evaluation.

본 발명의 일 실시예에 따른 기계학습 방법을 이용하여 다수의 입력 답안을 채점하는 방법은, a) 상기 다수의 입력 답안의 유형에 따른 적어도 두 개의 분류 등급 각각에 대응하는 적어도 두 개의 제1 분류 기준을 설정하는 단계 - 상기 적어도 두 개의 제1 분류 기준 각각은 각 자질을 축으로 하는 자질 공간에서의 위치를 나타내는 좌표값인 자질 벡터를 포함함 - 와, b) 상기 다수의 입력 답안에 대하여 미리 결정된 특징적 요소에 해당하는 자질(feature)을 추출하는 단계와, c) 각 자질을 축으로 하는 자질 공간에서의 위치를 나타내는 좌표값인 자질 벡터를 산출하는 단계와, d) 상기 다수의 입력 답안 각각의 자질 벡터와 상기 적어도 두 개의 제1 분류 기준 각각의 자질 벡터를 비교하여, 벡터 간의 유사도가 가장 높은 값을 갖는 제1 분류 기준에 대응하는 분류 등급으로 상기 다수의 입력 답안 각각을 분류하는 단계와, e) 분류 등급별로 분류된 상기 다수의 입력 답안 각각의 자질 벡터의 각 좌표들의 평균을 수행하여 적어도 두 개의 제2 분류 기준을 형성하는 단계와, f) 상기 적어도 두 개의 제1 분류 기준을 상기 적어도 두 개의 제2 분류 기준으로 변경하는 단계와, g) 상기 다수의 입력 답안 각각의 자질 벡터와 상기 적어도 두 개의 제2 분류 기준 각각의 자질 벡터를 비교하여, 벡터 간의 유사도가 가장 높은 값을 갖는 제2 분류 기준에 대응하는 분류 등급으로 상기 다수의 입력 답안 각각을 분류하는 단계와, h) 상기 분류 등급별로 점수를 부여하는 단계를 포함한다.A method of scoring a plurality of input answers using a machine learning method according to an embodiment of the present invention includes the steps of: a) determining at least two first classifications corresponding to each of at least two classifications, Wherein each of the at least two first classification criteria includes a feature vector that is a coordinate value indicating a position in a feature space around each feature axis; and b) C) calculating a feature vector, which is a coordinate value indicating a position in a feature space centering on each feature, and d) calculating a feature vector corresponding to each of the plurality of input answers And a feature vector of each of the at least two first classification reference vectors is compared with a feature vector of the first classification reference vector having the highest degree of similarity between the vectors Classifying each of the plurality of input answers into a plurality of input answers classified into classification classes; averaging each of the coordinates of the feature vectors of each of the plurality of input answers classified by classification class to form at least two second classification criteria; , f) changing the at least two first classification criteria to the at least two second classification criteria, g) comparing the feature vector of each of the plurality of input answers and the feature vector of each of the at least two second classification criteria Classifying each of the plurality of input answers into a classification class corresponding to a second classifying criterion having a highest degree of similarity between the vectors, and h) assigning a score to the classification class.

또한 본 발명의 일 실시예에 따른 프로그램은, 컴퓨터에서 상기한 기계학습 방법을 이용하여 다수의 입력 답안을 채점하는 방법을 실행시키도록 기록매체에 저장된다.Also, a program according to an embodiment of the present invention is stored in a recording medium so as to execute a method of scoring a plurality of input answers using the above-described machine learning method in a computer.

또한 본 발명의 일 실시예에 따른 기록매체는, 상기한 기계학습 방법을 이용하여 다수의 입력 답안을 채점하는 방법을 실행시키기 위한 프로그램을 저장한다.A recording medium according to an embodiment of the present invention stores a program for executing a method of scoring a plurality of input answers using the machine learning method.

본 발명의 일 실시예에 따르면, 다수의 문장으로 구성된 서답형 문항을 포함하는 대량의 답안에 대하여 정확하고 신속하게 채점을 수행할 수 있다.According to an embodiment of the present invention, it is possible to accurately and quickly perform scoring on a large number of answers including a question-answer item composed of a plurality of sentences.

도 1은 본 발명의 실시예에 따른 답안 채점 환경의 구성을 보이는 예시도.
도 2는 본 발명의 실시예에 따른 답안 채점 장치의 구성을 보이는 예시도.
도 3은 본 발명의 실시예에 따른 답안 채점 장치가 포함하는 처리부의 구성을 보이는 예시도.
도 4는 본 발명의 실시예에 따른 문서 정규화부의 자연 언어 처리 과정을 보이는 예시도.
도 5는 본 발명의 실시예에 따른 예시 문장의 음절 발생 및 띄어쓰기 상태 전이를 보이는 예시도.
도 6a 내지 도 6d는 본 발명의 실시예에 따른 문장의 의존관계를 보이는 예시도.
도 7은 본 발명의 실시예에 따른 기계학습 방법인 비지도 학습 방법을 이용하여 다수의 입력 답안을 채점하는 방법의 절차를 보이는 예시도.BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an exemplary view showing the configuration of an answer scoring environment according to an embodiment of the present invention; FIG.
FIG. 2 is an exemplary view showing a configuration of an answer scoring apparatus according to an embodiment of the present invention; FIG.
3 is an exemplary view showing a configuration of a processing unit included in the answer scoring apparatus according to an embodiment of the present invention;
FIG. 4 illustrates an example of a natural language process of a document normalization unit according to an embodiment of the present invention. FIG.
5 illustrates an example of syllable generation and spacing state transitions in an exemplary sentence according to an embodiment of the present invention.
6A to 6D are views showing dependency relationships of sentences according to an embodiment of the present invention.
FIG. 7 is a diagram illustrating a procedure of a method of scoring a plurality of input answers using a non-background learning method, which is a machine learning method according to an embodiment of the present invention.

이하 첨부된 도면을 참조하여 본 발명의 실시예들에 대해 상세히 설명한다. 다만, 이하의 설명에서는 본 발명의 요지를 불필요하게 흐릴 우려가 있는 경우, 널리 알려진 기능이나 구성에 관한 구체적 설명은 생략하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, well-known functions or constructions will not be described in detail if they obscure the subject matter of the present invention.

도 1은 본 발명의 실시예에 따른 답안 채점 환경의 구성을 보이는 예시도이다.FIG. 1 is an exemplary view showing the configuration of an answer scoring environment according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 답안 채점 환경(100)은 서버(110), 다수의 학생 단말(120-1,...,120-n) 및 다수의 채점자 단말(130-1,...,130-n)을 포함할 수 있다. 서버(110), 다수의 학생 단말(120-1,...,120-n) 및 다수의 채점자 단말(130-1,...,130-n)은 네트워크(N)를 통하여 통신 가능하도록 서로 연결될 수 있다.As shown in FIG. 1, the answer scoring environment 100 includes a server 110, a plurality of student terminals 120-1,..., 120-n, and a plurality of scorer terminals 130-1,. , 130-n). The server 110, the plurality of student terminals 120-1 to 120-n and the plurality of scorer terminals 130-1 to 130-n can communicate with each other via the network N Can be connected to each other.

서버(110)는 학생 단말(120-n)로부터 수신한 입력 답안에 대하여 언어 처리를 수행하거나 언어 처리 없이 채점 대상 답안을 형성하고, 형성된 채점 대상 답안을 채점하기 위한 분류 기준을 형성하며, 형성된 분류 기준을 이용하여 채점 대상 답안의 분류 및 채점을 수행할 수 있다. 분류 기준의 형성 방법, 채점 대상 답안의 분류 및 채점 방법에 대해서는 후술하도록 한다. 서버(110)는 채점 대상 답안 중 미리 설정된 소정 개수의 채점 대상 답안을 이용하여 분류 기준을 형성할 수도 있고, 채점 대상 답안 중 미리 설정된 소정 개수의 채점 대상 답안에 대한 채점자의 분류 결과에 기초하여 분류 기준을 형성할 수도 있다. 서버(110)는 채점자 단말(130-1,...,130-n)로부터 네트워크(N)를 통하여 채점자의 분류 결과를 수신할 수 있다. 또한, 서버(110)는 채점이 완료된 채점 대상 답안의 채점 결과를 채점자 단말(130-1,...,130-n) 또는 학생 단말(120-1,...,120-n)로 전송할 수 있고, 형성된 분류 기준에 따라서 분류할 수 없거나 채점할 수 없는 채점 대상 답안을 채점자의 수동 채점을 위하여 채점자 단말(130-1,...,130-n)로 전송할 수 있다.The server 110 forms a sorting criterion for scoring the score to be formed, performs language processing on the input answer received from the student terminal 120-n, forms an answer to be scored without linguistic processing, The criterion can be used to classify and score scoring subjects' answers. The method of forming the classification standard, the classification of the scoring subject answers, and the scoring method will be described later. The server 110 may form a classification criterion by using a predetermined number of points to be scored, which are set in advance among the scores to be scored, or may classify the scores based on the grading results of a predetermined number of points to be scored, A reference may be formed. The server 110 can receive the grading result of the grader from the grader terminals 130-1, ..., and 130-n via the network N. [ In addition, the server 110 transmits the scoring result of the grading completed answer to the grader terminal 130-1, ..., 130-n or the student terminal 120-1, ..., 120-n 130-1,..., 130-n for the passive scoring of the scorer, which can not be classified or can not be graded according to the formed classification criteria.

학생 단말(120-1,...,120-n)은 특정 시험에 응하는 다수의 학생들로부터 답안을 입력받아 서버(110) 및 채점자 단말(130-1,...,130-n)에서 처리 가능한 형태인 입력 답안을 형성할 수 있다. 일 실시예로서, 시험에 응하는 각각의 학생은 학생 단말(120-1,...,120-n)에 본인의 답안을 직접 입력하여 학생 단말(120-1,...,120-n)이 입력 답안을 형성할 수도 있고, 각각의 학생이 시험지에 기재하여 제출한 답안을 학생 단말(120-1,...,120-n)이 광학 문자 인식 OCR(Optical Character Recognition)과 같은 필기 인식 방법을 이용하여 인식한 후 입력 답안을 형성할 수도 있으며, 시험지에 기재하여 제출한 답안을 관리자가 일괄적으로 학생 단말(120-1,...,120-n)에 입력하여 입력 답안을 형성할 수도 있으나, 채점 대상 답안을 형성하는 방법은 이러한 구현에 한정되는 것은 아니다. 또한, 학생 단말(120-1,...,120-n)은 서버(110)로부터 채점이 완료된 채점 대상 답안의 채점 결과를 수신할 수 있다. 여기서, 입력 답안은 적어도 하나의 단어, 적어도 하나의 숫자, 적어도 하나의 문장 등 주관식 시험 문제의 답안에서 나타날 수 있는 모든 형태의 답안을 포함할 수 있다. 본 실시예에서는 입력 답안이 적어도 하나의 문장을 포함하는 경우를 위주로 설명하도록 한다. The student terminals 120-1 to 120-n receive answers from a plurality of students corresponding to a specific test and receive responses from the server 110 and the grader terminals 130-1 to 130-n The input answer can be formed in a form that can be processed. In one embodiment, each student participating in the test directly inputs his or her answer to the student terminal 120-1, ..., 120-n, .., 120-n may form an input answer, and the student terminal 120-1, ..., 120-n may write an answer written on each student's test sheet by handwriting such as optical character recognition OCR (Optical Character Recognition) The input answers can be formed after recognition using the recognition method. An administrator can collect answers written on the test paper by collectively inputting them in the student terminals 120-1, ..., 120-n, However, the method of forming the scoring target answer is not limited to such an implementation. In addition, the student terminals 120-1, ..., and 120-n can receive the scoring result of the scoring target answer completed by the server 110. Here, the input answer may include all types of answers that may appear in the answer to the question-and-answer test question, such as at least one word, at least one number, at least one sentence. In the present embodiment, a case where the input answer includes at least one sentence will be mainly described.

채점자 단말(130-1,...,130-n)은 서버(110)로부터 채점이 완료된 채점 대상 답안의 채점 결과 및 서버(110)에서 형성된 분류 기준에 따라서 분류할 수 없거나 채점할 수 없는 채점 대상 답안을 네트워크(N)를 통하여 수신할 수 있다. 또한, 채점자 단말(130-1,...,130-n)은 서버(110)의 채점 대상 답안의 분류 및 채점 결과에 대하여 채점자가 검토를 용이하게 수행하고, 서버(110)에서 분류하지 못하였거나 채점하지 못한 채점 대상 답안에 대하여 채점자가 채점을 용이하게 수행할 수 있도록 한다. 또한, 채점자 단말(130-1,...,130-n)은 서버(110)가 채점 대상 답안에 대한 분류 기준을 형성할 수 있도록, 채점 대상 답안 중 미리 설정된 소정 개수의 채점 대상 답안에 대한 채점자의 분류 결과를 채점자로부터 수신하여 서버(110)로 전송할 수 있다.The grader terminals 130-1 to 130-n perform grading based on the grading result of the grading object answer completed by the server 110 and the grading score that can not be classified or can not be graded according to the classifying criteria formed in the server 110 The target answer can be received via the network N. [ The grader terminals 130-1 to 130-n can easily perform review by the grader on the grading result of the grading result of the server 110 and can not classify it on the server 110 Or scored by the scorer in order to facilitate the scoring. The grader terminals 130-1, ..., 130-n may also be arranged so that the server 110 can form a classification criterion for an answer to be scored, for a predetermined number of points to be scored, The scorer may receive the classification result from the scorer and transmit the result to the server 110.

일 실시예로서, 학생 단말(120-1,...,120-n) 및 채점자 단말(130-1,...,130-n)은 개인용 컴퓨터(personal computer), 태블릿(tablet), 스마트폰(smart phone), 노트북 컴퓨터(laptop computer), 개인 휴대 정보 단말기(PDA: Personal Digital Assistants) 등을 포함할 수 있으나, 이러한 구현에 한정되지 않는다.In one embodiment, the student terminals 120-1, ..., 120-n and the scorer terminals 130-1, ..., 130-n may be personal computers, tablets, smart A smart phone, a laptop computer, a personal digital assistant (PDA), and the like, but the present invention is not limited to such an implementation.

도 2는 본 발명의 실시예에 따른 답안 채점 장치의 구성을 보이는 예시도이고, 도 3은 본 발명의 실시예에 따른 답안 채점 장치가 포함하는 처리부의 구성을 보이는 예시도이다.FIG. 2 is an exemplary view showing the configuration of an answer scoring apparatus according to an embodiment of the present invention, and FIG. 3 is an exemplary view illustrating a configuration of a processing unit included in the answer scoring apparatus according to an embodiment of the present invention.

답안 채점 장치(200)는 처리부(210), 저장부(220), 통신부(230) 및 시스템 버스(240)를 포함할 수 있다. 일 실시예로서, 처리부(210), 저장부(220) 및 통신부(230)는 시스템 버스(240)를 통하여 서로 연결될 수 있다. 답안 채점 장치(200)는 서버(110) 내부에 포함될 수도 있고, 서버(110)와는 별도로 구비될 수도 있으나, 이러한 구현에 한정되는 것은 아니다. 처리부(210)는 언어 처리부(212) 및 분류 채점부(214)를 포함할 수 있고, 언어 처리부(212)는 문서 정규화부(212-1), 형태소 분석부(212-2), 품사 부착부(212-3), 부정표현 인식부(212-4), 구묶음부(212-5), 바꿔쓰기부(212-6) 및 의존관계 분석부(212-7)를 포함할 수 있으며, 분류 채점부(214)는 학습용 답안 생성부(214-1), 자질 추출부(214-2), 답안 분류부(214-3) 및 채점부(214-4)를 포함할 수 있다.The answer scoring apparatus 200 may include a processing unit 210, a storage unit 220, a communication unit 230, and a system bus 240. In an embodiment, the processing unit 210, the storage unit 220, and the communication unit 230 may be connected to each other via the system bus 240. The answer scoring device 200 may be included in the server 110 or separately from the server 110, but the present invention is not limited thereto. The processing unit 210 may include a language processing unit 212 and a classification scoring unit 214. The language processing unit 212 includes a document normalization unit 212-1, a morphological analysis unit 212-2, A substitution unit 212-3, a negative expression recognition unit 212-4, a grouping unit 212-5, a replacing unit 212-6, and a dependency analyzing unit 212-7. The scoring unit 214 may include a learning answer generation unit 214-1, a feature extraction unit 214-2, an answer classification unit 214-3, and a scoring unit 214-4.

처리부(210)는 다수의 입력 답안에 대하여 자연 언어 처리를 수행하여 다수의 채점 대상 답안을 형성하고, 다수의 채점 대상 답안 중 소정 임계값 이상의 빈도를 갖는 제1 소정 개수의 답안을 선택하며, 선택된 제1 소정 개수의 답안 각각에서 자연 언어 처리의 수행 결과를 이용하여 미리 결정된 특징적 요소에 해당하는 자질(feature)을 추출하고, 제1 소정 개수의 답안 각각에서 자질의 포함 여부에 따른 채점을 위한 분류 기준을 수신하되, 분류 기준은 각 자질을 축으로 하는 자질 공간에서의 위치를 나타내는 좌표값인 자질 벡터를 포함하고, 다수의 채점 대상 답안 중 제1 소정 개수의 답안을 제외한 제2 소정 개수의 답안 각각에 대해 자질을 추출하며, 제2 소정 개수의 답안 각각의 자질 벡터와 분류 기준의 자질 벡터를 비교하여 벡터 간의 유사도가 미리 결정된 값 이상인 분류 기준에 대응하는 클러스터(cluster)로 제2 소정 개수의 답안 각각을 분류하고, 분류된 클러스터 각각에 대하여 분류 등급에 따라서 점수를 부여할 수 있다.The processing unit 210 performs natural language processing on a plurality of input answers to form a plurality of scoring target answers, selects a first predetermined number of answers having a frequency of a predetermined threshold value or more among a plurality of scoring target answers, Extracting a feature corresponding to a predetermined characteristic element by using the result of performing the natural language processing in each of the first predetermined number of answers, and extracting a feature for the score according to whether the feature is included in each of the first predetermined number of answers And a classification criterion includes a qualification vector which is a coordinate value indicating a position in a qualitative space centering on each of the qualities, and the second predetermined number of answers excluding the first predetermined number of answers Extracts the qualities for each of the second predetermined number of the answers, compares the qualification vector of each of the second predetermined number of answers with the qualification vector of the classification reference, Each of the second predetermined number of answers may be classified into a cluster corresponding to a classification criterion equal to or greater than the determined value and a score may be assigned to each of the classified clusters according to the classification level.

또한, 처리부(210)는 a) 다수의 입력 답안에 대하여 자연 언어 처리를 수행하지 않고 다수의 입력 답안의 유형에 따른 적어도 두 개의 분류 등급 각각에 대응하는 적어도 두 개의 제1 분류 기준을 설정하고 - 적어도 두 개의 제1 분류 기준 각각은 각 자질을 축으로 하는 자질 공간에서의 위치를 나타내는 좌표값인 자질 벡터를 포함함 -, b) 다수의 입력 답안에 대하여 미리 결정된 특징적 요소에 해당하는 자질(feature)을 추출하며, c) 각 자질을 축으로 하는 자질 공간에서의 위치를 나타내는 좌표값인 자질 벡터를 산출하고, d) 다수의 입력 답안 각각의 자질 벡터와 적어도 두 개의 제1 분류 기준 각각의 자질 벡터를 비교하여, 벡터 간의 유사도가 가장 높은 값을 갖는 제1 분류 기준에 대응하는 분류 등급으로 다수의 입력 답안 각각을 분류하며, e) 분류 등급별로 분류된 다수의 입력 답안 각각의 자질 벡터의 각 좌표들의 평균을 수행하여 적어도 두 개의 제2 분류 기준을 형성하고, f) 적어도 두 개의 제1 분류 기준을 적어도 두 개의 제2 분류 기준으로 변경하며, g) 다수의 입력 답안 각각의 자질 벡터와 적어도 두 개의 제2 분류 기준 각각의 자질 벡터를 비교하여, 벡터 간의 유사도가 가장 높은 값을 갖는 제2 분류 기준에 대응하는 분류 등급으로 다수의 입력 답안 각각을 분류하고, h) 분류 등급별로 점수를 부여할 수 있다.The processing unit 210 may also be configured to: a) set at least two first classification criteria corresponding to each of at least two classification classes according to the type of input answers, without performing natural language processing for a plurality of input answers; Wherein each of the at least two first classification criteria includes a feature vector that is a coordinate value indicating a position in a feature space around each feature; b) a feature corresponding to a predetermined feature element for a plurality of input answers C) calculating a feature vector which is a coordinate value indicating a position in the feature space about each feature axis; d) calculating a feature vector of each of the plurality of input answers and a feature of each of the at least two first classifiers Classifying the plurality of input answers into classification classes corresponding to the first classification class having the highest similarity among the vectors by comparing the vectors, Performing averaging of each of the coordinates of the feature vectors of each of the plurality of classified input answers to form at least two second classification references, f) changing at least two first classification references to at least two second classification references, g) comparing a feature vector of each of the plurality of input answers with a feature vector of each of the at least two second classifiers to obtain a class classification corresponding to a second classifier having a highest degree of similarity between the vectors, And h) a score for each classification.

또한, 처리부(210)는 단계 d) 이후 단계 h) 이전에, 적어도 두 개의 제2 분류 기준을 새로운 적어도 두 개의 제1 분류 기준으로 설정하고, 단계 e) 내지 단계 g)를 반복 수행하되, 적어도 두 개의 제1 분류 기준과 적어도 두 개의 제2 분류 기준이 동일할 때, 단계 h)를 수행할 수 있다.Also, the processing unit 210 may set at least two second classification criteria to new at least two first classification criteria before step d), and then repeat steps e) to g) When two first classification criteria and at least two second classification criteria are the same, step h) may be performed.

또한, 처리부(210)는 K-평균 군집화(K-means Clustering), 계층 군집화(Hierarchical Clustering), 밀도기반 군집화(Density-Based Clustering) 또는 격자기반 군집화(Grid-Based Clustering) 중 어느 하나를 이용하여 단계 d) 내지 단계 g)를 수행할 수 있다.In addition, the processing unit 210 may use any one of K-means clustering, hierarchical clustering, density-based clustering, and grid-based clustering Step d) to step g) can be carried out.

또한, 처리부(210)는 다수의 입력 답안에 대하여 자연 언어 처리를 수행하고, 자연 언어 처리의 수행 결과를 이용하여 자질을 추출할 수 있다.In addition, the processing unit 210 may perform natural language processing for a plurality of input answers, and extract the qualities using the results of the natural language processing.

또한, 처리부(210)는 다수의 입력 답안 각각의 자질 벡터와 적어도 두 개의 제1 분류 기준의 자질 벡터를 비교하여, 벡터 간의 유사도가 가장 높은 분류 기준에 대응하는 분류 등급으로 분류하거나 또는 가장 높은 유사도의 분류 등급이 적어도 두 개 이상 존재할 경우 최상위 점수를 갖는 분류 등급 또는 정답으로 분류할 수 있다.In addition, the processing unit 210 compares the feature vector of each of the plurality of input answers with the feature vectors of at least two first classification references, classifies the feature vectors into classification classes corresponding to the classification reference having the highest similarity among the vectors, The classification score of the highest score or the correct answer if there are at least two classification classes.

또한, 처리부(210)는 다수의 입력 답안 각각의 자질 벡터와 적어도 두 개의 제2 분류 기준의 자질 벡터를 비교하여, 벡터 간의 유사도가 가장 높은 분류 기준에 대응하는 분류 등급으로 분류하거나 또는 가장 높은 유사도의 분류 등급이 적어도 두 개 이상 존재할 경우 최상위 점수를 갖는 분류 등급 또는 정답으로 분류할 수 있다.In addition, the processing unit 210 compares the feature vector of each of the plurality of input answers with the feature vectors of at least two second classification references, classifies the feature vectors into classification classes corresponding to the classification reference having the highest similarity among the vectors, The classification score of the highest score or the correct answer if there are at least two classification classes.

여기서 자연 언어 처리는, 입력 답안에 대한 문장 분리 단계, 띄어쓰기 교정 단계, 철자 교정 단계, 약어 확장 단계, 기호 제거 단계를 포함하는 문서 정규화, 입력 답안에 대한 형태소 분석, 형태소 분석 결과의 품사 부착, 입력 답안의 부정표현 인식, 품사 부착 결과에 포함된 적어도 두 개 이상의 형태소에 대한 구묶음, 입력 답안에 포함된 어절 또는 구를 미리 결정된 표준 표현으로 변환하는 바꿔쓰기 및 입력 답안에 포함된 형태소 또는 어절간의 의존구조를 분석하는 의존관계 분석을 포함할 수 있다.Here, the natural language processing includes a sentence separation step, a spacing correction step, a spelling correction step, an acronym expansion step, a document normalization including a symbol removal step, a morphological analysis on an input answer, A sentence for at least two morphemes contained in the result of attaching a part of speech, a phrase or phrases included in the input answer, a sentence for converting to a predetermined standard expression, and a morpheme or vernacular sentence included in the input answer And dependency analysis to analyze the dependency structure.

또한 여기서 자질은, 형태소 분석 및 품사 부착을 수행하여 형성한 형태소 자질, 입력 답안이 포함하는 어절에 기초하여 형성한 어절 자질, 구묶음을 수행하여 형성한 기본구 자질, 의존관계 분석을 수행하여 형성한 의존관계 자질, 소정 개수의 인접한 형태소 자질 또는 어절 자질을 포함하여 형성한 엔그램(n-gram) 자질 중 적어도 하나를 포함할 수 있다.In addition, the qualities include morpheme qualities formed by morphological analysis and attaching parts of speech, morphological qualities formed by performing morphological analysis and attaching parts of speech, morpheme qualities formed based on phrases included in input answers, basic phrases formed by performing phrases, (N-gram) qualities including one dependency qualifier, a predetermined number of adjacent morpheme qualities, or a phrase qualification.

또한 여기서 적어도 두 개의 분류 등급은, 정답과 오답으로 된 분류 등급이거나, 또는 3개 이상 복수의 차등 점수를 갖는 3개 이상 복수의 분류 등급일 수 있다.Also, at least two classification classes may be classification grades in correct answers and incorrect answers, or three or more classification classes having three or more differential scores.

표 1은 본 발명의 실시예에 따른 어절, 형태소, 기본구, 의존관계 자질들을 추출하고, 추출된 자질 집합을 활용하여 각 채점 대상 답안을 자질벡터로 표현한 예시를 보여준다.Table 1 shows an example of extracting the phrases, morpheme, basic phrases, and dependency relations according to the embodiment of the present invention, and expressing the scores to be scored according to the feature vector using the extracted feature sets.

[표 1][Table 1]

또한 처리부(210)는, 분류 기준을 수신하기 전에, 제1 소정 개수의 답안 각각에서 자질을 추출하고, 추출된 자질을 이용하여 제1 소정 개수의 답안 각각의 자질 벡터를 형성하며, 미리 결정된 기준 자질 벡터와 형성된 자질 벡터를 비교하여, 벡터 간의 유사도가 미리 결정된 값 이상인 기준 자질 벡터에 해당하는 분류 기준에 대응하는 클러스터로 제1 소정 개수의 답안 각각을 분류할 수 있다.In addition, the processing unit 210 extracts qualities from each of the first predetermined number of answers before receiving the classification criterion, forms a qualification vector of each of the first predetermined number of answers using the extracted qualities, The feature vector and the formed feature vector may be compared to classify each of the first predetermined number of answers into clusters corresponding to the classification criterion corresponding to the reference feature vector having a similarity degree between the vectors equal to or greater than a predetermined value.

또한 처리부(210)는, 기준 자질 벡터의 값과 분류된 제1 소정 개수의 답안의 자질 벡터값의 평균값을 산출하고, 산출된 평균값을 새로운 기준 자질 벡터로 갱신할 수 있다.The processing unit 210 may also calculate the average value of the feature vector values of the first predetermined number of answers classified with the value of the reference feature vector and update the calculated average value with a new reference feature vector.

또한 처리부(210)는, 분류 기준을 수신할 경우, 제1 소정 개수의 답안에 대한 채점 결과를 수신하고, 제1 소정 개수의 답안이 포함하는 자질을 추출하여 분류 기준을 형성하며, 형성된 분류 기준을 이용하여 제2 소정 개수의 답안을 분류 기준에 대응하는 클러스터로 분류하고, 제2 소정 개수의 답안 중 각각의 클러스터에 포함될 확률이 소정값 이상인 답안을 제1 소정 개수의 답안에 추가하여 분류 기준을 다시 형성할 수 있다.In addition, when receiving the classification criteria, the processing unit 210 receives grading results for the first predetermined number of answers, extracts the qualities included in the first predetermined number of answers, forms a classification criterion, Classifying the answers of the second predetermined number into clusters corresponding to the classification criterion and adding answers of the second predetermined number of answers to be included in each cluster to a first predetermined number of answers, Lt; / RTI >

또한 처리부(210)는, 분류 기준을 수신할 경우, 추출된 제1 소정 개수의 답안이 포함하는 자질이 기준 자질 벡터에 해당하는 분류 기준에 포함될 확률을 산출하고, 제1 소정 개수의 답안이 포함하는 각각의 자질에 대하여 산출된 확률을 곱하여 제1 소정 개수의 답안을 기준 자질 벡터에 해당하는 분류 기준에 대응하는 클러스터로 분류할 수 있다.When receiving the classification criterion, the processing unit 210 calculates the probability that the qualities included in the extracted first predetermined number of answers are included in the classification criterion corresponding to the reference qualities vector, and when the first predetermined number of answers is included The first predetermined number of answers can be classified into clusters corresponding to the classification criterion corresponding to the reference qualities vector.

또한 처리부(210)는, 자질 벡터를 형성할 경우, 추출된 자질 중 어느 하나의 자질이 특정 채점 대상 답안에서 나타나는 빈도에 대한 가중치와, 추출된 자질 중 어느 하나의 자질이 다수의 채점 대상 답안에서 나타나는 빈도에 대한 가중치 중 적어도 하나를 고려하여 자질 벡터를 형성할 수 있다. 즉, 처리부(210)는 채점 대상 답안 분류의 정확도를 향상시키기 위해서 추출된 자질을 단순한 빈도로 표현하는 대신 각 자질의 중요도를 고려하여 가중치로 나타낼 수 있다. 일반적으로 채점 대상 답안에 자주 나타나는 자질은 해당 채점 대상 답안을 대표하는 중요한 자질을 나타낼 수 있다. 그런데 문장 수준의 채점 대상 답안에는 대부분 형태소 자질 "-다/종결형어말어미"를 포함하고 있지만 채점에 있어서는 중요한 자질은 아닐 수 있다. 이러한 점을 고려하여, 각각의 자질을 가중치 tfidf(term frequency-inverse document frequency)로 계산할 수 있다. tfidf는 여러 문서로 이루어진 문서 집합이 있을 때 어떤 단어가 특정 문서 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치로서, tf(단어 빈도)와 idf(역문서 빈도)를 곱하여 산출할 수 있다. 다음의 수학식 1 내지 수학식 4를 이용하여 tf, df, idf, tfidf를 산출할 수 있다.In addition, when the feature vector is formed, the processing unit 210 may be configured to determine, based on the weight of the frequency at which one of the extracted qualities appears in the specific scoring candidate answer, The quality vector may be formed by considering at least one of the weights for the frequency of occurrence. That is, the processing unit 210 may express the extracted qualities as a weight in consideration of the importance of each qualification, instead of expressing the extracted qualities as a simple frequency in order to improve the accuracy of the grading target answer classification. In general, the qualities that often appear in the scoring subject answers can represent important qualities that represent the scoring subject answers. However, most of the sentence level scoring answers include "- end / ending ending ending", but it may not be an important qualification in scoring. Taking this into account, each of the qualities can be calculated as a term tfidf (term frequency-inverse document frequency). tfidf is a statistical number that indicates how important a word is in a particular document when there is a document set consisting of several documents, which can be calculated by multiplying tf (word frequency) and idf (inverse document frequency). It is possible to calculate tf, df, idf and tfidf using the following equations (1) to (4).

[수학식 1][Equation 1]

[수학식 2]&Quot; (2) "

[수학식 3]&Quot; (3) "

[수학식 4]&Quot; (4) "

수학식 1에서 tf는 특정 자질이 특정 채점 대상 답안에서 얼마나 자주 나타나는지를 표현한 값이고, 이 값이 높을수록 해당 자질은 해당 채점 대상 답안에서 중요하다고 판단할 수 있다. df는 특정 자질이 채점 대상 답안 전체에서 얼마나 흔하게 사용되는지를 나타내는 값으로 이 값이 높을수록 해당 자질은 해당 채점 대상 답안에서 중요하지 않다고 판단할 수 있고, idf는 df의 역수를 나타내며, 이 값이 높을수록 해당 자질은 해당 채점 대상 답안에서 중요하다고 판단할 수 있다.In Equation (1), tf is a value expressing how often a particular feature appears in a particular scoring target answer, and the higher the value, the more important it can be judged to be important in the scoring subject answer. df is a measure of how often certain qualities are used throughout the scoring objective. The higher the score, the less likely that qualities are not significant in the scoring objective. idf is the reciprocal of df, The higher the quality, the more likely it is that the qualification can be deemed important to the subject.

표 2는 본 발명의 실시예에 따라서 표 1에 도시한 채점 대상 답안의 자질벡터를 tfidf로 표현한 것이다. 채점 대상 답안 "아름이를 생포하였다"에 어절 자질 "생포하였다"와 형태소 자질 "생포/동작성명사"는 각각 1번 포함되어 둘 다 1의 값을 가지므로, 두 자질값의 차이가 전혀 없다. 반면 tfidf를 이용하면, 어절 자질 "생포하였다"는 수학식 5에서와 같이 산출하고, 형태소 자질 "생포/동작성명사"는 수학식 6에서와 같이 산출하여 각각 0.397과 0.222로 다르게 표현할 수 있다. 즉, 두 자질이 서로 다른 차이점이 있음을 나타낼 수 있다. 이러한 차이를 반영하면, 기계학습 방법에 영향을 주어 채점의 정확도를 개선시킬 수 있다.Table 2 shows the feature vectors of the scoring target answers shown in Table 1 in terms of tfidf according to the embodiment of the present invention. There is no difference between the two qualitative values, since the words "captured" and "morpheme" and "morpheme / sentence noun" are included in the answer " On the other hand, if tfidf is used, the morpheme qualities " captured / nested nouns "are calculated as shown in Equation (6) and can be expressed as 0.397 and 0.222, respectively. In other words, it can be shown that the two qualities have different differences. Reflecting these differences can affect machine learning methods and improve the accuracy of scoring.

[수학식 5]&Quot; (5) "

[수학식 6]&Quot; (6) "

[표 2][Table 2]

또한 처리부(210)는, 추출된 자질 중 어느 하나의 자질이 특정 채점 대상 답안에서 나타나는 빈도에 대한 가중치는, 자질 중 어느 하나의 자질이 특정 채점 대상 답안에 나타난 횟수로 산출하고, 추출된 자질 중 어느 하나의 자질이 상기 다수의 채점 대상 답안에서 나타나는 빈도에 대한 가중치는, 채점 대상 답안의 총 개수를 어느 하나의 자질을 포함하는 채점 대상 답안의 개수로 나누어 로그를 취함으로써 산출할 수 있다.In addition, the processing unit 210 calculates the weight of the frequency with which one of the qualities of the extracted qualities appears in the specific scoring target answer as the number of times that one of the qualities appears in the specific scoring target answer, The weight of the frequency with which one of the qualities appears in the plurality of scoring target answers can be calculated by dividing the total number of the scoring target answers by the number of scoring target answers including any one qualification and taking a log.

또한 처리부(210)는, 제2 소정 개수의 답안 각각을 분류할 경우, 제2 소정 개수의 답안 각각의 자질 벡터와 분류 기준의 자질 벡터를 비교하여 벡터 간의 유사도가 가장 높은 분류 기준에 대응하는 클러스터로 분류하거나, 가장 높은 유사도의 분류 기준이 적어도 두 개 이상 존재할 경우 최상위 점수를 갖는 분류 기준에 대응하는 클러스터 또는 정답으로 분류할 수 있다.When classifying each of the second predetermined number of answers, the processing unit 210 compares the feature vector of each of the second predetermined number of answers with the feature vector of the classification criterion so that a cluster corresponding to the classification criterion with the highest similarity among the vectors Or if there are at least two classification criteria with the highest degree of similarity, it can be classified as a cluster corresponding to the classification score having the highest score or the correct answer.

언어 처리부(212)는 다수의 학생 단말(120-1,...,120-n)에서 형성한 다수의 입력 답안에 대하여 자연 언어 처리를 수행하여 채점 대상 답안을 형성할 수 있다. 일 실시예로서, 언어 처리부(212)는 문서 정규화, 형태소 분석, 품사 부착, 부정 표현 인식, 구묶음, 바꿔쓰기, 의존관계분석 등의 언어 처리를 수행하여 채점 대상 답안을 형성할 수 있지만, 언어 처리부(212)가 입력 답안에 대해 수행하는 언어 처리는 이러한 구현에 한정되지 않는다.The language processing unit 212 may perform natural language processing on a plurality of input answers formed by the plurality of student terminals 120-1, ..., and 120-n to form a scoring target answer. In one embodiment, the language processing unit 212 may perform language processing such as document normalization, morphological analysis, parts attaching, illegal expression recognition, phrase matching, swapping, dependency analysis, The language processing performed by the processing unit 212 for the input answer is not limited to such an implementation.

문서 정규화부(212-1)는 동일한 의미를 가진 다양한 입력 답안을 하나의 표현으로 통일화할 수 있다. 도 4는 본 발명의 실시예에 따른 문서 정규화부(212-1)의 자연 언어 처리 과정을 보이는 예시도이다. 문서 정규화부(212-1)는 입력 답안에 대하여 문장 분리 단계(S310), 띄어쓰기 교정 단계(S320), 철자 교정 단계(S330), 약어 확장 단계(S340), 기호 제거 단계(S350) 등을 포함하는 문서 정규화 과정, 즉 자연 언어 처리를 수행할 수 있지만, 문서 정규화부(212-1)가 입력 답안에 대해 수행하는 자연 언어 처리는 이러한 순서에 한정되는 것은 아니다.The document normalization unit 212-1 can unify various input answers having the same meaning into one expression. FIG. 4 is a diagram illustrating a natural language process of the document normalization unit 212-1 according to an embodiment of the present invention. The document normalizing unit 212-1 includes a sentence separating step S310, a spacing correction step S320, a spelling correction step S330, an abbreviation expanding step S340, a symbol removing step S350, That is, the natural language processing performed by the document normalization unit 212-1 on the input answer is not limited to this order.

문서 정규화부(212-1)는 문장 분리 단계(S310)에서 다수의 문장으로 구성된 입력 답안을 이용하여 문장 단위로 분리된 출력을 형성할 수 있다. 문서 정규화부(212-1)는 문장 종결 기호를 기준으로 입력 답안에 포함된 다수의 문장을 문장 단위로 분리할 수 있다. 일 실시예로서, 문장 종결 기호는 온점(.), 물음표(?), 느낌표(!)를 포함할 수 있으나, 이러한 구현에 한정되지 않는다. 다음의 표 3은 문서 정규화부(212-1)의 문장 분리 단계 수행 결과를 보이는 예시이다. 문서 정규화부(212-1)는 문장이 분리되지 않은 "우리는 열심히 손을 흔들었다. 그러나 선수 중 아무도 돌아보는 사람이 없었다."라는 형태의 입력 답안을 수신하여 "흔들었다" 다음에 표시된 온점을 이용하여 "우리는 열심히 손을 흔들었다"와 그러나 선수 중 아무도 돌아보는 사람이 없었다"를 서로 다른 문장으로 구분하여 출력할 수 있다.The document normalization unit 212-1 can form an output separated by sentence unit by using an input answer composed of a plurality of sentences in the sentence separating step (S310). The document normalization unit 212-1 can separate a plurality of sentences included in the input answer on a sentence-by-sentence basis based on a sentence ending symbol. In one embodiment, the sentence termination symbol may include an on-point (.), A question mark (?), An exclamation point (!), But is not limited to this implementation. The following Table 3 is an example showing the result of the sentence separating step performed by the document normalizing unit 212-1. The document normalization unit 212-1 receives the input answer of the form "We waved hard, but no one of the athletes turned around" , We could output "We shook our hands hard" and "No one looked at the players," but in different sentences.

[표 3][Table 3]

그러나, 입력 답안이 문장 종결 기호를 포함하지 않는 경우 문장을 분리할 수 없다. 또한, 문장 종결 기호와 유사한 기호가 입력 답안에 사용되는 경우도 존재할 수도 있고, 문장 종결 기호가 문장을 종결할 때만 쓰이지 않을 수 있다. 예컨대, "대통령이 부산(?)에 오셨다."라는 문장에서 '?'를 이용하여 문장을 분리해서는 안 되고, "0.002는 매우 작은 숫자다."라는 문장에서 소수점을 마침표로 인식하여 문장을 분리해서는 안 되기 때문에 문서 정규화부(212-1)는 조건부 무작위장 모델(Conditional Random Field Model)을 포함하는 다양한 기계학습 방법을 이용해서 입력 답안이 포함하는 다수의 문장을 분리할 수 있다. 조건부 무작위장 모델은 다양한 자질의 확률을 결합하여 원하는 범주를 결정할 수 있도록 학습하고 데이터를 분류하는 모델이다. 즉, 문서 정규화부(212-1)는 다수의 입력 답안에서 문장 종결 기호를 제거한 후 각각의 입력 답안이 포함하는 문장의 모든 어절을 추출하고, 추출된 어절 중 연속하는 두 어절이 서로 다른 문장으로 분리될 수 있는 문장 분리 확률을 계산한다. 또한, 문서 정규화부(212-1)는 문장 분리 확률과 미리 결정된 기준 확률을 비교하고, 문장 분리 확률이 미리 결정된 기준 확률 이상인 경우 당해 연속하는 두 어절을 서로 다른 문장으로 분리할 수 있다. 일 실시예로서, 미리 결정된 기준 확률은 80% 내지 90% 범위 내의 값일 수 있지만 이에 한정되는 것은 아니다. 다음 표 4는 기계학습 방법을 이용한 문서 정규화부(212-1)의 문장 분리 단계 수행 결과를 보이는 예시이다. 문서 정규화부(212-1)는 입력 답안으로 문장이 분리되지 않은 "우리는 열심히 손을 흔들었다. 그러나 선수 중 아무도 돌아보는 사람이 없었다."라는 형태의 입력 답안을 수신하여 "흔들었다"와 "없었다" 다음에 표시된 온점을 제거하고, "우리는 열심히", "열심히 손을", "손을 흔들었다", "흔들었다 그러나", "그러나 아무도", "아무도 돌아보는", "돌아보는 사람이", "사람이 없었다"와 같이 어절 단위로 분리할 수 있다. 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에서 각각의 분리된 어절 단위가 서로 다른 문장으로 구분될 확률을 산출하고, 산출된 확률이 가장 높은 어절 단위인 "흔들었다 그러나"를 기준으로 "우리는 열심히 손을 흔들었다"와 "그러나 선수 중 아무도 돌아보는 사람이 없었다"를 서로 다른 문장으로 분리할 수 있다. 일 실시예로서, 말뭉치로서는 국립국어원에서 제공하는 세종말뭉치를 이용할 수 있으나, 말뭉치는 문장 분리에 대한 정보를 포함하는 말뭉치라면 어떠한 말뭉치를 이용하더라도 무방하다.However, if the input answer does not contain a sentence terminator, the sentence can not be separated. There may also be a case where a symbol similar to a sentence ending symbol is used in the input answer, and a sentence ending symbol may not be used only when terminating the sentence. For example, in the sentence "President came to Busan (?)", The sentence should not be separated by using "?", And the sentence "0.002 is a very small number" The document normalization unit 212-1 can separate a plurality of sentences included in the input answer using various machine learning methods including a conditional random field model. A conditional random field model is a model that learns and classifies data so that the probabilities of various qualities can be combined to determine the desired category. That is, the document normalization unit 212-1 extracts all the words in the sentences included in the input answers after removing the sentence end symbols from the plurality of input answers, and the two consecutive words in the extracted word are different sentences Calculate sentence separation probabilities that can be separated. In addition, the document normalization unit 212-1 compares the sentence separation probability with a predetermined reference probability, and when the sentence separation probability is equal to or greater than a predetermined reference probability, separates two consecutive words into different sentences. In one embodiment, the predetermined reference probability may be a value within the range of 80% to 90%, but is not limited thereto. Table 4 is an example showing results of performing the sentence separating step of the document normalizing unit 212-1 using the machine learning method. The document normalization unit 212-1 receives the input answer of the form "We waved hard, but no one of the players turned around" "No", "no one", "no one turns around", "turn around", "waved", "waved", "waved" Man "," no man ", and so on. The document normalization unit 212-1 calculates the probability that each of the separated word units in the corpus stored in the storage unit 220 is divided into different sentences and calculates the probability that the computed probability is " , We can separate "we waved our hands hard" and "but no one looked at the players" in different sentences. As an example, a corpus of Sejong provided by the National Institute of Korean Language can be used as a corpus, but any corpus may be used if the corpus includes corpus information.

[표 4][Table 4]

또한, 문서 정규화부(212-1)는 띄어쓰기 교정 단계(S320)에서 입력 답안이 포함하는 각각의 문장의 띄어쓰기를 채점에 영향을 미치지 않도록 수정할 수 있다. 문서 정규화부(212-1)는 확률 기반 띄어쓰기 교정 방법을 이용하여 띄어쓰기 교정을 수행할 수 있다. 즉, 문서 정규화부(212-1)는 다수의 입력 답안이 포함하는 문장의 음절들을 붙여쓰기 하고, 각각의 입력 답안이 포함하는 문장의 모든 음절을 추출하며, 추출된 음절 중 연속하는 두 음절이 붙여쓰기 될 확률과 띄어쓰기 될 확률을 각각 산출하고, 붙여쓰기 될 확률과 띄어쓰기 될 확률 중 보다 높은 확률에 따라 당해 연속하는 두 음절을 붙여쓰기 또는 띄어쓰기 할 수 있다. 문서 정규화부(212-1)는 다수의 입력 답안이 포함하는 문장의 음절들을 붙여쓰기 할 경우 문장 분리 단계(S310)에서 문장 분리 확률이 미리 결정된 기준 확률 이상인 경우 당해 연속하는 두 어절을 서로 다른 문장으로 분리하는 단계 이전 또는 이후에 다수의 입력 답안이 포함하는 문장의 음절들을 붙여쓰기 할 수 있다. 한편, 문서 정규화부(212-1)는 산출된 붙여쓰기 될 확률과 띄어쓰기 될 확률이 동일한 경우 디폴트(default)로 연속하는 두 음절을 붙여쓰기 하도록 설정할 수 있다.In addition, the document normalization unit 212-1 can correct the spacing of each sentence included in the input answer in the spacing correction step (S320) so as not to affect the scoring. The document normalization unit 212-1 can perform spacing correction using a probability-based spacing correction method. That is, the document normalization unit 212-1 pastes the syllables of the sentences contained in the plurality of input answers, extracts all the syllables of the sentences included in the respective input answers, and consecutive two syllables of the extracted syllables The probability of pasting and the probability of spacing can be calculated, and the two consecutive syllables can be written or spaced according to the probability of being pasted and the probability of being spaced out. When the sentence separation probability is greater than or equal to a predetermined reference probability in the sentence separating step (S310), the document normalizing unit (212-1) pastes the two consecutive words in different sentences The syllables of the sentences included in a plurality of input answers can be pasted. On the other hand, the document normalization unit 212-1 can set the two syllables that are consecutive in default to be pasted when the calculated probability of pasting and the probability of being spaced are the same.

일 실시예로서, 문서 정규화부(212-1)는 입력 답안이 포함하는 "우리는 열심히 손을 흔들었다"라는 문장을 입력 받아서 "우리는열심히손을흔들었다"와 같이 모든 음절들을 붙여쓰기 한 후 "우리", "리는", "는열", "열심", "심히", "히손", "손을", "을흔", "흔들", "들었", "었다"로 분리할 수 있다. 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "우리"("우"와 "리"를 붙여쓰기 한 형태)를 포함하는 문장의 개수를 "우" 음절을 포함하는 문장의 개수로 나누어서 "우리"를 붙여쓰기 할 확률을 산출하고, "우 리"("우"와 "리"를 띄어쓰기 한 형태)를 포함하는 문장의 개수를 "우" 음절을 포함하는 문장의 개수로 나누어서 "우리"를 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "우리"의 띄어쓰기 형태로 결정할 수 있다. "우리"를 띄어쓰기 할 확률보다는 "우리"를 붙여쓰기 할 확률이 높을 것이므로 "우리" 는 붙여쓰기로 결정할 수 있다. In one embodiment, the document normalization unit 212-1 receives a sentence of "We shook his hand hard" that the input answer includes, and writes all the syllables such as " It can be divided into "us", "ri", "hi", "zeal", "heavily", "Hisson", "hand", "sword", "shake", "heard" have. The document normalization unit 212-1 sets the number of sentences including "us" (a form in which "right" and "le" are pasted) among the sentences included in the corpus stored in the storage unit 220 to " , And the number of sentences including "us" (a form in which "space" and "space" are spaced) is included in the "right" syllable We can calculate the probability of spacing "us" by dividing the number of sentences by the number of sentences, and determine the form with a higher probability among the calculated probabilities as the spacing of "us". Since we are more likely to add "us" than the probability of spacing "us", we can decide to stick with "us".

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "리는"("리"와 "는"을 붙여쓰기 한 형태)을 포함하는 문장의 개수를 "리" 음절을 포함하는 문장의 개수로 나누어서 "리는"을 붙여쓰기 할 확률을 산출하고, "리 는"("리"와 "는"을 띄어쓰기 한 형태)을 포함하는 문장의 개수를 "리" 음절을 포함하는 문장의 개수로 나누어서 "리는"을 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "리는"의 띄어쓰기 형태로 결정할 수 있다. "리는"을 띄어쓰기 할 확률보다는 "리는"을 붙여쓰기 할 확률이 높을 것이므로 "리는" 은 붙여쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 sets the number of sentences including "rei" (a form in which " And the number of sentences including "Lee" (a form in which "Lee" and "the" are spaced apart) is calculated by dividing the number of sentences including the syllable "Lee" "We divide by the number of sentences containing syllables and calculate the probability of spacing" Lee ", and we can determine a form with a higher probability among the calculated probabilities as the spacing of" Lee ". Since Lee is likely to write "Lee" rather than the spacing of "Lee", "Lee" can decide to paste.

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "는열"("는"과 "열"을 붙여쓰기 한 형태)을 포함하는 문장의 개수를 "는" 음절을 포함하는 문장의 개수로 나누어서 "는열"을 붙여쓰기 할 확률을 산출하고, "는 열"("는"과 "열"을 띄어쓰기 한 형태)을 포함하는 문장의 개수를 "는" 음절을 포함하는 문장의 개수로 나누어서 "는열"을 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "는열"의 띄어쓰기 형태로 결정할 수 있다. "는열"을 붙여쓰기 할 확률보다는 "는열"을 띄어쓰기 할 확률이 높을 것이므로 "는열" 은 띄어쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 determines the number of sentences including the " arrays "(forms in which" a "and" columns "are pasted) among the sentences included in the corpus stored in the storage unit 220 "Is calculated by dividing the number of sentences including syllables by the number of sentences including" column "(a form of" space "between" a "and" column "), The probability that the "odd heat" is spaced by dividing it into the number of sentences including the odd odd odd odd odd and even odd odd odd odd odd odd oddities can be determined. Since there is a high probability that it will be spaced more than "odd" rather than "odd", "odd" can be determined by spacing.

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "열심"("열"과 "심"을 붙여쓰기 한 형태)을 포함하는 문장의 개수를 "열" 음절을 포함하는 문장의 개수로 나누어서 "열심"을 붙여쓰기 할 확률을 산출하고, "열 심"("열"과 "심"을 띄어쓰기 한 형태)을 포함하는 문장의 개수를 "열" 음절을 포함하는 문장의 개수로 나누어서 "열심"을 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "열심"의 띄어쓰기 형태로 결정할 수 있다. "열심"을 띄어쓰기 할 확률보다는 "열심"을 붙여쓰기 할 확률이 높을 것이므로 "열심" 은 붙여쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 sets the number of sentences including "eardrum" (a form in which "column" and "seam" are pasted) among the sentences included in the corpus stored in the storage unit 220 as " "The number of sentences containing" heat "(a form of" space "and" space ") is called the" heat "syllable Quot; ebb "is divided by the number of sentences including " ebb ", and a form having a higher probability among the calculated probabilities can be determined as a space of" ebb. Because "enthusiasm" is more likely to be affixed with "zeal" than the possibility of spacing, "zeal" can be decided by pasting.

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "심히"("심"과 "히"를 붙여쓰기 한 형태)를 포함하는 문장의 개수를 "심" 음절을 포함하는 문장의 개수로 나누어서 "심히"를 붙여쓰기 할 확률을 산출하고, "심 히"("심"과 "히"를 띄어쓰기 한 형태)을 포함하는 문장의 개수를 "심" 음절을 포함하는 문장의 개수로 나누어서 "심히"를 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "심히"의 띄어쓰기 형태로 결정할 수 있다. "심히"를 띄어쓰기 할 확률보다는 "심히"를 붙여쓰기 할 확률이 높을 것이므로 "심히"를 붙여쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 may set the number of sentences including "severe" (a form in which " "The number of sentences containing" sincerity "(a form of" spacing "" ss "and" ss ") is called a" ss "syllable Quot ;, and a form having a higher probability among the calculated probabilities can be determined as a "very" spacing form. Since there is a high probability of pasting "severely" rather than "deeply" spacing, you can decide to write "very".

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "히손"("히"와 "손"을 붙여쓰기 한 형태)을 포함하는 문장의 개수를 "히" 음절을 포함하는 문장의 개수로 나누어서 "히손"을 붙여쓰기 할 확률을 산출하고, "히 손"("히"와 "손"을 띄어쓰기 한 형태)을 포함하는 문장의 개수를 "히" 음절을 포함하는 문장의 개수로 나누어서 "히손"을 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 해당 음절의 띄어쓰기 형태로 결정할 수 있다. "히손"을 붙여쓰기 할 확률보다는 "히손"을 띄어쓰기 할 확률이 높을 것이므로 "히손"은 띄어쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 determines the number of sentences including "Hisson" (a form in which "HI" and "HAND" are pasted) among the sentences included in the corpus stored in the storage unit 220 "The probability of pasting" Hisson "is divided by the number of sentences containing the syllable, and the number of sentences including" Hisson "(the form of spacing" Hei "and" The probability that the word "Hisson" is spaced apart by dividing it by the number of sentences including the sentence is calculated, and a form having a higher probability than the calculated probability can be determined as the spacing form of the syllable. "Hisson" can be determined to be spaced because there is a high probability that "Hisson" will be spaced rather than the probability of pasting "Hisson".

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "손을"("손"과 "을"을 붙여쓰기 한 형태)을 포함하는 문장의 개수를 "손" 음절을 포함하는 문장의 개수로 나누어서 "손을"을 붙여쓰기 할 확률을 산출하고, "손 을"("손"과 "을"을 띄어쓰기 한 형태)을 포함하는 문장의 개수를 "손" 음절을 포함하는 문장의 개수로 나누어서 "손을"을 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "손을"의 띄어쓰기 형태로 결정할 수 있다. "손을"을 띄어쓰기 할 확률보다는 "손을"을 붙여쓰기 할 확률이 높을 것이므로 "손을"을 붙여쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 sets the number of sentences including the "hand" (a form in which "hand" and " Quot; hand " and the number of sentences including "hand" (a form in which "hand" and " "It is possible to divide the number of sentences containing syllables and calculate the probability of spacing" hand ", and the form with a higher probability than the calculated probability can be determined as a space of" hand ". Because you are more likely to write "hand" than "hand", you can decide to write "hand".

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "을흔"("을"과 "흔"을 붙여쓰기 한 형태)을 포함하는 문장의 개수를 "을" 음절을 포함하는 문장의 개수로 나누어서 "을흔"을 붙여쓰기 할 확률을 산출하고, "을 흔"("을"과 "흔"을 띄어쓰기 한 형태)을 포함하는 문장의 개수를 "을" 음절을 포함하는 문장의 개수로 나누어서 "을흔"를 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "을흔"의 띄어쓰기 형태로 결정할 수 있다. "을흔"을 붙여쓰기 할 확률보다는 "을흔"을 띄어쓰기 할 확률이 높을 것이므로 "을흔"을 띄어쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 sets the number of sentences including "scribble" (a form in which "e" and "scribbled" are written) among the sentences included in the corpus stored in the storage unit 220 to " "The number of sentences containing" syllable "(the spacing between" e "and" syllable ") is calculated by dividing the number of sentences containing the syllable by the syllable" syllable " Quot ;, and a shape having a higher probability than the calculated probability may be determined as a space of "SIGNS ". Since there is a high probability of spacing "sir" rather than the probability of pasting "sir", it is possible to determine "sir" to be spaced.

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "흔들"("흔"과 "들"을 붙여쓰기 한 형태)을 포함하는 문장의 개수를 "흔" 음절을 포함하는 문장의 개수로 나누어서 "흔들"을 붙여쓰기 할 확률을 산출하고, "흔 들"("흔"과 "들"을 띄어쓰기 한 형태)을 포함하는 문장의 개수를 "흔" 음절을 포함하는 문장의 개수로 나누어서 "흔들"을 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "흔들"의 띄어쓰기 형태로 결정할 수 있다. "흔들"을 띄어쓰기 할 확률보다는 "흔들"을 붙여쓰기 할 확률이 높을 것이므로 "흔들"을 붙여쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 may set the number of sentences including "shake" (a form in which "shake" and " "The number of sentences including" shaken "(a space between" shaku "and" syllable ") is called" shaky "syllable by calculating the probability of writing" shaken "by dividing by the number of sentences containing syllable, Quot; shaking "by dividing the number of sentences including the " shaking " by the number of sentences including the " shaking" You can decide to write "shake" because you will be more likely to write "shake" instead of "shake".

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "들었"("들"과 "었"을 붙여쓰기 한 형태)을 포함하는 문장의 개수를 "들" 음절을 포함하는 문장의 개수로 나누어서 "들었"을 붙여쓰기 할 확률을 산출하고, "들 었"("들"과 "었"을 띄어쓰기 한 형태)을 포함하는 문장의 개수를 "들" 음절을 포함하는 문장의 개수로 나누어서 "들었"을 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "들었"의 띄어쓰기 형태로 결정할 수 있다. "들었"을 띄어쓰기 할 확률보다는 "들었"을 붙여쓰기 할 확률이 높을 것이므로 "들었"을 붙여쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 determines the number of sentences including "heard" (a form in which "s" and "i" are pasted) among the sentences included in the corpus stored in the storage unit 220 "The number of sentences including" heard "(spacing between" s "and" y ") is calculated by dividing the number of sentences including the syllable by the number of" syllables " Quot; heard "by dividing by the number of sentences including " hearing ", and a form having a higher probability among the calculated probabilities can be determined as a form of" heard ". You can decide to write "I heard" because you are more likely to write "hear" than "probability" of spacing.

또한, 문서 정규화부(212-1)는 저장부(220)에 저장된 말뭉치에 포함된 문장 중 "었다"("었"과 "다"를 붙여쓰기 한 형태)를 포함하는 문장의 개수를 "었" 음절을 포함하는 문장의 개수로 나누어서 "었다"를 붙여쓰기 할 확률을 산출하고, "었 다"("었"과 "다"를 띄어쓰기 한 형태)를 포함하는 문장의 개수를 "었" 음절을 포함하는 문장의 개수로 나누어서 "었다"을 띄어쓰기 할 확률을 산출하며, 산출된 확률 중 보다 높은 확률을 갖는 형태를 "었다"의 띄어쓰기 형태로 결정할 수 있다. "었다"를 띄어쓰기 할 확률보다는 "었다"를 붙여쓰기 할 확률이 높을 것이므로 "었다"를 붙여쓰기로 결정할 수 있다.In addition, the document normalization unit 212-1 determines the number of sentences including "I" (a form in which "I" and " "The number of sentences containing" I "(the spacing between" I "and" I ") is calculated by dividing the number of sentences containing the syllable by" Quot ;, and the type having a higher probability among the calculated probabilities can be determined as a space of "I ". It is more likely to add "I" than "I" to the spacing, so I can decide to write "I".

문서 정규화부(212-1)는 상기한 띄어쓰기 교정 방법으로 모든 음절의 띄어쓰기 형태를 결정하여 전체 문장의 띄어쓰기를 "우리는 열심히 손을 흔들었다"로 결정할 수 있다. 문서 정규화부(212-1)는 띄어쓰기 교정 단계에서 명사와 같은 실질형태소는 앞 음절과 띄어 쓰고, 조사나 어미와 같은 형식형태소는 앞 음절과 붙여 쓰도록 띄어쓰기를 교정할 수 있다.The document normalization unit 212-1 can determine the spacing of all the syllables by the spacing correction method described above and determine the spacing of the entire sentences as "We waved hard." The document normalization unit 212-1 can correct the spacing such that a substantial morpheme such as a noun is spaced apart from the preceding syllable in a spacing correction step and a morpheme such as an investigation or a mother is pasted with a preceding syllable.

도 5는 본 발명의 실시예에 따른 예시 문장의 음절 발생 및 띄어쓰기 상태 전이를 보이는 예시도이다.FIG. 5 is an exemplary diagram illustrating syllable generation and spacing state transitions in an exemplary sentence according to an embodiment of the present invention.

도 5에 도시한 바와 같이, 예시 문장을 음절 열 S=s₁s₂…s_n과 띄어쓰기 태그 열 T=t₁t₂…t_n으로 구분하여 표현하는데 띄어쓰기를 “띄”로 표현하고 붙여쓰기를 “붙”으로 나타낼 수 있다. 모든 음절의 시작은 $로 나타내고, 시작 음절의 띄어쓰기 태그는 “띄”로 정의할 수 있다. 일 실시예로서, “우리는 열심히 손을 흔들었다”라는 예시 문장은 음절 열 “우리는열심히손을흔들었다”와 띄어쓰기 태그 열 “붙붙띄붙붙띄붙띄붙붙붙띄”로 나타낼 수 있다. 저장부(220)에 저장된 말뭉치에 “우리는 열심히 손을 흔들었다”가 포함되지 않을 수 있으므로, 음절 열 전체로 확률을 계산하지 않고 하나씩 쪼개어 확률을 계산할 수 있다. 확률값 P1(우리는 열심히 손을 흔들었다), P2(우리는열심히손을흔들었다), P3(우리는열심 히손을 흔들었다), P4(우리 는열 심히손 을흔들 었다) 등을 계산하고, 이들 중 가장 높은 확률값(예를 들어, P1)을 갖는 띄어쓰기 후보(예를 들어, "우리는 열심히 손을 흔들었다")를 예시 문장의 띄어쓰기 형태로 선택할 수 있다.As shown in Fig. 5, the example sentence is syllable string S = s ₁ s ₂ ... s _n and the spacing tag column T = t ₁ t ₂ ... t _n , where the spacing can be expressed as a "space" and pasted as "pasted". The beginning of every syllable is denoted by $, and the spacing tag of the start syllable is defined as "a". As an example, the example sentence "We Waved Hands Wary" can be represented by the syllable column "We Waved Harder" and the Spacing Tag column "Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Sticky Since the corpus stored in the storage unit 220 may not include " We shook hands hard ", the probability can be calculated by dividing the entire syllable column by one without calculating the probability. We calculated the probability values P1 (we waved hard), P2 (we waved hard), P3 (we waved eagerly), P4 (we waved very hard) (For example, "We shook his hand hard") having the highest probability value (e.g., P1) among the candidate sentences.

또한, 문서 정규화부(212-1)는 철자 교정 단계(S330)에서 입력 답안이 포함하는 각각의 문장에 포함된 철자 오류 중 답안 채점에 영향을 주지 않는 부분을 수정할 수 있다. 즉, 문서 정규화부(212-1)는 다수의 입력 답안이 포함하는 각 문장에서 모든 문장 부호를 제거하고, 문장 분리 단계(S310)에서 추출된 어절의 각각이 저장부(220)에 저장된 말뭉치에 포함되어 있는지 확인하며, 추출된 어절의 각각이 말뭉치에 포함되어 있는 경우 당해 어절의 철자를 교정하지 않고, 추출된 어절의 각각이 말뭉치에 포함되어 있지 않은 경우 교정 대상 어절을 교정 후보 어절로 교정할 수 있다. 여기서 교정 대상 어절을 교정 후보 어절로 교정하는 경우, 교정 후보 어절이 복수 개인 경우, 교정 대상 어절과 교정 후보 어절 각각의 편집 거리를 비교하고, 편집 거리가 최소값인 교정 후보 어절로 교정 대상 어절을 교정할 수 있다. 문서 정규화부(212-1)는 최소 편집 거리 알고리즘을 이용하여 철자 교정을 수행할 수 있다. 문서 정규화부(212-1)는 최소 편집 거리 알고리즘을 이용하여 철자 교정을 수행할 수 있다. 편집 거리는 교정 대상 어절을 교정 후보 어절로 수정할 경우 교정 대상 어절의 자모에 대하여 편집(치환, 삭제, 삽입 등)을 수행하는 횟수를 나타낼 수 있다. 예를 들어, 교정 대상 어절이 "렬심히"이고 교정 후보 어절이 "열심히"일 경우 "렬심히"를 "열심히"로 수정하기 위해서는 "ㄹ"을 "o"로 1회 치환하면 되는데 이런 경우의 편집 거리는 "1"이 될 수 있고, 교정 대상 어절 "렬삼히"를 교정 후보 어절 "열심히"로 수정하기 위해서는 "ㄹ"을"ㅇ"으로 치환하고 "ㅏ"를 "l"로 치환해야 하므로 이런 경우 2회의 수정이 발생하므로 편집 거리는 "2"가 될 수 있으며, 교정 대상 어절 "여심히"를 교정 후보 어절 "열심히"로 수정하기 위해서는 "여"에"ㄹ"을 삽입하면 되는데 이런 경우 1회의 삽입이 발생하므로 편집 거리는 "1"이 될 수 있다.In addition, the document normalization unit 212-1 may correct a portion of the spelling error included in each sentence included in the input answer, which does not affect the answer score, in the spelling correction step (S330). That is, the document normalization unit 212-1 removes all the punctuation marks in each sentence included in the input answers, and each of the phrases extracted in the sentence separating step S310 is stored in the corpus stored in the storage unit 220 If each of the extracted words is included in the corpus, the spelling of the word is not corrected, and if each of the extracted words is not included in the corpus, the corrective word is corrected to the correction candidate word . In this case, when the correction target word is corrected by the correction candidate word, when the plurality of correction candidate words are plural, the editing distance of each of the correction target word and the correction candidate word is compared, and the correction target word is corrected can do. The document normalization unit 212-1 can perform spelling correction using a minimum edit distance algorithm. The document normalization unit 212-1 can perform spelling correction using a minimum edit distance algorithm. The editing distance may indicate the number of times editing (substitution, deletion, insertion, etc.) is performed on the character of the correction target word when the correction target word is corrected to the correction candidate word. For example, if the proofreading clause is "hard" and the proofreading clause is "hard", then "o" should be replaced once with "o" to correct "hard" to "hard" The editing distance may be "1", and in order to correct the proofreading phrase "lull" to the proofreading phrase "hard", it is necessary to replace "d" with "o" and "a" with "l" , The editing distance may be "2", and in order to correct the correction candidate word "very" to "proof", it is necessary to insert " Since the insertion occurs, the editing distance may be "1 ".

이러한 철자 교정 단계(S330)는 어절, 음절, 음소 단위로 적용될 수 있다. 예컨대, '놀:구: → 고'는 어절 단위에 적용되는 철자 교정 규칙이고, 'ㄱ:ㅜ:ㅅㅣㅍ → ㅗ'는 음소 단위에 적용되는 철자 교정 규칙이다. 여기서 철자 교정 규칙 'w:x:z → y'는 x의 왼쪽 어절, 음절 또는 음소는 w이고, 오른쪽 어절, 음절 또는 음소는 z일 경우 x를 y로 변환하는 철자 교정 규칙을 나타낼 수 있다. 어절 단위로 교정 규칙이 적용되면 정확성이 높은 반면, 그 적용 범위는 감소한다. 이에 비해 음소 단위로 교정 규칙이 적용되면 적용 범위가 증가하는 대신 규칙의 정확도는 감소한다. 이에 이들을 적절히 조합하여 교정할 필요가 있다.This spelling correction step (S330) may be applied in units of words, syllables, and phonemes. For example, 'nol: ng:: ng' is a spelling correction rule applied to the unit of a word, and 'a: t: ss: t → t' is a spelling correction rule applied to a phoneme unit. Here, the spelling correction rule 'w: x: z → y' can represent a spelling correction rule that translates x to y if the left word of x, the syllable or phoneme is w, and the right word, syllable or phoneme is z. When the calibration rule is applied in the unit of the word, the accuracy is high while the range of its application is reduced. On the other hand, when the calibration rule is applied on a phoneme basis, the accuracy of the rule is decreased instead of increasing the range of application. Therefore, it is necessary to calibrate them by appropriately combining them.

상기한 규칙에 기반한 철자 교정 방법 이외에도 다양한 방법을 이용하여 철자 교정 단계를 수행할 수 있다. 첫째, 단어 기반 철자 교정 방법은 틀린 단어에 대해 편집거리를 토대로 저장부(220)에 저장된 말뭉치에서 가장 유사한 단어를 검색하여 교정하는 방법이다. 그런데 단어 기반 철자 교정 방법은 말뭉치에 해당 단어가 포함되어 있지 않을 경우 무조건 틀린 단어로 판단하므로 신조어에 취약할 수 있다. 또한, 한국어는 여러 형태소가 결합하여 매우 다양한 어절을 생성하므로, 모든 어절을 말뭉치에 포함하기가 어려울 수 있다.In addition to the above-mentioned rule-based spelling correction method, various methods can be used to perform the spelling correction step. First, the word-based spelling correction method is a method of searching and correcting the most similar words in the corpus stored in the storage unit 220 based on the editing distance of the wrong word. However, the word-based spelling correction method may be vulnerable to the coined word if the word is not included in the corpus. Also, it is difficult to include all of the words in corpus because Korean combines various morphemes and generates very diverse vernacular phrases.

둘째, 형태소 기반 철자 교정 방법은 정확한 철자 교정을 위해 형태소 분석기를 이용하는 방법이다. 한국어 특성상 형태소 분석이 매우 유효하게 작용하는데, 다만 형태소 기반 철자 교정 방법은 교정 후보마다 매번 형태소 분석을 수행하므로 효율성이 낮고, 형태소 분석 자체가 잘못 수행되는 경우에 취약할 수 밖에 없다.Second, morpheme-based spelling correction method is a method of using morpheme analyzer for correct spelling correction. The Korean morphological analysis is very effective, but the morpheme - based spelling correction method is inefficient because it performs morpheme analysis every time the candidate is corrected, and is vulnerable to the case where the morpheme analysis itself is performed erroneously.

셋째, 음절 기반 철자 교정 방법은 음절을 바탕으로 철자 오류를 교정하는 방법으로, 미등록어에 대해서도 견고하게 분석하여 철자를 교정할 수 있는 장점이 있다. 말뭉치에서 필요한 정보를 모두 학습하므로 구축 비용이 크지 않지만, 학습 집합과 실험 집합의 특성이 유사하면 비교적 성능이 좋으나 그렇지 않은 경우에는 성능이 낮을 수 있다.Third, the syllable-based spelling correction method is a method of correcting spelling errors based on syllables. The construction costs are not large because they learn all the necessary information in the corpus. However, if the characteristics of the training set and the experiment set are similar, the performance is relatively good. Otherwise, the performance may be low.

넷째, 자소 기반 철자 교정 방법은 빈번하게 발생하는 철자 오류에 대해 자소 단위로 철자 교정 후보를 생성하고 검증하는 방법이다. 이 방법은 철자 교정 후보를 많이 생성하므로 교정 복잡도가 크게 증가한다. 지나치게 많은 교정 후보는 최적 교정 후보를 찾는 것을 오히려 방해하므로, 주변 문맥을 충분히 고려하지 않으면 높은 성능을 기대하기 어렵다.Fourth, it is a method of generating and verifying spelling correction candidates on a per - speller basis for frequently occurring spelling errors. This method generates a lot of spelling corrections, which greatly increases the complexity of the calibration. Too many calibration candidates interfere with finding the best calibration candidates, so high performance is not expected unless the surrounding context is taken into account.

상기한 다양한 교정 방법 중 어느 하나만을 사용하는 것보다는 장단점을 보완할 수 있도록 단계별로 교정하는 것이 철자 교정 성능이 우수할 수 있다. 특히, 자소 기반 철자 교정 방법이나 음절 기반 철자 교정 방법을 먼저 수행한 후 어절 기반 철자 교정 방법을 수행하는 경우 좋은 성능을 보일 수 있다. 이는 자소 단위나 음절 단위 철자 교정에서 철자를 잘못 교정한 경우의 일부를 어절 단위 철자 교정에서 다시 올바르게 교정한 경우도 발생하기 때문이다.It is possible to improve the spelling correction performance by correcting stepwise so as to compensate for the advantages and disadvantages rather than using only one of the various calibration methods described above. Especially, it can show good performance when a spelling - based spelling correction method or a syllable - based spelling correction method is performed first and then a spelling - based spelling correction method is performed. This is because some of the spelling mistakes that have been corrected in the spelling correction unit or the syllable unit spelling correction are corrected correctly in the spelling correction unit.

일 실시예로서, 문서 정규화부(212-1)는 입력 답안이 포함하는 문장에서 문장 부호를 제거하고 어절 단위로 분리할 수 있다. 문서 정규화부(212-1)는 분리된 각각의 어절이 저장부(220)에 저장된 말뭉치에 포함되어 있으면 해당 어절의 철자는 교정하지 않고, 분리된 각각의 어절이 저장부(220)에 저장된 말뭉치에 포함되어 있지 않을 경우 해당 어절로부터 편집 거리가 소정값 이내인 교정 후보 어절을 추출할 수 있다. 예를 들어, 문서 정규화부(212-1)는 입력 답안에 포함된 어절 중 "렬심히"가 말뭉치에 포함되어 있지 않음을 확인하고, 말뭉치에 포함된 어절 중 교정 대상 어절 "렬심히"로부터 편집 거리가 1 이내인 교정 후보 어절로 "열심히"를 추출할 수 있다. 문서 정규화부(212-1)는 추출된 교정 후보 어절을 이용하여 교정 대상 어절을 교정 후보 어절로 수정하는 철자 교정을 수행할 수 있다.In one embodiment, the document normalization unit 212-1 may remove the punctuation marks from the sentences included in the input answer, and separate the punctuation marks in units of words. If the separated word phrases are included in the corpus stored in the storage unit 220, the word normalization unit 212-1 does not correct the spelling of the corresponding phrase, and each of the separated phrases is stored in the storage unit 220, It is possible to extract a calibration candidate word having an editing distance within a predetermined value from the corresponding word. For example, the document normalization unit 212-1 determines that the phrase "very strict" is not included in the corpus of the phrase included in the input answer, and edits it from the phrase to be corrected " It is possible to extract "hard" by the candidate word of the distance within 1 distance. The document normalization unit 212-1 can perform spelling correction for correcting the correction target word to the correction candidate word using the extracted correction candidate word.

또한, 문서 정규화부(212-1)는 약어 확장 단계(S340)에서 입력 답안이 포함하는 각각의 문장에 포함된 약어를 동일한 개념 또는 의미를 지닌 표준 표현으로 수정할 수 있다. 일 실시예로서, 문서 정규화부(212-1)는 저장부(220)에 저장된 시소러스(Thesaurus)를 이용하여 입력 답안에 포함된 약어를 표준 표현으로 변환할 수 있다. 즉, 문서 정규화부(212-1)는 추출된 어절의 각각이 저장부(220)에 저장된 시소러스(Thesaurus)에 포함되어 있는지 확인하고, 문장 분리 단계(S310)에서 추출된 어절의 각각이 시소러스에 포함되어 있지 않은 경우 당해 어절의 변경을 수행하지 않고, 추출된 어절의 각각이 시소러스에 포함되어 있는 경우 당해 어절을 시소러스에 포함된 약어 확장 어절로 변경할 수 있다. 시소러스는 서로 다른 단어 간의 관계(유의어 또는 반의어)를 나타내는 사전을 의미할 수 있다. 예를 들어, 시소러스에는 "평가원", "KICE", "교육과정평가원"이 "한국교육과정평가원"의 유의어로 등록될 수 있고, 문서 정규화부(212-1)는 입력 답안을 확인하여 "평가원", "KICE", "교육과정평가원"을 포함하고 있을 경우 모두 "한국교육과정평가원"으로 수정할 수 있다.In addition, the document normalization unit 212-1 may modify the abbreviations included in each sentence included in the input answer to a standard expression having the same concept or meaning in the acronym expansion step (S340). In one embodiment, the document normalization unit 212-1 may convert an abbreviation included in the input answer into a standard expression using a thesaurus stored in the storage unit 220. [ That is, the document normalization unit 212-1 checks whether each of the extracted word phrases is included in a thesaurus stored in the storage unit 220, and if each of the phrases extracted in the sentence separating step S310 is included in the thesaurus If it is not included, and if each of the extracted words is included in the thesaurus without changing the corresponding word, the word can be changed to an abbreviated extended word contained in the thesaurus. A thesaurus may mean a dictionary representing a relationship between different words (synonyms or antonyms). For example, the thesaurus may be registered as a thesaurus of "evaluation person", "KICE", "curriculum evaluation person", "Korean curriculum evaluation source", and the document normalization unit 212-1 confirms the input answer, "," KICE ", and" Curriculum Evaluation Institute ", all of them can be changed to" Korea Curriculum Evaluation Institute ".

또한, 문서 정규화부(212-1)는 기호 제거 단계(S350)에서 입력 답안이 포함하는 채점에 불필요한 기호들을 제거할 수 있다. 일 실시예로서, 문서 정규화부(212-1)는 입력 답안에 포함된 문장 부호{큰따옴표("), 작은따옴표('), 온점(.), 물음표(?), 느낌표(!) 등}가 포함되어 다른 형태로 나타난 문장에서 기호 제거 단계(S350)를 통하여 기호를 제거하여 동일한 형태의 문장으로 출력할 수 있다. 예를 들어, 문서 정규화부(212-1)는 입력 답안이 "우리는 열심히 손을 흔들었다."와 같을 경우 문서 정규화부(212-1)는 문장 부호인 물음표, 온점을 제거하여 "우리는 열심히 손을 흔들었다"로 수정할 수 있다. 제거 대상이 되는 기호의 범위를 어디까지 정할 것인지는 응용 분야에 따라 달라질 수 있다. 특히 입력 답안에서는 지문에서 지시하는 기호나 문항 번호까지 함께 쓰는 경우가 종종 있어 이를 제거할 필요가 있다. 그런데 국어 문항의 "A-새벽별", "A: 새벽별" 답안에 나타난 알파벳 기호를 단순히 제거하여 "새벽별"로 정규화한다면, 과학 문항에서 "기체 A와 B를 반응시켜 기체 C가 생성된다"는 "기체 C와 B를 반응시켜 기체 A가 생성된다"와 구분할 수 없다. 따라서 기호 제거 단계(S350)는 교과와 문항을 고려하여 채점자가 선택적으로 활용할 수 있도록 옵션으로 설정할 수 있도록 할 수 있다.In addition, the document normalization unit 212-1 may remove unnecessary symbols in scoring included in the input answer in the symbol removal step (S350). In one embodiment, the document normalization unit 212-1 determines whether or not the sentence code (double quotation mark ("), single quotation mark ('), ontometry (.), Question mark (?), Exclamation mark (! The document normalization unit 212-1 can output the same type of sentence by removing the symbol through the symbol removal step S350 in a sentence including another form. The document normalization unit 212-1 can remove the question mark, the question mark, and the warm-up point, and modify it to "We shook his hand hard. &Quot; However, it is necessary to remove the "A-Dawn", "Dawn", "Dawn", and " A: Dawn of the Stars " If it is normalized to "Dawn star", it can not be distinguished from "In reaction of gas A and B, gas C is produced" and "gas A is produced by reacting gas C and B." Therefore, The removal step (S350) can be set as an option so that the grader can selectively utilize the subject and the items.

형태소 분석부(212-2)는 저장부(220)에 저장된 말뭉치(품사 부착 말뭉치)를 이용하여 입력 답안이 포함하는 각각의 문장을 어절 단위로 분할하고, 어휘의 중의성과 품사 중의성 등을 고려하여 가능한 모든 형태의 어절별 형태소 분석 결과를 추출할 수 있다. 품사 부착부(212-3)는 형태소 분석부(212-2)에서 추출된 어절별 형태소 분석 결과 중 확률 기반 품사 부착 모델을 이용하여 가장 확률이 높은 형태를 해당 어절의 형태소 분석 결과로 결정하여 각 형태소들의 품사를 부착할 수 있다.The morpheme analysis unit 212-2 divides each sentence included in the input answer into units of words by using a corpus (partly attached corpus) stored in the storage unit 220 and considers the vocabulary of the word and the sex of the part-of-speech And extract morphological analysis results of all possible types of morphemes. The part attaching unit 212-3 determines the morphological analysis result of the most probable form using the probabilistic part-of-speech attaching model among the morpheme analysis results by the morpheme analyzing unit 212-2, The parts of the morphemes can be attached.

표 5는 일 실시예에 따른 형태소 분석부(212-2)의 형태소 분석 결과 및 품사 부착부(212-3)의 품사 부착 결과를 보이는 예시도이다.Table 5 is an exemplary view showing morphological analysis results of the morpheme analysis unit 212-2 according to one embodiment and results of partly affixing parts of the parts attachment unit 212-3.

[표 5][Table 5]

표 5를 참조하면, 형태소 분석부(212-2)는 입력 답안이 "우리는 손을 열심히 흔들었다"를 포함할 경우 품사 부착 말뭉치를 이용하여 어절 단위로 해당 문장의 가능한 모든 형태소 분석 결과를 추출할 수 있다. Referring to Table 5, the morphological analysis unit 212-2 extracts all possible morphological analysis results of the sentence in units of words using the corpus-attached corpus when the input answer includes "We shook the hand hard" can do.

예를 들어, "우리는"이라는 어절의 형태소 분석 결과는 "우리"가 대명사, "는"이 보조사 형태일 수도 있고, "우리"가 명사, "는"이 보조사 형태일 수도 있으며, "우"가 명사, "리"가 명사, "는"이 보조사 형태일 수도 있다.For example, the morphological analysis result of the word "we" might be "us", pronoun, "helper", "us", or "helper" Is a noun, "Li" is a noun, and "a" may be a form of a subsidiary.

또한, "손을" 이라는 어절의 형태소 분석 결과는 "손"이 명사, "을"이 보조사 형태일 수도 있다.Also, the morphological analysis result of the word "hand" may be "hand" as a noun, and "e" as a helper.

또한, "열심히"라는 어절의 형태소 분석 결과는 "열심히"가 부사 형태일 수도 있고, "열심"이 명사, "히"가 부사화접미사 형태일 수도 있다.Also, the morphological analysis result of the phrase "hard" may be "hard" in adverbial form, "eager" in noun, and "hi" in adverbial suffix form.

또한, "흔들었다"라는 어절의 형태소 분석 결과는 "흔들"이 동사, "었"이 선어말어미, "다"가 종결형어말어미 형태일 수도 있다In addition, the morphological analysis result of the word "shake" may be that "shake" is a verb, "yi" is a prose term, "da" is a termination term

일 실시예로서, 품사 부착부(212-3)는 추출된 어절별 형태소 분석 결과와 품사 부착 말뭉치를 이용하여 각각의 형태소 분석 결과가 품사 부착 말뭉치에서 나타난 확률이 가장 높은 형태소 분석 결과를 해당 어절의 형태소 분석 결과로 결정할 수 있다.In one embodiment, the part-of-speech attaching section 212-3 extracts the morpheme analysis result having the highest probability of each morpheme analysis result from the part-of-speech corpus using the extracted morpheme analysis result and the part- It can be decided from the morphological analysis result.

표 5를 예로 들면, "우리는"이라는 어절의 경우 "우리"가 대명사, "는"이 보조사 형태가 품사 부착 말뭉치에서 출현한 횟수(확률)가 가장 많으므로 해당 형태를 형태소 분석 결과로 결정하고, "손을"이라는 어절의 경우 "손"을 명사, "을"이 보조사 형태가 품사 부착 말뭉치에서 출현한 횟수가 가장 많으므로 해당 형태를 형태소 분석 결과로 결정할 수 있다. 또한, "열심히"라는 어절의 경우 "열심히"가 부사인 형태가 품사 부착 말뭉치에서 출현한 횟수가 가장 많으므로 해당 형태를 형태소 분석 결과로 결정하고, "흔들었다"라는 어절의 경우 "흔들"이 동사, "었"이 선어말어미, "다"가 종결형어말어미 형태가 품사 부착 말뭉치에서 출현한 횟수가 가장 많으므로 해당 형태를 형태소 분석 결과로 결정할 수 있다. 상기한 방법을 이용하면 품사 부착부(212-3)는 결정된 형태소 분석 결과를 이용하여 각 형태소들의 품사를 부착할 수 있다.For example, in Table 5, the word "we" is the pronoun, "we" is the pronoun, and "the number of occurrences (probabilities) , "Hand" in the case of the word "hand" noun, "e" in the form of a part of the corpus with the help of the form is the most frequently occur, so the morphological analysis can determine the form. In addition, in the case of the phrase "hard", the morphology analysis result is determined as the most frequent occurrence in the corpus with part-of-speech attached to the form of "adversely" adverb, and the word "shake" The morphological analysis result can be determined because the number of occurrences of the verb, "I", and "Da" is the most frequently appearing in the corpus with part - of - speech. Using the above-described method, the part-of-speech attaching unit 212-3 can attach the part of speech of each morpheme using the determined morpheme analysis result.

부정표현 인식부(212-4)는 저장부(220)에 저장된 부정부사(못, 안, 아니 등), 부정 보조용언구('~지 못하/않/아니하', '~지 마라', '~지 말자' 등), 부정 구묶음('~ㄹ/을 수 없' 등), 부정용언('아니다', '없다' 등), 이중부정 표현('~지 않으면 안 된다' 등) 등과 같은 부정표현을 이용하여 입력 답안이 포함하는 각각의 문장에 포함된 부정표현을 인식하고 부정형 태그를 부착할 수 있다. 일 실시예로서, 부정표현 인식부(212-4)는 다수의 입력 답안이 포함하는 각 문장이 미리 결정된 부정표현을 포함하는지 확인하고, 각 문장이 미리 결정된 부정표현을 포함하지 않는 경우 각 문장에 부정형 태그를 부착하지 않고, 각 문장이 미리 결정된 부정표현을 포함하는 경우 각 문장에 부정형 태그를 부착할 수 있다. 부정표현 인식부(212-4)는 부정표현들을 부정(NOT), 불능(CANNOT), 강조(STRESS)로 구분하여 각각의 부정형 태그를 부착할 수 있다. 일 실시예로서, 부정표현 인식부(212-4)는 다수의 입력 답안이 포함하는 각 문장이 미리 결정된 부정표현을 포함하는지 확인하고, 각 문장이 미리 결정된 부정표현을 포함하지 않는 경우 각 문장에 부정형 태그를 부착하지 않고, 각 문장이 미리 결정된 부정표현을 포함하는 경우 각 문장에 부정형 태그를 부착할 수 있다. 표 6은 본 발명의 실시예에 따른 입력 문장과 그에 따른 부정형 태그 부착 결과를 보이는 예시도이다.The illegality expression recognizing unit 212-4 recognizes the illegal adverbs stored in the storage unit 220, the negative auxiliary pronouns ('do not allow / deny', 'do not deny' (Eg, 'Do not ~'), a set of indefinite phrases (such as 'can not be'), negative phrases ('no', 'no', etc.) By using the same negative expression, it is possible to recognize a negative expression included in each sentence included in the input answer and to attach an indefinite tag. In one embodiment, the negative expression recognition unit 212-4 determines whether each sentence included in the input answers includes a predetermined negative expression, and if each sentence does not include a predetermined negative expression, If each sentence contains a predetermined negative expression without attaching an indefinite tag, an indefinite tag can be attached to each sentence. The illegality expression recognizing unit 212-4 can attach the irregular expressions to each of the irregular expressions by dividing them into NOT, CAN NOT, and STRESS. In one embodiment, the negative expression recognition unit 212-4 determines whether each sentence included in the input answers includes a predetermined negative expression, and if each sentence does not include a predetermined negative expression, If each sentence contains a predetermined negative expression without attaching an indefinite tag, an indefinite tag can be attached to each sentence. Table 6 is an example of an input sentence according to an embodiment of the present invention and a result of attachment of the irregular tag according to the present invention.

[표 6][Table 6]

표 6을 참조하면, 부정표현 인식부(212-4)는 "소란스럽지 않다"가 입력 답안으로 입력되었을 경우 저장부(220)에 저장된 부정 표현인 "~지 않"을 포함하고 있으므로 해당 부정 표현을 제거하고 부정을 의미하는 부정형 태그 NOT을 부착하여 "소란스럽다(NOT)" 형태로 수정할 수 있다.Referring to Table 6, since the negative expression recognition unit 212-4 includes a negative expression "not" stored in the storage unit 220 when "not disturbing" is input as the input answer, Quot; NOT "by attaching an amorphous tag NOT, which means " NO ". < / RTI >

또한, 부정표현 인식부(212-4)는 "먹을 수 없다"가 입력 답안으로 입력되었을 경우 저장부(220)에 저장된 부정 표현인 "~을 수 없"을 포함하고 있으므로 해당 부정 표현을 제거하고 불능을 의미하는 부정형 태그 CANNOT을 부착하여 "먹다(CANNOT)" 형태로 수정할 수 있다.In addition, since the negative expression recognition unit 212-4 includes a negative expression "can not be stored" in the storage unit 220 when "can not be eaten" is input as the input answer, the negative expression is removed It can be modified to "CAN NOT" by attaching CAN NOT, which is an indefinite tag which means impossible.

또한, 부정표현 인식부(212-4)는 "공부를 안 했다"가 입력 답안으로 입력되었을 경우 저장부(220)에 저장된 부정 표현인 "~를 안"을 포함하고 있으므로 해당 부정 표현을 제거하고 부정을 의미하는 부정형 태그 NOT을 부착하여 "공부하다(NOT)" 형태로 수정할 수 있다.In addition, since the illegitimate expression recognition unit 212-4 includes a negative expression "to" stored in the storage unit 220 when the "did not study" It can be modified to "NOT" by attaching an amorphous tag NOT, which means negation.

또한, 부정표현 인식부(212-4)는 "이번에는 내가 가지 않으면 안 된다"가 입력 답안으로 입력되었을 경우 저장부(220)에 저장된 이중부정 표현인 "~지 않으면 안"을 포함하고 있으므로 해당 이중부정 표현을 제거하고 강조를 의미하는 태그 STRESS를 부착하여 "이번에는 내가 가다(STRESS)" 형태로 수정할 수 있다.In addition, since the illegitimate expression recognition unit 212-4 includes the double negative expression "do not care" stored in the storage unit 220 when "I have to go this time" It removes the negative expression and attaches the tag STRESS which means emphasis, and it can be modified to "STRESS" this time.

구묶음부(212-5)는 입력 답안이 포함하는 각각의 문장에서 명사구나 동사구와 같이 통사적으로 서로 밀접하게 연결되어 있는 구안에 구를 포함하지 않는 비재귀적 형태의 구를 추출할 수 있다. 구묶음은 부분 구문 분석이라고도 표현할 수 있다. 입력 답안의 구묶음 결과 구문 분석의 복잡도를 감소시킬 수 있다. 일 실시예로서, 구묶음부(212-5)는 확률 기반 방법을 이용하여 입력 답안이 포함하는 각각의 문장을 어절별로 분리하고 분리된 어절들이 저장부(220)에 저장된 말뭉치(구묶음 말뭉치)에서 동일한 구(Inside)로 분류될 확률과 새로운 구의 시작(Begin)으로 분류될 확률을 비교하여 보다 높은 쪽의 확률로 동일한 구로 묶어야 할지 새로운 구로 분류하여야 할지를 결정할 수 있다.The phrase bundle 212-5 can extract phrases in non-relational forms that do not contain phrases in a phrase that is syntactically closely connected to each other, such as noun phrases or verbal phrases, in each sentence included in the input answer. A phrase can also be referred to as a partial parsing. It is possible to reduce the complexity of parsing the result set of the input answer. In one embodiment, the phrase bundling unit 212-5 separates each sentence included in the input answer by a word-by-word method using a probability-based method, and the separated phrases are stored in a corpus (phrase bundling corpus) It is possible to compare the probability of being classified into the same sphere and the probability of being classified as the beginning of a new sphere to determine whether to classify the probability of the higher sphere into the same sphere or the new sphere.

표 7은 본 발명의 일 실시예에 따른 입력 답안과 구묶음 결과를 보이는 예시이다.Table 7 is an example showing input answers and phrase results according to an embodiment of the present invention.

[표 7][Table 7]

표 7을 참조하면, 구묶음부(212-5)는 입력 답안을 어절별(책사랑, 독서, 퀴즈, 대회는, 우리, 학교의, 독서, 문화를, 뿌리내리게, 했다)로 분리할 수 있고, 연속하는 소정 개수의 어절 "책사랑 독서 퀴즈 대회"가 구묶음 말뭉치에서 동일한 구로 분류된 횟수를 산출하고, 서로 다른 새로운 구로 분류된 횟수를 산출하여 산출된 횟수가 많은 형태로 구묶음 여부를 판단할 수 있다. 이 경우 "책사랑 독서 퀴즈 대회"가 동일한 구로 분류된 횟수가 서로 다른 새로운 구로 분류된 횟수보다 많기 때문에 구묶음을 수행할 수 있다. 구묶음부(212-5)는 상기한 방법과 동일한 원리로 "우리 학교", "독서 문화", "뿌리내리게 했다"의 구묶음을 수행할 수 있다.Referring to Table 7, the phrase bundle 212-5 can separate the input answer into a word-by-word (book love, reading, quiz, conference, we, school, reading, culture, The number of times that a predetermined number of consecutive phrases "book reading love quiz contest" is classified into the same phrase in the phrase bundle corpus is calculated, and the number of times classified into different new phrases is calculated, It can be judged. In this case, the "book love reading quiz contest" can be performed because the number of times classified in the same phrase is greater than the number of times that the new phrase is divided into different phrases. The phrase bundle 212-5 can carry out phrases such as " our school ", "reading culture ", and" rooted out "

표 7에서 예시한 "책사랑 독서 퀴즈 대회는 우리 학교의 독서 문화를 뿌리내리게 했다"를 어절 단위로 구문 분석을 수행하고자 하면 10개의 어절에서 가능한 모든 어절쌍을 추출하고, 추출된 어절쌍들에 대해서 구문 분석을 수행해야 하지만, 구묶음 결과를 활용하면 [책사랑 독서 퀴즈 대회](명사구), [우리 학교](명사구), [독서 문화](명사구), [뿌리내리게 했다](동사구)의 4개의 구에서 가능한 구문 관계만을 파악하기 때문에 구묶음을 수행할 경우 구문 분석의 복잡도를 감소시킬 수 있다.If we try to parse the phrase "Book Love Reading Quiz Competition," as shown in Table 7, which has led to the rooting of our school's reading culture, we can extract all possible pairs of words from 10 words, , But it is necessary to carry out the analysis of the phrases, but when using the result of the phrase bundle, it is possible to use [Book Love Reading Quiz Competition] (Noun phrase), [Our school] (Noun phrase), [Reading culture] (Noun phrase) Since we only know the syntactic relations that are possible in the four clauses, the complexity of the parsing can be reduced when the clause is executed.

바꿔쓰기부(212-6)는 입력 답안이 포함하는 각각의 문장을 대체 단어 및 표현 등을 사용하여 문장의 의미는 유지하면서 간단한 문장으로 다시 작성하는 바꿔쓰기 과정을 수행할 수 있다. 일 실시예로서, 바꿔쓰기부(212-6)는 저장부(220)에 저장된 시소러스(Thesaurus)를 이용하여 입력 답안에 포함된 표현을 표준 표현으로 변환할 수 있다. 시소러스는 서로 다른 단어 또는 표현 간의 관계(유의어 또는 반의어)를 나타내는 사전을 의미할 수 있다. 예를 들어, 시소러스에는 "가/조사"가 "이/조사"의 유의어로, "는다/어미"가 "다/어미"의 유의어로 등록될 수 있고, 바꿔쓰기부(212-6)는 "우리+가 밥+을 먹+는다"가 입력 답안으로 입력되었을 경우 시소러스를 참조하여 "우리+이 밥+을 먹+다"로 바꿔쓰기 과정을 수행할 수 있다.The replacing unit 212-6 can perform a replacement process of rewriting each sentence included in the input answer by using a substitute word and an expression or the like and rewriting it into a simple sentence while maintaining the meaning of the sentence. In one embodiment, the rewriting unit 212-6 may convert a representation included in the input answer to a standard expression using a thesaurus stored in the storage unit 220. [ A thesaurus may refer to a dictionary representing a relationship between different words or expressions (synonyms or antonyms). For example, the thesaurus may be registered with the thesaurus of " I / D ", the word " We + can eat rice + "is inputted as an answer, we can refer to the thesaurus and perform the process of" We + eat rice + ".

의존관계 분석부(212-7)는 입력 답안이 포함하는 각각의 문장을 의존문법을 기반으로 구문 분석을 수행할 수 있다. 의존문법은 의존소(dependent)와 지배소(governor)의 관계를 문법으로 표현한 것인데, 지배소는 의존관계에 있는 언어 요소들 중 의미의 중심이 되는 요소를, 의존소는 지배소가 갖는 의미를 보완해주는 요소를 의미한다. 일 실시예로서, 의존관계 분석부(212-7)는 문장의 역방향 순서로 어절별 의존관계를 분석하고, 확률에 근거하여 의존소와 지배소를 구분할 수 있다. 즉, 의존관계 분석부(212-7)는 다수의 입력 답안이 포함하는 문장으로부터 추출된 어절의 각각이 의존소 및 지배소로 분류될 확률을 각각 산출하고, 의존소로 분류될 확률과 지배소로 분류될 확률을 서로 비교하여, 보다 높은 확률로 추출된 어절의 각각을 분류할 수 있다.The dependency analysis unit 212-7 can perform parsing on each sentence included in the input answer based on the dependency grammar. Dependence grammar is a grammatical representation of the relationship between dependent and governor, where the dominant element is the central element of the dependent language elements, and the dependent element is the meaning of the dominant It means the complementary factor. In one embodiment, the dependency analyzer 212-7 may analyze the dependency of each word in the reverse order of the sentence, and may classify the dependency and dominance based on the probabilities. That is, the dependency analysis unit 212-7 calculates the probability that each of the words extracted from the sentences included in the input answers is classified into the dependency and the dominant, and calculates the probability of being classified as the dependent and the class of the dominant By comparing the probabilities with each other, each of the extracted words can be classified with a higher probability.

예를 들어, "오늘 나는 밥을 먹었다"의 의존관계를 분석하면 우선 도 6a에 도시한 바와 같이 문장의 역방향 순서로 "먹었다"와 "밥을"을 의존관계로 연결할 수 있다. 도 6b에 도시한 바와 같이 역방향 순서로 다음 어절인 "나는"의 의존관계를 분석할 경우 "나는"이 "먹었다"와 의존관계인지(A) "밥을"과 의존관계인지(B)를 결정해야 하는데, 의존관계 분석부(212-7)는 저장부(220)에 저장된 말뭉치를 이용하여 "나는"과 "먹었다"가 의존관계일 확률과 "나는"과 "밥을"이 의존관계일 확률을 산출하여 보다 높은 확률을 갖는 "나는"이 "먹었다"와 의존관계(A)를 갖도록 선택할 수 있다. 도 6c에 도시한 바와 같이 역방향 순서로 다음 어절인 "오늘"의 의존관계를 분석할 경우 언어 구조상 크로싱(crossing)이 발생하는 "오늘"과 "밥을"은 의존관계가 발생할 수 없고, "오늘"이 "먹었다"와 의존관계인지(C) 또는 "나는"과 의존관계인지(D)를 결정하면 되는데, 의존관계 분석부(212-7)는 저장부(220)에 저장된 말뭉치를 이용하여 "오늘"과 "먹었다"가 의존관계일 확률과 "오늘"과 "나는"이 의존관계일 확률을 산출하여 보다 높은 확률을 갖는 "오늘"이 "나는"과 의존관계(D)를 갖도록 선택할 수 있다. 도 6d는 "오늘 나는 밥을 먹었다"라는 입력 답안 문장의 의존관계 분석이 완료된 형태를 나타낼 수 있다.For example, when analyzing the dependence relationship of "I ate rice today", it is possible to connect "eat" and "rice" in the order of the reverse order of the sentence as shown in FIG. As shown in FIG. 6B, when analyzing the dependency of the next word "I" in the reverse order, it is determined whether "I" is dependent on "eating" or not (A) The dependency analysis unit 212-7 uses the corpus stored in the storage unit 220 to determine the probability that the "I" and "EAT" are dependency relationships and the probability that "I" and "Bob" To select "I" with a higher probability to have "fed" and dependency (A). As shown in FIG. 6C, when analyzing the dependency of the next word "today" in the reverse order, there is no dependence between "today" and "rice" where crossing occurs in the language structure, The dependency analysis unit 212-7 uses the corpus stored in the storage unit 220 to determine whether the dependency relationship is " Today "and" Eat "can be chosen to have a dependency relationship, and" Today "and" I "have a probability of being a dependency so that" Today "with a higher probability has" I "and Dependency (D) . FIG. 6D shows a form in which the dependency analysis of the input answer sentence "Today I ate rice" is completed.

분류 채점부(214)는 언어 처리부(212)에서 형성된 채점 대상 답안의 채점을 수행하여 채점 결과를 형성할 수 있다. 일 실시예로서, 분류 채점부(214)는 학습용 답안 생성, 채점 대상 답안의 자질 추출 및 선택, 학습 모델 생성, 채점 대상 답안의 분류 기준 형성, 채점 대상 답안 분류, 채점 대상 답안의 채점 등의 절차로 채점 결과를 형성할 수 있지만, 분류 채점부(214)가 채점 대상 답안에 대해 수행하는 절차는 이러한 구현에 한정되지 않는다. 분류 채점부(214)는 학습용 답안 생성부(214-1), 자질 추출부(214-2), 답안 분류부(214-3) 및 채점부(214-4)를 포함할 수 있다.The classification scoring unit 214 may perform a scoring of the scoring objective answer formed in the language processing unit 212 to form a scoring result. In one embodiment, the classifying and scoring unit 214 includes procedures for generating an answer for a learning, extracting and selecting a qualification of an answer to be scored, generating a learning model, forming a classification standard for an answer to be scored, classifying an answer to be scored, , But the procedure performed by the classification scoring unit 214 for the scoring objective answer is not limited to such an implementation. The classifying and scoring unit 214 may include a learning answer generation unit 214-1, a feature extraction unit 214-2, an answer classification unit 214-3, and a scoring unit 214-4.

학습용 답안 생성부(214-1)는 언어 처리가 완료된 채점 대상 답안 중 소정 임계값 이상의 빈도를 갖는 제1 소정 개수의 답안을 선택하여 제1 소정 개수의 답안을 학습용 답안으로 형성할 수 있다. 일 실시예로서, 학습용 답안 생성부(214-1)는 전체 채점 대상 답안 중 80% 이상의 빈도를 갖는 채점 대상 답안을 제1 소정 개수의 답안으로 선택할 수도 있고, 미리 설정된 임의의 개수(예를 들어, 10개)의 채점 대상 답안을 제1 소정 개수의 답안으로 선택할 수도 있으나 제1 소정 개수의 답안을 선택하는 방법이 이러한 실시예에 한정되지 않는다.The learning-answer generating unit 214-1 may select a first predetermined number of answers having a frequency equal to or higher than a predetermined threshold value among the scoring-subject answers that have been subjected to the language processing, and form the first predetermined number of answers as learning answers. In one embodiment, the learning-answer generating unit 214-1 may select the first to-be-selected answers as the scoring-candidate answers having a frequency of 80% or more of all the scoring-answering answers, , 10) can be selected as the first predetermined number of answers, but the method of selecting the first predetermined number of answers is not limited to this embodiment.

자질 추출부(214-2)는 학습용 답안 생성부(214-1)에서 선택된 제1 소정 개수의 답안 각각으로부터 언어 처리부(212)가 수행한 자연 언어 처리의 수행 결과를 이용하여 미리 결정된 특징적 요소에 해당하는 자질(feature)을 추출할 수 있다. 일 실시예로서, 언어 처리부(212)는 입력 답안에 대한 문장 분리 단계, 띄어쓰기 교정 단계, 철자 교정 단계, 약어 확장 단계, 기호 제거 단계를 포함하는 문서 정규화, 입력 답안의 형태소 분석, 입력 답안이 포함하는 형태소의 품사 부착, 입력 답안의 부정표현 인식, 입력 답안이 포함하는 적어도 두 개의 어절에 대한 구묶음, 입력 답안이 포함하는 어절 또는 구를 미리 결정된 표준 표현으로 변환하는 바꿔쓰기 및 입력 답안이 포함하는 형태소 또는 어절간의 의존구조를 분석하는 의존관계 분석을 포함하는 자연 언어 처리를 수행할 수 있다. The feature extraction unit 214-2 extracts the characteristic elements from the first predetermined number of answers selected by the learning answer generation unit 214-1 by using the result of the natural language processing performed by the language processing unit 212 The corresponding feature can be extracted. In one embodiment, the language processing unit 212 includes a sentence separating step, a spacing correction step, a spelling correction step, an abbreviation expansion step, a document normalization including a symbol removal step, a morphological analysis of an input answer, The morpheme of the morpheme, the recognition of the negative expression of the input answer, the phrase for at least two phrases included in the input answer, the transposition and the input answer to convert the phrase or phrase included in the input answer into a predetermined standard expression A natural language processing including a dependency analysis for analyzing a dependency structure between a morpheme or a verse can be performed.

또한, 자질 추출부(214-2)는 학습용 답안 생성부(214-1)에서 선택된 제1 소정 개수의 답안의 형태소 분석 및 품사 부착 수행결과를 이용하여 형태소 자질을 추출할 수 있다. 예를 들어, 제1 소정 개수의 답안 중 어느 하나의 답안이 포함하는 문장이 "우리는 손을 열심히 흔들었다"일 경우, 형태소 분석부(212-2) 및 품사 부착부(212-3)의 형태소 분석 및 품사 부착 수행 결과는 "우리/대명사 + 는/보조사 + 손/명사 + 을/보조사 + 열심히/부사 + 흔들/동사 + 었/선어말어미 + 다/종결형어말어미"와 같고, "우리/대명사", "는/보조사", "손/명사", "을/보조사", "열심히/부사", "흔들/동사", "었/선어말어미", "다/종결형어말어미"를 형태소 자질로 추출할 수 있다.In addition, the feature extraction unit 214-2 can extract the morpheme qualities using the morphological analysis of the first predetermined number of answers selected by the answer calculation unit 214-1, and the results of performing the part-of-speech attachment. For example, when the sentence included in one of the answers of the first predetermined number is "We shook his hand hard ", the morpheme analysis unit 212-2 and the parts attaching unit 212-3 The result of morphological analysis and part - affixing is like "our / pronoun + is / auxiliary + hand / noun + / assistant + hard / adverb + shake / verb + , "" / "," / "," / "," / "," / "," A / noun "," hand / noun " .

또한, 자질 추출부(214-2)는 학습용 답안 생성부(214-1)에서 선택된 제1 소정 개수의 답안이 포함하는 어절에 기초하여 어절 자질을 추출할 수 있다. 예를 들어, 제1 소정 개수의 답안 중 어느 하나의 답안이 포함하는 문장이 "우리는 손을 열심히 흔들었다"일 경우, 띄어쓰기 단위인 어절들 즉, "우리는". "손을", "열심히", "흔들었다"를 어절 자질로 추출할 수 있다.In addition, the feature extraction unit 214-2 can extract the speech quality based on the phrases included in the first predetermined number of answers selected by the learning answer generation unit 214-1. For example, if the sentence included in one of the first set of answers is "We shook his hand hard," then the spacing units of words, "we". You can extract "hand", "hard", "waved" in a phrase.

또한, 자질 추출부(214-2)는 학습용 답안 생성부(214-1)에서 선택된 제1 소정 개수의 답안이 포함하는 어절의 구묶음을 수행하여 기본구 자질을 추출할 수 있다. 예를 들어, 제1 소정 개수의 답안 중 어느 하나의 답안이 포함하는 문장이 "책사랑 독서 퀴즈 대회는 우리 학교의 독서 문화를 뿌리내리게 했다"일 경우 구묶음부(212-5)의 구묶음 수행 결과는 "[책사랑 독서 퀴즈 대회](명사구)는 [우리 학교](명사구)의 [독서 문화](명사구)를 [뿌리내리게 했다](동사구)"와 같고, 각각의 구묶음 결과인 "[책사랑 독서 퀴즈 대회](명사구)", "[우리 학교](명사구)", "[독서 문화](명사구)", "[뿌리내리게 했다](동사구)"를 기본구 자질로 추출할 수 있다.In addition, the feature extraction unit 214-2 can extract a basic syllable by performing a syllable of the words included in the first predetermined number of answers selected by the answer-generation unit 214-1. For example, if the sentence included in any one of the answers of the first predetermined number of answers is "book reading love quiz contest has rooted down the reading culture of our school ", the phrase bundling section 212-5 The results are as follows: "[Book Love Reading Quiz Competition] (noun phrase) is like [our school] (noun phrase) [reading culture] (noun phrase) It is possible to extract the basic phrase qualities from the [book reading comprehension quiz competition] (noun phrase), [our school] (noun phrase), reading culture (noun phrase), and roots (verb phrase) have.

또한, 자질 추출부(214-2)는 학습용 답안 생성부(214-1)에서 선택된 제1 소정 개수의 답안의 의존관계 분석을 수행하여 의존관계 자질을 형성할 수 있다. 예를 들어, 제1 소정 개수의 답안 중 어느 하나의 답안이 포함하는 문장이 "오늘 나는 밥을 먹었다"일 경우 의존관계 분석부(212-7)의 의존관계 분석 결과는 도 6d에 도시한 바와 같고, 각각의 의존관계인 [나는, 먹었다], [밥을, 먹었다], [오늘, 나는, 먹었다]를 의존관계 자질로 형성할 수 있다.The qualification extraction unit 214-2 may perform dependency analysis of the first predetermined number of answers selected by the learning answer generation unit 214-1 to form dependency qualities. For example, when the sentence included in one of the answers of the first predetermined number is "I have eaten rice today", the dependency analysis result of the dependency analyzing unit 212-7 is as shown in FIG. 6D Can be formed into dependency qualities, which are the same, and each dependency [I ate], [ate, cooked] [today, I ate].

또한, 자질 추출부(214-2)는 형태소 자질의 바이그램(bi-gram) 혹은 트라이그램(tri-gram)이나 어절 자질의 바이그램 혹은 트라이그램을 결합하여 엔그램(n-gram) 자질로 형성할 수 있다. 일 실시예로서, 형태소 자질이나 어절 자질에서 인접한 2개를 결합하여 형태소 혹은 어절 바이그램(bi-gram) 자질을 형성할 수도 있고, 형태소 자질이나 어절 자질에서 인접한 3개를 결합하여 형태소 혹은 어절 트라이그램(tri-gram) 자질을 형성할 수도 있다.The feature extraction unit 214-2 may combine the bi-gram or tri-gram of the morpheme qualities or the bi-gram or the tri-gram of the verbal qualities to form an n-gram feature . In one embodiment, a morpheme or a word bi-gram feature may be formed by combining two adjacent words in a morpheme or a word feature, or a combination of three adjacent words in a morpheme or a word feature, (tri-gram) < / RTI >

또한, 자질 추출부(214-2)는 채점 대상 답안 중 제1 소정 개수의 답안을 제외한 제2 소정 개수의 답안 각각으로부터 언어 처리부(212)가 수행한 자연 언어 처리의 수행 결과를 이용하여 자질을 추출할 수 있다.In addition, the feature extraction unit 214-2 extracts the qualities using the result of the natural language processing performed by the language processing unit 212 from each of the second predetermined number of answers excluding the first predetermined number of answers Can be extracted.

답안 분류부(214-3)는 자질 추출부(214-2)에서 추출된 자질들을 이용하여 채점 대상 답안의 분류에 사용할 분류 기준을 형성하기 위한 기계학습 방법을 선택하고, 선택된 기계학습 방법을 이용하여 분류 기준을 형성하며, 형성된 분류 기준에 따라서 채점 대상 답안을 적어도 두 개의 클러스터(cluster)로 분류(예컨대, 정답 및 오답으로 분류하거나, 3점, 2점, 1점 등의 3개 이상 복수의 차등 점수로 분류)할 수 있다. 일 실시예로서, 분류 기준을 형성하기 위한 기계학습 방법은 비지도 학습 방법(Unsupervised Learning Method), 지도 학습 방법(Supervised Learning Method), 준지도 학습 방법(Semi-Supervised Learning Method), 앙상블 학습 방법(Ensemble Learning Method) 등을 포함할 수 있다.The answer classifying section 214-3 selects a machine learning method for forming a classification criterion to be used for classifying the scoring target answer using the qualities extracted from the qualification extracting section 214-2 and uses the selected machine learning method And classify the scoring candidate answers into at least two clusters (for example, classify them into correct answers and incorrect answers, or classify them into three or more, such as three points, two points, one point, etc.) And classified as a differential score). In one embodiment, a machine learning method for forming a classification criterion is classified into an unsupervised learning method, a supervised learning method, a semi-supervised learning method, an ensemble learning method Ensemble Learning Method).

비지도 학습 방법에서는 학습용 답안 생성부(214-1)에서 선택된 제1 소정 개수의 답안을 적어도 두 개의 클러스터로 클러스터링(clustering)하고 분류 기준을 형성하며, 형성된 분류 기준을 이용하여 (제1 소정 개수의 답안을 제외한) 제2 소정 개수의 답안을 적어도 두 개의 클러스터 중 어느 하나의 클러스터로 분류할 수 있다. 일 실시예로서, 답안 분류부(214-3)는 자질 추출부(214-2)가 제1 소정 개수의 답안에서 추출한 자질들을 이용하여 제1 소정 개수의 답안 각각을 자질 벡터로 변환하고, 제1 소정 개수의 답안에서 추출된 자질들을 이용하여 적어도 두 개의 클러스터 각각에 대응하는 초기 분류 기준을 설정하며, 설정된 초기 분류 기준을 기준 벡터로 변환할 수 있다. 또한, 답안 분류부(214-3)는 기준 벡터와 제1 소정 개수의 답안 각각에 대응하는 자질 벡터간의 유사도(거리)를 산출하고, 산출된 유사도(거리)에 기초하여 제1 소정 개수의 답안 각각을 유사도가 가장 높은(거리가 가장 가까운) 기준 벡터에 대응하는 클러스터로 분류할 수 있다. 또한, 답안 분류부(214-3)는 제1 소정 개수의 답안 각각을 어느 하나의 클러스터로 분류한 후 클러스터별로 분류된 제1 소정 개수의 답안 각각에 대응하는 자질 벡터의 각 좌표들의 벡터 평균을 수행하여 각 클러스터에 대응하는 기준 벡터를 다시 설정할 수 있다. 또한, 답안 분류부(214-3)는 다시 설정된 기준 벡터를 이용하여 제1 소정 개수의 답안 각각에 대응하는 자질 벡터간의 유사도 산출과 가장 유사도가 높은 클러스터로의 분류를 반복할 수 있다. 답안 분류부(214-3)는 각 클러스터에 포함된 제1 소정 개수의 답안의 변동이 없을 때까지 유사도 산출 및 클러스터 분류를 반복 수행할 수 있고, 클러스터에 포함된 답안의 변동이 없을 경우 각 클러스터의 기준 벡터를 최종 기준 벡터로 설정할 수 있다. 일 실시예로서, 답안 분류부(214-3)는 K-평균 군집화, 계층 군집화, 밀도기반 군집화, 격자기반 군집화(Grid-Based Clustering) 등의 방법을 이용하여 클러스터링 및 분류 기준을 형성할 수 있으나, 이러한 방법에 한정되지 않는다. In the non-background learning method, clustering the first predetermined number of answers selected by the answer-generating unit 214-1 into at least two clusters, forming a classification criterion, and using the classification criterion (the first predetermined number The second predetermined number of answers can be classified into one of at least two clusters. In one embodiment, the answer classification unit 214-3 uses each of the qualities extracted from the first predetermined number of answers by the feature extraction unit 214-2 to convert each of the first predetermined number of answers into the feature vector, 1, an initial classification criterion corresponding to each of at least two clusters may be set using the qualities extracted from the predetermined number of answers, and the set initial classification criterion may be converted into a reference vector. In addition, the answer classification unit 214-3 calculates the similarity (distance) between the reference vector and the feature vectors corresponding to the first predetermined number of answers, and calculates the similarity (distance) based on the calculated similarity Each can be classified into clusters corresponding to the reference vectors having the highest similarity (closest distance). The answer classification unit 214-3 classifies each of the first predetermined number of answers into one cluster and then calculates a vector average of the coordinates of the feature vector corresponding to each of the first predetermined number of answers sorted by the cluster To reset the reference vector corresponding to each cluster. In addition, the answer classification unit 214-3 can repeat the calculation of the degree of similarity between the feature vectors corresponding to each of the first predetermined number of answers using the set reference vectors and classification into clusters having the highest similarity. The answer classifying section 214-3 can repeatedly perform the similarity degree calculation and the cluster classification until there is no change in the first predetermined number of answers included in each cluster. If there is no change in the answers included in the cluster, Can be set as a final reference vector. In one embodiment, the answer classification unit 214-3 may form clustering and classification criteria using K-average clustering, hierarchical clustering, density-based clustering, and grid-based clustering , But the method is not limited thereto.

여기서 K-평균 군집화 방법은 주어진 데이터를 k개의 클러스터로 묶는 방법으로, 각 클러스터와 거리 차이의 분산을 최소화하는 방식으로 동작할 수 있다. K-평균 군집화 방법은 비지도 학습(unsupervised learning)의 일종으로, 레이블(label)이 달려 있지 않은 입력 데이터에 레이블을 달아주는 역할을 수행할 수 있다. K-평균 군집화 방법을 이용하여 형성한 클러스터들의 구조와 EM 알고리즘(Expectation-Maximization Algorithm)을 이용하여 형성한 클러스터들의 구조는 비슷한 구조를 가질 수 있다.Here, the K-average clustering method is a method of grouping the given data into k clusters, and can operate in a manner of minimizing the dispersion of the distance difference with each cluster. The K-means clustering method is a type of unsupervised learning that can be used to label input data that does not have a label. The structure of the clusters formed using the K-average clustering method and the structure of clusters formed using the EM algorithm (Expectation-Maximization Algorithm) can have a similar structure.

계층 군집화 방법은 계층적 구조를 갖는 클러스터들을 상향식으로(bottom-up) 병합하여(merge) 클러스터링을 수행하는 방법(Agglomerative Method), 또는 하향식으로(top-down) 재귀적으로(recursively) 분할하여(split) 클러스터링을 수행하는 방법(Divisive Method)을 나타낼 수 있다.The hierarchical clustering method is a method of bottom-up merging clusters having a hierarchical structure, a method of clustering (Agglomerative Method), or a method of recursively dividing by top-down split clustering (Divisive Method).

밀도기반 군집화 방법은 이웃하는 데이터의 밀도에 기반하여 클러스터를 확장시킬 수 있는 방법이다.The density-based clustering method is a method that can expand a cluster based on the density of neighboring data.

격자기반 군집화 방법은 데이터가 존재하는 공간을 격자 구조로 이루어진 유한개의 공간으로 분리하고 격자 구조에 대해서 군집화 방법을 수행하는 방법이다.The grid-based clustering method divides the space in which the data exist in a finite number of spaces of a grid structure and performs a clustering method on the grid structure.

또한, 답안 분류부(214-3)는 설정된 최종 기준 벡터와 제2 소정 개수의 답안 각각에 대응하는 자질 벡터간의 유사도(거리)를 산출하고, 산출된 유사도(거리)에 기초하여 제2 소정 개수의 답안 각각을 유사도가 가장 높은(거리가 가장 가까운) 기준 벡터에 대응하는 클러스터로 분류할 수 있다.The answer classifying section 214-3 calculates the similarity (distance) between the feature vector corresponding to each of the set final reference vector and the second predetermined number of answers, and outputs the second predetermined number (distance) Can be classified into clusters corresponding to the reference vectors having the highest similarity (closest distance).

표 8은 본 발명의 일 실시예에 따른 비지도 학습 방법을 이용한 답안 분류 방법을 나타낸다. 제1 소정 개수의 답안이 "아름이를 생포하였다", "아름이를 잡았다", "동물을 생포하는 사람이다", "동물을 보살핀다"의 4개라고 가정할 경우 제1 소정 개수의 답안으로부터 형태소 자질을 추출하면 "아름이", "동물", "생포", "잡다", "보살피다"와 같다. 추출된 자질들 중 "생포", "아름이", "동물"을 비지도 학습 방법을 위한 자질로 선택할 수 있고, 제1 소정 개수의 답안 각각을 자질 벡터로 변환하면 "아름이를 생포하였다"→ (1,1,0), "아름이를 잡았다"→ (0,1,0), "동물을 생포하는 사람이다"→ (1,0,1), "동물들을 보살핀다"→ (0,0,1)와 같다.Table 8 shows an answer classification method using the non-background learning method according to an embodiment of the present invention. Assuming that the first predetermined number of answers is four, namely, "captured beautiful," "captured beautiful," "is a person who captures animals," and "cares for animals," the morphological qualities Are like "beautiful", "animal", "captured", "catch", "look at". Among the extracted qualities, "Capture", "Beautiful", and "Animal" can be selected as the qualities for the non-geographic learning method. If each of the first predetermined number of answers is converted into a qualitative vector, 1, 1, 0), "caught the beautiful" → (0,1,0), "is the person who captures animals" → (1,0,1), "cares animals" → ).

[표 8][Table 8]

일 실시예로서, 답안 분류부(214-3)는 클러스터 A의 초기 기준 벡터를 (1,1,0)으로 설정하고 클러스터 B의 초기 기준 벡터(0,0,1)로 설정할 수 있다. 답안 분류부(214-3)는 클러스터 A 및 클러스터 B의 초기 기준 벡터들{(1,1,0), (0,0,1)}과 제1 소정 개수 답안 각각의 자질 벡터들{(1,1,0), (0,1,0), (1,0,1), (0,0,1)} 사이의 유사도(거리)를 산출하고, 산출된 유사도에 기초하여 가장 유사도가 높은(거리가 가까운) 클러스터로 제1 소정 개수의 답안 각각을 분류할 수 있다. 즉, "아름이를 생포하였다"의 경우 클러스터 A와의 거리는 0이고, 클러스터 B와의 거리는

이므로 가장 거리가 가까운 클러스터 A로 분류할 수 있고, "아름이를 잡았다"의 경우 클러스터 A와의 거리는 1이고, 클러스터 B와의 거리는

이므로 가장 거리가 가까운 클러스터 A로 분류할 수 있다. 또한, "동물들을 보살핀다"의 경우 클러스터 A와의 거리는

이고, 클러스터 B와의 거리는 0이므로 가장 거리가 가까운 클러스터 B로 분류할 수 있고, "동물을 생포하는 사람이다"의 경우 클러스터 A와의 거리는

이고, 클러스터 B와의 거리는 1이므로 가장 거리가 가까운 클러스터 B로 분류할 수 있다.In one embodiment, the answer classification unit 214-3 can set the initial reference vector of the cluster A to (1,1,0) and set it to the initial reference vector of the cluster B (0,0,1). The answer classifying section 214-3 classifies the initial reference vectors {(1,1,0), (0,0,1)} of the cluster A and the cluster B and the feature vectors {(1 (Distance) between the coordinates (0, 1, 0), (0, 1, 0) Each of the first predetermined number of answers can be classified into clusters (close in distance). That is, in the case of "captured beautiful ", the distance from the cluster A is 0, and the distance from the cluster B is

The distance to the cluster A is 1, and the distance from the cluster B to the cluster B is 1

, It can be classified into cluster A closest to the distance. Also, in the case of "caring for animals ", the distance from cluster A

, And the distance from the cluster B is 0, so that it can be classified into the cluster B closest to the closest distance. In the case of "the person who captures an animal"

And the distance from the cluster B is 1, so that it can be classified into the cluster B having the closest distance.

또한, 답안 분류부(214-3)는 분류된 제1 소정 개수의 답안 각각의 자질 벡터의 각 좌표값들을 평균하여 초기 기준 벡터를 다시 설정할 수 있다. 즉, 클러스터 A로 분류된 "아름이를 생포하였다"(1,1,0)와 "아름이를 잡았다"(0,1,0)의 각 좌표값들을 평균하면 (0.5, 1, 0)과 같고 이를 클러스터 A의 기준 벡터로 다시 설정할 수 있고, 클러스터 B로 분류된 "동물을 생포하는 사람이다"(1,0,1)와 "동물들을 보살핀다"(0,0,1)의 각 좌표값들을 평균하면 (0.5,0,1)과 같고 이를 클러스터 B의 기준 벡터로 다시 설정할 수 있다.In addition, the answer classification unit 214-3 can reset the initial reference vectors by averaging the respective coordinate values of the feature vectors of the first predetermined number of answers classified. In other words, the average of the coordinate values of "captured beautiful" (1,1,0) and "captured beautiful" (0,1,0) classified as cluster A is equal to (0.5, 1, 0) (1, 0, 1) and "cares animals" (0, 0, 1), which are classified as cluster B, (0.5, 0, 1) and can be set back to the reference vector of the cluster B.

또한, 답안 분류부(214-3)는 새롭게 설정된 기준 벡터와 제1 소정 개수의 답안 각각의 자질 벡터들 사이의 유사도 산출과 가장 유사도가 높은 클러스터로의 분류를 반복할 수 있다. 답안 분류부(214-3)는 각 클러스터에 포함된 제1 소정 개수의 답안의 변동이 없을 때까지 유사도 산출 및 가장 유사도가 높은 클러스터로의 분류를 반복 수행할 수 있고, 각 클러스터에 포함된 답안의 변동이 없을 경우의 기준 벡터를 최종 기준 벡터로 설정할 수 있다. 즉, 답안 분류부(214-3)는 클러스터 A 및 B의 새로운 기준 벡터{(0.5, 1, 0), (0.5,0,1)}와 제1 소정 개수의 답안의 자질 벡터들{(1,1,0), (0,1,0), (1,0,1), (0,0,1)}과의 유사도 산출과 클러스터 분류를 반복하여 "아름이를 생포하였다"(1,1,0)와 "아름이를 잡았다"(0,1,0)를 클러스터 A로 분류하고, "동물을 생포하는 사람이다"(1,0,1)와 "동물들을 보살핀다"(0,0,1)를 클러스터 B로 분류할 수 있으며, 클러스터 A와 클러스터 B의 변화가 전혀 없으므로, 클러스터 A의 최종 기준 벡터는 (0.5, 1, 0)로 클러스터 B의 최종 기준 벡터는 (0.5, 0, 1)로 설정할 수 있다.In addition, the answer classification unit 214-3 can calculate the similarity between the newly set reference vectors and the feature vectors of the first predetermined number of answers, and classify them into clusters having the highest similarity. The answer sorting section 214-3 can calculate the degree of similarity and classify it into clusters having the highest similarity repeatedly until there is no change in the first predetermined number of answers included in each cluster, The reference vector can be set as the final reference vector. That is, the answer classifying section 214-3 classifies the new reference vectors {(0.5, 1, 0), (0.5, 0, 1)} of the clusters A and B and the feature vectors { , 1, 0), (0, 1, 0), (1,0,1), (0,0,1) (0, 1, 0) and "beautiful" (0, 1, 0) The final reference vector of the cluster A is (0.5, 1, 0) and the final reference vector of the cluster B is (0.5, 0, 1), since the cluster A and the cluster B do not change at all. .

또한 답안 채점 장치(200)는 비지도 학습 방법을 이용하여서 학생 단말(120-n)로부터 수신한 다수의 입력 답안에 대하여 자연 언어 처리를 수행하지 않고 (최초 입력 답안 자체를 이용하여) 채점을 수행할 수도 있다.In addition, the answer scoring device 200 performs scoring (not using the natural language processing) on the plurality of input answers received from the student terminal 120-n using the non-background learning method You may.

도 7은 본 발명의 실시예에 따른 기계학습 방법인 비지도 학습 방법을 이용하여 다수의 입력 답안을 채점하는 방법의 절차를 보이는 예시도이다.FIG. 7 is a diagram illustrating a procedure of a method of scoring a plurality of input answers using a non-background learning method, which is a machine learning method according to an embodiment of the present invention.

도 7에 도시한 바와 같이, 답안 채점 장치(200)는 다수의 입력 답안의 유형에 따른 적어도 두 개의 분류 등급 각각에 대응하는 적어도 두 개의 제1 분류 기준을 설정할 수 있고(S710), 다수의 입력 답안에 대하여 미리 결정된 특징적 요소에 해당하는 자질(feature)을 추출할 수 있다(S720). 또한 답안 채점 장치(200)는 각 자질을 축으로 하는 자질 공간에서의 위치를 나타내는 좌표값인 자질 벡터를 산출하고(S730), 다수의 입력 답안 각각의 자질 벡터와 적어도 두 개의 제1 분류 기준 각각의 자질 벡터를 비교하여, 벡터 간의 유사도가 가장 높은 값을 갖는 제1 분류 기준에 대응하는 분류 등급으로 다수의 입력 답안 각각을 분류할 수 있다(S740). 또한 답안 채점 장치(200)는 분류 등급별로 분류된 다수의 입력 답안 각각의 자질 벡터의 각 좌표들의 평균을 수행하여 적어도 두 개의 제2 분류 기준을 형성하고(S750), 적어도 두 개의 제1 분류 기준을 적어도 두 개의 제2 분류 기준으로 변경할 수 있다(S760). 또한 답안 채점 장치(200)는 다수의 입력 답안 각각의 자질 벡터와 적어도 두 개의 제2 분류 기준 각각의 자질 벡터를 비교하여, 벡터 간의 유사도가 가장 높은 값을 갖는 제2 분류 기준에 대응하는 분류 등급으로 상기 다수의 입력 답안 각각을 분류하고(S770), 분류 등급별로 점수를 부여할 수 있다(S780).As shown in FIG. 7, the answer grading device 200 can set at least two first classification criteria corresponding to each of at least two classification classes according to the type of the input answers (S710) A feature corresponding to a predetermined characteristic element for the answer can be extracted (S720). In addition, the answer scoring apparatus 200 calculates a feature vector, which is a coordinate value indicating a position in the feature space, with each feature as an axis (S730), and calculates a feature vector of each of the plurality of input answers and at least two (S740). In step S740, each of the plurality of input answers is classified into classification classes corresponding to the first classification reference having the highest similarity among the vectors. In addition, the answer scoring unit 200 performs an average of each coordinate of the feature vector of each of the plurality of input answers classified by the classification grade to form at least two second classification references (S750) To at least two second classification criteria (S760). In addition, the answer scoring device 200 compares the feature vector of each of the plurality of input answers with the feature vector of each of the at least two second classifiers, and classifies the classification class corresponding to the second classifier having the highest similarity among the vectors (S770), and a score may be assigned to each of the classification scores (S780).

지도 학습 방법에서는 학습용 답안 생성부(214-1)에서 선택된 제1 소정 개수의 답안에 대한 채점자의 채점 결과를 입력 받아 분석하여 일관성 있는 분류 기준을 형성할 수 있다. 일 실시예로서, 답안 분류부(214-3)는 채점자에 의해 적어도 두 개의 클러스터 중 어느 하나의 클러스터로 분류된 제1 소정 개수의 답안이 각각의 클러스터에 포함될 확률과, 제1 소정 개수의 답안에서 추출된 자질들의 제1 소정 개수의 답안 내에서의 출현 확률을 산출하여 분류 기준을 형성하고, 형성된 분류 기준을 이용하여 제2 소정 개수의 답안을 적어도 두 개의 클러스터 중 어느 하나의 클러스터로 분류할 수 있다.In the map learning method, the grading result of the grader for the first predetermined number of answers selected by the answer creating unit 214-1 may be received and analyzed to form a consistent classification standard. In one embodiment, the answer classifying section 214-3 classifies the probability that a first predetermined number of answers divided into clusters of at least two clusters by the scorer is included in each cluster, The probability of emergence in the first predetermined number of answers of the qualities extracted from the candidate is calculated to form a classification criterion and the second predetermined number of answers is classified into one of the at least two clusters .

표 9는 본 발명의 일 실시예에 따른 지도 학습 방법을 이용한 답안 분류 방법을 나타낸다. 제1 소정 개수의 답안이 "아름이를 생포하였다", "아름이를 잡았다", 아름이를 겨우 생포하였다", "동물을 생포하는 사람이다", "동물들을 보살핀다"의 5개이고, 제1 소정 개수의 답안을 [정답]과 [오답]의 중 어느 하나의 클러스터로 분류하며, 채점자의 클러스터링 결과가 "[정답] 아름이를 생포하였다", "[정답] 아름이를 잡았다", "[정답] 아름이를 겨우 생포하였다", "[오답] 동물을 생포하는 사람이다", "[오답] 동물들을 보살핀다"일 경우, 제1 소정 개수의 답안으로부터 형태소 자질을 추출하면 "아름이", "동물", "생포", "겨우", "사람", "잡다", "보살피다"와 같다. 답안 분류부(214-3)는 추출된 자질들 중 "생포", "아름이", "동물"을 지도 학습 방법을 위한 자질로 선택할 수 있고, 제1 채점 대상 답안 5개중 정답인 답안은 3개이므로 제1 소정 개수의 답안이 정답인 클러스터에 포함될 확률은 0.6, 오답인 답안은 2개이므로 제1 소정 개수의 답안이 오답인 클러스터에 포함될 확률은 0.4임을 산출할 수 있다. 또한, 답안 분류부(214-3)는 지도 학습 방법을 위하여 선택된 자질들이 각 클러스터에 포함될 조건부 확률은 다음과 같은 방법으로 산출할 수 있다. 즉, 지도 학습 방법을 위하여 선택된 자질들 중 "아름이"는 정답인 답안 3개에 모두 포함되어 있으므로 "아름이"를 포함하는 답안이 정답 클러스터에 포함될 확률은 1로 산출할 수 있고, 오답인 답안 2개에 모두 포함되어 있지 않으므로 "아름이"를 포함하는 답안이 오답 클러스터에 포함될 확률은 0으로 산출할 수 있다. 또한, 지도 학습 방법을 위하여 선택된 자질들 중 "생포"는 정답인 답안 3개중 2개에 포함되어 있으므로 "생포"를 포함하는 답안이 정답 클러스터에 포함될 확률은 0.67로 산출할 수 있고, 오답인 답안 2개중 1개에 포함되어 있으므로 "생포"를 포함하는 답안이 오답 클러스터에 포함될 확률은 0.5로 산출할 수 있다. 또한, 지도 학습 방법을 위하여 선택된 자질들 중 "동물"은 정답인 답안 3개 모두에 포함되어 있지 않으므로 "동물"을 포함하는 답안이 정답 클러스터에 포함될 확률은 0으로 산출할 수 있고, 오답인 답안 2개 모두에 포함되어 있으므로 "동물"을 포함하는 답안이 오답 클러스터에 포함될 확률은 1로 산출할 수 있다. 일 실시예로서, 답안 분류부(214-3)는 나이브 베이즈 분류(Naive Bayes Classification), 은닉 마르코프 모델(Hidden Markov Model), 신경망(Neural Network), 로지스틱 회귀분석(Logistic Regression), k-NN(k-nearest neighbor algorithm), 지지 벡터 머신(Support Vector Machine) 등의 방법을 이용하여 분류 기준을 형성할 수 있으나, 이러한 방법에 한정되지 않는다.Table 9 shows an answer classification method using a map learning method according to an embodiment of the present invention. The first predetermined number of answers are five of "I captured beautiful", "I caught beautiful", "I caught the beautiful", "I am a person who captures animals" and "I care for animals" The clustering results of the scorers were "captured", "" [correct] captured, "" [correct], and "beautiful". , "Animal", "Animal", "Animal", and "Animal" are extracted from the first predetermined number of answers in the case of " The answer classifying section 214-3 classifies the "captured", "beautiful", and "animal" among the extracted features as a map learning method And the number of answers which are the correct answers among the 5 answers of the first scoring target is 3, so that the first predetermined number of answers is the correct answer It is possible to calculate that the probability that the first predetermined number of answers will be included in the cluster having the wrong answer is 0.4 because the probability of inclusion in the cluster is 0.6 and the number of answers with the wrong answer is 2. Further, The conditional probability that the selected qualities will be included in each cluster can be calculated by the following method: Among the selected qualities for the map learning method, "beautiful" is included in all three answers, The probability that an included answer is included in a correct answer cluster can be calculated to be 1, and not included in all two incorrect answer answers, so the probability that an answer containing "beautiful" is included in an incorrect answer cluster can be calculated as 0. , Among the selected qualities for the map learning method, "sappo" is included in two out of three correct answer answers. The probability that an answer containing "caught" will be included in an incorrect answer cluster can be calculated to be 0.5, since it is calculated to be 0.67, and it is included in one of two incorrect answer answers. Since the "animal" among the qualities is not included in all three of the answers, the probability that the answer including the "animal" is included in the correct answer cluster can be calculated as 0, Animal "is included in the wrong answer cluster can be calculated as 1. In one embodiment, the answer classifying section 214-3 classifies the answer classifying section 214-3 into Naive Bayes Classification, Hidden Markov Model, Neural Network, Logistic Regression, kNN a k-nearest neighbor algorithm, and a support vector machine, but the present invention is not limited to this method.

또한, 답안 분류부(214-3)는 제1 소정 개수의 답안이 각각의 클러스터에 포함될 확률과, 제1 소정 개수의 답안에서 추출된 자질들의 제1 소정 개수의 답안 내에서의 출현 확률을 이용하여 제2 소정 개수의 답안 각각을 정답과 오답 중 어느 하나의 클러스터로 분류할 수 있다. 표 7에서 예시한 바와 같이, 제2 소정 개수의 답안 중 어느 하나가 "아름이를 생포하는데 성공하였다"일 경우 추출된 형태소 자질은 "아름이", "생포", "성공"이고, 지도 학습 방법을 위하여 선택된 자질 중 "아름이" 및 "생포"만을 포함하고 있으므로 "아름이를 생포하는데 성공하였다"가 정답일 확률은 0.6 ×1.0 ×0.67 = 0.402이고, 오답일 확률은 0.4 × 0.0 × 0.5 = 0으로 산출할 수 있다. 따라서, 답안 분류부(214-3)는 정답일 확률이 오답일 확률보다 높으므로 "아름이를 생포하는데 성공하였다"를 정답 클러스터로 분류할 수 있다.The answer sorting section 214-3 also uses the probability that the first predetermined number of answers will be included in each cluster and the appearance probability in the first predetermined number of answers extracted from the first predetermined number of answers And classify each of the second predetermined number of answers into one of a correct answer and an incorrect answer. As exemplified in Table 7, if any of the second predetermined number of answers is "successful in capturing the beautifulness", the extracted morpheme qualities are "beautiful", "captured", "successful" The probability of correct answer is 0.6 × 1.0 × 0.67 = 0.402, and the probability of incorrect answer is calculated as 0.4 × 0.0 × 0.5 = 0 because it includes only "beautiful" and " can do. Therefore, the answer classifying section 214-3 can classify the answer as "having succeeded in capturing the beautiful" as the correct answer cluster because the probability of correct answers is higher than the probability of being an incorrect answer.

[표 9][Table 9]

준지도 학습 방법은 상기한 지도 학습 방법과 비지도 학습 방법을 복합적으로 사용하여 채점 대상 답안을 어느 하나의 클러스터로 분류하기 위한 분류 기준을 형성할 수 있다. 즉, 준지도 학습 방법에서는 학습용 답안 생성부(214-1)에서 선택된 제1 소정 개수의 답안에 대한 채점자의 채점 결과를 입력 받고, 제1 소정 개수의 답안이 포함하는 자질들을 추출하여 분류 기준을 형성하며, 형성된 분류 기준을 이용하여 제2 소정 개수의 답안을 어느 하나의 클러스터로 분류한 후 각각의 클러스터에 포함될 확률이 소정값(예를 들어, 90%) 이상인 답안을 제1 소정 개수의 답안에 포함시켜서 다시 분류 기준을 형성할 수 있다.The gradient map learning method can form a sorting criterion for classifying the grading object answer into any one cluster by using the combination of the above-described map learning method and the non-map learning method. That is, in the rough learning method, the grading result of the grader for the first predetermined number of answers selected by the learning-answer creating unit 214-1 is received, the qualities included in the first predetermined number of answers are extracted, And classifies the second predetermined number of answers into one cluster by using the formed classification criterion, and then assigns the answer having a probability of being included in each cluster to a predetermined value (for example, 90% or more) as a first predetermined number of answers So that the classification standard can be formed again.

표 10은 본 발명의 일 실시예에 따른 준지도 학습 방법을 이용한 분류 기준 형성 방법을 나타낸다. 답안 분류부(214-3)는 지도 학습 방법을 이용하여 1차 분류 기준을 형성할 수 있다. 즉, 제1 소정 개수의 답안이 "아름이를 생포하였다", "아름이를 잡았다", "동물을 생포하는 사람이다", "동물들을 보살핀다"의 4개이고, 제1 소정 개수의 답안을 [정답]과 [오답]의 중 어느 하나의 클러스터로 분류하며, 채점자의 클러스터링 결과가 "[정답] 아름이를 생포하였다", "[정답] 아름이를 잡았다", "[오답] 동물을 생포하는 사람이다", "[오답] 동물들을 보살핀다"일 경우 제1 소정 개수의 답안으로부터 형태소 자질을 추출하면 "아름이", "동물", "생포", "사람", "잡다", "보살피다"와 같다. 추출된 자질들 중 "생포", "아름이", "동물"을 준지도 학습 방법을 위한 1차 분류 기준으로 선택할 수 있다. 1차 분류 기준을 이용하여 미분류된 제2 소정 개수의 대상 답안들의 정답 또는 오답일 추정 확률을 산출하면 "반달가슴곰 아름이를 생포하였다"는 정답 클러스터로 분류될 추정 확률이 94%이고, "아름이를 겨우 생포하였다"는 정답 클러스터로 분류될 추정 확률이 95%이며, "아름이란 곰을 생포하였다"는 정답 클러스터로 분류될 추정 확률이 92%이고, "동물들을 돌봐주는 사람이다"는 정답 클러스터로 분류될 추정 확률이 92%이며, "동물들을 돌봐주는 사람"은 오답 클러스터로 분류될 추정 확률이 91%이고, "동물을 치료해주는 사람이다"는 오답 클러스터로 분류될 추정 확률이 93%이며, "동물을 발견하는 사람"은 오답 클러스터로 분류될 추정 확률이 91%이고, "반달가슴곰을 다시 잡았다"는 정답 클러스터로 분류될 추정 확률이 75%이며, "동물을 돌봐주는 것"은 오답 클러스터로 분류될 추정 확률이 82%이고, "열흘간 곰을 추적해 발견함"은 오답 클러스터로 분류될 추정 확률이 59%이다. 이중 정답 또는 오답 추정 확률이 90% 이상인 "반달가슴곰 아름이를 생포하였다", "아름이를 겨우 생포하였다", "아름이란 곰을 생포했다", "동물들을 돌봐주는 사람", "동물을 치료해주는 사람이다", "동물을 발견하는 사람"을 제1 소정 개수의 답안에 추가할 수 있다. 답안 분류부(214-3)는 확장된 제1 소정 개수의 답안으로부터 2차 분류 기준을 형성할 수 있고, 2차 분류 기준을 적용하여 미분류된 채점 대상 답안의 정답 또는 오답 추정 확률을 산출하여 어느 하나의 클러스터로 재분류 할 수 있다. 답안 분류부(214-3)는 분류 기준 형성 및 미분류된 채점 대상 답안의 재분류 과정을 제1 소정 개수의 답안의 개수가 특정값으로 수렴될 때까지 반복함으로써, 확장된 제1 소정 개수의 답안에 기초하여 보다 정확한 분류 기준을 형성할 수 있다. 예를 들어, "반달가슴곰을 다시 잡았다"는 1차 분류 기준의 자질 중 "잡다"만을 포함하기 때문에 정답 클러스터로 분류할 정답 추정 확률이 75%지만, "반달가슴곰 아름이를 생포하였다", "아름이를 겨우 생포하였다", "아름이란 곰을 생포했다"가 제1 소정 개수의 답안에 추가된 2차 분류 기준의 자질 중 "잡다"와 "반달가슴곰"을 정답 추정 확률 산출에 활용할 수 있다. 따라서, "반달가슴곰을 다시 잡았다"를 정답으로 추정할 확률이 93%로 높아지고, 3차 분류 기준 형성 시에는 "반달가슴곰을 다시 잡았다"를 제1 소정 개수의 답안에 추가할 수 있다.Table 10 shows a method of forming a classification criterion using the coherence learning method according to an embodiment of the present invention. The answer classification unit 214-3 can form a primary classification rule using a map learning method. That is, the first predetermined number of answers are four, namely, "captured beautiful," "captured beautiful," "are people who capture animals," and "take care of animals. The clustering result of the scorer is classified into one of "cluster" and "wrong answer", and the clustering result of the scorer "captured the beautiful", "the [correct answer] caught the beautiful", " If the morpheme qualities are extracted from the first predetermined number of answers, it is like "beautiful", "animal", "captured", "person", "grab", " Among the extracted qualities, "Sopo", "Beautiful", and "Animal" can be selected as the primary classification criteria for the learning method. When the estimated probability of correct answer or wrong answer of the second predetermined number of target answers that are not classified is calculated by using the first classification standard, the probability of being classified as the correct answer cluster of " The probability of being classified as a correct answer clusters is 95%, the probability of being classified as a correct answer cluster is "92%," and "the person who takes care of animals" , The probability of being classified as an incorrect cluster is 91% and the probability of being classified as an incorrect cluster is 93%. , "Animal discovery" has a 91% probability of being classified as an incorrect cluster, and 75% of the probability that a cluster of "An estimated 82% probability of being classified as a cluster answered, "We also found traces of a bear between ten" is the estimated probability of being classified as incorrect cluster 59%. "I have caught the bears of beautiful bears," "I caught the bears beautifully," "I captured beautiful bears," "Those who take care of animals," "Those who treat animals, Person ", "person who finds an animal" can be added to the first predetermined number of answers. The answer classifying section 214-3 can form a second classifying criterion from the expanded first predetermined number of answers and calculate a correct answer or a wrong answer probability probability of the undecided classifying answer by applying the second classifying criterion It can be reclassified into one cluster. The answer sorting section 214-3 repeats the process of classifying the classification and reclassifying the undivided scoring target answer until the number of the first predetermined number of answers is converged to a specific value, A more accurate classification criterion can be formed. For example, the "recapture of the beast bear bears" includes only "catch" among the qualities of the first classification criterion, so the probability of correct answer to be classified as the correct answer cluster is 75%, but " Among the qualities of the second classification criterion added to the first predetermined number of answers, "caught beautifully" and "captured beautiful bears" can be used to calculate the answer probability probability of "catch" and " have. Therefore, the probability of estimating the correct answer as "caught a beast bear bears again " is increased to 93%, and it is possible to add" bearish bears bears again "to the first predetermined number of answers.

[표 10][Table 10]

앙상블 학습 방법은 상기한 비지도 학습 방법, 지도 학습 방법, 준지도 학습 방법을 복합적으로 사용하여 채점 대상 답안을 어느 하나의 클러스터로 클러스터링하는 방법을 나타낼 수 있다. 앙상블 학습 방법은 여러 기계학습 방법의 결과를 혼합하여 사용하므로 어느 하나의 특정 기계학습 방법을 이용하는 경우 보다 신뢰성이 높은 분류 결과를 얻을 수 있다.The ensemble learning method can represent a method of clustering the scoring candidate answers into any one of the clusters using a combination of the non-background learning method, the guidance learning method, and the edge guidance learning method. Since the ensemble learning method uses mixed results of several machine learning methods, more reliable classification results can be obtained than when using one specific machine learning method.

채점부(214-4)는 각각의 클러스터로 분류된 제2 소정 개수의 답안을 분류 등급에 따라서 점수를 부여할 수 있다. 일 실시예로서, 채점부(214-4)는 제2 소정 개수의 답안을 서로 다른 2개의 클러스터로 분류하였을 경우 각각의 클러스터에 대하여 정답과 오답으로 점수를 부여할 수도 있고, 제2 소정 개수의 답안을 서로 다른 4개의 클러스터로 분류하였을 경우 각각의 클러스터에 대하여 3점, 2점, 1점, 0점 등의 3개 이상 복수의 차등 점수로 점수를 부여할 수도 있다.The scoring unit 214-4 may score a second predetermined number of answers classified into the respective clusters according to the classification level. In one embodiment, if the second predetermined number of answers is classified into two different clusters, the scoring unit 214-4 may assign scores to correct answers and wrong answers for each cluster, If the answer is classified into four different clusters, scores of three or more differential scores such as 3, 2, 1, and 0 may be assigned to each cluster.

저장부(220)는 다수의 채점 대상 답안, 처리부(210)에서 형성한 분류 기준, 자질 벡터, 제1 및 제2 소정 개수의 답안 각각으로부터 추출된 자질 및 분류 등급에 따라서 부여된 점수에 대한 정보를 저장할 수 있다. 또한, 저장부(220)는 언어 처리부(212)의 자연 언어 처리 수행을 위한 말뭉치(품사 부착 말뭉치, 구묶음 말뭉치 등을 포함함), 시소러스, 부정표현 인식부(212-4)의 부정표현 인식을 위한 부정부사, 부정 보조용언구, 부정 구묶음, 부정용언, 이중부정 표현 등과 같은 부정표현을 저장할 수 있다. 일 실시예로서, 저장부(220)는 ROM(Read Only Memory), RAM(Random Access Memory), CD(Compact Disc)-ROM, 자기 테이프(Magnetic Tape), 플로피 디스크(Floppy Disc), 광데이터(Optical Data) 저장장치 또는 캐리어 웨이브(Carrier Wave)(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것 등을 포함할 수 있으나, 이러한 구현에 한정되는 것은 아니다.The storage unit 220 stores information on a plurality of scores to be scored, a classification criterion formed by the processing unit 210, a quality vector, a score extracted from each of the first and second predetermined number of answers, Can be stored. In addition, the storage unit 220 may include a corpus (including a partly attached corpus, a phrase bundle corpus, etc.) for performing a natural language process of the language processing unit 212, a thesaurus, a negative expression recognition unit 212-4 Negative indefinite phrase, negative indefinite phrase, indefinite phrase, indefinite phrase, double indefinite expression, etc. As an example, the storage unit 220 may be a ROM (Read Only Memory), a RAM (Random Access Memory), a CD (Compact Disc) -ROM, a magnetic tape, a floppy disc, Optical data storage, or carrier waves (e.g., transmission over the Internet), but are not limited to such implementations.

통신부(230)는 다수의 학생 단말(120-1,...,120-n) 및 다수의 채점자 단말(130-1,...,130-n)과 채점 대상 답안의 채점을 위한 각종 신호를 송수신할 수 있다.The communication unit 230 includes a plurality of student terminals 120-1 to 120-n and a plurality of scorer terminals 130-1 to 130-n and various signals for scoring an answer to be scored Lt; / RTI >

상기 방법은 특정 실시예들을 통하여 설명되었지만, 상기 방법은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있으며, 또한 케리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 상기 실시예들을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.Although the method has been described through particular embodiments, the method may also be implemented as computer readable code on a computer readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission over the Internet) . In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the above embodiments can be easily deduced by programmers of the present invention.

본 명세서에서는 본 발명이 일부 실시예들과 관련하여 설명되었지만, 본 발명이 속하는 기술분야의 당업자가 이해할 수 있는 본 발명의 정신 및 범위를 벗어나지 않는 범위에서 다양한 변형 및 변경이 이루어질 수 있다는 점을 알아야 할 것이다. 또한, 그러한 변형 및 변경은 본 명세서에 첨부된 특허청구의 범위 내에 속하는 것으로 생각되어야 한다.Although the present invention has been described in connection with certain embodiments thereof, it should be understood that various changes and modifications may be made therein without departing from the spirit and scope of the invention as will be apparent to those skilled in the art to which the invention pertains. something to do. It is also contemplated that such variations and modifications are within the scope of the claims appended hereto.

100: 답안 채점 환경 110: 서버
120-1,…,120-n: 학생 단말 130-1,…,130-n: 채점자 단말
N: 네트워크 200: 답안 채점 장치
210: 처리부 220: 저장부
230: 통신부 240: 시스템 버스
212: 언어 처리부 214: 분류 채점부
212-1: 문서 정규화부 212-2: 형태소 분석부
212-3: 품사 부착부 212-4: 부정표현 인식부
212-5: 구묶음부 212-6: 바꿔쓰기부
212-7: 의존관계 분석부 214-1: 학습용 답안 생성부
214-2: 자질 추출부 214-3: 답안 분류부
214-4: 채점부100: Answer scoring environment 110: Server
120-1, ... , 120-n: a student terminal 130-1, ... , 130-n: the scorer terminal
N: Network 200: Answer scoring device
210: processing section 220:
230: communication unit 240: system bus
212: language processing unit 214: classification scoring unit
212-1: Document normalization unit 212-2: Morphological analysis unit
212-3: Part attachment part 212-4: Negative expression recognition part
212-5: Saddle section 212-6: Replacement section
212-7: Dependence analysis unit 214-1:
214-2: qualification extraction unit 214-3: answer classification unit
214-4: scoring unit

Claims

A method for scoring a plurality of consecutive input answers,
a) extracting a feature corresponding to a predetermined characteristic element with respect to the plurality of consecutive input answers;
b) setting at least two first classification criteria based on the extracted qualities to classify the types of consecutive input answers, wherein each of the first classification criteria comprises at least one of: A feature vector which is a coordinate value indicating a position of the feature vector,
c) forming a feature vector, which is a coordinate value indicating a position in a feature space around each feature using the extracted feature,
d) Wherein the feature vector of each of the plurality of consecutive input answers is compared with the feature vector of each of the first classifiers so as to classify the feature vectors of the plurality of consecutive answers as a classification class corresponding to the first classifier having the highest similarity among the vectors, Classifying the input answers;
e) Performing averaging of each coordinate of each of the feature vectors according to a grade and forming a second classification reference for each class, wherein each of the second classification references is characterized by a feature A feature vector which is a coordinate value indicating a position in space,
f) a feature vector of each of the plurality of consecutive input answers and a feature vector of each of the second classifiers is compared with each other, and a class classification corresponding to a second classifier having a highest degree of similarity between the vectors, Classifying each of the answer input answers;
g) assigning a score according to the grade classified in step f)
The method comprising the steps of:

The method according to claim 1,
Forming a new second classification criterion by applying a plurality of consecutive input answers sorted by the grade in step f) to step e); and performing the step f) with the new second classification criterion And sequentially classifying each of the plurality of consecutive input answers,
Performing the step of assigning a score according to the classified class in step g) when the second classifying criterion of step f) is the same as the second classifying criterion newly formed by applying the step e)
A method of scoring multiple input answers.

3. The method of claim 2,
The method of any one of the preceding claims, wherein the step (d) to step (b) is performed using any one of K-means Clustering, Hierarchical Clustering, Density-Based Clustering or Grid-Based Clustering. g), < / RTI >
A method of scoring multiple input answers.

The method according to claim 1,
The step a)
Performing natural language processing on the plurality of consecutive input answers;
Extracting the qualities using the results of the natural language processing;
The method comprising the steps of:

5. The method of claim 4,
Wherein the performing of the natural language processing comprises:
A document normalization process including a sentence separation step, a spacing correction step, a spelling correction step, an abbreviation expansion step, and a symbol removal step for the plurality of consecutive input answers; Analyzing morphemes for the plurality of consecutive input answers; A step of attaching parts of the morpheme analysis result; Recognizing a negative expression of the plurality of consecutive input answers; Processing a bundle for at least two morphemes included in the result of attaching a part of speech; A replacing step of converting a phrase or phrase included in the plurality of consecutive input answers into a predetermined standard expression; And a dependency analysis process for analyzing a dependency structure between a morpheme or a word phrase included in the plurality of consecutive input answers.

6. The method of claim 5,
The above-
Analyzing the morpheme, attaching the part-of-speech, morphological qualities formed by performing the morpheme analysis, and processing the phrases formed on the basis of the phrases included in the plurality of consecutive input answers, (N-gram) qualities formed by including a predetermined basic phrase qualities, a dependency qualities formed by performing the dependency relationship analysis process, a predetermined number of adjacent morpheme qualities or the phrase qualities A number of written answer input scoring methods.

2. The method of claim 1, wherein step d)
A feature vector of each of the plurality of consecutive input answers and a feature vector of the at least two first classification references are compared to classify classifications corresponding to classification classes having the highest similarity among the vectors, And classifying the classification score into a classification score having the highest score when at least two grades exist.

The method of claim 1, wherein step (f)
A feature vector of each of the plurality of consecutive input answers and a feature vector of the at least two second classification references are compared to classify classifications corresponding to classification classes having the highest degree of similarity between vectors, And classifying the classification score into a classification score having the highest score when at least two grades exist.

The method according to claim 1,
Wherein the at least two classification classes comprise:
A multiple-entry input answer scoring method that is a classification score of correct answers and incorrect answers, or three or more classification grades with three or more differential scores.

A program stored on a recording medium for causing a computer to execute a plurality of recorded answer score scoring methods according to any one of claims 1 to 9.

9. A recording medium for storing a program for causing a computer to execute a plurality of consecutive answer answer scoring methods according to any one of claims 1 to 9.