WO2018030595A1 - Method and device for extracting character - Google Patents

Method and device for extracting character Download PDF

Info

Publication number
WO2018030595A1
WO2018030595A1 PCT/KR2016/015284 KR2016015284W WO2018030595A1 WO 2018030595 A1 WO2018030595 A1 WO 2018030595A1 KR 2016015284 W KR2016015284 W KR 2016015284W WO 2018030595 A1 WO2018030595 A1 WO 2018030595A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
extracting
subject
extracted
text
Prior art date
Application number
PCT/KR2016/015284
Other languages
French (fr)
Korean (ko)
Inventor
김승훈
박태근
Original Assignee
단국대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 단국대학교 산학협력단 filed Critical 단국대학교 산학협력단
Publication of WO2018030595A1 publication Critical patent/WO2018030595A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present invention relates to a method and apparatus for extracting a character, and more particularly, by extracting a character from a text of a Korean translation novel and a Korean creative fiction based on a survey used for oil nouns, without various learning data of a machine learning technique. It relates to a character extraction method and apparatus that can extract characters for the novel.
  • Information extraction is the task of extracting important information such as objects and events from natural language text, and entity name recognition finds the names of objects in the text as part of the information extraction, and then classifies them into predefined classes such as names, names and organization names. It is a technique.
  • An object of the present invention is to provide a technique for extracting characters without learning data for applying machine learning techniques to various novels.
  • a method of extracting a character from a character extraction apparatus performed in an information processing apparatus having a processor may include: a preprocessing step of reading text from an electronic document; Extracting a statement combined with a subject investigation and an assistant from the text as a candidate; And extracting the character from the subject candidate based on the well name survey.
  • the extracting the character, the subject candidate used in combination with the well name investigation can be extracted as the character.
  • the text in quotation marks may be divided into speech, and other text may be divided into narrative.
  • the extracting of the subject candidate, the narrative in the narrative combined with the main investigation and the assistant verbs to extract the subject candidate, and the extracting the character, the subject name candidate extracted from the narrative can be extracted as a character, and the spoken word used in combination with the well name survey during the utterance can be extracted as the character.
  • the subject investigation and the assistant may include at least one of-,-,-and-.
  • the well-known survey may include at least one or more of the combined survey combined with other surveys to-,-to,-and to-to /-to.
  • the method may further include determining a character from the extracted character by excluding at least one or more of a pronoun, an unspecified noun, a collective noun, a plural form, a rhetoric, and a dependent noun.
  • an apparatus for extracting a character having at least one processor may include a preprocessor configured to read text from an electronic document; A subject candidate extracting unit for extracting a statement combined with a subject investigation and an assistant from the text as a subject candidate; And a character extracting unit extracting a character from the subject candidate based on the well name investigation.
  • the character extracting unit it is possible to extract the subject candidate used in combination with the oil well name investigation as the character.
  • the preprocessing unit may classify the text in quotation marks into utterances and other texts in narratives.
  • the subject candidate extracting unit extracts a statement combined with a main subject investigation and an assistant in the narrative as a subject candidate
  • the character extracting unit is used as a subject candidate used in combination with an oil well name survey among subject candidates extracted from the narrative.
  • the subject interrogation and the assistant may include at least one of-,-,-and-.
  • the well-known survey may include at least one or more of the combined survey combined with other surveys to-,-to,-and to-to /-to.
  • the apparatus may further include a character determination unit that determines a character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns from the extracted characters.
  • a character may be extracted for various novels without learning data for applying machine learning techniques to various novels.
  • FIG. 1 is a diagram illustrating a functional block of an apparatus for extracting a character according to an exemplary embodiment.
  • FIG. 2 is a flowchart illustrating a method of extracting a character according to an exemplary embodiment.
  • FIG. 3 is a flowchart illustrating a method of extracting a character according to another exemplary embodiment.
  • 4 to 7 are graphs for explaining an experimental result of extracting a character through the character extraction method.
  • an expression such as 'first' and 'second' is used only for distinguishing a plurality of configurations, and does not limit the order or other features between the configurations.
  • FIG. 1 is a block diagram illustrating a functional block of the character extraction apparatus 100 according to an exemplary embodiment.
  • the character extracting apparatus 100 includes a preprocessing unit 110, a candidate candidate extracting unit 120, and a character extracting unit 130, and further includes a character determining unit 140. can do.
  • the preprocessing unit 110 reads the text to be extracted from the electronic document.
  • the term 'appearance character' used in the present specification is a concept including a proper noun and a common noun.
  • proper nouns in the proper nouns mean person names such as "Harry”, “Isabella”, “Billy”, and common nouns in the proper nouns correspond to "father", "mother”.
  • the subject candidate extracting unit 120 extracts the message combined with the main subject investigation and the assistant from the text read by the preprocessor 110 as the subject candidate.
  • the character extracting unit 130 extracts the character from the subject candidate extracted by the subject candidate extracting unit 120 based on the oil well use survey.
  • the character determination unit 140 determines a character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns from the characters extracted by the character extraction unit 130.
  • the preprocessor 110, the subject candidate extractor 120, the character extractor 130, and the character determiner 140 included in the above-described embodiment include instructions programmed to perform their functions. It may be implemented by a computing device including a memory and a microprocessor for performing these instructions.
  • FIG. 2 is a flowchart illustrating a method of extracting a character by the character extraction apparatus 100 performed in an information processing apparatus having a processor, according to an exemplary embodiment.
  • the method for extracting a character according to FIG. 2 may be performed by the character extracting apparatus 100 described with reference to FIG. 1.
  • the text to extract the character through the preprocessor 110 is read from the electronic document (S210).
  • the subject candidate extracting unit 120 extracts the sentence combined with the main subject investigation and the assistant from the read text as the candidate candidate (S220).
  • the main subject is defined as '- ⁇ /- ⁇ ' and the assistant is defined as '- ⁇ /- ⁇ ', followed by '- ⁇ /- ⁇ /-' after a statement that can be given in many documents including novels.
  • '/' Is used as a '. For example, "Harry said,” and “Harry said.”
  • the spoken word used in combination with at least one of '-//-//-/' is extracted from the read text as the candidate.
  • the characters of the novel appear as subjects in the text from several times to as many as hundreds, the characters with the support of the last letter are combined with the '-//-' surveys on the letters that represent the characters. Can appear in the text, and in the case of a character without a backing in the last character, the letter representing the character can be combined with the investigation of '-ga /-' and appear in the text.
  • the extracted candidates are extracted as the subject candidates, which have been used in combination with the survey pairs of '-//-' or '-//-' as the subject candidates, and the accuracy is correct. Can improve.
  • the character extracting unit 130 extracts the character from the candidate based on the oil well use survey (S230).
  • an oil well noun means a noun representing a person or an animal
  • an oil well investigation means an investigation that can be attached to an oil well noun, and an example thereof is shown in Table 1 below.
  • the basic type of the well-named survey is 'to / -to / -to', it can be used to extract characters based on the basic type, and 'from to-to / -to' Characters can be extracted based on all types of complex surveys such as
  • the character extraction unit 130 may extract the candidate for which the candidate has been used in combination with the well name investigation in the text for the extracted candidate for the character.
  • subject candidates existing within a certain word / syllable can be extracted as characters from the words used in combination with the well name investigation.
  • the character determination unit 140 may determine the character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns among the extracted characters (S240).
  • pronouns e.g. 'I', 'us', 'he', 'her', etc.
  • FIG. 3 is a flowchart illustrating a method of extracting a character by the character extraction apparatus 100 performed in an information processing apparatus having a processor according to another exemplary embodiment.
  • the method of extracting a character according to FIG. 3 may be performed by the character extracting apparatus 100 described with reference to FIG. 1.
  • the text to extract the character through the pre-processing unit 110 is read from the electronic document.
  • the preprocessor 110 may divide the text into utterances (Utterance) and narrative (Sarrative).
  • the utterance means that the thoughts of a novel character are realized in sentence units, and the author indicates that a specific sentence is an utterance by using quotation marks ("", '').
  • a narrative is a set of sentences that lead the novel's storyline and consists of sentences that describe the narrative of a series of events from a first or third person perspective.
  • the preprocessing unit 110 may divide the text in quotation marks into speech, and separate other text into narrative.
  • the subject candidate extracting unit 120 extracts the message combined with the main subject investigation and the assistant in the narrative of the read text as the subject candidate (S320).
  • step S320 the candidate can be extracted at a higher speed since the candidate can be searched only in the narrative.
  • the character extracting unit 130 extracts the subject candidate used in combination with the well name survey among the candidate candidates extracted from the narrative as a character, and uses the combined word used with the well name survey during speech as the character. Extract (S330).
  • the character determination unit 140 may determine the character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns among the extracted characters (S340).
  • 4 to 7 are graphs for analyzing the results of extracting the character through the above-described embodiment.
  • the accuracy rate, recall rate, and F-measure are defined as follows.
  • the accuracy rate is calculated by the ratio of the actual characters excluding the error among the characters extracted according to the embodiment, for the characters extracted according to the embodiment through the embodiment.
  • the reproducibility is calculated by the ratio of the actual person's name among the characters extracted through the embodiment with respect to the actual person's name (corresponding to the proper noun among the characters).
  • F-measure is calculated as the harmonic mean of accuracy and recall.
  • FIG. 4 represents the index of the novel, in which novels 1 to 11 are translated into Korean, and numbers 12 to 80 are novels created in Korean.
  • FIG. 5 shows the accuracy and reproducibility of extracting characters according to an embodiment of 80 Korean novels.
  • the first to eleven times of the x-axis in FIG. 5 are Korean translation novels, and the first to twelve to eightyth Korean creative novels.
  • 6 and 7 are graphs analyzing how much the reproducibility is increased when aiming to find a character having an appearance rate of 1% or more or 0.5% or more that is extracted through an embodiment.
  • the incidence rate is calculated as the ratio of one person's frequency to the sum of all characters' frequency. For example, A's 1% appearance means that A's frequency is 1% of the total frequency of the characters in the novel.
  • the extracted character information may be used to obtain a relationship between characters or to further extract a character appearing in another sentence having a pattern similar to the sentence in which they are used.
  • embodiments of the present invention may be implemented through various means.
  • embodiments of the present invention may be implemented by hardware, firmware, software, or a combination thereof.
  • a method according to embodiments of the present invention may include one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), and Programmable Logic Devices (PLDs). It may be implemented by field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, and the like.
  • ASICs Application Specific Integrated Circuits
  • DSPs Digital Signal Processors
  • DSPDs Digital Signal Processing Devices
  • PLDs Programmable Logic Devices
  • FPGAs field programmable gate arrays
  • processors controllers, microcontrollers, microprocessors, and the like.
  • the method according to the embodiments of the present invention may be implemented in the form of a module, a procedure, or a function that performs the functions or operations described above.
  • the software code may be stored in a memory unit and driven by a processor.
  • the memory unit may be located inside or outside the processor, and may exchange data with the processor by various known means.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A method for extracting a character according to an embodiment of the present invention is performed in an information processing device having a processor, and comprises: a pre-processing step of reading a text from an electronic document; a step of extracting, as a subject candidate, an uninflected word coupled to a subject marking particle and a topic marking particle in the text; and a step of extracting a character from the subject candidates on the basis of a particle for an animate noun.

Description

등장인물 추출 방법 및 장치Character Extraction Method and Apparatus
본 발명은 등장인물 추출 방법 및 장치에 관한 것으로서, 보다 자세하게는 한국어 번역 소설 및 한국어 창작 소설의 텍스트로부터 유정명사에 사용되는 조사를 기초로 등장인물을 추출함으로써, 기계학습 기법의 학습 데이터가 없이도 다양한 소설에 대하여 등장인물을 추출할 수 있는 등장인물 추출 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for extracting a character, and more particularly, by extracting a character from a text of a Korean translation novel and a Korean creative fiction based on a survey used for oil nouns, without various learning data of a machine learning technique. It relates to a character extraction method and apparatus that can extract characters for the novel.
정보 추출은 자연어 텍스트로부터 개체 및 이벤트와 같은 중요한 정보들을 추출하는 작업이며, 개체명 인식은 정보 추출의 일부분으로 텍스트 내의 개체명을 발견한 뒤, 인명, 지명, 조직명과 같은 미리 정의된 클래스로 분류하는 기법이다. Information extraction is the task of extracting important information such as objects and events from natural language text, and entity name recognition finds the names of objects in the text as part of the information extraction, and then classifies them into predefined classes such as names, names and organization names. It is a technique.
이러한 개체명 인식 기법들의 대부분은 규칙 기반 알고리즘 또는 기계학습 기반 기술을 활용하고 있는데, 최근에는 두 가지 기법의 단점은 줄이고 장점은 활용하고자 하는 하이브리드 기법들이 제안되고 있다.Most of the entity name recognition techniques utilize rule-based algorithms or machine learning-based techniques. Recently, hybrid techniques have been proposed to reduce the disadvantages of the two techniques and to utilize the advantages.
그러나, 개체명 인식 기법들에 대한 분석 연구에 따르면, 목표 텍스트 장르가 아닌 다른 텍스트 장르에 개체명 인식 기법을 적용하는 것이 쉽지 않음에도 불구하고, 기존의 개체명 인식 기법들은 텍스트 장르와 도메인에 대하여 고려하지 않은 채, 신문 기사와 같은 텍스트로부터 개체명을 추출하는 것에 한정되어 있다.However, according to an analytical study on the entity name recognition techniques, although it is not easy to apply the entity name recognition technique to a text genre other than the target text genre, the existing entity name recognition techniques have been applied to the text genre and domain. Without consideration, it is limited to extracting individual names from text such as newspaper articles.
본 발명의 실시예에서 해결하고자 하는 과제는 다양한 소설에 대하여 기계학습 기법을 적용하기 위한 학습 데이터가 없이도 등장인물을 추출하는 기술을 제공하는 것이다.An object of the present invention is to provide a technique for extracting characters without learning data for applying machine learning techniques to various novels.
다만, 본 발명의 실시예가 이루고자 하는 기술적 과제는 이상에서 언급한 과제로 제한되지 않으며, 이하에서 설명할 내용으로부터 통상의 기술자에게 자명한 범위 내에서 다양한 기술적 과제가 도출될 수 있다.However, the technical problem to be achieved by the embodiment of the present invention is not limited to the above-mentioned problem, various technical problems can be derived within the scope apparent to those skilled in the art from the following description.
일 실시예에 따른 프로세서를 갖는 정보 처리 장치에서 수행되는 등장인물 추출 장치가 등장인물을 추출하는 방법은 텍스트를 전자 문서로부터 읽어들이는 전처리 단계; 상기 텍스트에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하는 단계; 및 유정명사용 조사를 기초로 상기 주어후보로부터 등장인물을 추출하는 단계를 포함한다. According to an embodiment, a method of extracting a character from a character extraction apparatus performed in an information processing apparatus having a processor may include: a preprocessing step of reading text from an electronic document; Extracting a statement combined with a subject investigation and an assistant from the text as a candidate; And extracting the character from the subject candidate based on the well name survey.
이때, 상기 등장인물을 추출하는 단계는, 상기 유정명사용 조사와 결합되어 사용된 주어후보를 등장인물로 추출할 수 있다. At this time, the extracting the character, the subject candidate used in combination with the well name investigation can be extracted as the character.
또한, 상기 전처리 단계는, 인용부호 내의 텍스트를 발화로 구분하고, 그 외의 텍스트를 내러티브로 구분할 수 있다. In the preprocessing step, the text in quotation marks may be divided into speech, and other text may be divided into narrative.
더불어, 상기 주어후보로 추출하는 단계는, 상기 내러티브에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하며, 상기 등장인물을 추출하는 단계는, 상기 내러티브에서 추출된 주어후보 중 유정명사용 조사와 결합되어 사용된 주어후보를 등장인물로 추출하고, 상기 발화 중 유정명사용 조사와 결합되어 사용된 체언을 등장인물로 추출할 수 있다. In addition, the extracting of the subject candidate, the narrative in the narrative combined with the main investigation and the assistant verbs to extract the subject candidate, and the extracting the character, the subject name candidate extracted from the narrative The subject candidate used in combination can be extracted as a character, and the spoken word used in combination with the well name survey during the utterance can be extracted as the character.
아울러, 상기 주격조사 및 보조사는, -이, -가, -은 및 -는 중 적어도 하나 이상을 포함할 수 있다. In addition, the subject investigation and the assistant may include at least one of-,-,-and-.
또한, 상기 유정명사용 조사는, -한테, -에게, -께 및 상기 -한테/-에게/-께 에 다른 조사가 결합된 복합조사 중 적어도 하나 이상을 포함할 수 있다. In addition, the well-known survey may include at least one or more of the combined survey combined with other surveys to-,-to,-and to-to /-to.
더불어, 상기 방법은 상기 추출된 등장인물로부터 대명사, 불특정명사, 집합명사, 복수형, 수사 및 의존명사 중 적어도 하나 이상을 제외하여 등장인물을 확정하는 단계를 더 포함할 수 있다. In addition, the method may further include determining a character from the extracted character by excluding at least one or more of a pronoun, an unspecified noun, a collective noun, a plural form, a rhetoric, and a dependent noun.
일 실시예에 따른 적어도 하나의 프로세서를 갖는 등장인물 추출 장치는 텍스트를 전자 문서로부터 읽어들이는 전처리부; 상기 텍스트에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하는 주어후보 추출부; 및 유정명사용 조사를 기초로 상기 주어후보로부터 등장인물을 추출하는 등장인물 추출부를 수행한다. According to an embodiment, an apparatus for extracting a character having at least one processor may include a preprocessor configured to read text from an electronic document; A subject candidate extracting unit for extracting a statement combined with a subject investigation and an assistant from the text as a subject candidate; And a character extracting unit extracting a character from the subject candidate based on the well name investigation.
이때, 상기 등장인물 추출부는, 상기 유정명사용 조사와 결합되어 사용된 주어후보를 등장인물로 추출할 수 있다. At this time, the character extracting unit, it is possible to extract the subject candidate used in combination with the oil well name investigation as the character.
또한, 상기 전처리부는, 인용부호 내의 텍스트를 발화로 구분하고, 그 외의 텍스트를 내러티브로 구분할 수 있다. In addition, the preprocessing unit may classify the text in quotation marks into utterances and other texts in narratives.
아울러, 상기 주어후보 추출부는, 상기 내러티브에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하며, 상기 등장인물 추출부는, 상기 내러티브에서 추출된 주어후보 중 유정명사용 조사와 결합되어 사용된 주어후보를 등장인물로 추출하고, 상기 발화 중 유정명사용 조사와 결합되어 사용된 체언을 등장인물로 추출할 수 있다. In addition, the subject candidate extracting unit extracts a statement combined with a main subject investigation and an assistant in the narrative as a subject candidate, and the character extracting unit is used as a subject candidate used in combination with an oil well name survey among subject candidates extracted from the narrative. Can be extracted as a character, and the spoken word used in combination with the well name investigation during the utterance can be extracted as the character.
이때, 상기 주격조사 및 보조사는, -이, -가, -은 및 -는 중 적어도 하나 이상을 포함할 수 있다. In this case, the subject interrogation and the assistant may include at least one of-,-,-and-.
또한, 상기 유정명사용 조사는, -한테, -에게, -께 및 상기 -한테/-에게/-께 에 다른 조사가 결합된 복합조사 중 적어도 하나 이상을 포함할 수 있다. In addition, the well-known survey may include at least one or more of the combined survey combined with other surveys to-,-to,-and to-to /-to.
더불어, 상기 장치는 상기 추출된 등장인물로부터 대명사, 불특정명사, 집합명사, 복수형, 수사 및 의존명사 중 적어도 하나 이상을 제외하여 등장인물을 확정하는 등장인물 확정부를 더 포함할 수 있다. In addition, the apparatus may further include a character determination unit that determines a character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns from the extracted characters.
본 발명의 실시예에 따르면, 다양한 소설에 대하여 기계학습 기법을 적용하기 위한 학습 데이터 없이도 다양한 소설에 대하여 등장인물을 추출할 수 있다. According to an embodiment of the present invention, a character may be extracted for various novels without learning data for applying machine learning techniques to various novels.
이에 따라 인물 간의 소셜 네트워크를 파악하여 도서를 분류할 수 있으며, 등장인물의 성별, 나이 등을 파악하여 text-to-speech 기반 스토리텔링 시스템에서 등장인물에 어울리는 목소리로 책의 내용을 읽어주는 시스템에 활용할 수 있다.As a result, it is possible to classify books by grasping social networks among people, and by analyzing gender and age of characters, and reading the contents of books with voices that match characters in text-to-speech-based storytelling systems. It can be utilized.
도 1은 일 실시예에 따른 등장인물 추출 장치의 기능 블럭을 나타낸 도면이다.1 is a diagram illustrating a functional block of an apparatus for extracting a character according to an exemplary embodiment.
도 2는 일 실시예에 따른 등장인물 추출 방법을 설명하기 위한 순서도이다. 2 is a flowchart illustrating a method of extracting a character according to an exemplary embodiment.
도 3은 다른 실시예에 따른 등장인물 추출 방법을 설명하기 위한 순서도이다. 3 is a flowchart illustrating a method of extracting a character according to another exemplary embodiment.
도 4 내지 도 7은 등장인물 추출 방법을 통해 등장인물을 추출한 실험 결과를 설명하기 위한 그래프이다. 4 to 7 are graphs for explaining an experimental result of extracting a character through the character extraction method.
본 발명의 목적과 기술적 구성 및 그에 따른 작용 효과에 관한 자세한 사항은 본 발명의 명세서에 첨부된 도면에 의거한 이하의 상세한 설명에 의해 보다 명확하게 이해될 것이다. 첨부된 도면을 참조하여 본 발명에 따른 실시예를 상세하게 설명한다.Details of the object and technical configuration of the present invention and the resulting effects will be more clearly understood by the following detailed description based on the accompanying drawings. With reference to the accompanying drawings will be described in detail an embodiment according to the present invention.
본 명세서에서 개시되는 실시예들은 본 발명의 범위를 한정하는 것으로 해석되거나 이용되지 않아야 할 것이다. 이 분야의 통상의 기술자에게 본 명세서의 실시예를 포함한 설명은 다양한 응용을 갖는다는 것이 당연하다. 따라서, 본 발명의 상세한 설명에 기재된 임의의 실시예들은 본 발명을 보다 잘 설명하기 위한 예시적인 것이며 본 발명의 범위가 실시예들로 한정되는 것을 의도하지 않는다.The embodiments disclosed herein should not be interpreted or used as limiting the scope of the invention. It is obvious to those skilled in the art that the description, including the embodiments herein, has a variety of applications. Accordingly, certain embodiments described in the detailed description of the present invention are illustrative for better understanding of the present invention and are not intended to limit the scope of the present invention to the embodiments.
도면에 표시되고 아래에 설명되는 기능 블록들은 가능한 구현의 예들일 뿐이다. 다른 구현들에서는 상세한 설명의 사상 및 범위를 벗어나지 않는 범위에서 다른 기능 블록들이 사용될 수 있다. 또한, 본 발명의 하나 이상의 기능 블록이 개별 블록들로 표시되지만, 본 발명의 기능 블록들 중 하나 이상은 동일 기능을 실행하는 다양한 하드웨어 및 소프트웨어 구성들의 조합일 수 있다.The functional blocks shown in the figures and described below are only examples of possible implementations. Other functional blocks may be used in other implementations without departing from the spirit and scope of the detailed description. Also, while one or more functional blocks of the present invention are represented by separate blocks, one or more of the functional blocks of the present invention may be a combination of various hardware and software configurations that perform the same function.
또한, 어떤 구성요소들을 포함한다는 표현은 개방형의 표현으로서 해당 구성요소들이 존재하는 것을 단순히 지칭할 뿐이며, 추가적인 구성요소들을 배제하는 것으로 이해되어서는 안 된다.In addition, the expression of including certain components merely refers to the presence of the components as an open expression, and should not be understood as excluding additional components.
나아가 어떤 구성요소가 다른 구성요소에 연결되어 있다거나 접속되어 있다고 언급될 때에는, 그 다른 구성요소에 직접적으로 연결 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 한다. Further, when a component is referred to as being connected or connected to another component, it should be understood that there may be a direct connection or connection to that other component, but there may be other components in between.
또한, '제1, 제2' 등과 같은 표현은 복수의 구성들을 구분하기 위한 용도로만 사용된 표현으로써, 구성들 사이의 순서나 기타 특징들을 한정하지 않는다. Also, an expression such as 'first' and 'second' is used only for distinguishing a plurality of configurations, and does not limit the order or other features between the configurations.
이하에서는 도면들을 참조하여 본 발명의 실시예들에 대해 설명하도록 한다. Hereinafter, embodiments of the present invention will be described with reference to the drawings.
도 1은 일 실시예에 따른 등장인물 추출 장치(100)의 기능 블럭을 나타낸 도면이다.1 is a block diagram illustrating a functional block of the character extraction apparatus 100 according to an exemplary embodiment.
도 1을 참조하면, 등장인물 추출 장치(100)는 전처리부(110), 주어후보 추출부(120), 등장인물 추출부(130)를 포함하고, 추가적으로 등장인물 확정부(140)를 더 포함할 수 있다. Referring to FIG. 1, the character extracting apparatus 100 includes a preprocessing unit 110, a candidate candidate extracting unit 120, and a character extracting unit 130, and further includes a character determining unit 140. can do.
전처리부(110)는 등장인물을 추출할 대상이 되는 텍스트를 전자 문서로부터 읽어들인다. 한편, 본 명세서에서 사용하는 '등장인물'이란 용어는 유정명사 중 고유명사와 일반명사를 포함하는 개념이다. 예를 들어, 유정명사 중 고유명사는 "해리", "이사벨라", "빌리" 등의 인물명을 의미하고, 유정명사 중 일반명사는 "아버지", "어머니" 등에 해당한다. The preprocessing unit 110 reads the text to be extracted from the electronic document. On the other hand, the term 'appearance character' used in the present specification is a concept including a proper noun and a common noun. For example, proper nouns in the proper nouns mean person names such as "Harry", "Isabella", "Billy", and common nouns in the proper nouns correspond to "father", "mother".
주어후보 추출부(120)는 전처리부(110)가 읽어들인 텍스트에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출한다. The subject candidate extracting unit 120 extracts the message combined with the main subject investigation and the assistant from the text read by the preprocessor 110 as the subject candidate.
등장인물 추출부(130)는 유정명사용 조사를 기초로 주어후보 추출부(120)가 추출한 주어후보로부터 등장인물을 추출한다. The character extracting unit 130 extracts the character from the subject candidate extracted by the subject candidate extracting unit 120 based on the oil well use survey.
등장인물 확정부(140)는 등장인물 추출부(130)에 의해 추출된 등장인물로부터 대명사, 불특정명사, 집합명사, 복수형, 수사 및 의존명사 중 적어도 하나 이상을 제외하여 등장인물을 확정한다. The character determination unit 140 determines a character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns from the characters extracted by the character extraction unit 130.
이때 등장인물 추출 장치(100)의 각 구성이 동작하는 구체적인 과정은 도 2 및 도 3과 함께 설명하기로 한다. In this case, a detailed process of operating each component of the character extraction apparatus 100 will be described with reference to FIGS. 2 and 3.
한편, 상술한 실시예가 포함하는 전처리부(110), 주어후보 추출부(120), 등장인물 추출부(130) 및 등장인물 확정부(140)는 이들의 기능을 수행하도록 프로그램된 명령어를 포함하는 메모리, 및 이들 명령어를 수행하는 마이크로프로세서를 포함하는 연산 장치에 의해 구현될 수 있다. Meanwhile, the preprocessor 110, the subject candidate extractor 120, the character extractor 130, and the character determiner 140 included in the above-described embodiment include instructions programmed to perform their functions. It may be implemented by a computing device including a memory and a microprocessor for performing these instructions.
도 2는 일 실시예에 따른 프로세서를 갖는 정보 처리 장치에서 수행되는 등장인물 추출 장치(100)가 등장인물을 추출하는 방법을 설명하기 위한 순서도이다. 도 2에 따른 등장인물 추출 방법은 도 1을 통해 설명된 등장인물 추출 장치(100)에 의해 수행될 수 있으며, 각 단계를 설명하면 다음과 같다.2 is a flowchart illustrating a method of extracting a character by the character extraction apparatus 100 performed in an information processing apparatus having a processor, according to an exemplary embodiment. The method for extracting a character according to FIG. 2 may be performed by the character extracting apparatus 100 described with reference to FIG. 1.
우선, 전처리부(110)를 통해 등장인물을 추출할 텍스트를 전자 문서로부터 읽어들인다(S210). First, the text to extract the character through the preprocessor 110 is read from the electronic document (S210).
다음으로, 주어후보 추출부(120)는 읽어들인 텍스트에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출한다(S220). Next, the subject candidate extracting unit 120 extracts the sentence combined with the main subject investigation and the assistant from the read text as the candidate candidate (S220).
한국어 문법에서 주격조사는 '-이/-가'이고 보조사는 '-은/-는'으로 정의되어 있으며, 소설을 포함하는 많은 문서에서 주어가 될 수 있는 체언 뒤에 '-이/-가/-은/-는'을 붙여 주어로 사용하고 있다. 예를 들면, "해리가 말했다."와 "해리는 말했다."와 같이 사용되고 있다. In Korean grammar, the main subject is defined as '-이 /-가' and the assistant is defined as '-은 /-가', followed by '-이 /-가 /-' after a statement that can be given in many documents including novels. '/' Is used as a '. For example, "Harry said," and "Harry said."
이에, 주어후보 추출 단계(S220)에서는 읽어들인 텍스트에서 '-이/-가/-은/-는'중 적어도 하나 이상의 조사와 결합되어 사용된 체언을 주어후보로 추출할 수 있다. Thus, in the candidate can extracting step (S220), the spoken word used in combination with at least one of '-//-//-/' is extracted from the read text as the candidate.
또한, 소설의 등장인물들은 수차례부터 많게는 수백차례까지 텍스트에서 주어로 등장하기 때문에, 마지막 글자에 받침이 있는 등장인물의 경우 등장인물을 나타내는 글자에 '-이/-은'의 조사와 모두 결합되어 텍스트에 등장할 수 있고, 마지막 글자에 받침이 없는 등장인물의 경우 등장인물을 나타내는 글자에 '-가/-는'의 조사와 무두 결합되어 텍스트에 등장할 수 있다. In addition, since the characters of the novel appear as subjects in the text from several times to as many as hundreds, the characters with the support of the last letter are combined with the '-//-' surveys on the letters that represent the characters. Can appear in the text, and in the case of a character without a backing in the last character, the letter representing the character can be combined with the investigation of '-ga /-' and appear in the text.
이에 따라, 주어후보 추출 단계(S220)에서는 읽어들인 텍스트에서 '-이/-은' 또는 '-가/-는'의 조사 쌍과 함께 결합되어 사용된 적이 있는 체언들을 주어후보로 추출하여, 정확성을 향상시킬 수 있다. Accordingly, in the subject candidate extraction step (S220), the extracted candidates are extracted as the subject candidates, which have been used in combination with the survey pairs of '-//-' or '-//-' as the subject candidates, and the accuracy is correct. Can improve.
이후, 등장인물 추출부(130)는 유정명사용 조사를 기초로 주어후보로부터 등장인물을 추출한다(S230). 이때 유정명사란 사람이나 동물 따위를 나타내는 명사를 의미하고, 유정명사용 조사란 유정명사 뒤에 붙을 수 있는 조사를 의미하며, 그 예는 아래 표 1과 같다. Subsequently, the character extracting unit 130 extracts the character from the candidate based on the oil well use survey (S230). In this case, an oil well noun means a noun representing a person or an animal, and an oil well investigation means an investigation that can be attached to an oil well noun, and an example thereof is shown in Table 1 below.
-hante (-한테)-hante (to) -anteseo (-한테서)-anteseo (from)
-hantero (-한테로)-hantero (to) -anteneun (-한테는)-anteneun (for-)
-ege (-에게)-ege (to-) -geseo (-에게서)-geseo (from-)
-egero (-에게)-egero (to-) -geneun (-에게는)-geneun (for-)
-egen (-에겐)-egen (-gen) -gekkaji (-에게까지)-gekkaji (until-)
-egedo (-에게도)-egedo (also to-) -geseon (-에게선)-geseon (for-)
-egeseoneun (-에게서는)-egeseoneun (from-) -geseo (-께서)-geseo
이때, 유정명사용 조사의 기본형은 '-한테/-에게/-께'이므로 기본형을 기초로 등장인물의 추출에 사용할 수 있고, 기본형에 다른 조사가 결합된 '-한테서/-에게로/-께서는'과 같은 모든 복합조사의 형태를 기초로 등장인물을 추출할 수 있다. At this time, since the basic type of the well-named survey is 'to / -to / -to', it can be used to extract characters based on the basic type, and 'from to-to / -to' Characters can be extracted based on all types of complex surveys such as
예를 들면, 등장인물 추출부(130)는 추출된 주어후보에 대해 텍스트에서 유정명사용 조사와 결합되어 사용된 적이 있는 주어후보를 등장인물로 추출할 수 있다. 또는 유정명사용 조사와 결합되어 사용된 단어로부터 일정 어절/음절 내에 존재하는 주어후보를 등장인물로 추출할 수 있다. For example, the character extraction unit 130 may extract the candidate for which the candidate has been used in combination with the well name investigation in the text for the extracted candidate for the character. Alternatively, subject candidates existing within a certain word / syllable can be extracted as characters from the words used in combination with the well name investigation.
아울러, 추가적으로 등장인물 확정부(140)는 추출된 등장인물 중 대명사, 불특정명사, 집합명사, 복수형, 수사 및 의존명사 중 적어도 하나 이상을 제외하여 등장인물을 확정할 수 있다(S240). In addition, the character determination unit 140 may determine the character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns among the extracted characters (S240).
표 1에 나열된 유정명사용 조사와 함께 텍스트에 등장하는 주어후보를 모두 등장인물로 추출한다면, 추출된 등장인물의 상당수가 대명사이거나 불특정한 사람을 나타내는 명사, 수사 또는 의존명사일 수 있다. If the subject candidates appearing in the text, together with the well-known nomenclature surveys listed in Table 1, are extracted as characters, many of the extracted characters may be pronouns, nouns, or dependent nouns.
따라서, 본 단계(S240)에서는 텍스트에서 유정명사용 조사와 함께 사용되는 주어후보로 등장인물로 추출되었다고 하더라도, 추출된 등장인물 중Therefore, even in this step (S240), even if extracted as a character as a subject candidate used in conjunction with the oil well name investigation in the text, out of the extracted characters
1) 대명사 (예: '나', '우리', '그', '그녀' 등), 1) pronouns (e.g. 'I', 'us', 'he', 'her', etc.),
2) 불특정 명사 (예: '사람', '남자', '여자' 등), 2) unspecified nouns (e.g. 'person', 'male', 'woman', etc.)
3) 집합명사 (예: '일가', '가족', '무리' 등), 3) collective nouns (e.g. 'family', 'family', 'group', etc.),
4) 복수형 (예: '사람들', '남자들', '여자들' 등), 4) plural forms (e.g. 'people', 'men', 'women', etc.),
5) 수사 (예: '하나', '둘', '셋' 등), 5) investigations (e.g. 'one', 'two', 'three', etc.),
6) 의존명사 (예: '놈', '명', '분', 등)에 해당하는 경우를 제어하여 등장인물을 확정할 수 있다. 6) The character can be confirmed by controlling the case of dependent nouns (eg, 'nome', 'name', 'minute', etc.).
도 3은 다른 실시예에 따른 프로세서를 갖는 정보 처리 장치에서 수행되는 등장인물 추출 장치(100)가 등장인물을 추출하는 방법을 설명하기 위한 순서도이다. 도 3에 따른 등장인물 추출 방법은 도 1을 통해 설명된 등장인물 추출 장치(100)에 의해 수행될 수 있으며, 각 단계를 설명하면 다음과 같다.3 is a flowchart illustrating a method of extracting a character by the character extraction apparatus 100 performed in an information processing apparatus having a processor according to another exemplary embodiment. The method of extracting a character according to FIG. 3 may be performed by the character extracting apparatus 100 described with reference to FIG. 1.
우선, 전처리부(110)를 통해 등장인물을 추출할 텍스트를 전자 문서로부터 읽어들인다. 이때 전처리부(110)는 텍스트를 발화(Utterance)와 내러티브(Narrative)로 구분할 수 있다(S310). First, the text to extract the character through the pre-processing unit 110 is read from the electronic document. At this time, the preprocessor 110 may divide the text into utterances (Utterance) and narrative (Sarrative).
발화는 소설 등장인물의 생각이 문장 단위로 실현된 것을 의미하며, 작가는 인용문 기호("", '')를 사용하여 특정 문장이 발화임을 표시한다. 내러티브는 소설의 줄거리를 이끌어 나가는 문장의 집합으로, 일련의 사건이 가지는 서사성을 1인칭 혹은 3인칭 관점에서 서술하는 문장들로 구성된다. The utterance means that the thoughts of a novel character are realized in sentence units, and the author indicates that a specific sentence is an utterance by using quotation marks ("", ''). A narrative is a set of sentences that lead the novel's storyline and consists of sentences that describe the narrative of a series of events from a first or third person perspective.
이를 위해, 전처리부(110)는 인용부호 내의 텍스트를 발화로 구분하고, 그 외의 텍스트를 내러티브로 구분할 수 있다. To this end, the preprocessing unit 110 may divide the text in quotation marks into speech, and separate other text into narrative.
다음으로, 주어후보 추출부(120)는 읽어들인 텍스트 중 내러티브에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출한다(S320). Next, the subject candidate extracting unit 120 extracts the message combined with the main subject investigation and the assistant in the narrative of the read text as the subject candidate (S320).
소설의 등장인물은 내러티브에서 주어로 등장하지만, 대체로 발화에서는 주어생략에 의해 등장인물이 주어로 등장하지 않거나 대명사로 대체된다. 그러나 소설의 등장인물은 발화에서 유정명사용 조사와 함께 등장할 수 있다. 이러한 이유로, 주어후보를 추출할 때에는 내러티브에 해당되는 텍스트만 대상으로 하고, 이후 단계(S330)에서 유정명사용 조사를 이용하여 등장인물을 추출할 때에는 내러티브에서 추출된 주어후보 및 발화 전체를 대상으로 한다. 따라서 S320 단계에서는 내러티브에서만 주어후보를 검색하게 되므로 보다 빠른 속도로 주어후보 추출이 가능해진다. The characters in the novel appear as subjects in the narrative, but in the speech, the characters do not appear as subjects or are replaced by pronouns. However, the characters of the novel can appear in the utterance with the investigation of the use of oil well name. For this reason, only the text corresponding to the narrative is extracted when the subject candidate is extracted, and when the character is extracted by using the well-named survey in step S330, the subject candidate and the entire speech extracted from the narrative are targeted. . Therefore, in step S320, the candidate can be extracted at a higher speed since the candidate can be searched only in the narrative.
이에 따라, 등장인물 추출부(130)는 내러티브에서 추출된 주어후보 중 유정명사용 조사와 결합되어 사용된 주어후보를 등장인물로 추출하고, 발화 중 유정명사용 조사와 결합되어 사용된 체언을 등장인물로 추출한다(S330). Accordingly, the character extracting unit 130 extracts the subject candidate used in combination with the well name survey among the candidate candidates extracted from the narrative as a character, and uses the combined word used with the well name survey during speech as the character. Extract (S330).
아울러, 추가적으로 등장인물 확정부(140)는 추출된 등장인물 중 대명사, 불특정명사, 집합명사, 복수형, 수사 및 의존명사 중 적어도 하나 이상을 제외하여 등장인물을 확정할 수 있다(S340).In addition, the character determination unit 140 may determine the character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns among the extracted characters (S340).
도 4 내지 도 7은 상술한 실시예를 통해 등장인물을 추출한 결과를 분석하기 위한 그래프이다. 4 to 7 are graphs for analyzing the results of extracting the character through the above-described embodiment.
상술한 실시예에 따른 등장인물 추출 결과에 대한 정확성, 재현율, F-measure 계산을 위하여, 추출된 등장인물과 실제 등장인물에 해당하는 것들을 수동으로 추출하여 비교해 보았다. In order to calculate accuracy, reproducibility, and F-measure for the character extraction result according to the above-described embodiment, the extracted characters and those corresponding to the actual characters were manually extracted and compared.
이때 정확률, 재현율, F-measure는 다음과 같이 정의한다.At this time, the accuracy rate, recall rate, and F-measure are defined as follows.
Figure PCTKR2016015284-appb-I000001
Figure PCTKR2016015284-appb-I000001
정확률은 실시예를 통해 실시예에 따라 추출된 등장인물에 대하여, 실시예에 따라 추출된 등장인물 중 오류를 제외한 실제 등장인물 비율로 계산된다. 재현율은 수동으로 추출한 실제 인물명(등장인물 중 고유명사에 해당)에 대하여, 실시예를 통해 추출된 등장인물 중 실제의 인물명 비율로 계산된다. F-measure는 정확률과 재현율의 조화평균으로 계산된다.The accuracy rate is calculated by the ratio of the actual characters excluding the error among the characters extracted according to the embodiment, for the characters extracted according to the embodiment through the embodiment. The reproducibility is calculated by the ratio of the actual person's name among the characters extracted through the embodiment with respect to the actual person's name (corresponding to the proper noun among the characters). F-measure is calculated as the harmonic mean of accuracy and recall.
한편, 도 4를 참조하면, 본 실험은 한국어 소설 80권으로 진행되었으며, 100,000 단어 정도로 구성된 소설들이 실험에 사용되었다. 도 4의 x축은 소설의 인덱스를 나타내는데, 1번부터 11번까지의 소설이 한국어로 번역된 소설이며, 12번부터 80번은 한국어로 창작된 소설이다. On the other hand, referring to Figure 4, the experiment was conducted in 80 Korean novels, novels composed of about 100,000 words were used in the experiment. The x-axis of FIG. 4 represents the index of the novel, in which novels 1 to 11 are translated into Korean, and numbers 12 to 80 are novels created in Korean.
도 5는 80권의 한국어 소설에 대하여, 실시예에 따라 등장인물을 추출하였을 때의 정확률과 재현율을 보여준다. 이때 도 4와 동일하게, 도 5에서 x축의 1번부터 11번까지는 한국어 번역 소설이며, 12번부터 80번까지는 한국어 창작 소설이다.5 shows the accuracy and reproducibility of extracting characters according to an embodiment of 80 Korean novels. In this case, as in FIG. 4, the first to eleven times of the x-axis in FIG. 5 are Korean translation novels, and the first to twelve to eightyth Korean creative novels.
도 5의 실험 결과에 따르면, 전체 80권에서 총 1,809개의 등장인물이 추출되었으나 이 중에서 1,773개가 올바르게 추출된 것이어서, 전체 정확률은 98.01%로 계산되었다. 권당으로 바꾸어 표현하면, 권당 22.61개의 등장인물이 추출되었고, 이 중에서 22.16개가 올바르게 추출된 등장인물이었다. 전체 80권의 책으로부터 등장인물이 될 수 없는 총 36개의 단어들이 추출되었는데, 이 단어들은 대부분 의인화되어 사용된 것들이었다. 예를 들면, "그 착한 목소리에게..." 또는 "지중해의 빛한테..."와 같은 문장에서 '목소리'와 '빛'이 의인화되어 유정명사용 조사와 함께 사용되었고, 그 결과 등장인물로 추출되는 결과가 초래되었다.According to the experimental results of FIG. 5, a total of 1,809 characters were extracted from a total of 80 books, but 1,773 of them were correctly extracted, and the overall accuracy was calculated to be 98.01%. In other words, 22.61 characters were extracted per volume, of which 22.16 were correctly extracted characters. A total of 36 words were extracted from a total of 80 books that could not be characters, most of which were personified. For example, in sentences such as "to the good voice ..." or "to the light of the Mediterranean Sea ...", "voice" and "light" are personified and used in conjunction with the investigation of the use of well-known names. The result was extraction.
도 5의 실험 결과로부터 재현율을 계산해 보면, 전체 80권으로부터 수동 추출된 총 1,431개의 등장인물 중에서 1,002개의 등장인물이 추출되었으므로, 전체 재현율은 70.02%로 나타났다. 권당으로 표현하자면, 권당 17.89개의 등장인물이 존재하지만, 실시예를 통해 찾아낼 수 있는 등장인물은 12.53개였다.When the reproducibility was calculated from the experimental results of FIG. 5, since 1,002 characters were extracted from a total of 1,431 characters manually extracted from all 80 volumes, the total reproducibility was 70.02%. In terms of volume, there are 17.89 characters per volume, but 12.53 characters can be found through the examples.
도 6 및 도 7은 실시예를 통하여 추출된 등장인물의 등장율이 1% 이상 또는 0.5% 이상인 등장인물을 발견하는 것을 목표로 하는 경우, 재현율이 얼마나 상승하는지를 분석한 그래프이다. 6 and 7 are graphs analyzing how much the reproducibility is increased when aiming to find a character having an appearance rate of 1% or more or 0.5% or more that is extracted through an embodiment.
등장율은 모든 등장인물의 등장빈도 합에 대한 한 인물의 등장빈도 비율로 계산한다. 예를 들어, A의 등장율이 1%라는 것은 소설의 전체 등장인물의 등장빈도 합에 대하여 A의 등장빈도 비율이 1%라는 것을 의미한다.The incidence rate is calculated as the ratio of one person's frequency to the sum of all characters' frequency. For example, A's 1% appearance means that A's frequency is 1% of the total frequency of the characters in the novel.
도 6의 결과로부터 재현율을 계산해 보면, 전체 80권으로부터 수동 추출한 총 1,431개의 등장인물 중에서 1,115개의 등장인물이 추출되어, 전체 재현율은 77.92%로 증가하였다. 그리고, 도 7의 결과로부터 재현율을 계산해 보면, 전체 80권으로부터 수동 추출된 총 1,431개의 등장인물 중에서 1,199개의 등장인물이 추출되어, 전체 재현율은 83.79%로 증가하였다. 뿐만 아니라 도 7에서는, 각각의 소설별 재현율도 다섯 권을 제외하면 모두 70%보다 높게 나타났다.In calculating the reproducibility from the results of FIG. 6, 1,115 characters were extracted from a total of 1,431 characters manually extracted from all 80 volumes, and the overall reproducibility was increased to 77.92%. And, when calculating the reproducibility from the results of Figure 7, 1,199 characters were extracted from a total of 1,431 characters manually extracted from the total 80 books, the total reproducibility increased to 83.79%. In addition, in FIG. 7, the reproduction rate of each novel was higher than 70% except for five books.
도 6과 도 7의 결과로부터, 각각의 소설별 등장인물의 등장율이 1%이상 또는 0.5%이상인 모든 등장인물을 발견하는 경우의 F-measure를 계산해보면, 각각 86.82%와 90.34%로 나타났다. From the results of FIG. 6 and FIG. 7, the F-measure of the case of finding all characters having an appearance rate of 1% or more and 0.5% or more of each novel was 86.82% and 90.34%, respectively.
이상의 결과로부터, 기계학습 기법을 적용하기 위한 학습 데이터 없이도 상술한 실시예를 통해 다양한 소설에 대하여 등장인물을 효과적으로 추출할 수 있는 것을 확인할 수 있다. From the above results, it can be seen that the characters can be effectively extracted for various novels through the above-described embodiments without learning data for applying the machine learning technique.
또한, 이렇게 추출한 등장인물 정보를 활용하여, 등장인물 간의 관계를 구하거나, 이들이 사용된 문장과 유사한 패턴의 다른 문장에 등장하는 등장인물을 추가로 추출하는 연구를 추가적으로 진행할 수 있을 것이다.In addition, the extracted character information may be used to obtain a relationship between characters or to further extract a character appearing in another sentence having a pattern similar to the sentence in which they are used.
한편, 상술한 본 발명의 실시예들은 다양한 수단을 통해 구현될 수 있다. 예를 들어, 본 발명의 실시예들은 하드웨어, 펌웨어(firmware), 소프트웨어 또는 그것들의 결합 등에 의해 구현될 수 있다.Meanwhile, the above-described embodiments of the present invention may be implemented through various means. For example, embodiments of the present invention may be implemented by hardware, firmware, software, or a combination thereof.
하드웨어에 의한 구현의 경우, 본 발명의 실시예들에 따른 방법은 하나 또는 그 이상의 ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), FPGAs(Field Programmable Gate Arrays), 프로세서, 컨트롤러, 마이크로 컨트롤러, 마이크로 프로세서 등에 의해 구현될 수 있다.For implementation in hardware, a method according to embodiments of the present invention may include one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), and Programmable Logic Devices (PLDs). It may be implemented by field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, and the like.
펌웨어나 소프트웨어에 의한 구현의 경우, 본 발명의 실시예들에 따른 방법은 이상에서 설명된 기능 또는 동작들을 수행하는 모듈, 절차 또는 함수 등의 형태로 구현될 수 있다. 소프트웨어 코드는 메모리 유닛에 저장되어 프로세서에 의해 구동될 수 있다. 상기 메모리 유닛은 상기 프로세서 내부 또는 외부에 위치하여, 이미 공지된 다양한 수단에 의해 상기 프로세서와 데이터를 주고 받을 수 있다.In the case of an implementation by firmware or software, the method according to the embodiments of the present invention may be implemented in the form of a module, a procedure, or a function that performs the functions or operations described above. The software code may be stored in a memory unit and driven by a processor. The memory unit may be located inside or outside the processor, and may exchange data with the processor by various known means.
이와 같이, 본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.As such, those skilled in the art will appreciate that the present invention can be implemented in other specific forms without changing the technical spirit or essential features thereof. Therefore, the above-described embodiments are to be understood as illustrative in all respects and not as restrictive. The scope of the present invention is shown by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. do.

Claims (16)

  1. 프로세서를 갖는 정보 처리 장치에서 수행되는 등장인물 추출 장치가 등장인물을 추출하는 방법에 있어서,A method for extracting a character from a character extraction apparatus performed in an information processing apparatus having a processor,
    텍스트를 전자 문서로부터 읽어들이는 전처리 단계;A preprocessing step of reading text from the electronic document;
    상기 텍스트에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하는 단계; 및Extracting a statement combined with a subject investigation and an assistant from the text as a candidate; And
    유정명사용 조사를 기초로 상기 주어후보로부터 등장인물을 추출하는 단계;Extracting a character from the subject candidate based on a well name survey;
    를 포함하는 등장인물 추출 방법.Character extraction method comprising a.
  2. 제1항에 있어서,The method of claim 1,
    상기 등장인물을 추출하는 단계는,Extracting the character is,
    상기 유정명사용 조사와 결합되어 사용된 주어후보를 등장인물로 추출하는Extracting the subject candidate used in combination with the well name survey as a character
    등장인물 추출 방법.Character Extraction Method.
  3. 제1항에 있어서,The method of claim 1,
    상기 전처리 단계는, The pretreatment step,
    인용부호 내의 텍스트를 발화로 구분하고, 그 외의 텍스트를 내러티브로 구분하는To separate text within quotes into utterances, and to separate other text into narratives.
    등장인물 추출 방법.Character Extraction Method.
  4. 제3항에 있어서,The method of claim 3,
    상기 주어후보로 추출하는 단계는, Extracting the subject candidate,
    상기 내러티브에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하며,In the narrative, the subject combined with the main investigation and the assistant is extracted as the candidate.
    상기 등장인물을 추출하는 단계는,Extracting the character is,
    상기 내러티브에서 추출된 주어후보 중 유정명사용 조사와 결합되어 사용된 주어후보를 등장인물로 추출하고, 상기 발화 중 유정명사용 조사와 결합되어 사용된 체언을 등장인물로 추출하는Among the candidates extracted from the narrative, the candidates used in combination with the well name surveys are extracted as characters, and the statements used in combination with the well name surveys during the utterance are extracted as characters.
    등장인물 추출 방법.Character Extraction Method.
  5. 제1항에 있어서,The method of claim 1,
    상기 주격조사 및 보조사는,The subject investigation and assistant,
    -이, -가, -은 및 -는 중 적어도 하나 이상인-At least one of-,-and-is
    등장인물 추출 방법.Character Extraction Method.
  6. 제1항에 있어서,The method of claim 1,
    상기 유정명사용 조사는,The oil well use investigation,
    -한테, -에게, -께 및 상기 -한테/-에게/-께 에 다른 조사가 결합된 복합조사 중 적어도 하나 이상인At least one of a combination of combinations of other investigations to-to-to-to-and to-to
    등장인물 추출 방법.Character Extraction Method.
  7. 제1항에 있어서,The method of claim 1,
    상기 추출된 등장인물로부터 대명사, 불특정명사, 집합명사, 복수형, 수사 및 의존명사 중 적어도 하나 이상을 제외하여 등장인물을 확정하는 단계;Determining a character from the extracted character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns;
    를 더 포함하는 등장인물 추출 방법.Character extraction method comprising more.
  8. 텍스트를 전자 문서로부터 읽어들이는 전처리부;A preprocessor for reading text from an electronic document;
    상기 텍스트에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하는 주어후보 추출부; 및A subject candidate extracting unit for extracting a statement combined with a subject investigation and an assistant from the text as a subject candidate; And
    유정명사용 조사를 기초로 상기 주어후보로부터 등장인물을 추출하는 등장인물 추출부;A character extracting unit extracting a character from the subject candidate based on an oil well use survey;
    를 수행하는 적어도 하나의 프로세서를 갖는 등장인물 추출 장치.Appearance extraction device having at least one processor for performing the.
  9. 제8항에 있어서,The method of claim 8,
    상기 등장인물 추출부는,The character extract unit,
    상기 유정명사용 조사와 결합되어 사용된 주어후보를 등장인물로 추출하는Extracting the subject candidate used in combination with the well name survey as a character
    등장인물 추출 장치.Character extraction device.
  10. 제8항에 있어서,The method of claim 8,
    상기 전처리부는, The preprocessing unit,
    인용부호 내의 텍스트를 발화로 구분하고, 그 외의 텍스트를 내러티브로 구분하는To separate text within quotes into utterances, and to separate other text into narratives.
    등장인물 추출 장치.Character extraction device.
  11. 제10항에 있어서,The method of claim 10,
    상기 주어후보 추출부는,The subject candidate extracting unit,
    상기 내러티브에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하며,In the narrative, the subject combined with the main investigation and the assistant is extracted as the candidate.
    상기 등장인물 추출부는,The character extract unit,
    상기 내러티브에서 추출된 주어후보 중 유정명사용 조사와 결합되어 사용된 주어후보를 등장인물로 추출하고, 상기 발화 중 유정명사용 조사와 결합되어 사용된 체언을 등장인물로 추출하는Among the candidates extracted from the narrative, the candidates used in combination with the well name surveys are extracted as characters, and the statements used in combination with the well name surveys during the utterance are extracted as characters.
    등장인물 추출 장치.Character extraction device.
  12. 제8항에 있어서,The method of claim 8,
    상기 주격조사 및 보조사는,The subject investigation and assistant,
    -이, -가, -은 및 -는 중 적어도 하나 이상인-At least one of-,-and-is
    등장인물 추출 장치.Character extraction device.
  13. 제8항에 있어서,The method of claim 8,
    상기 유정명사용 조사는,The oil well use investigation,
    -한테, -에게, -께 및 상기 -한테/-에게/-께 에 다른 조사가 결합된 복합조사 중 적어도 하나 이상인At least one of a combination of combinations of other investigations to-to-to-to-and to-to
    등장인물 추출 장치.Character extraction device.
  14. 제8항에 있어서,The method of claim 8,
    상기 추출된 등장인물로부터 대명사, 불특정명사, 집합명사, 복수형, 수사 및 의존명사 중 적어도 하나 이상을 제외하여 등장인물을 확정하는 등장인물 확정부;A character determination unit that determines a character from the extracted characters by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric, and dependent nouns;
    를 더 포함하는 등장인물 추출 장치.Character extract device comprising more.
  15. 텍스트를 전자 문서로부터 읽어들이는 전처리 단계;A preprocessing step of reading text from the electronic document;
    상기 텍스트에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하는 단계; 및Extracting a statement combined with a subject investigation and an assistant from the text as a candidate; And
    유정명사용 조사를 기초로 상기 주어후보로부터 등장인물을 추출하는 단계;Extracting a character from the subject candidate based on a well name survey;
    를 포함하는 등장인물 추출 방법을 수행하는 컴퓨터 판독 가능 기록매체에 저장된 프로그램.A program stored in a computer-readable recording medium for performing the character extraction method comprising a.
  16. 적어도 하나의 프로세서 상에서 수행될 때 상기 프로세서로 하여금,When executed on at least one processor, causes the processor to:
    텍스트를 전자 문서로부터 읽어들이는 동작;Reading text from an electronic document;
    상기 텍스트에서 주격조사 및 보조사와 결합된 체언을 주어후보로 추출하는 동작; 및 Extracting a message combined with a subject investigation and an assistant from the text as a candidate; And
    유정명사용 조사를 기초로 상기 주어후보로부터 등장인물을 추출하는 동작;Extracting a character from the subject candidate based on a well name use survey;
    을 수행하게 하는 명령어를 포함하는 프로그램이 기록된 컴퓨터 판독 가능 기록매체. A computer-readable recording medium having recorded thereon a program including instructions for causing the computer to perform the operation.
PCT/KR2016/015284 2016-08-08 2016-12-26 Method and device for extracting character WO2018030595A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020160100737A KR101869016B1 (en) 2016-08-08 2016-08-08 Method and apparatus for extracting character
KR10-2016-0100737 2016-08-08

Publications (1)

Publication Number Publication Date
WO2018030595A1 true WO2018030595A1 (en) 2018-02-15

Family

ID=61162351

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2016/015284 WO2018030595A1 (en) 2016-08-08 2016-12-26 Method and device for extracting character

Country Status (2)

Country Link
KR (1) KR101869016B1 (en)
WO (1) WO2018030595A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102496958B1 (en) * 2020-12-15 2023-02-08 주식회사 아이포트폴리오 Book story data base generating method for reading evaluation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR19980066877A (en) * 1997-01-29 1998-10-15 김광호 Morphological interpretation based on types of unregistered words
JP2008176630A (en) * 2007-01-19 2008-07-31 Toshiba Corp Document data processing apparatus

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0772888A (en) * 1993-09-01 1995-03-17 Matsushita Electric Ind Co Ltd Information processor
KR101138822B1 (en) * 2009-11-19 2012-05-10 한국과학기술원 Method and system for managing annoted names of people appearing in digital photos
KR102069697B1 (en) * 2013-07-29 2020-02-24 한국전자통신연구원 Apparatus and method for automatic interpretation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR19980066877A (en) * 1997-01-29 1998-10-15 김광호 Morphological interpretation based on types of unregistered words
JP2008176630A (en) * 2007-01-19 2008-07-31 Toshiba Corp Document data processing apparatus

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KIM HYEONG JEONG: "The Animacy of the Proceeding Noun Phrase and the Choice of", YONSEI INSTITUTE OF LANGUAGE AND INFORMATION STUDIES, vol. 26, 2010, pages 141 - 195 *
KIM SEO-HEE ET AL.: "A Recognition method for Main Characters Name in Korean Novels", JOURNAL OF KORE INSTITUTE OF INFORMATION, ELECTRONICS AND COMMUNICATION TECHNOLOGY, vol. 09, no. 1, February 2016 (2016-02-01), pages 75 - 81 *
KO SEOK JU: "The Semantic Information of Noun in Construction Description", KOREAN LINGUISTICS, vol. 52, pages 1 - 24 *

Also Published As

Publication number Publication date
KR101869016B1 (en) 2018-06-19
KR20180016840A (en) 2018-02-20

Similar Documents

Publication Publication Date Title
WO2017010652A1 (en) Automatic question and answer method and device therefor
WO2015023035A1 (en) Preposition error correcting method and device performing same
WO2016208941A1 (en) Text preprocessing method and preprocessing system for performing same
WO2014069779A1 (en) Syntax preprocessing-based syntax analysis apparatus, and method for same
KR20080033325A (en) Definition extraction
WO2014030834A1 (en) Method for detecting grammatical errors, error detection device for same, and computer-readable recording medium having method recorded thereon
Qiu et al. Word segmentation for Chinese novels
WO2020138607A1 (en) Method and device for providing question and answer using chatbot
WO2018101506A1 (en) Document multi-classification device and document multi-classification method for classifying one document into plurality of categories by using lexico-semantic pattern obtained by reconfiguring semantic category of words constituting sentence
Tufiş A cheap and fast way to build useful translation lexicons
Lewis et al. Automatically identifying computationally relevant typological features
Zakharov Corpora of the Russian language
WO2012030053A2 (en) Apparatus and method for recognizing an idiomatic expression using phrase alignment of a parallel corpus
WO2018030595A1 (en) Method and device for extracting character
WO2012060534A1 (en) Device and method for building phrasal verb translation pattern using parallel corpus
Uchimoto et al. Morphological analysis of the Corpus of Spontaneous Japanese
WO2023163383A1 (en) Multimodal-based method and apparatus for recognizing emotion in real time
WO2020111374A1 (en) System for converting voice lecture file into text on basis of lecture related keywords
Oudah et al. Person name recognition using the hybrid approach
Boulaknadel et al. Amazighe Named Entity Recognition using a A rule based approach
Sikdar et al. Adapting a state-of-the-art anaphora resolution system for resource-poor language
Phadte et al. Towards normalising Konkani-English code-mixed social media text
Hossain et al. A framework for political portmanteau decomposition
Hamed et al. Exploring the effects of diacritization on Arabic frequency counts
Aggarwal et al. A survey on parts of speech tagging for Indian languages

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16912793

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16912793

Country of ref document: EP

Kind code of ref document: A1