WO2018030595A1

WO2018030595A1 - Method and device for extracting character

Info

Publication number: WO2018030595A1
Application number: PCT/KR2016/015284
Authority: WO
Inventors: 김승훈; 박태근
Original assignee: 단국대학교 산학협력단
Priority date: 2016-08-08
Filing date: 2016-12-26
Publication date: 2018-02-15
Also published as: KR101869016B1; KR20180016840A

Abstract

A method for extracting a character according to an embodiment of the present invention is performed in an information processing device having a processor, and comprises: a pre-processing step of reading a text from an electronic document; a step of extracting, as a subject candidate, an uninflected word coupled to a subject marking particle and a topic marking particle in the text; and a step of extracting a character from the subject candidates on the basis of a particle for an animate noun.

Description

Character Extraction Method and Apparatus

The present invention relates to a method and apparatus for extracting a character, and more particularly, by extracting a character from a text of a Korean translation novel and a Korean creative fiction based on a survey used for oil nouns, without various learning data of a machine learning technique. It relates to a character extraction method and apparatus that can extract characters for the novel.

Information extraction is the task of extracting important information such as objects and events from natural language text, and entity name recognition finds the names of objects in the text as part of the information extraction, and then classifies them into predefined classes such as names, names and organization names. It is a technique.

Most of the entity name recognition techniques utilize rule-based algorithms or machine learning-based techniques. Recently, hybrid techniques have been proposed to reduce the disadvantages of the two techniques and to utilize the advantages.

However, according to an analytical study on the entity name recognition techniques, although it is not easy to apply the entity name recognition technique to a text genre other than the target text genre, the existing entity name recognition techniques have been applied to the text genre and domain. Without consideration, it is limited to extracting individual names from text such as newspaper articles.

An object of the present invention is to provide a technique for extracting characters without learning data for applying machine learning techniques to various novels.

However, the technical problem to be achieved by the embodiment of the present invention is not limited to the above-mentioned problem, various technical problems can be derived within the scope apparent to those skilled in the art from the following description.

According to an embodiment, a method of extracting a character from a character extraction apparatus performed in an information processing apparatus having a processor may include: a preprocessing step of reading text from an electronic document; Extracting a statement combined with a subject investigation and an assistant from the text as a candidate; And extracting the character from the subject candidate based on the well name survey.

At this time, the extracting the character, the subject candidate used in combination with the well name investigation can be extracted as the character.

In the preprocessing step, the text in quotation marks may be divided into speech, and other text may be divided into narrative.

In addition, the extracting of the subject candidate, the narrative in the narrative combined with the main investigation and the assistant verbs to extract the subject candidate, and the extracting the character, the subject name candidate extracted from the narrative The subject candidate used in combination can be extracted as a character, and the spoken word used in combination with the well name survey during the utterance can be extracted as the character.

In addition, the subject investigation and the assistant may include at least one of-,-,-and-.

In addition, the well-known survey may include at least one or more of the combined survey combined with other surveys to-,-to,-and to-to /-to.

In addition, the method may further include determining a character from the extracted character by excluding at least one or more of a pronoun, an unspecified noun, a collective noun, a plural form, a rhetoric, and a dependent noun.

According to an embodiment, an apparatus for extracting a character having at least one processor may include a preprocessor configured to read text from an electronic document; A subject candidate extracting unit for extracting a statement combined with a subject investigation and an assistant from the text as a subject candidate; And a character extracting unit extracting a character from the subject candidate based on the well name investigation.

At this time, the character extracting unit, it is possible to extract the subject candidate used in combination with the oil well name investigation as the character.

In addition, the preprocessing unit may classify the text in quotation marks into utterances and other texts in narratives.

In addition, the subject candidate extracting unit extracts a statement combined with a main subject investigation and an assistant in the narrative as a subject candidate, and the character extracting unit is used as a subject candidate used in combination with an oil well name survey among subject candidates extracted from the narrative. Can be extracted as a character, and the spoken word used in combination with the well name investigation during the utterance can be extracted as the character.

In this case, the subject interrogation and the assistant may include at least one of-,-,-and-.

In addition, the apparatus may further include a character determination unit that determines a character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns from the extracted characters.

According to an embodiment of the present invention, a character may be extracted for various novels without learning data for applying machine learning techniques to various novels.

As a result, it is possible to classify books by grasping social networks among people, and by analyzing gender and age of characters, and reading the contents of books with voices that match characters in text-to-speech-based storytelling systems. It can be utilized.

1 is a diagram illustrating a functional block of an apparatus for extracting a character according to an exemplary embodiment.

2 is a flowchart illustrating a method of extracting a character according to an exemplary embodiment.

3 is a flowchart illustrating a method of extracting a character according to another exemplary embodiment.

4 to 7 are graphs for explaining an experimental result of extracting a character through the character extraction method.

Details of the object and technical configuration of the present invention and the resulting effects will be more clearly understood by the following detailed description based on the accompanying drawings. With reference to the accompanying drawings will be described in detail an embodiment according to the present invention.

The embodiments disclosed herein should not be interpreted or used as limiting the scope of the invention. It is obvious to those skilled in the art that the description, including the embodiments herein, has a variety of applications. Accordingly, certain embodiments described in the detailed description of the present invention are illustrative for better understanding of the present invention and are not intended to limit the scope of the present invention to the embodiments.

The functional blocks shown in the figures and described below are only examples of possible implementations. Other functional blocks may be used in other implementations without departing from the spirit and scope of the detailed description. Also, while one or more functional blocks of the present invention are represented by separate blocks, one or more of the functional blocks of the present invention may be a combination of various hardware and software configurations that perform the same function.

In addition, the expression of including certain components merely refers to the presence of the components as an open expression, and should not be understood as excluding additional components.

Further, when a component is referred to as being connected or connected to another component, it should be understood that there may be a direct connection or connection to that other component, but there may be other components in between.

Also, an expression such as 'first' and 'second' is used only for distinguishing a plurality of configurations, and does not limit the order or other features between the configurations.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

1 is a block diagram illustrating a functional block of the character extraction apparatus 100 according to an exemplary embodiment.

Referring to FIG. 1, the character extracting apparatus 100 includes a preprocessing unit 110, a candidate candidate extracting unit 120, and a character extracting unit 130, and further includes a character determining unit 140. can do.

The preprocessing unit 110 reads the text to be extracted from the electronic document. On the other hand, the term 'appearance character' used in the present specification is a concept including a proper noun and a common noun. For example, proper nouns in the proper nouns mean person names such as "Harry", "Isabella", "Billy", and common nouns in the proper nouns correspond to "father", "mother".

The subject candidate extracting unit 120 extracts the message combined with the main subject investigation and the assistant from the text read by the preprocessor 110 as the subject candidate.

The character extracting unit 130 extracts the character from the subject candidate extracted by the subject candidate extracting unit 120 based on the oil well use survey.

The character determination unit 140 determines a character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns from the characters extracted by the character extraction unit 130.

In this case, a detailed process of operating each component of the character extraction apparatus 100 will be described with reference to FIGS. 2 and 3.

Meanwhile, the preprocessor 110, the subject candidate extractor 120, the character extractor 130, and the character determiner 140 included in the above-described embodiment include instructions programmed to perform their functions. It may be implemented by a computing device including a memory and a microprocessor for performing these instructions.

2 is a flowchart illustrating a method of extracting a character by the character extraction apparatus 100 performed in an information processing apparatus having a processor, according to an exemplary embodiment. The method for extracting a character according to FIG. 2 may be performed by the character extracting apparatus 100 described with reference to FIG. 1.

First, the text to extract the character through the preprocessor 110 is read from the electronic document (S210).

Next, the subject candidate extracting unit 120 extracts the sentence combined with the main subject investigation and the assistant from the read text as the candidate candidate (S220).

In Korean grammar, the main subject is defined as '-이 /-가' and the assistant is defined as '-은 /-가', followed by '-이 /-가 /-' after a statement that can be given in many documents including novels. '/' Is used as a '. For example, "Harry said," and "Harry said."

Thus, in the candidate can extracting step (S220), the spoken word used in combination with at least one of '-//-//-/' is extracted from the read text as the candidate.

In addition, since the characters of the novel appear as subjects in the text from several times to as many as hundreds, the characters with the support of the last letter are combined with the '-//-' surveys on the letters that represent the characters. Can appear in the text, and in the case of a character without a backing in the last character, the letter representing the character can be combined with the investigation of '-ga /-' and appear in the text.

Accordingly, in the subject candidate extraction step (S220), the extracted candidates are extracted as the subject candidates, which have been used in combination with the survey pairs of '-//-' or '-//-' as the subject candidates, and the accuracy is correct. Can improve.

Subsequently, the character extracting unit 130 extracts the character from the candidate based on the oil well use survey (S230). In this case, an oil well noun means a noun representing a person or an animal, and an oil well investigation means an investigation that can be attached to an oil well noun, and an example thereof is shown in Table 1 below.

-hante (-한테)-hante (to)	-anteseo (-한테서)-anteseo (from)
-hantero (-한테로)-hantero (to)	-anteneun (-한테는)-anteneun (for-)
-ege (-에게)-ege (to-)	-geseo (-에게서)-geseo (from-)
-egero (-에게)-egero (to-)	-geneun (-에게는)-geneun (for-)
-egen (-에겐)-egen (-gen)	-gekkaji (-에게까지)-gekkaji (until-)
-egedo (-에게도)-egedo (also to-)	-geseon (-에게선)-geseon (for-)
-egeseoneun (-에게서는)-egeseoneun (from-)	-geseo (-께서)-geseo

At this time, since the basic type of the well-named survey is 'to / -to / -to', it can be used to extract characters based on the basic type, and 'from to-to / -to' Characters can be extracted based on all types of complex surveys such as

For example, the character extraction unit 130 may extract the candidate for which the candidate has been used in combination with the well name investigation in the text for the extracted candidate for the character. Alternatively, subject candidates existing within a certain word / syllable can be extracted as characters from the words used in combination with the well name investigation.

In addition, the character determination unit 140 may determine the character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns among the extracted characters (S240).

If the subject candidates appearing in the text, together with the well-known nomenclature surveys listed in Table 1, are extracted as characters, many of the extracted characters may be pronouns, nouns, or dependent nouns.

Therefore, even in this step (S240), even if extracted as a character as a subject candidate used in conjunction with the oil well name investigation in the text, out of the extracted characters

1) pronouns (e.g. 'I', 'us', 'he', 'her', etc.),

2) unspecified nouns (e.g. 'person', 'male', 'woman', etc.)

3) collective nouns (e.g. 'family', 'family', 'group', etc.),

4) plural forms (e.g. 'people', 'men', 'women', etc.),

5) investigations (e.g. 'one', 'two', 'three', etc.),

6) The character can be confirmed by controlling the case of dependent nouns (eg, 'nome', 'name', 'minute', etc.).

3 is a flowchart illustrating a method of extracting a character by the character extraction apparatus 100 performed in an information processing apparatus having a processor according to another exemplary embodiment. The method of extracting a character according to FIG. 3 may be performed by the character extracting apparatus 100 described with reference to FIG. 1.

First, the text to extract the character through the pre-processing unit 110 is read from the electronic document. At this time, the preprocessor 110 may divide the text into utterances (Utterance) and narrative (Sarrative).

The utterance means that the thoughts of a novel character are realized in sentence units, and the author indicates that a specific sentence is an utterance by using quotation marks ("", ''). A narrative is a set of sentences that lead the novel's storyline and consists of sentences that describe the narrative of a series of events from a first or third person perspective.

To this end, the preprocessing unit 110 may divide the text in quotation marks into speech, and separate other text into narrative.

Next, the subject candidate extracting unit 120 extracts the message combined with the main subject investigation and the assistant in the narrative of the read text as the subject candidate (S320).

The characters in the novel appear as subjects in the narrative, but in the speech, the characters do not appear as subjects or are replaced by pronouns. However, the characters of the novel can appear in the utterance with the investigation of the use of oil well name. For this reason, only the text corresponding to the narrative is extracted when the subject candidate is extracted, and when the character is extracted by using the well-named survey in step S330, the subject candidate and the entire speech extracted from the narrative are targeted. . Therefore, in step S320, the candidate can be extracted at a higher speed since the candidate can be searched only in the narrative.

Accordingly, the character extracting unit 130 extracts the subject candidate used in combination with the well name survey among the candidate candidates extracted from the narrative as a character, and uses the combined word used with the well name survey during speech as the character. Extract (S330).

In addition, the character determination unit 140 may determine the character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns among the extracted characters (S340).

4 to 7 are graphs for analyzing the results of extracting the character through the above-described embodiment.

In order to calculate accuracy, reproducibility, and F-measure for the character extraction result according to the above-described embodiment, the extracted characters and those corresponding to the actual characters were manually extracted and compared.

At this time, the accuracy rate, recall rate, and F-measure are defined as follows.

The accuracy rate is calculated by the ratio of the actual characters excluding the error among the characters extracted according to the embodiment, for the characters extracted according to the embodiment through the embodiment. The reproducibility is calculated by the ratio of the actual person's name among the characters extracted through the embodiment with respect to the actual person's name (corresponding to the proper noun among the characters). F-measure is calculated as the harmonic mean of accuracy and recall.

On the other hand, referring to Figure 4, the experiment was conducted in 80 Korean novels, novels composed of about 100,000 words were used in the experiment. The x-axis of FIG. 4 represents the index of the novel, in which novels 1 to 11 are translated into Korean, and numbers 12 to 80 are novels created in Korean.

5 shows the accuracy and reproducibility of extracting characters according to an embodiment of 80 Korean novels. In this case, as in FIG. 4, the first to eleven times of the x-axis in FIG. 5 are Korean translation novels, and the first to twelve to eightyth Korean creative novels.

According to the experimental results of FIG. 5, a total of 1,809 characters were extracted from a total of 80 books, but 1,773 of them were correctly extracted, and the overall accuracy was calculated to be 98.01%. In other words, 22.61 characters were extracted per volume, of which 22.16 were correctly extracted characters. A total of 36 words were extracted from a total of 80 books that could not be characters, most of which were personified. For example, in sentences such as "to the good voice ..." or "to the light of the Mediterranean Sea ...", "voice" and "light" are personified and used in conjunction with the investigation of the use of well-known names. The result was extraction.

When the reproducibility was calculated from the experimental results of FIG. 5, since 1,002 characters were extracted from a total of 1,431 characters manually extracted from all 80 volumes, the total reproducibility was 70.02%. In terms of volume, there are 17.89 characters per volume, but 12.53 characters can be found through the examples.

6 and 7 are graphs analyzing how much the reproducibility is increased when aiming to find a character having an appearance rate of 1% or more or 0.5% or more that is extracted through an embodiment.

The incidence rate is calculated as the ratio of one person's frequency to the sum of all characters' frequency. For example, A's 1% appearance means that A's frequency is 1% of the total frequency of the characters in the novel.

In calculating the reproducibility from the results of FIG. 6, 1,115 characters were extracted from a total of 1,431 characters manually extracted from all 80 volumes, and the overall reproducibility was increased to 77.92%. And, when calculating the reproducibility from the results of Figure 7, 1,199 characters were extracted from a total of 1,431 characters manually extracted from the total 80 books, the total reproducibility increased to 83.79%. In addition, in FIG. 7, the reproduction rate of each novel was higher than 70% except for five books.

From the results of FIG. 6 and FIG. 7, the F-measure of the case of finding all characters having an appearance rate of 1% or more and 0.5% or more of each novel was 86.82% and 90.34%, respectively.

From the above results, it can be seen that the characters can be effectively extracted for various novels through the above-described embodiments without learning data for applying the machine learning technique.

In addition, the extracted character information may be used to obtain a relationship between characters or to further extract a character appearing in another sentence having a pattern similar to the sentence in which they are used.

Meanwhile, the above-described embodiments of the present invention may be implemented through various means. For example, embodiments of the present invention may be implemented by hardware, firmware, software, or a combination thereof.

For implementation in hardware, a method according to embodiments of the present invention may include one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), and Programmable Logic Devices (PLDs). It may be implemented by field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, and the like.

In the case of an implementation by firmware or software, the method according to the embodiments of the present invention may be implemented in the form of a module, a procedure, or a function that performs the functions or operations described above. The software code may be stored in a memory unit and driven by a processor. The memory unit may be located inside or outside the processor, and may exchange data with the processor by various known means.

As such, those skilled in the art will appreciate that the present invention can be implemented in other specific forms without changing the technical spirit or essential features thereof. Therefore, the above-described embodiments are to be understood as illustrative in all respects and not as restrictive. The scope of the present invention is shown by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. do.

Claims

A method for extracting a character from a character extraction apparatus performed in an information processing apparatus having a processor,

A preprocessing step of reading text from the electronic document;

Extracting a statement combined with a subject investigation and an assistant from the text as a candidate; And

Extracting a character from the subject candidate based on a well name survey;

Character extraction method comprising a.
The method of claim 1,

Extracting the character is,

Extracting the subject candidate used in combination with the well name survey as a character

Character Extraction Method.
The method of claim 1,

The pretreatment step,

To separate text within quotes into utterances, and to separate other text into narratives.

Character Extraction Method.
The method of claim 3,

Extracting the subject candidate,

In the narrative, the subject combined with the main investigation and the assistant is extracted as the candidate.

Extracting the character is,

Among the candidates extracted from the narrative, the candidates used in combination with the well name surveys are extracted as characters, and the statements used in combination with the well name surveys during the utterance are extracted as characters.

Character Extraction Method.
The method of claim 1,

The subject investigation and assistant,

-At least one of-,-and-is

Character Extraction Method.
The method of claim 1,

The oil well use investigation,

At least one of a combination of combinations of other investigations to-to-to-to-and to-to

Character Extraction Method.
The method of claim 1,

Determining a character from the extracted character by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric and dependent nouns;

Character extraction method comprising more.
A preprocessor for reading text from an electronic document;

A subject candidate extracting unit for extracting a statement combined with a subject investigation and an assistant from the text as a subject candidate; And

A character extracting unit extracting a character from the subject candidate based on an oil well use survey;

Appearance extraction device having at least one processor for performing the.
The method of claim 8,

The character extract unit,

Extracting the subject candidate used in combination with the well name survey as a character

Character extraction device.
The method of claim 8,

The preprocessing unit,

To separate text within quotes into utterances, and to separate other text into narratives.

Character extraction device.
The method of claim 10,

The subject candidate extracting unit,

In the narrative, the subject combined with the main investigation and the assistant is extracted as the candidate.

The character extract unit,

Among the candidates extracted from the narrative, the candidates used in combination with the well name surveys are extracted as characters, and the statements used in combination with the well name surveys during the utterance are extracted as characters.

Character extraction device.
The method of claim 8,

The subject investigation and assistant,

-At least one of-,-and-is

Character extraction device.
The method of claim 8,

The oil well use investigation,

At least one of a combination of combinations of other investigations to-to-to-to-and to-to

Character extraction device.
The method of claim 8,

A character determination unit that determines a character from the extracted characters by excluding at least one or more of pronouns, unspecified nouns, collective nouns, plurals, rhetoric, and dependent nouns;

Character extract device comprising more.
A preprocessing step of reading text from the electronic document;

Extracting a statement combined with a subject investigation and an assistant from the text as a candidate; And

Extracting a character from the subject candidate based on a well name survey;

A program stored in a computer-readable recording medium for performing the character extraction method comprising a.
When executed on at least one processor, causes the processor to:

Reading text from an electronic document;

Extracting a message combined with a subject investigation and an assistant from the text as a candidate; And

Extracting a character from the subject candidate based on a well name use survey;

A computer-readable recording medium having recorded thereon a program including instructions for causing the computer to perform the operation.