WO2015166606A1

WO2015166606A1 - Natural language processing system, natural language processing method, and natural language processing program

Info

Publication number: WO2015166606A1
Application number: PCT/JP2014/082428
Authority: WO
Inventors: 正人萩原
Original assignee: 楽天株式会社
Priority date: 2014-04-29
Filing date: 2014-12-08
Publication date: 2015-11-05
Also published as: CN106030568A; KR20160124237A; TWI567569B; CN106030568B; JP5809381B1; TW201544976A; KR101729461B1; JPWO2015166606A1

Abstract

A natural language processing system according to an embodiment is provided with an analysis unit and a correction unit. The analysis unit, by executing a morphological analysis with respect to a single sentence by using a division model, sets a tag to each divided element obtained by dividing the single sentence. The division model includes an output feature score indicating the correspondence between the divided element and the tag, and a transition feature score indicating a combination of two tags corresponding to continuous two divided elements. The correction unit compares the tag indicated by an analysis result obtained by the analysis unit with correct data indicating a correct tag of the single sentence, and corrects the division model used in the morphological analysis of the next sentence by the analysis unit by increasing the score of a feature related to the correct tag corresponding to an incorrect tag while decreasing the score of a feature related to the incorrect tag.

Description

Natural language processing system, natural language processing method, and natural language processing program

One aspect of the present invention relates to a natural language processing system, a natural language processing method, and a natural language processing program.

As one of the basic techniques of natural language processing, morphological analysis is known in which a sentence is divided into morpheme strings and the part of speech of each morpheme is determined. In relation to this, Patent Document 1 below discloses that the input text data is decomposed into morphemes, the position information corresponding to the decomposed morphemes is obtained by referring to the morpheme dictionary, and the cost using the position information is obtained. There is described a morpheme analyzer that determines a morpheme string from morpheme string candidates obtained by the decomposition by a function.

JP 2013-210856 A

Morphological analysis is executed using a division model including a score for each feature. Since the division model, which can be said to be knowledge for morphological analysis, is generally fixed in advance, naturally, when trying to analyze a sentence belonging to a new field that is not covered by the division model or a sentence having a new property, Obtaining the correct result is very difficult. On the other hand, if the division model is to be corrected by a method such as machine learning, there is a possibility that the time required for the correction will increase so as not to be predicted. Therefore, it is desired to automatically correct the division model for morphological analysis within a certain time.

A natural language processing system according to one aspect of the present invention divides a single sentence by executing morphological analysis on a sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features. The analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence Compare the correct answer data, and set the score of the output feature and the transition feature related to the correct tag corresponding to the tag of the incorrect answer higher than the current value, and the score of the output feature related to the tag of the incorrect answer Score preliminary transition feature is made lower than the current value, and a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.

A natural language processing method according to an aspect of the present invention is a natural language processing method executed by a natural language processing system including a processor, and uses a division model obtained by machine learning using one or more training data. , An analysis step of setting a tag indicating at least a part of speech of each word to each divided element obtained by dividing the one sentence by executing a morphological analysis on one sentence, wherein the division model is: The analysis step includes an output feature score indicating a correspondence between a divided element and a tag, and a transition feature score indicating a combination of two tags corresponding to two consecutive divided elements. Compares the tag shown in the analysis result and the correct data indicating the correct tag of one sentence, and outputs related to the correct tag corresponding to the incorrect tag By setting the gender score and transition feature score higher than the current value, and the output feature score and transition feature score associated with the incorrect tag are lower than the current value, the next sentence in the analysis step And a modification step for modifying the division model used in the morphological analysis.

A natural language processing program according to an aspect of the present invention divides a sentence by executing a morphological analysis on a sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features. The analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence Compared with the correct answer data, the score of the output feature and the transition feature related to the correct tag corresponding to the tag of the incorrect answer are set higher than the current value, and the score of the output feature related to the tag of the incorrect answer is set. And the scores of the transition feature is made lower than the current value, causes a computer to function as a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.

In such an aspect, every time a sentence is subjected to morphological analysis, the analysis result is compared with the correct answer data, and the division model is corrected based on the difference between them. By correcting the split model for each sentence in this way, the time required to correct the split model when processing multiple sentences can be suppressed to an extent that increases almost linearly according to the number of sentences. The analysis split model can be automatically modified within a certain time (in other words, within a predictable time range).

According to one aspect of the present invention, the division model for morphological analysis can be automatically corrected within a certain time.

It is a conceptual diagram of the process in the natural language processing system which concerns on embodiment. It is a figure which shows the example of the morphological analysis in embodiment. It is a figure which shows the hardware constitutions of the computer which comprises the natural language processing system which concerns on embodiment. It is a block diagram which shows the function structure of the natural language processing system which concerns on embodiment. It is a figure which shows an example of tagging notionally. (A), (b) is a figure which shows typically an example of the update of a score, respectively. It is a flowchart which shows operation | movement of the natural language processing system which concerns on embodiment. It is a figure which shows the structure of the natural language processing program which concerns on embodiment.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are denoted by the same reference numerals, and redundant description is omitted.

First, the function and configuration of the natural language processing system 10 according to the embodiment will be described with reference to FIGS. The natural language processing system 10 is a computer system that performs morphological analysis. The morpheme analysis is a process of dividing a sentence into morpheme strings and determining the part of speech of each morpheme. A sentence is a unit of linguistic expression that represents one complete statement, and is expressed by a character string. A morpheme is the smallest language unit that has meaning. A morpheme sequence is a sequence of one or more morphemes obtained by dividing a sentence into one or more morphemes. Part of speech is the division of words by grammatical function or form.

The natural language processing system 10 performs morphological analysis on individual sentences using the division model 20. One of the features of the natural language processing system 10 is that when the division model 20 is learned, the division model 20 is corrected each time a morphological analysis is performed on each sentence. When the modification of the division model 20 is completed, the natural language processing system 10 including the confirmed division model 20 is provided to the user. The user can cause the natural language processing system 10 to execute morpheme analysis. At this time, the morpheme analysis is executed without correcting the division model 20. The “division model” in this specification is a standard (cue) when a sentence is divided into one or more morphemes, and is indicated by a score of each feature. This division model is obtained by machine learning using one or more training data. The training data is data indicating at least a sentence divided into words and a part of speech of each word obtained by dividing the sentence. A feature is a clue for obtaining a correct result in morphological analysis. In general, what is used as a feature (cue) is not limited. The feature score is a numerical value representing the likelihood of the feature.

FIG. 1 briefly shows the concept of processing in the natural language processing system 10 according to the present embodiment. Note that a gear M in FIG. 1 indicates execution of morphological analysis. At some point, the natural language processing system 10 divides the sentence s ₁ into one or more morphemes by executing a morphological analysis using the division model w ₁ . In this embodiment, the natural language processing system 10 divides a sentence into one or more morphemes by dividing the sentence into individual characters and executing character unit processing. In other words, in the present embodiment, the divided element to be processed is a character. The natural language processing system 10 indicates the result of morphological analysis by setting a tag for each character (divided element). A “tag” in this specification is a label indicating an attribute or function of a character. Tags will be described in more detail later.

When the morphological analysis is executed, the natural language processing system 10 accepts data (correct data) indicating the correct answer of the morphological analysis of the sentence s ₁ , compares the analysis result with the correct data, and corrects the divided model w _1. in obtaining a new division model w _2. Specifically, the natural language processing system 10 evaluates that the entire analysis result is wrong when at least a part of tagging in the morphological analysis of the sentence s ₁ is wrong. Then, the natural language processing system 10 evaluates the feature corresponding to each tag in the correct answer data as “correct (+1)”, sets the feature score higher than the current value, and corresponds to each tag in the analysis result. the feature that was evaluated as "wrong (-1)" and is made lower than the current value of the score for that feature, obtain division model w _2. If some tags are correct in the analysis result, the two evaluations of “feature (+1)” and “wrong (+1)” of the features related to the partial tag (correct answer tag) are the result. Is offset by Therefore, the process of lowering or raising the feature score as described above increases the feature score related to the correct tag corresponding to the incorrect tag (correct tag corresponding to the incorrect answer portion) and It can be said that this is a process of lowering the score of the feature related to the correct tag (tag of the incorrect answer part).

The natural language processing system 10 evaluates each tag in the correct answer data as “correct (+1)”, while evaluating a tag related to each character in the analysis result as “wrong (−1)” After canceling the two evaluation results, the score of the feature corresponding to the tag evaluated as “correct (+1)” is increased, and the score of the feature corresponding to the tag evaluated as “mistake (−1)” May be lowered.

For example, it is assumed that five characters x _a , x _b , x _c , x _d , and x _e exist in the sentence s ₁ . The character _x _a, x _b, x _c, x d, where _x correct tags each _t a of _{_{_{_{e, t b, t c,}}}} t d, t e, the tag of each character by the morphological analysis _{t a} _, _t g, and had a _{_{t h, t d, t e}} . In this case, the natural language processing system 10, the tag _t a in solution _{_{_{data, t b, t c, t}}} d, the score of the feature by evaluating the identity corresponding to _{t e} a "correct (+1)" was higher than the current value, the tag _t a in the execution _{_{_{result, t g, t h, t}}} d, "definitely (-1)" a feature that corresponds to _{t e} and evaluate the current value of the score of the identity and Lower than. In this case, the score of the feature corresponding to the tags t _a , t _d , and t _c is not changed from the result before the update, and the score of the feature corresponding to the correct tags t _b and t _c is increased, and the incorrect answer The score of the feature corresponding to the tags t _g and t _h is lowered.

When executing the morphological analysis for the next sentence s ₂ , the natural language processing system 10 uses the division model w ₂ . Then, the natural language processing system 10 accepts a correct answer data of the morphological analysis of the statement s _2, compares the execution result and its correct data, to modify the same divided model w ₂ in the case of modifying the division models w ₁ obtain a new division model w ₃ that.

The natural language processing system 10 corrects the division model (w ₁ → w ₂ , w ₂ → w ₃ ,..., W each time one sentence (s ₁ , s ₂ ,..., _St) is processed in this way. _t → w _{t + 1} ), and use the modified division model in the morphological analysis of the next sentence. Such a method of updating the model every time one piece of training data is processed is also referred to as “online learning” or “online machine learning”.

An example of the result of morphological analysis by the natural language processing system 10 is shown in FIG. In this example, the natural language processing system 10 converts a Japanese sentence “hon wo katte”, which corresponds to an English sentence “I blog a book”, into five letters x ₁ : “book”. (hon) ", _{x 2:"} the (wo) ", _{x 3:"} Offer (ka) ", _{x 4:"} Tsu (t) ", _{x 5:} divided into" te (te) ". And the natural language processing system 10 sets a tag to each character by performing morphological analysis. In the present embodiment, the tag is a combination of the appearance mode of the character in the word, the part of speech of the word, and the subclass of the part of speech of the word, and uses an alphabet such as “SN-nc”. Expressed.

The appearance mode is whether a certain character becomes one word alone or in combination with another character, and when the character is a part of a word consisting of two or more characters, This is information indicating where in the word the character is located. In the present embodiment, the appearance mode is indicated by one of S, B, I, and E. The appearance mode “S” indicates that the character becomes a single word by itself. The appearance mode “B” indicates that the character is positioned at the beginning of a word composed of two or more characters. The appearance mode “I” indicates that the character is located in the middle of a word composed of three or more characters. The appearance mode “E” indicates that the character is located at the end of a word composed of two or more characters. The example of FIG. 2 indicates that the characters x ₁ , x ₂ , and x ₅ are each a single word, and the characters x ₃ and x ₄ form one word.

Note that the scheme for the appearance mode is not limited. In this embodiment, the scheme “SBIEO” is used, but for example, the scheme “IOB2” that is well known to those skilled in the art may be used.

Examples of parts of speech include nouns, verbs, particles, adjectives, adjective verbs, conjunctions, and the like. In this embodiment, the noun is represented by “N”, the particle is represented by “P”, and the verb is represented by “V”. The example of FIG. 2 indicates that the character x ₁ is a noun, the character x ₂ is a particle, the word consisting of the characters x ₃ and x ₄ is a verb, and the character x ₅ is a particle.

The part-of-speech subclass indicates the subordinate concept of the corresponding part-of-speech. For example, nouns can be further classified into general nouns and proper nouns, and particles can be further classified into case particles, conjunctive particles, auxiliary particles, and the like. In this embodiment, the general noun is represented by “nc”, the proper noun is represented by “np”, the case particle is represented by “k”, the connection particle is represented by “sj”, and the general verb is “c”. Is represented. The example of FIG. 2 shows that the character x ₁ is a general noun, the character x ₂ is a case particle, the word consisting of the characters x ₃ and x ₄ is a general verb, and the character x ₅ is a connection particle. ing.

The score of the feature stored in the division model 20 is a score of an output feature and a score of a transition feature.

The output feature is a clue indicating the correspondence between a tag and a character or character type. In other words, the output feature is a clue indicating what kind of character or character type is likely to correspond to what kind of tag. The output feature corresponds to the feature representation of the output matrix of the hidden Markov model. In this embodiment, an output feature of a unigram (a character string made up of only one character) and an output feature of a bigram (a character string made up of two consecutive characters) are used.

Here, the character type is a character type in a certain language. Examples of Japanese character types include kanji, hiragana, katakana, alphabet (uppercase and lowercase), Arabic numerals, kanji numerals, and middle black (•). In the present embodiment, the character type is represented by alphabets. For example, “C” indicates kanji, “H” indicates hiragana, “K” indicates katakana, “L” indicates alphabets, and “A” indicates Arabic numerals. The example of FIG. 2 indicates that the characters x ₁ and x ₃ are kanji characters and the characters x ₂ , x ₄ , and x ₅ are hiragana characters.

The unigram output feature related to the character is a clue indicating the correspondence between the tag t and the character x. Further, the output feature of the unigram regarding the character type is a clue indicating the correspondence between the tag t and the character type c. In the present embodiment, the likelihood score s of the correspondence between the tag t and the letter x is indicated by {t / x, s}. A likelihood score s of correspondence between the tag t and the character type c is denoted by {t / c, s}. The division model 20 includes scores regarding a plurality of tags for one character or character type. When data on all types of tags is prepared for one character or character type, the division model 20 also has a score for a combination of a tag and a character or character type that cannot actually occur in the grammar. Including. However, the score of a feature that is impossible in grammar is relatively low.

The following is an example of an output feature score for the Japanese word “hon”. Although it is impossible in Japanese grammar that this character is a particle, as described above, data such as “SPK / hon” which does not exist in the grammar can be prepared.
{SN-nc / book, 0.0420}
{B-N-nc / hon, 0.0310}
{SPK / hon, 0.0003}
{BVc / hon, 0.0031}

In addition, an example of an output feature score regarding the character type “Kanji” is shown.
{SN-nc / C, 0.0255}
{EN-np / C, 0.0488}
{SPK / C, 0.0000}
{BVc / C, 0.0299}

Also for character types, data indicating features that do not exist in the grammar can be prepared. For example, it is impossible in Japanese grammar that a word represented by Arabic numerals is a particle, but data can also be prepared for a feature such as “SPK / A”.

The bigram output feature related to the character is a clue indicating the correspondence between the tag t and the character string x _i x _{i + 1} . The bigram output feature related to the character type is a clue indicating the correspondence between the tag t and the character type column c _i c _{i + 1} . In the present embodiment, the likelihood score s of the tag t and the characters x _i and x _{i + 1} is represented by {t / x _i / x _{i + 1} , s}. Further, the likelihood score s of the tag t and the character types c _i and c _{i + 1} is denoted by {t / c _i / c _{i + 1} , s}. When preparing data related to all tags that can exist for one bigram, the division model 20 also stores data on combinations of tags and bigrams that cannot actually occur in the grammar.

The following is an example of an output feature score for a bigram “hon wo”.
{SN-nc / book / hon (wo), 0.0420}
{B-N-nc / hon / (wo), 0.0000}
{SPK / hon / (wo), 0.0001}
{BVc / hon / (wo), 0.0009}

Moreover, the example of the score of the output feature regarding the bigram in which hiragana appears after the kanji is shown.
{SN-nc / C / H, 0.0455}
{EN-np / C / H, 0.0412}
{SPK / C / H, 0.0000}
{BVc / C / H, 0.0054}

The transition feature, a cue indicating a combination (combination consisting of two tags corresponding to two consecutive characters) tags t _i and tag t _{i + 1} of the next character x _{i + 1} character x _i. This transition feature is a bigram feature. The transition feature corresponds to the feature representation of the transition matrix of the hidden Markov model. In the present embodiment, the likelihood score s of the combination of the tag t _i and the tag t _{i + 1} is represented by {t _i / t _{i + 1} , s}. In the case of preparing transition feature data for all possible combinations, the division model 20 also stores data on combinations of two tags that cannot actually occur in the grammar.

Below, some examples of transition feature scores are shown.
{SN-nc / SP-k, 0.0512}
{E-N-nc / E-N-nc, 0.0000}
{SPK / BVc, 0.0425}
{BVc / IVc, 0.0387}

The natural language processing system 10 includes one or more computers. When the natural language processing system 10 includes a plurality of computers, each functional element of the natural language processing system 10 described later is realized by distributed processing. The type of individual computer is not limited. For example, a stationary or portable personal computer (PC) may be used, a workstation may be used, or a portable terminal such as a high-functional portable telephone (smart phone), a portable telephone, or a personal digital assistant (PDA). May be used. Alternatively, the natural language processing system 10 may be constructed by combining various types of computers. When a plurality of computers are used, these computers are connected via a communication network such as the Internet or an intranet.

FIG. 3 shows a general hardware configuration of each computer 100 in the natural language processing system 10. A computer 100 includes a CPU (processor) 101 that executes an operating system, application programs, and the like, a main storage unit 102 that includes a ROM and a RAM, an auxiliary storage unit 103 that includes a hard disk and a flash memory, and a network. The communication control unit 104 includes a card or a wireless communication module, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a display and a printer. Of course, the hardware modules to be mounted differ depending on the type of the computer 100. For example, a stationary PC and a workstation often include a keyboard, a mouse, and a monitor as an input device and an output device, but in a smartphone, a touch panel often functions as an input device and an output device.

Each functional element of the natural language processing system 10 described later reads predetermined software on the CPU 101 or the main storage unit 102, and operates the communication control unit 104, the input device 105, the output device 106, and the like under the control of the CPU 101. This is realized by reading and writing data in the main storage unit 102 or the auxiliary storage unit 103. Data and a database necessary for processing are stored in the main storage unit 102 or the auxiliary storage unit 103.

On the other hand, the division model 20 is stored in the storage device in advance. The specific mounting method of the division model 20 is not limited. For example, the division model 20 may be prepared as a relational database or a text file. The installation location of the division model 20 is not limited. For example, the division model 20 may exist in the natural language processing system 10 or in another computer system different from the natural language processing system 10. May be. When the division model 20 is in another natural language processing system, the natural language processing system 10 accesses the division model 20 via a communication network.

As described above, it can be said that the division model 20 is a set of scores of various features. On formula, score _w 1 of the n _feature, w 2, ..., vector division model 20 including a _{_{_{w n w = {w 1,}}} w 2, ..., w n} can be represented by. The scores of the features at the time when the division model 20 is newly created are all zero. That is, w = {0, 0,..., 0}. The score is updated little by little by the processing of the natural language processing system 10 described later. After a certain number of sentences have been processed, there is a difference between the individual feature scores as described above.

As shown in FIG. 4, the natural language processing system 10 includes an acquisition unit 11, an analysis unit 12, and a correction unit 13 as functional components. The natural language processing system 10 accesses the division model 20 as necessary. Each functional element will be described below. In the present embodiment, the description will be made on the assumption that the natural language processing system 10 processes a Japanese sentence. However, the language of the sentence processed by the natural language processing system 10 is not limited to Japanese, and sentences in other languages such as Chinese can be analyzed.

The acquisition unit 11 is a functional element that acquires a sentence to be divided into morpheme strings. The acquisition method of the sentence by the acquisition part 11 is not limited. For example, the acquisition unit 11 may collect sentences from any website on the Internet (so-called crawling). Alternatively, the acquisition unit 11 may read a sentence stored in advance in a database in the natural language processing system 10, or read a sentence stored in a database on a computer system other than the natural language processing system 10 via a communication network. It may be accessed and read by. Alternatively, the acquisition unit 11 may accept a sentence input by a user of the natural language processing system 10. When an instruction to analyze the first sentence is input, the acquisition unit 11 acquires one sentence and outputs it to the analysis unit 12. Thereafter, when a completion notification is input from the correction unit 13 described later, the acquisition unit 11 acquires the next sentence and outputs it to the analysis unit 12.

The analysis unit 12 is a functional element that performs morphological analysis on individual sentences. The analysis unit 12 executes the following process every time one sentence is input.

First, the analysis unit 12 divides one sentence into individual characters and determines the character type of each character. The analysis unit 12 stores in advance a comparison table between characters and character types, or a regular expression for determining a character type, and determines a character type using the comparison table or regular expression.

Subsequently, the analysis unit 12 determines a tag for each character using a Viterbi algorithm. For the i-th character, the analysis unit 12 determines which candidate tag among the plurality of candidate tags of the (i-1) -th character for each tag (candidate tag) that may be finally selected. It is determined whether or not the score (also referred to as “connection score”) is the highest. Here, the connection score is a total value of various scores related to the calculation target tag (unigram output feature score, bigram output feature score, and transition feature score). For example, when the i-th tag is “SN-nc”, the analysis unit 12 has the highest connection score when the (i−1) -th tag is “SPK”. When the i-th tag is “SVc”, it is determined that the connection score is highest when the (i-1) -th tag is “EN-nc”. To do. Then, the analysis unit 12 stores all the combinations (for example, (S−P−k, S−N−nc), (E−N−nc, S−V−c), etc.) having the highest connection score. . The analysis unit 12 executes such processing while proceeding character by character from the first character to the sentence end symbol.

Since there is only one type of tag (EOS) for the end-of-sentence symbol, the combination of the tag of the last character and the end-of-sentence symbol having the highest connection score is determined as one (for example, the combination is (E- Vc, EOS)). Then, the tag of the last character is determined (for example, the tag is determined to be “EVC”), and as a result, the tag of the second character from the end is also determined. As a result, the tags are fixed to the formulas in order from the end to the beginning of the sentence.

FIG. 5 schematically shows such processing by the analysis unit 12. FIG. 5 shows an example of tagging a sentence consisting of four characters. In order to simplify the explanation, in this example, the tags are simplified as “A1”, “B2”, etc., and the number of candidate tags for each character is three. A thick line in FIG. 5 indicates a combination of a tag determined to have the highest connection score obtained by processing a sentence from the front. For example, in the process of the third character, the tag C1 has the highest connection score with the tag B1, the tag C2 has the highest connection score with the tag B1, and the tag C3 has the highest connection score with the tag B2. In the example of FIG. 5, when processing is performed up to the end of the sentence (EOS), the combination (D1, EOS) is determined, then the combination (C2, D1) is determined, and then the combinations (B1, C2), (A2) , B1) are determined sequentially. Therefore, the analysis unit 12 determines that the tags of the first to fourth characters are A2, B1, C2, and D1, respectively.

The analysis unit 12 outputs a sentence with each character tagged as an analysis result. The analysis unit 12 outputs the analysis result to at least the correction unit 13 because the analysis result is necessary for correcting the divided model 20. The analysis unit 12 may perform further output. For example, the analysis unit 12 may display the analysis result on a monitor or print it on a printer, write the analysis result to a text file, or store the analysis result in a storage device such as a memory or a database. May be. Alternatively, the analysis unit 12 may transmit the analysis result to an arbitrary computer system other than the natural language processing system 10 via a communication network.

The correction unit 13 is a functional element that corrects the divided model 20 based on the difference between the analysis result obtained from the analysis unit 12 and the correct answer of the morphological analysis of the sentence. In the present specification, “modification of the division model” is processing for changing the score of at least one feature in the division model. In some cases, even if an attempt is made to change a certain score, the value may not change as a result. The correction unit 13 executes the following processing every time one analysis result is input.

First, the correction unit 13 acquires correct answer data corresponding to the input analysis result, that is, data indicating the correct answer of the morphological analysis of the sentence processed by the analysis unit 12. The correct answer data in the present embodiment is data indicating tags (combination of appearance mode, part of speech, and part of speech subclass) of each character forming a sentence. This correct answer data is created manually. The method for acquiring correct data by the correction unit 13 is not limited. For example, the correction unit 13 may read correct data stored in advance in a database in the natural language processing system 10, or may read sentences stored in a database on a computer system other than the natural language processing system 10 in a communication network. You may access and read via. Alternatively, the correction unit 13 may accept correct answer data input by the user of the natural language processing system 10.

When the correct data is acquired, the correction unit 13 compares the input analysis result with the correct data and identifies the difference between them.

If the analysis result is completely the same as the correct answer data and there is no difference, the correction unit 13 ends the process without correcting the division model 20, generates a completion notification, and outputs it to the acquisition unit 11. This completion notification is a signal indicating that the processing in the correction unit 13 has been completed and the morphological analysis for the next sentence can be executed. The fact that the analysis result completely matches the correct answer data does not require correction of the division model 20 at least at this point, so the natural language processing system 10 (more specifically, the analysis unit 12) has the current division model 20 Is used as is to parse the next sentence.

For example, the correct data for the above-described Japanese sentence “Hon Wo Katte” is as follows. For convenience, each character is also expressed as x ₁ to x ₅ .
x ₁ : {SN—nc}
x ₂ : {SPK}
x ₃ : {BVc}
x ₄ : {EVV}
x ₅ : {SP-sj}

Therefore, when the analysis result shown in FIG. 2 is input, the correction unit 13 determines that the analysis result and the correct data match completely, and notifies the acquisition unit 11 of the completion notification without correcting the analysis unit 12. Output.

On the other hand, when the analysis result does not completely match the correct answer data (that is, when there is a difference between the analysis result and the correct answer data), the correction unit 13 updates at least a part of the scores of the divided model 20. More specifically, the correcting unit 13 sets the feature score related to the correct tag corresponding to the incorrect tag to be higher than the current value, and sets the feature score related to the incorrect tag to be lower than the current value. To do.

For example, it is assumed that the analysis unit 12 obtains the following analysis result from a Japanese sentence “Buy Book”.
x ₁ : {SN—nc}
x ₂ : {SPK}
x ₃ : {BVc}
x ₄ : {IVc}
x ₅ : {EVV}

In this case, since the analysis result is wrong as a whole, the correction unit 13 evaluates the feature corresponding to each tag in the correct answer data as “correct (+1)” and sets the score of the feature higher than the current value. Then, the feature corresponding to each tag in the analysis result is evaluated as “error (−1)”, and the score of the feature is made lower than the current value. In consideration of the part that is offset as a result, it can be said that the correction unit 13 finally performs the following processing.

The correction unit 13 calculates the scores for the output features “EV−c / t (t)” and “SP−sj / te (te)” corresponding to the correct tags of the characters x ₄ and x ₅ from the current value. The score for the output features “IVc / t (t)” and “EVc / t (te)” related to the incorrect tag is made smaller than the current value. As a result, the score of the output feature of the unigram related to the analyzed sentence (score regarding the character) is updated.

In addition, the correcting unit 13 sets the scores for the output features “EVC / H” and “SP-sj / H” related to the correct tags of the characters x ₄ and x _{5 that} are incorrect answers to the current values. The score for the output features “IVc / H” and “EVc / H” related to the incorrect answer tag is made smaller than the current value. Thereby, the score of the output feature of the unigram related to the analyzed sentence (score regarding the character type) is updated.

Further, the correcting unit 13 obtains a score for the output feature “EV−c / t (t) / te (te)” related to the correct tags of the characters x ₄ and x _{5 that} were incorrect from the current value. The score for the output feature “IVV / t (t) / te (te)” related to the incorrect tag is made smaller than the current value. This updates the bigram output feature score (character score) associated with the analyzed sentence.

In addition, the correcting unit 13 increases the score for the output feature “EVC / H / H” related to the correct tags of the characters x ₄ and x _{5 that} are incorrect answers from the current value, and corrects the incorrect answer. The score for the output feature “IVc / H / H” related to the tag is made smaller than the current value. This updates the bigram output feature score (score for the character type) associated with the analyzed sentence.

In addition, the correcting unit 13 uses the transition features “BVc / EVVc” and “EVc / SP-sj” associated with the correct tags of the characters x ₄ and x _{5 that} were incorrect. The score for the transition features “BVc / IVVc” and “IVc / EVVc” related to the incorrect answer tag are set higher than the current value. Make it smaller than the current value. As a result, the score of the transition feature related to the analyzed sentence is updated.

As described above, the correction unit 13 evaluates each tag in the correct answer data as “correct (+1)”, while evaluating the tag related to each character in the analysis result as “wrong (−1)”. , After offsetting the two evaluation results for each tag, increase the score of the feature corresponding to the tag evaluated as “correct (+1)”, and correspond to the tag evaluated as “wrong (−1)” The score of the feature to be performed may be lowered.

When updating the feature score, the modification unit 13 may use SCW (Soft Confidence-Weighted Learning). This SCW is a method in which a parameter with a large variance is regarded as not yet confident (inaccurate) and the parameter is greatly updated, and a parameter with a small variance is regarded as accurate to some extent and the parameter is updated to a small value. is there. The correcting unit 13 determines the amount of change in the score based on the variance of the score having the value range. In order to execute this SCW, a Gaussian distribution is introduced into the division model 20 (vector w), and the correction unit 13 simultaneously updates the average and covariance matrix of each score in addition to updating each score. The average initial value of each score is zero. For the initial value of the covariance matrix of each score, the diagonal element is 1, and the other elements (non-diagonal elements) are 0. FIG. 6A shows a mode in which a score with a large variance is changed greatly (that is, the amount of change in the score is large), and FIG. 6B shows a mode in which a score with a small variance is changed slightly (that is, the score is changed). The variation is small). FIGS. 6A and 6B show that the covariance matrix Σ is also updated when the score is updated from Sa to Sb. Regarding the update of the covariance matrix, the accuracy of score calculation can be maintained without considering the correlation between a certain feature and other features, so in this embodiment, the off-diagonal of the covariance matrix is used. Calculate only diagonal elements without calculating elements. Thereby, the update speed of a score can be raised.

The correcting unit 13 may update the feature score using a method other than SCW. Examples of methods other than SCW include Perceptron, Passive Aggressive (PA), Confidence Weighted (CW), and Adaptive Regularization of Weight Vectors (AROW).

When the division model 20 is corrected by updating the feature score related to the analyzed sentence, the correction unit 13 generates a completion notification and outputs it to the acquisition unit 11. In this case, the natural language processing system 10 (more specifically, the analysis unit 12) analyzes the next sentence using the modified division model 20.

Next, the operation of the natural language processing system 10 will be described with reference to FIG. 7, and the natural language processing method according to the present embodiment will be described.

First, the acquisition unit 11 acquires one sentence (step S11). Subsequently, the analysis unit 12 performs morphological analysis on the sentence using the division model 20 (step S12, analysis step). By this morphological analysis, a tag such as “SN-nc” is given to each character of the sentence.

Subsequently, the correcting unit 13 obtains a difference between the result of the morphological analysis by the analyzing unit 12 and the correct answer data of the morphological analysis (step S13). When there is no difference (step S14; NO), that is, when the morphological analysis by the analyzing unit 12 is completely correct, the correcting unit 13 ends the process without correcting the divided model 20. On the other hand, when there is a difference between the analysis result and the correct answer data (step S14; YES), that is, when at least part of the morphological analysis by the analysis unit 12 is incorrect, the correction unit 13 relates to the analyzed sentence. The division model 20 is corrected by updating the feature score (step S15, correction step). Specifically, the correcting unit 13 sets the feature score related to the correct tag corresponding to the incorrect tag to be higher than the current value, and sets the feature score related to the incorrect tag from the current value. Also lower.

When the process in the correction unit 13 is completed, the process returns to step S11 (see step S16). The acquisition unit 11 acquires the next sentence (step S11), and the analysis unit 12 performs morphological analysis on the sentence (step S12). At this time, if the modification of the divided model 20 (step S15) has been executed in the processing of the previous sentence, the analysis unit 12 performs morphological analysis using the modified divided model 20. Thereafter, the correction unit 13 executes the processes after step S13. Such repetition continues as long as the sentence to be processed exists (see step S16).

An example of an algorithm showing the operation of the natural language processing system 10 is shown below.
Initialize w ₁
For t = 1, 2,...
Receive instance x _t
Predict structure y ^ _t based on w _t
Receive correct structure y _t
If y ^ _t ≠ y _t , update
w _{t + 1} = update (w _t , y _t , + 1)
w _{t + 1} = update (w _t , y ^ _t , −1)

The first line in the algorithm means initialization of the division model 20 (variable w ₁ ), and for example, the score of each feature is set to 0 by this processing. The For loop on the second line indicates that the processes on and after the third line are executed one sentence at a time. The third line means that the sentence _xt is acquired and corresponds to step S11 described above. The fourth line shows a process of assigning a tag to each character by performing a morphological analysis based on the division model 20 (w _t ) at that time, and corresponds to step S12 described above. y ^ _t indicates the analysis result. Line 5, which means that to get the correct data y _t of the morphological analysis of the sentence x _t. Line 6, if there is a difference between the analysis result y ^ _t and solution data y _t means to update (modifying) the division model 20. The seventh line indicates that the correct answer data y _t is learned as a positive example, and the eighth line indicates that the analysis result y ^ _t including an error is learned as a negative example. The processing on the seventh and eighth lines corresponds to step S15 described above.

Next, a natural language processing program P1 for realizing the natural language processing system 10 will be described with reference to FIG.

The natural language processing program P1 includes a main module P10, an acquisition module P11, an analysis module P12, and a correction module P13.

The main module P10 is a part that comprehensively controls morphological analysis and related processing. The functions realized by executing the acquisition module P11, the analysis module P12, and the correction module P13 are the same as the functions of the acquisition unit 11, the analysis unit 12, and the correction unit 13, respectively.

The natural language processing program P1 may be provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. The natural language processing program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.

As described above, the natural language processing system according to one aspect of the present invention performs a morphological analysis on one sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to each divided element obtained by dividing the one sentence, and the division model has an output feature indicating a correspondence between the divided element and the tag. The analysis unit including a score and a score of a transition feature indicating a combination of two tags corresponding to two consecutive subdivided elements, a tag indicated by an analysis result obtained by the analysis unit, and one sentence Compared with the correct answer data indicating the correct answer tag, the score of the output feature and the transition feature related to the correct answer tag corresponding to the incorrect answer tag are set higher than the current value, and related to the incorrect answer tag. That the scores of the score and the transition feature output feature is made lower than the current value, and a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.

Also, the accuracy of the morphological analysis of the next sentence can be increased by increasing the feature score for the correct tag and decreasing the feature score for the tag that was incorrect.

In the natural language processing system according to another aspect, the divided element may be a character. By processing for each character using knowledge (output feature and transition feature) in units of characters, morphological analysis can be executed without using a word dictionary that generally becomes large. In addition, since the division model is corrected for each sentence using knowledge in units of characters rather than knowledge of words, it is assumed that the next sentence is different in field or nature from any sentence analyzed so far In addition, it is possible to perform morphological analysis of the next sentence with high accuracy. That is, the natural language processing system according to one aspect of the present invention has adaptability to a sentence in an unknown field or a sentence having an unknown property.

In the natural language processing system according to another aspect, each of the output feature score and the transition feature score has a range of values, a variance is set for each score, and the correction unit is based on the variance of each score. The amount of change in the score when the score is high or low may be determined. By using this method, it is possible to quickly converge the scores of each feature.

The present invention has been described in detail above based on the embodiments. However, the present invention is not limited to the above embodiment. The present invention can be variously modified without departing from the gist thereof.

In general, since the number of features included in the division model 20 increases according to the number of characters handled, the division model 20 becomes very large in a language with many characters such as Japanese and Chinese. The storage capacity for will also be very large. Therefore, a technique called feature hashing may be introduced to digitize individual features using a hash function. In particular, the effect of digitizing characters and character strings representing a part of the features is high. On the other hand, even if the transition feature is hashed, it does not contribute much to the compression of the capacity of the division model 20, and the processing speed may be slow. Therefore, only the output features may be hashed without hashing the transition features. Note that only one type of hash function may be used, or different hash functions may be used for output features and transition features.

In this case, the division model 20 stores data on the features in which individual characters are represented by numerical values. For example, a character “hon” is converted to a numerical value of 34, and a character “wo” is converted to a numerical value of 4788. By this numericalization, a set of bounded features can be formed. This feature hashing may assign the same numerical value to multiple characters or character strings, but it is very unlikely that the same numerical value will be assigned to characters or character strings that appear frequently. Collisions can be ignored.

That is, in the natural language processing system according to another aspect, the division model may include an output feature quantified by a hash function. By handling characters numerically, it is possible to save the memory capacity required for storing the division model.

The analysis unit 12 may perform morphological analysis using a feature having a relatively high score without using a feature having a relatively low score (ignoring such a feature). Examples of techniques for ignoring features having relatively low scores include forward-backward splitting (Forward-Background Splitting (FOBOS)) and feature quantization (Feature Quantization).

FOBOS is a method of compressing the score to 0 by regularization (for example, L1 regularization). By using FOBOS, it is possible to ignore a feature whose score is a predetermined value or less (for example, a feature whose score is 0 or a feature whose score is close to 0).

The feature quantization is a technique for converting the score of a feature into an integer by multiplying the value after the decimal point by 10 ⁿ (n is a natural number of 1 or more). For example, when the score “0.1234456789” is multiplied by 1000 to make an integer, the score becomes “123”. By quantizing the score, the memory capacity required to store the score as text can be saved. In addition, this technique makes it possible to ignore features whose score is equal to or less than a predetermined value (for example, a feature whose score after integerization is 0 or a feature whose score is close to 0). For example, assuming that the scores of certain features Fa and Fb are 0.0512 and 0.0003, respectively, and multiplying these scores by 1000, the features Fa and Fb are 51 and 0, respectively. In this case, the analysis unit 12 performs morpheme analysis without using the feature Fb.

The regularization or quantization process is executed by, for example, the correction unit 13, another functional element in the natural language processing system 10, or a computer system different from the natural language processing system 10. When the correction unit 13 performs regularization or quantization processing, the correction unit 13 performs a morphological analysis on a set of sentences (for example, a certain number of sentences) in the natural language processing system 10 to determine what the division model 20 is. After the correction, the regularization or quantization process is performed once.

That is, in the natural language processing system according to another aspect, the analysis unit may execute the morphological analysis without using a feature whose score is equal to or lower than a predetermined value by regularization or quantization. By not using features with relatively low scores (for example, features whose score becomes 0 by regularization or quantization, or features whose score is close to 0), it is possible to reduce the amount of data in the split model or to perform morpheme analysis time Can be shortened.

In the above embodiment, the analysis unit 12 divides a sentence into individual characters and sets a tag for each character, but the divided element may be a word instead of a character. Accordingly, the analysis unit may perform morphological analysis using a division model and a word dictionary that indicate a score of a feature related to a word instead of a character.

As described above, the natural language processing system according to the present invention can be applied to morphological analysis of an arbitrary language.

DESCRIPTION OF SYMBOLS 10 ... Natural language processing system, 11 ... Acquisition part, 12 ... Analysis part, 13 ... Correction part, 20 ... Division model, P1 ... Natural language processing program, P10 ... Main module, P11 ... Acquisition module, P12 ... Analysis module, P13 ... modification module.

Claims

By using a division model obtained by machine learning using one or more training data and performing morphological analysis on one sentence, each divided element obtained by dividing the one sentence has at least a word An analysis unit for setting a tag indicating the part of speech of the combination, wherein the division model is a combination of a score of an output feature indicating a correspondence between a divided element and a tag and two tags corresponding to two consecutive divided elements Including the transition feature score indicating
The tag indicated by the analysis result obtained by the analysis unit is compared with correct data indicating the correct tag of the one sentence, and the score of the output feature related to the correct tag corresponding to the incorrect tag And the score of the transition feature is made higher than the current value, and the score of the output feature and the score of the transition feature related to the incorrect answer tag are made lower than the current value. A natural language processing system comprising: a correction unit that corrects the division model used in the morphological analysis.
The split element is a character;
The natural language processing system according to claim 1.
The division model includes the output feature quantified by a hash function;
The natural language processing system according to claim 1 or 2.
Each of the output feature score and the transition feature score has a range of values, and a variance is set for each score,
The correction unit determines the amount of change in the score when the score is increased or decreased based on the variance of each score.
The natural language processing system according to any one of claims 1 to 3.
The analysis unit performs the morphological analysis without using the feature whose score is equal to or less than a predetermined value by regularization or quantization.
The natural language processing system according to any one of claims 1 to 4.
A natural language processing method executed by a natural language processing system including a processor,
By using a division model obtained by machine learning using one or more training data and performing morphological analysis on one sentence, each divided element obtained by dividing the one sentence has at least a word An analysis step of setting a tag indicating the part of speech of the combination, wherein the division model is a combination of an output feature score indicating a correspondence between a divided element and a tag and two tags corresponding to two consecutive divided elements The analysis step comprising a transition feature score indicative of
The tag indicated by the analysis result obtained in the analysis step is compared with correct data indicating the correct tag of the one sentence, and the score of the output feature related to the correct tag corresponding to the incorrect tag And the score of the transition feature is made higher than the current value, and the score of the output feature and the score of the transition feature related to the incorrect answer tag are made lower than the current value, so that the next sentence in the analysis step A natural language processing method including a correction step of correcting the division model used in the morphological analysis.
By using a division model obtained by machine learning using one or more training data and performing morphological analysis on one sentence, each divided element obtained by dividing the one sentence has at least a word An analysis unit for setting a tag indicating the part of speech of the combination, wherein the division model is a combination of a score of an output feature indicating a correspondence between a divided element and a tag and two tags corresponding to two consecutive divided elements Including the transition feature score indicating
The tag indicated by the analysis result obtained by the analysis unit is compared with correct data indicating the correct tag of the one sentence, and the score of the output feature related to the correct tag corresponding to the incorrect tag And the score of the transition feature is made higher than the current value, and the score of the output feature and the score of the transition feature related to the incorrect answer tag are made lower than the current value. A natural language processing program for causing a computer to function as a correction unit for correcting the division model used in the morphological analysis.