WO2015166606A1 - Natural language processing system, natural language processing method, and natural language processing program - Google Patents

Natural language processing system, natural language processing method, and natural language processing program Download PDF

Info

Publication number
WO2015166606A1
WO2015166606A1 PCT/JP2014/082428 JP2014082428W WO2015166606A1 WO 2015166606 A1 WO2015166606 A1 WO 2015166606A1 JP 2014082428 W JP2014082428 W JP 2014082428W WO 2015166606 A1 WO2015166606 A1 WO 2015166606A1
Authority
WO
WIPO (PCT)
Prior art keywords
score
tag
sentence
feature
natural language
Prior art date
Application number
PCT/JP2014/082428
Other languages
French (fr)
Japanese (ja)
Inventor
正人 萩原
Original Assignee
楽天株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 楽天株式会社 filed Critical 楽天株式会社
Priority to KR1020167028427A priority Critical patent/KR101729461B1/en
Priority to CN201480076197.5A priority patent/CN106030568B/en
Priority to JP2015512822A priority patent/JP5809381B1/en
Publication of WO2015166606A1 publication Critical patent/WO2015166606A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation

Definitions

  • One aspect of the present invention relates to a natural language processing system, a natural language processing method, and a natural language processing program.
  • Patent Document 1 discloses that the input text data is decomposed into morphemes, the position information corresponding to the decomposed morphemes is obtained by referring to the morpheme dictionary, and the cost using the position information is obtained.
  • a morpheme analyzer that determines a morpheme string from morpheme string candidates obtained by the decomposition by a function.
  • Morphological analysis is executed using a division model including a score for each feature. Since the division model, which can be said to be knowledge for morphological analysis, is generally fixed in advance, naturally, when trying to analyze a sentence belonging to a new field that is not covered by the division model or a sentence having a new property, Obtaining the correct result is very difficult. On the other hand, if the division model is to be corrected by a method such as machine learning, there is a possibility that the time required for the correction will increase so as not to be predicted. Therefore, it is desired to automatically correct the division model for morphological analysis within a certain time.
  • a natural language processing system divides a single sentence by executing morphological analysis on a sentence using a division model obtained by machine learning using one or more training data.
  • An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features.
  • the analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence Compare the correct answer data, and set the score of the output feature and the transition feature related to the correct tag corresponding to the tag of the incorrect answer higher than the current value, and the score of the output feature related to the tag of the incorrect answer Score preliminary transition feature is made lower than the current value, and a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.
  • a natural language processing method is a natural language processing method executed by a natural language processing system including a processor, and uses a division model obtained by machine learning using one or more training data.
  • the analysis step includes an output feature score indicating a correspondence between a divided element and a tag, and a transition feature score indicating a combination of two tags corresponding to two consecutive divided elements.
  • a natural language processing program divides a sentence by executing a morphological analysis on a sentence using a division model obtained by machine learning using one or more training data.
  • An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features.
  • the analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence
  • the score of the output feature and the transition feature related to the correct tag corresponding to the tag of the incorrect answer are set higher than the current value, and the score of the output feature related to the tag of the incorrect answer is set.
  • the scores of the transition feature is made lower than the current value, causes a computer to function as a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.
  • the analysis result is compared with the correct answer data, and the division model is corrected based on the difference between them.
  • the analysis split model can be automatically modified within a certain time (in other words, within a predictable time range).
  • the division model for morphological analysis can be automatically corrected within a certain time.
  • the natural language processing system 10 is a computer system that performs morphological analysis.
  • the morpheme analysis is a process of dividing a sentence into morpheme strings and determining the part of speech of each morpheme.
  • a sentence is a unit of linguistic expression that represents one complete statement, and is expressed by a character string.
  • a morpheme is the smallest language unit that has meaning.
  • a morpheme sequence is a sequence of one or more morphemes obtained by dividing a sentence into one or more morphemes.
  • Part of speech is the division of words by grammatical function or form.
  • the natural language processing system 10 performs morphological analysis on individual sentences using the division model 20.
  • One of the features of the natural language processing system 10 is that when the division model 20 is learned, the division model 20 is corrected each time a morphological analysis is performed on each sentence.
  • the natural language processing system 10 including the confirmed division model 20 is provided to the user. The user can cause the natural language processing system 10 to execute morpheme analysis. At this time, the morpheme analysis is executed without correcting the division model 20.
  • the “division model” in this specification is a standard (cue) when a sentence is divided into one or more morphemes, and is indicated by a score of each feature. This division model is obtained by machine learning using one or more training data.
  • the training data is data indicating at least a sentence divided into words and a part of speech of each word obtained by dividing the sentence.
  • a feature is a clue for obtaining a correct result in morphological analysis. In general, what is used as a feature (cue) is not limited.
  • the feature score is a numerical value representing the likelihood of the feature.
  • FIG. 1 briefly shows the concept of processing in the natural language processing system 10 according to the present embodiment.
  • a gear M in FIG. 1 indicates execution of morphological analysis.
  • the natural language processing system 10 divides the sentence s 1 into one or more morphemes by executing a morphological analysis using the division model w 1 .
  • the natural language processing system 10 divides a sentence into one or more morphemes by dividing the sentence into individual characters and executing character unit processing.
  • the divided element to be processed is a character.
  • the natural language processing system 10 indicates the result of morphological analysis by setting a tag for each character (divided element).
  • a “tag” in this specification is a label indicating an attribute or function of a character. Tags will be described in more detail later.
  • the natural language processing system 10 When the morphological analysis is executed, the natural language processing system 10 accepts data (correct data) indicating the correct answer of the morphological analysis of the sentence s 1 , compares the analysis result with the correct data, and corrects the divided model w 1. in obtaining a new division model w 2. Specifically, the natural language processing system 10 evaluates that the entire analysis result is wrong when at least a part of tagging in the morphological analysis of the sentence s 1 is wrong. Then, the natural language processing system 10 evaluates the feature corresponding to each tag in the correct answer data as “correct (+1)”, sets the feature score higher than the current value, and corresponds to each tag in the analysis result.
  • the natural language processing system 10 evaluates each tag in the correct answer data as “correct (+1)”, while evaluating a tag related to each character in the analysis result as “wrong ( ⁇ 1)” After canceling the two evaluation results, the score of the feature corresponding to the tag evaluated as “correct (+1)” is increased, and the score of the feature corresponding to the tag evaluated as “mistake ( ⁇ 1)” May be lowered.
  • the natural language processing system 10 the tag t a in solution data, t b, t c, t d, the score of the feature by evaluating the identity corresponding to t e a "correct (+1)" was higher than the current value, the tag t a in the execution result, t g, t h, t d, "definitely (-1)" a feature that corresponds to t e and evaluate the current value of the score of the identity and Lower than.
  • the score of the feature corresponding to the tags t a , t d , and t c is not changed from the result before the update, and the score of the feature corresponding to the correct tags t b and t c is increased, and the incorrect answer The score of the feature corresponding to the tags t g and t h is lowered.
  • the natural language processing system 10 When executing the morphological analysis for the next sentence s 2 , the natural language processing system 10 uses the division model w 2 . Then, the natural language processing system 10 accepts a correct answer data of the morphological analysis of the statement s 2, compares the execution result and its correct data, to modify the same divided model w 2 in the case of modifying the division models w 1 obtain a new division model w 3 that.
  • the natural language processing system 10 corrects the division model (w 1 ⁇ w 2 , w 2 ⁇ w 3 ,..., W each time one sentence (s 1 , s 2 ,..., St) is processed in this way. t ⁇ w t + 1 ), and use the modified division model in the morphological analysis of the next sentence.
  • Such a method of updating the model every time one piece of training data is processed is also referred to as “online learning” or “online machine learning”.
  • the natural language processing system 10 converts a Japanese sentence “hon wo katte”, which corresponds to an English sentence “I blog a book”, into five letters x 1 : “book”. (hon) ", x 2:” the (wo) ", x 3:” Offer (ka) ", x 4:” Tsu (t) ", x 5: divided into” te (te) ". And the natural language processing system 10 sets a tag to each character by performing morphological analysis.
  • the tag is a combination of the appearance mode of the character in the word, the part of speech of the word, and the subclass of the part of speech of the word, and uses an alphabet such as “SN-nc”. Expressed.
  • the appearance mode is whether a certain character becomes one word alone or in combination with another character, and when the character is a part of a word consisting of two or more characters, This is information indicating where in the word the character is located.
  • the appearance mode is indicated by one of S, B, I, and E.
  • the appearance mode “S” indicates that the character becomes a single word by itself.
  • the appearance mode “B” indicates that the character is positioned at the beginning of a word composed of two or more characters.
  • the appearance mode “I” indicates that the character is located in the middle of a word composed of three or more characters.
  • the appearance mode “E” indicates that the character is located at the end of a word composed of two or more characters.
  • the example of FIG. 2 indicates that the characters x 1 , x 2 , and x 5 are each a single word, and the characters x 3 and x 4 form one word.
  • scheme for the appearance mode is not limited.
  • the scheme “SBIEO” is used, but for example, the scheme “IOB2” that is well known to those skilled in the art may be used.
  • Examples of parts of speech include nouns, verbs, particles, adjectives, adjective verbs, conjunctions, and the like.
  • the noun is represented by “N”
  • the particle is represented by “P”
  • the verb is represented by “V”.
  • FIG. 2 indicates that the character x 1 is a noun, the character x 2 is a particle, the word consisting of the characters x 3 and x 4 is a verb, and the character x 5 is a particle.
  • the part-of-speech subclass indicates the subordinate concept of the corresponding part-of-speech.
  • nouns can be further classified into general nouns and proper nouns, and particles can be further classified into case particles, conjunctive particles, auxiliary particles, and the like.
  • the general noun is represented by “nc”
  • the proper noun is represented by “np”
  • the case particle is represented by “k”
  • the connection particle is represented by “sj”
  • the general verb is “c”. Is represented.
  • FIG. 2 shows that the character x 1 is a general noun, the character x 2 is a case particle, the word consisting of the characters x 3 and x 4 is a general verb, and the character x 5 is a connection particle. ing.
  • the score of the feature stored in the division model 20 is a score of an output feature and a score of a transition feature.
  • the output feature is a clue indicating the correspondence between a tag and a character or character type.
  • the output feature is a clue indicating what kind of character or character type is likely to correspond to what kind of tag.
  • the output feature corresponds to the feature representation of the output matrix of the hidden Markov model.
  • an output feature of a unigram (a character string made up of only one character) and an output feature of a bigram (a character string made up of two consecutive characters) are used.
  • the character type is a character type in a certain language.
  • Japanese character types include kanji, hiragana, katakana, alphabet (uppercase and lowercase), Arabic numerals, kanji numerals, and middle black (•).
  • the character type is represented by alphabets. For example, “C” indicates kanji, “H” indicates hiragana, “K” indicates katakana, “L” indicates alphabets, and “A” indicates Arabic numerals.
  • FIG. 2 indicates that the characters x 1 and x 3 are kanji characters and the characters x 2 , x 4 , and x 5 are hiragana characters.
  • the unigram output feature related to the character is a clue indicating the correspondence between the tag t and the character x. Further, the output feature of the unigram regarding the character type is a clue indicating the correspondence between the tag t and the character type c.
  • the likelihood score s of the correspondence between the tag t and the letter x is indicated by ⁇ t / x, s ⁇ .
  • a likelihood score s of correspondence between the tag t and the character type c is denoted by ⁇ t / c, s ⁇ .
  • the division model 20 includes scores regarding a plurality of tags for one character or character type.
  • the division model 20 When data on all types of tags is prepared for one character or character type, the division model 20 also has a score for a combination of a tag and a character or character type that cannot actually occur in the grammar. Including. However, the score of a feature that is impossible in grammar is relatively low.
  • data indicating features that do not exist in the grammar can be prepared. For example, it is impossible in Japanese grammar that a word represented by Arabic numerals is a particle, but data can also be prepared for a feature such as “SPK / A”.
  • the bigram output feature related to the character is a clue indicating the correspondence between the tag t and the character string x i x i + 1 .
  • the bigram output feature related to the character type is a clue indicating the correspondence between the tag t and the character type column c i c i + 1 .
  • the likelihood score s of the tag t and the characters x i and x i + 1 is represented by ⁇ t / x i / x i + 1 , s ⁇ .
  • the likelihood score s of the tag t and the character types c i and c i + 1 is denoted by ⁇ t / c i / c i + 1 , s ⁇ .
  • the transition feature a cue indicating a combination (combination consisting of two tags corresponding to two consecutive characters) tags t i and tag t i + 1 of the next character x i + 1 character x i.
  • This transition feature is a bigram feature.
  • the transition feature corresponds to the feature representation of the transition matrix of the hidden Markov model.
  • the likelihood score s of the combination of the tag t i and the tag t i + 1 is represented by ⁇ t i / t i + 1 , s ⁇ .
  • the division model 20 also stores data on combinations of two tags that cannot actually occur in the grammar.
  • transition feature scores are shown. ⁇ SN-nc / SP-k, 0.0512 ⁇ ⁇ E-N-nc / E-N-nc, 0.0000 ⁇ ⁇ SPK / BVc, 0.0425 ⁇ ⁇ BVc / IVc, 0.0387 ⁇
  • the natural language processing system 10 includes one or more computers. When the natural language processing system 10 includes a plurality of computers, each functional element of the natural language processing system 10 described later is realized by distributed processing.
  • the type of individual computer is not limited. For example, a stationary or portable personal computer (PC) may be used, a workstation may be used, or a portable terminal such as a high-functional portable telephone (smart phone), a portable telephone, or a personal digital assistant (PDA). May be used.
  • the natural language processing system 10 may be constructed by combining various types of computers. When a plurality of computers are used, these computers are connected via a communication network such as the Internet or an intranet.
  • FIG. 3 shows a general hardware configuration of each computer 100 in the natural language processing system 10.
  • a computer 100 includes a CPU (processor) 101 that executes an operating system, application programs, and the like, a main storage unit 102 that includes a ROM and a RAM, an auxiliary storage unit 103 that includes a hard disk and a flash memory, and a network.
  • the communication control unit 104 includes a card or a wireless communication module, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a display and a printer.
  • the hardware modules to be mounted differ depending on the type of the computer 100.
  • a stationary PC and a workstation often include a keyboard, a mouse, and a monitor as an input device and an output device, but in a smartphone, a touch panel often functions as an input device and an output device.
  • Each functional element of the natural language processing system 10 described later reads predetermined software on the CPU 101 or the main storage unit 102, and operates the communication control unit 104, the input device 105, the output device 106, and the like under the control of the CPU 101. This is realized by reading and writing data in the main storage unit 102 or the auxiliary storage unit 103. Data and a database necessary for processing are stored in the main storage unit 102 or the auxiliary storage unit 103.
  • the division model 20 is stored in the storage device in advance.
  • the specific mounting method of the division model 20 is not limited.
  • the division model 20 may be prepared as a relational database or a text file.
  • the installation location of the division model 20 is not limited.
  • the division model 20 may exist in the natural language processing system 10 or in another computer system different from the natural language processing system 10. May be.
  • the natural language processing system 10 accesses the division model 20 via a communication network.
  • the division model 20 is a set of scores of various features.
  • the score is updated little by little by the processing of the natural language processing system 10 described later. After a certain number of sentences have been processed, there is a difference between the individual feature scores as described above.
  • the natural language processing system 10 includes an acquisition unit 11, an analysis unit 12, and a correction unit 13 as functional components.
  • the natural language processing system 10 accesses the division model 20 as necessary.
  • Each functional element will be described below. In the present embodiment, the description will be made on the assumption that the natural language processing system 10 processes a Japanese sentence.
  • the language of the sentence processed by the natural language processing system 10 is not limited to Japanese, and sentences in other languages such as Chinese can be analyzed.
  • the acquisition unit 11 is a functional element that acquires a sentence to be divided into morpheme strings.
  • the acquisition method of the sentence by the acquisition part 11 is not limited.
  • the acquisition unit 11 may collect sentences from any website on the Internet (so-called crawling).
  • the acquisition unit 11 may read a sentence stored in advance in a database in the natural language processing system 10, or read a sentence stored in a database on a computer system other than the natural language processing system 10 via a communication network. It may be accessed and read by.
  • the acquisition unit 11 may accept a sentence input by a user of the natural language processing system 10.
  • the acquisition unit 11 acquires one sentence and outputs it to the analysis unit 12.
  • the acquisition unit 11 acquires the next sentence and outputs it to the analysis unit 12.
  • the analysis unit 12 is a functional element that performs morphological analysis on individual sentences.
  • the analysis unit 12 executes the following process every time one sentence is input.
  • the analysis unit 12 divides one sentence into individual characters and determines the character type of each character.
  • the analysis unit 12 stores in advance a comparison table between characters and character types, or a regular expression for determining a character type, and determines a character type using the comparison table or regular expression.
  • the analysis unit 12 determines a tag for each character using a Viterbi algorithm. For the i-th character, the analysis unit 12 determines which candidate tag among the plurality of candidate tags of the (i-1) -th character for each tag (candidate tag) that may be finally selected. It is determined whether or not the score (also referred to as “connection score”) is the highest.
  • the connection score is a total value of various scores related to the calculation target tag (unigram output feature score, bigram output feature score, and transition feature score). For example, when the i-th tag is “SN-nc”, the analysis unit 12 has the highest connection score when the (i ⁇ 1) -th tag is “SPK”.
  • the analysis unit 12 stores all the combinations (for example, (S ⁇ P ⁇ k, S ⁇ N ⁇ nc), (E ⁇ N ⁇ nc, S ⁇ V ⁇ c), etc.) having the highest connection score. .
  • the analysis unit 12 executes such processing while proceeding character by character from the first character to the sentence end symbol.
  • the combination of the tag of the last character and the end-of-sentence symbol having the highest connection score is determined as one (for example, the combination is (E- Vc, EOS)).
  • the tag of the last character is determined (for example, the tag is determined to be “EVC”), and as a result, the tag of the second character from the end is also determined.
  • the tags are fixed to the formulas in order from the end to the beginning of the sentence.
  • FIG. 5 schematically shows such processing by the analysis unit 12.
  • FIG. 5 shows an example of tagging a sentence consisting of four characters.
  • the tags are simplified as “A1”, “B2”, etc., and the number of candidate tags for each character is three.
  • a thick line in FIG. 5 indicates a combination of a tag determined to have the highest connection score obtained by processing a sentence from the front.
  • the tag C1 has the highest connection score with the tag B1
  • the tag C2 has the highest connection score with the tag B1
  • the tag C3 has the highest connection score with the tag B2.
  • FIG. 5 shows an example of tagging a sentence consisting of four characters.
  • the tags are simplified as “A1”, “B2”, etc., and the number of candidate tags for each character is three.
  • a thick line in FIG. 5 indicates a combination of a tag determined to have the highest connection score obtained by processing a sentence from the front.
  • the tag C1 has the highest connection score with the tag B1
  • the tag C2 has
  • the analysis unit 12 determines that the tags of the first to fourth characters are A2, B1, C2, and D1, respectively.
  • the analysis unit 12 outputs a sentence with each character tagged as an analysis result.
  • the analysis unit 12 outputs the analysis result to at least the correction unit 13 because the analysis result is necessary for correcting the divided model 20.
  • the analysis unit 12 may perform further output.
  • the analysis unit 12 may display the analysis result on a monitor or print it on a printer, write the analysis result to a text file, or store the analysis result in a storage device such as a memory or a database. May be.
  • the analysis unit 12 may transmit the analysis result to an arbitrary computer system other than the natural language processing system 10 via a communication network.
  • the correction unit 13 is a functional element that corrects the divided model 20 based on the difference between the analysis result obtained from the analysis unit 12 and the correct answer of the morphological analysis of the sentence.
  • “modification of the division model” is processing for changing the score of at least one feature in the division model. In some cases, even if an attempt is made to change a certain score, the value may not change as a result.
  • the correction unit 13 executes the following processing every time one analysis result is input.
  • the correction unit 13 acquires correct answer data corresponding to the input analysis result, that is, data indicating the correct answer of the morphological analysis of the sentence processed by the analysis unit 12.
  • the correct answer data in the present embodiment is data indicating tags (combination of appearance mode, part of speech, and part of speech subclass) of each character forming a sentence. This correct answer data is created manually.
  • the method for acquiring correct data by the correction unit 13 is not limited.
  • the correction unit 13 may read correct data stored in advance in a database in the natural language processing system 10, or may read sentences stored in a database on a computer system other than the natural language processing system 10 in a communication network. You may access and read via.
  • the correction unit 13 may accept correct answer data input by the user of the natural language processing system 10.
  • the correction unit 13 compares the input analysis result with the correct data and identifies the difference between them.
  • the correction unit 13 ends the process without correcting the division model 20, generates a completion notification, and outputs it to the acquisition unit 11.
  • This completion notification is a signal indicating that the processing in the correction unit 13 has been completed and the morphological analysis for the next sentence can be executed.
  • the fact that the analysis result completely matches the correct answer data does not require correction of the division model 20 at least at this point, so the natural language processing system 10 (more specifically, the analysis unit 12) has the current division model 20 Is used as is to parse the next sentence.
  • each character is also expressed as x 1 to x 5 .
  • x 1 ⁇ SN—nc ⁇ x 2 : ⁇ SPK ⁇ x 3 : ⁇ BVc ⁇ x 4 : ⁇ EVV ⁇ x 5 : ⁇ SP-sj ⁇
  • the correction unit 13 determines that the analysis result and the correct data match completely, and notifies the acquisition unit 11 of the completion notification without correcting the analysis unit 12. Output.
  • the correction unit 13 updates at least a part of the scores of the divided model 20. More specifically, the correcting unit 13 sets the feature score related to the correct tag corresponding to the incorrect tag to be higher than the current value, and sets the feature score related to the incorrect tag to be lower than the current value. To do.
  • the analysis unit 12 obtains the following analysis result from a Japanese sentence “Buy Book”.
  • x 1 ⁇ SN—nc ⁇ x 2 : ⁇ SPK ⁇ x 3 : ⁇ BVc ⁇ x 4 : ⁇ IVc ⁇ x 5 : ⁇ EVV ⁇
  • the correction unit 13 evaluates the feature corresponding to each tag in the correct answer data as “correct (+1)” and sets the score of the feature higher than the current value. Then, the feature corresponding to each tag in the analysis result is evaluated as “error ( ⁇ 1)”, and the score of the feature is made lower than the current value. In consideration of the part that is offset as a result, it can be said that the correction unit 13 finally performs the following processing.
  • the correction unit 13 calculates the scores for the output features “EV ⁇ c / t (t)” and “SP ⁇ sj / te (te)” corresponding to the correct tags of the characters x 4 and x 5 from the current value.
  • the score for the output features “IVc / t (t)” and “EVc / t (te)” related to the incorrect tag is made smaller than the current value.
  • the score of the output feature of the unigram related to the analyzed sentence is updated.
  • the correcting unit 13 sets the scores for the output features “EVC / H” and “SP-sj / H” related to the correct tags of the characters x 4 and x 5 that are incorrect answers to the current values.
  • the score for the output features “IVc / H” and “EVc / H” related to the incorrect answer tag is made smaller than the current value. Thereby, the score of the output feature of the unigram related to the analyzed sentence (score regarding the character type) is updated.
  • the correcting unit 13 obtains a score for the output feature “EV ⁇ c / t (t) / te (te)” related to the correct tags of the characters x 4 and x 5 that were incorrect from the current value.
  • the score for the output feature “IVV / t (t) / te (te)” related to the incorrect tag is made smaller than the current value. This updates the bigram output feature score (character score) associated with the analyzed sentence.
  • the correcting unit 13 increases the score for the output feature “EVC / H / H” related to the correct tags of the characters x 4 and x 5 that are incorrect answers from the current value, and corrects the incorrect answer.
  • the score for the output feature “IVc / H / H” related to the tag is made smaller than the current value. This updates the bigram output feature score (score for the character type) associated with the analyzed sentence.
  • the correcting unit 13 uses the transition features “BVc / EVVc” and “EVc / SP-sj” associated with the correct tags of the characters x 4 and x 5 that were incorrect.
  • the score for the transition features “BVc / IVVc” and “IVc / EVVc” related to the incorrect answer tag are set higher than the current value. Make it smaller than the current value. As a result, the score of the transition feature related to the analyzed sentence is updated.
  • the correction unit 13 evaluates each tag in the correct answer data as “correct (+1)”, while evaluating the tag related to each character in the analysis result as “wrong ( ⁇ 1)”. , After offsetting the two evaluation results for each tag, increase the score of the feature corresponding to the tag evaluated as “correct (+1)”, and correspond to the tag evaluated as “wrong ( ⁇ 1)” The score of the feature to be performed may be lowered.
  • the modification unit 13 may use SCW (Soft Confidence-Weighted Learning).
  • SCW Soft Confidence-Weighted Learning
  • This SCW is a method in which a parameter with a large variance is regarded as not yet confident (inaccurate) and the parameter is greatly updated, and a parameter with a small variance is regarded as accurate to some extent and the parameter is updated to a small value. is there.
  • the correcting unit 13 determines the amount of change in the score based on the variance of the score having the value range.
  • a Gaussian distribution is introduced into the division model 20 (vector w), and the correction unit 13 simultaneously updates the average and covariance matrix of each score in addition to updating each score.
  • the average initial value of each score is zero.
  • FIG. 6A shows a mode in which a score with a large variance is changed greatly (that is, the amount of change in the score is large), and FIG. 6B shows a mode in which a score with a small variance is changed slightly (that is, the score is changed). The variation is small).
  • FIGS. 6A and 6B show that the covariance matrix ⁇ is also updated when the score is updated from Sa to Sb.
  • the accuracy of score calculation can be maintained without considering the correlation between a certain feature and other features, so in this embodiment, the off-diagonal of the covariance matrix is used. Calculate only diagonal elements without calculating elements. Thereby, the update speed of a score can be raised.
  • the correcting unit 13 may update the feature score using a method other than SCW.
  • methods other than SCW include Perceptron, Passive Aggressive (PA), Confidence Weighted (CW), and Adaptive Regularization of Weight Vectors (AROW).
  • the correction unit 13 When the division model 20 is corrected by updating the feature score related to the analyzed sentence, the correction unit 13 generates a completion notification and outputs it to the acquisition unit 11. In this case, the natural language processing system 10 (more specifically, the analysis unit 12) analyzes the next sentence using the modified division model 20.
  • the acquisition unit 11 acquires one sentence (step S11).
  • the analysis unit 12 performs morphological analysis on the sentence using the division model 20 (step S12, analysis step).
  • a tag such as “SN-nc” is given to each character of the sentence.
  • the correcting unit 13 obtains a difference between the result of the morphological analysis by the analyzing unit 12 and the correct answer data of the morphological analysis (step S13).
  • step S14 NO
  • the correcting unit 13 ends the process without correcting the divided model 20.
  • step S14 YES
  • the correction unit 13 relates to the analyzed sentence.
  • the division model 20 is corrected by updating the feature score (step S15, correction step). Specifically, the correcting unit 13 sets the feature score related to the correct tag corresponding to the incorrect tag to be higher than the current value, and sets the feature score related to the incorrect tag from the current value. Also lower.
  • step S11 When the process in the correction unit 13 is completed, the process returns to step S11 (see step S16).
  • the acquisition unit 11 acquires the next sentence (step S11), and the analysis unit 12 performs morphological analysis on the sentence (step S12).
  • step S15 if the modification of the divided model 20 (step S15) has been executed in the processing of the previous sentence, the analysis unit 12 performs morphological analysis using the modified divided model 20.
  • the correction unit 13 executes the processes after step S13. Such repetition continues as long as the sentence to be processed exists (see step S16).
  • the first line in the algorithm means initialization of the division model 20 (variable w 1 ), and for example, the score of each feature is set to 0 by this processing.
  • the For loop on the second line indicates that the processes on and after the third line are executed one sentence at a time.
  • the third line means that the sentence xt is acquired and corresponds to step S11 described above.
  • the fourth line shows a process of assigning a tag to each character by performing a morphological analysis based on the division model 20 (w t ) at that time, and corresponds to step S12 described above.
  • y ⁇ t indicates the analysis result. Line 5, which means that to get the correct data y t of the morphological analysis of the sentence x t.
  • Line 6 if there is a difference between the analysis result y ⁇ t and solution data y t means to update (modifying) the division model 20.
  • the seventh line indicates that the correct answer data y t is learned as a positive example, and the eighth line indicates that the analysis result y ⁇ t including an error is learned as a negative example.
  • the processing on the seventh and eighth lines corresponds to step S15 described above.
  • the natural language processing program P1 includes a main module P10, an acquisition module P11, an analysis module P12, and a correction module P13.
  • the main module P10 is a part that comprehensively controls morphological analysis and related processing.
  • the functions realized by executing the acquisition module P11, the analysis module P12, and the correction module P13 are the same as the functions of the acquisition unit 11, the analysis unit 12, and the correction unit 13, respectively.
  • the natural language processing program P1 may be provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory.
  • the natural language processing program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.
  • the natural language processing system performs a morphological analysis on one sentence using a division model obtained by machine learning using one or more training data.
  • An analysis unit that sets at least a tag indicating the part of speech of each word to each divided element obtained by dividing the one sentence, and the division model has an output feature indicating a correspondence between the divided element and the tag.
  • the analysis unit including a score and a score of a transition feature indicating a combination of two tags corresponding to two consecutive subdivided elements, a tag indicated by an analysis result obtained by the analysis unit, and one sentence Compared with the correct answer data indicating the correct answer tag, the score of the output feature and the transition feature related to the correct answer tag corresponding to the incorrect answer tag are set higher than the current value, and related to the incorrect answer tag. That the scores of the score and the transition feature output feature is made lower than the current value, and a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.
  • a natural language processing method is a natural language processing method executed by a natural language processing system including a processor, and uses a division model obtained by machine learning using one or more training data.
  • the analysis step includes an output feature score indicating a correspondence between a divided element and a tag, and a transition feature score indicating a combination of two tags corresponding to two consecutive divided elements.
  • a natural language processing program divides a sentence by executing a morphological analysis on a sentence using a division model obtained by machine learning using one or more training data.
  • An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features.
  • the analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence
  • the score of the output feature and the transition feature related to the correct tag corresponding to the tag of the incorrect answer are set higher than the current value, and the score of the output feature related to the tag of the incorrect answer is set.
  • the scores of the transition feature is made lower than the current value, causes a computer to function as a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.
  • the analysis result is compared with the correct answer data, and the division model is corrected based on the difference between them.
  • the analysis split model can be automatically modified within a certain time (in other words, within a predictable time range).
  • the accuracy of the morphological analysis of the next sentence can be increased by increasing the feature score for the correct tag and decreasing the feature score for the tag that was incorrect.
  • the divided element may be a character.
  • morphological analysis can be executed without using a word dictionary that generally becomes large.
  • the division model is corrected for each sentence using knowledge in units of characters rather than knowledge of words, it is assumed that the next sentence is different in field or nature from any sentence analyzed so far
  • each of the output feature score and the transition feature score has a range of values
  • a variance is set for each score
  • the correction unit is based on the variance of each score.
  • the amount of change in the score when the score is high or low may be determined.
  • the division model 20 becomes very large in a language with many characters such as Japanese and Chinese.
  • the storage capacity for will also be very large. Therefore, a technique called feature hashing may be introduced to digitize individual features using a hash function.
  • the effect of digitizing characters and character strings representing a part of the features is high.
  • the transition feature is hashed, it does not contribute much to the compression of the capacity of the division model 20, and the processing speed may be slow. Therefore, only the output features may be hashed without hashing the transition features. Note that only one type of hash function may be used, or different hash functions may be used for output features and transition features.
  • the division model 20 stores data on the features in which individual characters are represented by numerical values. For example, a character “hon” is converted to a numerical value of 34, and a character “wo” is converted to a numerical value of 4788. By this numericalization, a set of bounded features can be formed. This feature hashing may assign the same numerical value to multiple characters or character strings, but it is very unlikely that the same numerical value will be assigned to characters or character strings that appear frequently. Collisions can be ignored.
  • the division model may include an output feature quantified by a hash function.
  • the analysis unit 12 may perform morphological analysis using a feature having a relatively high score without using a feature having a relatively low score (ignoring such a feature).
  • techniques for ignoring features having relatively low scores include forward-backward splitting (Forward-Background Splitting (FOBOS)) and feature quantization (Feature Quantization).
  • FOBOS Forward-Background Splitting
  • Feature Quantization feature quantization
  • FOBOS is a method of compressing the score to 0 by regularization (for example, L1 regularization).
  • regularization for example, L1 regularization.
  • the feature quantization is a technique for converting the score of a feature into an integer by multiplying the value after the decimal point by 10 n (n is a natural number of 1 or more). For example, when the score “0.1234456789” is multiplied by 1000 to make an integer, the score becomes “123”. By quantizing the score, the memory capacity required to store the score as text can be saved. In addition, this technique makes it possible to ignore features whose score is equal to or less than a predetermined value (for example, a feature whose score after integerization is 0 or a feature whose score is close to 0).
  • the analysis unit 12 performs morpheme analysis without using the feature Fb.
  • the regularization or quantization process is executed by, for example, the correction unit 13, another functional element in the natural language processing system 10, or a computer system different from the natural language processing system 10.
  • the correction unit 13 performs regularization or quantization processing
  • the correction unit 13 performs a morphological analysis on a set of sentences (for example, a certain number of sentences) in the natural language processing system 10 to determine what the division model 20 is.
  • the regularization or quantization process is performed once.
  • the analysis unit may execute the morphological analysis without using a feature whose score is equal to or lower than a predetermined value by regularization or quantization.
  • features with relatively low scores for example, features whose score becomes 0 by regularization or quantization, or features whose score is close to 0
  • the analysis unit 12 divides a sentence into individual characters and sets a tag for each character, but the divided element may be a word instead of a character. Accordingly, the analysis unit may perform morphological analysis using a division model and a word dictionary that indicate a score of a feature related to a word instead of a character.
  • the natural language processing system according to the present invention can be applied to morphological analysis of an arbitrary language.

Abstract

A natural language processing system according to an embodiment is provided with an analysis unit and a correction unit. The analysis unit, by executing a morphological analysis with respect to a single sentence by using a division model, sets a tag to each divided element obtained by dividing the single sentence. The division model includes an output feature score indicating the correspondence between the divided element and the tag, and a transition feature score indicating a combination of two tags corresponding to continuous two divided elements. The correction unit compares the tag indicated by an analysis result obtained by the analysis unit with correct data indicating a correct tag of the single sentence, and corrects the division model used in the morphological analysis of the next sentence by the analysis unit by increasing the score of a feature related to the correct tag corresponding to an incorrect tag while decreasing the score of a feature related to the incorrect tag.

Description

自然言語処理システム、自然言語処理方法、および自然言語処理プログラムNatural language processing system, natural language processing method, and natural language processing program
 本発明の一側面は、自然言語処理システム、自然言語処理方法、および自然言語処理プログラムに関する。 One aspect of the present invention relates to a natural language processing system, a natural language processing method, and a natural language processing program.
 自然言語処理の基礎技術の一つとして、文を形態素の列に分割して各形態素の品詞を判定する形態素解析が知られている。これに関連して下記特許文献1には、入力されたテキストデータを形態素に分解し、形態素辞書を参照して当該分解された形態素に対応する位置の情報を取得し、位置情報を用いたコスト関数により、当該分解で得られた形態素列の候補から形態素列を決定する形態素解析装置が記載されている。 As one of the basic techniques of natural language processing, morphological analysis is known in which a sentence is divided into morpheme strings and the part of speech of each morpheme is determined. In relation to this, Patent Document 1 below discloses that the input text data is decomposed into morphemes, the position information corresponding to the decomposed morphemes is obtained by referring to the morpheme dictionary, and the cost using the position information is obtained. There is described a morpheme analyzer that determines a morpheme string from morpheme string candidates obtained by the decomposition by a function.
特開2013-210856号公報JP 2013-210856 A
 形態素解析は、各素性のスコアを含む分割モデルを用いて実行される。形態素解析のための知識ともいえるその分割モデルは一般に予め固定されているので、その分割モデルでは網羅していない新たな分野に属する文または新たな性質を持つ文を形態素解析しようとすると、当然ながら正しい結果を得ることは非常に困難である。一方で、分割モデルを機械学習などの手法により修正しようとすると、その修正に要する時間が予測の付かないほどに増大する可能性がある。そこで、形態素解析の分割モデルを一定の時間内に自動的に修正することが望まれている。 Morphological analysis is executed using a division model including a score for each feature. Since the division model, which can be said to be knowledge for morphological analysis, is generally fixed in advance, naturally, when trying to analyze a sentence belonging to a new field that is not covered by the division model or a sentence having a new property, Obtaining the correct result is very difficult. On the other hand, if the division model is to be corrected by a method such as machine learning, there is a possibility that the time required for the correction will increase so as not to be predicted. Therefore, it is desired to automatically correct the division model for morphological analysis within a certain time.
 本発明の一側面に係る自然言語処理システムは、1以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析部であって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析部と、解析部により得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析部による次の文の形態素解析で用いられる分割モデルを修正する修正部とを備える。 A natural language processing system according to one aspect of the present invention divides a single sentence by executing morphological analysis on a sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features. The analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence Compare the correct answer data, and set the score of the output feature and the transition feature related to the correct tag corresponding to the tag of the incorrect answer higher than the current value, and the score of the output feature related to the tag of the incorrect answer Score preliminary transition feature is made lower than the current value, and a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.
 本発明の一側面に係る自然言語処理方法は、プロセッサを備える自然言語処理システムにより実行される自然言語処理方法であって、1以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析ステップであって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析ステップと、解析ステップにおいて得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析ステップにおける次の文の形態素解析で用いられる分割モデルを修正する修正ステップとを含む。 A natural language processing method according to an aspect of the present invention is a natural language processing method executed by a natural language processing system including a processor, and uses a division model obtained by machine learning using one or more training data. , An analysis step of setting a tag indicating at least a part of speech of each word to each divided element obtained by dividing the one sentence by executing a morphological analysis on one sentence, wherein the division model is: The analysis step includes an output feature score indicating a correspondence between a divided element and a tag, and a transition feature score indicating a combination of two tags corresponding to two consecutive divided elements. Compares the tag shown in the analysis result and the correct data indicating the correct tag of one sentence, and outputs related to the correct tag corresponding to the incorrect tag By setting the gender score and transition feature score higher than the current value, and the output feature score and transition feature score associated with the incorrect tag are lower than the current value, the next sentence in the analysis step And a modification step for modifying the division model used in the morphological analysis.
 本発明の一側面に係る自然言語処理プログラムは、1以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析部であって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析部と、解析部により得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析部による次の文の形態素解析で用いられる分割モデルを修正する修正部としてコンピュータを機能させる。 A natural language processing program according to an aspect of the present invention divides a sentence by executing a morphological analysis on a sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features. The analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence Compared with the correct answer data, the score of the output feature and the transition feature related to the correct tag corresponding to the tag of the incorrect answer are set higher than the current value, and the score of the output feature related to the tag of the incorrect answer is set. And the scores of the transition feature is made lower than the current value, causes a computer to function as a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.
 このような側面においては、一つの文を形態素解析する度に、その解析結果と正解データとが比較され、これらの差に基づいて分割モデルが修正される。このように一文毎に分割モデルを修正することで、複数の文を処理した場合の分割モデルの修正に要する時間が、文の個数に応じてほぼ線形的に増大する程度に抑えられるので、形態素解析の分割モデルを一定の時間内に(言い換えると、予測できる時間の範囲内に)自動的に修正することができる。 In such an aspect, every time a sentence is subjected to morphological analysis, the analysis result is compared with the correct answer data, and the division model is corrected based on the difference between them. By correcting the split model for each sentence in this way, the time required to correct the split model when processing multiple sentences can be suppressed to an extent that increases almost linearly according to the number of sentences. The analysis split model can be automatically modified within a certain time (in other words, within a predictable time range).
 本発明の一側面によれば、形態素解析の分割モデルを一定の時間内に自動的に修正することができる。 According to one aspect of the present invention, the division model for morphological analysis can be automatically corrected within a certain time.
実施形態に係る自然言語処理システムでの処理の概念図である。It is a conceptual diagram of the process in the natural language processing system which concerns on embodiment. 実施形態における形態素解析の例を示す図である。It is a figure which shows the example of the morphological analysis in embodiment. 実施形態に係る自然言語処理システムを構成するコンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the computer which comprises the natural language processing system which concerns on embodiment. 実施形態に係る自然言語処理システムの機能構成を示すブロック図である。It is a block diagram which shows the function structure of the natural language processing system which concerns on embodiment. タグ付けの一例を概念的に示す図である。It is a figure which shows an example of tagging notionally. (a),(b)はそれぞれ、スコアの更新の一例を模式的に示す図である。(A), (b) is a figure which shows typically an example of the update of a score, respectively. 実施形態に係る自然言語処理システムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the natural language processing system which concerns on embodiment. 実施形態に係る自然言語処理プログラムの構成を示す図である。It is a figure which shows the structure of the natural language processing program which concerns on embodiment.
 以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。なお、図面の説明において同一又は同等の要素には同一の符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are denoted by the same reference numerals, and redundant description is omitted.
 まず、図1~5を用いて、実施形態に係る自然言語処理システム10の機能及び構成を説明する。自然言語処理システム10は、形態素解析を実行するコンピュータシステムである。形態素解析とは、文を形態素の列に分割して各形態素の品詞を判定する処理である。文とは、一つの完結した言明を表す言語表現の単位であり、文字列で表現される。形態素とは、意味を有する最小の言語単位である。形態素の列とは、文を1以上の形態素に分割して得られる該1以上の形態素の並びである。品詞とは、文法上の機能または形態による単語の区分けである。 First, the function and configuration of the natural language processing system 10 according to the embodiment will be described with reference to FIGS. The natural language processing system 10 is a computer system that performs morphological analysis. The morpheme analysis is a process of dividing a sentence into morpheme strings and determining the part of speech of each morpheme. A sentence is a unit of linguistic expression that represents one complete statement, and is expressed by a character string. A morpheme is the smallest language unit that has meaning. A morpheme sequence is a sequence of one or more morphemes obtained by dividing a sentence into one or more morphemes. Part of speech is the division of words by grammatical function or form.
 自然言語処理システム10は、分割モデル20を用いて個々の文を形態素解析する。自然言語処理システム10の特徴の一つとして、分割モデル20を学習する際には個々の文を形態素解析する度にその分割モデル20を修正する点が挙げられる。分割モデル20の修正が終われば、確定した分割モデル20を備える自然言語処理システム10がユーザに提供される。ユーザはその自然言語処理システム10に形態素解析を実行させることができ、この際には、分割モデル20の修正が行われることなく、形態素解析が実行される。本明細書における「分割モデル」とは、文を1以上の形態素に分割する際の基準(手掛かり)であり、各素性のスコアで示される。この分割モデルは、1以上のトレーニングデータを用いた機械学習により得られる。トレーニングデータは、各単語に分割された文と、その文を分割して得られる各単語の品詞とを少なくとも示すデータである。素性(feature)とは、形態素解析において正しい結果を得るための手掛かりである。一般に、何を素性(手掛かり)として用いるかは限定されない。素性のスコアとは、当該素性の尤もらしさを表す数値である。 The natural language processing system 10 performs morphological analysis on individual sentences using the division model 20. One of the features of the natural language processing system 10 is that when the division model 20 is learned, the division model 20 is corrected each time a morphological analysis is performed on each sentence. When the modification of the division model 20 is completed, the natural language processing system 10 including the confirmed division model 20 is provided to the user. The user can cause the natural language processing system 10 to execute morpheme analysis. At this time, the morpheme analysis is executed without correcting the division model 20. The “division model” in this specification is a standard (cue) when a sentence is divided into one or more morphemes, and is indicated by a score of each feature. This division model is obtained by machine learning using one or more training data. The training data is data indicating at least a sentence divided into words and a part of speech of each word obtained by dividing the sentence. A feature is a clue for obtaining a correct result in morphological analysis. In general, what is used as a feature (cue) is not limited. The feature score is a numerical value representing the likelihood of the feature.
 図1に、本実施形態に係る自然言語処理システム10での処理の概念を簡潔に示す。なお、図1における歯車Mは形態素解析の実行を示す。ある時点において、自然言語処理システム10は分割モデルwを用いた形態素解析を実行することで文sを1以上の形態素に分割する。本実施形態では、自然言語処理システム10は文を個々の文字に分割して文字単位の処理を実行することで文を1以上の形態素に分割する。すなわち、本実施形態では、処理対象となる被分割要素は文字である。自然言語処理システム10は、個々の文字(被分割要素)にタグを設定することで、形態素解析の結果を示す。本明細書における「タグ」とは、文字の属性または機能を示すラベルである。タグについては後でさらに詳しく説明する。 FIG. 1 briefly shows the concept of processing in the natural language processing system 10 according to the present embodiment. Note that a gear M in FIG. 1 indicates execution of morphological analysis. At some point, the natural language processing system 10 divides the sentence s 1 into one or more morphemes by executing a morphological analysis using the division model w 1 . In this embodiment, the natural language processing system 10 divides a sentence into one or more morphemes by dividing the sentence into individual characters and executing character unit processing. In other words, in the present embodiment, the divided element to be processed is a character. The natural language processing system 10 indicates the result of morphological analysis by setting a tag for each character (divided element). A “tag” in this specification is a label indicating an attribute or function of a character. Tags will be described in more detail later.
 形態素解析を実行すると、自然言語処理システム10はその文sの形態素解析の正解を示すデータ(正解データ)を受け付け、解析結果とその正解データとを比較して分割モデルwを修正することで新たな分割モデルwを得る。具体的には、自然言語処理システム10は、文sの形態素解析の少なくとも一部のタグ付けが間違った場合には、解析結果の全体が間違いであったと評価する。そして、自然言語処理システム10は、正解データ内の各タグに対応する素性を「正しい(+1)」と評価してその素性のスコアを現在値よりも高くし、解析結果内の各タグに対応する素性を「間違い(-1)」と評価してその素性のスコアを現在値よりも低くすることで、分割モデルwを得る。解析結果内で一部のタグが正解であった場合には、当該一部のタグ(正解のタグ)に関連する素性の二つの評価「正しい(+1)」「間違い(+1)」は結果的に相殺される。したがって、上記のように素性のスコアを低くまたは高くする処理は、不正解のタグに対応する正解のタグ(不正解部分に対応する正解のタグ)に関連する素性のスコアを高くし、該不正解のタグ(不正解部分のタグ)に関連する素性のスコアを低くする処理であるといえる。 When the morphological analysis is executed, the natural language processing system 10 accepts data (correct data) indicating the correct answer of the morphological analysis of the sentence s 1 , compares the analysis result with the correct data, and corrects the divided model w 1. in obtaining a new division model w 2. Specifically, the natural language processing system 10 evaluates that the entire analysis result is wrong when at least a part of tagging in the morphological analysis of the sentence s 1 is wrong. Then, the natural language processing system 10 evaluates the feature corresponding to each tag in the correct answer data as “correct (+1)”, sets the feature score higher than the current value, and corresponds to each tag in the analysis result. the feature that was evaluated as "wrong (-1)" and is made lower than the current value of the score for that feature, obtain division model w 2. If some tags are correct in the analysis result, the two evaluations of “feature (+1)” and “wrong (+1)” of the features related to the partial tag (correct answer tag) are the result. Is offset by Therefore, the process of lowering or raising the feature score as described above increases the feature score related to the correct tag corresponding to the incorrect tag (correct tag corresponding to the incorrect answer portion) and It can be said that this is a process of lowering the score of the feature related to the correct tag (tag of the incorrect answer part).
 なお、自然言語処理システム10は、正解データ内の各タグを「正しい(+1)」と評価する一方で、解析結果内の各文字に関するタグを「間違い(-1)」と評価し、各タグについて二つの評価結果を相殺した上で、「正しい(+1)」と評価されたタグに対応する素性のスコアを高くし、「間違い(-1)」と評価されたタグに対応する素性のスコアを低くしてもよい。 The natural language processing system 10 evaluates each tag in the correct answer data as “correct (+1)”, while evaluating a tag related to each character in the analysis result as “wrong (−1)” After canceling the two evaluation results, the score of the feature corresponding to the tag evaluated as “correct (+1)” is increased, and the score of the feature corresponding to the tag evaluated as “mistake (−1)” May be lowered.
 例えば、文s内に5個の文字x,x,x,x,xが存在するものとする。そして、文字x,x,x,x,xの正解のタグがそれぞれt,t,t,t,tであり、形態素解析により各文字のタグがt,t,t,t,tであったとする。この場合には、自然言語処理システム10は、正解データ内のタグt,t,t,t,tに対応する素性を「正しい(+1)」と評価してその素性のスコアを現在値よりも高くし、実行結果内のタグt,t,t,t,tに対応する素性を「間違い(-1)」と評価してその素性のスコアを現在値よりも低くする。この場合、タグt,t,tに対応する素性のスコアは結果的には更新前と変わらず、正解のタグt,tに対応する素性のスコアが高くなり、不正解のタグt、tに対応する素性のスコアが低くなる。 For example, it is assumed that five characters x a , x b , x c , x d , and x e exist in the sentence s 1 . The character x a, x b, x c, x d, where x correct tags each t a of e, t b, t c, t d, t e, the tag of each character by the morphological analysis t a , t g, and had a t h, t d, t e . In this case, the natural language processing system 10, the tag t a in solution data, t b, t c, t d, the score of the feature by evaluating the identity corresponding to t e a "correct (+1)" was higher than the current value, the tag t a in the execution result, t g, t h, t d, "definitely (-1)" a feature that corresponds to t e and evaluate the current value of the score of the identity and Lower than. In this case, the score of the feature corresponding to the tags t a , t d , and t c is not changed from the result before the update, and the score of the feature corresponding to the correct tags t b and t c is increased, and the incorrect answer The score of the feature corresponding to the tags t g and t h is lowered.
 次の文sに対する形態素解析を実行する場合には、自然言語処理システム10はその分割モデルwを用いる。そして、自然言語処理システム10はその文sの形態素解析の正解データを受け付け、実行結果とその正解データとを比較し、分割モデルwを修正する場合と同様に分割モデルwを修正することで新たな分割モデルwを得る。 When executing the morphological analysis for the next sentence s 2 , the natural language processing system 10 uses the division model w 2 . Then, the natural language processing system 10 accepts a correct answer data of the morphological analysis of the statement s 2, compares the execution result and its correct data, to modify the same divided model w 2 in the case of modifying the division models w 1 obtain a new division model w 3 that.
 自然言語処理システム10はこのように一つの文(s,s,…,st)を処理する度に分割モデルを修正し(w→w,w→w,…,w→wt+1)、次の文の形態素解析で修正後の分割モデルを用いる。このように一つのトレーニングデータを処理する度にモデルを更新する手法は、「オンライン学習」または「オンラインの機械学習」ともいわれる。 The natural language processing system 10 corrects the division model (w 1 → w 2 , w 2 → w 3 ,..., W each time one sentence (s 1 , s 2 ,..., St) is processed in this way. t → w t + 1 ), and use the modified division model in the morphological analysis of the next sentence. Such a method of updating the model every time one piece of training data is processed is also referred to as “online learning” or “online machine learning”.
 自然言語処理システム10による形態素解析の結果の例を図2に示す。この例では、自然言語処理システム10は、「I bought a book」という英文に相当する、「本を買って(hon wo katte)」という日本語の文を、5個の文字x:「本(hon)」,x:「を(wo)」,x:「買(ka)」,x:「っ(t)」,x:「て(te)」に分割する。そして、自然言語処理システム10は形態素解析を実行することで、各文字にタグを設定する。本実施形態では、タグは、単語内での文字の出現態様と、その単語の品詞と、その単語の品詞のサブクラスとの組合せであり、「S-N-nc」などのようにアルファベットを用いて表現される。 An example of the result of morphological analysis by the natural language processing system 10 is shown in FIG. In this example, the natural language processing system 10 converts a Japanese sentence “hon wo katte”, which corresponds to an English sentence “I blog a book”, into five letters x 1 : “book”. (hon) ", x 2:" the (wo) ", x 3:" Offer (ka) ", x 4:" Tsu (t) ", x 5: divided into" te (te) ". And the natural language processing system 10 sets a tag to each character by performing morphological analysis. In the present embodiment, the tag is a combination of the appearance mode of the character in the word, the part of speech of the word, and the subclass of the part of speech of the word, and uses an alphabet such as “SN-nc”. Expressed.
 出現態様は、ある文字が単独で一つの単語となるかそれとも他の文字との組合せで一つの単語になるかということと、文字が、2文字以上から成る単語の一部である場合に、その文字が単語内のどこに位置するかということとを示す情報である。本実施形態では、出現態様はS,B,I,Eのいずれかで示される。出現態様「S」は、文字がそれ単独で一つの単語になることを示す。出現態様「B」は、文字が、2文字以上から成る単語の先頭に位置することを示す。出現態様「I」は、文字が、3文字以上から成る単語の途中に位置することを示す。出現態様「E」は、文字が、2文字以上から成る単語の末尾に位置することを示す。図2の例は、文字x,x,xが単独で一つの単語であり、文字x,xで1単語が形成されることを示している。 The appearance mode is whether a certain character becomes one word alone or in combination with another character, and when the character is a part of a word consisting of two or more characters, This is information indicating where in the word the character is located. In the present embodiment, the appearance mode is indicated by one of S, B, I, and E. The appearance mode “S” indicates that the character becomes a single word by itself. The appearance mode “B” indicates that the character is positioned at the beginning of a word composed of two or more characters. The appearance mode “I” indicates that the character is located in the middle of a word composed of three or more characters. The appearance mode “E” indicates that the character is located at the end of a word composed of two or more characters. The example of FIG. 2 indicates that the characters x 1 , x 2 , and x 5 are each a single word, and the characters x 3 and x 4 form one word.
 なお、出現態様についてのスキームは限定されない。本実施形態では、「SBIEO」というスキームを用いているが、例えば、当業者に周知である「IOB2」というスキームを用いてもよい。 Note that the scheme for the appearance mode is not limited. In this embodiment, the scheme “SBIEO” is used, but for example, the scheme “IOB2” that is well known to those skilled in the art may be used.
 品詞の例としては、名詞、動詞、助詞、形容詞、形容動詞、接続詞などが挙げられる。本実施形態では、名詞は「N」で表され、助詞は「P」で表され、動詞は「V」で表される。図2の例は、文字xが名詞であり、文字xが助詞であり、文字x,xから成る単語が動詞であり、文字xが助詞であることを示している。 Examples of parts of speech include nouns, verbs, particles, adjectives, adjective verbs, conjunctions, and the like. In this embodiment, the noun is represented by “N”, the particle is represented by “P”, and the verb is represented by “V”. The example of FIG. 2 indicates that the character x 1 is a noun, the character x 2 is a particle, the word consisting of the characters x 3 and x 4 is a verb, and the character x 5 is a particle.
 品詞のサブクラスは、対応する品詞の下位概念を示す。例えば、名詞は一般名詞と固有名詞とにさらに分類することができ、助詞は格助詞、接続助詞、係助詞などにさらに分類することができる。本実施形態では、一般名詞は「nc」で表され、固有名詞は「np」で表され、格助詞は「k」で表され、接続助詞は「sj」で表され、一般動詞は「c」で表される。図2の例は、文字xが一般名詞であり、文字xが格助詞であり、文字x,xから成る単語が一般動詞であり、文字xが接続助詞であることを示している。 The part-of-speech subclass indicates the subordinate concept of the corresponding part-of-speech. For example, nouns can be further classified into general nouns and proper nouns, and particles can be further classified into case particles, conjunctive particles, auxiliary particles, and the like. In this embodiment, the general noun is represented by “nc”, the proper noun is represented by “np”, the case particle is represented by “k”, the connection particle is represented by “sj”, and the general verb is “c”. Is represented. The example of FIG. 2 shows that the character x 1 is a general noun, the character x 2 is a case particle, the word consisting of the characters x 3 and x 4 is a general verb, and the character x 5 is a connection particle. ing.
 分割モデル20が記憶する素性のスコアは、出力素性(emission feature)のスコアおよび遷移素性(transition feature)のスコアである。 The score of the feature stored in the division model 20 is a score of an output feature and a score of a transition feature.
 出力素性とは、タグと文字または文字種との対応を示す手掛かりである。言い換えると、出力素性とは、どのようなタグに対してどのような文字または文字種が対応しやすいかを示す手掛かりである。出力素性は、隠れマルコフモデルの出力行列の素性表現に対応する。本実施形態では、ユニグラム(1文字のみから成る文字列)の出力素性と、バイグラム(連続する2文字から成る文字列)の出力素性とを用いる。 The output feature is a clue indicating the correspondence between a tag and a character or character type. In other words, the output feature is a clue indicating what kind of character or character type is likely to correspond to what kind of tag. The output feature corresponds to the feature representation of the output matrix of the hidden Markov model. In this embodiment, an output feature of a unigram (a character string made up of only one character) and an output feature of a bigram (a character string made up of two consecutive characters) are used.
 ここで、文字種とはある言語における文字の種類のことである。日本語の文字種として、例えば、漢字、平仮名、片仮名、アルファベット(大文字および小文字)、アラビア数字、漢数字、および中黒(・)が挙げられる。なお、本実施形態では、文字種をアルファベットで表す。例えば、「C」は漢字を示し、「H」は平仮名を示し、「K」は片仮名を示し、「L」はアルファベットを示し、「A」はアラビア数字を示す。図2の例は、文字x,xが漢字であり、文字x,x,xが平仮名であることを示している。 Here, the character type is a character type in a certain language. Examples of Japanese character types include kanji, hiragana, katakana, alphabet (uppercase and lowercase), Arabic numerals, kanji numerals, and middle black (•). In the present embodiment, the character type is represented by alphabets. For example, “C” indicates kanji, “H” indicates hiragana, “K” indicates katakana, “L” indicates alphabets, and “A” indicates Arabic numerals. The example of FIG. 2 indicates that the characters x 1 and x 3 are kanji characters and the characters x 2 , x 4 , and x 5 are hiragana characters.
 文字に関するユニグラムの出力素性は、タグtと文字xとの対応を示す手掛かりである。また、文字種に関するユニグラムの出力素性は、タグtと文字種cとの対応を示す手掛かりである。本実施形態では、タグtと文字xとの対応の尤もらしさのスコアsを{t/x,s}で示す。また、タグtと文字種cとの対応の尤もらしさのスコアsを{t/c,s}で示す。分割モデル20は、一つの文字または文字種に対して複数のタグに関するスコアを含む。一つの文字または文字種に対して、すべての種類のタグに関するデータが用意される場合には、分割モデル20は、文法上、実際には起こりえないタグと文字または文字種との組合せについてのスコアも含む。ただし、文法上有り得ない素性のスコアは、相対的に低くなる。 The unigram output feature related to the character is a clue indicating the correspondence between the tag t and the character x. Further, the output feature of the unigram regarding the character type is a clue indicating the correspondence between the tag t and the character type c. In the present embodiment, the likelihood score s of the correspondence between the tag t and the letter x is indicated by {t / x, s}. A likelihood score s of correspondence between the tag t and the character type c is denoted by {t / c, s}. The division model 20 includes scores regarding a plurality of tags for one character or character type. When data on all types of tags is prepared for one character or character type, the division model 20 also has a score for a combination of a tag and a character or character type that cannot actually occur in the grammar. Including. However, the score of a feature that is impossible in grammar is relatively low.
 以下に、日本語の「本(hon)」という文字に関する出力素性のスコアの例を示す。この文字が助詞であることは日本語の文法上有り得ないが、上述した通り、文法上存在しない「S-P-k/本(hon)」のような素性についてもデータが用意され得る。
{S-N-nc/本(hon),0.0420}
{B-N-nc/本(hon),0.0310}
{S-P-k/本(hon),0.0003}
{B-V-c/本(hon),0.0031}
The following is an example of an output feature score for the Japanese word “hon”. Although it is impossible in Japanese grammar that this character is a particle, as described above, data such as “SPK / hon” which does not exist in the grammar can be prepared.
{SN-nc / book, 0.0420}
{B-N-nc / hon, 0.0310}
{SPK / hon, 0.0003}
{BVc / hon, 0.0031}
 また、文字種「漢字」に関する出力素性のスコアの例を示す。
{S-N-nc/C,0.0255}
{E-N-np/C,0.0488}
{S-P-k/C,0.0000}
{B-V-c/C,0.0299}
In addition, an example of an output feature score regarding the character type “Kanji” is shown.
{SN-nc / C, 0.0255}
{EN-np / C, 0.0488}
{SPK / C, 0.0000}
{BVc / C, 0.0299}
 文字種に関しても、文法上存在しない素性を示すデータが用意され得る。例えば、アラビア数字で表される単語が助詞になることは日本語の文法上有り得ないが、「S-P-k/A」のような素性についてもデータが用意され得る。 Also for character types, data indicating features that do not exist in the grammar can be prepared. For example, it is impossible in Japanese grammar that a word represented by Arabic numerals is a particle, but data can also be prepared for a feature such as “SPK / A”.
 文字に関するバイグラムの出力素性は、タグtと文字列xi+1との対応を示す手掛かりである。また、文字種に関するバイグラムの出力素性は、タグtと文字種の列ci+1との対応を示す手掛かりである。本実施形態では、タグtおよび文字x,xi+1の尤もらしさのスコアsを{t/x/xi+1,s}で示す。また、タグtおよび文字種c,ci+1の尤もらしさのスコアsを{t/c/ci+1,s}で示す。一つのバイグラムに対して、存在し得るすべてのタグに関するデータを用意する場合には、分割モデル20は、文法上、実際には起こりえないタグとバイグラムとの組合せについてのデータも記憶する。 The bigram output feature related to the character is a clue indicating the correspondence between the tag t and the character string x i x i + 1 . The bigram output feature related to the character type is a clue indicating the correspondence between the tag t and the character type column c i c i + 1 . In the present embodiment, the likelihood score s of the tag t and the characters x i and x i + 1 is represented by {t / x i / x i + 1 , s}. Further, the likelihood score s of the tag t and the character types c i and c i + 1 is denoted by {t / c i / c i + 1 , s}. When preparing data related to all tags that can exist for one bigram, the division model 20 also stores data on combinations of tags and bigrams that cannot actually occur in the grammar.
 以下に、「本を(hon wo)」というバイグラムに関する出力素性のスコアの例を示す。
{S-N-nc/本(hon)/を(wo),0.0420}
{B-N-nc/本(hon)/を(wo),0.0000}
{S-P-k/本(hon)/を(wo),0.0001}
{B-V-c/本(hon)/を(wo),0.0009}
The following is an example of an output feature score for a bigram “hon wo”.
{SN-nc / book / hon (wo), 0.0420}
{B-N-nc / hon / (wo), 0.0000}
{SPK / hon / (wo), 0.0001}
{BVc / hon / (wo), 0.0009}
 また、漢字の次に平仮名が現れるバイグラムに関する出力素性のスコアの例を示す。
{S-N-nc/C/H,0.0455}
{E-N-np/C/H,0.0412}
{S-P-k/C/H,0.0000}
{B-V-c/C/H,0.0054}
Moreover, the example of the score of the output feature regarding the bigram in which hiragana appears after the kanji is shown.
{SN-nc / C / H, 0.0455}
{EN-np / C / H, 0.0412}
{SPK / C / H, 0.0000}
{BVc / C / H, 0.0054}
 遷移素性とは、文字xのタグtとその次の文字xi+1のタグti+1との組合せ(連続する2文字に対応する二つのタグから成る組合せ)を示す手掛かりである。この遷移素性はバイグラムに関する素性である。遷移素性は、隠れマルコフモデルの遷移行列の素性表現に対応する。本実施形態では、タグtとタグti+1との組合せの尤もらしさのスコアsを{t/ti+1,s}で示す。存在し得るすべての組合せに関する遷移素性のデータを用意する場合には、分割モデル20は、文法上、実際には起こりえない二つのタグの組合せについてのデータも記憶する。 The transition feature, a cue indicating a combination (combination consisting of two tags corresponding to two consecutive characters) tags t i and tag t i + 1 of the next character x i + 1 character x i. This transition feature is a bigram feature. The transition feature corresponds to the feature representation of the transition matrix of the hidden Markov model. In the present embodiment, the likelihood score s of the combination of the tag t i and the tag t i + 1 is represented by {t i / t i + 1 , s}. In the case of preparing transition feature data for all possible combinations, the division model 20 also stores data on combinations of two tags that cannot actually occur in the grammar.
 以下に、遷移素性のスコアのいくつかの例を示す。
{S-N-nc/S-P-k,0.0512}
{E-N-nc/E-N-nc,0.0000}
{S-P-k/B-V-c,0.0425}
{B-V-c/I-V-c,0.0387}
Below, some examples of transition feature scores are shown.
{SN-nc / SP-k, 0.0512}
{E-N-nc / E-N-nc, 0.0000}
{SPK / BVc, 0.0425}
{BVc / IVc, 0.0387}
 自然言語処理システム10は1台以上のコンピュータを備え、複数台のコンピュータを備える場合には、後述する自然言語処理システム10の各機能要素は分散処理により実現される。個々のコンピュータの種類は限定されない。例えば、据置型または携帯型のパーソナルコンピュータ(PC)を用いてもよいし、ワークステーションを用いてもよいし、高機能携帯電話機(スマートフォン)や携帯電話機、携帯情報端末(PDA)などの携帯端末を用いてもよい。あるいは、様々な種類のコンピュータを組み合わせて自然言語処理システム10を構築してもよい。複数台のコンピュータを用いる場合には、これらのコンピュータはインターネットやイントラネットなどの通信ネットワークを介して接続される。 The natural language processing system 10 includes one or more computers. When the natural language processing system 10 includes a plurality of computers, each functional element of the natural language processing system 10 described later is realized by distributed processing. The type of individual computer is not limited. For example, a stationary or portable personal computer (PC) may be used, a workstation may be used, or a portable terminal such as a high-functional portable telephone (smart phone), a portable telephone, or a personal digital assistant (PDA). May be used. Alternatively, the natural language processing system 10 may be constructed by combining various types of computers. When a plurality of computers are used, these computers are connected via a communication network such as the Internet or an intranet.
 自然言語処理システム10内の個々のコンピュータ100の一般的なハードウェア構成を図3に示す。コンピュータ100は、オペレーティングシステムやアプリケーション・プログラムなどを実行するCPU(プロセッサ)101と、ROM及びRAMで構成される主記憶部102と、ハードディスクやフラッシュメモリなどで構成される補助記憶部103と、ネットワークカードあるいは無線通信モジュールで構成される通信制御部104と、キーボードやマウスなどの入力装置105と、ディスプレイやプリンタなどの出力装置106とを備える。当然ながら、搭載されるハードウェアモジュールはコンピュータ100の種類により異なる。例えば、据置型のPCおよびワークステーションは入力装置および出力装置としてキーボード、マウス、およびモニタを備えることが多いが、スマートフォンはタッチパネルが入力装置および出力装置として機能することが多い。 FIG. 3 shows a general hardware configuration of each computer 100 in the natural language processing system 10. A computer 100 includes a CPU (processor) 101 that executes an operating system, application programs, and the like, a main storage unit 102 that includes a ROM and a RAM, an auxiliary storage unit 103 that includes a hard disk and a flash memory, and a network. The communication control unit 104 includes a card or a wireless communication module, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a display and a printer. Of course, the hardware modules to be mounted differ depending on the type of the computer 100. For example, a stationary PC and a workstation often include a keyboard, a mouse, and a monitor as an input device and an output device, but in a smartphone, a touch panel often functions as an input device and an output device.
 後述する自然言語処理システム10の各機能要素は、CPU101または主記憶部102の上に所定のソフトウェアを読み込ませ、CPU101の制御の下で通信制御部104や入力装置105、出力装置106などを動作させ、主記憶部102または補助記憶部103におけるデータの読み出し及び書き込みを行うことで実現される。処理に必要なデータやデータベースは主記憶部102または補助記憶部103内に格納される。 Each functional element of the natural language processing system 10 described later reads predetermined software on the CPU 101 or the main storage unit 102, and operates the communication control unit 104, the input device 105, the output device 106, and the like under the control of the CPU 101. This is realized by reading and writing data in the main storage unit 102 or the auxiliary storage unit 103. Data and a database necessary for processing are stored in the main storage unit 102 or the auxiliary storage unit 103.
 一方、分割モデル20は予め記憶装置に記憶される。分割モデル20の具体的な実装方法は限定されず、例えば分割モデル20は関係データベースあるいはテキストファイルとして用意されてもよい。また、分割モデル20の設置場所は限定されず、例えば、分割モデル20は自然言語処理システム10の内部に存在してもよいし、自然言語処理システム10とは異なる他のコンピュータシステム内に存在してもよい。分割モデル20が他の自然言語処理システム内にある場合には、自然言語処理システム10は通信ネットワークを介して分割モデル20にアクセスする。 On the other hand, the division model 20 is stored in the storage device in advance. The specific mounting method of the division model 20 is not limited. For example, the division model 20 may be prepared as a relational database or a text file. The installation location of the division model 20 is not limited. For example, the division model 20 may exist in the natural language processing system 10 or in another computer system different from the natural language processing system 10. May be. When the division model 20 is in another natural language processing system, the natural language processing system 10 accesses the division model 20 via a communication network.
 上述したように、分割モデル20は様々な素性のスコアの集合であるともいえる。数式上では、n個の素性のスコアw,w,…,wを含む分割モデル20をベクトルw={w,w,…,w}で示すことができる。分割モデル20が新規に作成された時点での各素性のスコアはすべて0である。すなわち、w={0,0,…,0}である。後述する自然言語処理システム10の処理により、そのスコアは少しずつ更新されていく。ある程度多くの文が処理された後には、上記のように個々の素性のスコアの間に差が生じてくる。 As described above, it can be said that the division model 20 is a set of scores of various features. On formula, score w 1 of the n feature, w 2, ..., vector division model 20 including a w n w = {w 1, w 2, ..., w n} can be represented by. The scores of the features at the time when the division model 20 is newly created are all zero. That is, w = {0, 0,..., 0}. The score is updated little by little by the processing of the natural language processing system 10 described later. After a certain number of sentences have been processed, there is a difference between the individual feature scores as described above.
 図4に示すように、自然言語処理システム10は機能的構成要素として取得部11、解析部12、および修正部13を備える。自然言語処理システム10は必要に応じて分割モデル20にアクセスする。以下に各機能要素について説明するが、本実施形態では自然言語処理システム10が日本語の文を処理することを前提に説明する。もっとも、自然言語処理システム10が処理する文の言語は日本語に限定されず、中国語などの他の言語の文を解析することも可能である。 As shown in FIG. 4, the natural language processing system 10 includes an acquisition unit 11, an analysis unit 12, and a correction unit 13 as functional components. The natural language processing system 10 accesses the division model 20 as necessary. Each functional element will be described below. In the present embodiment, the description will be made on the assumption that the natural language processing system 10 processes a Japanese sentence. However, the language of the sentence processed by the natural language processing system 10 is not limited to Japanese, and sentences in other languages such as Chinese can be analyzed.
 取得部11は、形態素の列に分割しようとする文を取得する機能要素である。取得部11による文の取得方法は限定されない。例えば、取得部11はインターネット上の任意のウェブサイトから文を収集してもよい(いわゆる、クローリング(crawling))。あるいは、取得部11は自然言語処理システム10内のデータベースに予め蓄積された文を読み出してもよいし、自然言語処理システム10以外のコンピュータシステム上にあるデータベースに予め蓄積された文を通信ネットワーク経由でアクセスして読み出してもよい。あるいは、取得部11は自然言語処理システム10のユーザが入力した文を受け付けてもよい。最初の文の解析の指示が入力されると、取得部11は一つの文を取得して解析部12に出力する。その後、後述する修正部13から完了通知が入力されると、取得部11は次の文を取得して解析部12に出力する。 The acquisition unit 11 is a functional element that acquires a sentence to be divided into morpheme strings. The acquisition method of the sentence by the acquisition part 11 is not limited. For example, the acquisition unit 11 may collect sentences from any website on the Internet (so-called crawling). Alternatively, the acquisition unit 11 may read a sentence stored in advance in a database in the natural language processing system 10, or read a sentence stored in a database on a computer system other than the natural language processing system 10 via a communication network. It may be accessed and read by. Alternatively, the acquisition unit 11 may accept a sentence input by a user of the natural language processing system 10. When an instruction to analyze the first sentence is input, the acquisition unit 11 acquires one sentence and outputs it to the analysis unit 12. Thereafter, when a completion notification is input from the correction unit 13 described later, the acquisition unit 11 acquires the next sentence and outputs it to the analysis unit 12.
 解析部12は個々の文に対して形態素解析を実行する機能要素である。解析部12は一つの文が入力される度に以下の処理を実行する。 The analysis unit 12 is a functional element that performs morphological analysis on individual sentences. The analysis unit 12 executes the following process every time one sentence is input.
 まず、解析部12は一つの文を個々の文字に分割し、各文字の文字種を判定する。解析部12は、文字と文字種との対比表、または文字種を判定するための正規表現を予め記憶しており、その対比表または正規表現を用いて文字種を判定する。 First, the analysis unit 12 divides one sentence into individual characters and determines the character type of each character. The analysis unit 12 stores in advance a comparison table between characters and character types, or a regular expression for determining a character type, and determines a character type using the comparison table or regular expression.
 続いて、解析部12はビタビ・アルゴリズム(Viterbi algorithm)を用いて各文字のタグを決定する。i番目の文字に対して、解析部12は、最終的に選択される可能性があるタグ(候補タグ)のそれぞれついて、(i-1)番目の文字の複数の候補タグのうちどの候補タグと接続した場合にスコア(これを「接続スコア」ともいう)がいちばん高くなるかを判定する。ここで、接続スコアは、計算対象のタグに関する各種スコア(ユニグラムの出力素性のスコア、バイグラムの出力素性のスコア、および遷移素性のスコア)の合計値である。例えば、解析部12は、i番目のタグが「S-N-nc」の場合には、(i-1)番目のタグが「S-P-k」である場合に接続スコアが一番高くなり、i番目のタグが「S-V-c」の場合には、(i-1)番目のタグが「E-N-nc」である場合に接続スコアが一番高くなる、などと判定する。そして、解析部12は、接続スコアが最も高くなる組合せ(例えば、(S-P-k,S-N-nc)、(E-N-nc,S-V-c)など)をすべて記憶する。解析部12は、最初の文字から文末記号まで1文字ずつ進みながらこのような処理を実行する。 Subsequently, the analysis unit 12 determines a tag for each character using a Viterbi algorithm. For the i-th character, the analysis unit 12 determines which candidate tag among the plurality of candidate tags of the (i-1) -th character for each tag (candidate tag) that may be finally selected. It is determined whether or not the score (also referred to as “connection score”) is the highest. Here, the connection score is a total value of various scores related to the calculation target tag (unigram output feature score, bigram output feature score, and transition feature score). For example, when the i-th tag is “SN-nc”, the analysis unit 12 has the highest connection score when the (i−1) -th tag is “SPK”. When the i-th tag is “SVc”, it is determined that the connection score is highest when the (i-1) -th tag is “EN-nc”. To do. Then, the analysis unit 12 stores all the combinations (for example, (S−P−k, S−N−nc), (E−N−nc, S−V−c), etc.) having the highest connection score. . The analysis unit 12 executes such processing while proceeding character by character from the first character to the sentence end symbol.
 文末記号に対しては一種類のタグ(EOS)しか存在しないので、接続スコアが最も高い、最後の文字と文末記号とのタグとの組合せは一つに決まる(例えば、その組合せが(E-V-c,EOS)であると決まる)。そうすると、最後の文字のタグが決まり(例えば、そのタグは「E-V-c」であると決まる)、その結果、最後から2番目の文字のタグも決まる。結果として、文の最後から先頭に向かって順番に、芋づる式にタグが確定する。 Since there is only one type of tag (EOS) for the end-of-sentence symbol, the combination of the tag of the last character and the end-of-sentence symbol having the highest connection score is determined as one (for example, the combination is (E- Vc, EOS)). Then, the tag of the last character is determined (for example, the tag is determined to be “EVC”), and as a result, the tag of the second character from the end is also determined. As a result, the tags are fixed to the formulas in order from the end to the beginning of the sentence.
 このような解析部12による処理を模式的に図5に示す。図5は、4文字から成る文のタグ付けをする一例を示す。説明を簡単にするために、この例ではタグを「A1」「B2」などのように簡略化して示し、各文字の候補タグの個数を3としている。図5における太線は、文を前方から処理することで得られる、接続スコアが最も高いと判定されたタグとタグとの組合せを示す。例えば3文字目の処理では、タグC1についてはタグB1との接続スコアが最も高く、タグC2についてはタグB1との接続スコアが最も高く、タグC3についてはタグB2との接続スコアが最も高い。図5の例では、文の最後(EOS)まで処理すると、組合せ(D1,EOS)が確定し、続いて、組合せ(C2,D1)が確定し、その後、組合せ(B1,C2)、(A2,B1)が順次確定する。したがって、解析部12は、1~4文字目のタグがそれぞれA2,B1,C2,D1であると判定する。 FIG. 5 schematically shows such processing by the analysis unit 12. FIG. 5 shows an example of tagging a sentence consisting of four characters. In order to simplify the explanation, in this example, the tags are simplified as “A1”, “B2”, etc., and the number of candidate tags for each character is three. A thick line in FIG. 5 indicates a combination of a tag determined to have the highest connection score obtained by processing a sentence from the front. For example, in the process of the third character, the tag C1 has the highest connection score with the tag B1, the tag C2 has the highest connection score with the tag B1, and the tag C3 has the highest connection score with the tag B2. In the example of FIG. 5, when processing is performed up to the end of the sentence (EOS), the combination (D1, EOS) is determined, then the combination (C2, D1) is determined, and then the combinations (B1, C2), (A2) , B1) are determined sequentially. Therefore, the analysis unit 12 determines that the tags of the first to fourth characters are A2, B1, C2, and D1, respectively.
 解析部12は各文字がタグ付けされた文を解析結果として出力する。解析部12は解析結果を少なくとも修正部13に出力するが、この理由は、その解析結果が分割モデル20の修正に必要だからである。解析部12は更なる出力を実行してもよい。例えば、解析部12は解析結果をモニタ上に表示したりプリンタに印刷したりしてもよいし、解析結果をテキストファイルに書き出してもよいし、解析結果をメモリやデータベースなどの記憶装置に格納してもよい。あるいは、解析部12は解析結果を通信ネットワーク経由で自然言語処理システム10以外の他の任意のコンピュータシステムに送信してもよい。 The analysis unit 12 outputs a sentence with each character tagged as an analysis result. The analysis unit 12 outputs the analysis result to at least the correction unit 13 because the analysis result is necessary for correcting the divided model 20. The analysis unit 12 may perform further output. For example, the analysis unit 12 may display the analysis result on a monitor or print it on a printer, write the analysis result to a text file, or store the analysis result in a storage device such as a memory or a database. May be. Alternatively, the analysis unit 12 may transmit the analysis result to an arbitrary computer system other than the natural language processing system 10 via a communication network.
 修正部13は、解析部12から得られた解析結果と、その文の形態素解析の正解との差に基づいて分割モデル20を修正する機能要素である。本明細書における「分割モデルの修正」とは、分割モデル内の少なくとも一つの素性のスコアを変更する処理である。なお、場合によっては、あるスコアを変更しようとしても結果的に値が変わらない場合があり得る。修正部13は解析結果が一つ入力される度に以下の処理を実行する。 The correction unit 13 is a functional element that corrects the divided model 20 based on the difference between the analysis result obtained from the analysis unit 12 and the correct answer of the morphological analysis of the sentence. In the present specification, “modification of the division model” is processing for changing the score of at least one feature in the division model. In some cases, even if an attempt is made to change a certain score, the value may not change as a result. The correction unit 13 executes the following processing every time one analysis result is input.
 まず、修正部13は入力された解析結果に対応する正解データ、すなわち、解析部12により処理された文の形態素解析の正解を示すデータを取得する。本実施形態における正解データとは、文を形成する各文字のタグ(出現態様、品詞、および、品詞のサブクラスの組合せ)を示すデータである。この正解データは人手により作成される。修正部13による正解データの取得方法は限定されない。例えば、修正部13は自然言語処理システム10内のデータベースに予め蓄積された正解データを読み出してもよいし、自然言語処理システム10以外のコンピュータシステム上にあるデータベースに予め蓄積された文を通信ネットワーク経由でアクセスして読み出してもよい。あるいは、修正部13は自然言語処理システム10のユーザが入力した正解データを受け付けてもよい。 First, the correction unit 13 acquires correct answer data corresponding to the input analysis result, that is, data indicating the correct answer of the morphological analysis of the sentence processed by the analysis unit 12. The correct answer data in the present embodiment is data indicating tags (combination of appearance mode, part of speech, and part of speech subclass) of each character forming a sentence. This correct answer data is created manually. The method for acquiring correct data by the correction unit 13 is not limited. For example, the correction unit 13 may read correct data stored in advance in a database in the natural language processing system 10, or may read sentences stored in a database on a computer system other than the natural language processing system 10 in a communication network. You may access and read via. Alternatively, the correction unit 13 may accept correct answer data input by the user of the natural language processing system 10.
 正解データを取得すると、修正部13は入力された解析結果とその正解データとを比較してこれらの間の差を特定する。 When the correct data is acquired, the correction unit 13 compares the input analysis result with the correct data and identifies the difference between them.
 解析結果が正解データと完全に一致して差が無い場合には、修正部13は分割モデル20を修正することなく処理を終了し、完了通知を生成して取得部11に出力する。この完了通知は、修正部13での処理が終了して次の文に対する形態素解析が実行可能になったことを示す信号である。解析結果が正解データと完全に一致したということは、少なくともこの時点で分割モデル20を修正する必要がないので、自然言語処理システム10(より具体的には解析部12)は現在の分割モデル20をそのまま用いて次の文を解析する。 If the analysis result is completely the same as the correct answer data and there is no difference, the correction unit 13 ends the process without correcting the division model 20, generates a completion notification, and outputs it to the acquisition unit 11. This completion notification is a signal indicating that the processing in the correction unit 13 has been completed and the morphological analysis for the next sentence can be executed. The fact that the analysis result completely matches the correct answer data does not require correction of the division model 20 at least at this point, so the natural language processing system 10 (more specifically, the analysis unit 12) has the current division model 20 Is used as is to parse the next sentence.
 例えば、上述した日本語の文「本を買って(hon wo katte)」についての正解データは以下の通りである。なお、便宜的に、各文字をx~xとも表す。
:{S-N-nc}
:{S-P-k}
:{B-V-c}
:{E-V-c}
:{S-P-sj}
For example, the correct data for the above-described Japanese sentence “Hon Wo Katte” is as follows. For convenience, each character is also expressed as x 1 to x 5 .
x 1 : {SN—nc}
x 2 : {SPK}
x 3 : {BVc}
x 4 : {EVV}
x 5 : {SP-sj}
 したがって、図2に示す解析結果が入力された場合には、修正部13はその解析結果と正解データとが完全に一致すると判定し、解析部12を修正することなく完了通知を取得部11に出力する。 Therefore, when the analysis result shown in FIG. 2 is input, the correction unit 13 determines that the analysis result and the correct data match completely, and notifies the acquisition unit 11 of the completion notification without correcting the analysis unit 12. Output.
 一方、解析結果が正解データと完全に一致しない場合(すなわち、解析結果と正解データとに差がある場合)には、修正部13は分割モデル20の少なくとも一部のスコアを更新する。より具体的には、修正部13は不正解のタグに対応する正解のタグに関連する素性のスコアを現在値よりも高くするとともに、該不正解のタグに関する素性のスコアを現在値よりも低くする。 On the other hand, when the analysis result does not completely match the correct answer data (that is, when there is a difference between the analysis result and the correct answer data), the correction unit 13 updates at least a part of the scores of the divided model 20. More specifically, the correcting unit 13 sets the feature score related to the correct tag corresponding to the incorrect tag to be higher than the current value, and sets the feature score related to the incorrect tag to be lower than the current value. To do.
 例えば、解析部12が日本語の文「本を買って(hon wo katte)」から下記の解析結果を得たとする。
:{S-N-nc}
:{S-P-k}
:{B-V-c}
:{I-V-c}
:{E-V-c}
For example, it is assumed that the analysis unit 12 obtains the following analysis result from a Japanese sentence “Buy Book”.
x 1 : {SN—nc}
x 2 : {SPK}
x 3 : {BVc}
x 4 : {IVc}
x 5 : {EVV}
 この場合、解析結果が全体として間違っているので、修正部13は、正解データ内の各タグに対応する素性を「正しい(+1)」と評価してその素性のスコアを現在値よりも高くし、解析結果内の各タグに対応する素性を「間違い(-1)」と評価してその素性のスコアを現在値よりも低くする。結果的に相殺される部分を考慮すると、修正部13は最終的に以下の処理を行うと言い換えることができる。 In this case, since the analysis result is wrong as a whole, the correction unit 13 evaluates the feature corresponding to each tag in the correct answer data as “correct (+1)” and sets the score of the feature higher than the current value. Then, the feature corresponding to each tag in the analysis result is evaluated as “error (−1)”, and the score of the feature is made lower than the current value. In consideration of the part that is offset as a result, it can be said that the correction unit 13 finally performs the following processing.
 修正部13は文字x,xの正解のタグに対応する出力素性「E-V-c/っ(t)」「S-P-sj/て(te)」についてのスコアを現在値より大きくし、不正解のタグに関連する出力素性「I-V-c/っ(t)」「E-V-c/て(te)」についてのスコアを現在値より小さくする。これにより、解析された文に関連するユニグラムの出力素性のスコア(文字に関するスコア)が更新される。 The correction unit 13 calculates the scores for the output features “EV−c / t (t)” and “SP−sj / te (te)” corresponding to the correct tags of the characters x 4 and x 5 from the current value. The score for the output features “IVc / t (t)” and “EVc / t (te)” related to the incorrect tag is made smaller than the current value. As a result, the score of the output feature of the unigram related to the analyzed sentence (score regarding the character) is updated.
 また、修正部13は、不正解だった文字x,xの正解のタグに関連する出力素性「E-V-c/H」「S-P-sj/H」についてのスコアを現在値より大きくし、不正解のタグに関連する出力素性「I-V-c/H」「E-V-c/H」についてのスコアを現在値より小さくする。これにより、解析された文に関連するユニグラムの出力素性のスコア(文字種に関するスコア)が更新される。 In addition, the correcting unit 13 sets the scores for the output features “EVC / H” and “SP-sj / H” related to the correct tags of the characters x 4 and x 5 that are incorrect answers to the current values. The score for the output features “IVc / H” and “EVc / H” related to the incorrect answer tag is made smaller than the current value. Thereby, the score of the output feature of the unigram related to the analyzed sentence (score regarding the character type) is updated.
 また、修正部13は、不正解だった文字x,xの正解のタグに関連する出力素性「E-V-c/っ(t)/て(te)」についてのスコアを現在値より大きくし、不正解のタグに関連する出力素性「I-V-c/っ(t)/て(te)」についてのスコアを現在値より小さくする。これにより、解析された文に関連するバイグラムの出力素性のスコア(文字に関するスコア)が更新される。 Further, the correcting unit 13 obtains a score for the output feature “EV−c / t (t) / te (te)” related to the correct tags of the characters x 4 and x 5 that were incorrect from the current value. The score for the output feature “IVV / t (t) / te (te)” related to the incorrect tag is made smaller than the current value. This updates the bigram output feature score (character score) associated with the analyzed sentence.
 また、修正部13は、不正解だった文字x,xの正解のタグに関連する出力素性「E-V-c/H/H」についてのスコアを現在値より大きくし、不正解のタグに関連する出力素性「I-V-c/H/H」についてのスコアを現在値より小さくする。これにより、解析された文に関連するバイグラムの出力素性のスコア(文字種に関するスコア)が更新される。 In addition, the correcting unit 13 increases the score for the output feature “EVC / H / H” related to the correct tags of the characters x 4 and x 5 that are incorrect answers from the current value, and corrects the incorrect answer. The score for the output feature “IVc / H / H” related to the tag is made smaller than the current value. This updates the bigram output feature score (score for the character type) associated with the analyzed sentence.
 また、修正部13は不正解だった文字x,xの正解のタグに関連する遷移素性「B-V-c/E-V-c」「E-V-c/S-P-sj」についてのスコアを現在値より大きくし、不正解のタグに関連する遷移素性「B-V-c/I-V-c」「I-V-c/E-V-c」についてのスコアを現在値より小さくする。これにより、解析された文に関連する遷移素性のスコアが更新される。 In addition, the correcting unit 13 uses the transition features “BVc / EVVc” and “EVc / SP-sj” associated with the correct tags of the characters x 4 and x 5 that were incorrect. The score for the transition features “BVc / IVVc” and “IVc / EVVc” related to the incorrect answer tag are set higher than the current value. Make it smaller than the current value. As a result, the score of the transition feature related to the analyzed sentence is updated.
 なお、上述したように、修正部13は、正解データ内の各タグを「正しい(+1)」と評価する一方で、解析結果内の各文字に関するタグを「間違い(-1)」と評価し、各タグについての二つの評価結果を相殺した上で、「正しい(+1)」と評価されたタグに対応する素性のスコアを高くし、「間違い(-1)」と評価されたタグに対応する素性のスコアを低くしてもよい。 As described above, the correction unit 13 evaluates each tag in the correct answer data as “correct (+1)”, while evaluating the tag related to each character in the analysis result as “wrong (−1)”. , After offsetting the two evaluation results for each tag, increase the score of the feature corresponding to the tag evaluated as “correct (+1)”, and correspond to the tag evaluated as “wrong (−1)” The score of the feature to be performed may be lowered.
 素性のスコアを更新する際に、修正部13はSCW(Soft Confidence-Weighted learning)を用いてもよい。このSCWは、分散の大きいパラメータについてはまだ自信がない(正確でない)とみなしてそのパラメータを大きく更新し、分散の小さいパラメータについてはある程度正確であるとみなしてそのパラメータを小さく更新するという手法である。修正部13は、値の範囲を有するスコアの分散に基づいて該スコアの変化量を決定する。このSCWを実行するために、分割モデル20(ベクトルw)にガウス分布を導入し、修正部13は各スコアの更新に加えてそのスコアの平均および共分散行列も同時に更新する。各スコアの平均の初期値は0である。各スコアの共分散行列の初期値については、対角要素が1であり、それ以外の要素(非対角要素)は0である。図6(a)は、分散の大きいスコアを大きく変更する(すなわち、スコアの変化量が大きい)態様を示し、図6(b)は、分散の小さいスコアを少しだけ変更する(すなわち、スコアの変化量が小さい)態様を示している。図6(a)および図6(b)はそれぞれ、スコアをSaからSbに更新した際に共分散行列Σも更新することを示している。なお、共分散行列の更新に関していうと、ある素性と他の素性との相関関係を考慮しなくてもスコアの計算の精度を保つことができるので、本実施形態では共分散行列の非対角要素を計算することなく対角要素のみを計算する。これにより、スコアの更新速度を上げることができる。 When updating the feature score, the modification unit 13 may use SCW (Soft Confidence-Weighted Learning). This SCW is a method in which a parameter with a large variance is regarded as not yet confident (inaccurate) and the parameter is greatly updated, and a parameter with a small variance is regarded as accurate to some extent and the parameter is updated to a small value. is there. The correcting unit 13 determines the amount of change in the score based on the variance of the score having the value range. In order to execute this SCW, a Gaussian distribution is introduced into the division model 20 (vector w), and the correction unit 13 simultaneously updates the average and covariance matrix of each score in addition to updating each score. The average initial value of each score is zero. For the initial value of the covariance matrix of each score, the diagonal element is 1, and the other elements (non-diagonal elements) are 0. FIG. 6A shows a mode in which a score with a large variance is changed greatly (that is, the amount of change in the score is large), and FIG. 6B shows a mode in which a score with a small variance is changed slightly (that is, the score is changed). The variation is small). FIGS. 6A and 6B show that the covariance matrix Σ is also updated when the score is updated from Sa to Sb. Regarding the update of the covariance matrix, the accuracy of score calculation can be maintained without considering the correlation between a certain feature and other features, so in this embodiment, the off-diagonal of the covariance matrix is used. Calculate only diagonal elements without calculating elements. Thereby, the update speed of a score can be raised.
 なお、修正部13はSCW以外の手法を用いて素性のスコアを更新してもよい。SCW以外の手法の例としては、Perceptron、Passive Aggressive(PA)、Confidence Weighted(CW)、Adaptive Regularization of Weight Vectors(AROW)が挙げられる。 The correcting unit 13 may update the feature score using a method other than SCW. Examples of methods other than SCW include Perceptron, Passive Aggressive (PA), Confidence Weighted (CW), and Adaptive Regularization of Weight Vectors (AROW).
 解析された文に関連する素性のスコアを更新することで分割モデル20を修正すると、修正部13は完了通知を生成して取得部11に出力する。この場合には、自然言語処理システム10(より具体的には解析部12)は修正された分割モデル20を用いて次の文を解析する。 When the division model 20 is corrected by updating the feature score related to the analyzed sentence, the correction unit 13 generates a completion notification and outputs it to the acquisition unit 11. In this case, the natural language processing system 10 (more specifically, the analysis unit 12) analyzes the next sentence using the modified division model 20.
 次に、図7を用いて、自然言語処理システム10の動作を説明するとともに本実施形態に係る自然言語処理方法について説明する。 Next, the operation of the natural language processing system 10 will be described with reference to FIG. 7, and the natural language processing method according to the present embodiment will be described.
 まず、取得部11が一つの文を取得する(ステップS11)。続いて、解析部12が分割モデル20を用いてその文を形態素解析する(ステップS12、解析ステップ)。この形態素解析により、文の各文字に「S-N-nc」などのようなタグが付与される。 First, the acquisition unit 11 acquires one sentence (step S11). Subsequently, the analysis unit 12 performs morphological analysis on the sentence using the division model 20 (step S12, analysis step). By this morphological analysis, a tag such as “SN-nc” is given to each character of the sentence.
 続いて、修正部13が解析部12による形態素解析の結果と、その形態素解析の正解データとの差を求める(ステップS13)。その差がない場合(ステップS14;NO)、すなわち、解析部12による形態素解析が完全に正しい場合には、修正部13は分割モデル20を修正することなく処理を終了する。一方、解析結果と正解データとに差がある場合(ステップS14;YES)、すなわち、解析部12による形態素解析の少なくとも一部が正しくない場合には、修正部13は解析された文に関連する素性のスコアを更新することで分割モデル20を修正する(ステップS15、修正ステップ)。具体的には、修正部13は、不正解のタグに対応する正解のタグに関連する素性のスコアを現在値よりも高くするとともに、該不正解のタグに関連する素性のスコアを現在値よりも低くする。 Subsequently, the correcting unit 13 obtains a difference between the result of the morphological analysis by the analyzing unit 12 and the correct answer data of the morphological analysis (step S13). When there is no difference (step S14; NO), that is, when the morphological analysis by the analyzing unit 12 is completely correct, the correcting unit 13 ends the process without correcting the divided model 20. On the other hand, when there is a difference between the analysis result and the correct answer data (step S14; YES), that is, when at least part of the morphological analysis by the analysis unit 12 is incorrect, the correction unit 13 relates to the analyzed sentence. The division model 20 is corrected by updating the feature score (step S15, correction step). Specifically, the correcting unit 13 sets the feature score related to the correct tag corresponding to the incorrect tag to be higher than the current value, and sets the feature score related to the incorrect tag from the current value. Also lower.
 修正部13での処理が完了すると、ステップS11の処理に戻り(ステップS16参照)。取得部11が次の文を取得し(ステップS11)、解析部12がその文を形態素解析する(ステップS12)。このとき、前の文の処理において分割モデル20の修正(ステップS15)が実行されていた場合には、解析部12は修正された分割モデル20を用いて形態素解析を実行する。その後、修正部13がステップS13以降の処理を実行する。このような繰り返しは、処理対象の文が存在する限り続く(ステップS16参照)。 When the process in the correction unit 13 is completed, the process returns to step S11 (see step S16). The acquisition unit 11 acquires the next sentence (step S11), and the analysis unit 12 performs morphological analysis on the sentence (step S12). At this time, if the modification of the divided model 20 (step S15) has been executed in the processing of the previous sentence, the analysis unit 12 performs morphological analysis using the modified divided model 20. Thereafter, the correction unit 13 executes the processes after step S13. Such repetition continues as long as the sentence to be processed exists (see step S16).
 自然言語処理システム10の動作を示すアルゴリズムの一例を以下に示す。
Initialize w
For t=1,2,…
  Recieve instance x
  Predict structure y^ based on w
  Receive correct structure y
  If y^≠y, update
    wt+1=update(w,y,+1)
    wt+1=update(w,y^,-1)
An example of an algorithm showing the operation of the natural language processing system 10 is shown below.
Initialize w 1
For t = 1, 2,...
Receive instance x t
Predict structure y ^ t based on w t
Receive correct structure y t
If y ^ t ≠ y t , update
w t + 1 = update (w t , y t , + 1)
w t + 1 = update (w t , y ^ t , −1)
 上記アルゴリズムにおける1行目は分割モデル20(変数w)の初期化を意味し、この処理により、例えば各素性のスコアが0に設定される。2行目のForループは、3行目以降の処理を一文ずつ実行することを示す。3行目は、文xを取得することを意味し、上記のステップS11に相当する。4行目は、その時点の分割モデル20(w)に基づく形態素解析をすることで各文字にタグを付与する処理を示し、上記のステップS12に相当する。y^は解析結果を示す。5行目は、文xの形態素解析の正解データyを取得することを意味する。6行目は、解析結果y^と正解データyとに差がある場合には分割モデル20を更新(修正)することを意味する。7行目は、正解データyを正例として学習することを示し、8行目は、誤りを含む解析結果y^を負例として学習することを示す。7,8行目の処理は上記のステップS15に相当する。 The first line in the algorithm means initialization of the division model 20 (variable w 1 ), and for example, the score of each feature is set to 0 by this processing. The For loop on the second line indicates that the processes on and after the third line are executed one sentence at a time. The third line means that the sentence xt is acquired and corresponds to step S11 described above. The fourth line shows a process of assigning a tag to each character by performing a morphological analysis based on the division model 20 (w t ) at that time, and corresponds to step S12 described above. y ^ t indicates the analysis result. Line 5, which means that to get the correct data y t of the morphological analysis of the sentence x t. Line 6, if there is a difference between the analysis result y ^ t and solution data y t means to update (modifying) the division model 20. The seventh line indicates that the correct answer data y t is learned as a positive example, and the eighth line indicates that the analysis result y ^ t including an error is learned as a negative example. The processing on the seventh and eighth lines corresponds to step S15 described above.
 次に、図8を用いて、自然言語処理システム10を実現するための自然言語処理プログラムP1を説明する。 Next, a natural language processing program P1 for realizing the natural language processing system 10 will be described with reference to FIG.
 自然言語処理プログラムP1は、メインモジュールP10、取得モジュールP11、解析モジュールP12、および修正モジュールP13を備える。 The natural language processing program P1 includes a main module P10, an acquisition module P11, an analysis module P12, and a correction module P13.
 メインモジュールP10は、形態素解析およびこの関連処理を統括的に制御する部分である。取得モジュールP11、解析モジュールP12、および修正モジュールP13を実行することにより実現される機能はそれぞれ、上記の取得部11、解析部12、および修正部13の機能と同様である。 The main module P10 is a part that comprehensively controls morphological analysis and related processing. The functions realized by executing the acquisition module P11, the analysis module P12, and the correction module P13 are the same as the functions of the acquisition unit 11, the analysis unit 12, and the correction unit 13, respectively.
 自然言語処理プログラムP1は、例えば、CD-ROMやDVD-ROM、半導体メモリなどの有形の記録媒体に固定的に記録された上で提供されてもよい。また、自然言語処理プログラムP1は、搬送波に重畳されたデータ信号として通信ネットワークを介して提供されてもよい。 The natural language processing program P1 may be provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. The natural language processing program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.
 以上説明したように、本発明の一側面に係る自然言語処理システムは、1以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析部であって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析部と、解析部により得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析部による次の文の形態素解析で用いられる分割モデルを修正する修正部とを備える。 As described above, the natural language processing system according to one aspect of the present invention performs a morphological analysis on one sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to each divided element obtained by dividing the one sentence, and the division model has an output feature indicating a correspondence between the divided element and the tag. The analysis unit including a score and a score of a transition feature indicating a combination of two tags corresponding to two consecutive subdivided elements, a tag indicated by an analysis result obtained by the analysis unit, and one sentence Compared with the correct answer data indicating the correct answer tag, the score of the output feature and the transition feature related to the correct answer tag corresponding to the incorrect answer tag are set higher than the current value, and related to the incorrect answer tag. That the scores of the score and the transition feature output feature is made lower than the current value, and a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.
 本発明の一側面に係る自然言語処理方法は、プロセッサを備える自然言語処理システムにより実行される自然言語処理方法であって、1以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析ステップであって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析ステップと、解析ステップにおいて得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析ステップにおける次の文の形態素解析で用いられる分割モデルを修正する修正ステップとを含む。 A natural language processing method according to an aspect of the present invention is a natural language processing method executed by a natural language processing system including a processor, and uses a division model obtained by machine learning using one or more training data. , An analysis step of setting a tag indicating at least a part of speech of each word to each divided element obtained by dividing the one sentence by executing a morphological analysis on one sentence, wherein the division model is: The analysis step includes an output feature score indicating a correspondence between a divided element and a tag, and a transition feature score indicating a combination of two tags corresponding to two consecutive divided elements. Compares the tag shown in the analysis result and the correct data indicating the correct tag of one sentence, and outputs related to the correct tag corresponding to the incorrect tag By setting the gender score and transition feature score higher than the current value, and the output feature score and transition feature score associated with the incorrect tag are lower than the current value, the next sentence in the analysis step And a modification step for modifying the division model used in the morphological analysis.
 本発明の一側面に係る自然言語処理プログラムは、1以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析部であって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析部と、解析部により得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析部による次の文の形態素解析で用いられる分割モデルを修正する修正部としてコンピュータを機能させる。 A natural language processing program according to an aspect of the present invention divides a sentence by executing a morphological analysis on a sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features. The analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence Compared with the correct answer data, the score of the output feature and the transition feature related to the correct tag corresponding to the tag of the incorrect answer are set higher than the current value, and the score of the output feature related to the tag of the incorrect answer is set. And the scores of the transition feature is made lower than the current value, causes a computer to function as a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.
 このような側面においては、一つの文を形態素解析する度に、その解析結果と正解データとが比較され、これらの差に基づいて分割モデルが修正される。このように一文毎に分割モデルを修正することで、複数の文を処理した場合の分割モデルの修正に要する時間が、文の個数に応じてほぼ線形的に増大する程度に抑えられるので、形態素解析の分割モデルを一定の時間内に(言い換えると、予測できる時間の範囲内に)自動的に修正することができる。 In such an aspect, every time a sentence is subjected to morphological analysis, the analysis result is compared with the correct answer data, and the division model is corrected based on the difference between them. By correcting the split model for each sentence in this way, the time required to correct the split model when processing multiple sentences can be suppressed to an extent that increases almost linearly according to the number of sentences. The analysis split model can be automatically modified within a certain time (in other words, within a predictable time range).
 また、正解したタグに関する素性スコアを高くし、不正解だったタグに関する素性のスコアを低くすることで、次の文の形態素解析の精度をより高くすることができる。 Also, the accuracy of the morphological analysis of the next sentence can be increased by increasing the feature score for the correct tag and decreasing the feature score for the tag that was incorrect.
 他の側面に係る自然言語処理システムでは、被分割要素が文字であってもよい。文字単位での知識(出力素性および遷移素性)を用いて文字毎に処理することで、一般的に大規模になってしまう単語辞書を用いることなく、形態素解析を実行することができる。また、単語の知識ではなく文字単位での知識を用いて一文毎に分割モデルが修正されるので、次の文が、これまで解析されたいずれの文とも分野または性質が異なったものであるとしても、当該次の文を高精度に形態素解析することが可能である。すなわち、本発明の一側面に係る自然言語処理システムは、未知の分野の文または未知の性質を持つ文に対する適応性を有する。 In the natural language processing system according to another aspect, the divided element may be a character. By processing for each character using knowledge (output feature and transition feature) in units of characters, morphological analysis can be executed without using a word dictionary that generally becomes large. In addition, since the division model is corrected for each sentence using knowledge in units of characters rather than knowledge of words, it is assumed that the next sentence is different in field or nature from any sentence analyzed so far In addition, it is possible to perform morphological analysis of the next sentence with high accuracy. That is, the natural language processing system according to one aspect of the present invention has adaptability to a sentence in an unknown field or a sentence having an unknown property.
 他の側面に係る自然言語処理システムでは、出力素性のスコアおよび遷移素性のスコアのそれぞれが値の範囲を有し、各スコアについて分散が設定され、修正部が、各スコアの分散に基づいて、該スコアを高くまたは低く際の該スコアの変化量を決定してもよい。この手法を用いることで、各素性のスコアを早く収束させることが可能になる。 In the natural language processing system according to another aspect, each of the output feature score and the transition feature score has a range of values, a variance is set for each score, and the correction unit is based on the variance of each score. The amount of change in the score when the score is high or low may be determined. By using this method, it is possible to quickly converge the scores of each feature.
 以上、本発明をその実施形態に基づいて詳細に説明した。しかし、本発明は上記実施形態に限定されるものではない。本発明は、その要旨を逸脱しない範囲で様々な変形が可能である。 The present invention has been described in detail above based on the embodiments. However, the present invention is not limited to the above embodiment. The present invention can be variously modified without departing from the gist thereof.
 一般に、分割モデル20に含まれる素性の個数は取り扱う文字の個数に応じて増えるので、日本語や中国語などのように文字が多い言語では分割モデル20が非常に大規模になり、分割モデル20のための記憶容量も非常に大きくなってしまう。そこで、フィーチャー・ハッシング(Feature Hashing)という手法を導入して、個々の素性をハッシュ関数により数値化してもよい。特に、素性の一部を表す文字および文字列を数値化することの効果が高い。その一方で、遷移素性はハッシュ化しても分割モデル20の容量の圧縮にはそれほど貢献せず、処理速度が却って遅くなる可能性がある。したがって、遷移素性をハッシュ化することなく出力素性のみをハッシュ化してもよい。なお、ハッシュ関数については、一種類のみを用いてもよいし、出力素性と遷移素性とで異なるハッシュ関数を用いてもよい。 In general, since the number of features included in the division model 20 increases according to the number of characters handled, the division model 20 becomes very large in a language with many characters such as Japanese and Chinese. The storage capacity for will also be very large. Therefore, a technique called feature hashing may be introduced to digitize individual features using a hash function. In particular, the effect of digitizing characters and character strings representing a part of the features is high. On the other hand, even if the transition feature is hashed, it does not contribute much to the compression of the capacity of the division model 20, and the processing speed may be slow. Therefore, only the output features may be hashed without hashing the transition features. Note that only one type of hash function may be used, or different hash functions may be used for output features and transition features.
 この場合には、分割モデル20は、個々の文字が数値で表された素性についてのデータを記憶する。例えば、「本(hon)」という文字が34という数値に変換され、「を(wo)」という文字が4788という数値に変換される。この数値化により、有界な(bounded)素性の集合を形成することができる。なお、このフィーチャー・ハッシングにより、複数の文字または文字列に同じ数値が割り当てられることがあり得るが、出現頻度が高い文字または文字列同士に同じ数値が割り当てられる蓋然性は非常に低いので、このような衝突は無視することができる。 In this case, the division model 20 stores data on the features in which individual characters are represented by numerical values. For example, a character “hon” is converted to a numerical value of 34, and a character “wo” is converted to a numerical value of 4788. By this numericalization, a set of bounded features can be formed. This feature hashing may assign the same numerical value to multiple characters or character strings, but it is very unlikely that the same numerical value will be assigned to characters or character strings that appear frequently. Collisions can be ignored.
 すなわち、他の側面に係る自然言語処理システムでは、分割モデルが、ハッシュ関数により数値化された出力素性を含んでもよい。文字を数値で扱うことで、分割モデルの記憶に必要なメモリ容量を節約することができる。 That is, in the natural language processing system according to another aspect, the division model may include an output feature quantified by a hash function. By handling characters numerically, it is possible to save the memory capacity required for storing the division model.
 解析部12は、スコアが相対的に低い素性を用いることなく(そのような素性を無視して)、スコアが相対的に高い素性を用いて形態素解析を実行してもよい。スコアが相対的に低い素性を無視する手法として、例えば、フォワード・バックワード・スプリッティング(Forward-Backward Splitting(FOBOS))と、素性の量子化(Feature Quantization)とが挙げられる。 The analysis unit 12 may perform morphological analysis using a feature having a relatively high score without using a feature having a relatively low score (ignoring such a feature). Examples of techniques for ignoring features having relatively low scores include forward-backward splitting (Forward-Background Splitting (FOBOS)) and feature quantization (Feature Quantization).
 FOBOSは、正則化(例えばL1正則化)によりスコアを0の方に圧縮する手法である。FOBOSを用いることで、スコアが所定値以下の素性(例えば、スコアが0の素性、またはスコアが0に近い素性)を無視することが可能になる。 FOBOS is a method of compressing the score to 0 by regularization (for example, L1 regularization). By using FOBOS, it is possible to ignore a feature whose score is a predetermined value or less (for example, a feature whose score is 0 or a feature whose score is close to 0).
 素性の量子化は、小数点以下の値に10(nは1以上の自然数)を乗ずることで素性のスコアを整数化する手法である。例えば、「0.123456789」というスコアに1000を乗じて整数化するとスコアは「123」する。スコアを量子化することで、そのスコアをテキストで記憶するために必要なメモリ容量を節約することができる。また、この手法により、スコアが所定値以下の素性(例えば、整数化後のスコアが0の素性、または該スコアが0に近い素性)を無視することが可能になる。例えば、ある素性Fa,Fbのスコアがそれぞれ0.0512、0.0003であるとして、これらのスコアに1000を乗じて整数化した場合には、素性Fa,Fbはそれぞれ51、0になる。この場合には、解析部12は素性Fbを用いることなく形態素解析を実行する。 The feature quantization is a technique for converting the score of a feature into an integer by multiplying the value after the decimal point by 10 n (n is a natural number of 1 or more). For example, when the score “0.1234456789” is multiplied by 1000 to make an integer, the score becomes “123”. By quantizing the score, the memory capacity required to store the score as text can be saved. In addition, this technique makes it possible to ignore features whose score is equal to or less than a predetermined value (for example, a feature whose score after integerization is 0 or a feature whose score is close to 0). For example, assuming that the scores of certain features Fa and Fb are 0.0512 and 0.0003, respectively, and multiplying these scores by 1000, the features Fa and Fb are 51 and 0, respectively. In this case, the analysis unit 12 performs morpheme analysis without using the feature Fb.
 正則化または量子化の処理は、例えば、修正部13、自然言語処理システム10内の他の機能要素、あるいは自然言語処理システム10とは別のコンピュータシステムで実行される。修正部13が正則化または量子化の処理を実行する場合には、修正部13は自然言語処理システム10において1セットの文(例えば、ある程度多くの文)が形態素解析されて分割モデル20が何度も修正された後に、正則化または量子化の処理を一回実行する。 The regularization or quantization process is executed by, for example, the correction unit 13, another functional element in the natural language processing system 10, or a computer system different from the natural language processing system 10. When the correction unit 13 performs regularization or quantization processing, the correction unit 13 performs a morphological analysis on a set of sentences (for example, a certain number of sentences) in the natural language processing system 10 to determine what the division model 20 is. After the correction, the regularization or quantization process is performed once.
 すなわち、他の側面に係る自然言語処理システムでは、解析部が、正則化または量子化によりスコアが所定値以下になった素性を用いることなく形態素解析を実行してもよい。スコアが相対的に低い素性(例えば、正則化または量子化によりスコアが0になる素性、または該スコアが0に近い素性)を使わないことで、分割モデルのデータ量を抑えたり形態素解析の時間を短縮したりすることができる。 That is, in the natural language processing system according to another aspect, the analysis unit may execute the morphological analysis without using a feature whose score is equal to or lower than a predetermined value by regularization or quantization. By not using features with relatively low scores (for example, features whose score becomes 0 by regularization or quantization, or features whose score is close to 0), it is possible to reduce the amount of data in the split model or to perform morpheme analysis time Can be shortened.
 上記実施形態では解析部12が文を個々の文字に分割して各文字にタグを設定したが、被分割要素は文字ではなく単語であってもよい。これに伴い、解析部は、文字ではなく単語に関する素性のスコアを示す分割モデルと単語辞書とを用いて形態素解析を実行してもよい。 In the above embodiment, the analysis unit 12 divides a sentence into individual characters and sets a tag for each character, but the divided element may be a word instead of a character. Accordingly, the analysis unit may perform morphological analysis using a division model and a word dictionary that indicate a score of a feature related to a word instead of a character.
 上述した通り、本発明に係る自然言語処理システムは、任意の言語の形態素解析に適用することができる。 As described above, the natural language processing system according to the present invention can be applied to morphological analysis of an arbitrary language.
 10…自然言語処理システム、11…取得部、12…解析部、13…修正部、20…分割モデル、P1…自然言語処理プログラム、P10…メインモジュール、P11…取得モジュール、P12…解析モジュール、P13…修正モジュール。
 
DESCRIPTION OF SYMBOLS 10 ... Natural language processing system, 11 ... Acquisition part, 12 ... Analysis part, 13 ... Correction part, 20 ... Division model, P1 ... Natural language processing program, P10 ... Main module, P11 ... Acquisition module, P12 ... Analysis module, P13 ... modification module.

Claims (7)

  1.  1以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析部であって、前記分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析部と、
     前記解析部により得られた解析結果で示されるタグと、前記一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する前記出力素性のスコアおよび前記遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する前記出力素性のスコアおよび前記遷移素性のスコアを現在値よりも低くすることで、前記解析部による次の文の形態素解析で用いられる前記分割モデルを修正する修正部と
    を備える自然言語処理システム。
    By using a division model obtained by machine learning using one or more training data and performing morphological analysis on one sentence, each divided element obtained by dividing the one sentence has at least a word An analysis unit for setting a tag indicating the part of speech of the combination, wherein the division model is a combination of a score of an output feature indicating a correspondence between a divided element and a tag and two tags corresponding to two consecutive divided elements Including the transition feature score indicating
    The tag indicated by the analysis result obtained by the analysis unit is compared with correct data indicating the correct tag of the one sentence, and the score of the output feature related to the correct tag corresponding to the incorrect tag And the score of the transition feature is made higher than the current value, and the score of the output feature and the score of the transition feature related to the incorrect answer tag are made lower than the current value. A natural language processing system comprising: a correction unit that corrects the division model used in the morphological analysis.
  2.  前記被分割要素が文字である、
    請求項1に記載の自然言語処理システム。
    The split element is a character;
    The natural language processing system according to claim 1.
  3.  前記分割モデルが、ハッシュ関数により数値化された前記出力素性を含む、
    請求項1または2に記載の自然言語処理システム。
    The division model includes the output feature quantified by a hash function;
    The natural language processing system according to claim 1 or 2.
  4.  前記出力素性のスコアおよび前記遷移素性のスコアのそれぞれが値の範囲を有し、各スコアについて分散が設定され、
     前記修正部が、各スコアの分散に基づいて、該スコアを高くまたは低く際の該スコアの変化量を決定する、
    請求項1~3のいずれか一項に記載の自然言語処理システム。
    Each of the output feature score and the transition feature score has a range of values, and a variance is set for each score,
    The correction unit determines the amount of change in the score when the score is increased or decreased based on the variance of each score.
    The natural language processing system according to any one of claims 1 to 3.
  5.  前記解析部が、正則化または量子化により前記スコアが所定値以下になった前記素性を用いることなく前記形態素解析を実行する、
    請求項1~4のいずれか一項に記載の自然言語処理システム。
    The analysis unit performs the morphological analysis without using the feature whose score is equal to or less than a predetermined value by regularization or quantization.
    The natural language processing system according to any one of claims 1 to 4.
  6.  プロセッサを備える自然言語処理システムにより実行される自然言語処理方法であって、
     1以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析ステップであって、前記分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析ステップと、
     前記解析ステップにおいて得られた解析結果で示されるタグと、前記一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する前記出力素性のスコアおよび前記遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する前記出力素性のスコアおよび前記遷移素性のスコアを現在値よりも低くすることで、前記解析ステップにおける次の文の形態素解析で用いられる前記分割モデルを修正する修正ステップと
    を含む自然言語処理方法。
    A natural language processing method executed by a natural language processing system including a processor,
    By using a division model obtained by machine learning using one or more training data and performing morphological analysis on one sentence, each divided element obtained by dividing the one sentence has at least a word An analysis step of setting a tag indicating the part of speech of the combination, wherein the division model is a combination of an output feature score indicating a correspondence between a divided element and a tag and two tags corresponding to two consecutive divided elements The analysis step comprising a transition feature score indicative of
    The tag indicated by the analysis result obtained in the analysis step is compared with correct data indicating the correct tag of the one sentence, and the score of the output feature related to the correct tag corresponding to the incorrect tag And the score of the transition feature is made higher than the current value, and the score of the output feature and the score of the transition feature related to the incorrect answer tag are made lower than the current value, so that the next sentence in the analysis step A natural language processing method including a correction step of correcting the division model used in the morphological analysis.
  7.  1以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析部であって、前記分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析部と、
     前記解析部により得られた解析結果で示されるタグと、前記一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する前記出力素性のスコアおよび前記遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する前記出力素性のスコアおよび前記遷移素性のスコアを現在値よりも低くすることで、前記解析部による次の文の形態素解析で用いられる前記分割モデルを修正する修正部と
    してコンピュータを機能させる自然言語処理プログラム。
    By using a division model obtained by machine learning using one or more training data and performing morphological analysis on one sentence, each divided element obtained by dividing the one sentence has at least a word An analysis unit for setting a tag indicating the part of speech of the combination, wherein the division model is a combination of a score of an output feature indicating a correspondence between a divided element and a tag and two tags corresponding to two consecutive divided elements Including the transition feature score indicating
    The tag indicated by the analysis result obtained by the analysis unit is compared with correct data indicating the correct tag of the one sentence, and the score of the output feature related to the correct tag corresponding to the incorrect tag And the score of the transition feature is made higher than the current value, and the score of the output feature and the score of the transition feature related to the incorrect answer tag are made lower than the current value. A natural language processing program for causing a computer to function as a correction unit for correcting the division model used in the morphological analysis.
PCT/JP2014/082428 2014-04-29 2014-12-08 Natural language processing system, natural language processing method, and natural language processing program WO2015166606A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
KR1020167028427A KR101729461B1 (en) 2014-04-29 2014-12-08 Natural language processing system, natural language processing method, and natural language processing program
CN201480076197.5A CN106030568B (en) 2014-04-29 2014-12-08 Natural language processing system, natural language processing method and natural language processing program
JP2015512822A JP5809381B1 (en) 2014-04-29 2014-12-08 Natural language processing system, natural language processing method, and natural language processing program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201461985615P 2014-04-29 2014-04-29
US61/985615 2014-04-29

Publications (1)

Publication Number Publication Date
WO2015166606A1 true WO2015166606A1 (en) 2015-11-05

Family

ID=54358353

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/082428 WO2015166606A1 (en) 2014-04-29 2014-12-08 Natural language processing system, natural language processing method, and natural language processing program

Country Status (5)

Country Link
JP (1) JP5809381B1 (en)
KR (1) KR101729461B1 (en)
CN (1) CN106030568B (en)
TW (1) TWI567569B (en)
WO (1) WO2015166606A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101030A (en) * 2020-08-24 2020-12-18 沈阳东软智能医疗科技研究院有限公司 Method, device and equipment for establishing term mapping model and realizing standard word mapping
CN113204667A (en) * 2021-04-13 2021-08-03 北京百度网讯科技有限公司 Method and device for training audio labeling model and audio labeling
JP7352249B1 (en) 2023-05-10 2023-09-28 株式会社Fronteo Information processing device, information processing system, and information processing method

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021559B (en) * 2018-02-05 2022-05-03 威盛电子股份有限公司 Natural language understanding system and semantic analysis method
CN110020434B (en) * 2019-03-22 2021-02-12 北京语自成科技有限公司 Natural language syntactic analysis method
KR102352481B1 (en) * 2019-12-27 2022-01-18 동국대학교 산학협력단 Sentence analysis device using morpheme analyzer built on machine learning and operating method thereof
CN116153516B (en) * 2023-04-19 2023-07-07 山东中医药大学第二附属医院(山东省中西医结合医院) Disease big data mining analysis system based on distributed computing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09114825A (en) * 1995-10-19 1997-05-02 Ricoh Co Ltd Method and device for morpheme analysis
JP2003099426A (en) * 2001-09-25 2003-04-04 Canon Inc Natural language processor, its control method and program

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100530171C (en) * 2005-01-31 2009-08-19 日电(中国)有限公司 Dictionary learning method and devcie
CN100533431C (en) * 2005-09-21 2009-08-26 富士通株式会社 Natural language component identifying correcting apparatus and method based on morpheme marking
CN102681981A (en) * 2011-03-11 2012-09-19 富士通株式会社 Natural language lexical analysis method, device and analyzer training method
JP5795985B2 (en) 2012-03-30 2015-10-14 Kddi株式会社 Morphological analyzer, morphological analysis method, and morphological analysis program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09114825A (en) * 1995-10-19 1997-05-02 Ricoh Co Ltd Method and device for morpheme analysis
JP2003099426A (en) * 2001-09-25 2003-04-04 Canon Inc Natural language processor, its control method and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BIRD STEVEN ET AL., NATURAL LANGUAGE PROCESSING WITH PYTHON, 8 November 2010 (2010-11-08), pages 480 - 490, ISBN: 978-4-87311-470-5 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101030A (en) * 2020-08-24 2020-12-18 沈阳东软智能医疗科技研究院有限公司 Method, device and equipment for establishing term mapping model and realizing standard word mapping
CN112101030B (en) * 2020-08-24 2024-01-26 沈阳东软智能医疗科技研究院有限公司 Method, device and equipment for establishing term mapping model and realizing standard word mapping
CN113204667A (en) * 2021-04-13 2021-08-03 北京百度网讯科技有限公司 Method and device for training audio labeling model and audio labeling
CN113204667B (en) * 2021-04-13 2024-03-22 北京百度网讯科技有限公司 Method and device for training audio annotation model and audio annotation
JP7352249B1 (en) 2023-05-10 2023-09-28 株式会社Fronteo Information processing device, information processing system, and information processing method

Also Published As

Publication number Publication date
CN106030568A (en) 2016-10-12
KR20160124237A (en) 2016-10-26
TWI567569B (en) 2017-01-21
CN106030568B (en) 2018-11-06
JP5809381B1 (en) 2015-11-10
TW201544976A (en) 2015-12-01
KR101729461B1 (en) 2017-04-21
JPWO2015166606A1 (en) 2017-04-20

Similar Documents

Publication Publication Date Title
JP5809381B1 (en) Natural language processing system, natural language processing method, and natural language processing program
Abandah et al. Automatic diacritization of Arabic text using recurrent neural networks
JP5901001B1 (en) Method and device for acoustic language model training
Duan et al. Online spelling correction for query completion
US20120262461A1 (en) System and Method for the Normalization of Text
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
KR101544690B1 (en) Word division device, word division method, and word division program
CN109117474B (en) Statement similarity calculation method and device and storage medium
US11423237B2 (en) Sequence transduction neural networks
JP6817556B2 (en) Similar sentence generation method, similar sentence generation program, similar sentence generator and similar sentence generation system
US20220391647A1 (en) Application-specific optical character recognition customization
US20200279079A1 (en) Predicting probability of occurrence of a string using sequence of vectors
WO2019208507A1 (en) Language characteristic extraction device, named entity extraction device, extraction method, and program
KR20230061001A (en) Apparatus and method for correcting text
CN113255331B (en) Text error correction method, device and storage medium
JP2015169947A (en) Model learning device, morphological analysis device and method
Yang et al. Spell Checking for Chinese.
CN111090720B (en) Hot word adding method and device
JP5676517B2 (en) Character string similarity calculation device, method, and program
US20180033425A1 (en) Evaluation device and evaluation method
Kim et al. Reliable automatic word spacing using a space insertion and correction model based on neural networks in Korean
JP6303508B2 (en) Document analysis apparatus, document analysis system, document analysis method, and program
Yan et al. A novel approach to improve the Mongolian language model using intermediate characters
CN111667813B (en) Method and device for processing file
Meško et al. Checking the writing of commas in Slovak

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2015512822

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14890522

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20167028427

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14890522

Country of ref document: EP

Kind code of ref document: A1