TWI567569B

TWI567569B - Natural language processing systems, natural language processing methods, and natural language processing programs

Info

Publication number: TWI567569B
Application number: TW104108650A
Authority: TW
Inventors: Masato Hagiwara
Original assignee: Rakuten Inc
Priority date: 2014-04-29
Filing date: 2015-03-18
Publication date: 2017-01-21
Also published as: JP5809381B1; CN106030568B; JPWO2015166606A1; KR20160124237A; WO2015166606A1; KR101729461B1; TW201544976A; CN106030568A

Description

Natural language processing system, natural language processing method, and natural language processing program product

本發明之一側面係有關於自然語言處理系統、自然語言處理方法、及自然語言處理程式。 One aspect of the present invention relates to a natural language processing system, a natural language processing method, and a natural language processing program.

作為自然語言處理的基礎技術之一，將語句分割成詞素串然後判定各詞素之品詞的構詞解析，已為人知。與此相關連而在下記專利文獻1中係記載，將所被輸入的本文資料分解成詞素，參照詞素辭典而取得該當所被分解之詞素所對應之位置的資訊，藉由使用到位置資訊的成本函數，從該當分解所得到的詞素串之候補中，決定詞素串的構詞解析裝置。 As one of the basic techniques of natural language processing, it is known to divide a sentence into a morpheme string and then determine the word formation of the word of each morpheme. In connection with this, Patent Document 1 describes that the input data is decomposed into morphemes, and the morpheme dictionary is used to obtain information on the position corresponding to the morphemes to be decomposed, by using location information. The cost function determines a word formation device for the morpheme string from the candidate for the morpheme string obtained by the decomposition.

[Previous Technical Literature] [Patent Document]

[專利文獻1]日本特開2013-210856號公報 [Patent Document 1] Japanese Patent Laid-Open Publication No. 2013-210856

構詞解析，係使用含有各素性之分數的分割模型，而被執行。可說是構詞解析所需之知識的分割模型，一般係為事前固定，因此，若試圖將該分割模型所未網羅的新領域所屬之語句或帶有新性質之語句進行構詞解析，則想當然爾，要獲得正確的結果，會是非常的困難。另一方面，若試圖將分割模型藉由機械學習等之手法而加以修正，則修正所需時間可能會無法預測地成長。於是，將構詞解析之分割模型在一定時間內自動加以修正，係被人們所期望。 The word formation is performed by using a segmentation model containing the scores of each prime. It can be said that the segmentation model of the knowledge required for word formation analysis is generally fixed beforehand. Therefore, if you try to parse the statement of the new domain that is not included in the segmentation model or the statement with the new property, then Of course, it will be very difficult to get the right results. On the other hand, if an attempt is made to correct the segmentation model by means of mechanical learning or the like, the time required for the correction may grow unpredictably. Therefore, the segmentation model of word formation analysis is automatically corrected within a certain period of time, which is expected by people.

本發明之一側面所述之自然語言處理系統，係具備：解析部，係使用藉由用到1個以上之訓練資料的機械學習所得的分割模型，對一個語句執行構詞解析，以對該一個語句分割所得的每個被分割要素，設定至少表示單詞之品詞的標籤，其中，該解析部係為，分割模型係含有：表示被分割要素與標籤之對應的輸出素性之分數、和表示連續二個被分割要素所對應之二個標籤之組合的遷移素性之分數；和修正部，係將解析部所得之解析結果所示之標籤、和表示一個語句之正確解答之標籤的正確解答資料，進行比較，將與不正確解答之標籤所對應的正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成高於現在值，將與該不正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成低於現在值，將解析部在下個語句之構詞解析中所使用的分割模型，予以修正。 The natural language processing system according to one aspect of the present invention includes: an analysis unit that performs a word formation analysis on a sentence by using a segmentation model obtained by mechanical learning using one or more pieces of training data to Each of the divided elements obtained by dividing a sentence is set to a label indicating at least a word of the word, wherein the analysis unit includes: a score indicating the output quality of the divided element and the label, and a continuous representation The fraction of the kinomism of the combination of the two labels corresponding to the two divided elements; and the correction unit, which is a label indicating the result of the analysis obtained by the analysis unit, and a correct answer for the label indicating the correct answer of one sentence, For comparison, the scores of the output primes and the kinetics scores associated with the labels of the correct answers corresponding to the incorrectly answered labels are set higher than the current values, and the output traits associated with the labels of the incorrect answers are The scores of the scores and the kinetics are set lower than the current value, and the segmentation model used by the analysis unit in the word formation analysis of the next sentence is corrected.

本發明之一側面所述之自然語言處理方法，係為藉由具備處理器之自然語言處理系統所執行的自然語言處理方法，其係含有：解析步驟，係使用藉由用到1個以上之訓練資料的機械學習所得的分割模型，對一個語句執行構詞解析，以對該一個語句分割所得的每個被分割要素，設定至少表示單詞之品詞的標籤，其中，該解析步驟係為，分割模型係含有：表示被分割要素與標籤之對應的輸出素性之分數、和表示連續二個被分割要素所對應之二個標籤之組合的遷移素性之分數；和修正步驟，係將解析步驟中所得之解析結果所示之標籤、和表示一個語句之正確解答之標籤的正確解答資料，進行比較，將與不正確解答之標籤所對應的正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成高於現在值，將與該不正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成低於現在值，將解析步驟中的在下個語句之構詞解析中所使用的分割模型，予以修正。 The natural language processing method according to one aspect of the present invention is a natural language processing method executed by a natural language processing system including a processor, which includes an analysis step and uses one or more of them. a segmentation model obtained by mechanical learning of training materials, performing parsing of a sentence, and setting a label indicating at least a word of the word for each divided element obtained by dividing the one sentence, wherein the parsing step is The model system includes: a score indicating the output prime of the divided element and the label, and a fraction of the kinomism indicating a combination of the two labels corresponding to the two consecutive divided elements; and a correction step, which is obtained in the analysis step The label of the analysis result and the correct answer data of the label indicating the correct answer of a sentence are compared, and the score of the output prime and the kinetic property associated with the label of the correct answer corresponding to the incorrectly answered label are compared. The score is set higher than the current value, and the score of the output prime associated with the label of the incorrect answer and The score of the mobilizing property is set lower than the current value, and the segmentation model used in the parsing of the next sentence in the parsing step is corrected.

本發明之一側面所述之自然語言處理程式，係令電腦發揮機能而成為：解析部，係使用藉由用到1個以上之訓練資料的機械學習所得的分割模型，對一個語句執行構詞解析，以對該一個語句分割所得的每個被分割要素，設定至少表示單詞之品詞的標籤，其中，該解析部係為，分割模型係含有：表示被分割要素與標籤之對應的輸出素性之分數、和表示連續二個被分割要素所對應之二個標籤之組合的遷移素性之分數；和修正部，係將解析部所得之解析結果所示之標籤、和表示一個語句之正確解答之標籤的正確解答資料，進行比較，將與不正確解答之標籤所對應的正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成高於現在值，將與該不正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成低於現在值，將解析部在下個語句之構詞解析中所使用的分割模型，予以修正。 The natural language processing program according to one aspect of the present invention is an analysis unit that performs word formation on a sentence by using a segmentation model obtained by mechanical learning using one or more training materials. In the analysis, a label indicating at least the word of the word is set for each of the divided elements obtained by dividing the one sentence, wherein the analysis unit includes: the input indicating that the divided element corresponds to the label The score of the elementality and the fraction of the kinomism indicating the combination of the two labels corresponding to the two consecutive divided elements; and the correction unit, which is the label of the analysis result obtained by the analysis unit, and the correctness of the statement The correct answer information of the answer label, compare and compare the score of the output prime and the score of the mobility factor associated with the label of the correct answer corresponding to the incorrectly answered label to be higher than the current value, and the incorrect answer will be The scores of the output primes associated with the labels and the scores of the mobility factors are set lower than the current values, and the segmentation model used by the analysis unit in the word formation analysis of the next sentence is corrected.

於此種側面中，每次將一個語句進行構詞解析之際，該解析結果會和正確解答資料做比較，根據它們的差異來修正分割模型。藉由如此每一語句地修正分割模型，在處理複數語句時的分割模型之修正所需時間，係隨著語句的個數而被抑制成大致線性成長的程度，因此可將構詞解析的分割模型，在一定時間內(換言之，在可預測的時間範圍內)自動加以修正。 In this aspect, each time a sentence is parsed, the parsing result is compared with the correct answer data, and the segmentation model is corrected according to their differences. By correcting the segmentation model for each sentence as described above, the time required for the correction of the segmentation model when processing the complex sentence is suppressed to a substantially linear growth degree with the number of sentences, so that the segmentation of the word formation can be performed. The model is automatically corrected for a certain period of time (in other words, within a predictable time frame).

若依據本發明之一側面，則可將構詞解析之分割模型在一定時間內自動加以修正。 According to one aspect of the present invention, the segmentation model of the word formation can be automatically corrected within a certain period of time.

10‧‧‧自然語言處理系統 10‧‧‧Natural Language Processing System

11‧‧‧取得部 11‧‧‧Acquisition Department

12‧‧‧解析部 12‧‧‧ Analysis Department

13‧‧‧修正部 13‧‧‧Amendment

20‧‧‧分割模型 20‧‧‧ segmentation model

P1‧‧‧自然語言處理程式 P1‧‧‧Natural Language Processing Program

P10‧‧‧主要模組 P10‧‧‧ main module

P11‧‧‧取得模組 P11‧‧‧Get Module

P12‧‧‧解析模組 P12‧‧‧analysis module

P13‧‧‧修正模組 P13‧‧‧correction module

[圖1]實施形態所述之自然語言處理系統中的處理之概念圖。 [Fig. 1] Processing in a natural language processing system according to an embodiment Concept map.

[圖2]實施形態中的構詞解析之例子的圖示。 Fig. 2 is a diagram showing an example of word formation analysis in the embodiment.

[圖3]構成實施形態所述之自然語言處理系統的電腦之硬體構成的圖示。 Fig. 3 is a view showing a hardware configuration of a computer constituting the natural language processing system according to the embodiment.

[圖4]實施形態所述之自然語言處理系統之機能構成的區塊圖。 Fig. 4 is a block diagram showing the functional configuration of the natural language processing system according to the embodiment.

[圖5]標籤賦予之一例的概念性圖示。 [Fig. 5] A conceptual illustration of an example of label assignment.

[圖6](a)、(b)係分別為分數更新之一例的模式性圖示。 [Fig. 6] (a) and (b) are schematic diagrams each showing an example of score update.

[圖7]實施形態所述之自然語言處理系統之動作的流程圖。 Fig. 7 is a flow chart showing the operation of the natural language processing system according to the embodiment.

[圖8]實施形態所述之自然語言處理程式之構成的圖示。 Fig. 8 is a view showing the configuration of a natural language processing program according to the embodiment.

以下，一面參照添附圖面，一面詳細說明本發明的實施形態。此外，於圖面的說明中同一或同等要素係標示同一符號，並省略重複說明。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are designated by the same reference numerals, and the repeated description is omitted.

首先，使用圖1~5，說明實施形態所述之自然語言處理系統10之機能及構成。自然語言處理系統10，係為執行構詞解析的電腦系統。所謂構詞解析，係將語句分割成詞素串然後判定各詞素之品詞的處理。所謂語句，係為表示一個完結之言明的語言表現之單位，是以字串而被表現。所謂詞素，係為具有語義的最小之語言單位。所謂詞素串，係將語句分割成1個以上之詞素而得的該1個以上之詞素之排列。所謂品詞，係為按照文法上之機能或形態而做的單詞之區分。 First, the function and configuration of the natural language processing system 10 according to the embodiment will be described with reference to Figs. The natural language processing system 10 is a computer system that performs word formation analysis. The so-called word formation analysis divides the sentence into morpheme strings and then determines the processing of the words of each morpheme. The so-called statement is a unit that expresses the language expression of a finished statement and is expressed as a string. The morpheme is the smallest language list with semantics. Bit. The morpheme string is an arrangement of one or more morphemes obtained by dividing a sentence into one or more morphemes. The so-called word is a distinction between words made in accordance with the function or form of grammar.

自然語言處理系統10，係使用分割模型20而將每個語句進行構詞解析。作為自然語言處理系統10的特徵之一，可舉例如：在學習分割模型20之際，每次將每個語句進行構詞解析時，就會修正該分割模型20這點。若分割模型20之修正結束，則具備已確定之分割模型20的自然語言處理系統10，就被提供給使用者。使用者係可令該自然語言處理系統10執行構詞解析，此時，不須進行分割模型20之修正，就可執行構詞解析。本明細書中的所謂「分割模型」，係將語句分割成1個以上之詞素之際的基準(線索)，是以各素性之分數來表示。該分割模型，係藉由使用到1個以上之訓練資料的機械學習而獲得。訓練資料，係為至少表示：已被分割成各單詞的語句、和將該語句分割所得的各單詞之品詞的資料。所謂素性(feature)，係於構詞解析中獲得正確結果所需之線索。一般而言，要將什麼作為素性(線索)使用，並無限定。所謂素性之分數，係為表現該當素性之最合理樣貌的數值。 The natural language processing system 10 uses the segmentation model 20 to perform word formation analysis for each sentence. One of the features of the natural language processing system 10 is that, for example, when the segmentation model 20 is learned, the segmentation model 20 is corrected each time a sentence is analyzed for word formation. When the correction of the segmentation model 20 is completed, the natural language processing system 10 having the determined segmentation model 20 is provided to the user. The user system can cause the natural language processing system 10 to perform word formation analysis. At this time, the word formation analysis can be performed without modifying the segmentation model 20. The "segmentation model" in this booklet is a reference (cue) at the time of dividing a sentence into one or more morphemes, and is expressed as a score of each prime. The segmentation model is obtained by mechanical learning using one or more training materials. The training material is at least a document indicating that the sentence has been divided into words and the words of each word obtained by dividing the sentence. The so-called feature is the clue needed to get the correct result in word formation. In general, there is no limit to what to use as a prime (clue). The so-called prime score is the value that represents the most reasonable appearance of the prime.

圖1中簡潔圖示了，本實施形態所述之自然語言處理系統10中的處理之概念。此外，圖1中的齒輪M係表示構詞解析之執行。於某個時點上，自然語言處理系統10係藉由執行使用到分割模型w₁的構詞解析，而將語句s₁分割成1個以上之詞素。在本實施形態中，自然語言處理系統10係將語句分割成每個文字然後執行文字單位的處理，以將語句分割成1個以上之詞素。亦即，在本實施形態中，作為處理對象的被分割要素係為文字。自然語言處理系統10，係藉由對每個文字(被分割要素)設定標籤，以表示構詞解析之結果。本明細書中的所謂「標籤」，係為表示文字之屬性或機能的標牌。關於標籤係在後面更詳細說明。 The concept of processing in the natural language processing system 10 of the present embodiment is shown in a simplified manner in FIG. Further, the gear M in Fig. 1 indicates the execution of word formation analysis. At a certain point in time, the natural language processing system 10 divides the sentence s ₁ into one or more morphemes by performing the word formation analysis using the segmentation model w ₁ . In the present embodiment, the natural language processing system 10 divides a sentence into each character and then performs processing of a character unit to divide the sentence into one or more morphemes. In other words, in the present embodiment, the divided elements to be processed are characters. The natural language processing system 10 sets a label for each character (the divided element) to indicate the result of the word formation analysis. The so-called "label" in this book is a sign indicating the attributes or functions of the characters. The labeling is described in more detail later.

一旦執行構詞解析，則自然語言處理系統10係受理表示該語句s₁之構詞解析之正確答案的資料(正確解答資料)，藉由將解析結果與該正確解答資料進行比較以修正分割模型w₁而獲得新的分割模型w₂。具體而言，自然語言處理系統10，係語句s₁的構詞解析之至少一部分之標籤賦予是有錯誤時，則解析結果之全體就被評價為錯誤。然後，自然語言處理系統10，係將正確解答資料內之各標籤所對應之素性評價為「正確(+1)」然後將該素性之分數設成高於現在值，將解析結果內之各標籤所對應之素性評價為「錯誤(-1)」然後將該素性之分數設成低於現在值，藉此而獲得分割模型w₂。若解析結果內有一部分之標籤是正確答案，則該當一部之標籤(正確解答之標籤)所關連之素性的二個評價「正確(+1)」「錯誤(-1)」係最後會被互相抵消。因此，如上記，將素性之分數設低或設高的處理，係可說成是，將與不正確解答之標籤所對應的正確解答之標籤(不正確解答部分所對應的正確解答之標籤)相關連的素性之分數設高，將與該不正確解答之標籤(不正確部分的標籤)相關連的素性之分數設低的處理。 Once the word formation analysis is performed, the natural language processing system 10 accepts the data indicating the correct answer of the word formation analysis of the sentence s ₁ (correctly answering the data), and corrects the segmentation model by comparing the analysis result with the correct answer data. w ₁ obtained by dividing a new model w _2. Specifically, in the natural language processing system 10, when at least a part of the word formation analysis of the sentence s ₁ is erroneous, the entire analysis result is evaluated as an error. Then, the natural language processing system 10 evaluates the primeness corresponding to each label in the correct answer data as "correct (+1)" and then sets the score of the prime to be higher than the current value, and analyzes each label in the result. The corresponding prime evaluation is "error (-1)" and then the score of the prime is set lower than the current value, thereby obtaining the segmentation model w ₂ . If a part of the label in the analysis result is the correct answer, then the two evaluations of the prime label (correct (+1)) and "error (-1)" of the label (the correct answer) will be Offset each other. Therefore, as described above, the process of setting the score of the prime to be low or high can be said to be the label of the correct answer corresponding to the incorrectly answered label (the label of the correct answer corresponding to the incorrect answer part) The score of the associated prime is set high, and the score of the prime associated with the label of the incorrect answer (the label of the incorrect part) is set low.

此外，自然語言處理系統10係亦可為，將正確解答資料內的各標籤評價為「正確(+1)」，而另一方面，將關於解析結果內之各文字的標籤評價為「錯誤(-1)」，針對各標籤而將二個評價結果互相抵銷後，將被評價為「正確(+1)」之標籤所對應之素性的分數設高，將被評價為「錯誤(-1)」之標籤所對應之素性的分數設低。 In addition, the natural language processing system 10 may evaluate each label in the correct answer data as "correct (+1)", and on the other hand, evaluate the label of each character in the analysis result as "error ( -1)", after the two evaluation results are offset against each label, the score of the prime corresponding to the label that is evaluated as "correct (+1)" is set to be high and will be evaluated as "error (-1 The score of the prime corresponding to the label is set to be low.

例如，假設語句s₁內有5個文字x_a，x_b，x_c，x_d，x_e存在。然後，假設文字x_a，x_b，x_c，x_d，x_e之正確解答之標籤分別是t_a，t_b，t_c，t_d，t_e，藉由構詞解析而各文字之標籤係為t_a，t_g，t_h，t_d，t_e。此情況下，自然語言處理系統10，係將正確解答資料內之標籤t_a，t_b，t_c，t_d，t_e所對應之素性評價為「正確(+1)」評價而將該素性之分數設成高於現在值，執行結果內之標籤t_a，t_g，t_h，t_d，t_e所對應之素性評價為「錯誤(-1)」而將該素性之分數設成低於現在值。此情況下，標籤t_a，t_d，t_e所對應之素性之分數就結果而言是和更新前一樣沒變，正確解答之標籤t_b，t_c所對應之素性之分數係變高，不正確解答之標籤t_g，t_h所對應之素性之分數係變低。 For example, suppose there are five words x _a , x _b , x _c , x _d , x _{e in} statement s ₁ . Then, suppose that the labels of the correct answers of the words x _a , x _b , x _c , x _d , x _e are t _a , t _b , t _c , t _d , t _e , respectively, and the labels of the characters are parsed by the word formation. It is t _a , t _g , t _h , t _d , t _e . In this case, the natural language processing system 10 evaluates the prime of the labels t _a , t _b , t _c , t _d , t _{e in} the correct answer as the "correct (+1)" evaluation. The score is set higher than the current value, and the prime value corresponding to the labels t _a , t _g , t _h , t _d , t _{e in} the execution result is evaluated as "error (-1)" and the score of the prime is set to be low. The value now. In this case, the scores of the primes corresponding to the labels t _a , t _d , and t _e are the same as before the update, and the scores of the primes corresponding to the correctly answered labels t _b and t _c become high. The incorrectly answered label t _g , t _h corresponds to the low score of the prime.

對下個語句s₂執行構詞解析時，自然語言處理系統10係使用該分割模型w₂。然後，自然語言處理系統10係受理該語句s₂之構詞解析的正確解答資料，將執行結果與該正確解答資料進行比較，與修正分割模型w₁時同樣地修正分割模型w₂而獲得新的分割模型w₃。 When the word formation analysis is performed on the next sentence s ₂ , the natural language processing system 10 uses the segmentation model w ₂ . Then, based natural language processing system 10 receives the statement s ₂ Formation of correct answer information parsed, the execution result with the correct answer data are compared with the same manner as the correction corrected segmentation model ₁ w ₂ w-division model is obtained a new The segmentation model w ₃ .

自然語言處理系統10係像這樣，每次處理一個語句(s₁，s₂，…，s_t)時就修正分割模型(w₁→w₂，w₂→w₃，…，w_t→W_t+1)，在下個語句之構詞解析中使用修正過的分割模型。此種每次處理一筆訓練資料時就更新模型的手法，係也稱為「線上學習」或「線上的機械學習」。 The natural language processing system 10 is such that the segmentation model is corrected each time a sentence (s ₁ , s ₂ , ..., s _t ) is processed (w ₁ → w ₂ , w ₂ → w ₃ , ..., w _t → W _t+1 ), the modified segmentation model is used in the word formation of the next sentence. This method of updating the model each time a training material is processed is also called "online learning" or "online mechanical learning".

自然語言處理系統10所做的構詞解析之結果之例子，示於圖2。在此例中，自然語言處理系統10，係將相當於「I bought a book」此一英文的「本買(hon wo katte)」(買書)此一日文之語句，分割成5個文字x₁：「本(hon)」，x₂：「(wo)」，x₃：「買(ka)」，x₄：「(t)」，x₅：「(te)」。然後，自然語言處理系統10係藉由執行構詞解析，而對各文字設定標籤。在本實施形態中，標籤係為，單詞內的文字之出現態樣、該單詞之品詞、該單詞之品詞之子級別的組合，係使用「S-N-nc」等這類英文字母來表現。 An example of the result of the word formation analysis performed by the natural language processing system 10 is shown in FIG. In this example, the natural language processing system 10 will be equivalent to "I bought a book" in this English. buy (hon wo katte)" (buy a book) The sentence of this day is divided into five words x ₁ : "本 (hon)", x ₂ : " (wo)", x ₃ : "buy (ka)", x ₄ : " (t)", x ₅ : " (te)". Then, the natural language processing system 10 sets a label for each character by performing word formation analysis. In the present embodiment, the label is a combination of the appearance of the character in the word, the word of the word, and the sub-level of the word of the word, and is expressed using an English letter such as "SN-nc".

出現態樣係為用來表示，某文字是單獨且由一個單詞所成還是與其他文字組合而構成一個單詞，以及當文字是2文字以上所成之單詞的一部分時，該文字位於單詞內的哪個位置的資訊。在本實施形態中，出現態樣係以S，B，I，E之任一者來表示。出現態樣「S」係表示，文字是單獨構成一個單詞。出現態樣「B」係表示，文字是位於由2文字以上所成之單詞的開頭。出現態樣「I」係表示，文字是位於由3文字以上所成之單詞的中間。出現態樣「E」係表示，文字是位於由2文字以上所成之單詞的末尾。圖2之例子係圖示了，文字x₁，x₂，x₅是單獨且一個單詞，文字x₃，x₄是形成了1單詞。 The appearance pattern is used to indicate whether a text is formed by a single word or combined with other characters to form a word, and when the text is part of a word formed by more than 2 characters, the text is located within the word. Which location information. In the present embodiment, the appearance pattern is expressed by any of S, B, I, and E. The appearance of the "S" system indicates that the words constitute a single word. The appearance pattern "B" indicates that the text is at the beginning of a word formed by two or more characters. The appearance "I" indicates that the text is in the middle of a word made of more than 3 characters. The appearance "E" indicates that the text is at the end of the word formed by more than 2 characters. The example of Fig. 2 illustrates that the words x ₁ , x ₂ , x ₅ are separate and one word, and the words x ₃ , x ₄ are one word formed.

此外，關於出現態樣的架構係沒有限定。在本實施形態中雖然是採用「SBIEO」此一架構，但亦可採用例如當業者所公知的「IOB2」此一架構。 In addition, the architecture regarding the appearance of the aspect is not limited. In the present embodiment, although the "SBIEO" architecture is employed, it is also possible to employ, for example, the "IOB2" architecture known to the industry.

作為品詞之例子可舉出：名詞、動詞、助詞、形容詞、形容動詞、接續詞等。在本實施形態中，名詞係以「N」表示，助詞係以「P」表示，動詞係以「V」表示。圖2之例子係圖示了，文字x₁是名詞，文字x₂是助詞，文字x₃，x₄所成之單詞是動詞，文字x₅是助詞。 Examples of the word can be exemplified by nouns, verbs, auxiliary words, adjectives, adjective verbs, and continuation words. In the present embodiment, the noun is represented by "N", the auxiliary word is represented by "P", and the verb is represented by "V". The example of Fig. 2 illustrates that the word x ₁ is a noun, the word x ₂ is a help word, the word x ₃ , the word formed by x ₄ is a verb, and the word x ₅ is a help word.

品詞之子級別，係表示對應的品詞之下位概念。例如，名詞係可再分類成一般名詞和專有名詞，助詞係可再分類成格助詞、接續助詞、係助詞等。在本實施形態中，一般名詞係以「nc」表示，專有名詞係以「np」表示，格助詞係以「k」表示，接續助詞係以「sj」表示，一般動詞係以「c」表示。圖2之例子係圖示了，文字x₁是一般名詞，文字x₂是格助詞，文字x₃，x₄所成之單詞是一般動詞，文字x₅是接續助詞。 The sub-level of the word is the subordinate concept of the corresponding word. For example, nouns can be subdivided into general nouns and proper nouns, and auxiliary words can be subdivided into auxiliary words, subsequent auxiliary words, and auxiliary words. In the present embodiment, the general noun is represented by "nc", the proper noun is represented by "np", the auxiliary auxiliary word is represented by "k", the subsequent auxiliary auxiliary is represented by "sj", and the general verb is represented by "c". Said. The example in Fig. 2 illustrates that the character x ₁ is a general noun, the word x ₂ is a verb auxiliary word, the word x ₃ , the word formed by x ₄ is a general verb, and the word x ₅ is a continuation auxiliary word.

分割模型20所記憶的素性之分數，係為輸出素性(emission feature)之分數及遷移素性(transition feature)之分數。 The fraction of the primes memorized by the segmentation model 20 is the fraction of the output feature and the kinesin (transition) The score of feature).

所謂輸出素性，係為表示標籤與文字或與文字種別之對應的線索。換言之，所謂輸出素性係為表示，對哪種標籤是比較容易由哪種文字或文字種別做對應的線索。輸出素性，係對應於隱馬爾可夫模型的輸出矩陣的素性表現。在本實施形態中係採用，單連詞(僅由1文字所成的字串)之輸出素性、和雙連詞(由連續2文字所成的字串)之輸出素性。 The output prime is a clue indicating the correspondence between the label and the text or the type of the text. In other words, the output prime is a clue as to which kind of label is relatively easy to be matched by which character or type of text. The output prime is the prime representation of the output matrix corresponding to the hidden Markov model. In the present embodiment, the output quality of a single conjunction (a string formed by only one character) and the output of a double conjunction (a string formed by two consecutive characters) are used.

此處，所謂文字種別，係指某語言中的文字之種類。作為日文的文字種別係可舉出例如：漢字、平假名、片假名、英文字母(大寫及小寫)、阿拉伯數字、漢數字、及中黑(‧)。此外，在本實施形態中，文字種別係以英文字母來表示。例如，「C」係表示漢字，「H」係表示平假名，「K」係表示片假名，「L」係表示英文字母，「A」係表示阿拉伯數字。圖2之例子係圖示了，文字x₁，x₃是漢字，文字x₂，x₄，x₅是平假名。 Here, the type of text refers to the type of words in a certain language. Examples of Japanese language types include Chinese characters, hiragana, katakana, English letters (uppercase and lowercase), Arabic numerals, Chinese numerals, and Chinese black (‧). Further, in the present embodiment, the character type is indicated by an English letter. For example, "C" means Chinese characters, "H" means hiragana, "K" means katakana, "L" means English letters, and "A" means Arabic numerals. The example of Fig. 2 illustrates that the characters x ₁ , x ₃ are Chinese characters, the characters x ₂ , x ₄ , and x ₅ are hiragana.

關於文字的單連詞之輸出素性，係為表示標籤t與文字x之對應的線索。又，關於文字種別的單連詞之輸出素性，係為表示標籤t與文字種別c之對應的線索。在本實施形態中，標籤t與文字x之對應的最合理樣貌的分數s，以{t/x，s}表示。又，標籤t與文字種別_c之對應的最合理樣貌的分數s，以{t/c，s}表示。分割模型20，係對一個文字或文字種別而含有關於複數標籤的分數。對於一個文字或文字種別，若有準備關於所有種類之標籤的資料，則分割模型20係還會含有，關於在文法上實際不可能發生的標籤與文字或文字種別之組合的分數。但是，文法上不可能有的素性之分數，係相對較低。 The output property of a single conjunction of a character is a clue indicating the correspondence between the tag t and the character x. Further, the output property of the single conjunction of the character type is a clue indicating the correspondence between the tag t and the character type c. In the present embodiment, the score s of the most reasonable appearance corresponding to the label t and the character x is expressed by {t/x, s}. Further, the score s of the most reasonable appearance corresponding to the label t and the character type _c is expressed by {t/c, s}. The segmentation model 20 contains scores for plural tags for a text or type of text. For a text or type of text, if there is information about all kinds of labels, the segmentation model 20 will also contain a score on the combination of the label and the text or text type that is practically impossible to grammatically. However, the scores of the primes that are impossible in grammar are relatively low.

以下展示，關於日文的「本(hon)」此一文字的輸出素性之分數的例子。雖然該文字在日文文法上不可能是助詞，但如上述，針對文法上不存在的「S-P-k/本(hon)」這類素性，仍會備有資料。 The following is an example of the score of the output quality of the Japanese word "hon". Although the text cannot be a mnemonic in Japanese grammar, as mentioned above, there is still information on the nature of "S-P-k/本(hon)" which does not exist in grammar.

{S-N-nc/本(hon)，0.0420} {S-N-nc/本(hon), 0.0420}

{B-N-nc/本(hon)，0.0310} {B-N-nc/本(hon),0.0310}

{S-P-k/本(hon)，0.0003} {S-P-k/本(hon), 0.0003}

{B-V-c/本(hon)，0.0031} {B-V-c/本(hon), 0.0031}

又，展示關於文字種別「漢字」的輸出素性之分數之例子。 In addition, an example of the score of the output quality of the Chinese character "Chinese character" is displayed.

{S-N-nc/C，0.0255} {S-N-nc/C,0.0255}

{E-N-np/C，0.0488} {E-N-np/C, 0.0488}

{S-P-k/C，0.0000} {S-P-k/C, 0.0000}

{B-V-c/C，0.0299} {B-V-c/C, 0.0299}

關於文字種別也可準備有，表示文法上不存在之素性的資料。例如，雖然在日文文法上，以阿拉伯數字所表示的單詞不可能會是助詞，但針對「S-P-k/A」這類素性，仍會備有資料。 As for the type of text, it is also possible to prepare information indicating the grammatical non-existence. For example, although in Japanese grammar, words represented by Arabic numerals are unlikely to be auxiliary words, information on "S-P-k/A" is still available.

關於文字的雙連詞之輸出素性，係為表示標籤t與字串x_ix_i+1之對應的線索。又，關於文字種別的雙連詞之輸出素性，係為表示標籤t與文字種別的串列 c_ic_i+1之對應的線索。在本實施形態中，標籤t及文字x_i，x_i+1的最合理樣貌的分數s，以{t/x_i/x_i+1，s}表示。又，標籤t及文字種別c_i，c_i+1的最合理樣貌的分數s，以{t/c_i/c_i+1，s}表示。對於一個雙連詞，若有準備關於所有可能存在之標籤的資料，則分割模型20也會記憶，關於在文法上實際不可能發生的標籤與雙連詞之組合的資料。 The output property of the double conjunction of the character is a clue indicating the correspondence between the tag t and the string x _i x _i+1 . Further, the output property of the double conjunction of the character type is a clue indicating the correspondence between the tag t and the character string c _i c _i+1 of the character type. In the present embodiment, the score s of the most reasonable appearance of the label t and the characters x _i , x _i+1 is expressed by {t/x _i /x _i+1 ,s}. Further, the score s of the most reasonable appearance of the label t and the character type c _i , c _i+1 is expressed by {t/c _i /c _i+1 , s}. For a double conjunction, if there is information about all possible labels, the segmentation model 20 will also memorize the data on the combination of labels and double conjunctions that are not grammatically impossible.

以下展示，關於「本(hon wo)」此一雙連詞的輸出素性之分數的例子。 The following shows about "this (hon wo) An example of the score of the output of this double conjunction.

{S-N-nc/本(hon)/(wo)，0.0420} {SN-nc/本(hon)/ (wo), 0.0420}

{B-N-nc/本(hon)/(wo)，0.0000} {BN-nc/本(hon)/ (wo), 0.0000}

{S-P-k/本(hon)/(wo)，0.0001} {SPk/本(hon)/ (wo), 0.0001}

{B-V-c/本(hon)/(wo)，0.0009} {BVc/本(hon)/ (wo), 0.0009}

又，展示關於在漢字之後出現平假名之雙連詞的輸出素性之分數之例子。 Also, an example of the score of the output quality of the double conjunction of the hiragana after the Chinese character is displayed.

{S-N-nc/C/H，0.0455} {S-N-nc/C/H, 0.0455}

{E-N-np/C/H，0.0412} {E-N-np/C/H, 0.0412}

{S-P-k/C/H，0.0000} {S-P-k/C/H, 0.0000}

{B-V-c/C/H，0.0054} {B-V-c/C/H, 0.0054}

所謂遷移素性，係為表示文字x_i之標籤t_i與其下個文字x_i+1之標籤t_i+1之組合(對應於連續2文字的二個標籤所成的組合)的線索。該遷移素性係為有關於雙連詞的素性。遷移素性，係對應於隱馬爾可夫模型的遷移矩陣的素性表現。在本實施形態中，標籤t_i與標籤t_i+1之組合的最合理樣貌的分數s，以{t_i/t_i+1，s}表示。若有準備關於所有可能存在之組合的遷移素性的資料，則分割模型20也會記憶，關於在文法上實際不可能發生的二個標籤之組合的資料。 The so-called migration primality, based text labels representing x _i t _i of the next character and its label of x _{i + 1} t _{i + 1} of the composition (corresponding to the two characters formed by the combination of two successive labels) leads. The kinesin is a trait related to a double conjunction. The kinetics are the prime representations of the migration matrix corresponding to the hidden Markov model. In the present embodiment, the score s of the most reasonable appearance of the combination of the label t _i and the label t _i+1 is expressed by {t _i /t _i+1 , s}. If there is information on the physiology of all possible combinations, the segmentation model 20 will also memorize the data on the combination of the two tags that are practically impossible to grammatically.

以下，展示遷移素性之分數的數個例子。 Below, several examples of the scores of kinetics are shown.

{S-N-nc/S-P-k，0.0512} {S-N-nc/S-P-k, 0.0512}

{E-N-nc/E-N-nc，0.0000} {E-N-nc/E-N-nc, 0.0000}

{S-P-k/B-V-c，0.0425} {S-P-k/B-V-c, 0.0425}

{B-V-c/I-V-c，0.0387} {B-V-c/I-V-c, 0.0387}

自然語言處理系統10係具備1台以上之電腦，在具備複數台電腦的情況下，後述的自然語言處理系統10之各機能要素係藉由分散處理而實現。每個電腦之種類係沒有限定。例如，亦可使用桌上型或攜帶型之個人電腦(PC)，也可使用工作站，也可使用高機能行動電話機(智慧型手機)或行動電話機、行動資訊終端(PDA)等之行動終端。或者，亦可將各式種類之電腦加以組合來建構自然語言處理系統10。使用複數台電腦的情況下，這些電腦係可透過網際網路或內部網路等之通訊網路而連接。 The natural language processing system 10 includes one or more computers. When a plurality of computers are provided, each functional element of the natural language processing system 10 to be described later is realized by distributed processing. There is no limit to the type of each computer. For example, a desktop or portable personal computer (PC), a workstation, or a mobile terminal such as a high-performance mobile phone (smartphone) or a mobile phone or a mobile information terminal (PDA) can be used. Alternatively, various types of computers can be combined to construct the natural language processing system 10. In the case of a plurality of computers, these computers can be connected via a communication network such as the Internet or an internal network.

自然語言處理系統10內的每個電腦100的一般硬體構成，示於圖3。電腦100係具備有：執行作業系統或應用程式等的CPU(處理器)101、由ROM及RAM所構成的主記憶部102、由硬碟或快閃記憶體等所構成的輔助記憶部103、由網路卡或無線通訊模組等所構成的通訊控制部104、由鍵盤或滑鼠等所成之輸入裝置105、顯示器或印表機等之輸出裝置106。當然，所搭載的硬體模組係隨電腦100之種類而不同。例如，桌上型之PC及工作站通常作為輸入裝置及輸出裝置會具備鍵盤、滑鼠、及螢幕，但智慧型手機通常是以觸控面板來作為輸入裝置及輸出裝置的功能。 A general hardware configuration of each computer 100 within the natural language processing system 10 is shown in FIG. The computer 100 includes a CPU (processor) 101 that executes a work system or an application, a main memory unit 102 composed of a ROM and a RAM, and an auxiliary memory unit 103 including a hard disk or a flash memory. a network of wireless cards or wireless communication modules The control unit 104 includes an input device 105 formed by a keyboard or a mouse, an output device 106 such as a display or a printer. Of course, the hardware modules that are mounted vary depending on the type of computer 100. For example, desktop PCs and workstations usually have a keyboard, a mouse, and a screen as input devices and output devices, but smart phones usually use a touch panel as an input device and an output device.

後述的自然語言處理系統10的各機能要素，係藉由將所定之軟體讀入到CPU101或主記憶部102上，在CPU101的控制之下使通訊控制部104或輸入裝置105、輸出裝置106等作動，進行主記憶部102或輔助記憶部103中的資料讀出或寫入而實現之。處理所需之資料或資料庫，係被儲存在主記憶部102或輔助記憶部103內。 Each function element of the natural language processing system 10 to be described later reads the predetermined software into the CPU 101 or the main memory unit 102, and causes the communication control unit 104, the input device 105, the output device 106, and the like under the control of the CPU 101. The operation is performed by reading or writing data in the main memory unit 102 or the auxiliary memory unit 103. The data or database required for processing is stored in the main memory unit 102 or the auxiliary memory unit 103.

另一方面，分割模型20係預先記憶在記憶裝置中。分割模型20之具體的實作方法係無限定，例如分割模型20係亦可以關連資料庫或是文字檔的方式而被準備。又，分割模型20之設置場所係無限定，例如，分割模型20係可存在於自然語言處理系統10的內部，也可存在於與自然語言處理系統10不同的其他電腦系統內。分割模型20位於其他自然語言處理系統內時，自然語言處理系統10係透過通訊網路而存取分割模型20。 On the other hand, the segmentation model 20 is pre-stored in the memory device. The specific implementation method of the segmentation model 20 is not limited. For example, the segmentation model 20 can also be prepared in a manner related to a database or a text file. Further, the installation location of the segmentation model 20 is not limited. For example, the segmentation model 20 may exist inside the natural language processing system 10 or may exist in another computer system different from the natural language processing system 10. When the segmentation model 20 is located in another natural language processing system, the natural language processing system 10 accesses the segmentation model 20 via the communication network.

如上述，分割模型20係也可說是各種素性之分數的集合。在數式上，含有n個素性之分數w₁，w₂，…，w_n的分割模型20，可以向量w={w₁， w₂，…，w_n}來表示。分割模型20被新作成的時點上的各素性之分數係全部為0。亦即，係為w={0，0，…，0}。藉由後述的自然語言處理系統10之處理，該分數會慢慢地漸漸被更新。在處理過某種程度多的語句後，係如上述般地在每個素性的分數之間就會逐漸產生差異。 As described above, the segmentation model 20 can also be said to be a collection of scores of various primes. In the equation, the segmentation model 20 containing the scores w ₁ , w ₂ , ..., w _n of n primes can be represented by vectors w = {w ₁ , w ₂ , ..., w _n }. The scores of the primes at the time when the segmentation model 20 is newly created are all zero. That is, it is w={0,0,...,0}. This score is gradually updated gradually by the processing of the natural language processing system 10 described later. After processing a certain number of sentences, as described above, a difference gradually occurs between the scores of each prime.

如圖4所示，自然語言處理系統10作為機能性構成要素係具備取得部11、解析部12、及修正部13。自然語言處理系統10係因應需要而存取分割模型20。以下說明各機能要素，但在本實施形態中是以自然語言處理系統10處理日文之語句為前提來說明。當然，自然語言處理系統10所處理的語句之語言係不限定於日文，亦可解析中文等之其他語言的語句。 As shown in FIG. 4, the natural language processing system 10 includes an acquisition unit 11, an analysis unit 12, and a correction unit 13 as functional components. The natural language processing system 10 accesses the segmentation model 20 as needed. Although each functional element will be described below, this embodiment is described on the premise that the natural language processing system 10 processes the Japanese sentence. Of course, the language of the sentence processed by the natural language processing system 10 is not limited to Japanese, and sentences of other languages such as Chinese can also be analyzed.

取得部11，係為用來取得欲分割成詞素串的語句的機能要素。取得部11所做的語句之取得方法係無限定。例如，取得部11係亦可從網際網路上的任意網站收集語句(也就是所謂的爬行(crawling))。或者，取得部11係亦可將自然語言處理系統10內之資料庫中所預先積存的語句予以讀出，也可將位於自然語言處理系統10以外之電腦系統上的資料庫中所預先積存的語句經由通訊網路進行存取而讀出。或者，取得部11係亦可受理自然語言處理系統10之使用者所輸入的語句。一旦最初之語句解析的指示被輸入，取得部11係取得一個語句並輸出至解析部12。其後，一旦從後述的修正部13輸入了完成通知，則取得部11係取得下個語句並輸出至解析部 12。 The acquisition unit 11 is a functional element for acquiring a sentence to be divided into a morpheme string. The method of obtaining the statement made by the acquisition unit 11 is not limited. For example, the acquisition unit 11 can also collect statements (that is, so-called crawling) from any website on the Internet. Alternatively, the acquisition unit 11 may read out the pre-stored statements in the database in the natural language processing system 10, or may pre-store the statements in the database on the computer system other than the natural language processing system 10. The statement is read via the communication network and read. Alternatively, the acquisition unit 11 may accept a sentence input by a user of the natural language processing system 10. When the instruction to analyze the first sentence is input, the acquisition unit 11 acquires one sentence and outputs it to the analysis unit 12. Then, when the completion notification is input from the correction unit 13 to be described later, the acquisition unit 11 acquires the next sentence and outputs it to the analysis unit. 12.

解析部12係為對每個語句執行構詞解析的機能要素。解析部12係每次被輸入一個語句就執行以下的處理。 The analysis unit 12 is a functional element that performs word formation analysis for each sentence. The analysis unit 12 executes the following processing every time a sentence is input.

首先，解析部12係將一個語句分割成各個文字，判定各文字之文字種別。解析部12係預先記憶有文字與文字種別之對比表、或用來判定文字種別所需之正規表現，使用該對比表或正規表現來判定文字種別。 First, the analysis unit 12 divides a single sentence into individual characters, and determines the character type of each character. The analysis unit 12 preliminarily stores a comparison table of characters and character types, or a regular expression required for determining the type of characters, and determines the type of characters using the comparison table or the regular expression.

接下來，解析部12係使用維特比演算法(Viterbi algorithm)來決定各文字的標籤。對第i個文字，解析部12係針對最終有被選擇之可能性的標籤(候補標籤)之每一者，判定與第(i-1)個文字之複數候補標籤之其中哪個候補標籤與連接時分數(此亦稱為「連接分數」)會是最高。此處，連接分數係為，關於計算對象之標籤的各種分數(單連詞之輸出素性之分數、雙連詞之輸出素性之分數、及遷移素性之分數)的合計值。例如，解析部12係為，若第i個標籤為「S-N-nc」，則第(i-1)個標籤是「S-P-k」時就認為連接分數最高，若第i個標籤為「S-V-c」，則第(i-1)個標籤是「E-N-nc」時就認為連接分數最高…等，如此判定。然後，解析部12，係將連接分數最高的組合(例如(S-P-k，S-N-nc)、(E-N-nc，S-V-c)等)，全部加以記憶。解析部12，係從最初的文字起到文末記號為止每次前進1文字而執行此種處理。 Next, the analysis unit 12 determines the label of each character using a Viterbi algorithm. For the i-th character, the analysis unit 12 determines which one of the plurality of candidate tags of the (i-1)th character is the candidate tag and the connection with respect to each of the tags (candidate tags) that are finally selected. The hourly score (this is also known as the "join score") will be the highest. Here, the connection score is a total value of various scores of the label of the calculation target (the score of the output of the single conjunction, the score of the output of the double conjunction, and the score of the kinomism). For example, when the i-th label is "SN-nc", the analysis unit 12 considers that the (i-1)th label is "SPk", and the connection score is considered to be the highest, and if the i-th label is "SVc", When the (i-1)th label is "EN-nc", the connection score is considered to be the highest...etc. Then, the analyzing unit 12 memorizes the combination having the highest connection score (for example, (S-P-k, S-N-nc), (E-N-nc, S-V-c), etc.). The analysis unit 12 performs such processing by advancing one character at a time from the first character to the end of the text.

對文末記號係僅存在一種類之標籤(EOS)，因此連接分數為最高、最後文字與文末記號之標籤的組合係決定一個(例如決定成該組合是(E-V-c，EOS))。如此一來，最後文字的標籤係被決定(例如該標籤係被決定為「E-V-c」)，其結果為，從最後倒數第2個文字之標籤也被決定。結果，從語句之最後往開頭依序以順藤摸瓜方式，確定標籤。 There is only one type of label (EOS) for the end of the text, so the combination of the highest score, the last character and the end of the label determines one (for example, the combination is (E-V-c, EOS)). In this way, the label of the last character is determined (for example, the label is determined to be "E-V-c"), and as a result, the label of the second character from the last countdown is also determined. As a result, the label is determined in the order from the end of the sentence to the beginning.

此種解析部12所做的處理，模式性示於圖5。圖5係表示由4文字所成之語句的標籤賦予之一例。為了簡化說明，在此例中是將標籤簡化成如「A1」「B2」等來表示，將各文字之候補標籤的個數設成3。圖5中的粗線係表示，將語句從前方進行處理所得的，連接分數被判定為最高的標籤與標籤之組合。例如在第3文字之處理中，對於標籤C1係與標籤B1的連接分數為最高，對於標籤C2係與標籤B1的連接分數為最高，對於標籤C3係與標籤B2的連接分數為最高。在圖5之例子中，若處理到語句之最後(EOS)，則組合(D1，EOS)係被確定，接下來，組合(C2，D1)係被確定，其後，組合(B1，C2)，(A2，B1)係被依序確定。因此，解析部12係判定第1~4文字之標籤分別為A2，B1，C2，D1。 The processing performed by such analysis unit 12 is schematically shown in FIG. Fig. 5 is a diagram showing an example of label assignment of a sentence made of four characters. In order to simplify the description, in this example, the label is simplified to be expressed as "A1", "B2", etc., and the number of candidate labels for each character is set to 3. The thick line in Fig. 5 indicates the combination of the label and the label whose connection score is determined to be the highest, which is obtained by processing the sentence from the front. For example, in the processing of the third character, the connection score between the tag C1 and the tag B1 is the highest, the connection score between the tag C2 and the tag B1 is the highest, and the connection score between the tag C3 and the tag B2 is the highest. In the example of Figure 5, if processed to the end of the statement (EOS), the combination (D1, EOS) is determined, then the combination (C2, D1) is determined, and then, the combination (B1, C2) , (A2, B1) is determined sequentially. Therefore, the analysis unit 12 determines that the labels of the first to fourth characters are A2, B1, C2, and D1, respectively.

解析部12係將各文字被賦予標籤的語句，當作解析結果而輸出。解析部12係至少將解析結果輸出至修正部13，但其理由為，該解析結果是分割模型20之修正上所必須。解析部12係亦可執行更多輸出。例如，解析部12係亦可將解析結果顯示在螢幕上或令印表機列印，將解析結果寫入文字檔等，也可將解析結果儲存在記憶體或資料庫等之記憶裝置。或者，解析部12係將解析結果經由通訊網路而發送至自然語言處理系統10以外之其他任意電腦系統。 The analysis unit 12 outputs a statement in which each character is given a tag as an analysis result. The analysis unit 12 outputs at least the analysis result to the correction unit 13, but the reason is that the analysis result is the repair of the segmentation model 20. It must be on the right. The analysis unit 12 can also perform more output. For example, the analysis unit 12 may display the analysis result on the screen or print the printer, write the analysis result to a text file, or the like, and store the analysis result in a memory device such as a memory or a database. Alternatively, the analysis unit 12 transmits the analysis result to any other computer system other than the natural language processing system 10 via the communication network.

修正部13係為，根據從解析部12所得之解析結果、與該語句之構詞解析之正確答案的差，來修正分割模型20的機能要素。本明細書中所謂「分割模型之修正」，係為將分割模型內之至少一個素性之分數加以變更的處理。此外，隨著情況不同，有可能即使欲變更某個分數，結果值卻沒有改變。修正部13係每次有一個解析結果被輸入，就執行以下處理。 The correction unit 13 corrects the functional elements of the segmentation model 20 based on the difference between the analysis result obtained from the analysis unit 12 and the correct answer analyzed by the word formation of the sentence. The "correction of the segmentation model" in this detailed book is a process of changing the score of at least one prime in the segmentation model. In addition, depending on the situation, it is possible that the result value does not change even if a certain score is to be changed. The correction unit 13 executes the following processing each time one analysis result is input.

首先，修正部13係取得所被輸入的解析結果所對應之正確解答資料，亦即，表示已被解析部12所處理過的語句之構詞解析之正確答案的資料。本實施形態中的所謂正確解答資料，係為表示形成語句的各文字之標籤(出現態樣、品詞、及品詞之子級別之組合)的資料。該正確解答資料係由人手所作成。修正部13所做的正確解答資料之取得方法係無限定。例如，修正部13係亦可將自然語言處理系統10內之資料庫中所預先積存的正確解答資料予以讀出，也可將位於自然語言處理系統10以外之電腦系統上的資料庫中所預先積存的語句經由通訊網路進行存取而讀出。或者，修正部13係亦可受理自然語言處理系統10之使用者所輸入的正確解答資料。 First, the correction unit 13 acquires the correct answer data corresponding to the input analysis result, that is, the data indicating the correct answer of the word formation analysis of the sentence processed by the analysis unit 12. The so-called correct answer data in the present embodiment is a material indicating a label (a combination of appearance, word, and sub-level of appearance) of each character forming a sentence. The correct answer is made by the staff. The method of obtaining the correct answer data by the correction unit 13 is not limited. For example, the correction unit 13 may read out the correct answer data accumulated in the database in the natural language processing system 10, or may pre-populate the database on the computer system other than the natural language processing system 10. The accumulated statements are read via the communication network and read. Alternatively, the correction unit 13 can also accept natural language. The correct answer data entered by the user of the processing system 10.

一旦取得正確解答資料，則修正部13係將所被輸入的解析結果與該正確解答資料進行比較而特定出兩者之間的差。 When the correct answer data is obtained, the correcting unit 13 compares the input analysis result with the correct answer data to specify the difference between the two.

若解析結果是與正確解答資料完全一致沒有差異，則修正部13係不修正分割模型20就結束處理，生成完成通知並輸出至取得部11。該完成通知係為表示，修正部13的處理結束而可對下個語句執行構詞解析的訊號。解析結果與正確解答資料完全一致這件事情，代表至少此時點上不需要修正分割模型20，因此自然語言處理系統10(更具體而言係為解析部12)係直接使用現在之分割模型20來解析下個語句。 If the analysis result is completely different from the correct answer data, the correction unit 13 ends the processing without correcting the division model 20, and generates a completion notification and outputs it to the acquisition unit 11. This completion notification is a signal indicating that the processing of the word formation can be performed on the next sentence after the processing of the correction unit 13 is completed. The fact that the analysis result is completely consistent with the correct answer data means that at least at this point, the segmentation model 20 does not need to be modified, so the natural language processing system 10 (more specifically, the analysis unit 12) directly uses the current segmentation model 20 Parse the next statement.

例如，針對上述的日文語句「本買(hon wo katte)」(買書)的正確解答資料，係如以下。此外，也適宜的將各文字表示成x₁~x₅。 For example, for the above Japanese sentence "this buy (hon wo katte)" (buy book) The correct answer is as follows. Further, it is also appropriate to express each character as x ₁ to x ₅ .

x₁：{S-N-nc} x ₁ :{SN-nc}

x₂：{S-P-k} x ₂ :{SPk}

x₃：{B-V-c} x ₃ :{BVc}

x₄：{E-V-c} x ₄ :{EVc}

x₅：{S-P-sj} x ₅ :{SP-sj}

因此，當圖2所示的解析結果被輸入時，修正部13係判定該解析結果與正確解答資料是完全一致，不修正解析部12就將完成通知輸出至取得部11。 Therefore, when the analysis result shown in FIG. 2 is input, the correction unit 13 determines that the analysis result is completely identical to the correct answer data, and the non-correction analysis unit 12 outputs the completion notification to the acquisition unit 11.

另一方面，若解析結果與正確解答資料並非完全一致(亦即解析結果與正確解答資料之間有差異的情況)，則修正部13係將分割模型20之至少一部分之分數予以更新。更具體而言，修正部13係將不正確解答之標籤所對應之正確解答之標籤所關連的素性之分數設成高於現在值，並且將該不正確解答之標籤所關連的素性之分數設成低於現在值。 On the other hand, if the analysis results and the correct answer is not If the components are completely identical (that is, when there is a difference between the analysis result and the correct answer data), the correction unit 13 updates the score of at least a part of the segmentation model 20. More specifically, the correction unit 13 sets the score of the prime relationship associated with the label of the incorrect answer corresponding to the incorrectly answered label to be higher than the current value, and sets the score of the prime relationship associated with the incorrectly answered label. It is lower than the current value.

例如，假設解析部12係從日文語句「本買(hon wo katte)」得到以下之解析結果。 For example, suppose the analysis unit 12 is from the Japanese sentence "this. buy (hon wo katte)" The following analysis results were obtained.

x₁：{S-N-nc} x ₁ :{SN-nc}

x₂：{S-P-k} x ₂ :{SPk}

x₃：{B-V-c} x ₃ :{BVc}

x₄：{I-V-c} x ₄ :{IVc}

x₅：{E-V-c} x ₅ :{EVc}

此情況下，解析結果係整體來看都是錯的，因此修正部13係將正確解答資料內之各標籤所對應之素性評價為「正確(+1)」然後將該素性之分數設成高於現在值，將解析結果內之各標籤所對應之素性評價為「錯誤(-1)」然後將該素性之分數設成低於現在值。若考慮結果被抵銷的部分，則可說是修正部13係最終進行以下處理。 In this case, the analysis result is all wrong as a whole. Therefore, the correction unit 13 evaluates the primeness corresponding to each label in the correct answer data as "correct (+1)" and sets the score of the prime to be high. At the current value, the prime value corresponding to each label in the analysis result is evaluated as "error (-1)" and the score of the prime is set lower than the current value. When the part in which the result is offset is considered, it can be said that the correction unit 13 finally performs the following processing.

修正部13係將關於文字x₄，x₅之正確解答之標籤所對應之輸出素性「E-V-c/(t)」「S-P-sj/(te)」的分數設成大於現在值，將關於不正確解答之標籤所關連之輸出素性「I-V-c/(t)」「E-V-c/ (te)」的分數設成小於現在值。藉此，所被解析之語句所關連的單連詞之輸出素性之分數(關於文字的分數)係被更新。 The correction unit 13 is an output element corresponding to the label of the correct answer to the characters x ₄ and x ₅ "EVc/ (t)""SP-sj/ The score of (te) is set to be greater than the current value, and the output factor related to the incorrectly answered label is "IVc/ (t)""EVc/ The score of (te)" is set to be smaller than the current value. Thereby, the score of the output of the single conjunction associated with the sentence being parsed (the score on the text) is updated.

又，修正部13係將關於不正確答案的文字x₄，x₅之正確解答之標籤所關連的輸出素性「E-V-c/H」「S-P-sj/H」的分數設成大於現在值，將關於不正確解答之標籤所關連的輸出素性「I-V-c/H」「E-V-c/H」的分數設成小於現在值。藉此，所被解析之語句所關連的單連詞之輸出素性之分數(關於文字種別的分數)係被更新。 Further, the correction unit 13 sets the score of the output quality "EVc/H""SP-sj/H" associated with the label of the correct answer of the characters x ₄ and x ₅ of the incorrect answer to be larger than the current value, and will be about The score of the output quality "IVc/H" and "EVc/H" associated with the incorrectly answered label is set to be smaller than the current value. Thereby, the score of the output trait of the single conjunction (the score on the type of the text) to which the sentence to be analyzed is related is updated.

又，修正部13係將關於不正確答案的文字x₄，x₅之正確解答之標籤所關連的輸出素性「E-V-c/(t)/(te)」的分數設成大於現在值，將關於不正確解答之標籤所關連的輸出素性「I-V-c/(t)/(te)」的分數設成小於現在值。藉此，所被解析之語句所關連的雙連詞之輸出素性之分數(關於文字的分數)係被更新。 Further, the correcting unit 13 is an output element "EVc/" associated with the label of the correct answer of the characters x ₄ and x ₅ of the incorrect answer. (t)/ The score of (te) is set to be greater than the current value, and the output factor related to the incorrectly answered label is "IVc/ (t)/ The score of (te)" is set to be smaller than the current value. Thereby, the score of the output of the double conjunction associated with the sentence being parsed (the score on the text) is updated.

又，修正部13係將關於不正確答案的文字x₄，x₅之正確解答之標籤所關連的輸出素性「E-V-c/H/H」的分數設成大於現在值，將關於不正確解答之標籤所關連的輸出素性「I-V-c/H/H」的分數設成小於現在值。藉此，所被解析之語句所關連的雙連詞之輸出素性之分數(關於文字種別的分數)係被更新。 Further, the correction unit 13 sets the score of the output prime property "EVc/H/H" associated with the label of the correct answer of the characters x ₄ and x ₅ of the incorrect answer to be larger than the current value, and labels the incorrect answer. The score of the associated output prime "IVc/H/H" is set to be smaller than the current value. Thereby, the score of the output of the double conjunction associated with the sentence being analyzed (the score on the type of the text) is updated.

又，修正部13係將關於不正確答案的文字x₄，x₅之正確解答之標籤所關連的遷移素性「B-V-c/E-V-c」「E-V-c/S-P-sj」的分數設成大於現在值，將關於不正確解答之標籤所關連的遷移素性「B-V-c/I-V-c」「I-V-c/E-V-c」的分數設成小於現在值。藉此，所被解析之語句所關連的遷移素性之分數係被更新。 Further, the correction unit 13 sets the score of the phytonomic "BVc/EVc""EVc/SP-sj" associated with the label of the correct answer of the characters x ₄ and x ₅ of the incorrect answer to be larger than the current value, and will be about The scores of the kinesin "BVc/IVc" and "IVc/EVc" related to the incorrectly answered label are set to be smaller than the current value. Thereby, the score of the kinetics associated with the sentence being parsed is updated.

此外，如上述，修正部13係亦可為，將正確解答資料內的各標籤評價為「正確(+1)」，而另一方面，將關於解析結果內之各文字的標籤評價為「錯誤(-1)」，將關於各標籤的二個評價結果互相抵銷後，將被評價為「正確(+1)」之標籤所對應之素性的分數設高，將被評價為「錯誤(-1)」之標籤所對應之素性的分數設低。 Further, as described above, the correction unit 13 may evaluate each label in the correct answer data as "correct (+1)", and on the other hand, evaluate the label of each character in the analysis result as "error. (-1)", after the two evaluation results of each label are offset, the score of the prime corresponding to the label that is evaluated as "correct (+1)" is set high, and will be evaluated as "error (- The score of the prime corresponding to the label of 1) is set low.

在素性之分數更新之際，修正部13係亦可使用SCW(Soft Confidence-Weighted learning)。該SCW係為針對變異數較大的參數係視為還沒有自信(並非正確)而將該參數更新成較大，針對變異數較小的參數係視為某種程度正確而將該參數更新成較小的手法。修正部13，係基於具有值之範圍的分數之變異數來決定該分數之變化量。為了執行該SCW，對分割模型20(向量w)導入高斯分布，修正部13係除了各分數之更新以外還會同時更新該分數之平均及共變異數矩陣。各分數之平均的初期值係為0。關於各分數之共變異數矩陣的初期值，係對角要素為1，其以外之要素(非對角要素)則為0。圖6(a)係圖示，將變異數較大之分數予以大幅變更(亦即分數的變化量較大)的態樣，圖6(b)係圖示，將變異數較小之分數予以小幅變更(亦即分數的變化量較小)的態樣。圖6(a)及圖6(b)係分別圖示，將分數從Sa更新成Sb之際，也將共變異數矩陣Σ予以更新。此外，若關於共變異數矩陣之更新來說，即使不考慮某素性與其他素性之相關關係仍可保持分數之計算的精度，因此本實施形態中並不計算共變異數矩陣之非對角要素而是只計算對角要素。藉此，可提高分數之更新速度。 When the score of the prime is updated, the correction unit 13 can also use SCW (Soft Confidence-Weighted Learning). The SCW system is based on a parameter with a large number of variances, and the parameter is updated to be large (not correct), and the parameter is updated to a certain degree, and the parameter is updated to a certain degree. Smaller technique. The correction unit 13 determines the amount of change in the score based on the variance of the score having the range of values. In order to execute the SCW, a Gaussian distribution is introduced to the segmentation model 20 (vector w), and the correction unit 13 updates the average of the scores and the covariance matrix in addition to the update of the scores. The average initial value of each score is 0. Regarding the initial value of the common variance matrix of each score, the diagonal element is 1 and the other elements (non-diagonal elements) are 0. Fig. 6(a) is a diagram showing a case where the score with a large variation is greatly changed (that is, the amount of change in the score is large), and Fig. 6(b) is a diagram showing that the score having a smaller variation is given. Minor changes (ie, small changes in scores) Aspect. 6(a) and 6(b) are diagrams respectively showing that the common variance number matrix Σ is also updated when the score is updated from Sa to Sb. In addition, if the update of the covariance matrix is used, the accuracy of the calculation of the score can be maintained even if the correlation between the prime and the other primes is not considered. Therefore, the non-diagonal elements of the covariance matrix are not calculated in this embodiment. Instead, only the diagonal features are calculated. In this way, the update speed of the score can be increased.

此外，修正部13係亦可使用SCW以外之手法來更新素性之分數。作為SCW以外之手法的例子係可舉出如：Perceptron、Passive Aggressive(PA)、Confidence Weighted(CW)、Adaptive Regularization of Weight Vectors(AROW)。 Further, the correction unit 13 may update the score of the prime degree by using a method other than the SCW. Examples of the technique other than the SCW include Perceptron, Passive Aggressive (PA), Confidence Weighted (CW), and Adaptive Regularization of Weight Vectors (AROW).

一旦藉由更新已被解析之語句所關連的素性之分數而修正了分割模型20，則修正部13係生成完成通知並輸出至取得部11。此情況下，自然語言處理系統10(更具體而言係為解析部12)係使用已被修正的分割模型20來解析下個語句。 When the segmentation model 20 is corrected by updating the score of the prime factor associated with the sentence being analyzed, the correction unit 13 generates a completion notification and outputs it to the acquisition unit 11. In this case, the natural language processing system 10 (more specifically, the analysis unit 12) analyzes the next sentence using the segmentation model 20 that has been corrected.

接著，使用圖7，說明自然語言處理系統10的動作，並說明本實施形態所述之自然語言處理方法。 Next, the operation of the natural language processing system 10 will be described with reference to Fig. 7, and the natural language processing method according to the embodiment will be described.

首先，取得部11係取得一個語句(步驟S11)。接下來，解析部12係使用分割模型20來將該語句進行構詞解析(步驟S12，解析步驟)。藉由該構詞解析，對語句之各文字會賦予像是「S-N-nc」等這類標籤。 First, the acquisition unit 11 acquires one sentence (step S11). Next, the analysis unit 12 analyzes the word formation using the segmentation model 20 (step S12, analysis step). By the word formation, each character of the sentence is given a label such as "S-N-nc".

接下來，修正部13係求出，解析部12所做的構詞解析之結果、與該構詞解析之正確解答資料的差異 (步驟S13)。若無該差異(步驟S14；NO)，亦即，解析部12所做的構詞解析是完全正確時，則修正部13係不修正分割模型20就結束處理。另一方面，解析結果與正確解答資料有差異時(步驟S14；YES)，亦即，解析部12所做的構詞解析之至少一部分並不正確時，則修正部13係藉由更新已被解析之語句所關連之素性的分數，以修正分割模型20(步驟S15，修正步驟)。具體而言，修正部13係將不正確解答之標籤所對應之正確解答之標籤所關連的素性之分數設成高於現在值，並且將該不正確解答之標籤所關連的素性之分數設成低於現在值。 Next, the correction unit 13 obtains the difference between the result of the word formation analysis performed by the analysis unit 12 and the correct answer data of the word formation analysis. (Step S13). If there is no such difference (step S14; NO), that is, when the word formation analysis by the analysis unit 12 is completely correct, the correction unit 13 ends the processing without correcting the division model 20. On the other hand, if the analysis result differs from the correct answer data (step S14; YES), that is, when at least a part of the word formation analysis performed by the analysis unit 12 is not correct, the correction unit 13 is updated by the update unit 13 The fraction of the primes associated with the parsed statement is corrected to correct the segmentation model 20 (step S15, correction step). Specifically, the correction unit 13 sets the score of the prime relationship associated with the label of the incorrect answer corresponding to the incorrectly answered label to be higher than the current value, and sets the score of the prime relationship associated with the incorrectly answered label to Below the current value.

一旦修正部13的處理完成，就回到步驟S11 之處理(參照步驟S16)。取得部11係取得下個語句(步驟S11)，解析部12係將該語句進行構詞解析(步驟S12)。此時，在前個語句之處理中若有執行分割模型20之修正(步驟S15)，則解析部12係使用已被修正的分割模型20來執行構詞解析。其後，修正部13係執行步驟S13以後的處理。如此反覆，只要有處理對象的語句存在就會繼續(參照步驟S16)。 Once the processing of the correction unit 13 is completed, the process returns to step S11. Processing (refer to step S16). The acquisition unit 11 acquires the next sentence (step S11), and the analysis unit 12 performs word formation analysis on the sentence (step S12). At this time, if the correction of the division model 20 is performed in the processing of the previous sentence (step S15), the analysis unit 12 performs the word formation analysis using the segmentation model 20 that has been corrected. Thereafter, the correction unit 13 executes the processing in and after step S13. In this way, as long as there is a statement of the processing object, it continues (refer to step S16).

表示自然語言處理系統10之動作的演算法之一例，示於以下。 An example of an algorithm representing the operation of the natural language processing system 10 is shown below.

Initialize w₁ For t=1,2,… Receive instance x_t Predict structure y^_t based on w_t Receive correct structure y_t If y^_t≠y_t,update w_t+1=update(w_t,y_t,+1) w_t+1=update(w_t,y^_t,-1) Initialize w ₁ For t=1,2,... Receive instance x _t Predict structure y^ _t based on w _t Receive correct structure y _t If y^ _t ≠y _t ,update w _t+1 =update(w _t ,y _t , +1) w _t+1 =update(w _t ,y^ _t ,-1)

上記演算法中的第1行係意味著分割模型20(變數w₁)之初期化，藉由該處理，例如各素性之分數係被設定成0。第2行的For迴圈，係表示第3行以後之處理是每一個語句執行一次。第3行係意味著取得語句x_t，相當於上記的步驟S11。第4行係表示，藉由根據該時點之分割模型20(w_t)進行構詞解析而對各文字賦予標籤的處理，相當於上記的步驟S12。y^_t係表示解析結果。第5行係意味著，取得語句x_t之構詞解析的正確解答資料y_t。第6行係意味著，若解析結果y^_t與正確解答資料y_t有差異，則將分割模型20予以更新(修正)。第7行係表示將正確解答資料y_t視為正例而學習，第8行係表示將含有錯的解析結果y^_t視為負例而學習。第7、8行的處理係相當於上記的步驟S15。 The first line in the upper algorithm means the initialization of the segmentation model 20 (variable w ₁ ), and by this processing, for example, the score of each prime is set to zero. The For loop on line 2 indicates that the processing after line 3 is executed once for each statement. The third line means that the statement x _{t is obtained} , which corresponds to the above-described step S11. The fourth line indicates that the processing of the word formation by the word segmentation model 20(w _t ) based on the time point is equivalent to the above-described step S12. y^ _t is the analysis result. Line 5 Series means that x _t get the statement of the morphological parsing correct answer data y _t. The sixth line means that if the analysis result y^ _t differs from the correct answer data y _t , the segmentation model 20 is updated (corrected). The seventh line indicates that the correct answer data y _{t is} regarded as a positive example, and the eighth line indicates that the analysis result y^ _t containing the error is regarded as a negative example. The processing of the seventh and eighth lines corresponds to the above-described step S15.

接著，使用圖8，說明用來實現自然語言處理系統10所需的自然語言處理程式P1。 Next, the natural language processing program P1 required to implement the natural language processing system 10 will be described using FIG.

自然語言處理程式P1係具備：主要模組P10、取得模組P11、解析模組P12、及修正模組P13。 The natural language processing program P1 includes a main module P10, an acquisition module P11, an analysis module P12, and a correction module P13.

主要模組P10，係統籌控制構詞解析及其關連處理的部分。藉由執行取得模組P11、解析模組P12、及修正模組P13所實現的機能，係分別和上記的取得部 11、解析部12、及修正部13之機能相同樣。 The main module P10, the part of the system to control the word formation analysis and its related processing. By performing the functions of the acquisition module P11, the analysis module P12, and the correction module P13, respectively, the acquisition unit of the above and the above is recorded. 11. The functions of the analysis unit 12 and the correction unit 13 are the same.

自然語言處理程式P1係例如，亦可被固定地記錄在CD-ROM或DVD-ROM、半導體記憶體等之有形的記錄媒體中而提供。又，自然語言處理程式P1係亦可以重疊於載波的資料訊號方式，透過通訊網路而提供。 The natural language processing program P1 can be provided, for example, by being fixedly recorded on a tangible recording medium such as a CD-ROM, a DVD-ROM, or a semiconductor memory. Moreover, the natural language processing program P1 can also be superimposed on the carrier's data signal mode and provided through the communication network.

如以上說明，本發明之一側面所述之自然語言處理系統，係具備：解析部，係使用藉由用到1個以上之訓練資料的機械學習所得的分割模型，對一個語句執行構詞解析，以對該一個語句分割所得的每個被分割要素，設定至少表示單詞之品詞的標籤，其中，該解析部係為，分割模型係含有：表示被分割要素與標籤之對應的輸出素性之分數、和表示連續二個被分割要素所對應之二個標籤之組合的遷移素性之分數；和修正部，係將解析部所得之解析結果所示之標籤、和表示一個語句之正確解答之標籤的正確解答資料，進行比較，將與不正確解答之標籤所對應的正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成高於現在值，將與該不正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成低於現在值，將解析部在下個語句之構詞解析中所使用的分割模型，予以修正。 As described above, the natural language processing system according to one aspect of the present invention includes an analysis unit that performs word formation analysis on a sentence using a segmentation model obtained by mechanical learning using one or more pieces of training data. A label indicating at least a word of a word is set for each of the divided elements obtained by dividing the one sentence, wherein the analysis unit includes: a score indicating an output quality of the divided element and the label And a fraction of the mobility of the combination of the two labels corresponding to the two consecutive divided elements; and the correction unit is a label indicating the analysis result obtained by the analysis unit, and a label indicating the correct solution of a sentence. Correctly answer the data, compare it, and set the score of the output prime and the score of the kinetics associated with the label of the correct answer corresponding to the incorrectly answered label to be higher than the current value, and will be associated with the label of the incorrect solution. The score of the output prime and the score of the kinetics are set lower than the current value, and the analysis part is analyzed in the word formation of the next sentence. The segmentation model used is corrected.

本發明之一側面所述之自然語言處理方法，係為藉由具備處理器之自然語言處理系統所執行的自然語言處理方法，其係含有：解析步驟，係使用藉由用到1個以上之訓練資料的機械學習所得的分割模型，對一個語句執行構詞解析，以對該一個語句分割所得的每個被分割要素，設定至少表示單詞之品詞的標籤，其中，該解析步驟係為，分割模型係含有：表示被分割要素與標籤之對應的輸出素性之分數、和表示連續二個被分割要素所對應之二個標籤之組合的遷移素性之分數；和修正步驟，係將解析步驟中所得之解析結果所示之標籤、和表示一個語句之正確解答之標籤的正確解答資料，進行比較，將與不正確解答之標籤所對應的正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成高於現在值，將與該不正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成低於現在值，將解析步驟中的在下個語句之構詞解析中所使用的分割模型，予以修正。 The natural language processing method according to one aspect of the present invention is a natural language processing method executed by a natural language processing system including a processor, which includes an analysis step and uses one or more of them. Segmentation model of mechanical learning of training materials, for a statement Performing word formation analysis, and setting a label indicating at least the word of the word for each divided element obtained by dividing the one sentence, wherein the analyzing step is: the dividing model includes: indicating the correspondence between the divided element and the label a score of the output prime, and a fraction of the kinomism indicating a combination of two labels corresponding to the two consecutive divided elements; and a correction step of labeling the result of the analysis obtained in the analysis step, and indicating a statement Correctly answer the correct answer of the label, compare it, and set the score of the output prime and the score of the kinetics associated with the label of the correct answer corresponding to the incorrectly answered label to be higher than the current value, which will be incorrect. The scores of the output primes and the scores of the kinetics associated with the label of the solution are set lower than the current value, and the segmentation model used in the parsing of the next sentence in the parsing step is corrected.

本發明之一側面所述之自然語言處理程式，係令電腦發揮機能而成為：解析部，係使用藉由用到1個以上之訓練資料的機械學習所得的分割模型，對一個語句執行構詞解析，以對該一個語句分割所得的每個被分割要素，設定至少表示單詞之品詞的標籤，其中，該解析部係為，分割模型係含有：表示被分割要素與標籤之對應的輸出素性之分數、和表示連續二個被分割要素所對應之二個標籤之組合的遷移素性之分數；和修正部，係將解析部所得之解析結果所示之標籤、和表示一個語句之正確解答之標籤的正確解答資料，進行比較，將與不正確解答之標籤所對應的正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成高於現在值，將與該不正確解答之標籤相關連的輸出素性之分數及遷移素性之分數設成低於現在值，將解析部在下個語句之構詞解析中所使用的分割模型，予以修正。 The natural language processing program according to one aspect of the present invention is an analysis unit that performs word formation on a sentence by using a segmentation model obtained by mechanical learning using one or more training materials. In the analysis, a label indicating at least a word of a word is set for each of the divided elements obtained by dividing the one sentence, wherein the analysis unit includes: an output quality indicating that the divided element and the label correspond to each other a score, and a fraction of the kinomism indicating a combination of two labels corresponding to two consecutive divided elements; and a correction unit, which is a label indicating an analysis result obtained by the analysis unit, and a label indicating a correct answer of a sentence Correctly answer the data, compare it, and set the score of the output prime and the score of the kinetics associated with the label of the correct answer corresponding to the incorrectly answered label to be higher than the current value, which will be related to the label of the incorrect solution. The scores of the related output primes and the scores of the kinetics are set lower than the current value, and the segmentation model used by the analysis unit in the word formation analysis of the next sentence is corrected.

又，藉由將關於正確答案之標籤的素性分數設高，將關於不正確答案之標籤的素性之分數設低，就可更加提高下個語句之構詞解析之精度。 Moreover, by setting the prime score of the label on the correct answer to a high value and lowering the score of the label of the incorrect answer, the accuracy of the word formation of the next sentence can be further improved.

其他側面所述之自然語言處理系統中，被分割要素係亦可為文字。藉由使用文字單位的知識(輸出素性及遷移素性)來對每一文字進行處理，就可不必使用一般而言會變成大規模的單詞辭典，就可執行構詞解析。又，不是使用單詞的知識而是使用文字單位的知識來對每一語句修正分割模型，因此即使下個語句和目前為止所解析過的任一語句在領域或性質上有所不同，仍可將該當下個語句進行高精度的構詞解析。亦即，本發明之一側面所述的自然語言處理系統，係對於未知領域的語句或具有未知性質的語句，具有適應性。 In the natural language processing system described in the other aspect, the divided element system may also be a character. By using the knowledge of the word unit (output quality and mobilization) to process each character, it is possible to perform word formation analysis without using a word dictionary that is generally large. Moreover, instead of using the knowledge of the word, the knowledge of the word unit is used to correct the segmentation model for each sentence, so even if the next statement and any of the statements analyzed so far differ in domain or nature, The next statement is to perform high-precision word formation analysis. That is, the natural language processing system described in one aspect of the present invention is adaptable to statements in an unknown domain or sentences having an unknown property.

在其他側面所述的自然語言處理系統中，輸出素性之分數及遷移素性之分數分別具有值的範圍，針對各分數而設定有變異數，修正部係根據各分數之變異數，來決定將該分數設高或設低之際的該分數之變化量。藉由採用該手法，就可使各素性之分數盡快收斂。 In the natural language processing system described in the other side, lose The scores of the generative score and the kinomism have a range of values, and a score is set for each score, and the correction unit determines the score when the score is set high or low based on the variance of each score. The amount of change. By adopting this method, the scores of each prime can be converged as soon as possible.

以上，將本發明根據其實施形態而詳細說明。可是，本發明係不限定於上記實施形態。本發明係可在不脫離其要旨的範圍內做各種變形。 The present invention has been described in detail based on the embodiments thereof. However, the present invention is not limited to the above embodiment. The present invention can be variously modified without departing from the gist of the invention.

一般來說，分割模型20中所含之素性之個數係隨著處理的文字之個數而增加，因此像是日文或中文等這類文字較多的語言中，分割模型20規模會非常龐大，分割模型20所需之記憶容量也會變得非常龐大。於是，亦可導入素性雜湊化(Feature Hashing)此一手法，將每個素性藉由雜湊函數而予以數值化。尤其是，將表示素性之一部分的文字及字串予以數值化的效果較高。其另一方面，遷移素性係即使雜湊化仍對分割模型20之容量的壓縮沒有太多貢獻，反而可能會導致處理速度變慢。因此，亦可不將遷移素性進行雜湊化而僅將輸出素性予以雜湊化。此外，關於雜湊函數，係亦可僅使用一種類，也可在輸出素性與遷移素性之間使用不同的雜湊函數。 In general, the number of primes contained in the segmentation model 20 increases with the number of processed words, so in a language such as Japanese or Chinese, the size of the segmentation model 20 is very large. The memory capacity required to segment model 20 can also become very large. Therefore, it is also possible to introduce the technique of Feature Hashing, and each element is quantified by a hash function. In particular, the effect of digitizing characters and strings indicating one of the prime parts is high. On the other hand, the kinetics system does not contribute much to the compression of the capacity of the segmentation model 20 even if it is hashed, but may cause the processing speed to be slow. Therefore, it is also possible to hash only the output properties without mashing the kinetic properties. In addition, regarding the hash function, it is also possible to use only one class, or to use a different hash function between the output prime and the migrating element.

此情況下，分割模型20係記憶著，關於每個文字以數值表現之素性的資料。例如，「本(hon)」此一文字是被轉換成34此一數值，「(wo)」此一文字係被轉換成4788此一數值。藉由該數值化，就可形成有限的(bounded)素性之集合。此外，藉由該素性雜湊化，有可能複數文字或字串會被指派成相同數值，但出現頻率高的文字或字串彼此被指派成相同數值的或然性非常低，因此此種碰撞係可忽視。 In this case, the segmentation model 20 is a piece of data that is expressed in terms of numerical values for each character. For example, "this (hon)" is converted to a value of 34," (wo)" This text is converted to a value of 4788. By this numericalization, a set of bounded primes can be formed. In addition, with this prime hashing, it is possible that plural characters or strings are assigned the same value, but the probability that a frequently occurring character or string is assigned the same value to each other is very low, so the collision system is Can be ignored.

亦即，在其他側面所述的自然語言處理系統中，分割模型係可含有，藉由雜湊函數而被數值化的輸出素性。藉由將文字以數值來操控，就可節省分割模型之記憶所需的記憶體容量。 That is, in the natural language processing system described in other aspects, the segmentation model may contain output factors that are quantized by a hash function. By manipulating the text as a numerical value, the memory capacity required to segment the memory of the model can be saved.

解析部12係亦可不使用分數相對較低的素性(忽視此種素性)，使用分數相對較高的素性，來執行構詞解析。作為將分數相對較低的素性予以忽視的手法可舉例如：進退分割(Forward-Backward Splitting(FOBOS))、素性之量化(Feature Quantization)。 The analysis unit 12 may perform the word formation analysis without using a relatively low score (ignoring such a prime) and using a relatively high score. As a method of neglecting the relatively low-graded prime, for example, Forward-Backward Splitting (FOBOS) and Feature Quantization can be cited.

FOBOS，係藉由正規化(例如L1正規化)而將分數往0壓縮的手法。藉由使用FOBOS，就可忽視分數為所定值以下之素性(例如分數為0之素性，或分數接近0的素性)。 FOBOS is a technique of compressing a score to zero by normalization (for example, L1 normalization). By using FOBOS, it is possible to ignore the primes whose scores are below the specified value (for example, a prime with a score of 0, or a prime with a score close to 0).

素性之量化係為，將小數點以下之值乘以10ⁿ(n係1以上之自然數)而將素性之分數予以整數化的手法。例如，若將「0.123456789」此一分數乘以1000進行整數化，則分數係為「123」。藉由將分數予以量化，就可節省將該分數以本文加以記憶所必須的記憶體容量。又，藉由該手法，就可忽視分數為所定值以下之素性(例如整數化後之分數為0之素性、或該分數接近0的素性)。例如，假設某素性Fa，Fb之分數分別為0.0512、 0.0003，對這些分數乘以1000進行整數化時，則素性Fa，Fb係分別變成51、0。此情況下，解析部12係不使用素性Fb而執行構詞解析。 The quantification of the prime is to multiply the value below the decimal point by 10 ⁿ (n is a natural number of 1 or more) and integerize the score of the prime. For example, if the score of "0.123456789" is multiplied by 1000 for integerization, the score is "123". By quantifying the score, the memory capacity necessary to memorize the score in this paper can be saved. Moreover, by this method, it is possible to ignore the prime of the score below the predetermined value (for example, the prime of the fraction after the integer is 0, or the prime of the score close to 0). For example, suppose the scores of a certain prime Fa and Fb are 0.0512 and 0.0003, respectively. When these scores are multiplied by 1000, the primes Fa and Fb become 51 and 0, respectively. In this case, the analysis unit 12 performs the word formation analysis without using the prime Fb.

正規化或量化之處理係例如，藉由修正部13、自然語言處理系統10內之其他機能要素、或獨立於自然語言處理系統10之外的電腦系統來執行。若由修正部13來執行正規化或量化之處理時，則修正部13係於自然語言處理系統10中將1組的語句(例如多到某種程度的語句)進行構詞解析而將分割模型20修正數次後，執行正規化或量化之處理一次。 The normalization or quantization process is performed, for example, by the correction unit 13, other functional elements in the natural language processing system 10, or a computer system other than the natural language processing system 10. When the normalization or quantization process is executed by the correction unit 13, the correction unit 13 associates one set of sentences (for example, a plurality of sentences) into the natural language processing system 10 to construct a segmentation model. After 20 corrections, the normalization or quantization process is performed once.

亦即，在其他側面所述的自然語言處理系統中，解析部係亦可不使用藉由正規化或量化而導致分數變成所定值以下的素性，而執行構詞解析。藉由不使用分數相對較低之素性(例如藉由正規化或量化而分數變成0的素性、或該分數接近0的素性)，就可抑制分割模型之資料量，可縮短構詞解析之時間等等。 In other words, in the natural language processing system described in the other aspect, the analysis unit may perform the word formation analysis without using the prime value whose score becomes equal to or lower than the predetermined value by normalization or quantization. By not using a relatively low-level prime (for example, a prime with a score of 0 by normalization or quantization, or a prime of which the score is close to 0), the amount of data of the segmentation model can be suppressed, and the time for word formation can be shortened. and many more.

在上記實施形態中，解析部12是將語句分割成各個文字然後對各文字設定標籤，但被分割要素係亦可不是文字而是單詞。伴隨於此，解析部係亦可使用不是文字而是表示關於單詞的素性之分數的分割模型與單詞辭典，來執行構詞解析。 In the above embodiment, the analysis unit 12 divides the sentence into characters and sets a label for each character. However, the divided element may be a word instead of a word. Along with this, the analysis unit can perform the word formation analysis by using a segmentation model and a word dictionary that are not characters but a score of the prime of the word.

如上述，本發明所述之自然語言處理系統，係可適用於任意語言之構詞解析。 As described above, the natural language processing system of the present invention is applicable to word formation analysis in any language.

11‧‧‧取得部 11‧‧‧Acquisition Department

12‧‧‧解析部 12‧‧‧ Analysis Department

13‧‧‧修正部 13‧‧‧Amendment

20‧‧‧分割模型 20‧‧‧ segmentation model

Claims

A natural language processing system includes an analysis unit that performs a word formation analysis on a sentence by using a segmentation model obtained by mechanical learning using one or more pieces of training data, so that the one sentence is not a word dictionary. And dividing into one or more words, and dividing each character, and setting a label indicating at least the word of the word for each of the characters obtained by the division, wherein the analysis unit includes: a score indicating the pheromone of the combination of the output prime character corresponding to the character and the preceding label, and the combination of the two preceding labels corresponding to the two consecutive characters; and the correction unit, which is represented by the analysis result obtained by the preceding analysis unit The correct answer to the label and the label indicating the correct answer to the previous statement is compared, and the score of the pre-recorded output and the score of the pre-migration element associated with the label of the correct answer corresponding to the incorrectly answered label are set. Higher than the current value, the score of the pre-recorded output that is associated with the label of the incorrect answer and the pre-record The fractional shift primality set to be lower than the present value, the morphological analytical statement used before segmentation model in mind, the correction will be referred to parse the front lower portion.

The natural language processing system according to claim 1, wherein the pre-segmentation model includes a pre-recorded output property that is quantized by a hash function.

The natural language processing system as recited in claim 1 or 2, wherein the scores of the pre-recorded output primes and the pre-recorded kinetics scores are respectively The range of values has a variation number for each score; the pre-correction unit determines the amount of change in the score when the score is increased or decreased based on the variance of each score.

The natural language processing system according to claim 1 or 2, wherein the pre-characteristic analysis unit performs pre-construction word analysis without using a pre-recording property in which the pre-recorded score becomes equal to or less than a predetermined value by normalization or quantization.

The natural language processing system according to claim 3, wherein the pre-description analysis unit performs the pre-construction word analysis without using the pre-recording property in which the pre-record score becomes equal to or less than the predetermined value by normalization or quantization.

A natural language processing method is a natural language processing method executed by a natural language processing system having a processor, which includes an analysis step using mechanical learning by using one or more training materials. The segmentation model performs a word formation analysis on a sentence, so that the one sentence is divided into one or more words instead of using a word dictionary, and each text is divided, and at least each character obtained by the division is set. a label indicating a word of a word, wherein the step of analyzing is that the pre-recording model includes: a score indicating a correspondence between the character and the pre-recorded label, and a combination of two pre-recorded labels corresponding to the two consecutive characters. The fraction of kinetics; and the correction step, which compares the label shown in the analysis result obtained in the pre-analysis step with the correct answer data of the label indicating the correct answer of the previous statement, and the label is incorrectly solved. Correspondingly correct The score of the pre-recorded output prime and the score of the pre-recorded kinetics of the label associated with the solution are set higher than the current value, and the score of the pre-recorded output prime and the score of the pre-migration element associated with the label of the incorrect answer are set to be lower than the score. The current value is corrected by the pre-command segmentation model used in the parsing of the next sentence in the pre-analysis step.

A natural language processing program product that enables a computer to function as an analysis unit that performs word formation analysis on a sentence by using a segmentation model obtained by mechanical learning using one or more training materials to The sentence is divided into one or more words instead of using a word dictionary, and is divided into each character, and a label indicating at least the word of the word is set for each character obtained by the division, wherein the analysis unit is The pre-segmentation model includes: a score indicating the output quality of the corresponding character of the character and the pre-recorded label, and a fraction of the kinetics indicating a combination of the two preceding-labels corresponding to the two consecutive characters; and the correction unit, which is obtained from the pre-reviewing section. The correct answer to the label shown in the analysis result and the label indicating the correct answer to the previous statement, and the score of the pre-recorded output prime and the pre-record associated with the label of the correct answer corresponding to the incorrectly answered label. The kinesin score is set higher than the current value and will be related to the label of the incorrect answer. Score Score before the primality of mind and migration of primality set to be lower than the present value, the morphological analytical statement used before segmentation model in mind, the correction will be referred to parse the front lower portion.