WO2013128684A1 - 辞書生成装置、方法、及びプログラム - Google Patents
辞書生成装置、方法、及びプログラム Download PDFInfo
- Publication number
- WO2013128684A1 WO2013128684A1 PCT/JP2012/072350 JP2012072350W WO2013128684A1 WO 2013128684 A1 WO2013128684 A1 WO 2013128684A1 JP 2012072350 W JP2012072350 W JP 2012072350W WO 2013128684 A1 WO2013128684 A1 WO 2013128684A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- dictionary
- unit
- text
- boundary
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Definitions
- One embodiment of the present invention relates to an apparatus, a method, a program, and a computer-readable recording medium for generating a word dictionary.
- Japanese Patent Application Laid-Open No. 2004-228561 searches for a word to be matched with a partial character string of an input text from a word dictionary and generates it as a word candidate. Is selected as an unknown word candidate, and the unknown word model is used to estimate the word appearance probability of each unknown part of speech using the unknown word model, and the word sequence that maximizes the joint probability is determined using dynamic programming. The required technology is described.
- a dictionary generation apparatus is a model generation unit that generates a word division model using a corpus and a word group prepared in advance, and each text included in the corpus indicates a word boundary.
- the model generation unit to which boundary information is given, and an analysis unit that executes word division with a word division model incorporated into the collected set of texts and gives boundary information to each text;
- a selection unit that selects a word to be registered in the dictionary from the text to which boundary information is given by the analysis unit, and a registration unit that registers the word selected by the selection unit in the dictionary.
- a dictionary generation method is a dictionary generation method executed by a dictionary generation device, and includes a model generation step of generating a word division model using a corpus and a word group prepared in advance. Each text included in the corpus is provided with boundary information indicating word boundaries, and the model generation step and word division incorporating a word division model are executed on the collected text set.
- a dictionary generation program is a model generation unit that generates a word division model using a corpus and a word group prepared in advance, and each text included in the corpus indicates a word boundary.
- the model generation unit to which boundary information is given, and an analysis unit that executes word division with a word division model incorporated into the collected set of texts and gives boundary information to each text;
- a computer executes a selection unit that selects a word to be registered in the dictionary from the text to which boundary information is given by the analysis unit, and a registration unit that registers the word selected by the selection unit in the dictionary.
- a computer-readable recording medium is a model generation unit that generates a word division model using a corpus and a word group prepared in advance, and each text included in the corpus includes a word The model generation unit to which boundary information indicating a boundary is given, and an analysis for giving boundary information to each text by executing word division incorporating a word division model on the collected text set
- a dictionary generation program that causes a computer to execute a selection unit that selects a word to be registered in the dictionary from the text to which boundary information is provided by the analysis unit, and a registration unit that registers the word selected by the selection unit in the dictionary
- a word division model is generated using a corpus to which boundary information is given and a word group, and the word division incorporating the model is applied to the text set. Then, a word is selected from the text set to which boundary information is given by this application and registered in the dictionary. In this way, by adding boundary information to a text set by analysis using a corpus with boundary information, a word dictionary extracted from the text set is registered to easily build a large-scale word dictionary. be able to.
- the selection unit may select a word to be registered in the dictionary based on the appearance frequency of each word calculated from the boundary information given by the analysis unit.
- the accuracy of the dictionary can be increased by considering the appearance frequency calculated in this way.
- the selection unit may select a word whose appearance frequency is equal to or higher than a predetermined threshold.
- the selection unit extracts words having an appearance frequency equal to or higher than a threshold as registration candidates, selects a predetermined number of words from the registration candidates in order from the word having the highest appearance frequency, and registers them.
- the unit may add the word selected by the selection unit to the dictionary in which the word group is recorded. By registering only words with a relatively high appearance frequency in the dictionary, the accuracy of the dictionary can be improved. Further, by adding words to a word group dictionary prepared in advance, the configuration of the dictionary can be simplified.
- the selection unit extracts words having an appearance frequency equal to or higher than a threshold as registration candidates, selects a predetermined number of words from the registration candidates in order from the word having the highest appearance frequency, and registers them.
- the unit may register the word selected by the selection unit in a dictionary different from the dictionary in which the word group is recorded. By registering only words with a relatively high appearance frequency in the dictionary, the accuracy of the dictionary can be improved. Further, by adding words to a dictionary different from the dictionary of existing word groups (existing dictionary), it is possible to generate a dictionary having characteristics different from those of the existing dictionary.
- the registration unit may register the word selected by the selection unit in a dictionary different from the dictionary in which the word group is recorded.
- the selection unit extracts words whose appearance frequency is equal to or higher than a threshold as registration candidates, groups the registration candidate words according to the appearance frequency, and the registration unit
- the plurality of groups generated by the selection unit may be individually registered in a plurality of dictionaries different from the dictionary in which the word group is recorded.
- each of the collected texts is associated with information indicating the field of the text
- the registration unit includes the word selected by the selection unit. It may be individually registered in a dictionary prepared for each field based on the field of the text. By generating a dictionary for each field, a plurality of dictionaries having different characteristics can be generated.
- the boundary information includes first information indicating that no boundary exists at the position between characters, second information indicating that a boundary exists at the position between characters, and characters And the third information indicating that the boundary is probabilistically present at the interposition, and the appearance frequency of each word may be calculated based on the first, second, and third information.
- the text can be more appropriately divided into a plurality of words by introducing the third information indicating an intermediate concept instead of simply selecting whether or not a boundary exists.
- the analysis unit includes a first binary classifier and a second binary classifier, and the first binary classifier has a first position for each character position. Whether to allocate information other than the first information, or whether the second binary classifier assigns information other than the first information by the first binary classifier For the inter-position, it may be determined whether the second information or the third information is assigned.
- the collected text set is divided into a plurality of groups, and the analysis unit, the selection unit, and the registration unit perform processing based on one of the plurality of groups.
- the model generation unit generates a word division model using the corpus, the word group, and the word registered by the registration unit, and then the analysis unit, the selection unit, and the registration unit are another one of the plurality of groups.
- One process may be executed.
- a large-scale word dictionary can be easily constructed.
- the dictionary generation apparatus 10 extracts a word from the text set by analyzing a set of collected large amounts of text (hereinafter also referred to as “large-scale text”), and adds the extracted word to the dictionary It is.
- large-scale text a set of collected large amounts of text
- the dictionary generation apparatus 10 includes a CPU 101 that executes an operating system, application programs, and the like, a main storage unit 102 that includes a ROM and a RAM, an auxiliary storage unit 103 that includes a hard disk, and the like.
- the communication control unit 104 includes a network card, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a display.
- Each functional component of the dictionary generation device 10 to be described later reads predetermined software on the CPU 101 and the main storage unit 102, and controls the communication control unit 104, the input device 105, the output device 106, and the like under the control of the CPU 101.
- the operation is realized by reading and writing data in the main storage unit 102 and the auxiliary storage unit 103. Data and databases necessary for processing are stored in the main storage unit 102 and the auxiliary storage unit 103.
- the dictionary generation device 10 is illustrated as being configured by a single computer, but the functions of the dictionary generation device 10 may be distributed to a plurality of computers.
- the dictionary generation apparatus 10 includes a model generation unit 11, an analysis unit 12, a selection unit 13, and a registration unit 14 as functional components.
- the dictionary generation device 10 refers to the learning corpus 20, the existing dictionary 31, and the large-scale text 40 prepared in advance, and stores the extracted words in the word dictionary 30.
- the word dictionary 30 includes at least the existing dictionary 31 and may further include one or more additional dictionaries 32.
- the learning corpus 20 is a set of texts to which boundary information (annotations) indicating word boundaries (division positions when a sentence is divided into words) is attached (associated), and is prepared in advance as a database.
- Text is a sentence or character string consisting of a plurality of words.
- a predetermined number of texts randomly extracted from the titles and descriptions of the products stored in the website of the virtual shopping street are used as the material of the learning corpus 20.
- Boundary information is given to each extracted text manually by the evaluator.
- the setting of boundary information is performed based on two techniques of word division by point estimation and a three-stage word division corpus.
- the value indicated by this tag b i can be said to be the intensity of the split.
- the value of the word boundary tag is determined by referring to the feature obtained from the characters existing around it.
- the value of the word boundary tag is set using three types of features, that is, a character feature, a character type feature, and a dictionary feature.
- Character feature is in contact with the boundary b i, or the boundary b i all characters length n of the enclosing (n-gram), is a feature represented by a combination of relative positions with respect to the position b i.
- n 3 in FIG. 3
- “ ⁇ 1 / n (n)” “1 /” is set for the boundary b i between “n (n)” and “wo (wo)”.
- (Wo) ""-2 / pen “"-1 / n (n wo) "" 1 / buy (wo ka) ""-3 / rupen "”-2 / pen (( 9 features of “pen wo”, “-1 / buy (n wo ka)”, “1 / buy (wo kat)”.
- the character type feature is the same as the character type feature described above except that the character type is handled instead of the character.
- the character types eight types of hiragana, katakana, kanji, upper case alphabet, lower case alphabet, arabic numerals, kanji numerals, and middle black (•) were considered.
- the character type to be used and the number of the types are not limited at all.
- the dictionary feature is a feature representing whether or not a word having a length j (1 ⁇ j ⁇ k) located around the boundary exists in the dictionary.
- the dictionary feature is a flag indicating whether the boundary b i is located at the end point of the word (L), whether it is located at the start point (R) or contained in the word (M), It is shown in combination with the length j of the word. If the words “pen” and “wo” are registered in the dictionary, dictionary features L2 and R1 are created for the boundary bi in FIG. As will be described later, when a plurality of dictionaries are used, a dictionary identifier is assigned to the dictionary feature.
- the maximum n-gram length n in the character feature and the character type feature is 3 and the maximum word length k in the dictionary feature is 8, but these values may be arbitrarily determined.
- a three-stage word segmentation corpus that introduces the concept of “half segmentation” as well as the binary of “segmentation” and “non-segmentation” as described above is used.
- the three-stage word division corpus is a technique that develops probabilistic word division that indicates a division mode with a probabilistic value.
- the three-stage word division corpus is used because the number of word division strengths that humans can actually recognize is only a few levels at most, and it is not necessary to indicate the mode of division with continuous probability values.
- Half-splitting is an aspect indicating that a boundary is probabilistically present (within a probability range greater than 0 and less than 1) at the position between characters.
- This is a corpus generated by division.
- a compound noun such as “ball / pen (bo-ru / pen)”, a compound verb such as “ori / tatama” (“fold” in English), “o / sume (o / ”) (" Recommendation "in English)
- “rechargeable battery” in English, “rechargeable battery” means “rechargeable” (in English, “recharge”) and “denchi” (in English, “battery”). Although it can be said that it is a compound word of the type “AB + BC ⁇ ABC”, such a word is divided in half as “charge / electricity / pond (juu / den / chi)”.
- Each word is given a word boundary tag as boundary information and stored in the database as a learning corpus 20.
- the method for adding the boundary information to the text is arbitrary.
- boundary information may be embedded in each text so that “divided” is indicated by a space, “half-divided” is indicated by a hyphen, and the display of “non-divided” is omitted.
- the text with the boundary information can be recorded as a character string.
- the existing dictionary 31 is a set of a predetermined number of words, and is prepared in advance as a database.
- the existing dictionary 31 may be a generally used electronic dictionary, for example, a UniDic morphological analysis dictionary.
- the large-scale text 40 is a collection of collected text and is prepared in advance as a database.
- the large-scale text 40 may include an arbitrary sentence or character string according to the word to be extracted and the field of the word. For example, a large number of product titles and explanations may be collected from a virtual shopping street website, and the large-scale text 40 may be constructed from these raw data.
- the number of texts prepared as the large-scale text 40 is overwhelmingly larger than the number of texts included in the learning corpus 20.
- the model generation unit 11 is means for generating a word division model using the learning corpus 20 and the word dictionary 30.
- the model generation unit 11 includes a support vector machine (SVM), and generates a word division model by inputting a learning corpus 20 and a word dictionary 30 to the machine and executing learning processing. To do.
- This word segmentation model shows the rules on how to segment text, and is output as a parameter group used for word segmentation.
- the algorithm used for machine learning is not limited to SVM, and may be a decision tree or logistic regression.
- the model generation unit 11 causes the SVM to perform learning based on the learning corpus 20 and the existing dictionary 31, thereby generating an initial word division model (baseline model). Then, the model generation unit 11 outputs this word division model to the analysis unit 12.
- the model generation unit 11 performs learning (re-execution) based on the learning corpus 20 and the entire word dictionary 30.
- a corrected word division model is generated by causing the SVM to execute (learning) processing.
- the whole word dictionary 30 means all the words stored in the existing dictionary 31 from the beginning and the words obtained from the large-scale text 40.
- the analysis unit 12 is means for executing analysis (word division) in which the word division model is incorporated on the large-scale text 40 and adding (associating) boundary information to each text. As a result, a large amount of text as shown in FIG. 3 is obtained.
- the analysis unit 12 performs such word division on each text constituting the large-scale text 40, so that the “division” (second information), “half-division” (third information), and Boundary information indicating “non-divided” (first information) is assigned to each text, and all processed texts are output to the selection unit 13.
- the analysis unit 12 includes two binary classifiers, and uses these classifiers in order to give three types of boundary information to each text.
- the first classifier is means for determining whether the inter-character position is “non-divided” or otherwise, and the second classifier is whether the boundary determined not to be “non-divided” is “divided” or “ It is a means for determining whether it is “half-split”. In reality, since the majority of the inter-character positions are “non-divided”, it is first determined whether or not the inter-character positions are “non-divided”, and then determined to be other than “non-divided”. By determining the division mode for, boundary information can be efficiently given to a large amount of text. Moreover, the structure of the analysis part 12 can be simplified by combining a binary classifier.
- the selection unit 13 is a means for selecting a word to be registered in the word dictionary 30 from the text to which boundary information is given by the analysis unit 12.
- the selection part 13 calculates
- This calculation means that the appearance frequency can be obtained from the boundary information b i given to each inter-character position.
- O 1 indicates the appearance of the notation of the word w and is defined as follows.
- the selection part 13 calculates
- the selection unit 13 selects only words having a total appearance frequency equal to or higher than the first threshold value THa from the word group in the large-scale text 40 as a registration candidate V (word truncation by frequency). Then, the selection unit 13 selects a word to be finally registered in the word dictionary 30 from the registration candidates V, and determines a dictionary (database) for storing the word as necessary.
- the method of determining the word to be finally registered and the dictionary of the storage destination is not limited to one, and various methods can be used as described below.
- the selection unit 13 may determine to add only words having a total appearance frequency equal to or higher than a predetermined threshold among the registration candidates V to the existing dictionary 31. In this case, the selection unit 13 may select only words whose total appearance frequency is the second threshold THb (where THb> THa), or may select only words whose total appearance frequency is the top n. Hereinafter, such processing is also referred to as “APPEND”.
- the selection unit 13 may determine to register only words whose total appearance frequency is greater than or equal to a predetermined threshold among the registration candidates V in the additional dictionary 32. Also in this case, the selection unit 13 may select only words whose total appearance frequency is the second threshold THb (where THb> THa), or may select only words whose total appearance frequency is the top n. Hereinafter, such processing is also referred to as “TOP”.
- the selection unit 13 may determine to register all the registration candidates V in the additional dictionary 32. Hereinafter, such processing is also referred to as “ALL”.
- the selection unit 13 may determine to divide the registration candidates V into a plurality of subsets according to the total appearance frequency and register each subset in the individual additional dictionary 32.
- a subset having the total appearance frequency up to the top n is represented as V n .
- the selection unit 13 sets the subset V 1000 composed of the words up to the top 1000, the subset V 2000 composed of the words up to the top 2000, and the subset V 3000 composed of the words up to the top 3000. And generate Then, the selection unit 13 determines to register the subsets V 1000 , V 2000 , and V 3000 in the first additional dictionary 32, the second additional dictionary 32, and the third additional dictionary 32.
- the number of subsets to be generated and the size of each subset may be arbitrarily determined. Hereinafter, such processing is referred to as “MULTI”.
- the selection unit 13 When the word to be finally registered is selected and the storage destination dictionary is determined, the selection unit 13 outputs the selection result to the registration unit 14.
- the registration unit 14 is a means for registering the word selected by the selection unit 13 in the word dictionary 30. Which dictionary to register the word in the word dictionary 30 depends on the processing in the selection unit 13, so the registration unit 14 may register the word only in the existing dictionary 31, or the word only in one additional dictionary 32. May register. In the case of the above “MULTI” process, the registration unit 14 divides the selected word into a plurality of additional dictionaries 32 and registers them.
- the word added to the word dictionary 30 is used for correcting the word division model, but the word dictionary 30 may be used for purposes other than word division.
- the word dictionary 30 may be used for morphological analysis, display of input candidate words in an input box having an automatic input function, a knowledge database for extracting proper nouns, and the like.
- the model generation unit 11 generates an initial word division model (baseline model) by causing the SVM to perform learning based on the learning corpus 20 and the existing dictionary 31 (step S11, model generation step).
- the analysis unit 12 performs an analysis (word division) in which the baseline model is incorporated on the large-scale text 40, and indicates a boundary indicating “division”, “half-division”, or “non-division”. Information is assigned (associated) to each text (step S12, analysis step).
- the selection unit 13 selects a word to be registered in the dictionary (selection step). Specifically, the selection unit 13 calculates the total appearance frequency of each word based on the text with boundary information (step S13), and selects a word whose frequency is a predetermined threshold or more as a registration candidate (step S14). ). Then, the selection unit 13 selects a word to be finally registered in the dictionary from registration candidates and determines a dictionary in which the word is registered (step S15). The selection unit 13 can select a word and designate a dictionary by using the above-described techniques such as APPEND, TOP, ALL, and MULTI.
- the registration unit 14 registers the selected word in the designated dictionary based on the processing in the selection unit 13 (step S16, registration step).
- the word division model is corrected using the expanded word dictionary 30. That is, the model generation unit 11 generates a corrected word division model by relearning based on the learning corpus 20 and the entire word dictionary 30 (step S17).
- the dictionary generation program P1 includes a main module P10, a model generation module P11, an analysis module P12, a selection module P13, and a registration module P14.
- the main module P10 is a part that comprehensively controls the dictionary generation function.
- the functions realized by executing the model generation module P11, the analysis module P12, the selection module P13, and the registration module P14 are the functions of the model generation unit 11, the analysis unit 12, the selection unit 13, and the registration unit 14, respectively. It is the same.
- the dictionary generation program P1 is provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. Further, the dictionary generation program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.
- a word division model is generated using the learning corpus 20 to which boundary information is given and the existing dictionary 31, and the word division incorporating the model is large-scale. Applies to text 40. Then, by this application, a word is selected from the text set to which boundary information is given and registered in the word dictionary 30. As described above, by adding boundary information to a text set by analysis using the learning corpus 20 and registering words extracted from the text set, a large-scale word dictionary 30 can be easily constructed. Can do.
- “sumahoke-su” (“smartphone case” in English) is divided into “sumaho” and “case” (ke-su). sumaho) can be registered in the dictionary.
- “sumaho” is an abbreviation for “suma-tofon” in Japanese.
- the phrase “uttororin” (an unknown word corresponding to “uttori” in Japanese (“fascinated” in English)) can be registered in the dictionary. Then, by performing text analysis using the constructed dictionary, word segmentation of sentences containing registered words (for example, sentences containing “sumaho” or “uttororin”) can be performed more accurately. Executed.
- the UniDic headword list (304,267 different) was used as an existing dictionary, and LIBLINEAR was used as a default parameter as a support vector machine. All the half-width characters in the learning corpus and large text were unified, but no further normalization was performed.
- the field is a concept for grouping sentences and words based on style, contents (genre), and the like.
- learning in the same field from the title and description of 590 products randomly extracted from the website of the virtual shopping mall A without genre bias, and the description of 50 products randomly extracted from the website of the virtual shopping mall B A learning corpus with three-level word division was created.
- the number of words in this learning corpus was about 110,000 and the number of characters was about 340,000. The performance was evaluated using this learning corpus.
- Table 1 shows the learning result by the baseline model, the result of relearning using the word dictionary obtained by the two-stage word division, and the result of relearning using the word dictionary obtained by the three-stage word division. All values in Table 1 are percentages (%).
- the F-value is improved no matter which method (APPEND / TOP / ALL / MULTI) is used, which means that the proposed large text It shows that the learning used is effective.
- the increment of the F value was larger in the order of APPEND ⁇ TOP ⁇ ALL ⁇ MULTI. From this result, when adding a word, it is more effective to add it to another dictionary than to add it to an existing dictionary, and furthermore, it appears more than registering the word to be added to one additional dictionary. It was found that it was more effective to add to different dictionaries according to frequency.
- the classifier automatically learns different contributions and weights depending on the appearance frequency of words. Furthermore, when re-learning using three-stage word division, performance improved in all cases over the baseline model and two-stage word division. Specifically, by taking into account half-division, improvements such as accurately acquiring words with affixes were obtained.
- the learning corpus used was the same as that used for learning in the same field.
- the large-scale text used a user review in the travel reservation site C, an accommodation facility name, an accommodation plan name, and a response from the accommodation facility.
- the number of texts was 348,564, and the number of characters was about 126 million.
- 150 and 50 reviews were randomly extracted and manually divided into words, which were used as a test corpus and an active learning corpus (additions to the learning corpus), respectively.
- Table 2 shows the results of adding these obtained words to the dictionary and re-learning the model using the learning corpus and the field adaptation corpus. All values in Table 2 are percentages (%).
- the selection unit 13 selects a word based on the appearance frequency.
- the selection unit 13 may register all the words in the existing dictionary 31 or the additional dictionary 32 without referring to the appearance frequency.
- word truncation is not an essential process.
- the processing by the selection unit 13 and the registration unit 14 is performed after the analysis unit 12 analyzes the entire large-scale text 40.
- the analysis unit 12 analyzes a large amount of collected text in multiple times. May be.
- a series of processes including a model generation step, an analysis step, a selection step, and a registration step are repeated a plurality of times.
- group 1 is analyzed by the first loop process and the word is registered
- group 2 is analyzed by the second loop process to further add the word.
- Registered group 3 is analyzed in the process of the third loop, and further words are registered.
- the model generation unit 11 refers to the entire word dictionary 30 and generates a corrected word division model.
- the mode of the boundary information is not limited to this example.
- two-stage word division may be performed using only two types of boundary information “division” and “non-division”.
- word division may be performed in four or more stages using “division”, “non-division”, and a plurality of types of probabilistic division.
- a large-scale word dictionary can be easily constructed.
- DESCRIPTION OF SYMBOLS 10 ... Dictionary production
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
Description
テキスト(文字列)x=x1x2…xn(x1,x2,…,xnは文字)には、単語境界タグb=b1b2…bnが割り当てられる。ここで、biは文字xiとxi+1との間(文字間位置)に単語境界が存在するか否かを表すタグであり、bi=1は分割を、bi=0は非分割を、それぞれ意味する。ここで、このタグbiで示される値は分割の強度であるとも言える。
日本語には、単語境界を一意に決めるのが難しい単語が存在し、適切な単語分割の態様が場面によって異なるという問題がある。一例として、「ボールペン(bo-rupen)」(英語では「ballpoint pen」)という単語を含んだテキスト集合に対してキーワード検索を行う場合を想定する。もし「ボールペン(bo-rupen)」を分割しない場合には、「ペン(pen)」(英語では「pen」)というキーワードで検索してもテキストが抽出されないことになる(再現率の低下)。一方、「ボールペン(bo-rupen)」を「ボール(bo-ru)」(英語では「ball」)と「ペン(pen)」とに分割した場合には、スポーツ用品である「ボール(bo-ru)」をキーワードとした検索により、「ボールペン(bo-rupen)」を含むテキストが抽出されてしまう(精度の低下)。
wo katta)というテキストは、上記の点推定による単語分割と3段階単語分割コーパスとを用いて例えば図3に示すように分割される。図3の例では、「分割」(bi=1)の単語境界タグは、テキストの先頭や、「ン(n)」と「を(wo)」の間などに付与されている。「半分割」(bi=0.5)の単語境界タグは「ル(ru)」と「ペ(pe)」の間に付与されている。図3では「非分割」(bi=0)の単語境界タグを省略しているが、文字間に境界が表されていない箇所(例えば「ペ(pe)」と「ン(n)」の間)には当該タグが付与される。
ここで、O1は単語wの表記の出現を示しており、下記の通りに定義される。
Prec=NCOR/NSYS
Rec=NCOR=NREF
F=2Prec・Rec/(Prec+Rec)
Claims (14)
- 予め用意されたコーパス及び単語群を用いて単語分割モデルを生成するモデル生成部であって、前記コーパスに含まれる各テキストには、単語の境界を示す境界情報が付与されている、該モデル生成部と、
収集されたテキストの集合に対して、前記単語分割モデルが組み込まれた単語分割を実行して、各テキストに前記境界情報を付与する解析部と、
前記解析部により前記境界情報が付与されたテキストから辞書に登録する単語を選択する選択部と、
前記選択部により選択された単語を前記辞書に登録する登録部と
を備える辞書生成装置。 - 前記選択部が、前記解析部により付与された前記境界情報から算出される各単語の出現頻度に基づいて、前記辞書に登録する単語を選択する、
請求項1に記載の辞書生成装置。 - 前記選択部が、前記出現頻度が所定の閾値以上である単語を選択する、
請求項2に記載の辞書生成装置。 - 前記選択部が、前記出現頻度が前記閾値以上である単語を登録候補として抽出し、前記出現頻度が高い単語から順に該登録候補から所定数の単語を選択し、
前記登録部が、前記選択部により選択された単語を前記単語群が記録されている辞書に追加する、
請求項3に記載の辞書生成装置。 - 前記選択部が、前記出現頻度が前記閾値以上である単語を登録候補として抽出し、前記出現頻度が高い単語から順に該登録候補から所定数の単語を選択し、
前記登録部が、前記選択部により選択された単語を、前記単語群が記録されている辞書とは別の辞書に登録する、
請求項3に記載の辞書生成装置。 - 前記登録部が、前記選択部により選択された単語を、前記単語群が記録されている辞書とは別の辞書に登録する、
請求項3に記載の辞書生成装置。 - 前記選択部が、前記出現頻度が前記閾値以上である単語を登録候補として抽出し、前記出現頻度の高さに応じて該登録候補の単語をグループ化し、
前記登録部が、前記選択部により生成された複数のグループを、前記単語群が記録されている辞書とは別の複数の辞書に個別に登録する、
請求項3に記載の辞書生成装置。 - 前記収集されたテキストのそれぞれには、該テキストの分野を示す情報が関連付けられており、
前記登録部が、前記選択部により選択された単語を、該単語が含まれていたテキストの分野に基づいて、前記分野毎に用意された辞書に個別に登録する、
請求項3に記載の辞書生成装置。 - 前記境界情報が、文字間位置に前記境界が存在しないことを示す第1の情報と、文字間位置に前記境界が存在することを示す第2の情報と、文字間位置に前記境界が確率的に存在することを示す第3の情報とを含み、
各単語の出現頻度が前記第1、第2、及び第3の情報に基づいて算出される、
請求項2~8のいずれか一項に記載の辞書生成装置。 - 前記解析部が、第1の二値分類器及び第2の二値分類器を備え、
前記第1の二値分類器が、各文字間位置について、前記第1の情報を割り当てるか前記第1の情報以外の情報を割り当てるかを判定し、
前記第2の二値分類器が、前記第1の二値分類器により前記第1の情報以外の情報を割り当てると判定された文字間位置について、前記第2の情報を割り当てるか前記第3の情報を割り当てるかを判定する、
請求項9に記載の辞書生成装置。 - 前記収集されたテキストの集合が複数のグループに分割され、
前記解析部、前記選択部、及び前記登録部が前記複数のグループのうちの一つに基づく処理を実行した後に、前記モデル生成部が前記コーパス、前記単語群、及び前記登録部により登録された単語を用いて前記単語分割モデルを生成し、続いて、前記解析部、前記選択部、及び前記登録部が前記複数のグループのうちの別の一つに基づく処理を実行する、
請求項1~10のいずれか一項に記載の辞書生成装置。 - 辞書生成装置により実行される辞書生成方法であって、
予め用意されたコーパス及び単語群を用いて単語分割モデルを生成するモデル生成ステップであって、前記コーパスに含まれる各テキストには、単語の境界を示す境界情報が付与されている、該モデル生成ステップと、
収集されたテキストの集合に対して、前記単語分割モデルが組み込まれた単語分割を実行して、各テキストに前記境界情報を付与する解析ステップと、
前記解析ステップにおいて前記境界情報が付与されたテキストから辞書に登録する単語を選択する選択ステップと、
前記選択ステップにおいて選択された単語を前記辞書に登録する登録ステップと
を含む辞書生成方法。 - 予め用意されたコーパス及び単語群を用いて単語分割モデルを生成するモデル生成部であって、前記コーパスに含まれる各テキストには、単語の境界を示す境界情報が付与されている、該モデル生成部と、
収集されたテキストの集合に対して、前記単語分割モデルが組み込まれた単語分割を実行して、各テキストに前記境界情報を付与する解析部と、
前記解析部により前記境界情報が付与されたテキストから辞書に登録する単語を選択する選択部と、
前記選択部により選択された単語を前記辞書に登録する登録部と
をコンピュータに実行させる辞書生成プログラム。 - 予め用意されたコーパス及び単語群を用いて単語分割モデルを生成するモデル生成部であって、前記コーパスに含まれる各テキストには、単語の境界を示す境界情報が付与されている、該モデル生成部と、
収集されたテキストの集合に対して、前記単語分割モデルが組み込まれた単語分割を実行して、各テキストに前記境界情報を付与する解析部と、
前記解析部により前記境界情報が付与されたテキストから辞書に登録する単語を選択する選択部と、
前記選択部により選択された単語を前記辞書に登録する登録部と
をコンピュータに実行させる辞書生成プログラムを記憶するコンピュータ読取可能な記録媒体。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201280030052.2A CN103608805B (zh) | 2012-02-28 | 2012-09-03 | 辞典产生装置及方法 |
KR1020137030410A KR101379128B1 (ko) | 2012-02-28 | 2012-09-03 | 사전 생성 장치, 사전 생성 방법 및 사전 생성 프로그램을 기억하는 컴퓨터 판독 가능 기록 매체 |
JP2013515598A JP5373998B1 (ja) | 2012-02-28 | 2012-09-03 | 辞書生成装置、方法、及びプログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201261604266P | 2012-02-28 | 2012-02-28 | |
US61/604266 | 2012-02-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013128684A1 true WO2013128684A1 (ja) | 2013-09-06 |
Family
ID=49081915
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2012/072350 WO2013128684A1 (ja) | 2012-02-28 | 2012-09-03 | 辞書生成装置、方法、及びプログラム |
Country Status (5)
Country | Link |
---|---|
JP (1) | JP5373998B1 (ja) |
KR (1) | KR101379128B1 (ja) |
CN (1) | CN103608805B (ja) |
TW (1) | TWI452475B (ja) |
WO (1) | WO2013128684A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018073047A (ja) * | 2016-10-27 | 2018-05-10 | キヤノンマーケティングジャパン株式会社 | 情報処理装置、その制御方法及びプログラム |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105701133B (zh) * | 2014-11-28 | 2021-03-30 | 方正国际软件(北京)有限公司 | 一种地址输入的方法和设备 |
JP6707483B2 (ja) * | 2017-03-09 | 2020-06-10 | 株式会社東芝 | 情報処理装置、情報処理方法、および情報処理プログラム |
CN108391446B (zh) * | 2017-06-20 | 2022-02-22 | 埃森哲环球解决方案有限公司 | 基于机器学习算法对针对数据分类器的训练语料库的自动提取 |
JP2019049873A (ja) * | 2017-09-11 | 2019-03-28 | 株式会社Screenホールディングス | 同義語辞書作成装置、同義語辞書作成プログラム及び同義語辞書作成方法 |
CN109033183B (zh) * | 2018-06-27 | 2021-06-25 | 清远墨墨教育科技有限公司 | 一种可编辑的云词库的解析方法 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH09288673A (ja) * | 1996-04-23 | 1997-11-04 | Nippon Telegr & Teleph Corp <Ntt> | 日本語形態素解析方法と装置及び辞書未登録語収集方法と装置 |
JP2002351870A (ja) * | 2001-05-29 | 2002-12-06 | Communication Research Laboratory | 形態素の解析方法 |
JP2008257511A (ja) * | 2007-04-05 | 2008-10-23 | Yahoo Japan Corp | 専門用語抽出装置、方法及びプログラム |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1086821C (zh) * | 1998-08-13 | 2002-06-26 | 英业达股份有限公司 | 汉语语句切分的方法及其系统 |
CN100530171C (zh) * | 2005-01-31 | 2009-08-19 | 日电(中国)有限公司 | 字典学习方法和字典学习装置 |
-
2012
- 2012-09-03 CN CN201280030052.2A patent/CN103608805B/zh active Active
- 2012-09-03 JP JP2013515598A patent/JP5373998B1/ja active Active
- 2012-09-03 KR KR1020137030410A patent/KR101379128B1/ko active IP Right Grant
- 2012-09-03 WO PCT/JP2012/072350 patent/WO2013128684A1/ja active Application Filing
- 2012-09-13 TW TW101133547A patent/TWI452475B/zh active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH09288673A (ja) * | 1996-04-23 | 1997-11-04 | Nippon Telegr & Teleph Corp <Ntt> | 日本語形態素解析方法と装置及び辞書未登録語収集方法と装置 |
JP2002351870A (ja) * | 2001-05-29 | 2002-12-06 | Communication Research Laboratory | 形態素の解析方法 |
JP2008257511A (ja) * | 2007-04-05 | 2008-10-23 | Yahoo Japan Corp | 専門用語抽出装置、方法及びプログラム |
Non-Patent Citations (1)
Title |
---|
TETSURO SASADA ET AL.: "Kana-Kanji Conversion by Using Unknown Word-Pronunciation Pairs with Contexts", JOURNAL OF NATURAL LANGUAGE PROCESSING, vol. 17, no. 4, 30 July 2010 (2010-07-30), pages 131 - 153 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018073047A (ja) * | 2016-10-27 | 2018-05-10 | キヤノンマーケティングジャパン株式会社 | 情報処理装置、その制御方法及びプログラム |
Also Published As
Publication number | Publication date |
---|---|
CN103608805A (zh) | 2014-02-26 |
TWI452475B (zh) | 2014-09-11 |
KR20130137048A (ko) | 2013-12-13 |
JPWO2013128684A1 (ja) | 2015-07-30 |
TW201335776A (zh) | 2013-09-01 |
KR101379128B1 (ko) | 2014-03-27 |
JP5373998B1 (ja) | 2013-12-18 |
CN103608805B (zh) | 2016-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111444320B (zh) | 文本检索方法、装置、计算机设备和存储介质 | |
Kadhim et al. | Text document preprocessing and dimension reduction techniques for text document clustering | |
JP5373998B1 (ja) | 辞書生成装置、方法、及びプログラム | |
US8239188B2 (en) | Example based translation apparatus, translation method, and translation program | |
Tkaczyk et al. | Cermine--automatic extraction of metadata and references from scientific literature | |
JP5834883B2 (ja) | 因果関係要約方法、因果関係要約装置及び因果関係要約プログラム | |
CN110472043B (zh) | 一种针对评论文本的聚类方法及装置 | |
CN108875065B (zh) | 一种基于内容的印尼新闻网页推荐方法 | |
CN109558482B (zh) | 一种基于Spark框架的文本聚类模型PW-LDA的并行化方法 | |
Selamat et al. | Word-length algorithm for language identification of under-resourced languages | |
CN111400584A (zh) | 联想词的推荐方法、装置、计算机设备和存储介质 | |
Gunawan et al. | Multi-document summarization by using textrank and maximal marginal relevance for text in Bahasa Indonesia | |
JP6186198B2 (ja) | 学習モデル作成装置、翻訳装置、学習モデル作成方法、及びプログラム | |
Kotenko et al. | Evaluation of text classification techniques for inappropriate web content blocking | |
CN113986950A (zh) | 一种sql语句处理方法、装置、设备及存储介质 | |
CN113076748A (zh) | 弹幕敏感词的处理方法、装置、设备及存储介质 | |
CN106570196B (zh) | 视频节目的搜索方法和装置 | |
CN114912425A (zh) | 演示文稿生成方法及装置 | |
CN103218388A (zh) | 文档相似性评价系统、文档相似性评价方法以及计算机程序 | |
Ashari et al. | Document summarization using TextRank and semantic network | |
JP2011227749A (ja) | 略語完全語復元装置とその方法と、プログラム | |
CN113449063B (zh) | 一种构建文档结构信息检索库的方法及装置 | |
CN113157857B (zh) | 面向新闻的热点话题检测方法、装置及设备 | |
CN111581162B (zh) | 一种基于本体的海量文献数据的聚类方法 | |
JP4567025B2 (ja) | テキスト分類装置、テキスト分類方法及びテキスト分類プログラム並びにそのプログラムを記録した記録媒体 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2013515598 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12869894 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20137030410 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12869894 Country of ref document: EP Kind code of ref document: A1 |