WO2007088902A1

WO2007088902A1 - Character processing device, method and program, and recording medium

Info

Publication number: WO2007088902A1
Application number: PCT/JP2007/051622
Authority: WO
Inventors: Yoshiharu Sato; Noriko Ishibashi; Miyuki Seki; Hiroaki Kanokogi; Takashi Umeoka
Original assignee: Microsoft Corporation
Priority date: 2006-01-31
Filing date: 2007-01-31
Publication date: 2007-08-09
Also published as: CN101371252B; CN101371252A; JP4845523B2; JP2007206796A; TW200821868A

Abstract

Provided is a character processing device, which is provided with a phrase dictionary for transforming a character string of reading a word into a character string of a corresponding notation by using a word lattice, such that the notation acquired from a phrase dictionary is added as a bypass, in addition to the node of the notation acquired from a word dictionary (110), to the word lattice, in case the same reading as that described in the phrase dictionary is given.

Description

Character processing apparatus, method, program, and recording medium

Technical field

[0001] The present invention is a notation (kanji, kana mixed kanji, english characters, etc.) corresponding to the reading of a character string (for example, hiragana or romaji) for a plurality of words input using the word lattice. The present invention relates to a character processing device, a method, a program, and a recording medium. Background art

Conventionally, a method of performing kana-kanji conversion using a word lattice is known. As described in Non-Patent Document 1, the word lattice is a network in which a plurality of input readings or a plurality of notations corresponding to the input readings are arranged in the order of connection. Each notation constituting this network is called a node. Multiple nodes arranged in the order of connection are called nodes. The notation of this node is obtained by referring to the word dictionary. The word dictionary is a reading notation conversion dictionary that includes one reading and one notation corresponding to this reading (also called a headword) in one record, and is composed of multiple different records.

[0003] For example, given a reading of “Tokkyocho”,

Pass 1 "Tokkyo" (Note. Hiragana character string with pronunciation Tokkyo) → "Machi" (Note. Kanji with pronunciation C hou, meaning town)

Pass 2 “Tokkyo” → “Government” (Note. Kanji with pronunciation Chou, meaning office) Pass 3 “Patent” (Note. Character string of Kanji with pronunciation Tokkyo, meaning Patent) → “Government”. ..

A word lattice with multiple paths is created by the CPU in the Kana-Kanji conversion device memory c

[0004] The CPU obtains the appearance probability of a predetermined number of nodes on each path, usually one to three, which is often used, by referring to the language dictionary, and then on one path. By calculating the probability of appearance of all nodes, the appearance probability of a node in one path is calculated. By the way, the language dictionary contains a predetermined number of written strings and their occurrence probabilities in one record. This is a dictionary composed of a plurality of records.

[0005] In the above example, “Tokkyo”, “Town”, and A1 (value of appearance probability) are one record in the language dictionary.

[0006] By repeating the above processing procedure, the CPU calculates the appearance probabilities for all paths on the word lattice. The CPU detects the probability of the highest path among the calculated appearance probabilities. The node connection (sequence) indicated by the path having the highest appearance probability is determined as the most prominent kana-kanji conversion result corresponding to the given character string. (Non-patent literature 1)

The determined Kana-Kanji conversion result is displayed on the display screen, and after the confirmation operation by the user, the determined powerful Kanji conversion result is delivered to the document processing device (word processor program). A string of notation frequently used by the user.

V, the frequency of use is also reflected in the appearance probability of the language dictionary.

[0007] In such a character processing method, a problem occurs when the input reading character string is converted to a notation having a high appearance frequency. For example, expressions such as place names, company names, proverbs, and collocations consisting of many words are stipulated. By the way, in the development of a statistical language model, the appearance probability of a certain word path is generally calculated based on the appearance probability of a statistical material called a corpus. However, it is difficult to set a reliable probability that a proper noun such as a place name or a proverb usually has a high probability of appearing in a corpus. Therefore, although the proper noun class has a certain notation, such a conversion result cannot be guaranteed if it is statistically calculated.

[0008] When the above-mentioned collocation reading is input, the above-described character processing method creates a plurality of paths and calculates the appearance probability of the path, so that it is different from the conventional expression in the initial use state. It may result in Kana-Kanji conversion results.

[0009] Non-Patent Document 1: http: 〃 www.jaist.ac.jp/~kshirai/lec/i223/07.pdf

Non-Patent Document 2: “Language Model Adaptation Method to Fixed Expressions by Partial Emphasis on N-gram” (Dentsu IEICE Transactions Vol.J86-D-II Nol2, December 2003)

Disclosure of the invention

[0010] Therefore, an object of the present invention is to use a word lattice and convert it to a commonly used collocation. It is an object of the present invention to provide a character processing device, method, program, and recording medium capable of performing conversion in a balanced manner and conversion to a representation other than a multiple word having the same reading.

[0011] Other conventional techniques include the following.

1. Register the proper nouns in the dictionary in long units. Then, since the long range is held down with one word, the long registered word becomes the first candidate.

2. Add a weight adjustment to the probability given by the dictionary 'grammar when compiling the dictionary' grammar or when performing powerful kanji conversion so that it is likely to be the first candidate. (Non-patent document 2) The first method is easy to appear in the first candidate, but it is effective only when typing in that long range, and it is necessary to generate a candidate list in that length and range. There is a problem.

[0012] The second method requires complicated calculations, and there is no practical method.

[0013] The present invention does not take these conventional approaches, and retains information that a word string having a long warmth is internally composed of a plurality of words as in the first method. Hold it virtually in the dictionary as if it were a word. Then, matching is performed in long units, the appearance probability of the path is calculated, the first candidate of the conversion result is determined as a desired character string, and then divided into constituent word strings, and subsequent processing such as candidate generation is performed. .

[0014] More specifically, a first aspect of the present invention is a character processing device for converting a character string read from a plurality of words into a corresponding character string having a plurality of notations using a word lattice. A first storage means for storing a collocation dictionary having a plurality of different collocations, a corresponding notation string as one record, and a plurality of different records;

Searches the collocation dictionary with a plurality of word reading character strings to be converted, and a plurality of word notation character strings corresponding to the character strings in the collocation dictionary with the same reading as the plurality of word reading character strings Search means for obtaining

First information processing means for adding a plurality of notation character strings obtained by the search by the search means as a plurality of nodes to the word lattice;

A second storage means storing a plurality of sets of appearance probabilities, and a third storage means storing the appearance probabilities of the plurality of collocation expressions;

Based on the appearance probabilities stored in the second storage means and the third storage means, the highest appearance probability on the word lattice in which the character string of the collocation is added as a node is the highest. And a second information processing means for acquiring a character string written on the path as a conversion candidate.

[0015] In the second aspect of the present invention, the first storage means is used as a third storage means, and the appearance probability is included in the record.

[0016] In a third aspect of the present invention, there is provided a character processing method for a character processing device that uses a word lattice to convert a plurality of word reading character strings into a plurality of corresponding character strings. The character processing unit is a first storage means that stores a multiple word reading character string, a corresponding character string as one record, and a multiple word dictionary having a plurality of different records. A second storage means that stores a plurality of sets of appearance probabilities, a third storage means search means that stores the appearance probabilities of the plurality of collocation notations, a first information processing means, and a first information processing means 2 information processing means,

The retrieval means retrieves the collocation dictionary with a reading string of a plurality of words to be converted, and reads a plurality of words corresponding to the string in the collocation dictionary with the same reading as the reading string of the plurality of words. Get the notation string,

Adding a plurality of word notation character strings obtained by the search by the search means to the word lattice as a plurality of nodes by the first information processing means;

Based on the appearance probabilities stored in the second storage means and the third storage means, the character of the notation on the path having the highest appearance probability on the word lattice in which the character string of the combination word is added as a node A column is obtained as a conversion candidate by the second information processing means.

It is a feature.

[0017] In the fourth aspect of the present invention, the first storage unit is used as a third storage unit, and the appearance probability is included in the record. Character processing method.

[0018] In a fifth aspect of the present invention, there is provided a character processing program for a character processing device that uses a word lattice to convert a plurality of word reading character strings into a plurality of corresponding character strings. And

The character processing device has a plurality of conjunctive reading character strings and a corresponding character string. A first storage means storing a collocation dictionary having a plurality of different records; a second storage means storing a plurality of sets of appearance probabilities; A third storage means for storing the appearance probabilities of a plurality of collocation expressions, a search means, a first information processing means, and a second information processing means;

The retrieval means retrieves the collocation dictionary with a reading string of a plurality of words to be converted, and reads a plurality of words corresponding to the string in the collocation dictionary with the same reading as the reading string of the plurality of words. Obtaining a notation string;

Adding a plurality of word-notation character strings obtained by the search by the search means to the word lattice as a plurality of nodes by the first information processing means; the second storage means and the third storage means; Based on the appearance probabilities stored in the storage means, the character string of the notation on the path having the highest appearance probability on the word lattice in which the character string of the combined word notation is added as a node is used as the conversion candidate. Steps acquired by processing means

It is characterized by having.

[0019] In a sixth aspect of the present invention, the first storage means is used as a third storage means, and the appearance probability is included in the record.

[0020] The seventh aspect of the present invention is characterized in that the program of the fifth or sixth aspect is recorded.

Brief Description of Drawings

FIG. 1 is a block diagram showing a hardware configuration of an embodiment of the present invention.

FIG. 2 is a block diagram showing a software configuration according to the embodiment of the present invention.

FIG. 3 is a flowchart showing a character processing procedure according to the embodiment of the present invention.

FIG. 4 is an explanatory diagram showing an example of word ratings.

[FIG. 5] FIG. 5 is an explanatory diagram showing a word rating with a node added.

BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

(Embodiment 1)

FIG. 1 shows an example of the system configuration of the character processing device according to the first embodiment. With character processor Thus, various information processing apparatuses having an information processing function such as general-purpose personal computers and mobile phones can be used.

In FIG. 1, reference numeral 10 denotes a CPU, which executes character processing according to the present invention using a character processing program described later. The CPU 10 functions as search means and first and second information processing means of the present invention.

[0025] Reference numeral 20 denotes a system memory having a ROM and a RAM, which temporarily stores input / output data for the CPU 10.

[0026] Reference numeral 30 denotes an input device. For example, a device such as a keyboard for inputting a reading character string can be used. In addition to the keyboard, for example, a data reading device that reads a reading character string from a storage medium that stores the reading character string, or a communication device that inputs a reading character string through external communication can be used as the input device 30. it can.

Reference numeral 40 denotes a hard disk (HD), which stores a character processing program and a later-described dictionary used for character processing. The hard disk 40 functions as the first to third storage means of the present invention.

[0028] Reference numeral 50 denotes a display that displays inter-knitting candidates determined by the character processing program.

FIG. 2 shows a configuration of software installed in the hard disk 40 of FIG.

[0030] 100 is a character processing program for converting a plurality of word reading character strings into a corresponding plurality of notation character strings using a word lattice. 110 is a word dictionary, the same as before

A single word reading string and a corresponding notation string are written as one record, and several different records are listed.

[0031] Reference numeral 120 denotes a language dictionary, which describes a plurality of words and corresponding appearance probabilities. In the first embodiment, the language dictionary 120 can be the same as the conventional one.

[0032] 130 is a collocation dictionary according to the present invention, and a plurality of commonly used collocations (for example,

(Proprietary expressions such as place names, company names, proverbs, etc.), the reading character string and the written character string, the appearance probability of each collocation is one record, and multiple different records are listed in the collocation dictionary 130 .

[0033] When the number of words is small, the same notation as the string of collocations described in the collocation dictionary 130 is used. The character string is also written in the language dictionary 120. In this case, it should be noted that the appearance probability of the collocation dictionary 130 is set higher than the appearance probability of the language dictionary 120 in advance.

In the first embodiment, the hard disk 40 that stores the word dictionary 110 functions as the second storage unit of the present invention, and the hard disk 40 that stores the collocation dictionary 130 functions as the first and third storage units of the present invention. To do.

FIG. 3 shows the processing procedure of the program portion according to the present invention in the character processing program 100. 4 and 5 show the word ratings constructed on the system memory 20 in the form of a network for easy understanding.

[0036] The operation of the character processing device will be described with reference to FIGS.

[0037] The processing contents of the CPU 10 until "Tokkocho" is input as a reading character string from the input device 30 and "Japan Patent Office" is obtained as a conversion candidate will be described.

In step S 10 of FIG. 3, the CPU 10 builds the word lattice shown in FIG. 4 on the system memory 20 in the same manner as in the past. Briefly, the CPU 10 searches the word dictionary 110 with the reading character string “Tokkyo”, and acquires “Tokkyo”, “Totsuki” and “patent” as convertible character strings. Each of the obtained three character strings is stored in the system memory 20. Next, the CPU 10 converts the character strings “Cho”, “Chou” (a katakana character string with the pronunciation “Chou”), “Machi” ... “Office” is retrieved from the word dictionary 110 by a search.

[0039] The acquired character string of reading is stored in the system memory 20 in association with the character string of the notation of the immediately preceding word acquired previously. As a method for associating, a method of giving the storage address storing the immediately preceding notation as the notation acquired as attribute information and a method of storing it in a table form are widely known. Use it.

[0040] In this example, the reading character strings for two words are input. When three or more words are input, the following notation corresponding to the reading character strings in units of words. CPU10 obtains the character string and constructs a word rating.

Subsequently, the CPU 10 proceeds to step S20 in FIG. Here, as a search means of the present invention, the CPU 10 searches the collocation dictionary 130 with a character string of the input reading, in this case, “Tokkocho”. As a result of this search, the combined word notation “JPO” and its appearance probability A1 from the collocation dictionary 130 Is obtained.

The procedure proceeds to step S30, and the CPU 10 adds the acquired collocation notation “patent” and “government” to the word rating (see FIG. 4) in the system memory 20 as nodes as shown in FIG. Each word of the acquired collocation may be a node, or the whole may be a node. The example in Fig. 5 uses words as nodes. Note that the node composed of the added nodes is called a bypass (reference numeral 1010) in this embodiment. Bypass 1010 is given attribute information indicating that it is bypass to distinguish it from the conventional path.

The procedure proceeds to S40, and the CPU 10 calculates the appearance probability of each path on the word lattice of FIG. In the example of FIG. 5, since the first path is “Tokkyo” → “Chow”, the language dictionary 120 is searched for “Tokkyo” + “Chow”, and the corresponding appearance probability B1 is obtained by the CPU 10.

In this way, the appearance probability is acquired from the language dictionary 120 for the path (FIG. 4) from which the node is acquired by the word dictionary 110.

[0045] For the nodes "patent" and "office" on the bypass 1010, the appearance probability A1 is also obtained for the collocation dictionary 130 in step S20, so the CPU 10 compares each path with each other, for example, sorting and Using the information processing method called, the path with the highest appearance probability is detected. A notation character string obtained by combining the nodes on the detected path is acquired by the display 50 as a conversion candidate for the reading character string “Tokkocho” by the CPU 10 and displayed (step S40). Thereafter, the user confirms using the input device 30 or gives a conversion instruction to the CPU 10 as before, and acquires the conversion result desired by the user.

[0046] In this example, in the initial use state of the character processing device, the appearance probability of the path 1010 in Fig. 5 composed of the notation obtained from the collocation dictionary 130 has the highest value for "Tokkocho". Therefore, the node (Patent Office) on the no-path 1010 in FIG. 5 is determined as a conversion candidate.

Thereafter, when the user frequently uses the expression “Patent Office”, the appearance frequency corresponding to “Patent Office” on the language dictionary 120 is updated so as to increase as in the conventional case, and the binos 10 in FIG. The node above 00 (Patent Office) is determined as a conversion candidate.

[0048] If a person who lives in the town of “Patent Town” inputs an address using this character processing device, Since the appearance frequency of “Patent Town” described in the language dictionary 120 is updated according to the user's use, if the user inputs “Tokkyocho” as a character processing device, “Patent Town” Will be obtained as

[0049] As described above, by adding the notation obtained from the collocation dictionary 130 as a node to the word lattice, it is possible to solve the problem that the conventional expressive power S is not necessarily a conversion result in the initial use state. it can.

(Embodiment 2)

In Embodiment 1, the appearance probability of the collocation is described in the collocation dictionary 130 together with the read character string and the written character string. However, the appearance probability may be described in the language dictionary 120. It may be written in a dictionary or table.

[0051] The above-described embodiments are examples for explaining the present invention. The technical idea of the present invention is shown in the scope of claims, and it will be easily understood by those skilled in the art that various improvements to the above-described embodiment exist based on this technical idea.

Industrial applicability

[0052] According to the present invention, by adding the word notation obtained from the collocation dictionary as a node to the word lattice, conventionally, it is possible to improve the problem that the conventional notation is not a conversion result in the initial use state. . In addition, according to the frequency of use, a conversion result is obtained because a conversion other than the conventional expression having the same reading as the conventional expression is the conversion result.

Claims

The scope of the claims

[1] A character processing device for converting a plurality of word reading character strings into a plurality of corresponding notation character strings using a word lattice,

A first storage means for storing a collocation dictionary having a plurality of different collocation reading strings and a corresponding notation string as one record and having a plurality of different records;

Based on the appearance probabilities stored in the second storage means and the third storage means, the character of the notation on the path having the highest appearance probability on the word lattice in which the character string of the combination word is added as a node A character processing device comprising: second information processing means for acquiring a column as a conversion candidate.

[2] The character processing device according to claim 1, wherein the first storage unit is used as a third storage unit, and an appearance probability is included in the record.

[3] A character processing method of a character processing device for converting a character string of a plurality of word readings to a corresponding plurality of character strings using a word lattice,

The character processing device is a first storage means for storing a multiple word reading character string, a corresponding character string as one record, and storing a multiple word dictionary having a plurality of different records. A second storage means that stores a plurality of sets of appearance probabilities, a third storage means search means that stores the appearance probabilities of the plurality of collocation expressions, a first information processing means, Having a second information processing means;

The retrieval means retrieves the collocation dictionary with a reading string of a plurality of words to be converted, and reads a plurality of words corresponding to the string in the collocation dictionary with the same reading as the reading string of the plurality of words. Get the notation string, A plurality of word representation character strings obtained by the search by the search means are added as a plurality of nodes to the word lattice by the first information processing means;

A character processing method for a character processing device.

4. The character processing method for a character processing device according to claim 3, wherein the first storage unit is used as a third storage unit, and an appearance probability is included in the record.

[5] A character processing program for a character processing device that uses a word lattice to convert a plurality of word reading character strings into a plurality of corresponding character strings,

The character processing device is a first storage means for storing a multiple word reading character string, a corresponding character string as one record, and storing a multiple word dictionary having a plurality of different records. A second storage means that stores a set of appearance probabilities, a third storage means that stores the appearance probabilities of the plural conjunctions, a search means, a first information processing means, Having a second information processing means;

A character processing program comprising:

6. The character processing program according to claim 5, wherein the first storage unit is used as a third storage unit, and an appearance probability is included in the record. [7] A recording medium on which the program according to claim 5 or 6 is recorded.