WO2006051647A1 - Text data structure and text data processing method - Google Patents

Text data structure and text data processing method Download PDF

Info

Publication number
WO2006051647A1
WO2006051647A1 PCT/JP2005/016504 JP2005016504W WO2006051647A1 WO 2006051647 A1 WO2006051647 A1 WO 2006051647A1 JP 2005016504 W JP2005016504 W JP 2005016504W WO 2006051647 A1 WO2006051647 A1 WO 2006051647A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
text data
character string
processing program
character
Prior art date
Application number
PCT/JP2005/016504
Other languages
French (fr)
Japanese (ja)
Inventor
Tadashi Honda
Original Assignee
Advanced Design Corp.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Design Corp. filed Critical Advanced Design Corp.
Publication of WO2006051647A1 publication Critical patent/WO2006051647A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • G06F40/129Handling non-Latin characters, e.g. kana-to-kanji conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Definitions

  • Text data structure text data processing method, text data processing program, and recording medium recording the text data processing program
  • the present invention relates to a text data structure of a language including at least ideographic characters, a text data processing method for generating text data of the data structure, a text data processing program, and a recording medium on which a text data processing program is recorded. .
  • An object of the present invention is to provide a text data structure, a text data processing method, a text data processing program, and a recording medium on which the text data processing program is recorded so that the translation delimiter (clause) can be accurately grasped.
  • the text data structure according to claim 1 of the present invention is a text data structure in which character code data capable of specifying the character type of each character including at least ideographic characters is arranged.
  • the character code data included in each converted phrase is Characteristic phrase specifying data is included together with the character code data.
  • the text data structure according to claim 2 of the present invention is the text data structure according to claim 1,
  • the character code data of the phonetic character that is converted into the ideographic character is included in correspondence with the phrase of the character string after conversion as the kana data of the character string after conversion. Therefore, it is possible to accurately identify the hiragana, and to use these hiragana for translation.
  • the text data structure according to claim 3 of the present invention is the text data structure according to claim 1 or 2
  • Part of speech data that can identify the part of speech of the character string included in each clause and that has been acquired is included in association with the clause.
  • a text data processing method includes:
  • the phrase specifying data that can specify the character code data included in each clause in the converted character string is inserted into the text data of the converted character string. According to this feature, by specifying the characters included in each clause based on the clause specification data included in the text data, it is possible to accurately grasp the sentence breaks by clauses. When converting a sentence based on another language, the capacity and processing time of the translation program can be reduced.
  • a text data processing method according to claim 5 of the present invention is the text data processing method according to claim 4,
  • the character code data of the phonogram converted from the conversion to the ideographic character obtained from the conversion processing program is associated with the phrase of the converted character string as post-conversion character string kana data, and after the conversion According to this feature, the kana characters can be accurately specified, and these kana characters can be used for translation.
  • the text data processing method according to claim 6 of the present invention is the text data processing method according to claim 4 or 5
  • the part-of-speech data that can identify the part-of-speech of the character string included in each clause is inserted into the text data in association with the clause.
  • a text data processing program according to claim 7 of the present invention provides:
  • a text data processing program according to claim 8 of the present invention is the text data processing program according to claim 7,
  • the character code data of the phonogram converted from the conversion to the ideographic character obtained from the conversion processing program is associated with the phrase of the converted character string as post-conversion character string kana data, and after the conversion It includes a step of inserting a pseudonym data to be inserted into the text data of the character string.
  • the hiragana can be specified accurately, and these hiragana can be used for translation.
  • a text data processing program according to claim 9 of the present invention is the text data processing program according to claim 7 or 8,
  • a recording medium on which a text data processing program according to claim 10 of the present invention is recorded is characterized in that the text data processing program according to any one of claims 7 to 9 is recorded.
  • the text data processing program can be easily used by reading the recording medium power.
  • FIG. 1 is a flowchart showing the processing contents of a conversion processing program used in an embodiment of the present invention.
  • FIG. 2 is a flowchart showing the processing contents of the text data processing program according to the embodiment of the present invention. is there.
  • FIG. 3 is a diagram showing a structure of text data generated by a text data processing program according to an embodiment of the present invention.
  • FIG. 1 is a flowchart showing the processing contents of the Kana-Kanji conversion processing program that is the conversion processing program used in this embodiment
  • FIG. 2 shows the processing contents of the text data processing program used in this embodiment.
  • the kana-kanji conversion processing program and the text data processing program used in the present embodiment are installed in a computer such as a personal computer (not shown) from a recording medium such as a CD-ROM and executed on the computer.
  • the text data processing program of this embodiment is a plug-in module program of the kana-kanji conversion processing program, which is the main program, and the kana-kanji conversion processing program is a text data processing program. It is possible to operate without it.
  • the Kana-Kanji conversion processing program can be used as a known Japanese input tool that is commercially available. 1 and briefly explained based on Fig. 3, for example, the Kanji conversion processing program accepts the input of ⁇ Gasshukkosaikousaibanchochimichi '' as the conversion sentence as shown in Fig. 3. (S1), the phrase of the accepted conversion sentence is specified.
  • S1 the phrase of the accepted conversion sentence is specified.
  • These clauses can be identified by using, for example, the known minimum cost method, etc., specifically, the conversion sentence powers “Gashukukoku”, “Saikousaibansho”, “Chikamichi” The phrase is specified (S2).
  • the kanji-kanji conversion processing program converts the character string of the kanji conversion candidate to be an ideogram corresponding to the kana character string that is the phonogram character string included in each of the specified phrases.
  • Data power is all extracted (S3), and the conversion candidate representative of each clause is displayed and output as a conversion sentence according to the conversion operation.
  • the character string of the conversion candidate extracted in S3 is displayed and output in a selectable manner (S4).
  • the conversion word is accepted by the selection operation (confirmation operation) of the conversion candidate character string (S5)
  • the text data including the kanji that is the ideogram determined by the acceptance specifically, Outputs the Kanji text data of “US Supreme Court shortcut” to the above text data processing program.
  • the text data processing program In response to the output of the kanji text data, the text data processing program detects the output of the kanji text data in Sbl and proceeds to Sb2, as shown in FIG. 2, and outputs the output kanji text data.
  • the phrase information specified in each phrase, the kana character string that is the kana character string before conversion included in each phrase, and the word that contains the converted kanji included in each phrase (conversion word) A conversion information output request including part-of-speech data is output to a kanji conversion processing program and the conversion information is acquired (Sb2) G That is, the phonetic character input in Sb2 Kana-Kanji conversion processing power as a conversion processing program that converts a column into Kanji text data that is a character string including ideograms
  • the phrase information that is the conversion unit to the ideogram is acquired, and the S The phrase information acquisition step in the present invention is formed by b2.
  • the text data processing program power also outputs a conversion information output request to the powerful kanji conversion processing program.
  • the present invention is not limited to this.
  • the kana-kanji conversion processing program may output the conversion information in the conversion of the conversion sentence text to the text data processing program together with the conversion sentence text including the kanji converted by the kana-kanji conversion processing program.
  • the powerful kanji conversion processing program reads the phrase information specified in the powerful kanji conversion of the output kanji text data and the powerful characters before conversion included in each phrase.
  • the conversion information including the hiragana that is a column and the part of speech data of the word (conversion word) including the converted kanji included in each phrase is output to the text data processing program.
  • the text data processing program of the present embodiment uses the conversion sentence output from the kana-kanji conversion processing program and the kana-kanji conversion processing program. Range of each phrase in kanji text data , And at the boundary of each specified clause, that is, at the position where it is separated, specifically, as shown in Figure 3, it becomes the separation position of the clause for the United States and the Supreme Court.
  • a character type is assigned between the “country” and “most” character (character) code data!
  • the process proceeds to Sb4, and the kana-kanji conversion processing program power.
  • the part-of-speech data of the character string included in each phrase included in the acquired conversion information specifically, each part-of-speech Part-of-speech code uniquely assigned to each part-of-speech that can be identified (in fact, it corresponds to the part-of-speech code stored in correspondence with each word in the dictionary data of the powerful Kanji conversion program)
  • the character type is a special character that indicates that the inserted data is data other than the conversion sentence at the end position between the phrase specific characters that are the data range of each phrase. Assigned!
  • the code data “008F” is inserted at the head of the part of speech data composed of the part of speech code. That is, in the Sb4, the part-of-speech data made up of the part-of-speech code that can identify the part-of-speech code of the character string included in each phrase acquired in the Kb-Kanji conversion processing program in the Sb2 is associated with the phrase and the text data.
  • the part-of-speech data insertion step in the present invention is formed by the Sb4.
  • kana character data that becomes a kana character for each phrase included in the conversion information obtained from the powerful Kanji conversion processing program Similar to the lyric data, by inserting the code data of “008F”, which is a special character indicating that it is data other than the conversion sentence, and the symbol of the two slanting lines to the right, it is inserted as shown in FIG. As shown in Fig.
  • the phrase can be identified by the code data of “0 07F”, which is the symbol of the left-down diagonal line that is the phrase identification character, and these phrases and
  • the extended text data having the text data structure of the present invention including the part-of-speech data of the character string included in the phrase and the kana data is generated between the phrase specific characters. That is, in Sb4, the character code data of the phonetic character (kana character) that is the source of conversion to the ideographic character (kanji character) obtained from the Kana-Kanji conversion processing program in Sb2 is assigned to the character string after conversion.
  • the kana data is inserted into the text data of the converted character string in association with the clause of the converted character string, and the pseudonym data insertion step in the present invention is formed by the Sb4.
  • phrase specifying character a special character is used as the phrase specifying character.
  • iS This is preferable because the phrase-specific character can be easily distinguished from other characters (characters) because errors in the phrase specification can be greatly reduced, but the present invention is not limited to this.
  • the chords and characters used as phrase specific characters can be selected as appropriate.
  • the phrase specifying data that can specify the character code data included in each converted phrase is used as the phrase specifying character.
  • the present invention is not limited to this.
  • Specific data for example, data indicating that the number of characters from the beginning of a sentence is one clause and the next number of characters is one clause.
  • the phrase specifying data that can be used to specify the characters included in the phrase by using the arranged character number map data or the like may be appropriately selected according to the usage form of the text data.
  • the ability to include part-of-speech data and phonetic kana data is not limited to this, and the configuration may not include these part-of-speech data and kana-kana data.
  • the present invention is not limited to this, and these conversion processing programs are input in pinyin. It goes without saying that the program may be a Chinese Roman-Kanji conversion processing program for converting a Roman character string into Kanji, and the present invention can also be applied when converting other phonograms to ideograms.
  • the character string included in each clause is a reading of a proper noun or the like.
  • the name "Mark” in the United States is converted to "Masashi" and is not an ideogram.
  • a type code that can specify the power of the phonogram is used together with the part of speech data.
  • the part of speech code of the foreign word may be included as part of speech data.
  • the converted ideographic character string when used as a phonetic character such as a name, it should be used as a phonetic character.
  • the program accepts it as the phonetic character
  • the text data processing program may acquire conversion information including the type data indicating use.
  • the text data processing program is a plug-in module program of a Kana-Kanji conversion processing program. These text data processing programs are recorded on a recording medium or a computer network separately from the Kana-Kanji conversion processing program.
  • the present invention is not limited to this.
  • the text data processing program is included in the Kana-Kanji conversion processing program as being inseparable from the Kana-Kanji conversion processing program, A kana-kanji conversion processing program including the text data processing program may be distributed.

Abstract

The translation program size and the processing time are reduced. A text data structure is composed by arranging character code data on characters which include at least ideographic characters and the type of which can be identified. The text data structure contains together with the character code data, segment identifying data for identifying character code data included in each segment after conversion by a conversion program which converts an inputted phonographic character string into a character string including an ideographic character on the basis of information on segment which are units of conversion into an ideographic character acquired from the conversion program.

Description

明 細 書  Specification
テキストデータ構造、テキストデータ処理方法、テキストデータ処理プログ ラムおよびテキストデータ処理プログラムを記録した記録媒体  Text data structure, text data processing method, text data processing program, and recording medium recording the text data processing program
技術分野  Technical field
[0001] 本発明は、少なくとも表意文字を含む言語のテキストデータ構造、該データ構造の テキストデータを生成するためのテキストデータ処理方法、テキストデータ処理プログ ラムおよびテキストデータ処理プログラムを記録した記録媒体に関する。  [0001] The present invention relates to a text data structure of a language including at least ideographic characters, a text data processing method for generating text data of the data structure, a text data processing program, and a recording medium on which a text data processing program is recorded. .
背景技術  Background art
[0002] 従来、漢字等の表意文字を含むテキストデータにぉ 、ては、その入力にお 、て漠 字の読みや発音をローマ字やひらがな等の表音文字にて入力して漢字に変換する ことがなされている。  [0002] Conventionally, text data containing ideographic characters such as kanji has been input and converted to kanji by reading vague readings and pronunciations using phonetic characters such as romaji and hiragana. Things have been done.
発明の開示  Disclosure of the invention
発明が解決しょうとする課題  Problems to be solved by the invention
[0003] これら変換された表意文字を含むテキストデータを、他の言語等に機械翻訳する場 合において、翻訳機はこれら表意文字を理解できないので、これら表示文字の区切 りがどこである力を正確には判断できず、これら表意文字の単語を複数含む文、例え ば「合衆国最高裁判所近道」の文を変換するには、その区切りに応じて複数の意味 の訳が存在するので、これら表示文字の区切りを正確に把握して正確な翻訳を実施 することは難しく、これら区切りを正確に把握するために、多種多様の区切り方を選出 するための処理やプログラムを必要とし、その結果、翻訳プログラムの容量が大きな ものとなってしまうとともに、翻訳に時間を要してしまうという問題があった。  [0003] When text data including these converted ideograms is machine-translated into another language, etc., the translator cannot understand these ideographs, so the power of where these display characters are separated can be accurately determined. In order to convert a sentence containing multiple words of these ideographs, such as the sentence “Supreme Court of the United States Supreme Court”, there are multiple meaning translations according to the breaks. It is difficult to accurately understand the breaks of a segment and implement accurate translation, and in order to accurately grasp these breaks, processing and programs for selecting a wide variety of separation methods are required. There is a problem that the capacity of the system becomes large and the translation takes time.
[0004] 本発明は、このような問題点に着目してなされたもので、表意文字を含む文を他言 語に変換する場合にぉ 、て、翻訳プログラムの容量や処理時間を低減するために翻 訳文の区切り(文節)を正確に把握することのできるテキストデータ構造、テキストデー タ処理方法、テキストデータ処理プログラムおよびテキストデータ処理プログラムを記 録した記録媒体を提供することを目的とする。  [0004] The present invention has been made paying attention to such problems, and in order to reduce the capacity and processing time of a translation program when a sentence containing ideograms is converted into another language. An object of the present invention is to provide a text data structure, a text data processing method, a text data processing program, and a recording medium on which the text data processing program is recorded so that the translation delimiter (clause) can be accurately grasped.
課題を解決するための手段 [0005] 上記課題を解決するために、本発明の請求項 1に記載のテキストデータ構造は、 少なくとも表意文字を含む各文字の文字種を特定可能な文字コードデータが配列 されて成るテキストデータ構造であって、 Means for solving the problem [0005] In order to solve the above problem, the text data structure according to claim 1 of the present invention is a text data structure in which character code data capable of specifying the character type of each character including at least ideographic characters is arranged. There,
入力された表音文字列を表意文字を含む文字列に変換する変換処理プログラムか ら取得した該表意文字への変換単位となる文節情報に基づき、変換後の各文節に 含まれる文字コードデータを特定可能な文節特定データを該文字コードデータととも に含むことを特徴として 、る。  Based on the phrase information that is the unit of conversion to the ideographic character obtained from the conversion processing program that converts the input phonogram string to a character string that includes the ideogram, the character code data included in each converted phrase is Characteristic phrase specifying data is included together with the character code data.
この特徴によれば、テキストデータ中に含まれる文節特定データによって、各文節 に含まれる文字を特定することで、文節による文の区切りを正確に把握できるので、 これらテキストデータ構造を有するテキストデータに基づく表意文字を含む文を、他 言語に変換する場合にぉ 、て、翻訳プログラムの容量や処理時間を低減することが 可能となる。  According to this feature, by specifying the characters included in each clause based on the clause specifying data included in the text data, it is possible to accurately grasp the sentence breaks by clauses. When a sentence containing an ideogram based on it is converted into another language, the capacity and processing time of the translation program can be reduced.
[0006] 本発明の請求項 2に記載のテキストデータ構造は、請求項 1に記載のテキストデー タ構造であって、  [0006] The text data structure according to claim 2 of the present invention is the text data structure according to claim 1,
前記表意文字への変換元の表音文字の文字コードデータを変換後の文字列の振 り仮名データとして、変換後の文字列の文節に対応付けて含むことを特徴としている この特徴によれば、振り仮名を正確に特定することができ、これら振り仮名を翻訳に 役立てることも可能となる。  According to this feature, the character code data of the phonetic character that is converted into the ideographic character is included in correspondence with the phrase of the character string after conversion as the kana data of the character string after conversion. Therefore, it is possible to accurately identify the hiragana, and to use these hiragana for translation.
[0007] 本発明の請求項 3に記載のテキストデータ構造は、請求項 1または 2記載のテキスト データ構造であって、 [0007] The text data structure according to claim 3 of the present invention is the text data structure according to claim 1 or 2,
前記変換処理プログラム力 取得した、各文節に含まれる文字列の品詞を特定可 能な品詞データを、当該文節に対応付けて含むことを特徴としている。  Part of speech data that can identify the part of speech of the character string included in each clause and that has been acquired is included in association with the clause.
この特徴によれば、各文節に含まれる文字列の品詞を特定して、該特定した品詞 に基づ!/、て更に正確な翻訳を実施できる。  According to this feature, it is possible to specify the part of speech of the character string included in each clause and to perform more accurate translation based on the specified part of speech!
[0008] 本発明の請求項 4に記載のテキストデータ処理方法は、 [0008] A text data processing method according to claim 4 of the present invention includes:
入力された表音文字列を表意文字を含む文字列に変換する変換処理プログラムか ら該表意文字への変換単位となる文節情報を取得し、該取得した文節情報に基づき 、変換後の文字列中の各文節に含まれる文字コードデータを特定可能な文節特定 データを変換後の文字列のテキストデータ中に挿入することを特徴として ヽる。 この特徴によれば、テキストデータ中に含まれる文節特定データによって、各文節 に含まれる文字を特定することで、文節による文の区切りを正確に把握できるので、 これら文節特定データを含むテキストデータに基づく文を他言語に変換する場合に ぉ 、て、翻訳プログラムの容量や処理時間を低減することが可能となる。 Acquire phrase information as a conversion unit to the ideogram from a conversion processing program that converts the input phonogram string to a character string including the ideogram, and based on the acquired phrase information The phrase specifying data that can specify the character code data included in each clause in the converted character string is inserted into the text data of the converted character string. According to this feature, by specifying the characters included in each clause based on the clause specification data included in the text data, it is possible to accurately grasp the sentence breaks by clauses. When converting a sentence based on another language, the capacity and processing time of the translation program can be reduced.
[0009] 本発明の請求項 5に記載のテキストデータ処理方法は、請求項 4に記載のテキスト データ処理方法であって、  [0009] A text data processing method according to claim 5 of the present invention is the text data processing method according to claim 4,
前記変換処理プログラムから取得した、前記表意文字への変換元の表音文字の文 字コードデータを変換後の文字列の振り仮名データとして、変換後の文字列の文節 に対応付けて該変換後の文字列のテキストデータ中に挿入することを特徴として 、る この特徴によれば、振り仮名を正確に特定することができ、これら振り仮名を翻訳に 役立てることも可能となる。  The character code data of the phonogram converted from the conversion to the ideographic character obtained from the conversion processing program is associated with the phrase of the converted character string as post-conversion character string kana data, and after the conversion According to this feature, the kana characters can be accurately specified, and these kana characters can be used for translation.
[0010] 本発明の請求項 6に記載のテキストデータ処理方法は、請求項 4または 5に記載の テキストデータ処理方法であって、 [0010] The text data processing method according to claim 6 of the present invention is the text data processing method according to claim 4 or 5,
前記変換処理プログラム力 取得した、各文節に含まれる文字列の品詞を特定可 能な品詞データを、当該文節に対応付けて前記テキストデータ中に挿入することを 特徴としている。  The part-of-speech data that can identify the part-of-speech of the character string included in each clause is inserted into the text data in association with the clause.
この特徴によれば、各文節に含まれる文字列の品詞を特定して、該特定した品詞 に基づ!/、て更に正確な翻訳を実施できる。  According to this feature, it is possible to specify the part of speech of the character string included in each clause and to perform more accurate translation based on the specified part of speech!
[0011] 本発明の請求項 7に記載のテキストデータ処理プログラムは、 [0011] A text data processing program according to claim 7 of the present invention provides:
入力された表音文字列を表意文字を含む文字列に変換する変換処理プログラムか ら該表意文字への変換単位となる文節情報を取得する文節情報取得ステップと、 該取得した文節情報に基づき、変換後の文字列中の各文節に含まれる文字コード データを特定可能な文節特定データを変換後の文字列のテキストデータ中に挿入 する文節特定データ挿入ステップと、  Based on the acquired phrase information, a phrase information acquisition step of acquiring phrase information as a conversion unit to the ideogram from a conversion processing program that converts the input phonogram string into a string including ideograms, A phrase specifying data insertion step for inserting phrase specifying data capable of specifying the character code data included in each clause in the converted character string into the text data of the converted character string;
を含むことを特徴として 、る。 この特徴によれば、テキストデータ中に含まれる文節特定データによって、各文節 に含まれる文字を特定することで、文節による文の区切りを正確に把握できるので、 これら文節特定データを含むテキストデータに基づく文を他言語に変換する場合に ぉ 、て、翻訳プログラムの容量や処理時間を低減することが可能となる。 It is characterized by including. According to this feature, by specifying the characters included in each clause based on the clause specification data included in the text data, it is possible to accurately grasp the sentence breaks by clauses. When converting a sentence based on another language, the capacity and processing time of the translation program can be reduced.
[0012] 本発明の請求項 8に記載のテキストデータ処理プログラムは、請求項 7に記載のテ キストデータ処理プログラムであって、  [0012] A text data processing program according to claim 8 of the present invention is the text data processing program according to claim 7,
前記変換処理プログラムから取得した、前記表意文字への変換元の表音文字の文 字コードデータを変換後の文字列の振り仮名データとして、変換後の文字列の文節 に対応付けて該変換後の文字列のテキストデータ中に挿入する振り仮名データ挿入 ステップを含むことを特徴として 、る。  The character code data of the phonogram converted from the conversion to the ideographic character obtained from the conversion processing program is associated with the phrase of the converted character string as post-conversion character string kana data, and after the conversion It includes a step of inserting a pseudonym data to be inserted into the text data of the character string.
この特徴によれば、振り仮名を正確に特定することができ、これら振り仮名を翻訳に 役立てることも可能となる。  According to this feature, the hiragana can be specified accurately, and these hiragana can be used for translation.
[0013] 本発明の請求項 9に記載のテキストデータ処理プログラムは、請求項 7または 8に記 載のテキストデータ処理プログラムであって、 [0013] A text data processing program according to claim 9 of the present invention is the text data processing program according to claim 7 or 8,
前記変換処理プログラム力 取得した、各文節に含まれる文字列の品詞を特定可 能な品詞データを、当該文節に対応付けて前記テキストデータ中に挿入する品詞デ ータ挿入ステップを含むことを特徴として ヽる。  A part-of-speech data insertion step of inserting the acquired part-of-speech data capable of specifying the part-of-speech of a character string included in each clause into the text data in association with the clause. As you speak.
この特徴によれば、各文節に含まれる文字列の品詞を特定して、該特定した品詞 に基づ!/、て更に正確な翻訳を実施できる。  According to this feature, it is possible to specify the part of speech of the character string included in each clause and to perform more accurate translation based on the specified part of speech!
[0014] 本発明の請求項 10に記載のテキストデータ処理プログラムを記録した記録媒体は 、請求項 7〜9の 、ずれかに記載のテキストデータ処理プログラムを記録したことを特 徴としている。 [0014] A recording medium on which a text data processing program according to claim 10 of the present invention is recorded is characterized in that the text data processing program according to any one of claims 7 to 9 is recorded.
この特徴によれば、テキストデータ処理プログラムを記録媒体力 読み出して簡便 に利用することができる。  According to this feature, the text data processing program can be easily used by reading the recording medium power.
図面の簡単な説明  Brief Description of Drawings
[0015] [図 1]本発明の実施例に用いた変換処理プログラムの処理内容を示すフロー図であ る。  FIG. 1 is a flowchart showing the processing contents of a conversion processing program used in an embodiment of the present invention.
[図 2]本発明の実施例のテキストデータ処理プログラムの処理内容を示すフロー図で ある。 FIG. 2 is a flowchart showing the processing contents of the text data processing program according to the embodiment of the present invention. is there.
[図 3]本発明の実施例のテキストデータ処理プログラムにより生成されるテキストデー タの構成を示す図である。  FIG. 3 is a diagram showing a structure of text data generated by a text data processing program according to an embodiment of the present invention.
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0016] 本発明の実施例を以下に説明する。 Examples of the present invention will be described below.
実施例  Example
[0017] 図 1は、本実施例に用いた変換処理プログラムであるかな漢字変換処理プログラム における処理内容を示すフロー図であり、図 2は、本実施例に用いたテキストデータ 処理プログラムの処理内容を示すフロー図である。  FIG. 1 is a flowchart showing the processing contents of the Kana-Kanji conversion processing program that is the conversion processing program used in this embodiment, and FIG. 2 shows the processing contents of the text data processing program used in this embodiment. FIG.
[0018] これら本実施例に用いたかな漢字変換処理プログラム並びにテキストデータ処理プ ログラムは、図示しないパソコン等のコンピュータに、 CD— ROM等の記録媒体から インストールされて該コンピュータにおいて実施される。  [0018] The kana-kanji conversion processing program and the text data processing program used in the present embodiment are installed in a computer such as a personal computer (not shown) from a recording medium such as a CD-ROM and executed on the computer.
[0019] そして、本実施例のテキストデータ処理プログラムは、主プログラムとなる前述のか な漢字変換処理プログラムのプラグインモジュールプログラムとされており、これらか な漢字変換処理プログラムは、テキストデータ処理プログラムがなくても動作可能とさ れている。  [0019] The text data processing program of this embodiment is a plug-in module program of the kana-kanji conversion processing program, which is the main program, and the kana-kanji conversion processing program is a text data processing program. It is possible to operate without it.
[0020] これらかな漢字変換処理プログラムとしては、市販されて ヽる公知の日本語入カツ ールとして使用されて 、るかな漢字変換処理プログラム (FEP)を使用することができ 、その処理内容を、図 1並びに図 3に基づいて簡潔に説明すると、漢字変換処理プロ グラムは、例えば、図 3に示すように、変換文として「がっしゅうこくさいこうさいばんしょ ちかみち」の入力を受付けた場合に (S1)、該受付けた変換文の文節を特定する。こ れら文節の特定は、例えば公知の最小コスト法等を利用して特定すれば良ぐ具体 的には、変換文力 「がっしゅうこく」、「さいこうさいばんしょ」、「ちかみち」に文節が特 定される(S2)。  [0020] As these Kana-Kanji conversion processing programs, the Kana-Kanji conversion processing program (FEP) can be used as a known Japanese input tool that is commercially available. 1 and briefly explained based on Fig. 3, for example, the Kanji conversion processing program accepts the input of `` Gasshukkosaikousaibanchochimichi '' as the conversion sentence as shown in Fig. 3. (S1), the phrase of the accepted conversion sentence is specified. These clauses can be identified by using, for example, the known minimum cost method, etc., specifically, the conversion sentence powers “Gashukukoku”, “Saikousaibansho”, “Chikamichi” The phrase is specified (S2).
[0021] そして、該特定した各文節に含まれる表音文字列であるかな文字列に該当する表 意文字となる漢字の変換候補の文字列を、当該かな漢字変換処理プログラムに含ま れている辞書データ力 全て抽出し (S3)、変換操作に応じて各文節の変換候補の 代表を変換文として表示出力するとともに、更なる変換操作があった場合においては 、前記 S3において抽出した変換候補の文字列を選択可能に表示出力する(S4)。 [0021] Then, the kanji-kanji conversion processing program converts the character string of the kanji conversion candidate to be an ideogram corresponding to the kana character string that is the phonogram character string included in each of the specified phrases. Data power is all extracted (S3), and the conversion candidate representative of each clause is displayed and output as a conversion sentence according to the conversion operation. The character string of the conversion candidate extracted in S3 is displayed and output in a selectable manner (S4).
[0022] そして、変換候補の文字列の選択操作 (確定操作)による変換語の受付けを実施し た後(S5)、該受付けにより確定した表意文字である漢字を含むテキストデータ、具 体的には、「合衆国最高裁判所近道」の漢字テキストデータを、前述のテキストデー タ処理プログラムに対して出力する。 [0022] After the conversion word is accepted by the selection operation (confirmation operation) of the conversion candidate character string (S5), the text data including the kanji that is the ideogram determined by the acceptance, specifically, Outputs the Kanji text data of “US Supreme Court shortcut” to the above text data processing program.
[0023] この漢字テキストデータの出力に応じてテキストデータ処理プログラムは、図 2に示 すように、 Sblにおいて該漢字テキストデータの出力を検知して Sb2に進み、該出力 されてきた漢字テキストデータの力な漢字変換にお 、て特定した文節の情報と、各文 節に含まれる変換前のかな文字列である振り仮名と、各文節に含まれる変換された 漢字を含む単語 (変換語)の品詞データを含む変換情報の出力要求をカゝな漢字変 換処理プログラムに出力して、該変換情報を漢字変換処理プログラム力 取得する( Sb2) Gつまり、該 Sb2において入力された表音文字列を表意文字を含む文字列で ある漢字テキストデータに変換する変換処理プログラムとしてのかな漢字変換処理プ ログラム力 該表意文字への変換単位となる文節情報を取得しており、該 Sb2によつ て本発明における文節情報取得ステップが形成されて ヽる。 [0023] In response to the output of the kanji text data, the text data processing program detects the output of the kanji text data in Sbl and proceeds to Sb2, as shown in FIG. 2, and outputs the output kanji text data. In the powerful kanji conversion, the phrase information specified in each phrase, the kana character string that is the kana character string before conversion included in each phrase, and the word that contains the converted kanji included in each phrase (conversion word) A conversion information output request including part-of-speech data is output to a kanji conversion processing program and the conversion information is acquired (Sb2) G That is, the phonetic character input in Sb2 Kana-Kanji conversion processing power as a conversion processing program that converts a column into Kanji text data that is a character string including ideograms The phrase information that is the conversion unit to the ideogram is acquired, and the S The phrase information acquisition step in the present invention is formed by b2.
[0024] 尚、本実施例では、力な漢字変換処理プログラムに対してテキストデータ処理プロ グラム力も変換情報の出力要求を出力するようにしているが、本発明はこれに限定さ れるものではなぐ例えば、かな漢字変換処理プログラムが変換した漢字を含む変換 文テキストとともに、該変換文テキストの変換における変換情報をテキストデータ処理 プログラムに対してかな漢字変換処理プログラムが出力するようにしても良 、。  In this embodiment, the text data processing program power also outputs a conversion information output request to the powerful kanji conversion processing program. However, the present invention is not limited to this. For example, the kana-kanji conversion processing program may output the conversion information in the conversion of the conversion sentence text to the text data processing program together with the conversion sentence text including the kanji converted by the kana-kanji conversion processing program.
[0025] この変換情報の出力要求に応じて力な漢字変換処理プログラムは、出力した漢字 テキストデータの力な漢字変換において特定した文節情報と、各文節に含まれる変 換前の力ゝな文字列である振り仮名と、各文節に含まれる変換された漢字を含む単語 (変換語)の品詞データとを含む変換情報をテキストデータ処理プログラムに出力す る。  [0025] In response to the conversion information output request, the powerful kanji conversion processing program reads the phrase information specified in the powerful kanji conversion of the output kanji text data and the powerful characters before conversion included in each phrase. The conversion information including the hiragana that is a column and the part of speech data of the word (conversion word) including the converted kanji included in each phrase is output to the text data processing program.
[0026] このようにしてかな漢字変換処理プログラム力 取得した変換情報に含まれる文節 情報に基づ 、て本実施例のテキストデータ処理プログラムは、かな漢字変換処理プ ログラムから出力されてきた前記変換文となる漢字テキストデータ中の各文節の範囲 を特定し、該特定した各文節の境目、つまり区切りとなる位置に、具体的には図 3に 示すように、「合衆国」の文節と、「最高裁判所」の文節の区切り位置となる、「国」と「 最」の文字 (キャラクタ)コードデータの間に、文字種が割り当てられて!/、な 、文字 (キ ャラクタ)コード、具体的にはシフト(S)— JISコードの「007F」に対して文節を特定す るために割り当てた特殊キャラクタである左下がり 2本斜め線の記号となる「007F」の コードデータを文節特定キャラクタとして挿入することで (sb3)、これら「007F」のコ ードデータ間に存在する文字コードによる文字が、 1つの文節内に含まれる文字とし て特定できるようになる。つまり、該 Sb3において、前記 Sb2にてかな漢字変換処理 プログラム力 取得した文節情報に基づき、変換後の漢字文字列中の各文節に含ま れる文字コードデータを特定可能な文節特定データとなる特殊キャラクタを変換後の 漢字文字列の漢字テキストデータ中に挿入しており、該 Sb3によって本発明における 文節特定データ挿入ステップが形成されて 、る。 [0026] Based on the phrase information included in the acquired conversion information in this way, the text data processing program of the present embodiment uses the conversion sentence output from the kana-kanji conversion processing program and the kana-kanji conversion processing program. Range of each phrase in kanji text data , And at the boundary of each specified clause, that is, at the position where it is separated, specifically, as shown in Figure 3, it becomes the separation position of the clause for the United States and the Supreme Court. A character type is assigned between the “country” and “most” character (character) code data! /, Na, character (character) code, specifically shift (S) — JIS code “007F” On the other hand, by inserting the code data of “007F”, which is a special character assigned to specify the phrase to the left and the diagonally left down diagonal line (sb3 ) as the phrase specifying character, these “ 007Fcodes Characters with character codes that exist between code data can be specified as characters included in one clause. That is, in Sb3, a special character serving as phrase specifying data that can specify the character code data included in each phrase in the converted Kanji character string based on the phrase information acquired by the kana-kanji conversion processing program power in Sb2. It is inserted into the kanji text data of the converted kanji character string, and the phrase specifying data insertion step in the present invention is formed by the Sb3.
[0027] そして、これら文節特定キャラクタを挿入した後に Sb4に進んで、かな漢字変換処 理プログラム力 取得した変換情報に含まれる各文節に含まれる文字列の品詞デー タ、具体的には、各品詞を特定可能な各品詞に固有に付与された品詞コード (実際 には、力な漢字変換処理プログラムの辞書データに各単語に対応して記憶されて ヽ る品詞コードに該当する)力 文節内に含まれる各品詞の順に、各文節とされたデー タ範囲である文節特定キャラクタ間の末尾側位置に、これら挿入されたデータが変換 文以外のデータであることを示す特殊キャラクタである、文字種が割り当てられて!/、な い文字 (キャラクタ)コード、具体的にはシフト(S)— JISコードの「008F」に対して割り 当てた特殊キャラクタである右下がり 2本斜め線の記号となる「008F」のコードデータ を、前記品詞コードから成る品詞データの先頭に付して挿入する。つまり、該 Sb4に おいて、前記 Sb2にてかな漢字変換処理プログラム力 取得した、各文節に含まれる 文字列の品詞を特定可能な品詞コードから成る品詞データを、当該文節に対応付け て前記テキストデータ中に挿入しており、該 Sb4によって本発明における品詞データ 挿入ステップが形成されて 、る。 [0027] Then, after inserting these phrase-specific characters, the process proceeds to Sb4, and the kana-kanji conversion processing program power. The part-of-speech data of the character string included in each phrase included in the acquired conversion information, specifically, each part-of-speech Part-of-speech code uniquely assigned to each part-of-speech that can be identified (in fact, it corresponds to the part-of-speech code stored in correspondence with each word in the dictionary data of the powerful Kanji conversion program) In the order of each part-of-speech included, the character type is a special character that indicates that the inserted data is data other than the conversion sentence at the end position between the phrase specific characters that are the data range of each phrase. Assigned! /, No character (character) code, specifically shift (S) — a special character assigned to JIS code “008F”. The code data “008F” is inserted at the head of the part of speech data composed of the part of speech code. That is, in the Sb4, the part-of-speech data made up of the part-of-speech code that can identify the part-of-speech code of the character string included in each phrase acquired in the Kb-Kanji conversion processing program in the Sb2 is associated with the phrase and the text data. The part-of-speech data insertion step in the present invention is formed by the Sb4.
[0028] 更に、該品詞データの後方(下位)となる位置に、力な漢字変換処理プログラムから 取得した変換情報に含まれる各文節の振り仮名となるかな文字のデータが、前記品 詞データと同様に、変換文以外のデータであることを示す特殊キャラクタである、右 下がり 2本斜め線の記号となる「008F」のコードデータを先頭に付して挿入すること で、図 3に示すように、文節特定キャラクタである左下がり 2本斜め線の記号となる「0 07F」のコードデータにより文節を特定でき、かつ、これら各文節に対応するように、こ れら各文節となる文節特定キャラクタ間に、当該文節に含まれる文字列の品詞デー タゃ、振り仮名データが含まれる本発明のテキストデータ構造を有する拡張テキスト データが生成される。つまり、該 Sb4において、前記 Sb2にてかな漢字変換処理プロ グラムから取得した、前記表意文字 (漢字文字)への変換元の表音文字 (かな文字) の文字コードデータを変換後の文字列の振り仮名データとして、変換後の文字列の 文節に対応付けて該変換後の文字列のテキストデータ中に挿入しており、該 Sb4に よって本発明における振り仮名データ挿入ステップが形成されている。 [0028] Further, at the position that is behind (subordinate) the part-of-speech data, kana character data that becomes a kana character for each phrase included in the conversion information obtained from the powerful Kanji conversion processing program Similar to the lyric data, by inserting the code data of “008F”, which is a special character indicating that it is data other than the conversion sentence, and the symbol of the two slanting lines to the right, it is inserted as shown in FIG. As shown in Fig. 1, the phrase can be identified by the code data of “0 07F”, which is the symbol of the left-down diagonal line that is the phrase identification character, and these phrases and The extended text data having the text data structure of the present invention including the part-of-speech data of the character string included in the phrase and the kana data is generated between the phrase specific characters. That is, in Sb4, the character code data of the phonetic character (kana character) that is the source of conversion to the ideographic character (kanji character) obtained from the Kana-Kanji conversion processing program in Sb2 is assigned to the character string after conversion. The kana data is inserted into the text data of the converted character string in association with the clause of the converted character string, and the pseudonym data insertion step in the present invention is formed by the Sb4.
[0029] そして、これら「合衆国最高裁判所近道」の変換文を他の言語に翻訳する場合には 、翻訳のされ方としては、「合衆国」、「最高裁判所」、「近道」の文節に基づく翻訳文と 、誤訳となる「合衆国最高」、「裁判所」、「近道」と文節を区切ることによる翻訳文とが 存在するが、本実施例のテキストデータ構造を有する拡張テキストデータによれば、 該テキストデータに含まれる文節特定キャラクタにより、各文節に含まれる文字列を、 「合衆国」、「最高裁判所」、「近道」として特定でき、されらには、その振り仮名や品詞 も特定できるので、これらの文節の判定を実施するための処理を実施したり、これら 文節判定プログラムを翻訳プログラムが含む必要がなぐよって、翻訳プログラムの容 量や処理時間を低減することが可能となり、例えば、これらのテキストデータ構造を、 インターネット上のホームページの記述言語として使用されて 、る HTML中に含ま れる文章に適用することで、例えば中国人が日本語のホームページを閲覧する場合 や、逆に日本人が中国語のホームページを閲覧する場合において、正確かつ迅速 に文章が翻訳されて表示できるようになり、利用者の利便性を著しく向上できる。  [0029] Then, when translating these “US Supreme Court shortcuts” into other languages, the translations are based on the phrases “US”, “Supreme Court”, and “Shortcuts”. Sentences and translated sentences by separating the clauses from “U.S. Supreme”, “Court”, and “Shortcut”, which are mistranslations, exist, but according to the extended text data having the text data structure of this embodiment, the text The phrase-specific characters included in the data can identify the strings included in each phrase as “United States”, “Supreme Court”, “shortcut”, and also their kana and part-of-speech. It is possible to reduce the capacity and processing time of the translation program because it is not necessary to carry out the process for determining the clauses of the phrase, and the translation program does not need to include these phrase determination programs. For example, when these text data structures are used as the description language of homepages on the Internet and applied to sentences contained in HTML, for example, when Chinese people browse Japanese homepages, Conversely, when a Japanese browses a Chinese homepage, the text can be translated and displayed accurately and quickly, and the convenience for the user can be significantly improved.
[0030] 以上、本発明の実施例を図面により説明してきたが、具体的な構成はこれら実施例 に限られるものではなぐ本発明の要旨を逸脱しない範囲における変更や追力卩がぁ つても本発明に含まれる。  As described above, the embodiments of the present invention have been described with reference to the drawings. However, the specific configuration is not limited to these embodiments, and there is any change or additional effort within the scope of the present invention. It is included in the present invention.
[0031] 例えば、前記実施例では、文節特定キャラクタとして特殊キャラクタを使用して ヽる iS このようにすることは、文節特定キャラクタが他の文字 (キャラクタ)と区別し易ぐ 文節の特定において誤りが生じることを大幅に低減できることから好ましいが、本発 明はこれに限定されるものではなぐこれら文節特定キャラクタとしてもちいるコードや キャラクタは、適宜に選択すれば良い。 [0031] For example, in the above embodiment, a special character is used as the phrase specifying character. iS This is preferable because the phrase-specific character can be easily distinguished from other characters (characters) because errors in the phrase specification can be greatly reduced, but the present invention is not limited to this. The chords and characters used as phrase specific characters can be selected as appropriate.
[0032] また、前記実施例では、変換後の各文節に含まれる文字コードデータを特定可能 な文節特定データを文節特定キャラクタとしているが、本発明はこれに限定されるも のではなぐこれら文節特定データを、例えば、文章中の先頭から何文字が 1つの文 節で、次の何文字が 1つの文節であることを示すデータのように、文節に含まれる文 字の数を先頭力 順に配置した文字数マップデータ等を用いて文節に含まれる文字 を特定できるようにしても良ぐこれら文節特定データとしては、テキストデータの利用 形態等に合わせて適宜に選択すれば良い。  [0032] In the above embodiment, the phrase specifying data that can specify the character code data included in each converted phrase is used as the phrase specifying character. However, the present invention is not limited to this. Specific data, for example, data indicating that the number of characters from the beginning of a sentence is one clause and the next number of characters is one clause. The phrase specifying data that can be used to specify the characters included in the phrase by using the arranged character number map data or the like may be appropriately selected according to the usage form of the text data.
[0033] また、前記実施例では、品詞データや振り仮名データを含むよういしている力 本 発明はこれに限定されるものではなぐこれら品詞データや振り仮名データを含まな い構成としても良い。  [0033] Further, in the above-described embodiment, the ability to include part-of-speech data and phonetic kana data. The present invention is not limited to this, and the configuration may not include these part-of-speech data and kana-kana data.
[0034] また、前記実施例では、変換処理プログラムとして日本語のかな漢字変換処理プロ グラムを例示している力 本発明はこれに限定されるものではなぐこれら変換処理プ ログラムとしては、ピンイン入力したローマ文字列を漢字に変換する中国語のローマ 字漢字変換処理プログラムであっても良いことは言うまでもないばかりか、その他の 表音文字を表意文字に変換する場合においても本発明を適用できる。  [0034] Further, in the above-described embodiment, power that illustrates a Japanese Kana-Kanji conversion processing program as the conversion processing program. The present invention is not limited to this, and these conversion processing programs are input in pinyin. It goes without saying that the program may be a Chinese Roman-Kanji conversion processing program for converting a Roman character string into Kanji, and the present invention can also be applied when converting other phonograms to ideograms.
[0035] また、前記実施例では、各文節に含まれる文字列が、固有名詞等の読みで、例え ば、米国の「マーク」という名前を「真握」と変換するとともに、表意文字ではなくて表 音文字として使用することを、翻訳等において特定できるようにするために、文節に 含まれる文字列が、表意文字なのか表音文字なの力を特定可能な種別コードを品詞 コードとともに品詞データとして含むようにしたり、外来語に対する当て字等を特定で きるようにするために、これら外来語の品詞コードをこれら品詞データとして含むよう にしても良い。また、これら固有名詞等の入力に際して、変換された表意文字列が名 称等の表音文字として使用する場合等においては、表音文字として使用することを 当該変換の指定時に操作者力 変換処理プログラムが受付けて、該表音文字として 使用することを示す前記種別データを含む変換情報をテキストデータ処理プログラム が取得するようにしても良 、。 [0035] Also, in the above embodiment, the character string included in each clause is a reading of a proper noun or the like. For example, the name "Mark" in the United States is converted to "Masashi" and is not an ideogram. In order to be able to specify in the translation, etc., that the character string included in the phrase is an ideogram or a phonetic character, a type code that can specify the power of the phonogram is used together with the part of speech data. In order to make it possible to include a part of speech as a part of speech or to specify a character for a foreign word, the part of speech code of the foreign word may be included as part of speech data. In addition, when inputting these proper nouns, etc., when the converted ideographic character string is used as a phonetic character such as a name, it should be used as a phonetic character. The program accepts it as the phonetic character The text data processing program may acquire conversion information including the type data indicating use.
また、前記実施例では、テキストデータ処理プログラムをかな漢字変換処理プロダラ ムのプラグインモジュールプログラムとした形態を示しており、これらテキストデータ処 理プログラムをかな漢字変換処理プログラムとは個別に記録媒体やコンピュータネッ トワークを介して配布できるようにしている力 本発明はこれに限定されるものではなく 、これらテキストデータ処理プログラムをかな漢字変換処理プログラムとは分離不能と してかな漢字変換処理プログラムに含まれるようにし、該テキストデータ処理プロダラ ムを含むかな漢字変換処理プログラムを配布するようにしても良い。  In the above embodiment, the text data processing program is a plug-in module program of a Kana-Kanji conversion processing program. These text data processing programs are recorded on a recording medium or a computer network separately from the Kana-Kanji conversion processing program. The present invention is not limited to this. The text data processing program is included in the Kana-Kanji conversion processing program as being inseparable from the Kana-Kanji conversion processing program, A kana-kanji conversion processing program including the text data processing program may be distributed.

Claims

請求の範囲 The scope of the claims
[1] 少なくとも表意文字を含む各文字の文字種を特定可能な文字コードデータが配列 されて成るテキストデータ構造であって、  [1] A text data structure in which character code data that can specify the character type of each character including at least ideographic characters is arranged,
入力された表音文字列を表意文字を含む文字列に変換する変換処理プログラムか ら取得した該表意文字への変換単位となる文節情報に基づき、変換後の各文節に 含まれる文字コードデータを特定可能な文節特定データを該文字コードデータととも に含むことを特徴とするテキストデータ構造。  Based on the phrase information that is the unit of conversion to the ideographic character obtained from the conversion processing program that converts the input phonogram string to a character string that includes the ideogram, the character code data included in each converted phrase is A text data structure characterized by including identifiable phrase specifying data together with the character code data.
[2] 前記表意文字への変換元の表音文字の文字コードデータを変換後の文字列の振 り仮名データとして、変換後の文字列の文節に対応付けて含むことを特徴とする請求 項 1に記載のテキストデータ構造。  [2] The character code data of the phonogram converted from the ideographic character is included as the kana data of the converted character string in association with the phrase of the converted character string. The text data structure described in 1.
[3] 前記変換処理プログラムから取得した、各文節に含まれる文字列の品詞を特定可 能な品詞データを、当該文節に対応付けて含むことを特徴とする請求項 1または 2記 載のテキストデータ構造。  [3] The text according to claim 1 or 2, wherein the text includes part-of-speech data, which is obtained from the conversion processing program and can identify the part-of-speech of the character string included in each clause, in association with the clause. data structure.
[4] 入力された表音文字列を表意文字を含む文字列に変換する変換処理プログラムか ら該表意文字への変換単位となる文節情報を取得し、該取得した文節情報に基づき 、変換後の文字列中の各文節に含まれる文字コードデータを特定可能な文節特定 データを変換後の文字列のテキストデータ中に挿入することを特徴とするテキストデ ータ処理方法。  [4] Obtain phrase information as a conversion unit to the ideogram from the conversion processing program that converts the input phonogram string to a character string including the ideogram, and after conversion based on the obtained phrase information A text data processing method comprising: inserting phrase specifying data capable of specifying character code data included in each clause of a character string into the text data of the converted character string.
[5] 前記変換処理プログラムから取得した、前記表意文字への変換元の表音文字の文 字コードデータを変換後の文字列の振り仮名データとして、変換後の文字列の文節 に対応付けて該変換後の文字列のテキストデータ中に挿入することを特徴とする請 求項 4に記載のテキストデータ処理方法。  [5] The character code data of the phonogram converted to the ideogram obtained from the conversion processing program is associated with the phrase of the converted character string as the kana data of the converted character string. 5. The text data processing method according to claim 4, wherein the text data is inserted into the text data of the converted character string.
[6] 前記変換処理プログラムから取得した、各文節に含まれる文字列の品詞を特定可 能な品詞データを、当該文節に対応付けて前記テキストデータ中に挿入することを 特徴とする請求項 4または 5に記載のテキストデータ処理方法。 [6] The part-of-speech data that can be identified from the part-of-speech character string included in each clause, obtained from the conversion processing program, is inserted into the text data in association with the clause. Or the text data processing method according to 5.
[7] 入力された表音文字列を表意文字を含む文字列に変換する変換処理プログラムか ら該表意文字への変換単位となる文節情報を取得する文節情報取得ステップと、 該取得した文節情報に基づき、変換後の文字列中の各文節に含まれる文字コード データを特定可能な文節特定データを変換後の文字列のテキストデータ中に挿入 する文節特定データ挿入ステップと、 [7] A phrase information acquisition step for acquiring phrase information as a conversion unit to the ideogram from a conversion processing program for converting the input phonogram string to a character string including the ideogram, and the acquired phrase information The character code included in each clause in the converted character string based on A phrase specifying data insertion step for inserting the phrase specifying data into which the data can be specified into the text data of the converted character string;
を含むことを特徴とするテキストデータ処理プログラム。  A text data processing program comprising:
[8] 前記変換処理プログラムから取得した、前記表意文字への変換元の表音文字の文 字コードデータを変換後の文字列の振り仮名データとして、変換後の文字列の文節 に対応付けて該変換後の文字列のテキストデータ中に挿入する振り仮名データ挿入 ステップを含むことを特徴とする請求項 7に記載のテキストデータ処理プログラム。  [8] The character code data of the phonogram converted to the ideogram obtained from the conversion processing program is associated with the phrase of the converted character string as the kana data of the converted character string. 8. The text data processing program according to claim 7, further comprising a step of inserting a pseudonym data to be inserted into the text data of the character string after the conversion.
[9] 前記変換処理プログラムから取得した、各文節に含まれる文字列の品詞を特定可 能な品詞データを、当該文節に対応付けて前記テキストデータ中に挿入する品詞デ ータ挿入ステップを含むことを特徴とする請求項 7または 8に記載のテキストデータ処 理プログラム。 [9] A part-of-speech data insertion step of inserting part-of-speech data, which can be specified from the part-of-speech part of the character string included in each clause, obtained from the conversion processing program into the text data in association with the clause. The text data processing program according to claim 7 or 8, wherein
[10] 請求項 7〜9の 、ずれかに記載のテキストデータ処理プログラムを記録したことを特 徴とするテキストデータ処理プログラムを記録した記録媒体。  10. A recording medium on which a text data processing program is recorded, characterized in that the text data processing program according to any one of claims 7 to 9 is recorded.
PCT/JP2005/016504 2004-11-15 2005-09-08 Text data structure and text data processing method WO2006051647A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2004330696A JP2006139692A (en) 2004-11-15 2004-11-15 Text data structure, text data processing method, text data processing program, and recording medium having recorded the same
JP2004-330696 2004-11-15

Publications (1)

Publication Number Publication Date
WO2006051647A1 true WO2006051647A1 (en) 2006-05-18

Family

ID=36336330

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2005/016504 WO2006051647A1 (en) 2004-11-15 2005-09-08 Text data structure and text data processing method

Country Status (4)

Country Link
JP (1) JP2006139692A (en)
KR (1) KR20070083757A (en)
CN (1) CN101057234A (en)
WO (1) WO2006051647A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102460437A (en) * 2009-06-26 2012-05-16 乐天株式会社 Information search device, information search method, information search program, and storage medium on which information search program has been stored

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943763A (en) * 2017-11-29 2018-04-20 广州迈安信息科技有限公司 A kind of big text data processing method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61279973A (en) * 1985-06-06 1986-12-10 Ricoh Co Ltd Japanese processor
JPS638860A (en) * 1986-06-27 1988-01-14 Matsushita Electric Ind Co Ltd Kana/kanji converting device
JPH07141382A (en) * 1993-11-19 1995-06-02 Sharp Corp Foreign-language documentation support device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61279973A (en) * 1985-06-06 1986-12-10 Ricoh Co Ltd Japanese processor
JPS638860A (en) * 1986-06-27 1988-01-14 Matsushita Electric Ind Co Ltd Kana/kanji converting device
JPH07141382A (en) * 1993-11-19 1995-06-02 Sharp Corp Foreign-language documentation support device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102460437A (en) * 2009-06-26 2012-05-16 乐天株式会社 Information search device, information search method, information search program, and storage medium on which information search program has been stored

Also Published As

Publication number Publication date
CN101057234A (en) 2007-10-17
KR20070083757A (en) 2007-08-24
JP2006139692A (en) 2006-06-01

Similar Documents

Publication Publication Date Title
JP3277123B2 (en) System and method for processing Chinese text
JP6069211B2 (en) Text conversion and expression system
EP0686286B1 (en) Text input transliteration system
US20070061131A1 (en) Japanese virtual dictionary
US20050010391A1 (en) Chinese character / Pin Yin / English translator
Zeitoun et al. The Formosan language archive: Linguistic analysis and language processing
WO2006051647A1 (en) Text data structure and text data processing method
JPH11238051A (en) Chinese input conversion processor, chinese input conversion processing method and recording medium stored with chinese input conversion processing program
JP2003178087A (en) Retrieval device and method for electronic foreign language dictionary
JP2005250525A (en) Chinese classics analysis support apparatus, interlingual sentence processing apparatus and translation program
Lehal A Gurmukhi to Shahmukhi transliteration system
Joshi et al. Input Scheme for Hindi Using Phonetic Mapping
Lehal et al. A Hindi to Urdu transliteration system
Das et al. Multilingual Neural Machine Translation System for Indic to Indic Languages
KR100268297B1 (en) System and method for processing chinese language text
Chaware et al. Information retrieval in multilingual environment
JPH08272780A (en) Processor and method for chinese input processing, and processor and method for language processing
JP3220133B2 (en) Kana-Kanji conversion device
JP2608384B2 (en) Machine translation apparatus and method
JPH07210571A (en) Device and method for word retrieval processing
Moran An ontology for accessing transcription systems (OATS)
JPH01118961A (en) Translating device
JPS61235978A (en) Character string correction system
JP2006338155A (en) Computer program for character string conversion and recording medium with recorded conversion rule
Asahiah et al. A survey of approaches to diacritic restoration

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 1020077009140

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 200580038656.1

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 05782277

Country of ref document: EP

Kind code of ref document: A1