WO2006051647A1

WO2006051647A1 - Text data structure and text data processing method

Info

Publication number: WO2006051647A1
Application number: PCT/JP2005/016504
Authority: WO
Inventors: Tadashi Honda
Original assignee: Advanced Design Corp.
Priority date: 2004-11-15
Filing date: 2005-09-08
Publication date: 2006-05-18
Also published as: CN101057234A; KR20070083757A; JP2006139692A

Abstract

The translation program size and the processing time are reduced. A text data structure is composed by arranging character code data on characters which include at least ideographic characters and the type of which can be identified. The text data structure contains together with the character code data, segment identifying data for identifying character code data included in each segment after conversion by a conversion program which converts an inputted phonographic character string into a character string including an ideographic character on the basis of information on segment which are units of conversion into an ideographic character acquired from the conversion program.

Description

Specification

Text data structure, text data processing method, text data processing program, and recording medium recording the text data processing program

Technical field

[0001] The present invention relates to a text data structure of a language including at least ideographic characters, a text data processing method for generating text data of the data structure, a text data processing program, and a recording medium on which a text data processing program is recorded. .

Background art

[0002] Conventionally, text data containing ideographic characters such as kanji has been input and converted to kanji by reading vague readings and pronunciations using phonetic characters such as romaji and hiragana. Things have been done.

Disclosure of the invention

Problems to be solved by the invention

[0003] When text data including these converted ideograms is machine-translated into another language, etc., the translator cannot understand these ideographs, so the power of where these display characters are separated can be accurately determined. In order to convert a sentence containing multiple words of these ideographs, such as the sentence “Supreme Court of the United States Supreme Court”, there are multiple meaning translations according to the breaks. It is difficult to accurately understand the breaks of a segment and implement accurate translation, and in order to accurately grasp these breaks, processing and programs for selecting a wide variety of separation methods are required. There is a problem that the capacity of the system becomes large and the translation takes time.

[0004] The present invention has been made paying attention to such problems, and in order to reduce the capacity and processing time of a translation program when a sentence containing ideograms is converted into another language. An object of the present invention is to provide a text data structure, a text data processing method, a text data processing program, and a recording medium on which the text data processing program is recorded so that the translation delimiter (clause) can be accurately grasped.

Means for solving the problem [0005] In order to solve the above problem, the text data structure according to claim 1 of the present invention is a text data structure in which character code data capable of specifying the character type of each character including at least ideographic characters is arranged. There,

Based on the phrase information that is the unit of conversion to the ideographic character obtained from the conversion processing program that converts the input phonogram string to a character string that includes the ideogram, the character code data included in each converted phrase is Characteristic phrase specifying data is included together with the character code data.

According to this feature, by specifying the characters included in each clause based on the clause specifying data included in the text data, it is possible to accurately grasp the sentence breaks by clauses. When a sentence containing an ideogram based on it is converted into another language, the capacity and processing time of the translation program can be reduced.

[0006] The text data structure according to claim 2 of the present invention is the text data structure according to claim 1,

According to this feature, the character code data of the phonetic character that is converted into the ideographic character is included in correspondence with the phrase of the character string after conversion as the kana data of the character string after conversion. Therefore, it is possible to accurately identify the hiragana, and to use these hiragana for translation.

[0007] The text data structure according to claim 3 of the present invention is the text data structure according to claim 1 or 2,

Part of speech data that can identify the part of speech of the character string included in each clause and that has been acquired is included in association with the clause.

According to this feature, it is possible to specify the part of speech of the character string included in each clause and to perform more accurate translation based on the specified part of speech!

[0008] A text data processing method according to claim 4 of the present invention includes:

Acquire phrase information as a conversion unit to the ideogram from a conversion processing program that converts the input phonogram string to a character string including the ideogram, and based on the acquired phrase information The phrase specifying data that can specify the character code data included in each clause in the converted character string is inserted into the text data of the converted character string. According to this feature, by specifying the characters included in each clause based on the clause specification data included in the text data, it is possible to accurately grasp the sentence breaks by clauses. When converting a sentence based on another language, the capacity and processing time of the translation program can be reduced.

[0009] A text data processing method according to claim 5 of the present invention is the text data processing method according to claim 4,

The character code data of the phonogram converted from the conversion to the ideographic character obtained from the conversion processing program is associated with the phrase of the converted character string as post-conversion character string kana data, and after the conversion According to this feature, the kana characters can be accurately specified, and these kana characters can be used for translation.

[0010] The text data processing method according to claim 6 of the present invention is the text data processing method according to claim 4 or 5,

The part-of-speech data that can identify the part-of-speech of the character string included in each clause is inserted into the text data in association with the clause.

[0011] A text data processing program according to claim 7 of the present invention provides:

Based on the acquired phrase information, a phrase information acquisition step of acquiring phrase information as a conversion unit to the ideogram from a conversion processing program that converts the input phonogram string into a string including ideograms, A phrase specifying data insertion step for inserting phrase specifying data capable of specifying the character code data included in each clause in the converted character string into the text data of the converted character string;

It is characterized by including. According to this feature, by specifying the characters included in each clause based on the clause specification data included in the text data, it is possible to accurately grasp the sentence breaks by clauses. When converting a sentence based on another language, the capacity and processing time of the translation program can be reduced.

[0012] A text data processing program according to claim 8 of the present invention is the text data processing program according to claim 7,

The character code data of the phonogram converted from the conversion to the ideographic character obtained from the conversion processing program is associated with the phrase of the converted character string as post-conversion character string kana data, and after the conversion It includes a step of inserting a pseudonym data to be inserted into the text data of the character string.

According to this feature, the hiragana can be specified accurately, and these hiragana can be used for translation.

[0013] A text data processing program according to claim 9 of the present invention is the text data processing program according to claim 7 or 8,

A part-of-speech data insertion step of inserting the acquired part-of-speech data capable of specifying the part-of-speech of a character string included in each clause into the text data in association with the clause. As you speak.

[0014] A recording medium on which a text data processing program according to claim 10 of the present invention is recorded is characterized in that the text data processing program according to any one of claims 7 to 9 is recorded.

According to this feature, the text data processing program can be easily used by reading the recording medium power.

Brief Description of Drawings

FIG. 1 is a flowchart showing the processing contents of a conversion processing program used in an embodiment of the present invention.

FIG. 2 is a flowchart showing the processing contents of the text data processing program according to the embodiment of the present invention. is there.

FIG. 3 is a diagram showing a structure of text data generated by a text data processing program according to an embodiment of the present invention.

BEST MODE FOR CARRYING OUT THE INVENTION

Examples of the present invention will be described below.

Example

FIG. 1 is a flowchart showing the processing contents of the Kana-Kanji conversion processing program that is the conversion processing program used in this embodiment, and FIG. 2 shows the processing contents of the text data processing program used in this embodiment. FIG.

[0018] The kana-kanji conversion processing program and the text data processing program used in the present embodiment are installed in a computer such as a personal computer (not shown) from a recording medium such as a CD-ROM and executed on the computer.

[0019] The text data processing program of this embodiment is a plug-in module program of the kana-kanji conversion processing program, which is the main program, and the kana-kanji conversion processing program is a text data processing program. It is possible to operate without it.

[0020] As these Kana-Kanji conversion processing programs, the Kana-Kanji conversion processing program (FEP) can be used as a known Japanese input tool that is commercially available. 1 and briefly explained based on Fig. 3, for example, the Kanji conversion processing program accepts the input of `` Gasshukkosaikousaibanchochimichi '' as the conversion sentence as shown in Fig. 3. (S1), the phrase of the accepted conversion sentence is specified. These clauses can be identified by using, for example, the known minimum cost method, etc., specifically, the conversion sentence powers “Gashukukoku”, “Saikousaibansho”, “Chikamichi” The phrase is specified (S2).

[0021] Then, the kanji-kanji conversion processing program converts the character string of the kanji conversion candidate to be an ideogram corresponding to the kana character string that is the phonogram character string included in each of the specified phrases. Data power is all extracted (S3), and the conversion candidate representative of each clause is displayed and output as a conversion sentence according to the conversion operation. The character string of the conversion candidate extracted in S3 is displayed and output in a selectable manner (S4).

[0022] After the conversion word is accepted by the selection operation (confirmation operation) of the conversion candidate character string (S5), the text data including the kanji that is the ideogram determined by the acceptance, specifically, Outputs the Kanji text data of “US Supreme Court shortcut” to the above text data processing program.

[0023] In response to the output of the kanji text data, the text data processing program detects the output of the kanji text data in Sbl and proceeds to Sb2, as shown in FIG. 2, and outputs the output kanji text data. In the powerful kanji conversion, the phrase information specified in each phrase, the kana character string that is the kana character string before conversion included in each phrase, and the word that contains the converted kanji included in each phrase (conversion word) A conversion information output request including part-of-speech data is output to a kanji conversion processing program and the conversion information is acquired (Sb2) _G That is, the phonetic character input in Sb2 Kana-Kanji conversion processing power as a conversion processing program that converts a column into Kanji text data that is a character string including ideograms The phrase information that is the conversion unit to the ideogram is acquired, and the S The phrase information acquisition step in the present invention is formed by b2.

In this embodiment, the text data processing program power also outputs a conversion information output request to the powerful kanji conversion processing program. However, the present invention is not limited to this. For example, the kana-kanji conversion processing program may output the conversion information in the conversion of the conversion sentence text to the text data processing program together with the conversion sentence text including the kanji converted by the kana-kanji conversion processing program.

[0025] In response to the conversion information output request, the powerful kanji conversion processing program reads the phrase information specified in the powerful kanji conversion of the output kanji text data and the powerful characters before conversion included in each phrase. The conversion information including the hiragana that is a column and the part of speech data of the word (conversion word) including the converted kanji included in each phrase is output to the text data processing program.

[0026] Based on the phrase information included in the acquired conversion information in this way, the text data processing program of the present embodiment uses the conversion sentence output from the kana-kanji conversion processing program and the kana-kanji conversion processing program. Range of each phrase in kanji text data , And at the boundary of each specified clause, that is, at the position where it is separated, specifically, as shown in Figure 3, it becomes the separation position of the clause for the United States and the Supreme Court. A character type is assigned between the “country” and “most” character (character) code data! /, Na, character (character) code, specifically shift (S) — JIS code “007F” On the other hand, by inserting the code data of “007F”, which is a special character assigned to specify the phrase to the left and the diagonally left down diagonal line _(sb3 ) as the phrase specifying character, these “ _007F ” _codes Characters with character codes that exist between code data can be specified as characters included in one clause. That is, in Sb3, a special character serving as phrase specifying data that can specify the character code data included in each phrase in the converted Kanji character string based on the phrase information acquired by the kana-kanji conversion processing program power in Sb2. It is inserted into the kanji text data of the converted kanji character string, and the phrase specifying data insertion step in the present invention is formed by the Sb3.

[0027] Then, after inserting these phrase-specific characters, the process proceeds to Sb4, and the kana-kanji conversion processing program power. The part-of-speech data of the character string included in each phrase included in the acquired conversion information, specifically, each part-of-speech Part-of-speech code uniquely assigned to each part-of-speech that can be identified (in fact, it corresponds to the part-of-speech code stored in correspondence with each word in the dictionary data of the powerful Kanji conversion program) In the order of each part-of-speech included, the character type is a special character that indicates that the inserted data is data other than the conversion sentence at the end position between the phrase specific characters that are the data range of each phrase. Assigned! /, No character (character) code, specifically shift (S) — a special character assigned to JIS code “008F”. The code data “008F” is inserted at the head of the part of speech data composed of the part of speech code. That is, in the Sb4, the part-of-speech data made up of the part-of-speech code that can identify the part-of-speech code of the character string included in each phrase acquired in the Kb-Kanji conversion processing program in the Sb2 is associated with the phrase and the text data. The part-of-speech data insertion step in the present invention is formed by the Sb4.

[0028] Further, at the position that is behind (subordinate) the part-of-speech data, kana character data that becomes a kana character for each phrase included in the conversion information obtained from the powerful Kanji conversion processing program Similar to the lyric data, by inserting the code data of “008F”, which is a special character indicating that it is data other than the conversion sentence, and the symbol of the two slanting lines to the right, it is inserted as shown in FIG. As shown in Fig. 1, the phrase can be identified by the code data of “0 07F”, which is the symbol of the left-down diagonal line that is the phrase identification character, and these phrases and The extended text data having the text data structure of the present invention including the part-of-speech data of the character string included in the phrase and the kana data is generated between the phrase specific characters. That is, in Sb4, the character code data of the phonetic character (kana character) that is the source of conversion to the ideographic character (kanji character) obtained from the Kana-Kanji conversion processing program in Sb2 is assigned to the character string after conversion. The kana data is inserted into the text data of the converted character string in association with the clause of the converted character string, and the pseudonym data insertion step in the present invention is formed by the Sb4.

[0029] Then, when translating these “US Supreme Court shortcuts” into other languages, the translations are based on the phrases “US”, “Supreme Court”, and “Shortcuts”. Sentences and translated sentences by separating the clauses from “U.S. Supreme”, “Court”, and “Shortcut”, which are mistranslations, exist, but according to the extended text data having the text data structure of this embodiment, the text The phrase-specific characters included in the data can identify the strings included in each phrase as “United States”, “Supreme Court”, “shortcut”, and also their kana and part-of-speech. It is possible to reduce the capacity and processing time of the translation program because it is not necessary to carry out the process for determining the clauses of the phrase, and the translation program does not need to include these phrase determination programs. For example, when these text data structures are used as the description language of homepages on the Internet and applied to sentences contained in HTML, for example, when Chinese people browse Japanese homepages, Conversely, when a Japanese browses a Chinese homepage, the text can be translated and displayed accurately and quickly, and the convenience for the user can be significantly improved.

As described above, the embodiments of the present invention have been described with reference to the drawings. However, the specific configuration is not limited to these embodiments, and there is any change or additional effort within the scope of the present invention. It is included in the present invention.

[0031] For example, in the above embodiment, a special character is used as the phrase specifying character. iS This is preferable because the phrase-specific character can be easily distinguished from other characters (characters) because errors in the phrase specification can be greatly reduced, but the present invention is not limited to this. The chords and characters used as phrase specific characters can be selected as appropriate.

[0032] In the above embodiment, the phrase specifying data that can specify the character code data included in each converted phrase is used as the phrase specifying character. However, the present invention is not limited to this. Specific data, for example, data indicating that the number of characters from the beginning of a sentence is one clause and the next number of characters is one clause. The phrase specifying data that can be used to specify the characters included in the phrase by using the arranged character number map data or the like may be appropriately selected according to the usage form of the text data.

[0033] Further, in the above-described embodiment, the ability to include part-of-speech data and phonetic kana data. The present invention is not limited to this, and the configuration may not include these part-of-speech data and kana-kana data.

[0034] Further, in the above-described embodiment, power that illustrates a Japanese Kana-Kanji conversion processing program as the conversion processing program. The present invention is not limited to this, and these conversion processing programs are input in pinyin. It goes without saying that the program may be a Chinese Roman-Kanji conversion processing program for converting a Roman character string into Kanji, and the present invention can also be applied when converting other phonograms to ideograms.

[0035] Also, in the above embodiment, the character string included in each clause is a reading of a proper noun or the like. For example, the name "Mark" in the United States is converted to "Masashi" and is not an ideogram. In order to be able to specify in the translation, etc., that the character string included in the phrase is an ideogram or a phonetic character, a type code that can specify the power of the phonogram is used together with the part of speech data. In order to make it possible to include a part of speech as a part of speech or to specify a character for a foreign word, the part of speech code of the foreign word may be included as part of speech data. In addition, when inputting these proper nouns, etc., when the converted ideographic character string is used as a phonetic character such as a name, it should be used as a phonetic character. The program accepts it as the phonetic character The text data processing program may acquire conversion information including the type data indicating use.

In the above embodiment, the text data processing program is a plug-in module program of a Kana-Kanji conversion processing program. These text data processing programs are recorded on a recording medium or a computer network separately from the Kana-Kanji conversion processing program. The present invention is not limited to this. The text data processing program is included in the Kana-Kanji conversion processing program as being inseparable from the Kana-Kanji conversion processing program, A kana-kanji conversion processing program including the text data processing program may be distributed.

Claims

The scope of the claims

[1] A text data structure in which character code data that can specify the character type of each character including at least ideographic characters is arranged,

Based on the phrase information that is the unit of conversion to the ideographic character obtained from the conversion processing program that converts the input phonogram string to a character string that includes the ideogram, the character code data included in each converted phrase is A text data structure characterized by including identifiable phrase specifying data together with the character code data.

[2] The character code data of the phonogram converted from the ideographic character is included as the kana data of the converted character string in association with the phrase of the converted character string. The text data structure described in 1.

[3] The text according to claim 1 or 2, wherein the text includes part-of-speech data, which is obtained from the conversion processing program and can identify the part-of-speech of the character string included in each clause, in association with the clause. data structure.

[4] Obtain phrase information as a conversion unit to the ideogram from the conversion processing program that converts the input phonogram string to a character string including the ideogram, and after conversion based on the obtained phrase information A text data processing method comprising: inserting phrase specifying data capable of specifying character code data included in each clause of a character string into the text data of the converted character string.

[5] The character code data of the phonogram converted to the ideogram obtained from the conversion processing program is associated with the phrase of the converted character string as the kana data of the converted character string. 5. The text data processing method according to claim 4, wherein the text data is inserted into the text data of the converted character string.

[6] The part-of-speech data that can be identified from the part-of-speech character string included in each clause, obtained from the conversion processing program, is inserted into the text data in association with the clause. Or the text data processing method according to 5.

[7] A phrase information acquisition step for acquiring phrase information as a conversion unit to the ideogram from a conversion processing program for converting the input phonogram string to a character string including the ideogram, and the acquired phrase information The character code included in each clause in the converted character string based on A phrase specifying data insertion step for inserting the phrase specifying data into which the data can be specified into the text data of the converted character string;

A text data processing program comprising:

[8] The character code data of the phonogram converted to the ideogram obtained from the conversion processing program is associated with the phrase of the converted character string as the kana data of the converted character string. 8. The text data processing program according to claim 7, further comprising a step of inserting a pseudonym data to be inserted into the text data of the character string after the conversion.

[9] A part-of-speech data insertion step of inserting part-of-speech data, which can be specified from the part-of-speech part of the character string included in each clause, obtained from the conversion processing program into the text data in association with the clause. The text data processing program according to claim 7 or 8, wherein

10. A recording medium on which a text data processing program is recorded, characterized in that the text data processing program according to any one of claims 7 to 9 is recorded.