WO2006051647A1 - Text data structure and text data processing method - Google Patents
Text data structure and text data processing method Download PDFInfo
- Publication number
- WO2006051647A1 WO2006051647A1 PCT/JP2005/016504 JP2005016504W WO2006051647A1 WO 2006051647 A1 WO2006051647 A1 WO 2006051647A1 JP 2005016504 W JP2005016504 W JP 2005016504W WO 2006051647 A1 WO2006051647 A1 WO 2006051647A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- text data
- character string
- processing program
- character
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
- G06F40/129—Handling non-Latin characters, e.g. kana-to-kanji conversion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/53—Processing of non-Latin text
Definitions
- Text data structure text data processing method, text data processing program, and recording medium recording the text data processing program
- the present invention relates to a text data structure of a language including at least ideographic characters, a text data processing method for generating text data of the data structure, a text data processing program, and a recording medium on which a text data processing program is recorded. .
- An object of the present invention is to provide a text data structure, a text data processing method, a text data processing program, and a recording medium on which the text data processing program is recorded so that the translation delimiter (clause) can be accurately grasped.
- the text data structure according to claim 1 of the present invention is a text data structure in which character code data capable of specifying the character type of each character including at least ideographic characters is arranged.
- the character code data included in each converted phrase is Characteristic phrase specifying data is included together with the character code data.
- the text data structure according to claim 2 of the present invention is the text data structure according to claim 1,
- the character code data of the phonetic character that is converted into the ideographic character is included in correspondence with the phrase of the character string after conversion as the kana data of the character string after conversion. Therefore, it is possible to accurately identify the hiragana, and to use these hiragana for translation.
- the text data structure according to claim 3 of the present invention is the text data structure according to claim 1 or 2
- Part of speech data that can identify the part of speech of the character string included in each clause and that has been acquired is included in association with the clause.
- a text data processing method includes:
- the phrase specifying data that can specify the character code data included in each clause in the converted character string is inserted into the text data of the converted character string. According to this feature, by specifying the characters included in each clause based on the clause specification data included in the text data, it is possible to accurately grasp the sentence breaks by clauses. When converting a sentence based on another language, the capacity and processing time of the translation program can be reduced.
- a text data processing method according to claim 5 of the present invention is the text data processing method according to claim 4,
- the character code data of the phonogram converted from the conversion to the ideographic character obtained from the conversion processing program is associated with the phrase of the converted character string as post-conversion character string kana data, and after the conversion According to this feature, the kana characters can be accurately specified, and these kana characters can be used for translation.
- the text data processing method according to claim 6 of the present invention is the text data processing method according to claim 4 or 5
- the part-of-speech data that can identify the part-of-speech of the character string included in each clause is inserted into the text data in association with the clause.
- a text data processing program according to claim 7 of the present invention provides:
- a text data processing program according to claim 8 of the present invention is the text data processing program according to claim 7,
- the character code data of the phonogram converted from the conversion to the ideographic character obtained from the conversion processing program is associated with the phrase of the converted character string as post-conversion character string kana data, and after the conversion It includes a step of inserting a pseudonym data to be inserted into the text data of the character string.
- the hiragana can be specified accurately, and these hiragana can be used for translation.
- a text data processing program according to claim 9 of the present invention is the text data processing program according to claim 7 or 8,
- a recording medium on which a text data processing program according to claim 10 of the present invention is recorded is characterized in that the text data processing program according to any one of claims 7 to 9 is recorded.
- the text data processing program can be easily used by reading the recording medium power.
- FIG. 1 is a flowchart showing the processing contents of a conversion processing program used in an embodiment of the present invention.
- FIG. 2 is a flowchart showing the processing contents of the text data processing program according to the embodiment of the present invention. is there.
- FIG. 3 is a diagram showing a structure of text data generated by a text data processing program according to an embodiment of the present invention.
- FIG. 1 is a flowchart showing the processing contents of the Kana-Kanji conversion processing program that is the conversion processing program used in this embodiment
- FIG. 2 shows the processing contents of the text data processing program used in this embodiment.
- the kana-kanji conversion processing program and the text data processing program used in the present embodiment are installed in a computer such as a personal computer (not shown) from a recording medium such as a CD-ROM and executed on the computer.
- the text data processing program of this embodiment is a plug-in module program of the kana-kanji conversion processing program, which is the main program, and the kana-kanji conversion processing program is a text data processing program. It is possible to operate without it.
- the Kana-Kanji conversion processing program can be used as a known Japanese input tool that is commercially available. 1 and briefly explained based on Fig. 3, for example, the Kanji conversion processing program accepts the input of ⁇ Gasshukkosaikousaibanchochimichi '' as the conversion sentence as shown in Fig. 3. (S1), the phrase of the accepted conversion sentence is specified.
- S1 the phrase of the accepted conversion sentence is specified.
- These clauses can be identified by using, for example, the known minimum cost method, etc., specifically, the conversion sentence powers “Gashukukoku”, “Saikousaibansho”, “Chikamichi” The phrase is specified (S2).
- the kanji-kanji conversion processing program converts the character string of the kanji conversion candidate to be an ideogram corresponding to the kana character string that is the phonogram character string included in each of the specified phrases.
- Data power is all extracted (S3), and the conversion candidate representative of each clause is displayed and output as a conversion sentence according to the conversion operation.
- the character string of the conversion candidate extracted in S3 is displayed and output in a selectable manner (S4).
- the conversion word is accepted by the selection operation (confirmation operation) of the conversion candidate character string (S5)
- the text data including the kanji that is the ideogram determined by the acceptance specifically, Outputs the Kanji text data of “US Supreme Court shortcut” to the above text data processing program.
- the text data processing program In response to the output of the kanji text data, the text data processing program detects the output of the kanji text data in Sbl and proceeds to Sb2, as shown in FIG. 2, and outputs the output kanji text data.
- the phrase information specified in each phrase, the kana character string that is the kana character string before conversion included in each phrase, and the word that contains the converted kanji included in each phrase (conversion word) A conversion information output request including part-of-speech data is output to a kanji conversion processing program and the conversion information is acquired (Sb2) G That is, the phonetic character input in Sb2 Kana-Kanji conversion processing power as a conversion processing program that converts a column into Kanji text data that is a character string including ideograms
- the phrase information that is the conversion unit to the ideogram is acquired, and the S The phrase information acquisition step in the present invention is formed by b2.
- the text data processing program power also outputs a conversion information output request to the powerful kanji conversion processing program.
- the present invention is not limited to this.
- the kana-kanji conversion processing program may output the conversion information in the conversion of the conversion sentence text to the text data processing program together with the conversion sentence text including the kanji converted by the kana-kanji conversion processing program.
- the powerful kanji conversion processing program reads the phrase information specified in the powerful kanji conversion of the output kanji text data and the powerful characters before conversion included in each phrase.
- the conversion information including the hiragana that is a column and the part of speech data of the word (conversion word) including the converted kanji included in each phrase is output to the text data processing program.
- the text data processing program of the present embodiment uses the conversion sentence output from the kana-kanji conversion processing program and the kana-kanji conversion processing program. Range of each phrase in kanji text data , And at the boundary of each specified clause, that is, at the position where it is separated, specifically, as shown in Figure 3, it becomes the separation position of the clause for the United States and the Supreme Court.
- a character type is assigned between the “country” and “most” character (character) code data!
- the process proceeds to Sb4, and the kana-kanji conversion processing program power.
- the part-of-speech data of the character string included in each phrase included in the acquired conversion information specifically, each part-of-speech Part-of-speech code uniquely assigned to each part-of-speech that can be identified (in fact, it corresponds to the part-of-speech code stored in correspondence with each word in the dictionary data of the powerful Kanji conversion program)
- the character type is a special character that indicates that the inserted data is data other than the conversion sentence at the end position between the phrase specific characters that are the data range of each phrase. Assigned!
- the code data “008F” is inserted at the head of the part of speech data composed of the part of speech code. That is, in the Sb4, the part-of-speech data made up of the part-of-speech code that can identify the part-of-speech code of the character string included in each phrase acquired in the Kb-Kanji conversion processing program in the Sb2 is associated with the phrase and the text data.
- the part-of-speech data insertion step in the present invention is formed by the Sb4.
- kana character data that becomes a kana character for each phrase included in the conversion information obtained from the powerful Kanji conversion processing program Similar to the lyric data, by inserting the code data of “008F”, which is a special character indicating that it is data other than the conversion sentence, and the symbol of the two slanting lines to the right, it is inserted as shown in FIG. As shown in Fig.
- the phrase can be identified by the code data of “0 07F”, which is the symbol of the left-down diagonal line that is the phrase identification character, and these phrases and
- the extended text data having the text data structure of the present invention including the part-of-speech data of the character string included in the phrase and the kana data is generated between the phrase specific characters. That is, in Sb4, the character code data of the phonetic character (kana character) that is the source of conversion to the ideographic character (kanji character) obtained from the Kana-Kanji conversion processing program in Sb2 is assigned to the character string after conversion.
- the kana data is inserted into the text data of the converted character string in association with the clause of the converted character string, and the pseudonym data insertion step in the present invention is formed by the Sb4.
- phrase specifying character a special character is used as the phrase specifying character.
- iS This is preferable because the phrase-specific character can be easily distinguished from other characters (characters) because errors in the phrase specification can be greatly reduced, but the present invention is not limited to this.
- the chords and characters used as phrase specific characters can be selected as appropriate.
- the phrase specifying data that can specify the character code data included in each converted phrase is used as the phrase specifying character.
- the present invention is not limited to this.
- Specific data for example, data indicating that the number of characters from the beginning of a sentence is one clause and the next number of characters is one clause.
- the phrase specifying data that can be used to specify the characters included in the phrase by using the arranged character number map data or the like may be appropriately selected according to the usage form of the text data.
- the ability to include part-of-speech data and phonetic kana data is not limited to this, and the configuration may not include these part-of-speech data and kana-kana data.
- the present invention is not limited to this, and these conversion processing programs are input in pinyin. It goes without saying that the program may be a Chinese Roman-Kanji conversion processing program for converting a Roman character string into Kanji, and the present invention can also be applied when converting other phonograms to ideograms.
- the character string included in each clause is a reading of a proper noun or the like.
- the name "Mark” in the United States is converted to "Masashi" and is not an ideogram.
- a type code that can specify the power of the phonogram is used together with the part of speech data.
- the part of speech code of the foreign word may be included as part of speech data.
- the converted ideographic character string when used as a phonetic character such as a name, it should be used as a phonetic character.
- the program accepts it as the phonetic character
- the text data processing program may acquire conversion information including the type data indicating use.
- the text data processing program is a plug-in module program of a Kana-Kanji conversion processing program. These text data processing programs are recorded on a recording medium or a computer network separately from the Kana-Kanji conversion processing program.
- the present invention is not limited to this.
- the text data processing program is included in the Kana-Kanji conversion processing program as being inseparable from the Kana-Kanji conversion processing program, A kana-kanji conversion processing program including the text data processing program may be distributed.
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2004330696A JP2006139692A (en) | 2004-11-15 | 2004-11-15 | Text data structure, text data processing method, text data processing program, and recording medium having recorded the same |
JP2004-330696 | 2004-11-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006051647A1 true WO2006051647A1 (en) | 2006-05-18 |
Family
ID=36336330
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2005/016504 WO2006051647A1 (en) | 2004-11-15 | 2005-09-08 | Text data structure and text data processing method |
Country Status (4)
Country | Link |
---|---|
JP (1) | JP2006139692A (en) |
KR (1) | KR20070083757A (en) |
CN (1) | CN101057234A (en) |
WO (1) | WO2006051647A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102460437A (en) * | 2009-06-26 | 2012-05-16 | 乐天株式会社 | Information search device, information search method, information search program, and storage medium on which information search program has been stored |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107943763A (en) * | 2017-11-29 | 2018-04-20 | 广州迈安信息科技有限公司 | A kind of big text data processing method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS61279973A (en) * | 1985-06-06 | 1986-12-10 | Ricoh Co Ltd | Japanese processor |
JPS638860A (en) * | 1986-06-27 | 1988-01-14 | Matsushita Electric Ind Co Ltd | Kana/kanji converting device |
JPH07141382A (en) * | 1993-11-19 | 1995-06-02 | Sharp Corp | Foreign-language documentation support device |
-
2004
- 2004-11-15 JP JP2004330696A patent/JP2006139692A/en active Pending
-
2005
- 2005-09-08 KR KR1020077009140A patent/KR20070083757A/en not_active Application Discontinuation
- 2005-09-08 CN CNA2005800386561A patent/CN101057234A/en active Pending
- 2005-09-08 WO PCT/JP2005/016504 patent/WO2006051647A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS61279973A (en) * | 1985-06-06 | 1986-12-10 | Ricoh Co Ltd | Japanese processor |
JPS638860A (en) * | 1986-06-27 | 1988-01-14 | Matsushita Electric Ind Co Ltd | Kana/kanji converting device |
JPH07141382A (en) * | 1993-11-19 | 1995-06-02 | Sharp Corp | Foreign-language documentation support device |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102460437A (en) * | 2009-06-26 | 2012-05-16 | 乐天株式会社 | Information search device, information search method, information search program, and storage medium on which information search program has been stored |
Also Published As
Publication number | Publication date |
---|---|
CN101057234A (en) | 2007-10-17 |
KR20070083757A (en) | 2007-08-24 |
JP2006139692A (en) | 2006-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3277123B2 (en) | System and method for processing Chinese text | |
JP6069211B2 (en) | Text conversion and expression system | |
EP0686286B1 (en) | Text input transliteration system | |
US20070061131A1 (en) | Japanese virtual dictionary | |
US20050010391A1 (en) | Chinese character / Pin Yin / English translator | |
Zeitoun et al. | The Formosan language archive: Linguistic analysis and language processing | |
WO2006051647A1 (en) | Text data structure and text data processing method | |
JPH11238051A (en) | Chinese input conversion processor, chinese input conversion processing method and recording medium stored with chinese input conversion processing program | |
JP2003178087A (en) | Retrieval device and method for electronic foreign language dictionary | |
JP2005250525A (en) | Chinese classics analysis support apparatus, interlingual sentence processing apparatus and translation program | |
Lehal | A Gurmukhi to Shahmukhi transliteration system | |
Joshi et al. | Input Scheme for Hindi Using Phonetic Mapping | |
Lehal et al. | A Hindi to Urdu transliteration system | |
Das et al. | Multilingual Neural Machine Translation System for Indic to Indic Languages | |
KR100268297B1 (en) | System and method for processing chinese language text | |
Chaware et al. | Information retrieval in multilingual environment | |
JPH08272780A (en) | Processor and method for chinese input processing, and processor and method for language processing | |
JP3220133B2 (en) | Kana-Kanji conversion device | |
JP2608384B2 (en) | Machine translation apparatus and method | |
JPH07210571A (en) | Device and method for word retrieval processing | |
Moran | An ontology for accessing transcription systems (OATS) | |
JPH01118961A (en) | Translating device | |
JPS61235978A (en) | Character string correction system | |
JP2006338155A (en) | Computer program for character string conversion and recording medium with recorded conversion rule | |
Asahiah et al. | A survey of approaches to diacritic restoration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1020077009140 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 200580038656.1 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 05782277 Country of ref document: EP Kind code of ref document: A1 |