WO2002095614A1 - Procede d'identification d'un systeme de code de type langage ou par caracteres - Google Patents

Procede d'identification d'un systeme de code de type langage ou par caracteres Download PDF

Info

Publication number
WO2002095614A1
WO2002095614A1 PCT/JP2001/004350 JP0104350W WO02095614A1 WO 2002095614 A1 WO2002095614 A1 WO 2002095614A1 JP 0104350 W JP0104350 W JP 0104350W WO 02095614 A1 WO02095614 A1 WO 02095614A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
character code
code system
list
character
Prior art date
Application number
PCT/JP2001/004350
Other languages
English (en)
Japanese (ja)
Inventor
Izumi Suzuki
Original Assignee
Izumi Suzuki
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Izumi Suzuki filed Critical Izumi Suzuki
Priority to PCT/JP2001/004350 priority Critical patent/WO2002095614A1/fr
Priority to JP2002592007A priority patent/JPWO2002095614A1/ja
Publication of WO2002095614A1 publication Critical patent/WO2002095614A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification

Definitions

  • the present invention relates to a multilingual processing technique in a computer, and more particularly to a machine processing method for creating a language and a character code system of a text document encoded by the computer.
  • the difference in the character code system means not only the difference in character fonts, but also the display of a text document coded in the Bunko code system using a different Bunko code B character font (that is, the character code). This means that text that is completely meaningless will be displayed when decryption is performed by one-way system B).
  • Identification methods that meet the above requirements can be powerful information processing tools in relatively large multilingual processing systems, such as searching, classifying, and statistically retrieving documents on the network; ⁇ ft.
  • searching, classifying, and statistically retrieving documents on the network ⁇ ft.
  • the survey systematically accesses pages on the Internet around the world using mouth-bot search techniques, and automatically identifies and tabulates the language and character encoding used on the pages. .
  • the text document that is input to the device and is to be identified is called the “target text document.”
  • the text used on a certain page is written in a language / character code system that is not registered in the gfi device. If so, the page is checked manually and new languages / character sets are registered if necessary. (The registered language / character code system is called “target language / character code system.”)
  • Both ⁇ output the most likely language / character code system in the target language / character code system ⁇ ⁇ ⁇ as the identification result. It is difficult to make a clear determination of the power of the power.
  • method (2) it is difficult to check for unregistered languages / character codes if they are mixed.
  • Japanese / SMft-JIS is used as a target.
  • Malay / iso8859-l is not registered, recognizes text documents containing Japanese / Shift-JIS and Malay / iso8859-l.
  • method (2) will output the main language / Sfeift-tHS as a result, and it will be overlooked that the unregistered language / bunko code system is included.
  • the information units used for identification are based on the knowledge of each language and character code system. Is not a word or character in the language that is extracted from the text document, but it is a rate unless it is the method shown in (3) of the problem to be solved. Disclosure of the invention
  • the unit of information used in SiJ is the partial byte sequence of all the specified bytes and numbers contained in the text document (that is: byte sequence), which is ⁇ .
  • byte sequence the specified bytes and numbers contained in the text document
  • the byte sequence
  • a list of predefined length ⁇ f strings that may be expressed in a text document created in advance using the relevant language / character code system ( LBSL / C). If most of the specified length byte sequence that can appear in a text document in a certain language / character code system is complete, the text document in which byte sequences that do not correspond to them frequently appear will be written in this language's character code system. The fact that it is not a thing is supported.
  • the list LBSL / G in each language / character code system can be easily obtained from the text document in the language / character use system. A list that can obtain good identification results.
  • the standard of the number of text documents required to obtain LBSL / C is: 1 KB 20 KB for character code, 2 Japanese, etc. It is a pite.
  • FIG. 1 is a diagram schematically showing a system according to the present invention.
  • FIG. 2 is a flowchart of a series of general-purpose steps of a process executed by the system shown in FIG.
  • FIG. 3 is a flow chart of the fine steps executed in step 204 shown in FIG. 2 for calculating the learned bit rate in the target text document for each language / character code. .
  • Fig. 4 is executed in step 206 shown in Fig. 2 to delete the lower language / character code system when there is more than one language / character code system whose appearance rate exceeds the upper limit UB.
  • 3 is a flowchart of the detailed steps.
  • Fig. 5 shows the three-pight string!? That could appear when the language / bunko code system is "Japanese ZShift-JIS". Strike Part of LBSL / C.
  • Fig. 6 shows an example in which there is no language / character code system in which the learned byte appearance rate takes the lower limit (LB) and the upper limit (UB> ⁇ ).
  • FIG. 7 is an example of a list in which the relationship described in claim 2 is 153 $ in “Example of target language / character code system” (A to H) described on page 6].
  • the parentheses mean that the TO / character code system X is higher than the TO / character code system.
  • FIG. 8 is an example of execution of the process described in step 206 of FIG.
  • the language / character code system is the same as the example described in FIG. 6, and the relationship is the same as the example described in FIG. Figure & shows the number of LBSL / G items used in the experiment shown in “Possibility for Industrial Use” and the amount of text documents referred to fc to create it.
  • Fig. 1 shows the ffi-force results of step 2 and 4 shown in Fig. 2 in the experiment shown in "Possibility of Industrial Use”.
  • the computer coded text: ⁇ (text of interest: iC ») is entered, and first in step 202, it is checked whether it is a long or short document. Eck.
  • step 203 all specified length pite strings included in the target text document are read and stored in the list: LBS &.
  • the default pitch length is generally 3 pite. 1 byte and 2 bytes do not provide the desired discrimination performance. On the other hand, as the default value increases, the discrimination performance improves.
  • each rule of LBSS is searched for whether or not the S-pipe sequence does fe, and the appearance rate of the learned byte sequence is calculated for each language / character code system (step 204).
  • Fig. 5 shows an example (part) of the table LBSL / C
  • Fig. 3 shows the detailed steps of step 204.
  • step 205 it is checked whether or not the language code code system in which the learned pate appearance rate takes a value between the predetermined lower limit value and the upper limit value (UB>). If the language / character code system which takes a value between s limit value LB and the upper limit value UB is present it has one or more at ⁇ present stearyl 'flop 205 illustrating an example of a in FIG. 6 If not, then "The system automatically outputs the unrecognizable J and terminates the study process. If there is no J, the target text document contains multiple language character code systems. The processing is performed next, and the values of LB and UB are determined in advance depending on the implementation case .. The lower the lower limit LB and the lower the upper limit UB, the lower the LB of the trained pate If there is a language / character code system that takes a value between Gyora possibility is high.
  • step 20 & above the relevant language Z character code system is output as an identification result.
  • the present invention can be a powerful multilingual information processing means not only in the statistical survey on the Internet described in the background, but also in the search and classification of * on the network for the same reason.
  • Below ⁇ which is possible, two additional features of the present invention
  • Hiragana is almost always used in Japanese documents, and Shimo is very frequently used.
  • hiragana is often used as a character with a high frequency of appearance in the conventional technology (), and the first byte of the first name is often used as a character code specifically used in the conventional technology (2).
  • the target text document 1 language / character code system A, 2) different from A, language Z registered as envelope Z character code system B, and 3> A
  • the appearance rate is likely to exceed the upper limit value UB if the item of LBSL G is sufficient, but is likely to be lower than ⁇ due to insufficient items.
  • it is not the key to boosting the appearance rate of learned pite in other languages / character codes, and consequently returns indistinguishable.
  • the appearance rate of existing bytes for the target text document is smaller than the lower limit: even if the LB ⁇ C items of A are sufficient. If A's LBSL / C entry is inadequate, this figure will be less than or equal to that of LBSL C's entry, and will not be a factor in returning incorrect results.
  • Japan the world's largest funder, Japan is implementing effective and efficient economic cooperation based on the Official Development Assistance Charter in order to help developing countries become self-sustaining.
  • ⁇ ⁇ language ⁇ character code system and ⁇ and :: a text document in which Japanese fe / Shift-JIS and English are mixed (language / character code system A, B, G, F, G and H are all about 70 pounds, and about 130 pounds for a mixture of three languages and English) are input to the identification device described in claim 1 respectively.
  • Figure 1G shows the appearance rate of trained pite strings for each language / character code system in.
  • step 206 In the language / letter-code language Indonesian language that the list LBSL C did comparative experiments in insufficient, the input text of Indonesian language is indistinguishable. About other input text As a result, the correct knowledge was obtained by performing step 206 in step 2 of the claim. For example, in the case of text input in English / 3L, the learning rate of the learned byte string in the bilingual character code system of "English only” and "D. Japanese / S, English / L, or mixed” Exceeded UB. By performing the processing of step 206 on the above character code system, it is possible to obtain a monolingual Z character code system ": B. English / L only" as shown in example 1 of FIG. . (Character code system Shift-JIS is abbreviated to &, and iso8859-I is abbreviated to L.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

L'invention porte sur un procédé d'identification mécanique d'un système de code de type langage ou par caractères d'un document textuel codé informatiquement. Dans la liste LBSL/C d'une chaîne d'octets d'une longueur spécifique, formée précédemment pour chaque système de code de type langage ou par caractères, sont stockées des chaînes d'un nombre spécifique d'octets se présentant éventuellement dans un document textuel d'un système de code de type langage ou par caractères considéré. Pour chaque chaîne de codes de type langage ou par caractères, on calcule une «fréquence d'occurrence de la chaîne d'octets apprise », c'est-à-dire la fréquence du nombre de chaînes d'octets d'une longueur spécifique existant déjà dans la liste LBSL/C et contenue dans un document textuel considéré, et ce n'est que lorsque le paramètre considéré se rapproche de 1 qu'on restitue en sortie comme résultat le nom du système de code de type langage ou par caractères.
PCT/JP2001/004350 2001-05-24 2001-05-24 Procede d'identification d'un systeme de code de type langage ou par caracteres WO2002095614A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2001/004350 WO2002095614A1 (fr) 2001-05-24 2001-05-24 Procede d'identification d'un systeme de code de type langage ou par caracteres
JP2002592007A JPWO2002095614A1 (ja) 2001-05-24 2001-05-24 言語・文字コード系識別処理方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2001/004350 WO2002095614A1 (fr) 2001-05-24 2001-05-24 Procede d'identification d'un systeme de code de type langage ou par caracteres

Publications (1)

Publication Number Publication Date
WO2002095614A1 true WO2002095614A1 (fr) 2002-11-28

Family

ID=11737343

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2001/004350 WO2002095614A1 (fr) 2001-05-24 2001-05-24 Procede d'identification d'un systeme de code de type langage ou par caracteres

Country Status (2)

Country Link
JP (1) JPWO2002095614A1 (fr)
WO (1) WO2002095614A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008515107A (ja) * 2004-09-30 2008-05-08 グーグル・インコーポレーテッド テキスト分割のために言語を選択する方法およびシステム
US8849852B2 (en) 2004-09-30 2014-09-30 Google Inc. Text segmentation
JP2015118625A (ja) * 2013-12-19 2015-06-25 株式会社Ji2 判定装置、判定方法、及び、プログラム

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000194696A (ja) * 1998-12-23 2000-07-14 Xerox Corp サンプルテキスト基調言語自動識別方法

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000194696A (ja) * 1998-12-23 2000-07-14 Xerox Corp サンプルテキスト基調言語自動識別方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008515107A (ja) * 2004-09-30 2008-05-08 グーグル・インコーポレーテッド テキスト分割のために言語を選択する方法およびシステム
US8306808B2 (en) 2004-09-30 2012-11-06 Google Inc. Methods and systems for selecting a language for text segmentation
US8849852B2 (en) 2004-09-30 2014-09-30 Google Inc. Text segmentation
JP2015118625A (ja) * 2013-12-19 2015-06-25 株式会社Ji2 判定装置、判定方法、及び、プログラム

Also Published As

Publication number Publication date
JPWO2002095614A1 (ja) 2004-11-25

Similar Documents

Publication Publication Date Title
TW310400B (fr)
US10409911B2 (en) Systems and methods for text analytics processor
US5164899A (en) Method and apparatus for computer understanding and manipulation of minimally formatted text documents
US5590317A (en) Document information compression and retrieval system and document information registration and retrieval method
US7756871B2 (en) Article extraction
US9489371B2 (en) Detection of data in a sequence of characters
US20100023318A1 (en) Method and device for retrieving data and transforming same into qualitative data of a text-based document
Gesmundo et al. Lemmatisation as a tagging task
US20120290288A1 (en) Parsing of text using linguistic and non-linguistic list properties
CA2836220A1 (fr) Procedes et systemes pour la mise en correspondance d'enregistrements et la normalisation de noms
Cortes et al. An empirical comparison of question classification methods for question answering systems
CN114386100A (zh) 一种公有云用户敏感数据管理方法
US11314922B1 (en) System and method for generating regulatory content requirement descriptions
Xu et al. Using SVM to extract acronyms from text
CN111539383B (zh) 公式知识点识别方法及装置
JPH06314297A (ja) 文書処理装置および方法,ならびにデータ・ベース検索装置および方法
CN115983202A (zh) 一种数据处理方法、装置、设备及存储介质
WO2002095614A1 (fr) Procede d'identification d'un systeme de code de type langage ou par caracteres
US20230419110A1 (en) System and method for generating regulatory content requirement descriptions
Ifeanyi-Reuben et al. Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity
Sawalha et al. Linguistically informed and corpus informed morphological analysis of Arabic
JPS61248160A (ja) 文書情報登録方式
JPH0748217B2 (ja) 文書要約装置
CN111160042B (zh) 一种文本语义解析方法和装置
EP1072986A2 (fr) Système et dispositif pour extraire des données de textes semi-structurés

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2002592007

Country of ref document: JP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase