JP2000148754A

JP2000148754A - Multilingual system, multilingual processing method, and medium storing program for multilingual processing

Info

Publication number: JP2000148754A
Application number: JP10338416A
Authority: JP
Inventors: Shinichi Mukogawa; 信一向川; Tomoyuki Tada; 多田　　智之; Hidenobu Kaneoka; 秀信金岡; Yasuyuki Furukawa; 靖之古河
Original assignee: Omron Corp; Omron Tateisi Electronics Co
Current assignee: Omron Corp
Priority date: 1998-11-13
Filing date: 1998-11-13
Publication date: 2000-05-30

Abstract

PROBLEM TO BE SOLVED: To identify the language and coding system of a character code and to add the identification result to a character code string and output the obtained string. SOLUTION: An appearance probability table wherein the appearance probability of a character code is described for each character is prepared for every combination of languages and character code systems. An inputted character code string is divided into individual characters (step 21) and the appearance probability of the character code is obtained by referring to the appearance probability table (steps 23, 25, 27, 29, and 31). The product of appearance probability values is calculated for every combination of the languages and character code systems (steps 24, 26, 28, 30, and 32) and the combination of the languages and character code systems as to the input character string is judged from the obtained product. Information representing this judgement result is added to the character code string and outputted.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【技術分野】この発明は複数種類の言語を取扱うマルチ
リンガル・システム，マルチリンガル処理方法およびマ
ルチリンガル処理のためのプログラムを格納した記録媒
体に関し，特に文字コード列（エンコードされたテキス
ト・データ，キーワードなど）によって表わされる文字
列の言語およびその文字コードの種類（文字コード系）
を判別する機能をもったシステム，同機能を実現する方
法およびプログラム記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multilingual system, a multilingual processing method, and a recording medium storing a program for multilingual processing that handle a plurality of languages, and more particularly to a character code string (encoded text data, keyword Language) and the type of character code (character code system)
TECHNICAL FIELD The present invention relates to a system having a function of judging the number, a method for realizing the function, and a program recording medium.

【０００２】[0002]

【発明の背景】現在，日本，中国（中華人民共和国），
韓国および台湾（中華民国）で使われている漢字（また
はハングル）用文字コードは，２バイトで１文字を表現
している。これらの文字コード（系）は，言語（日本
語，中国語，韓国語など）ごとに独立して定義されてい
る。エンコード方法（文字コード系，コードの種類また
はエンコードのルール）が異なれば同じ言語の文字でも
異なる文字コードで表される。言語を表わす情報は通常
文字コード・データに付加されていない。このため，一
連の文字コードが与えられたときに，その文字コードが
どのような言語をエンコードして得られたものかは簡単
には分からない。BACKGROUND OF THE INVENTION Currently, Japan, China (People's Republic of China),
The character code for kanji (or Hangul) used in South Korea and Taiwan (Republic of China) expresses one character with two bytes. These character codes (systems) are defined independently for each language (Japanese, Chinese, Korean, etc.). If the encoding method (character code system, code type or encoding rule) is different, characters in the same language are represented by different character codes. Information representing a language is not usually added to character code data. For this reason, when a series of character codes are given, it is not easy to know in which language the character codes are encoded and obtained.

【０００３】データベースの検索システム，翻訳システ
ム，音声合成システムなどのような言語情報処理システ
ムは，特定の言語および文字コード系を前提としてつく
られている。複数種類の言語で利用可能な言語情報処理
システムを考えた場合にも，言語の種類ごとに言語情報
処理が異なるから，与えられるキーワードまたはテキス
ト・データの言語が分かっていることが必要である。与
えられるキーワード，テキスト・データの言語および文
字コード系が不明であると適切な処理は期待できない。A language information processing system such as a database search system, a translation system, a speech synthesis system, and the like is created on the premise of a specific language and a character code system. Even when considering a language information processing system that can be used in a plurality of languages, it is necessary to know the language of a given keyword or text data because language information is different for each language type. Appropriate processing cannot be expected if the given keyword, text data language and character code system are unknown.

【０００４】[0004]

【発明の開示】この発明は，与えられる文字コード列の
言語およびその文字コード系を識別できるようにし，入
力されるキーワードまたはテキスト・データの言語およ
び文字コード系が分からない場合であっても，それぞれ
の言語に適した各種言語情報処理を可能にすることを目
的とする。DISCLOSURE OF THE INVENTION The present invention makes it possible to identify the language of a given character code string and its character code system, so that even if the language and character code system of an input keyword or text data are unknown, An object of the present invention is to enable various language information processing suitable for each language.

【０００５】この発明によるマルチリンガル・システム
は，入力文字コード列（エンコードされた入力テキスト
・データ，キーワードなど）を受付ける入力手段，言語
と文字コード系との組合せごとに，その組合せにおいて
文字コードが出現する確率をそれぞれ記述した複数の出
現確率テーブルを格納した記憶手段，上記入力手段によ
って受付けた入力文字コード列に含まれる１または複数
の文字コードについて上記複数の出現確率テーブルから
それぞれ出現確率を読み出し，言語と文字コード系との
組合せごとに，評価データを得る手段，得られた評価デ
ータにもとづいて，入力文字コード列の言語と文字コー
ド系との組合せを判別する手段，および上記判別手段に
よって判別された言語と文字コード系との組合せを表わ
す情報を文字コード列とともに出力する出力手段を備え
ているものである。In the multilingual system according to the present invention, an input means for receiving an input character code string (encoded input text data, a keyword, etc.), a character code in each combination of a language and a character code system, Storage means for storing a plurality of occurrence probability tables each describing an occurrence probability; reading out the occurrence probabilities from the plurality of occurrence probability tables for one or a plurality of character codes included in the input character code string received by the input means; Means for obtaining evaluation data for each combination of language and character code system, means for determining the combination of the language and character code system of the input character code string based on the obtained evaluation data, and Information representing the combination of the determined language and the character code system is In which comprises an output means for outputting with a column.

【０００６】この発明によるマルチリンガル処理方法
は，言語と文字コード系との組合せごとに，その組合せ
において文字コードが出現する確率をそれぞれ記述した
出現確率テーブルをあらかじめ作成しておき，入力文字
コード列を受付け，受付けた入力文字コード列に含まれ
る１または複数の文字コードについて上記複数の出現確
率テーブルからそれぞれ出現確率を読み出し，言語と文
字コード系との組合せごとに，評価データを得，得られ
た評価データにもとづいて，入力文字コード列の言語と
文字コード系との組合せを判別し，判別した言語と文字
コード系との組合せを表わす情報を文字コード列ととも
に出力するものである。In the multilingual processing method according to the present invention, for each combination of a language and a character code system, an appearance probability table describing the probability of occurrence of a character code in the combination is created in advance, and an input character code string is created. Is received, the appearance probabilities are read out from the plurality of occurrence probability tables for one or more character codes included in the received input character code string, and evaluation data is obtained for each combination of language and character code system. Based on the evaluation data, a combination of the language of the input character code string and the character code system is determined, and information representing the combination of the determined language and the character code system is output together with the character code string.

【０００７】この発明はさらに，上記方法を実施するた
めのプログラムを格納した記録媒体も提供している。こ
の記録媒体は，言語と文字コード系との組合せごとに，
その組合せにおいて文字コードが出現する確率を記述し
た出現確率テーブルを用いて，入力文字コード列の言語
と文字コード系との組合せを識別するために，入力文字
コード列を受付け，受付けた入力文字コード列に含まれ
る１または複数の文字コードについて上記複数の出現確
率テーブルからそれぞれ出現確率を読み出し，言語と文
字コード系との組合せごとに，評価データを得，得られ
た評価データに基づいて，入力文字コード列の言語と文
字コード系との組合せを判別し，そして判別した言語と
文字コード系との組合せを表わす情報を文字コード列と
ともに出力するようにコンピュータを制御するプログラ
ムを格納したものである。[0007] The present invention further provides a recording medium storing a program for executing the above method. This recording medium is used for each combination of language and character code system.
An input character code string is received to identify the combination of the language of the input character code string and the character code system using an appearance probability table describing the probability of occurrence of the character code in the combination, and the received input character code The appearance probabilities are read from the plurality of appearance probability tables for one or a plurality of character codes included in the column, evaluation data is obtained for each combination of language and character code system, and input is performed based on the obtained evaluation data. A program for controlling a computer to determine a combination of a language and a character code system of a character code string and to output information indicating the determined combination of the language and the character code system together with the character code string. .

【０００８】好ましくはこの記録媒体には上記出現確率
テーブルもさらに格納されている。記録媒体とは磁気デ
ィスク記憶装置，光磁気ディスク記憶装置，光ディスク
記憶装置，磁気テープ，半導体メモリ等をいう。Preferably, the appearance probability table is further stored in the recording medium. The recording medium refers to a magnetic disk storage device, a magneto-optical disk storage device, an optical disk storage device, a magnetic tape, a semiconductor memory, or the like.

【０００９】文字コードの出現確率は，その文字コード
によって表わされる文字の言語と文字コード系との組合
せに依存する。同一の文字コードであっても，その文字
コードの出現確率は，言語ごとに異なる。また，同じ言
語でも文字コード系が異なれば同一文字コードの出現確
率が異なる。この発明は，言語と文字コード系との組合
せに特有な文字コードの出現確率に着目して文字コード
によって表わされる言語およびその文字コード系の種類
を判別するものである。ここで文字コードとは，数字，
アルファベット，漢字，ひらがな，カタカナ，ハング
ル，その他のすべての文字，記号，符号等を表わすコー
ドをいう。The appearance probability of a character code depends on the combination of the language of the character represented by the character code and the character code system. Even with the same character code, the appearance probability of the character code differs for each language. Also, even in the same language, the appearance probability of the same character code is different if the character code system is different. The present invention discriminates a language represented by a character code and a type of the character code system by paying attention to an appearance probability of a character code unique to a combination of a language and a character code system. Here, character codes are numbers,
Codes that represent the alphabet, kanji, hiragana, katakana, hangul, and all other characters, symbols, and signs.

【００１０】この発明によると，受付けた入力文字コー
ド列（テキスト・データ，キーワード等）の一文字コー
ドごとに上記出現確率テーブルから上記出現確率が読出
され，評価データが言語と文字コード系との組合せごと
に作成される。出現確率に関係する評価データが低けれ
ば入力された文字コード列はその言語と文字コード系と
の組合せに関するものではない可能性が高いと判断さ
れ，評価データが高ければ入力された文字コード列はそ
の言語と文字コード系との組合せに関するものである可
能性が高いと考えられる。このようにして評価データに
もとづいて，入力文字コード列の言語と文字コード系と
の組合せが判別される。According to the present invention, the appearance probability is read from the appearance probability table for each character code of the received input character code string (text data, keyword, etc.), and the evaluation data is a combination of a language and a character code system. It is created for each. If the evaluation data related to the appearance probability is low, it is determined that the input character code string is not likely to be related to the combination of the language and the character code system, and if the evaluation data is high, the input character code string is It is highly likely that it is related to the combination of the language and the character code system. In this way, the combination of the language of the input character code string and the character code system is determined based on the evaluation data.

【００１１】出現確率テーブルから読み出された出現確
率の積を算出し，算出された値にもとづいて入力文字コ
ード列の言語および文字コード系（エンコーディング方
法）を判別することが精度の観点から好ましい。いずれ
か一つの文字コードの出現確率が０または０に非常に近
い数値であれば，積も非常に小さい値となり，そのよう
な言語と文字コード系との組合せが明確に除外される。From the viewpoint of accuracy, it is preferable to calculate the product of the appearance probabilities read from the appearance probability table and determine the language and character code system (encoding method) of the input character code string based on the calculated values. . If the appearance probability of any one of the character codes is 0 or a value very close to 0, the product is also a very small value, and such a combination of the language and the character code system is clearly excluded.

【００１２】さらにこの発明によると，判別された言語
と文字コード系との組合せを表わす情報が文字コード列
（入力文字コード列のすべてまたは一部）とともに出力
される。したがって，この出力データを利用する装置，
システムでは，文字コード列の言語の種類および文字コ
ード系を認識することが可能となる。Further, according to the present invention, information indicating the combination of the determined language and the character code system is output together with the character code string (all or part of the input character code string). Therefore, a device that uses this output data,
The system can recognize the language type and character code system of the character code string.

【００１３】この発明は，たとえばマルチリンガル形態
素解析システム，マルチリンガル検索システム，マルチ
リンガル出力（たとえばプリント）システム，マルチリ
ンガル翻訳システム，マルチリンガル・ワードプロセッ
サ，マルチリンガル音声合成システム，マルチリンガル
・データベース・システム，マルチリンガル文書管理シ
ステム等のマルチリンガル処理システムに適用すること
ができる。これらのマルチリンガル処理システムでは，
文字コード列（テキスト・データ）を処理するときに，
言語と文字コード系との組合せを表わす情報を利用し
て，その言語と文字コード系との組合せに適した処理を
行うことができる。The present invention provides, for example, a multilingual morphological analysis system, a multilingual search system, a multilingual output (eg, print) system, a multilingual translation system, a multilingual word processor, a multilingual speech synthesis system, and a multilingual database system. , Multilingual processing systems such as multilingual document management systems. In these multilingual processing systems,
When processing character code strings (text data)
Using information representing a combination of a language and a character code system, processing suitable for the combination of the language and the character code system can be performed.

【００１４】たとえばマルチリンガル検索システムに
は，複数の少なくとも言語にそれぞれ適した検索手段が
設けられている。上記出力手段から与えられる言語と文
字コード系との組合せを表わす情報に基づいて，その情
報によって表わされる少なくとも言語に適した検索手段
が選択され，上記出力手段から与えられる文字コード列
にしたがう検索処理が実行される。For example, a multilingual search system is provided with search means suitable for each of a plurality of languages. Based on information representing a combination of a language and a character code system provided from the output means, a search means suitable for at least the language represented by the information is selected, and a search process is performed in accordance with a character code string provided from the output means. Is executed.

【００１５】マルチリンガル・データベース・システム
では，上記出力手段から与えられる言語と文字コード系
との組合せを表わす情報が上記出力手段から与えられる
文字コード列に付加された状態で，文字コード列がデー
タベースに登録される。[0015] In the multilingual database system, the character code string is added to the character code string provided from the output means, while the character code string is added to the character code string provided from the output means. Registered in.

【００１６】マルチリンガル文書管理システムでは，複
数の少なくとも言語にそれぞれ適した文書処理手段が設
けられる。上記出力手段から与えられる言語と文字コー
ド系との組合せを表わす情報に基づいて，その情報によ
って表わされる少なくとも言語に適した文書処理手段が
選択され，上記出力手段から与えられる文字コード列の
文書処理が実行される。In the multilingual document management system, a plurality of document processing means suitable for at least a plurality of languages are provided. Document processing means suitable for at least the language represented by the information is selected based on the information representing the combination of the language and the character code system given from the output means, and the document processing of the character code string given from the output means is performed. Is executed.

【００１７】入力文字コード列の受付手段は，キーボー
ド等の入力装置でもよいし，伝送されてきたデータを受
信する通信装置（ソフトを含む）でもよい。出力手段も
同じように，処理システム（検索システム，データベー
ス，文書管理システム）等を直接に制御するものでもよ
いし，文字コード列と識別情報とを送信する通信装置で
もよい。したがって，言語識別のためのシステムと，言
語識別結果に基づいて文字コード列を処理するシステム
とをネットワークによって接続することもできる。すな
わち，マルチリンガル処理システムはネットワーク上で
も実現される。The input character code string accepting means may be an input device such as a keyboard or a communication device (including software) for receiving transmitted data. Similarly, the output means may directly control a processing system (retrieval system, database, document management system) or the like, or may be a communication device for transmitting a character code string and identification information. Therefore, a system for language identification and a system for processing character code strings based on the result of language identification can be connected by a network. That is, the multilingual processing system is realized on a network.

【００１８】[0018]

【実施例の説明】図１は，文字コードの言語識別機能を
もつマルチリンガル・システムの外観を，図２は，その
電気的構成の概要をそれぞれ示している。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows the appearance of a multilingual system having a function of character code language identification, and FIG. 2 shows an outline of the electrical configuration.

【００１９】文字コードの言語識別機能をもつマルチリ
ンガル・システムは，コンピュータ１０を含む。コンピ
ュータ１０にはＣＲＴ表示装置（または液晶ディスプレ
イ・パネル）１１，プリンタ１２および入力装置（キー
ボード１３Ａやマウス１３Ｂ）が接続されている。コン
ピュータ１０の内部にはＦＤドライブ１４，ＣＤ−ＲＯ
Ｍドライブ１５およびＨＤユニット１６が設けられてい
る。ＦＤドライブ１４は，ＦＤ（フロッピー・ディス
ク）１９へのデータの書込みおよびＦＤ１９からのデー
タの読出しを行なう。ＣＤ−ＲＯＭドライブ１５は，Ｃ
Ｄ−ＲＯＭ（コンパクト・ディスク−リード・オンリ・
メモリ）１８からのデータの読出しを行なう。ＨＤユニ
ット１６は，ＨＤ（ハードディスク）（図示略）へのデ
ータの書込みおよびＨＤからのデータの読出しを行な
う。コンピュータ１０はさらに内部メモリ（半導体メモ
リなど）１７を含む。コンピュータ１０はさらに通信制
御プログラムにしたがってネットワークを介して他のシ
ステムと交信することができる。A multilingual system having a function of character language identification includes a computer 10. The computer 10 is connected to a CRT display (or liquid crystal display panel) 11, a printer 12, and input devices (keyboard 13A and mouse 13B). FD drive 14 and CD-RO inside computer 10
An M drive 15 and an HD unit 16 are provided. The FD drive 14 writes data to an FD (floppy disk) 19 and reads data from the FD 19. The CD-ROM drive 15 is
D-ROM (Compact Disc-Read Only)
Data from the memory (18). The HD unit 16 writes data to an HD (hard disk) (not shown) and reads data from the HD. The computer 10 further includes an internal memory (such as a semiconductor memory) 17. The computer 10 can further communicate with other systems via a network according to a communication control program.

【００２０】ＣＤ−ＲＯＭ１８には文字コードの言語識
別のための言語識別プログラムを含むマルチリンガル・
プログラムおよび言語識別に用いられる出現確率データ
が格納されている。その内容が図３に示されている。出
現確率データは，文字コード（すなわち文字）が出現す
る確率を表わすものである。文字ごとの出現確率は過去
の様々な文書に現れている文字を統計処理することによ
り予め求められる。代表的な文字についてのみ出現確率
データを求めておいてもよいし，すべての文字について
求めておいてもよい。出現確率データは，言語と文字コ
ード系（文字コードの種類またはエンコーディング方
法）との組合せごとにテーブルの形態で格納されてい
る。この実施例においては，中国語用（大陸用および台
湾用）テーブル（出現確率表），日本語用テーブル（Ｅ
ＵＣ（Extended UNIX Code）コードおよびShift-JIS
（Japanese Industrial Standards ）コード），ならび
に韓国語用テーブルがある。日本語は一般的にＥＵＣコ
ードまたはShift−JISコードによってエンコードされ
る。このＥＵＣコードやShift−JISコードが文字コード
系またはエンコーディング方法である。したがって，言
語の種類のみならず，文字コード系の識別も行える。日
本語以外の他の言語についても同様である。もっとも，
日本語Shift−JISコードは日本語ＥＵＣコードにコード
変換が可能であるから，後述する言語識別処理の例のよ
うに日本語についてはＥＵＣコード・テーブルのみを設
けておいてもよい。The CD-ROM 18 includes a multi-lingual program including a language identification program for character code language identification.
Stores occurrence probability data used for program and language identification. The contents are shown in FIG. The appearance probability data represents the probability that a character code (ie, a character) appears. The appearance probability for each character is obtained in advance by statistically processing characters appearing in various past documents. The appearance probability data may be obtained only for representative characters, or may be obtained for all characters. The appearance probability data is stored in the form of a table for each combination of a language and a character code system (character code type or encoding method). In this embodiment, a table (probability table) for Chinese (for continent and Taiwan) and a table for Japanese (E
UC (Extended UNIX Code) code and Shift-JIS
(Japanese Industrial Standards) code) and a table for Korean. Japanese is generally encoded by EUC code or Shift-JIS code. The EUC code or Shift-JIS code is a character code system or an encoding method. Therefore, not only the language type but also the character code system can be identified. The same applies to languages other than Japanese. However,
Since the Japanese Shift-JIS code can be converted into a Japanese EUC code, only an EUC code table may be provided for Japanese as in an example of language identification processing described later.

【００２１】ＣＤ−ＲＯＭ１８に格納されているプログ
ラムおよびデータは，システムの立ち上げ時に，ＣＤ−
ＲＯＭ１８から読み出され，ＨＤに格納される。文字コ
ードの言語識別処理を含むマルチリンガル処理において
は，これらのプログラムおよびデータの一部は，必要に
応じて，内部メモリ１７に一時的に記憶され，または展
開される。The programs and data stored in the CD-ROM 18 are stored in the CD-ROM when the system is started up.
The data is read from the ROM 18 and stored in the HD. In the multilingual processing including the language identification processing of the character codes, some of these programs and data are temporarily stored or expanded in the internal memory 17 as necessary.

【００２２】図４は，処理対象であるテキスト・データ
（文字コード列）の入力処理（プログラム），言語識別
処理（プログラム），出力処理（プログラム）および出
現確率データの相互の関係を概念的に示している。これ
らのプログラムの全体をマルチリンガル・プログラムと
いう。FIG. 4 conceptually shows the relationship between input processing (program), language identification processing (program), output processing (program), and appearance probability data of text data (character code string) to be processed. Is shown. All of these programs are called multilingual programs.

【００２３】ブラウザ，通信ソフトウェアなどのテキス
ト入力処理ソフトウェアによって（またはキーボード１
３Ａを介して）入力されたテキスト・データは，言語識
別プログラムによって，一文字ごとに切出され，切出さ
れた各文字について言語（日本語，中国語，韓国語）と
文字コード系との組合せごとに出現確率データが求めら
れる。求められた出現確率が評価値用ワークエリア（内
部メモリ１７の一部）において統計処理（後述する掛算
処理）され，最終的に入力テキスト・データの言語の種
類と文字コード系との組合せが識別される。この識別結
果を表わす情報は入力テキスト・データに付加されて，
出力される（たとえば，他の処理システムに与えられ
る，または伝送される）。The text input processing software such as a browser and communication software (or a keyboard 1)
Text data input (via 3A) is cut out character by character by a language identification program, and for each cut out character, a combination of language (Japanese, Chinese, Korean) and character code system The appearance probability data is obtained for each of the cases. The obtained appearance probability is subjected to statistical processing (multiplication processing described later) in the evaluation value work area (part of the internal memory 17), and finally the combination of the language type of the input text data and the character code system is identified. Is done. Information representing the identification result is added to the input text data,
Output (for example, given or transmitted to another processing system).

【００２４】図４がマルチリンガル・プログラムと言語
識別処理に用いられるデータとを機能ブロック図で表現
したものであるのに対して，図５はマルチリンガル・プ
ログラムをフローチャートで表現したものである。入力
されるテキスト・データ（文字コード列）が受付けられ
ると（ステップ４１），そのテキスト・データの言語と
文字コード系の組合せが識別される（ステップ４２）。
最後に，識別結果がテキスト・データに付加されて出力
される（ステップ４３）。FIG. 4 is a functional block diagram showing a multilingual program and data used for language identification processing, while FIG. 5 is a flowchart showing the multilingual program. When the input text data (character code string) is received (step 41), the combination of the language of the text data and the character code system is identified (step 42).
Finally, the identification result is added to the text data and output (step 43).

【００２５】図６は，文字コード識別の処理（図５ステ
ップ４２）の手順を示すフローチャートである。この処
理においては「梅花に鴬」という句を表す文字コード列
（この文字コード列は，たとえば日本語ＥＵＣコードで
は０ｘＣ７ＤＦ，０ｘＢ２Ｄ６，０ｘＡ４ＣＢ，０ｘＢ
２Ａ９と表わされる，日本語Shift−JISコードでは０ｘ
９４７Ｅ，０ｘ８９Ｄ４，０ｘ８２Ｃ９，０ｘ８９Ａ７
と表される。ここで，０ｘは１６進数を示す。）がキー
ボードから入力された場合にその文字コード列がどの言
語と文字コード系との組合せのものかを識別する例につ
いて説明する。図７はこの句「梅花に鴬」を構成する文
字コードの出現確率を言語と文字コード系との組合せご
とに示すものであり，各出現確率テーブルにおいて最大
の出現確率を 100％として正規化された値が示されてい
る。FIG. 6 is a flowchart showing the procedure of the character code identification process (step 42 in FIG. 5). In this processing, a character code string representing a phrase "plum blossoms" is used (for example, this character code string is 0xC7DF, 0xB2D6, 0xA4CB, 0xB in Japanese EUC code).
0x in Japanese Shift-JIS code represented as 2A9
947E, 0x89D4, 0x82C9, 0x89A7
It is expressed as Here, 0x indicates a hexadecimal number. ) Is input from the keyboard, an example will be described in which the character code string identifies a combination of a language and a character code system. FIG. 7 shows the appearance probabilities of the character codes constituting this phrase "Ume ni ni Ugui" for each combination of language and character code system. In each appearance probability table, the maximum occurrence probability is normalized to 100%. Values are shown.

【００２６】入力されたテキスト・データから２バイト
分（一文字分）（上記の例では０ｘＣ７ＤＦなど）のデ
ータが取り出される（ステップ２１）。Data of 2 bytes (one character) (0xC7DF or the like in the above example) is extracted from the input text data (step 21).

【００２７】取り出された２バイト分のデータ（文字コ
ード）に対応する出現確率が言語と文字コード系との組
合せごとに出現確率テーブルから読み出される（ステッ
プ２３，２５，２９，３１）。取り出された２バイト分
のデータは，他方では，Shift-JIS コードからＥＵＣコ
ードに変換され（ステップ２２），変換後のＥＵＣコー
ドによる文字コードの出現確率が日本語用出現確率テー
ブル（ＥＵＣコード）から読み出される（ステップ２
７）。The appearance probabilities corresponding to the extracted 2-byte data (character codes) are read from the appearance probability table for each combination of language and character code system (steps 23, 25, 29, 31). On the other hand, the extracted 2-byte data is converted from the Shift-JIS code to the EUC code (step 22), and the appearance probability of the character code by the converted EUC code is represented by the Japanese appearance probability table (EUC code). (Step 2
7).

【００２８】「梅花に鴬」の一文字目「梅」の文字コー
ドは，日本語のＥＵＣコード系では，0.0948％，日本語
のShift-JIS コード系では０％，中国語（大陸）のＥＵ
Ｃコード系では0.0129％，中国語（台湾）のＢｉｇ５コ
ード系では0.0022％，韓国語のＥＵＣコード系では10.9
41％の出現確率を持つ。The character code of the first character "ume" of "plum blossoms" is 0.0948% in Japanese EUC code system, 0% in Japanese Shift-JIS code system, and EU of Chinese (continent).
0.0129% for C code system, 0.0022% for Chinese (Taiwan) Big5 code system, 10.9 for Korean EUC code system
Has a 41% appearance probability.

【００２９】読出された出現確率と既に算出されている
評価値との積が算出され，この積が新たな評価値とされ
る（評価値の更新）（ステップ２４，２６，２８，３
０，３２）。この算出も言語と文字コード系との組合せ
ごと（すなわち，出現確率テーブルごと）に行われる。
評価値の初期値として１が設定されており，一文字目の
文字コードの場合には，読出された出現確率と１とが乗
算される。The product of the read appearance probability and the already calculated evaluation value is calculated, and this product is used as a new evaluation value (evaluation value update) (steps 24, 26, 28, and 3).
0, 32). This calculation is also performed for each combination of the language and the character code system (that is, for each appearance probability table).
1 is set as the initial value of the evaluation value. In the case of the first character code, the read appearance probability is multiplied by 1.

【００３０】このようにして更新された評価値の中の最
大値をもつ評価値を１００として他の評価値が正規化さ
れる（ステップ３３）。これは，後述するステップ３５
でしきい値との比較処理を行うためである。The other evaluation values are normalized with the evaluation value having the maximum value among the evaluation values updated in this way as 100 (step 33). This corresponds to step 35 described later.
This is for performing a comparison process with the threshold value.

【００３１】入力されたテキスト・データを構成するす
べての文字コードについて上記の処理が終了していなけ
れば（ステップ３４），最大値を持つ評価値を除く他の
すべての評価値の合計が算出される。この算出合計値が
所定のしきい値以下であれば（ステップ３５でＹＥ
Ｓ），入力したテキスト・データは最大値を持つ評価値
を与える言語と文字コード系との組合せであると判別さ
れる。この算出合計値が所定のしきい値を超えていれば
（ステップ３５でＮＯ），再びステップ２１からステッ
プ３４の処理が繰り返される。If the above processing has not been completed for all the character codes constituting the input text data (step 34), the sum of all the evaluation values except the evaluation value having the maximum value is calculated. You. If the calculated total value is equal to or less than a predetermined threshold value (YE
S), the input text data is determined to be a combination of a language and a character code system that gives an evaluation value having the maximum value. If the calculated total value exceeds a predetermined threshold value (NO in step 35), the processes in steps 21 to 34 are repeated.

【００３２】「梅花に鴬」の２文字目「花」の文字コー
ドに関して，日本語のＥＵＣコードにおける出現確率は
3.2740％，日本語のShift-JIS コードの出現確率は０
％，中国語（大陸）のＥＵＣコードにおける出現確率は
0.1118％，中国語（台湾）のＢｉｇ５コードにおける出
現確率は0.2874％，韓国語のＥＵＣコードにおける出現
確率は０％である。Regarding the character code of the second character “flower” of “plum blossoms”, the appearance probability in the Japanese EUC code is
3.2740%, the probability of occurrence of Japanese Shift-JIS code is 0
%, The probability of appearance in the EUC code of Chinese (continent)
0.1118%, the appearance probability in the Chinese (Taiwan) Big5 code is 0.2874%, and the appearance probability in the Korean EUC code is 0%.

【００３３】「梅花に鴬」の３文字目「に」の文字コー
ドの出現確率は，日本語のＥＵＣコードでは59.155％，
日本語のShift-JIS コードでは０％，中国語（大陸）の
ＥＵＣコードでは0.0001％，中国語（台湾）のＢｉｇ５
コードでは０％，韓国語のＥＵＣコードでは0.0001％で
ある。The appearance probability of the character code of the third character “Ni” of “Plum blossoms” is 59.155% in the Japanese EUC code,
0% for Japanese Shift-JIS code, 0.0001% for Chinese (continent) EUC code, Big5 for Chinese (Taiwan)
The code is 0%, and the Korean EUC code is 0.0001%.

【００３４】「梅花に鴬」の４文字目「鴬」の文字コー
ドの出現確率は，日本語のＥＵＣコードでは0.0001％，
日本語のShift-JIS コードでは０％，中国語（大陸）の
ＥＵＣコードでは0.3717％，中国語（台湾）のＢｉｇ５
コードでは0.0048％，韓国語のＥＵＣコードでは0.0299
％である。The appearance probability of the character code of the fourth character “Ugui” of “Plum blossoms” is 0.0001% in Japanese EUC code,
0% for Japanese Shift-JIS code, 0.3717% for Chinese (continent) EUC code, Big5 for Chinese (Taiwan)
0.0048% for code, 0.0299 for EUC code in Korean
%.

【００３５】「梅花に鴬」を構成する４文字分の文字コ
ードの出現確率の積が最終的な評価値として得られる。
入力されたテキスト・データを構成するすべての文字コ
ードについて上記の処理が終了したことになる（ステッ
プ３４でＹＥＳ）。最終的な評価値は日本語のＥＵＣコ
ードでは0.000000001836％，日本語のShift-JIS コード
では０％，中国語（大陸）のＥＵＣコードでは0.000000
000005366 ％，中国（台湾）のＢｉｇ５コードでは０
％，韓国のＥＵＣコードでは０％となる。これらの値を
比べると日本語のＥＵＣコードにおける評価値が一番大
きいから「梅花に鴬」は日本語でしかもＥＵＣコードで
表現されたものであると判断される。このようにして文
字コードによって表わされる文字の言語およびその文字
コードの種類（文字コード系，またはエンコーディング
方法）が識別される。入力テキスト・データが多数の文
字コードを含む場合には，通常は，３〜４文字について
の処理が終了したときに，ステップ３５でＹＥＳとな
り，入力テキスト・データの言語と文字コード系の組合
せの判別が終えるであろう。The product of the appearance probabilities of the character codes for the four characters that make up "Umebana to Ugui" is obtained as the final evaluation value.
This means that the above processing has been completed for all the character codes constituting the input text data (YES in step 34). The final evaluation value is 0.000000001836% for Japanese EUC code, 0% for Japanese Shift-JIS code, and 0.000000 for Chinese (continent) EUC code.
000005366%, 0 for Big5 code in China (Taiwan)
% And 0% for the EUC code of Korea. When these values are compared, since the evaluation value in the Japanese EUC code is the largest, it is determined that “Plum blossoms” is expressed in Japanese and in the EUC code. In this way, the language of the character represented by the character code and the type of the character code (character code system or encoding method) are identified. When the input text data includes a large number of character codes, normally, when the processing for three to four characters is completed, YES is obtained in step 35, and the combination of the language of the input text data and the character code system is set. The decision will be over.

【００３６】図８はマルチリンガル検索システムの全体
的構成を示すものである。マルチリンガル・システム５
１が図１から図７を用いて説明したシステムに相当す
る。FIG. 8 shows the overall configuration of the multilingual search system. Multilingual system 5
1 corresponds to the system described with reference to FIGS.

【００３７】マルチリンガル・システム５１は，たとえ
ば次のような質問の入力を受付ける。この質問の言語
（%C7〜%A9の部分；これは先に説明した「梅花に鴬」の
文字コードである）は不明である。The multilingual system 51 receives, for example, the following questions. The language of this question (% C7 to% A9 part; this is the character code of "Plum blossoms" described above) is unknown.

【００３８】 /cgi-bin/recognize?query=%C7%DF%B2%D6%A4%CB%B2%A9/ Cgi-bin / recognize? Query =% C7% DF% B2% D6% A4% CB% B2% A9

【００３９】cgi-bin は次に実行コマンドが続くことを
示す。実行コマンドrecognize は，それに続く？以降の
文字コード列（検索パラメータ）の言語と文字コード系
を識別して，識別結果を検索パラメータに付加せよとい
う命令である。Cgi-bin indicates that an execution command follows. Is the execution command recognize after that? This is a command for identifying the language and character code system of the subsequent character code string (search parameter) and adding the identification result to the search parameter.

【００４０】マルチリンガル・システム５１はこの命令
を実行して，？以降の文字列がＥＵＣコードで表わされ
た日本語であることを認識する。マルチリンガル・シス
テム５１は次のような言語パラメータが付加された質問
を作成して出力する。The multilingual system 51 executes this instruction, It recognizes that the subsequent character string is Japanese represented by the EUC code. The multilingual system 51 creates and outputs a question to which the following language parameter is added.

【００４１】/cgi-bin/search?query=%C7%DF%B2%D6%A4%
CB%B2%A9&lang=ja&encode=EUC/ Cgi-bin / search? Query =% C7% DF% B2% D6% A4%
CB% B2% A9 & lang = ja & encode = EUC

【００４２】searchは検索せよというコマンドである。
lang=ja&encode=EUCが言語パラメータであり，言語の種
類が日本語（ja）で（ISO-639 に準拠），コード系がＥ
ＵＣコードであることを示している。"Search" is a command for searching.
lang = ja & encode = EUC is a language parameter, the language type is Japanese (ja) (compliant with ISO-639), and the code system is E
It is a UC code.

【００４３】このような言語パラメータ付質問は，ネッ
トワーク（たとえばインターネット）を介して検索シス
テム５２に伝送される。検索システムは，受信した言語
パラメータ付質問を，検索文の部分と，言語パラメータ
の部分とに分解する。Such a question with language parameter is transmitted to the search system 52 via a network (for example, the Internet). The search system decomposes the received query with language parameter into a search sentence part and a language parameter part.

【００４４】検索文 query=%C7%DF%B2%D6%A4%CB%B2%A9 言語パラメータ lang=ja&encode=EUCQuery sentence query =% C7% DF% B2% D6% A4% CB% B2% A9 Language parameter lang = ja & encode = EUC

【００４５】検索システム５２は，日本語インデックス
５３，中国語インデックス５４および韓国語インデック
ス５５のうち，言語パラメータによって指示されるイン
デックス，すなわち日本語インデックス５３を選択し
て，検索文にしたがって，検索処理を実行する。検索結
果は検索システム５２からマルチリンガル・システム５
１に通知される。The search system 52 selects an index specified by a language parameter, that is, the Japanese index 53, from the Japanese index 53, the Chinese index 54, and the Korean index 55, and performs a search process according to the search sentence. Execute Search results are sent from the search system 52 to the multilingual system 5
1 is notified.

【００４６】上記の各種インデックス５３〜５５はもち
ろん，言語の種類とコード系との組合せごとに設けられ
るのが好ましいが，言語の種類が判明すればコード系の
変換は可能であるから（たとえば上述のShift−JISとＥ
ＵＣコードとの間の変換等），言語の種類ごとに設けれ
ば足りる。It is preferable that the various indexes 53 to 55 be provided for each combination of the language type and the code system. However, if the language type is known, the code system can be converted (for example, as described above). Shift-JIS and E
It is only necessary to provide conversion for each language type.

【００４７】図９はマルチリンガル・データベース・シ
ステムの例を示すものである。マルチリンガル・システ
ム６１は言語の種類と文字コードが不明なテキスト・デ
ータを受取り，その言語の種類と文字コードとを判別す
る。テキスト・データに言語タグ（判別された言語の種
類と文字コードの組合せを表わす）が付加され，データ
ベース６２に登録される。FIG. 9 shows an example of a multilingual database system. The multilingual system 61 receives text data whose language type and character code are unknown, and determines the language type and character code. A language tag (representing the combination of the determined language type and character code) is added to the text data and registered in the database 62.

【００４８】データベースの一例が図１０に示されてい
る。これはリレーショナル・データベースであり，文書
ごとに言語タグが付加されている。言語タグは言語（e
n，ja，fr，zhなど）と，コード系（ASCII ，ＥＵＣ，I
SO-8859-1など）との組合せから構成される。FIG. 10 shows an example of the database. This is a relational database with language tags added to each document. The language tag is the language (e
n, ja, fr, zh etc.) and code system (ASCII, EUC, I
SO-8859-1).

【００４９】このように，データベースに登録するデー
タに言語タグを付加することにより，特定の言語の文書
のみを取出したり，特定の言語の文書のみについて全文
検索を行ったり，特定の言語の文書についてのみ要約を
作成するというように，言語別（必要に応じて，文字コ
ードとの組合せも含めて）に処理を行うことが可能とな
る。As described above, by adding a language tag to the data registered in the database, only a document in a specific language can be extracted, a full-text search can be performed only on a document in a specific language, and a document in a specific language can be retrieved. It is possible to perform processing for each language (including a combination with a character code, if necessary), such as creating only an abstract.

【００５０】図１１は文書管理システムの例を示すもの
である。文書管理システム７２は，言語の種類ごとに，
要約生成プログラム７３，キーワード検索プログラム７
４，文法・スペルチェック・プログラム７５等を備えて
いる。マルチリンガル・システム７１から与えられる言
語タグ付きテキスト・データにしたがってその言語タグ
によって表わされる言語に適したプログラムが選択さ
れ，その言語に応じた要約生成，キーワード検索，文法
・スペルチェック等の処理が行なわれる。FIG. 11 shows an example of a document management system. The document management system 72 provides, for each language type,
Summary generation program 73, keyword search program 7
4, a grammar / spell check program 75 and the like. According to the text data with a language tag provided from the multilingual system 71, a program suitable for the language represented by the language tag is selected, and processing such as summary generation, keyword search, grammar and spell check according to the language is performed. Done.

[Brief description of the drawings]

【図１】文字コードの言語識別機能をもつマルチリンガ
ル・システムの外観を示している。FIG. 1 shows the appearance of a multilingual system having a language identification function for character codes.

【図２】マルチリンガル・システムの電気的構成の概要
を示している。FIG. 2 shows an outline of an electrical configuration of the multilingual system.

【図３】マルチリンガル・システムで用いられる記録媒
体のデータ構造を示している。FIG. 3 shows a data structure of a recording medium used in the multilingual system.

【図４】入力処理，言語識別処理，出力処理および出現
確率データの相互の関係を機能的に示すブロック図であ
る。。FIG. 4 is a block diagram functionally showing a mutual relationship among an input process, a language identification process, an output process, and appearance probability data. .

【図５】マルチリンガル・プログラムの全体的な流れを
示すフローチャートである。FIG. 5 is a flowchart showing an overall flow of a multilingual program.

【図６】文字コードの言語識別処理の処理手順を示すフ
ローチャートである。FIG. 6 is a flowchart illustrating a procedure of a language identification process of a character code;

【図７】所定の句を表す文字コードの出現確率を言語お
よび文字コードに対応して格納した出現確率テーブルを
示す。FIG. 7 shows an appearance probability table in which the appearance probabilities of character codes representing predetermined phrases are stored in correspondence with languages and character codes.

【図８】マルチリンガル検索システムの全体を示すブロ
ック図である。FIG. 8 is a block diagram showing the entire multilingual search system.

【図９】マルチリンガル・データベース・システムの全
体を示すブロック図である。FIG. 9 is a block diagram showing the entire multilingual database system.

【図１０】データベースの一例を示す。FIG. 10 shows an example of a database.

【図１１】マルチリンガル文書管理システムの全体を示
すブロック図である。FIG. 11 is a block diagram showing the entire multilingual document management system.

【符号の説明】１０コンピュータ１３Ａキーボード１５ＣＤ−ＲＯＭドライブ５１，６１，７１マルチリンガル・システム５２検索システム６２データベース７２文書管理システム[Description of Signs] 10 Computer 13A Keyboard 15 CD-ROM Drive 51, 61, 71 Multilingual System 52 Search System 62 Database 72 Document Management System

───────────────────────────────────────────────────── フロントページの続き (72)発明者金岡秀信京都府京都市右京区花園土堂町10番地オムロン株式会社内 (72)発明者古河靖之京都府京都市右京区花園土堂町10番地オムロン株式会社内Ｆターム(参考） 5B009 KA04 QA01 VB01 5B091 AA01 CB01 CC01 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Hidenobu Kanaoka 10 Odron, Hanazono Todocho, Ukyo-ku, Kyoto, Kyoto Prefecture F term in the corporation (reference) 5B009 KA04 QA01 VB01 5B091 AA01 CB01 CC01

Claims

[Claims]

An input means for receiving an input character code string,
For each combination of a language and a character code system, storage means for storing a plurality of appearance probability tables each describing the probability of occurrence of a character code in the combination, one of the input character code strings received by the input means, Means for reading out the appearance probabilities from the plurality of appearance probability tables for a plurality of character codes, obtaining evaluation data for each combination of language and character code system, and determining the language of the input character code string based on the obtained evaluation data Means for determining the combination of the language and the character code system, and output means for outputting information indicating the combination of the language and the character code system determined by the determination means together with the character code string;
Multilingual system with.

2. A search device suitable for at least a language represented by the information based on information provided from the output device and representing a combination of a language and a character code system, the search device comprising: 2. The multilingual system according to claim 1, further comprising a search system for selecting a unit and executing a search process according to a character code string provided from said output unit.

3. A data base for registering a character code string in a state where information representing a combination of a language and a character code system given from the output means is added to a character code string given from the output means. The multilingual system of claim 1.

4. A plurality of document processing means suitable for at least a plurality of languages, and based on information provided from the output means and representing a combination of a language and a character code system, at least a document processing means suitable for the language represented by the information. 2. The multilingual system according to claim 1, further comprising a document management system that selects a document processing unit and executes a document process on the character code string provided from the output unit.

5. For each combination of a language and a character code system,
An appearance probability table describing the probability of occurrence of a character code in the combination is prepared in advance, an input character code string is accepted, and a plurality of occurrence codes are obtained for one or more character codes included in the received input character code string. The appearance probabilities are read out from the probability tables, evaluation data is obtained for each combination of language and character code system, and the combination of the language of the input character code string and the character code system is determined based on the obtained evaluation data. A multilingual processing method for outputting information indicating a combination of a determined language and a character code system together with a character code string.

6. For each combination of a language and a character code system,
An input character code string is received to identify the combination of the language of the input character code string and the character code system using an appearance probability table describing the probability of occurrence of the character code in the combination, and the received input character code The appearance probabilities are read from the plurality of appearance probability tables for one or a plurality of character codes included in the column, evaluation data is obtained for each combination of language and character code system, and input is performed based on the obtained evaluation data. A recording medium storing a program for controlling a computer to determine a combination of a language of a character code string and a character code system, and to output information representing the determined combination of the language and the character code system together with the character code string.

7. The recording medium according to claim 6, wherein said appearance probability table is further stored.