JPS61282965A - Character string dividing method - Google Patents

Character string dividing method

Info

Publication number
JPS61282965A
JPS61282965A JP60124684A JP12468485A JPS61282965A JP S61282965 A JPS61282965 A JP S61282965A JP 60124684 A JP60124684 A JP 60124684A JP 12468485 A JP12468485 A JP 12468485A JP S61282965 A JPS61282965 A JP S61282965A
Authority
JP
Japan
Prior art keywords
character
numeric
character string
punctuation mark
kana
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP60124684A
Other languages
Japanese (ja)
Inventor
Shunichi Fukushima
俊一 福島
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to JP60124684A priority Critical patent/JPS61282965A/en
Publication of JPS61282965A publication Critical patent/JPS61282965A/en
Pending legal-status Critical Current

Links

Abstract

PURPOSE:To prevent a numeric string from being divided by punctuation mark, comma, and period in the numeric string by preventing a character string from dividing when numerics exist immediately before and after a punctuation mark. CONSTITUTION:A punctuation mark extracting means 3 retrieves a KANA (Japanese syllabary) string stored in a character string memory means 2, and extracts punctuation marks such as a comma and a period. An immediately before non-numeric deciding means 4 receives the position of the punctuation mark from the punctuation mark extracting means 3, and outputs a signal to an OR gate when the character immediately before the punctuation mark is not a numeric. An immediately after non-numeric deciding means 5 outputs a signal to the OR gate 6 when the character immediately after the punctuation mark is not numeric. When the signal is transmitted from the OR gate 6, a conversion unit reading means 7 receives the position of the punctuation mark from the punctuation mark extracting means 3, reads in the KANA string up to the position of the punctuation mark from the character string memory means 2, and outputs it to a KANA/KANJI (Chinese character) converting means 8.

Description

【発明の詳細な説明】 (産業上の利用分野) 本発明は、ワードプロセッサ等において、べた書き入力
仮名漢字変換等を行う前処理として必要とされる文字列
を自動的に複数の処理単位に分割する方法に関するもの
である。
Detailed Description of the Invention (Industrial Field of Application) The present invention is a method for automatically dividing character strings into multiple processing units, which is required as preprocessing for converting solid input from kana to kanji, etc. in a word processor or the like. It's about how to do it.

(従来技術とその問題点) 従来、ワードプロセッサ等において、べた書き入力され
た仮名文字列の仮名漢字変換処理単位への分割は、句読
点やカンマ、ピリオド等の区切り記号によって行われる
ことがある。その−例を次に示す。
(Prior art and its problems) Conventionally, in word processors and the like, division of a kana character string input in solid writing into kana-kanji conversion processing units is sometimes performed using delimiters such as punctuation marks, commas, and periods. An example of this is shown below.

キョウハ、ヨイテンキデス。Kyoha, Yoitenkides.

一/キョウハ、lヨイテンキデス、! この場合、2回に分けて「キョウハ」と[ヨイテンキデ
ス]がそれぞれ仮名漢字変換されることになる。なお、
l”は分割したことを示す信号である。
1/Kyoha, lyoitenkides,! In this case, ``Kyoha'' and ``Yoitenkides'' will be converted into kana and kanji in two separate steps. In addition,
l'' is a signal indicating division.

しかしながら、このように句読点で単純に分割する方法
では、次のように数字列中の読点、カンマ、ピリオドに
より数字列を分割してしまうという欠点を有する。
However, this method of simply dividing by punctuation marks has the disadvantage that the number string is divided by punctuation marks, commas, and periods as shown below.

コメトケイヲ、 29.800エンデカイマシタ。Kometokeiwo, 29.800 Endekai Mashita.

→コノトケイヲ、/29./800エンデカイマシタJ
′コノモンダイノセイ力イリツハ、 65.55%ダッ
タ。
→Kono Tokeio, /29. /800 Endekai Mashita J
'Konomon Dainosei Riki Iritsuha, 65.55% Datta.

→コノモンダイノセイ力イリツハ、 / 65. / 
55%ダッタJ仮名漢字変換結果としては誤りではない
ことがあるが、数値をもとに演算を行ったり、読み合わ
せ等を行ったりする場合、この分割の誤りが致命的にな
ることが多い。例えば、上記の例について、読み合わせ
を行った場合、数字の部分を読み上げるならば、分割の
誤りのため次のような誤った読み上げが行われることに
なる。
→Konomon Dino Seiriki Iritsuha, / 65. /
55% Datta J Although the result of Kana-Kanji conversion may not be an error, this division error is often fatal when performing calculations based on numerical values, reading together, etc. For example, in the above example, when reading together, if the number part is read aloud, the following erroneous reading will occur due to an error in division.

/ニー、’ユークlハッヒ・ヤク・・・lロクジューゴ
lゴジューゴ・・・ (発明の目的) 本発明の目的は、数字列中の読点、カンマ、ピリオドに
より数字列を分割してしまうことのない、正しく文字列
を分割できる方法を提供することである。
/nee, 'yuk l hahi yak... l rokujugo l gojugo... (Objective of the invention) An object of the present invention is to prevent a number string from being divided by a comma, comma, or period in the number string. , to provide a method that can correctly split strings.

(発明の構成) 本発明は、文字列を複数の単位に分割する際に、前記文
字列から読点、カンマ、ピリオド等の区切り記号を抽出
し、前記区切り記号の前後の文字が数字であるかを判定
し、前記区切り記号の直前あるいは直後に数字が存在し
ないときには前記区切り記号により文字列を分割し、前
記区切り記号の直前及び直後に数字が存在するときには
前記区切り記号では文字列を分割しないことを特徴とし
た文字列分割方法である。
(Structure of the Invention) When dividing a character string into a plurality of units, the present invention extracts delimiters such as commas, commas, and periods from the character string, and determines whether the characters before and after the delimiter are numbers. , and if there is no number immediately before or after the delimiter, divide the string using the delimiter, and if there are numbers immediately before or after the delimiter, do not divide the string using the delimiter. This is a string segmentation method featuring the following.

(実施例) 図面を用いて、本発明の構成を詳細に説明する。(Example) The configuration of the present invention will be explained in detail using the drawings.

第1図は、本発明の文字列分割方法を具体的にした日本
語文人力装置の一実施例を示すブロック図である。第1
図において、1は文字列入力手段であり、仮名文字列を
入力するために仮名キーボード等が用いられる。
FIG. 1 is a block diagram showing an embodiment of a Japanese literary device that embodies the character string division method of the present invention. 1st
In the figure, 1 is a character string input means, and a kana keyboard or the like is used to input a kana character string.

2は文字列記憶手段であり、文字列入力手段1によって
入力された仮名文字列を記憶する。磁気ディスク装置、
磁気テープ装置、ICメモリ等を用いて実現できる。
Reference numeral 2 denotes a character string storage means, which stores the kana character string inputted by the character string input means 1. magnetic disk device,
This can be realized using a magnetic tape device, IC memory, etc.

3は区切り記号抽出手段であり、文字列記憶手段2に記
憶された仮名文字列を検索し、句点(。)、読点(、)
、カンマ(1)、ピリオド(、)等の区切り記号を抽出
し、抽出された区切り記号の位置を直前非数字判定手段
4、直後非数字判定手段5、及び変換単位読込手段7へ
出力する。
3 is a delimiter extracting means, which searches the kana character string stored in the character string storage means 2 and extracts period marks (.) and commas (,).
, a comma (1), a period (,), etc., and output the position of the extracted delimiter to the immediately preceding non-numeric determining means 4, immediately following non-numeric determining means 5, and conversion unit reading means 7.

4は直前非数字判定手段であり、区切り記号抽出手段3
から区切り記号の位置を受は取り、区切り記号の直前の
文字が数字であるか否かを判定する。
4 is the immediately preceding non-numeric determining means, and the delimiter extracting means 3
The position of the delimiter is taken from , and it is determined whether the character immediately before the delimiter is a number.

区切り記号の直前の文字が数字でないときに、ORゲー
ト6へ信号を出力する。
A signal is output to the OR gate 6 when the character immediately before the delimiter is not a number.

5は直後非数字判定手段であり、区切り記号抽出手段3
から区切り記号の位置を受は取り、区切り記号の直後の
文字が数字であるか否かを判定する。
5 is a non-numeric determination means immediately after, and a delimiter extraction means 3
The position of the delimiter is taken from , and it is determined whether the character immediately after the delimiter is a number.

区切り記号の直後の文字が数字でないときに、ORゲー
ト6へ信号を出力する。
A signal is output to the OR gate 6 when the character immediately after the delimiter is not a number.

6はORゲートであり、直前非数字判定手段4と直後非
数字判定手段5との少なくとも一方から信号が送られて
きたときに、変換単位読込手段7へ信号を出力する。
Reference numeral 6 denotes an OR gate, which outputs a signal to the conversion unit reading means 7 when a signal is sent from at least one of the immediately preceding non-numeric determining means 4 and the immediately following non-numeric determining means 5.

7は変換単位読込手段であり、ORゲート6から信号が
送られてきたとき、区切り記号抽出手段3から区切り記
号の位置を受は取り、その区切り記号の位置までの仮名
文字列を文字列記憶手段2から読み込み、仮名漢字変換
手段8へ出力する。
7 is a conversion unit reading means, which receives the position of a delimiter from the delimiter extraction means 3 when a signal is sent from the OR gate 6, and stores the kana character string up to the position of the delimiter as a character string. It is read from the means 2 and output to the kana-kanji conversion means 8.

8は仮名漢字変換手段であり、変換単位読込手段7から
送られてきた仮名文字列を漢字仮名混じり文字列に変換
して出力する。この仮名漢字変換手段8は公知の技術を
用いて実現できる。
Reference numeral 8 denotes a kana/kanji conversion means, which converts the kana character string sent from the conversion unit reading means 7 into a character string containing kanji/kana and outputs the converted character string. This kana-kanji conversion means 8 can be realized using known technology.

次に、この実施例の日本語入力装置の動作を例を用いて
説明する。
Next, the operation of the Japanese language input device of this embodiment will be explained using an example.

文字列入力手段1により次のような仮名文字列が入力さ
れ、文字列記憶手段2に記憶されているものとする。
It is assumed that the following kana character string is input by the character string input means 1 and stored in the character string storage means 2.

イチロウクンハ、オツ力イニイキマシタ、2゜500エ
ンツカサブカツチ、1マンエンサッヲダシマシタ、オツ
リハイクラデショウ、コノモンダイノセイ力イリツハ、
 95.5%ダッタ。
Ichiro Kunha, Otsuriki Inikimashita, 2゜500 Entsu Ka Sabukatsuchi, 1 Man Ensawodashimashita, Otsuri Haikuladesho, Konomon Dai no Seiriki Iritsuha,
95.5% Datta.

このとき、区切り記号抽出手段3は、まず「イチロウク
ンハ」の直後のカンマ(1)を抽出し、その位置く8文
字目〉を、直前非数字判定手段4、直後非数字判定手段
5、及び変換単位読込手段7へ出力する。直前非数字判
定手段4はく8文字目〉の直前の位置く7文字目〉が数
字であるか否かを判定する。
At this time, the delimiter extracting means 3 first extracts the comma (1) immediately after "Ichiro Kunha", and the 8th character at that position is passed to the immediately preceding non-numeric determining means 4, immediately following non-numeric determining means 5, and conversion. It is output to the unit reading means 7. Immediately before non-numeric character determining means 4 determines whether or not the position immediately before the 8th character (7th character) is a numeric character.

「ハ」は数字でないので、ORゲート6へ信号を出力す
る。同時に、直後非数字判定手段5はく8文字目〉の位
置く9文字目〉が数字であるか否かを判定する。「オ」
は数字でないので、ORゲート6へ信号を出力する。そ
の結果、ORゲート6は変換単位読込手段7へ信号を出
力する。変換単位読込手段7では、ORゲート6からの
信号を受けて、区切り記号抽出手段3から受は取った区
切り記号の位置く8文字目〉までの文字列「イチロウク
ンハ、」を仮名漢字変換手段8へ出力する。仮名漢字変
換手段8は、これを漢字仮名混じり文字列「一部君は、
」に変換して出力する。
Since "c" is not a number, a signal is output to the OR gate 6. At the same time, the immediately following non-numeric character determining means 5 determines whether or not the 8th character (9th character) is a numeric character. "O"
Since is not a number, it outputs a signal to OR gate 6. As a result, the OR gate 6 outputs a signal to the conversion unit reading means 7. The conversion unit reading means 7 receives the signal from the OR gate 6 and converts the character string "Ichiro Kunha," up to the 8th character of the delimiter extracted from the delimiter extracting means 3 into the kana-kanji converter 8. Output to. The kana-kanji conversion means 8 converts this into a kanji-kana mixed character string ``Part-kun-kimi wa,
” and output.

次に、区切り記号抽出手段3は、「オツ力イニイキマシ
タ」の直後の<19文字目〉にあるピリオド(。
Next, the delimiter extracting means 3 extracts the period (.

)を抽出する。直前非数字判定手段4では<18文字目
〉の「夕」が数字でないので、ORゲート6へ信号を出
力する。直後非数字判定手段5では<20文字目〉の[
2)が数字であるので、ORゲート6へ信号を出力しな
い。ORゲート6では、直前非数字判定手段4から信号
が送られてきたので、変換単位読込手段7へ信号を出力
する。変換単位読込手段7では、ORゲート6からの信
号を受けて、く9文字目〉から<19文字目〉までの文
字列「オツ力イニイキマシタ、」を仮名漢字変換手段8
へ出力し、仮名漢字変換手段8により、「お使いに行き
ました。]という漢字仮名混じり文字列が得られる。
). The preceding non-numeric character determining means 4 outputs a signal to the OR gate 6 since the <18th character>"Yu" is not a numeric character. Immediately after, the non-numeric determining means 5 detects the <20th character> [
Since 2) is a number, no signal is output to the OR gate 6. Since the OR gate 6 receives the signal from the immediately preceding non-numeric determining means 4, it outputs the signal to the conversion unit reading means 7. The conversion unit reading means 7 receives the signal from the OR gate 6 and converts the character string "Otsuriki iniikimashita," from the 9th character to the 19th character into the kana-kanji conversion means 8.
Then, by the kana-kanji conversion means 8, a character string containing kanji and kana characters such as ``I went to run an errand.'' is obtained.

次に、区切り記号抽出手段3は、「2」の直後の<21
文字目〉にあるカンマ(1)を抽出する。直前非数字判
定手段4では<20文字目〉のr2Jが数字であるので
、ORゲート6へ信号を出力しない。直後非数字判定手
段5では<22文字目〉の「5」が数字であるので、O
Rゲート6へ信号を出力しない。その結果、ORゲート
6には、直前非数字判定手段4からも直後非数字判定手
段5からも信号が入力されず、変換単位読込手段7へ信
号は出力されない。従って、変換単位読込手段7及び仮
名漢字変換手段8は動作しない。
Next, the delimiter extraction means 3 extracts <21 immediately after “2”.
Extract the comma (1) in the character>. The immediately preceding non-numeric determining means 4 does not output a signal to the OR gate 6 because the <20th character> r2J is a numeric value. Immediately after, the non-numeric determining means 5 determines that the <22nd character>"5" is a number, so O
No signal is output to R gate 6. As a result, no signal is input to the OR gate 6 from either the immediately preceding non-numeric determining means 4 or the immediately following non-numeric determining means 5, and no signal is output to the conversion unit reading means 7. Therefore, the conversion unit reading means 7 and the kana-kanji conversion means 8 do not operate.

同様にして、<34文字目〉のカンマ(1)、<48文
字目〉のピリオド(、)、<60文字目〉のピリオド(
、)、<75文字目〉のカンマ(1)、最後のピリオド
(。
Similarly, <34th character> comma (1), <48th character> period (,), <60th character> period (
, ), the <75th character> comma (1), and the last period (.

)についてはORゲート6から信号が出力されるが、<
78文字目〉のピリオド(、)についてはORゲート6
から信号が出力されず、結果として、文字列記憶手段2
に記憶されていた仮名文字列は、次のような単位に分割
されて仮名漢字変換されることになる。
), a signal is output from the OR gate 6, but <
For the period (,) of the 78th character, use OR gate 6.
As a result, no signal is output from the character string storage means 2.
The kana character strings stored in the .

lイチロウクンハ、lオツ力イニイキマシタ、12゜5
00エンノカサヲカッテ、/1マンエンサッヲダシマシ
タ、lオツリハイクラデシシウJコノモンダイノセイ力
イリツハ、795.5%ダッタ、l−一部君は、お使い
に行きました。2.500円の傘を買って、1万円札を
出しました。おつりはいくらでしよう。この問題の正解
率は、95゜5%だった。
l Ichiro Kunha, l Otsuriki Inikimashita, 12゜5
00 Ennokasawo Katte, /1manensawodashimashita, lotsurihaikuradesishiuJKonomondainoseirikiiritsuha, 795.5%datta, l-Some of you went to run an errand. 2. I bought an umbrella for 500 yen and took out a 10,000 yen bill. How much change will I have? The correct answer rate for this problem was 95.5%.

(発明の効果) 以上説明したように、本発明の文字列分割方法によれば
、数字列中の読点、カンマ、ピリオドにより数字列を分
割してしまう致命的なことは全くなく、文字列を適切な
処理単位に分割することが可能となる。
(Effects of the Invention) As explained above, according to the character string division method of the present invention, there is no fatal problem of dividing a number string due to a comma, comma, or period in the number string, and the character string is It becomes possible to divide into appropriate processing units.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は本発明の文字列分割方法を具体的にした日本語
文人力装置の一実施例の構成を示すブロック図である。 図において、 1・・・文字列入力手段 2・・・文字列記憶手段 3・・・区切り記号抽出手段 4・・・直前非数字判定手段 5・・・直後非数字判定手段 6・・・ORゲート 7・・・変換単位読込手段 8・・・仮名漢字変換手段
FIG. 1 is a block diagram showing the configuration of an embodiment of a Japanese literary device that embodies the character string division method of the present invention. In the figure, 1... Character string input means 2... Character string storage means 3... Delimiter extraction means 4... Immediate non-numeric determining means 5... Immediate non-numeric determining means 6...OR Gate 7: Conversion unit reading means 8: Kana-kanji conversion means

Claims (1)

【特許請求の範囲】[Claims] 文字列を複数の単位に分割する際に、前記文字列から読
点、カンマ、ピリオド等の区切り記号を抽出し、前記区
切り記号の前後の文字が数字であるかを判定し、前記区
切り記号の直前あるいは直後に数字が存在しないときに
は前記区切り記号により文字列を分割し、前記区切り記
号の直前及び直後に数字が存在するときには前記区切り
記号では文字列を分割しないことを特徴とした文字列分
割方法。
When dividing a character string into multiple units, extract delimiters such as commas, commas, and periods from the character string, determine whether the characters before and after the delimiter are numbers, and select the characters immediately before the delimiter. Alternatively, when there is no number immediately after the delimiter, the character string is divided by the delimiter, and when there are numbers immediately before and after the delimiter, the character string is not divided by the delimiter.
JP60124684A 1985-06-07 1985-06-07 Character string dividing method Pending JPS61282965A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP60124684A JPS61282965A (en) 1985-06-07 1985-06-07 Character string dividing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP60124684A JPS61282965A (en) 1985-06-07 1985-06-07 Character string dividing method

Publications (1)

Publication Number Publication Date
JPS61282965A true JPS61282965A (en) 1986-12-13

Family

ID=14891512

Family Applications (1)

Application Number Title Priority Date Filing Date
JP60124684A Pending JPS61282965A (en) 1985-06-07 1985-06-07 Character string dividing method

Country Status (1)

Country Link
JP (1) JPS61282965A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS643773A (en) * 1987-06-26 1989-01-09 Hitachi Ltd Kana/kanji conversion device
JPH01206457A (en) * 1988-02-15 1989-08-18 Ricoh Co Ltd Character processor
JPH0269869A (en) * 1988-09-06 1990-03-08 Ricoh Co Ltd Language analyzing device
JPH02299068A (en) * 1989-04-26 1990-12-11 Internatl Business Mach Corp <Ibm> Word separation method and apparatus
JPH0594474A (en) * 1991-04-12 1993-04-16 Oki Electric Ind Co Ltd Method for recognizing translation objective sentence for translation system
JP2009116900A (en) * 2002-06-20 2009-05-28 Tegic Communications Inc Explicit character filtering of ambiguous text entry
US8938688B2 (en) 1998-12-04 2015-01-20 Nuance Communications, Inc. Contextual prediction of user words and user actions
US9786273B2 (en) 2004-06-02 2017-10-10 Nuance Communications, Inc. Multimodal disambiguation of speech recognition

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS643773A (en) * 1987-06-26 1989-01-09 Hitachi Ltd Kana/kanji conversion device
JPH01206457A (en) * 1988-02-15 1989-08-18 Ricoh Co Ltd Character processor
JPH0269869A (en) * 1988-09-06 1990-03-08 Ricoh Co Ltd Language analyzing device
JPH02299068A (en) * 1989-04-26 1990-12-11 Internatl Business Mach Corp <Ibm> Word separation method and apparatus
JPH0594474A (en) * 1991-04-12 1993-04-16 Oki Electric Ind Co Ltd Method for recognizing translation objective sentence for translation system
US8938688B2 (en) 1998-12-04 2015-01-20 Nuance Communications, Inc. Contextual prediction of user words and user actions
US9626355B2 (en) 1998-12-04 2017-04-18 Nuance Communications, Inc. Contextual prediction of user words and user actions
US8972905B2 (en) 1999-12-03 2015-03-03 Nuance Communications, Inc. Explicit character filtering of ambiguous text entry
US8990738B2 (en) 1999-12-03 2015-03-24 Nuance Communications, Inc. Explicit character filtering of ambiguous text entry
JP2009116900A (en) * 2002-06-20 2009-05-28 Tegic Communications Inc Explicit character filtering of ambiguous text entry
US9786273B2 (en) 2004-06-02 2017-10-10 Nuance Communications, Inc. Multimodal disambiguation of speech recognition

Similar Documents

Publication Publication Date Title
US11568150B2 (en) Methods and apparatus to improve disambiguation and interpretation in automated text analysis using transducers applied on a structured language space
JPS61282965A (en) Character string dividing method
Parizi et al. Do Character-Level Neural Network Language Models Capture Knowledge of Multiword Expression Compositionality?
JP3692399B2 (en) Notation error detection processing apparatus using supervised machine learning method, its processing method, and its processing program
CN112883717A (en) Wrongly written character detection method and device
JP3953772B2 (en) Reading device and program
JP4040233B2 (en) Important sentence extraction device and storage medium
JPS6174062A (en) Sentence input system
JPS6211385B2 (en)
JPS6395565A (en) Kana/kanji converting method
JPH0544699B2 (en)
JP2575947B2 (en) Phrase extraction device
JPH01287774A (en) Japanese data input processor
JPS63229561A (en) Back-up device for production/correction of document
JP2838850B2 (en) Kana-Kanji conversion device
JPS63284676A (en) Character string processor
Ryu et al. KCAT: a Korean Corpus Annotating Tool minimizing human intervention
Al-TAAni et al. Arabic numerals checker: checking agreement between numerals and counted objects in the Arabic language
JPH07122892B2 (en) Unregistered word recognition processing method in character recognition post-processing
JPS6337472A (en) Article setting system
JPS6255757A (en) Word correcting device
JPH01114983A (en) Method for estimating parts of speech
Jäppinen et al. Knowledge Engineering Applied to Morphological Analysis
JPS62271175A (en) Dictionary correction system
JPS5896376A (en) Japanese input device