CN1095573C

CN1095573C - Quick character and word identification method

Info

Publication number: CN1095573C
Application number: CN99104183A
Authority: CN
Inventors: 何代水; 纪金东
Original assignee: Inventec Group Shanghai Electronic Technology Co Ltd
Current assignee: Beijing Zhigu Tech Co Ltd
Priority date: 1999-03-24
Filing date: 1999-03-24
Publication date: 2002-12-04
Anticipated expiration: 2019-03-24
Also published as: CN1268712A

Abstract

The present invention provides a quick character and word identification method which is especially used for identifying Spanish derivative characters. The present invention is characterized by providing the method for quickly identifying the derivative characters so that a Spanish electronic dictionary can store all Spanish vocabulary (etymon tones and the derivative characters) under the condition that no burden is added to a memory. All rules for converting root words to the derivative characters are recorded, classified and converted for providing restoring rules. The restoring rules are recorded in an inquiry table used for search by sequence and coding. The present invention also provides a quick search method suitable for the inquiry table.

Description

Quick character and word identification method

The present invention relates to the method for a kind of quick character and word identification, the method for particularly a kind of quick identification Spain derivative, this method is applied to the electronic installation (such as electronic dictionary) or the software (such as word processing program or computerized dictionary) of sequencing.The present invention also can be used for carrying out mosaic and proofreaies and correct.

For the electronic dictionary or the computerized dictionary of many types, if the query word of user's input can't find in existing database, the response that the user will obtain refusing such as " look into and do not have this word ", or obtains spelling or the immediate word of phonetic.Yet, except misspelling, sometimes query word may be so-called speech of basic words-be-derivative or compound, and do not include among database.Therefore, similar above-mentioned response seems improper and utterly useless.

Yet in the European Chinese language such as Spanish and French, each word has many different derivatives usually; Spanish particularly, a verb can have 100 kinds of change types of surpassing.In Spanish, glossary can be classified as root speech, compound, derivative and combination thereof.Basically, root speech itself is basic words; Compound is made of plural speech; Derivative then is to be changed and got by the middle affixe that changes the root speech or back affixe or both.At present, root speech and their some derivatives commonly used only included in common electronic dictionary, and it is required that this clearly can't satisfy actual use.

A simple solution of the problems referred to above is that all derivatives of root speech are all taken in the electronic dictionary.But this method needs a large amount of storage spaces, and all Spain's glossarys are stored.Yet so both uneconomical, need a lot of time to import related data again.Therefore, in electronic installation or computer software such as the Spanish word processor such as Spain's electronic dictionary, if can provide a kind of needs less storage space, and words is discerned, searches and proofreaied and correct to more efficient method, will have great benefit and practical value.

Because the shortcoming of traditional words recognition device or software the invention provides a kind of character and word identification method fast.

The object of the present invention is to provide the fast searching method of a kind of Spain derivative.

Moreover, according to the present invention, provide the coding method of a kind of Spain derivative fast searching for computer or word processor.

The present invention also provides a kind of efficient method, checks the spain word graphs, and provides and query word spelling or the approaching candidate of phonetic, as the correction of the query word of misspelling.

Disclosed method mainly comprises the following step:

(1) collection and classification obtain the strictly all rules of derivative from Spain's root speech;

(2) with a kind of coding method these rules of encoding; And

(3) rule behind the sorting coding, and form a question blank.

In addition, according to the present invention, as follows to the searching procedure of each query word:

(1) earlier in root speech database, searches this query word,, promptly export storing data and stop about this word if find; Otherwise

(2) in the affixe question blank, pick up the affixe of looking into this word,, promptly export storing data and stop about this root speech if find any speech; Otherwise

(3) proofread and correct spelling to provide with the mode of the approaching word of query word spelling or phonetic.

Again, according to the present invention, the search of the back affixe formula derivative that each is possible can words the mode of reversing implement; Wherein earlier the lexicographic order of query word is put upside down, can be begun to obtain all possible back speech speech from the prefix of reversing thus, but these back affixes are in reverse order.Then from the question blank of previous preparation, with the known back affixe back affixe of gained relatively.According to maximum match principle, keep the back affixe of coupling, find out affixe behind the corresponding root speech, and back affixe in the speech words is looked in it replacement, obtain possible candidate root speech.Whether at last, continue to search these candidate root speech again exists in the master database.

Fig. 1 has the procedure chart that Spain's derivative is reduced into the rule query table of root speech for forming one among the present invention.

Fig. 2 is the process flow diagram of Spain's electronic dictionary typical case words searching procedure among the present invention.

Fig. 3 is for describing subroutine function calcspar how to discern derivative among the present invention.

Fig. 4 A is the back affixe of the searching query word subroutine flow chart of meta-rule also.

Fig. 4 B is the middle affixe of the searching query word subroutine flow chart of meta-rule also.

Fig. 4 C is a subroutine flow chart of confirming possible root speech from candidate.

Fig. 5 A, 5B, 5C and 5D be from this product-Oxford electronic dictionary-the screen of acquisition, show with " quiero " be that example is from importing, search the process of output.

For solving the derive problem of word of identification Spain, the word rule change of deriving of all Spain's words classifies according to its part of speech. Following is approximately some rule change examples of various parts of speech:

(1) noun:

--the words of consonant ending, plural number+s

--the words of consonant ending, plural number+es

--irregular variation example:

“z”-＞“ces”；

Rubí-＞Rubíes；

Bistrí-＞bisturíes；

Bambu-＞bambues；

Jersey-＞jerseys; Etc..

(2) adjective:

Adjective have feminine gender, the positive minute, therefore, for example each suffix with the adjective of o ending can have four kinds of forms, be respectively+' o ' ,+' a ' ,+' os ' and+' as '. And some adjectives with the consonant ending also can have+' a '+' as ', the form of+' es ' except original shape.

(3) adverbial word:

It is to become feminine gender by adjective that one class adverbial word is arranged in the Spanish, add ' mente ' again and come, so their declination is, ' o '-＞' amente '.

(4) verb:

This is the most complicated situation.In the middle of the Spanish, each verb can have more than 100 kind of version.Remove in the modern Spanish rare noly, 60 kinds of derivatives nearly also arranged, and these to have much be irregular version, the irregular of suffix not only arranged, irregular in the speech also arranged.

Table one is shown the collected part affixe rule change of the present invention.

Table one

The version kind	The derivative affixe	Replaceable affixe
The version kind	The derivative affixe	Replaceable affixe	1, verb declination	a	o，ar，er，ir
	aba	ar	1, verb declination	a	o，ar，er，ir
	aba	ar		abais	ar
	abamos	ar		abais	ar
	abamos	ar		…	…
2, irregular verb declination	he	haber		…	…
2, irregular verb declination	he	haber		has	haber
	ha	haber		has	haber
	ha	haber		hemos	haber
	…	…		hemos	haber
	…	…	3, the heteroclite plural number changes	ces	z
	…	…	3, the heteroclite plural number changes	ces	z

4, adjective becomes the variation of adverbial word	amete	o
4, adjective becomes the variation of adverbial word	amete	o	…	…
5, change in the verb speech	ar	ac	…	…
5, change in the verb speech	ar	ac	br	b
	c	z，qu	br	b
	c	z，qu	dr	d，n
	habéis	haber	dr	d，n

2,400 kinds of such rule change are arranged approximately, comprising the rule that changes simultaneously in rule change (variation of middle affixe) and suffix, the speech in declination rule (variation of back affixe), the speech in the Spanish.So can form a kind of form of inquiry, comprise all also meta-rule, for the middle affixe (or back affixe) of a certain specific derivative, can find original words institute might in affixe (or after affixe) (step 2 among Fig. 1).In other words for affixe in each derivative (or back affixe), affixe (or back affixe) in the root speech of several associated will be had.Specifically, the back affixe lexicographic order in the enquiry form is through counter-rotating, and the back affixe after these counter-rotatings is according to Spanish letter series arrangement (step 4).Middle affixe also passes through same letter sequence (step 6).Can quicken follow-up searching procedure widely like this.In one of the present invention preferred embodiment, also be to these rules (step 8), and just simply these rules are not accumulated a big form of encoding through the mode of affixe coding behind the derivative behind the letter sequence with a kind of.So, formed enquiry form comprises three parts: one is that the alphabetic index table done for back affixe reduction table is (as table two, wherein enumerated the index rule of part), another is that back affixe reduction rule list is (as table three, wherein enumerated also meta-rule of back affixe partly), and the 3rd be middle affixe reduction rule list (as table four, the middle affixe of wherein having enumerated part is meta-rule also).

Table two

The Spanish letter	Lower limit	The upper limit
The Spanish letter	Lower limit	The upper limit	……	…	…
I	37	37	……	…	…
I	37	37	J	--	--
K	--	--	J	--	--
K	--	--	L	38	38
M	--	--	L	38	38
M	--	--	N	39	73

O	74	96
O	74	96	P	--	--
Q	--	--	P	--	--
Q	--	--	R	97	98
S	99	193	R	97	98
S	99	193	T	--	--
U	--	--	T	--	--
U	--	--	……	…	…

Table three

Affixe behind the derivative of counter-rotating	Affixe behind the root speech of replacement
Affixe behind the derivative of counter-rotating	Affixe behind the root speech of replacement	……	……
nard	er，ir	……	……
nard	er，ir	o	ar，er，ir，r
oda	ar	o	ar，er，ir，r
oda	ar	odi	er，ir
odna	ar	odi	er，ir
odna	ar	odne	er，ir
odnei	er，eír，ir	odne	er，ir
odnei	er，eír，ir	odney	er，ir
og	cer，er，ir，cir，r	odney	er，ir
og	cer，er，ir，cir，r	ogi	er
ohce	acer	ogi	er
ohce	acer	oj	er，cir
oserpmi	imprimir	oj	er，cir
oserpmi	imprimir	otircse	escribir
otirf	freír	otircse	escribir
otirf	freír	otleuv	volver
otor	romper	otleuv	volver
otor	romper	otreiba	abrir

otreum	morir
otreum	morir	otseup	poner
otsiv	ver	otseup	poner
otsiv	ver	otsivorp	proveer
ovu	ar	otsivorp	proveer
ovu	ar	oy	ir
raey	er，ir	oy	ir
raey	er，ir	……	……

Table four

Speech in the derivative	Affixe in the root speech of replacement
Speech in the derivative	Affixe in the root speech of replacement	……	……
i	e	……	……
i	e	ic	ac
ie	e，i	ic	ac
ie	e，i	is	er
iz	ac	is	er
iz	ac	j	g
o	u	j	g
o	u	qu	c
ub	ab	qu	c
ub	ab	ue	o
ue	u	ue	o
ue	u	up	ab
us	on	up	ab
us	on	uv	en
ye	e	uv	en
ye	e	z	c
zc	c	z	c
zc	c	zg	c

According to this table, for the specific words that can not directly find in the master database of Spanish dictionary or electronic dictionary, just the institute that can application table be also given in the meta-rule might in affixe (or back affixe) in affixe (or back affixe) the replacement derivative, all candidate root speech of this word of construction, and pick up and whether look into that any one can find in these prepare words in basic root speech database.

The present invention also can similarly be applied on Spain's electronic dictionary, Spanish Word or the things of the like description; Yet for the purpose of concrete more, this instructions will describe with Spain's electronic dictionary especially as an example for the explanation summary of the invention.

Fig. 2 is the process flow diagram of Spain's electronic dictionary typical case words searching procedure among the present invention.At first, the user will be required to import a query word.

(after step 10) received query word, its master database will be searched in electronic dictionary, seeks the identical words of spelling (step 12), and this database comprises root speech and derivative commonly used thereof usually.If database has this query word, then the data about this word will directly be exported in electronic dictionary, stop searching procedure then and wait for user's next instruction (step 22).

If database is this query word not, the derivative recognition subroutine promptly come into operation (step 14), as shown in Figure 3.At first, be written into affixe enquiry form (step 26), then the search of back affixe and middle affixe relatively will) order execution (step 28 and step 30), its result will compare with master database at last (step 32).

The searching procedure process flow diagram of affixe formula and back affixe formula derivative during Fig. 4 A and 4B are respectively.Preferable reality according to the present invention is dragged example, and the lexicographic order of query word will be inverted (step 36) before the affixe formula derivative after the search, because this will make things convenient for the acquisition of back affixe.First letter of back affixe after the counter-rotating will be in order to the search section (step 38) in the definition enquiry form.Then use a kind of method for searching, in this search section, find out with the back affixe that reverses before n the identical back affixe of letter, wherein n is since 1 natural number that increases in proper order (step 40 and step 44).In case do not have identical back affixe to find in form, program promptly stops (step 42).So, if find any back affixe, just obtain the also meta-rule of affixe behind one group of derivative; (step 46) otherwise, query word is not regarded as just not that the back affixe changes (step 52).

But in this and follow-up program, maximum match principle will be used to determine possible affixe.This principle statement: if query word has several possible derivative affixes found in enquiry form in certain class affixe (back affixe and middle affixe) searching procedure, the derivative affixe that only has maximum letters is retained so that meta-rule to be provided.

Therefore, after the derivative search of above-mentioned back affixe formula, if any, only have a rule change and stay, and use this rule with the possible root speech of associated after affixe replace back affixe in the query word, form one group of possible root speech (step 50).

Then, according to Fig. 4 B, take first letter and back affixe (step 54) away from query word.Once more, affixe letter in these is carried out the search (step 56 and step 60) of middle affixe reduction table.So, if any, will obtain another group rule change (step 62).Use maximum match principle once more, obtain also meta-rule (step 64), the middle affixe (step 66) in the candidate root speech that had before found to replace.Otherwise query word just is regarded as not having middle affixe to change (step 68).

At last,, form one group of new possible root speech in conjunction with the result of above-mentioned two kinds of searches (back affixe and middle affixe), for further with the comparison of master database.Shown in Fig. 4 C, this relatively chooses each candidate from new one group words, search in master database, has all picked up up to all words and has looked into (

step

70,72,74 and 76).If wherein any one can be identified, just export this root speech and related data thereof; If find several,, select (step 80) for the user just these candidate are all exported.

Otherwise query word just is sent to mosaic syndrome program (step 82), because can't form possible root speech, just can not find the also meta-rule of back affixe or middle affixe.The immediate word of spelling or phonetic will be exported to the user and select (step 20).

Whole procedure when termination output query word related data is no matter be words (step 22) after possible root speech or mosaic are proofreaied and correct.

Below will explain how the present invention carries out with ' quiero ' as an example.

Suppose that query word ' quiero ' do not include the master database in dictionary, then this words will temporarily be regarded as possible derivative candidate, and carry out following step.

During beginning, will carry out back affixe search (Fig. 4 A) to query word.According to the present invention, ' quiero ' will be inverted to ' oreiup ' (step 36) for back affixe search, so present " back affixe " may be ' o ', ' or ', ' ore ' or the like, they are counter-rotatings of original back affixe.Because first letter is ' o ', so the preferred embodiment one of according to the present invention will be selected (step 38) between ' o-' section between the 74th and the 96th of the reduction rule in the enquiry form, carry out data-gathering fast for identical " back affixe ".

At first, first letter ' o ' of counter-rotating word will be gone out by peach, with the section selected in the enquiry form relatively (step 40), therefore just find also meta-rule ' o '-＞' ar ', ' er ', ' ir ', ' r ' }.Then next letter ' r ' will be affixed to ' o ' and form ' or ' (step 44) for further comparison.But ' or ' there is no any also meta-rule, and the affixe searching procedure just stops (step 43) behind this derivative.

Thereby in the middle of this example, according to maximum match principle, ' o ' is unique possible " back affixe ", and this just means that also ' o ' is the possible back affixe of candidate's derivative.This program then with the also meta-rule of the back affixe ' o ' in the former query word, replaces with ' ar ', ' er ', ' ir ' and ' r '.At last, promptly form first group of possible root speech quierar, quierer, quierir, quierr} does further comparison (step 50) with master database.

Next procedure is middle affixe search (Fig. 4 B).At first, remove first the back affixe ' o ' alphabetical and maximum match in the query word, obtain middle affixe ' uier '.This searches and takes ' u ' to mate (step 56) earlier, and still ' u ' is without any going back meta-rule.So ' i ' just brought comparison (step 58) and found also meta-rule { ' i '-＞' e ' } (step 56).Yet this searching procedure will continue to carry out and can't find other also (step 60) in the meta-rule.Finish after all comparisons, find to find two also meta-rule { ' i ' ...＞' e ' } and { ' ie ' ...＞' i ', ' e ' } (step 62).According to maximum match principle, ' ie ' is only the middle affixe of maximum match.So, have only meta-rule { ' ie ' ...＞' i ', ' e ' } be retained, the usefulness (step 64) of follow-up replacement is provided.At last, just use ' ie ' in the words of ' i ' and ' e ' replacement first dimension to form second group of possible root speech (step 66).

Now, complete possible root set of words be previous first and second groups possible root speech unite collection, just { quierar, quierer, quierir, quierr, querar, querer, querir, querr, quirar, quirer, quirir, quirr}.

Words in this new set goes out (step 70 among Fig. 4 C) with peach one by one, to carry out search (step 72) of master database, is all picked up up to all prepare words and to look into (step 74).In the present example, only finding ' querer ' is ' quiero ' reasonably root speech (step 78), and institute is so that with its output network user (step 80).

Fig. 5 A is as user during with this product-Oxford Spain electronic dictionary-input inquiry words ' quiero ', the image that is occurred on the screen.At it simultaneously, the immediate words of spelling listed in electronic dictionary.Fig. 5 B is the picture of end of input.Fig. 5 C has shown transient copy when electronic dictionary is being searched.Fig. 5 D has shown the result who searches output.

By help of the present invention, can save storeies many in the electronic dictionary.For example, 18361 words only included in Oxford Spain electronic dictionary, accounts for 161KB, can discern 500000 words.Otherwise, include the ROM that all 500000 words will need 4MB, the benefit of being saved reaches near 25 doubly.

Though above only provide complete explanation to a specific embodiment,, scope of the present invention should not be limited.Because spirit of the present invention is the foundation of affixe reduction rule query form, and the method for searching of arranging in pairs or groups mutually, thus can carry out various modifications to this enquiry form, and other method for searching also can be brought utilization.Scope of the present invention should be defined by appended claims.

Claims

1, a kind of method of derivative identification, this method comprises at least:

Set up lexical data base;

Produce first form, described first form comprises affixe behind a plurality of derivatives, and affixe all is associated with affixe behind the root speech of several replacement behind the described derivative;

Produce second form, described second form comprises affixe in a plurality of derivatives, and affixe all is associated with affixe in the root speech of several replacement in the described derivative;

The input inquiry words;

Acquisition is complementary with described first form and has affixe behind the derivative of maximum letters from this query word;

Choose a plurality of replacement back affixe that is associated with affixe behind the derivative that captures;

With affixe behind the derivative of acquisition, be replaced as described a plurality of replacement back affixe, to produce first group of words;

From described query word, acquisition is complementary with described second form and has affixe in the derivative of maximum letters;

Choose affixe in a plurality of replacement that are associated with affixe in the derivative that captures;

With affixe in the derivative of acquisition, be replaced as affixe in described a plurality of replacement, to produce second group of words;

In conjunction with described first group of words and described second group of words, to produce a plurality of prepare words and therefrom to produce the root speech; And

Export this root speech.

2, the method for claim 1 is characterized in that, above-mentioned lexical data base comprises Spain's root speech at least.

3, the method for claim 1 is characterized in that, above-mentioned first form according to described derivative after the lexicographic order ordering of affixe.

4, method as claimed in claim 3 is characterized in that, affixe also is divided into a plurality of groups behind the derivative of above-mentioned ordering, and is encoded according to the classification of described group.

5, the method for claim 1 is characterized in that, above-mentioned second form is according to the lexicographic order ordering of affixe in the described derivative.

6, method as claimed in claim 5 is characterized in that, affixe also is divided into a plurality of groups in the derivative of above-mentioned ordering, and is encoded according to the classification of described group.

7, the method for claim 1 is characterized in that, affixe lexicographic order is inverted behind the described derivative in above-mentioned first form, and affixe when acquisition behind this derivative, and its lexicographic order also is inverted.

8, the method for claim 1 is characterized in that, the acquisition of affixe behind the derivative of described acquisition is after affixe searches after to all derivatives in described first form, keeps to find and have maximum letters.

9, the method for claim 1 is characterized in that, the acquisition of affixe in the derivative of described acquisition is after affixe searches in to all derivatives in described second form, keeps to find and have maximum letters.

10, a kind of method of derivative identification, this method comprises at least:

Set up lexical data base;

The input inquiry words;

Acquisition is with described second table match and have affixe in the derivative of maximum letters from described query word;

Choose affixe in several replacement that are associated with affixe in the derivative that captures;

With affixe in the derivative of acquisition, be replaced as affixe in described a plurality of replacement, to produce first group of words;

From this query word, acquisition is complementary with described first form and has affixe behind the derivative of maximum letters;

With affixe behind the derivative of acquisition, be replaced as described a plurality of replacement back affixe, to produce second group of words;

In conjunction with described first group of words and described second group of words, to produce a plurality of prepare words and, to produce the root speech from wherein; And

Export this root speech.

11, method as claimed in claim 10 is characterized in that, above-mentioned lexical data base comprises Spain's root speech at least.

12, method as claimed in claim 10 is characterized in that, above-mentioned first form according to described derivative after the lexicographic order ordering of affixe.

13, method as claimed in claim 12 is characterized in that, affixe also is divided into a plurality of groups behind the derivative of above-mentioned ordering, and is encoded according to the classification of described group.

14, method as claimed in claim 10 is characterized in that, above-mentioned second form is according to the lexicographic order ordering of affixe in the described derivative.

15, method as claimed in claim 14 is characterized in that, affixe also is divided into a plurality of groups in the derivative of above-mentioned ordering, and is encoded according to the classification of described group.

16, method as claimed in claim 10 is characterized in that, affixe lexicographic order is inverted behind the described derivative in above-mentioned first form, and affixe when acquisition behind the described total derivative, and its lexicographic order also is inverted.

17, method as claimed in claim 10 is characterized in that, the acquisition of affixe behind the total derivative of described acquisition is after affixe searches after to all derivatives in described first form, keeps to find and have maximum letters.

18, method as claimed in claim 10 is characterized in that, the acquisition of affixe in the derivative of described acquisition is after affixe searches in to all derivatives in described second form, keeps to find and have maximum letters.