CN1645361A - Device and method for broad normalization - Google Patents

Device and method for broad normalization Download PDF

Info

Publication number
CN1645361A
CN1645361A CN 200510023588 CN200510023588A CN1645361A CN 1645361 A CN1645361 A CN 1645361A CN 200510023588 CN200510023588 CN 200510023588 CN 200510023588 A CN200510023588 A CN 200510023588A CN 1645361 A CN1645361 A CN 1645361A
Authority
CN
China
Prior art keywords
linguistic unit
sub
unit
linguistic
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200510023588
Other languages
Chinese (zh)
Inventor
刘健
吴耿锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai University
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN 200510023588 priority Critical patent/CN1645361A/en
Publication of CN1645361A publication Critical patent/CN1645361A/en
Pending legal-status Critical Current

Links

Images

Abstract

A generalized induction device comprises storage component of induction rule list, storage component of candidate queen and induction component. The method of generalized induction includes obtaining a language unit (LU), matching the unit with each of front subpiece in each induction rule, adding it to candidate queue of specific front subpiece matched by the unit, generating new LU in backpiece as per rule by searching various LU combinations with ergodic algorithm as the last front subpiece matched by the unit in certain rule, and outputting newly generated LU.

Description

The apparatus and method of broad normalization
Technical field
The present invention relates to the apparatus and method of the text analyzing in the natural language processing field, belong to the subclass G06F 17/27 of International Classification of Patents (IPC), particularly a kind of apparatus and method of the entity of different language level being carried out reduction.
Background technology
Most text analyzing work all comprises two steps, i.e. lexical analysis and syntactic analysis.Lexical analysis its objective is and judge which character can constitute individual character, and which individual character can constitute phrase; Syntactic analysis (or claiming grammatical analysis), but its purpose then is to judge which individual character or phrase constituent grammar composition, and which grammatical item can continue reduction, up to constituting a sentence.On the basis of syntactic analysis, can also further carry out semantic analysis, promptly analyze the notion of different grammatical item representatives, and then analyze which notion and can be combined into more complex conception.
No matter lexical analysis, grammatical analysis, or semantic analysis, from the angle of bottom-up parse, its essence all is according to certain rule entity language to be carried out reduction, constitutes the process of new entity language.
Further abstract on this basis, be not difficult to find, so-called text analyzing, its purpose is to identify various entity languages in the text and relation each other thereof.So-called linguistic unit relation is meant which linguistic unit is according to specific reductive rule, and reduction goes out specific linguistic unit.
The concrete indication of entity language on the different language level is different, such as: in the morphology aspect, can refer to individual character, or phrase; Can finger speech method composition in the sentence structure aspect; Perhaps, can refer to different semantic concepts at semantic level.Certainly, along with deepening continuously of text-processing area research, described language hierarchy may increase or change to some extent.
Entity language generally is configured to the data structure (linguistic unit) by marker expression when implementing.In the morphology aspect, linguistic unit is used for representing letter, individual character or phrase; In the grammer aspect, then represent individual character, phrase or grammatical item; At semantic level, then represent grammatical item, perhaps semantic concept.
Traditional analysis method, its starting point are to understand the inevitable relation of each linguistic unit and adjacent linguistic unit, so its reduction mode that adopts is close to.
But in some occasion, people need (perhaps having to) to know that each linguistic unit is possible each other, but not inevitable relation.Such as, at short notice, a large amount of texts are carried out the analysis of non-strictness, in the hope of obtaining the conclusion that some has statistical significance; Disturb literal or carrying out text analyzing with analyzing under the situation that irrelevant character exists, if use traditional text analysis technique without fail, will take text denoising, to such an extent as to the technology of domain knowledge that this need be special and higher knowledge Modeling is not ideal enough in some application scenario effects; At this moment, will adopt analyzing approximate texts.
In these occasions, next-door neighbour's reduction that the traditional analysis method adopts just can't well be worked.Therefore, in order to implement analyzing approximate texts, just need to adopt broad normalization.
So-called broad normalization is meant to search out the linguistic unit combination in already present linguistic unit (being the external expression of entity language) set, makes it to mate certain specific rule, to obtain a new linguistic unit.Participate in the linguistic unit combination of coupling, might not logically have to property the relation of the language-specific level of this regular representative without fail, may have the relation of the language-specific level of this regular representative and just react this linguistic unit combination.
Because broad normalization is to seek various possible linguistic unit combinations with matched rule in already present linguistic unit, so when realizing, how to raise the efficiency when the search language unit combination, will be the key that realizes the broad normalization device of a highly effective.
Summary of the invention
The objective of the invention is to overcome the above problems, a kind of apparatus and method of carrying out broad normalization are provided, this device uses the method for this broad normalization can preserve the linguistic unit that obtains from the external world, and judges the linguistic unit which linguistic unit can make new advances according to specific regular reduction.Especially, in order to reach the purpose that improves search language unit combination efficient, this method is an increment type, and the linguistic unit combination that had promptly been traveled through will no longer be traversed.
In order to realize above-mentioned purpose, the apparatus and method of broad normalization of the present invention are as follows:
The device of this broad normalization, its principal feature are that described device comprises:
(1) memory unit of reductive rule table is used to store reductive rule; Described reductive rule comprises one or more sub-former pieces, the condition that should satisfy when having stored the linguistic unit coupling; Described reductive rule also comprises a consequent, institute's canned data described regular by after the coupling with all or part of content of the linguistic unit that produces; Described linguistic unit canned data has been described the entity language of the pairing language-specific level of passage;
(2) memory unit of candidate formation is used to store the linguistic unit of the specific sub-former piece of coupling ad hoc rules;
(3) reduction parts are used for the linguistic unit that will be obtained by input medium, mate one by one with each sub-former piece of each rule, when the specific sub-former piece coupling of described linguistic unit and ad hoc rules it are joined in the corresponding candidate formation of this sub-former piece; Mate in the sub-former piece process at each, if this linguistic unit coupling is the powder former piece of certain rule with n sub-former piece, the utilization ergodic algorithm is searched for various linguistic units combinations, and each combination is satisfied:
(a) i member comes the (1≤i≤n-1) of the candidate formation of i sub-former piece of rule since then;
(b) n member linguistic unit for this reason;
(4) input block obtains from the linguistic unit of external world's input;
(5) output block is with the new linguistic unit output that produces;
The output of described input block links to each other with the input of described reduction parts, and the output of these reduction parts links to each other with the input of described output block, and these reduction parts also link to each other with the memory unit of reductive rule table and the memory unit of candidate formation respectively; For the combination of each linguistic unit, all produce new linguistic unit and export by output block according to regular consequent.
The linguistic unit of the device of this broad normalization comprises that also the entity language that indicates this linguistic unit representative occupies the text filed of zone in text; Described reduction device is:
(a) when the search language unit combination, i linguistic unit of every kind of linguistic unit combination text filed do not coincide with i+1 linguistic unit and (1≤i≤n-1) on the left of it;
(b) the new linguistic unit that obtains text filed is the text filed stack of each linguistic unit in the corresponding linguistic unit combination.
Use said apparatus to carry out the method for broad normalization, its principal feature is that this method may further comprise the steps:
(1) obtain a linguistic unit by input medium, described linguistic unit canned data has been described the entity language of the pairing language-specific level of passage;
(2) each the sub-former piece with this linguistic unit and each reductive rule mates one by one; Described reductive rule contains one or more sub-former pieces, the condition that should satisfy when having stored the linguistic unit coupling; For described each sub-former piece, all corresponding candidate formation is used to store the linguistic unit that mates this sub-former piece; Described reductive rule also comprises a consequent, institute's canned data described regular by after the coupling with all or part of content of the linguistic unit that produces; When described linguistic unit mates with the specific sub-former piece of ad hoc rules, it is joined in the corresponding candidate formation of this sub-former piece; Mate in the sub-former piece process at each,, then use ergodic algorithm to search for various linguistic units combinations, make each combination satisfy if this linguistic unit coupling is the powder former piece of certain rule with n sub-former piece:
(a) i member comes the (1≤i≤n-1) of the candidate formation of i sub-former piece of rule since then;
(b) n member linguistic unit for this reason;
For the combination of each linguistic unit, all produce new linguistic unit and export by output means according to regular consequent.
The linguistic unit of the method for this broad normalization comprises that also the entity language that indicates this linguistic unit representative occupies the text filed of zone in text, and described reductive rule is:
(a) when the search language unit combination, i linguistic unit of every kind of linguistic unit combination text filed do not coincide with i+1 linguistic unit and (1≤i≤n-1) on the left of it;
(b) the new linguistic unit that obtains text filed is the text filed stack of each linguistic unit in the corresponding linguistic unit combination.
Store the computer-readable storage medium of the program that realizes above-mentioned broad normalization method, its principal feature is that described program is carried out following steps:
(1) obtain a linguistic unit by input medium, described linguistic unit canned data has been described the entity language of the pairing language-specific level of passage;
(2) each the sub-former piece with this linguistic unit and each reductive rule mates one by one; Described reductive rule contains one or more sub-former pieces, the condition that should satisfy when having stored the linguistic unit coupling; For described each sub-former piece, all corresponding candidate formation is used to store the linguistic unit that mates this sub-former piece; Described reductive rule also comprises a consequent, institute's canned data described regular by after the coupling with all or part of content of the linguistic unit that produces; When described linguistic unit mates with the specific sub-former piece of ad hoc rules, it is joined in the corresponding candidate formation of this sub-former piece; Mate in the sub-former piece process at each,, then use ergodic algorithm to search for various linguistic units combinations, make each combination satisfy if this linguistic unit coupling is the powder former piece of certain rule with n sub-former piece:
(a) i member comes the (1≤i≤n-1) of the candidate formation of i sub-former piece of rule since then;
(b) n member linguistic unit for this reason;
For the combination of each linguistic unit, all produce new linguistic unit and export by output means according to regular consequent.
The linguistic unit of this storage medium comprises that also the entity language that indicates this linguistic unit representative occupies the text filed of zone in text, and described reductive rule is:
(a) when the search language unit combination, i linguistic unit of every kind of linguistic unit combination text filed do not coincide with i+1 linguistic unit and (1≤i≤n-1) on the left of it;
(b) the new linguistic unit that obtains text filed is the text filed stack of each linguistic unit in the corresponding linguistic unit combination.
Owing to adopted the apparatus and method of carrying out the increment type broad normalization of this invention, make last element of linguistic unit combination be restricted to the newspeak unit of firm input system, so the combination of the linguistic unit of current search is inevitable different with the last time, and this searching method can not omitted combination, thereby be increment type, thereby improved the efficient of search language unit combination, had better practicability.
Description of drawings
Fig. 1 carries out the functional block diagram of the device of increment type broad normalization for the present invention.
The hardware block diagram of Fig. 2 for installing among Fig. 1.
Fig. 3 is the structured flowchart of linguistic unit of the present invention.
Fig. 4 is the structured flowchart of reductive rule table of the present invention.
Fig. 5 is the structured flowchart of candidate formation of the present invention.
Fig. 6 is the synoptic diagram that concerns of the sub-former piece of reductive rule of the present invention and candidate formation.
Fig. 7 is the process flow diagram of main procedure MainProc in the increment type broad normalization method.
Fig. 8 seeks the linguistic unit combination to produce the process GenNewUnit process flow diagram of newspeak unit for being called by main procedure MainProc, being used to.
Fig. 9 is for being called, produce according to given linguistic unit stack the process flow diagram of the process Stk2Unit of newspeak unit by process GenNewUnit.
Embodiment
In order more to be expressly understood technology contents of the present invention, describe in detail especially exemplified by following examples.
See also shown in Figure 1, reduction parts 102 obtain from the linguistic unit of external world's input by input block 101, according to reductive rule table 104, utilization is subsequently with the method for the increment type broad normalization described, and the linguistic unit that newly obtains is copied in the qualified candidate formation in the candidate queue table 105; And meeting under the situation of specified conditions, in candidate queue table 105, seek the linguistic unit combination that meets specified conditions, obtain new linguistic unit by rule specific in the reductive rule table 104, the newspeak unit is outputed to outside the device by output block 103.
See also shown in Figure 2ly again, processor 201 is carried out subsequently the program of the increment type broad normalization that will describe, needed storage area when RAM 202 provides program to carry out; In addition, also be used to store reductive rule table 104 and storage candidate queue table 105, the program of ROM 203 storages carrying out increment type broad normalization, I/O interface 204 links to each other with output block 103 with input block 101, and input block 101 can be keyboard, OCR, receiver or the internal memory that stores pending linguistic unit etc.Output block 103 can be display device, printer, network interface or internal memory etc.Bus 205 connects above each parts.
Described linguistic unit canned data has been described the entity language of the pairing language-specific level of passage; Especially, the linguistic unit canned data has been described the entity language of the pairing language-specific level of passage with mark mode.
The implementation of linguistic unit has multiple, such as:
1. come the generic of representation language entity with single marking;
2. come the generic of representation language entity at different aspect with a plurality of marks, these marks constitute a set.
Single mark is the method that most language analysis devices or software all adopt when reduction, and its benefit is directly simple; The multiple labeling collection approach adopts in text analyzing methods such as functional unification grammar.For for simplicity, present embodiment adopts the implementation method of single mark.Grasp the technician of computer science general knowledge,, be not difficult linguistic unit is embodied as the multiple labeling set to implement the present invention with reference to present embodiment.
When implementing, can select linguistic unit whether to comprise the zone that text filed information occupies in text with the descriptive language entity.
If do not use text filed, the then not qualification of location relation between each sub-former piece of reductive rule, reduction at this moment is the out-of-order reduction.The out-of-order reduction can be used to some occasion, such as: the text for some language is analyzed, and such as Latin, the phraseological role of each vocabulary changes differentiation by the speech lattice, but not the position of vocabulary in sentence; Under not being very strict situation, grammar request do not analyze text; Perhaps text is carried out the time requirement height but accuracy requirement is not very high analysis.
On the other hand, if use text filed, the then qualification of location relation between each sub-former piece of reductive rule, reduction at this moment is that the order reduction is arranged.Most of natural language, such as Chinese, English etc. all are suitable for the mode of reduction in proper order.
Present embodiment adopts the pattern that the order reduction is arranged, so comprise text filed in the linguistic unit.Grasp the technician of computer science general knowledge,, be not difficult to out-of-order reduction mode and implement the present invention with reference to present embodiment.
The structure that linguistic unit adopted in the present embodiment sees also shown in Figure 3, owing to be that order reduction pattern is arranged, each linguistic unit 301 comprises main body 302 and text filed 303; Main body 302 comprises mark 304 and supplementary 305.If out-of-order reduction pattern, then linguistic unit 301 does not comprise text filed 303.
Mark 304 is used for identifying language concept under the different language hierarchies can make in English POS (Part of Speech) mark, also can define voluntarily as required.When if the present invention is implemented as module of certain text analyzing device or software, then adopt this text analyzing device or the given mark of software.
Below be giving an example of some marks, reference during for enforcement:
Language hierarchy Mark Implication
Morphology syntax syntax syntax syntax semanteme ?WRD ?PHR ?V ?N ?EVT ?ENT Individual character phrasal verb noun incident entity
Mark can be mutual sane level, does not promptly have subordinate relation each other; Relation also can have levels.If sane level, when comparing mark A and B, judge whether A is consistent with B; Concern if having levels, judge that then A is the subclass of B, or B is the subclass of A, perhaps A and B are irrelevant.In the present embodiment, mark concerns it is sane level each other.
Supplementary 305 is used to describe the information that some marks can't be described, such as: when linguistic unit is represented an individual character, the character string forms of storing this individual character with supplementary; Perhaps when linguistic unit is represented a phrase, the character string forms of storing this phrase with supplementary.
The zone that text filed 303 entity languages of having described the linguistic unit representative occupy in text can be implemented as:
1. with the interval of digital dual representation, the border of this zone in text is described.Is that to begin to label be 10 end of string for 3 character string such as, the zone of (3,10) expression text from label.
2. digital collection illustrates which locational character string belongs to this entity language.
Such as, { 3,4,5,10} represents that this entity language covering label is 3,4,5,10 character string.
In the present embodiment, adopt the mode of digital antithesis.
Below, describing in conjunction with the linguistic unit of some examples structure such as Fig. 3, mark and implication thereof are seen the mark example that provides above:
Linguistic unit (WRD, " and in ", (3,3)):
Be labeled as WRD, represent that the entity language of this linguistic unit representative is an individual character; Supplementary be " in ", represent this individual character for " in "; Text filed the covering label that occupies is 3 character string;
2. linguistic unit (PHR, " China ", (3,5)):
Be labeled as PHR, represent that the entity language of this linguistic unit representative is a phrase; Supplementary is " China ", represents that this phrase is " China "; What occupy is text filed for (3,5);
3. linguistic unit (V, NULL, (4,7)):
Be labeled as V, represent that the entity language of this linguistic unit representative is a verb; Supplementary is empty, and the representation language unit does not further describe this entity language; What occupy is text filed for (4,7);
The reductive rule table comprises some reductive rules.The structure of single reductive rule as shown in Figure 4.Reductive rule 401 comprises former piece 402 and consequent 403.Former piece 402 comprise the plurality of sub former piece (404,405 ..., 406).
Each sub-former piece has been stored the matching condition of linguistic unit.The implementation of sub-former piece has multiple, can:
1. be expressed as the main body of linguistic unit, elder generation's check mark consistance when mating: if the mark of the two inconsistent (Mk system is a sane level), perhaps linguistic unit is not that the subclass or the mark itself (Mk system is stratification) of sub-former piece mark then do not match.On the indicia matched basis, check the supplementary consistance: sub-former piece supplementary is that sky then mates; Sub-former piece supplementary is not that the supplementary of sky and linguistic unit and sub-former piece is inconsistent, and the two does not still match.
2. or be expressed as conditional expression, check during coupling whether linguistic unit to be matched all satisfies for all conditions expression formula in the set with set form storage.
Present embodiment adopts with the implementation of linguistic unit main body as sub-former piece content.
The consequent canned data described reductive rule by after mating with the content of the linguistic unit that produces.Under out-of-order reduction pattern, the linguistic unit that will produce, just main body is described in consequent.And having under the order reduction pattern, the main body of the linguistic unit that will produce is described in consequent; And its text filed can description in consequent is decided but give the increment type broad normalization method that aft section mentions.
Below, in conjunction with example reductive rule is described:
1. reductive rule<(WRD, " people ")〉→ (PHR, " mankind ")
Sub-former piece is (WRD, " people "); Consequent is (PHR, " mankind "); This rule is represented individual character " people " reduction become generic the to be phrase of " mankind ";
Reductive rule<(N, NULL), (V, NULL)〉→ (EVT, NULL)
Sub-former piece 1 be (N, NULL), sub-former piece 2 be (V, NULL); Consequent be (EVT, NULL); This rule is represented the entity language of an expression noun, and with the entity language of an expression verb, reduction becomes the entity language of a presentation of events.
The structure of candidate formation as shown in Figure 5.Candidate formation 501 comprise some linguistic units (502,503 ..., 504).When realizing, the candidate formation can be adopted multiple mode, such as: array, chained list, doubly linked list etc.In the present embodiment, the candidate formation is embodied as chained list.
The pass of candidate formation and sub-former piece ties up to when realizing, can adopt multiple mode, such as:
1. all candidate formations are put together.Such as: the queue table of structure candidate, every record of table comprises the rule sign, sub-former piece sign and candidate formation;
2. the candidate formation is attached to sub-former piece separately.Such as: in the data structure of sub-former piece, increase a pointer that points to the candidate formation; Perhaps the candidate formation is added as the member in the data structure or class of sub-former piece, or the like.
For the convenience that illustrates, in the present embodiment, employing be the implementation of candidate queue table.Grasp the technician of computer science general knowledge,, be not difficult to the candidate formation and adhere to sub-former piece mode and implement the present invention with reference to present embodiment.
Linguistic unit deposits the method for candidate formation in, can take multiple mode, such as:
1. the total data of duplicating linguistic unit as the member of candidate formation in the candidate formation;
2. in the candidate formation, sign can be the label that system gives each linguistic unit to the sign of duplicating linguistic unit, perhaps memory address of linguistic unit data structure etc. as the member of candidate formation.
In the embodiment that the present invention provides, adopt the total data duplicate linguistic unit as the member of the candidate formation implementation in the candidate formation.If adopt the method for preserving the linguistic unit sign, then need linguistic unit is stored in the good data structure of prior structure, so that the reduction method of mentioning subsequently can be visited.
As shown in Figure 6, for each reductive rule, all corresponding candidate formation of the sub-former piece of it each.
Increment type broad normalization main procedure MainProc mates new linguistic unit and each each sub-former piece of rule one by one, just it is saved in the candidate formation of this sub-former piece correspondence as long as find coupling.If coupling is last sub-former piece of certain rule, then begin in each the relevant candidate formation of this rule, to carry out the search of the linguistic unit combination of increment type.
Main procedure MainProc can be for reference a kind of performing step following (referring to Fig. 7):
A01: obtain a linguistic unit U from input block 101
A02: the regular number N that obtains reductive rule table 104
A03: make I=1
A04:, otherwise finish if I<=N then changes steps A 05
A05: I the regular R (I) that obtains the reductive rule table
A06: the sub-former piece number S that obtains R (I)
A07: make J=1
A08:, otherwise change steps A 12 if J<=S then changes steps A 09
A09: obtain R (I) J sub-former piece Pre (I, J)
A10: if (I J), then changes A13 to U coupling Pre, otherwise changes steps A 11
A11:J=J+1; Change steps A 08
A12:I=I+1; Change steps A 04
A13: in candidate queue table 105, find candidate formation Cand (I, J)
A14: U is added Cand (I, J) rear of queue
A15:, otherwise change steps A 11 if J=S then changes steps A 16
A16: (U, I S), change steps A 12 to invoked procedure GenNewUnit
Steps A 10 described couplings are meant that linguistic unit meets the requirement of regular sub-former piece.Particularly, can be with reference to the matching way described in the sub-former piece implementation mentioned above.
Main procedure is called ergodic algorithm and seek the linguistic unit combination in each relevant candidate formation of ad hoc rules.In an embodiment, promptly A16 call method GenNewUnit realizes.In order to guarantee to search for is increment type, and last element of linguistic unit combination is restricted to the newspeak unit of firm input system.
Order reduction pattern is arranged with respect to out-of-order reduction pattern, when the search language unit combination, many restrictions, that is: each member in the linguistic unit combination and thereafter member be non-overlapping copies on text filed, and the former are in the latter's left sides.As the text filed of linguistic unit A is (4,7), and the text filed of linguistic unit B is (5,8), and then A and B have public part on text filed, and the position relation of the two is for overlapping.Again such as, the text filed of linguistic unit C is (8,10), then A and C zero lap, and the right margin of A is 7 are still less than the left margin 8 of C, so A is in the left side of C.
Method GenNewUnit can be for reference a kind of performing step following (referring to Fig. 8):
B01: carry out initialization, comprising: obtain U at Cand (I, S) the address PU in; Storehouse Stk empties; General<S, PU〉stacked Stk
B02: read Stk stack top element<X, Y 〉
B03: the text filed assignment of the linguistic unit that Y is pointed is given YT, even YT=Y->TxtRgn
B04:, otherwise change step B11 if step B05 is changeed in X>1
B05:X=X-1
B06: pointer PC is pointed to candidate formation Cand (I, tail of the queue X)
B07: if PC to head of the queue then change step B12, otherwise changes step B08
B08: give CT with the text filed assignment of PC linguistic unit pointed, even CT=PC->TxtRgn
B09:, otherwise change step B10 if CT and YT are not overlapping and in the YT left side, then change step B18
B10: pointer PC moves forward a unit, even PC=PC->Prv; Change step B07
B11: invoked procedure Stk2Unit (I, Stk)
B12: eject the Stk stack top element
B13: if stack Stk is empty, then finish, otherwise change step B14
B14: read Stk stack top element<X, Y 〉
B15:, otherwise change step B16 if Y to head of the queue, then changes step B12
B16: pointer Y moves forward a unit, even Y=Y->Prv
B17:<X, Y〉stacked Stk; Change step B02
B18:<X, *PC〉stacked Stk; Change step B02
Step B18's *PC represents the linguistic unit that PC is pointed.
The method Stk2Unit that step B11 called is used for being made up by the current linguistic unit that obtains, and specific consequent produces new linguistic unit and also exports it.
Method Stk2Unit can be for reference a kind of performing step following (referring to Fig. 9):
C01: text filed T0 initialization
C02: pointer PS is pointed to storehouse Stk bottom
C03:, otherwise change step C04 if PS then changes step C06 to stack top
C04: the text filed T=PS->TxtRgn that obtains linguistic unit in the PS stack pointed
C05: the text filed T0 that is added to of linguistic unit in the stack that PS is pointed, even T0=T0 ∪ is T; Change step C03
C06: create linguistic unit U0
C07: with the text filed T0 that is set to of U0, even U0.TxtRgn=T0
C08: the consequent Post (I) that obtains I reductive rule
C09: the main body that Post (I) is copied to linguistic unit U0
C10: to output interface output U0
The described text filed overlap-add operation of step C05 is exactly " also " operation of interval or set in the mathematics.Such as, linguistic unit A text filed (2,5), linguistic unit B is (4,6), then Die Jia result is A ∪ B=(2,6)
If said method is made following modification:
1. remove B08, B09, B10;
2. allow B07 change B18 for fictitious time;
3. remove C04, C05, C07.
Then this reduction is out-of-order reduction pattern.
It more than is a kind of embodiment of increment type broad normalization method.The those skilled in the art of computer science are not difficult according to present embodiment, implement the present invention at concrete application background.
The method of carrying out the increment type broad normalization can be stored in the form of program in the computer-readable storage medium, and being used for stored program storage medium can be floppy disk, hard disk, CD, magneto-optic disk, CD-ROM, CD-R, tape, nonvolatile memory or volatile storage.
In this instructions, the present invention is described with reference to its certain embodiments.But, still can make various modifications and conversion obviously and not deviate from the spirit and scope of the present invention.Therefore, instructions and accompanying drawing are regarded in an illustrative, rather than a restrictive.

Claims (6)

1, a kind of device of broad normalization is characterized in that, described device comprises:
(1) memory unit of reductive rule table is used to store reductive rule; Described reductive rule comprises one or more sub-former pieces, the condition that should satisfy when having stored the linguistic unit coupling; Described reductive rule also comprises a consequent, institute's canned data described regular by after the coupling with all or part of content of the linguistic unit that produces; Described linguistic unit canned data has been described the entity language of the pairing language-specific level of passage;
(2) memory unit of candidate formation is used to store the linguistic unit of the specific sub-former piece of coupling ad hoc rules;
(3) reduction parts are used for the linguistic unit that will be obtained by input medium, mate one by one with each sub-former piece of each rule, when the specific sub-former piece coupling of described linguistic unit and ad hoc rules it are joined in the corresponding candidate formation of this sub-former piece; Mate in the sub-former piece process at each, if this linguistic unit coupling is the powder former piece of certain rule with n sub-former piece, the utilization ergodic algorithm is searched for various linguistic units combinations, and each combination is satisfied:
(a) i member comes the (1≤i≤n-1) of the candidate formation of i sub-former piece of rule since then;
(b) n member linguistic unit for this reason;
(4) input block obtains from the linguistic unit of external world's input;
(5) output block is with the new linguistic unit output that produces;
The output of described input block links to each other with the input of described reduction parts, and the output of these reduction parts links to each other with the input of described output block, and these reduction parts also link to each other with the memory unit of reductive rule table and the memory unit of candidate formation respectively; For the combination of each linguistic unit, all produce new linguistic unit and export by output block according to regular consequent.
2, the device of broad normalization according to claim 1 is characterized in that, described linguistic unit comprises that also the entity language that indicates this linguistic unit representative occupies the text filed of zone in text; Described reduction device is:
(a) when the search language unit combination, i linguistic unit of every kind of linguistic unit combination text filed do not coincide with i+1 linguistic unit and (1≤i≤n-1) on the left of it;
(b) the new linguistic unit that obtains text filed is the text filed stack of each linguistic unit in the corresponding linguistic unit combination.
3, a kind of method of using the described device of claim 1 to carry out broad normalization is characterized in that this method may further comprise the steps:
(1) obtain a linguistic unit by input medium, described linguistic unit canned data has been described the entity language of the pairing language-specific level of passage;
(2) each the sub-former piece with this linguistic unit and each reductive rule mates one by one; Described reductive rule contains one or more sub-former pieces, the condition that should satisfy when having stored the linguistic unit coupling; For described each sub-former piece, all corresponding candidate formation is used to store the linguistic unit that mates this sub-former piece; Described reductive rule also comprises a consequent, institute's canned data described regular by after the coupling with all or part of content of the linguistic unit that produces; When described linguistic unit mates with the specific sub-former piece of ad hoc rules, it is joined in the corresponding candidate formation of this sub-former piece; Mate in the sub-former piece process at each,, then use ergodic algorithm to search for various linguistic units combinations, make each combination satisfy if this linguistic unit coupling is the powder former piece of certain rule with n sub-former piece:
(a) i member comes the (1≤i≤n-1) of the candidate formation of i sub-former piece of rule since then;
(b) n member linguistic unit for this reason;
For the combination of each linguistic unit, all produce new linguistic unit and export by output means according to regular consequent.
4, the method for broad normalization according to claim 3 is characterized in that, described linguistic unit comprises that also the entity language that indicates this linguistic unit representative occupies the text filed of zone in text, and described reductive rule is:
(a) when the search language unit combination, i linguistic unit of every kind of linguistic unit combination text filed do not coincide with i+1 linguistic unit and (1≤i≤n-1) on the left of it;
(b) the new linguistic unit that obtains text filed is the text filed stack of each linguistic unit in the corresponding linguistic unit combination.
5, a kind of computer-readable storage medium that stores the program that realizes the described broad normalization method of claim 3 is characterized in that, described program is carried out following steps:
(1) obtain a linguistic unit by input medium, described linguistic unit canned data has been described the entity language of the pairing language-specific level of passage;
(2) each the sub-former piece with this linguistic unit and each reductive rule mates one by one; Described reductive rule contains one or more sub-former pieces, the condition that should satisfy when having stored the linguistic unit coupling; For described each sub-former piece, all corresponding candidate formation is used to store the linguistic unit that mates this sub-former piece; Described reductive rule also comprises a consequent, institute's canned data described regular by after the coupling with all or part of content of the linguistic unit that produces; When described linguistic unit mates with the specific sub-former piece of ad hoc rules, it is joined in the corresponding candidate formation of this sub-former piece; Mate in the sub-former piece process at each,, then use ergodic algorithm to search for various linguistic units combinations, make each combination satisfy if this linguistic unit coupling is the powder former piece of certain rule with n sub-former piece:
(a) i member comes the (1≤i≤n-1) of the candidate formation of i sub-former piece of rule since then;
(b) n member linguistic unit for this reason;
For the combination of each linguistic unit, all produce new linguistic unit and export by output means according to regular consequent.
6, storage medium according to claim 5 is characterized in that, described linguistic unit comprises that also the entity language that indicates this linguistic unit representative occupies the text filed of zone in text, and described reductive rule is:
(a) when the search language unit combination, i linguistic unit of every kind of linguistic unit combination text filed do not coincide with i+1 linguistic unit and (1≤i≤n-1) on the left of it;
(b) the new linguistic unit that obtains text filed is the text filed stack of each linguistic unit in the corresponding linguistic unit combination.
CN 200510023588 2005-01-26 2005-01-26 Device and method for broad normalization Pending CN1645361A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510023588 CN1645361A (en) 2005-01-26 2005-01-26 Device and method for broad normalization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510023588 CN1645361A (en) 2005-01-26 2005-01-26 Device and method for broad normalization

Publications (1)

Publication Number Publication Date
CN1645361A true CN1645361A (en) 2005-07-27

Family

ID=34875916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510023588 Pending CN1645361A (en) 2005-01-26 2005-01-26 Device and method for broad normalization

Country Status (1)

Country Link
CN (1) CN1645361A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105706078A (en) * 2013-10-09 2016-06-22 谷歌公司 Automatic definition of entity collections

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105706078A (en) * 2013-10-09 2016-06-22 谷歌公司 Automatic definition of entity collections
CN105706078B (en) * 2013-10-09 2021-08-03 谷歌有限责任公司 Automatic definition of entity collections

Similar Documents

Publication Publication Date Title
CN1174332C (en) Method and device for converting expressing mode
CN1677388A (en) Statistical language model for logical forms
CN1113305C (en) Language processing apparatus and method
CN1368693A (en) Method and equipment for global software
CN1180369C (en) Equipment and method for input of character string
CN1841367A (en) Communication support apparatus and method for supporting communication by performing translation between languages
CN1186287A (en) Method and apparatus for character recognition
CN1669029A (en) System and method for automatically discovering a hierarchy of concepts from a corpus of documents
CN1834955A (en) Multilingual translation memory, translation method, and translation program
CN1770107A (en) Extracting treelet translation pairs
CN1777888A (en) Method for sentence structure analysis based on mobile configuration concept and method for natural language search using of it
CN1910573A (en) System for identifying and classifying denomination entity
CN1172994A (en) Document retrieval system
CN1439979A (en) Solution scheme data editing process and automatic summarizing processor and method
CN1573926A (en) Discriminative training of language models for text and speech classification
CN1942877A (en) Information extraction system
CN1503161A (en) Statistical method and apparatus for learning translation relationship among phrases
CN1652107A (en) Language conversion rule preparing device, language conversion device and program recording medium
CN101075262A (en) Method and system for inputting Chinese character by computer
CN101034414A (en) Information processing device, method, and program
CN1924858A (en) Method and device for fetching new words and input method system
CN101069181A (en) Storage device and recording medium
CN1920812A (en) Language processing system
CN1991837A (en) Structured document processing apparatus and method
CN1786947A (en) System, method and program for extracting web page core content based on web page layout

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication