CN101013421A

CN101013421A - Rule-based automatic analysis method of Chinese basic block

Info

Publication number: CN101013421A
Application number: CN 200710063489
Authority: CN
Inventors: 周强
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2007-02-02
Filing date: 2007-02-02
Publication date: 2007-08-08
Anticipated expiration: 2027-02-02
Also published as: CN101013421B

Abstract

This is a rule-based Chinese basic block automatic analysis method, belonging to the natural language processing applications filed. Its features are: make use of automatic acquisition of the basic rules of the block to design the data and knowledge-driven basic block automatic analyzer to recognize different basic block in he Chinese authentic sentence efficiently; use vocabulary knowledge to do different levels of information expansion to examples of block combinations, and through the effective information matching of knowledge-level and multi-granularity automatic acquisition rules with the vocabulary of inner context constrains and external beam restrictions, get reliable automated analysis results as far as possible; automated disambiguate ambiguity by using confidence rules, choose more reliable automatic analysis as much as possible, while retaining part of complex ambiguity which is difficult to judge in the existing conditions for the follow-up analyzers use, thus enhancing the flexibility and effectiveness of basic block analyzer.

Description

Rule-based automatic analysis method of Chinese basic block

Technical field

The invention belongs to the natural language processing technique field

Background technology

Piece (chunk) is analyzed as a kind of part and parcel analytical technology, can decompose by reasonable task the complete analysis problem, reduce the intractability of analyzing automatically greatly, thereby all brought into play vital role at applied system researches such as the information extraction of natural language processing research field, question answering system, text minings.

Aspect English, Abney (1991) at first piece be defined as in the sentence one group adjacent be the aggregate of the word at center with certain notional word, thereby a sentence is split into the linear order of being made up of several pieces.Ramshaw﹠amp; Marcus (1995) has proposed ' BIO ' model, by judging that each word in the sentence is in reference position (B), centre position (I) of piece or does not belong to the position (O) of certain piece, block analysis is converted into a sequence mark problem identificatioin, for good basis has been laid in the application of various machine learning methods.Tjong﹠amp; Buchholz (2000) utilizes the Wall Street Journal language material of Penn treebank, changes the text of the BIO information labeling that has formed about 300,000 speech automatically by parsing tree, as the training and testing platform of a unified English block analysis.On this platform, many researchists use the different machines learning method, comprise based on automatic block analysis devices of structure such as storage study (MBL), support vector machine (SVM), hidden Markov models (HMM), carry out the BIO sequence and discern automatically.Use therein principal character is each 2 adjacent words and grammatical category information about each word.Experimental result shows that the F-measure of best block analysis system can reach about 93%, has tentatively proved the validity of local part of speech distributed intelligence to English block boundary and syntactic marker identification.In recent years, the correlative study aspect Chinese has also obtained similar result.But after this, Xiang Guan research work reduces gradually.Main cause is that present piece define method and model of cognition are main according to syntactic distribution information, and is indifferent to the inherent semantic content of piece, thereby corresponding piece is described and recognition technology research has lost the inherent driving force that further develops.

In recent years, (Multiword Expression, MWE) problem is subjected to theoretical language scholar and computational linguist's attention to multi-words expression gradually.It mainly studies the Special Significance interpretation problems of the many word combinations between speech and phrase in the English, but its basic thought has expanded to the interface problem of ubiquitous dictionary and sentence structure in the different language.Fillmore (2003) utilizes the form of structural grammar (Construction Grammar) and the describing method that meaning combines, and the common MWE of English is conducted in-depth analysis.Sag etc. (2002) have then carried out comprehensive summary to analytical challenge and the available techniques of present MWE, have proposed the basic imagination that need use different resource and distinct methods to analyze to different MWE.These research work are by introducing the analysis and the description technique of semantic content, for the block analysis problem exploration has injected new vitality.

Be subjected to the inspiration of MWE research work, we think, can come the definition block problem analysis from another angle: at an input sentence, at first by word polymerism and linguistic context constrained analysis on every side, determine which word combination can form word piece more than; Then the notional word in the remaining word is directly risen to the holophrase piece, so just can form a complete sentence piece that comprises many words piece, holophrase piece and other function words and describe sequence.Different with ' BIO ' sequence mark model of cognition is that this processing thinking is more emphasized the internal vocabulary polymerization analysis of different masses, thereby can set up the inner link between bittiness description content and the corresponding lexical semantic knowledge base more easily.The processing key here is how to find a kind of effective knowledge description system, and the lexical semantic knowledge of the piece on top layer being described example and deep layer combines.In these areas, we have carried out some desk studies:

● proposed the fundamental block description system based on topological structure, the fundamental block of setting up real text is very naturally described the inner link between example and the vocabulary association knowledge storehouse, has formed the description basis of follow-up rule learning and evolution;

● utilize Chinese basic block description rule study and extending evolution instrument automatically, under fundamental block tagged corpus and lexical knowledge bank support, from POS-tagging string descriptor rule, related and the outside linguistic context restriction knowledge by the more internal vocabulary of continuous introducing, the descriptive power of progressively evolving out is stronger, handle the higher rule of degree of confidence, forms by different level, the fundamental block rule base of many granularities;

These researchs have been laid good basis for further carrying out rule-based automatic analysis method of Chinese basic block research.

Summary of the invention

Rule-based automatic analysis method of Chinese basic block is characterized in that, it contains following steps successively:

(1) computer initialization is set:

A. Shu Ru sentence is S, S={＜w _i, t _i, w wherein _iBe i word among the sentence S, t _iBe the POS-tagging of i word, i ∈ [1, n], n are the word sum in the sentence;

B. forest PSF[is shared in compression], represent with the data structure in the line chart: the word of the n in the sentence from left to right after the series arrangement, the right individual from the 1st the word left side to n has the n+1 position, defining each position is a line chart node, any two nodes just can be formed a bar chart limit, with (l, r) expression, wherein l is the left sibling position on limit, r is the right node location on limit, and r＞l, all limits are combined to form a line chart array, and to provide complete descriptor to all words with by the fundamental block that word forms, all words and fundamental block are referred to as syntactic constituent, described fundamental block be meant among the sentence S one group adjacent be the aggregate of the word at center with certain notional word, described PSF[] include:

B1.＜and composition sign cflag 〉, the following different composition classification of expression: W-word, B-holophrase piece, many words of P-piece, the dynamic deletion limit that D-forms because of arranging fork;

B2.＜composition left margin cl 〉,＜composition right margin cr〉be expressed as the left and right boundary position of branch limit in sentence S, cl ∈ [0, n-1], cr ∈ [1, n];

B3.＜syntactic marker cctag 〉, represent the outside syntactic function of corresponding composition,

To the word limit, preserve its POS-tagging, particular content comprises: n-noun, s-place speech, t-time word, the f-noun of locality, r-pronoun, the vM-auxiliary verb, v-verb, a-adjective, d-adverbial word, m-number, q-measure word, the p-preposition, u-auxiliary word, c-conjunction, y-modal particle, e-interjection, w-punctuation mark;

To the fundamental block limit, preserve the syntactic marker that it obtains from rule base, particular content comprises: np-noun piece, vp-verb piece, sp-space piece, tp-time block, mp-quantity piece, ap-adjective piece, dp-adverbial word piece;

B4.＜relation mark crtag 〉, represent that the internal grammar of corresponding composition concerns,

To the word limit, preserve its word information;

To the fundamental block limit, preserve the relation mark that it obtains from rule base, particular content comprises: ZX-right corner division center, LN-chain type relational structure, the LH-coordination, PO-states guest's relation, and SB-states the relation of benefit, AM-ambiguity interval, SG-holophrase piece, wherein:

The right corner division center, all words in the expression fundamental block are directly interdependent to form a dextrad center dependency structure to the right corner centre word, and basic model is: A ₁... A _nH, dependence is: A ₁→ H ..., A _n→ H, H are the syntactic-semantic centre word of whole fundamental block, A ₁..., A _nBe qualifier;

The chain type relational structure, each word in the expression fundamental block is interdependent successively to form a multicenter dependence chain of arranging from left to right to its directly right adjacent word, and basic model is: H ₀H ₁... H _n, dependence is: H ₀→ H ₁..., H _N-1→ H _n, H _i, i ∈ [1, n-1] becomes the semantic polymerization site of different levels, H _nSyntactic-semantic centre word for whole fundamental block;

Coordination, each word in the expression fundamental block forms parallel construction, as: the teacher classmate;

State guest's relation, two words in the expression fundamental block form predicate-object phrase, as: have a meal;

State the relation of benefit, two words in the expression fundamental block form predicate-complement structure, as: go down;

The ambiguity interval represents that some words can form different textural association relation, is difficult to arrange automatically fork according to existing fundamental block rule base and lexical knowledge bank content, can only keep a plurality of textural associations and wait until follow-up system and carry out

Select to handle;

B5.＜and composition degree of confidence θ 〉, represent the processing degree of confidence of corresponding composition, θ ∈ [0,1];

B6.＜and the word limit 〉, i word in the sentence is characterized by: cflag=W, cl=i-1, cr=i, cctag=t _i, crtag=w _i, θ=0;

B7.＜and holophrase piece limit 〉, promptly the fundamental block of being made up of a word is characterized by: cflag=B, cr-cl=1, crtag=SG;

B8.＜and many words piece limit 〉, promptly the fundamental block of being made up of two above words is characterized by: cflag=P, cr-cl＞=2;

C. piece marks sequence stack ChkStack[], no ambiguity fundamental block that preservation extracts from PSF and the block analysis interval that may produce ambiguity, formation is at the piece mark sequence of the linearity of input sentence, its master record form is: [cflag, cl, cr, cctag, crtag, corresponding PSF bark mark PSFeno];

D. primitive rule table BasRules[], preserve all POS-tagging string descriptor rules, its master record form is [rstru, r_tag, fp, fn, θ, e_sp, e_ep], wherein:

R_stru is the textural association of rule,

R_tag is the reduction mark, comprises syntactic marker and relation mark two parts,

Fp is the positive example frequency,

Fn is the counter-example frequency,

θ is regular degree of confidence, and computing formula is: θ=fp/ (fp+fn),

E_sp is the reference position of corresponding following extension rule in following extension rule table,

E_ep is the final position of corresponding extension rule in the extension rule table,

If there is not corresponding extension rule in this primitive rule, then e_sp and e_ep are-1;

E. extension rule table ExpRules[], preserve all description rules that comprise lexical constraint and linguistic context restrictive condition that study obtains through extending evolutions, its master record form is: [r_stru, r_tag, fp, fn, θ] defines the same;

F. the fundamental block rule base has been preserved and has been carried out the fundamental block description rule that fundamental block is discerned needed different levels, obtain by automatic rule learning and evolution, its basic format be＜textural association〉→＜the reduction mark＜degree of confidence, wherein:

Textural association is described the inside unitized construction of each fundamental block, is divided into two levels according to the difference of rule description ability:

A) primitive rule, it is the POS-tagging string that its textural association is described,

B) extension rule by lexical constraint and linguistic context restriction, forms the stronger textural association of descriptive power and describes, and the definition of reduction mark and degree of confidence is the same;

G. lexical knowledge bank is preserved the various vocabulary that may use in the analytic process and is described knowledge, obtains by the external knowledge source, comprises following content:

G1. vocabulary association knowledge storehouse is contained the syntactic relation that forms between the Chinese notional word commonly used and is described rightly, and the master data form is:＜word 1〉＜word 2〉＜part of speech 1〉＜part of speech 2〉＜the syntactic relation mark 〉;

G2. feature verb list, contain from the syntactic information dictionary, extract can be with the verb vocabulary information of dissimilar objects, the master data form is: {＜verb entry〉} is organized into different verb lists according to different type of object;

G3. semantic nouns information table, 11 semantic category information that contain the Chinese major terms: tissue, people, artifact, natural thing, information, spirit, incident, attribute, quantity, time and space, the master data form is:＜noun entry〉＜the semantic category mark 〉;

H. crossing ambiguity interval, if the border, the left and right sides of two fundamental blocks (＜L1, R1〉and＜L2, R2 〉) meet the following conditions: (the ﹠amp of L2＜R1); (the ﹠amp of L2＞=L1); (R2＞R1) or (﹠amp of R2＞L1); (the ﹠amp of R2＜=R1); (L2＜L1), then form a crossing ambiguity interval＜AmL, AmR 〉, and AmL=min (L1, L2), AmR=max (R1, R2);

I. all standing piece is illustrated in certain crossing ambiguity interval＜AmL, AmR〉in exist can cover this interval fundamental block fully, suppose that promptly the border, the left and right sides of this fundamental block is cl and cr, then satisfy condition: cl=AmL, cr=AmR;

J. make-up ambiguity interval, if between a word combination region in the sentence, under certain linguistic context, can form a complete fundamental block, and under another linguistic context, can form the fundamental block of a plurality of separation, then just be called the make-up ambiguity interval between this word combination region, concrete decision condition is: if the fundamental block between word combination region is analyzed degree of confidence less than InBelTh, then form a make-up ambiguity interval;

K. the entire combination piece is represented the fundamental block that the interval entire combination of make-up ambiguity forms;

L. inner combination block, a plurality of fundamental blocks that interval inner each word of expression make-up ambiguity forms respectively;

Also load with lower module: many words piece identification module, ambiguity topology discovery and arrange divergent module, the automatic hoisting module of holophrase piece and linear block sequence generation module automatically to computing machine;

Simultaneously, set following parameter:

LowBelTh, the word in the sentence can be combined into the confidence threshold value of piece, gets 0.5;

InBelTh, the word combination in the sentence can form the confidence threshold value in make-up ambiguity interval, gets 0.7;

HighBelTh, the word in the sentence can be combined to form the confidence threshold value of reliable fundamental block, gets 0.85;

ERSum, the sum of all extension rule description strings of certain word combination in the sentence;

CBSum, all fundamental block sums that intersect with certain fundamental block generation among the data structure PSF;

OASum, the sum in all crossing ambiguity intervals of finding in the sentence;

CASum, the sum in all make-up ambiguity intervals of finding in the sentence;

And use following base conditioning function:

Min is the function of minimizing, and min (x, y) minimum value among x and the y is selected in expression;

Max is the maximizing function, and max (x, y) maximal value among x and the y is selected in expression;

(2) input need be carried out the Chinese sentence S that fundamental block is analyzed, S={＜w _i, t _i, i ∈ [1, n];

(3) PSF of initialization related data structures according to the following steps:

(3.1) initialization i=0;

(3.2) obtain the word information w of i speech in the sentence _iWith POS-tagging t _i, generate a new word limit record: [' W ', i, i+1, t _i, w _i, 0], add among this PSF, wherein the value of 0 expression degree of confidence;

(3.3) make i=i+1, repeating step (3.2) up to i=n, stops;

(4) find and discern many words piece according to the following steps:

(4.1) initialization i=0;

(4.2) from i word of sentence, from left to right scan whole sentence, be combined to form the fundamental block of all length between 2～6 words interval＜i to be analyzed, j, j ∈ [i+2, i+6];

(4.3) a fundamental block interval＜i to be analyzed of order obtaining step (4.2) formation, j 〉, in rule base, to seek at this interval optimum matching number of regulation BestRuleNo, concrete steps are:

(4.3.1) obtain interval interior POS-tagging string,, then return non working number in the primitive rule table, stop if this mark string does not occur;

(4.3.2) judge whether to exist corresponding extension rule,, then change (4.3.7) if do not exist;

(4.3.3) the inquiry lexical knowledge bank obtains all extension rule textural association description strings in this interval, and it adds up to ERSum;

(4.3.4) all k=0 of sequential search, 1,2 ..., ERSum-1 extension rule textural association description string occurs in the extension rule table if find certain description string wherein, and then the extension rule table sequence number with coupling adds in the rule list of finding;

(4.3.5), then change (4.3.7) if can not find the extension rule of coupling;

(4.3.6) from the rule list of described discovery, select the highest extension rule of degree of confidence, return this extension rule sequence number, stop;

(4.3.7), stop if the degree of confidence＜LowBelTh of primitive rule then returns non working number; Otherwise return this primitive rule sequence number, stop;

(4.4) if BestRuleNo is empty, then change (4.9);

(4.5) according to optimum matching number of regulation BestRuleNo, from corresponding rule list record, extract following information: fundamental block syntactic marker CCT, relation mark CRT and analysis degree of confidence CB, generate new many words piece limit record: [' P ', i, j, CCT, CRT, CB], add among the data structure PSF;

(4.6) carry out the relative combined strength of local context and dynamically arrange fork, step is as follows:

(4.6.1) obtain the confidence value θ of current fundamental block _T

(4.6.2) obtain among the data structure PSF fundamental block therewith the every other fundamental block that intersects takes place, it adds up to CBSum, and initialization intersects limit array index control variable i=0;

(4.6.3) obtain the wherein confidence value θ on i bar intersection limit _i, judge:

(4.6.4) if θ _T-θ _i＞0.2, then wherein i bar intersection limit (order: the cflag=' D ' on this intersection limit) of deletion;

(4.6.5) make i=i+1, repeating step (4.6.3)-(4.6.5) is up to i=CBSum;

(4.6.6), make θ if there is certain intersection limit i _i-θ _T＞0.2, then delete current fundamental block limit, stop;

(4.7) as long as j＜min (n, i+6), repeating step (4.3)-(4.6) then;

(4.8) if i＜n then makes i=i+1, repeating step (4.2)-(4.7), otherwise stop;

(5) find many words piece ambiguity structure, contain following steps successively:

(5.1) from data structure PSF, extract all many words piece limits, add up to PESum;

(5.2) left and right sides boundary position according to these many words pieces sorts separately from small to large automatically, forms word block information table more than;

(5.3) obtain the left and right sides boundary position＜L of the 1st piece in the table, R 〉: L=cl ₁, R=cr ₁, the left and right sides boundary position buffer zone in possible crossing ambiguity interval is set: BufL=L, BufR=R, initialization crossing ambiguity block information: AmL=AmR=0, many words of initialization block information table subscript control variable i=2;

(5.4) obtain the left and right sides boundary position＜L of i piece in the table, R 〉: L=cl _i, R=cr _i

(5.5) if: the word interval of two adjacent blocks does not intersect, i.e. L＞BufR then changes (5.6); Otherwise, adjust the left and right sides boundary position buffer zone in possible crossing ambiguity interval: BufL=min (BufL, L), (BufR R), and is provided with corresponding crossing ambiguity block information: AmL=BufL to BufR=max, and AmR=BufR changes (5.10);

(5.6) obtain the analysis degree of confidence θ on two adjacent many words piece limits _I-1And θ _i

(5.7) if θ _i＜InBelTh then preserves a make-up ambiguity interval＜L, R 〉;

(5.8) if θ _I-1＜InBelTh then preserves a make-up ambiguity interval＜BufL, BufR 〉;

(5.9) if find the crossing ambiguity interval, promptly a crossing ambiguity interval＜AmL is then preserved, AmR in AmL＞0 〉, reset associated buffer: BufL=L, BufR=R, AmL=AmR=0;

(5.10) make i=i+1, repeating step (5.4)-(5.9) are up to i=PESum;

(5.11) return the sum and the boundary information table in two class ambiguity intervals of discovery, stop;

(6) carry out many words piece and arrange fork automatically, step is as follows:

(6.1) obtain total OASum, CASum and the corresponding boundary information table in the two class ambiguity intervals of finding in the sentence;

(6.2) carry out the interval row's fork of overlap type and handle, step is as follows:

(6.2.1) initialization overlap type interval border information table subscript control variable i=0;

(6.2.2) obtain i crossing ambiguity interval＜L, R 〉, obtain interval inner all and intersect many words piece;

(6.2.3), then change step if exist all standing piece or all intersected blocks to form the chain type relational structure in the ambiguity interval

(6.2.4); Otherwise stop;

(6.2.4) obtain interval in the confidence value θ of all intersected blocks _i, and its maximal value Max_ θ and minimum M in_ θ are set;

(6.2.5), then form equally distributed chain type relational structure piece, and make CCT=np or vp if satisfy following two conditions simultaneously,

All θ _i＞=InBelTh or all θ _i＞LowBelTh, and (Max_ θ-Min_ θ＜0.1),

All intersected blocks are noun piece or verb piece,

(6.2.6), then form new chain type relational structure noun piece, make CCT=np if satisfy following two conditions simultaneously,

All intersected blocks are noun piece np,

Whole form a noun piece, and its distribution degree of confidence is greater than InBelTh,

(6.2.7), then form new chain type relational structure noun piece, make CCT=np if satisfy following two conditions simultaneously,

The syntactic marker of all intersected blocks is np or sp or tp,

There is all standing piece, and (all θ _i＞LowBelTh),

(6.2.8) generate new many words of chain type relational structure piece limit record: [' P ', L, R, CCT, ' LN ', Max_ θ] adds among the data structure PSF, wherein ' LN ' expression chain type relational structure;

(6.2.9) interval all interior intersection limits of deletion;

(6.2.10) make i=i+1, repeating step (6.2.2)-(6.2.9) up to i=OASum, stops;

(6.3) carry out combined interval row's fork and handle, step is as follows:

(6.3.1) the combined interval border information table of initialization subscript control variable i=0;

(6.3.2) obtain i make-up ambiguity interval＜L, R 〉;

(6.3.3) obtain the confidence value Comb_ θ of this interval entire combination piece;

(6.3.4) obtain the confidence value Seg_ θ of each inner combination block in this interval _i, and its maximal value Max_ θ is set;

(6.3.5) if all Seg_ θ _i＞LowBelTh and have a Seg_ θ _i＞HighBelTh then changes (6.3.9);

(6.3.6) if all Seg_ θ _i＜=LowBelTh then changes (6.3.10);

(6.3.7) if Comb_ θ＞HighBelTh then changes (6.3.10);

(6.3.8) if Comb_ θ-Max_ θ＞0.1, then commentaries on classics (6.3.10);

(6.3.9) select ' branch ' state, promptly delete the entire combination piece, change (6.3.11);

(6.3.10) select ' closing ' state, promptly delete each inner combination block, change (6.3.11);

(6.3.11) make i=i+1, repeating step (6.3.2)-(6.3.10) up to i=CASum, stops;

(7) promote the holophrase piece automatically, step is as follows:

(7.1) from left to right scan all words in the sentence, if this word is covered by other many words pieces, perhaps belong to the function word set that comprises conjunction, auxiliary word, preposition, modal particle, interjection, punctuation mark etc., then directly change next word, otherwise step below carrying out:

(7.2) obtain the syntactic marker of this word correspondence, generate a new holophrase piece limit record and add among the data structure PSF;

(8) generate linear block mark sequence, step is as follows:

(8.1) analyze the PSF array, obtain covering the interval location information table of following two class words of complete sentence:

The ambiguity interval is arranged, is expressed as AmbiList[],

No ambiguity interval is expressed as NonAmbiList[],

(8.2) obtain the sum in two class word intervals respectively, be expressed as ALSum and NALSum;

(8.3) initialization i=0;

(8.4) obtain i left and right sides boundary position＜L that the ambiguity interval is arranged, R 〉;

(8.5) obtain the syntactic marker CCT of first fundamental block in this interval, generate an ambiguity extent block: [' P ', L, R, CCT, ' AM ' ,-1], adding among the piece mark sequence stack ChkStack, relation mark wherein is set to AM, expression ambiguity interval;

(8.6) make i=i+1, repeating step (8.4)-(8.5) are up to i=ALSum;

(8.7) to each no ambiguity interval, order is extracted each fundamental block information that wherein covers, and adds among the piece mark sequence stack ChkStack;

(8.8) piece mark sequence stack ChkStack is carried out the block message ordering, form the piece mark sequence of the linearity that covers whole sentence, stop.

In order accurately to test the handling property of developing the Chinese basic block autoanalyzer of finishing at present, we have selected all news category texts, about 200,000 speech of total scale from Chinese syntax treebank TCT mark language material.With its separated into two parts: 80% as corpus, is mainly used in rule learning and evolves and handle; 20% as testing material, is mainly used in the performance evaluation of fundamental block analyzer.Table 1 has been listed the fundamental statistics of these experiment language materials.

The fundamental statistics of table 1 experiment language material

	Total number of files	The sentence sum	The word sum	The Chinese character sum	Mean sentence length
	Total number of files	The sentence sum	The word sum	The Chinese character sum	Mean sentence length	Training set	148	6676	170829	268151	25.6
Test set	37	1461	36543	57655	25.0	Training set	148	6676	170829	268151	25.6
Test set	37	1461	36543	57655	25.0	Add up to	185	8137	207372	325806	25.5

On corpus, handle by rule learning and extending evolution, below we have obtained by different level, the fundamental block rule base of many granularities:

● in the primitive rule aspect, comprise 211 POS-tagging description rules.

● in the extension rule aspect, comprise 4972 extension rules of having introduced more lexical constraint and linguistic context restriction description.

Simultaneously, match with the concrete application of top extension rule, we have also used following lexical knowledge bank:

1) vocabulary association knowledge storehouse: mainly used the moving guest to concern the storehouse at present, the moving guest's relationship description that has wherein comprised formation such as the noun of Chinese verb commonly used and back and verb is right.Basic scale is: 5346 of verb entries, vocabulary association are to 52390, and it is right that on average each verb entry comprises about 10 moving guest's relationship descriptions.

2) feature verb list: comprise from Beijing University's syntactic information dictionary, extract can be with the verb vocabulary information of dissimilar objects, basic scale is: 4888 of band noun object verbs, 781 of band place object verbs, 48 of band time object verbs, 278 of ditransitive verbs, 403 of double language verbs drive 732 of speech object verbs, 122 of band adjective object verbs, band sentential object verb 698;

3) semantic nouns information table: comprise 11 big category informations of semanteme of Chinese major terms, comprising: tissue, people, artifact, natural thing, information, spirit, incident, attribute, quantity, time and space, basic scale is: 26905 of noun entries.

Consider the concrete disposition of present analyzer, we at first are divided into following three major types to analysis result according to whether comprising ambiguity: 1) no ambiguity interval; 2) make-up ambiguity interval; 3) crossing ambiguity interval.At present processing language material, under open test case, above-mentioned 3 interval words that cover account for the ratio of handling the word sum and are respectively: 0.955,0.026,0.020.Show that present block analysis device handles language material to major part, can finish well and analyze and the work of row's fork.

In two ambiguity intervals, we are by the covering power of following index analysis ambiguity result to correct analysis result:

1) the correct result rate of recalling is described the ratio that comprises correct result in all ambiguity analysis results, and computing formula is: correct result's sum * 100% that the correct sum as a result/ambiguity interval that comprises among the ambiguity result relates to;

2) ambiguity distributive law is described the ambiguity that forms at a correct the possibility of result and is analyzed average number, and computing formula is:

The ambiguity correct sum as a result that sum/the ambiguity interval relates to as a result;

The interval result of table 2 ambiguity is analyzed data

	Make-up ambiguity		Crossing ambiguity
	Make-up ambiguity		Crossing ambiguity			The correct result rate of recalling	The ambiguity distributive law	The correct result rate of recalling	The ambiguity distributive law
Closed test	96.25	1.88	76.60	2.72		The correct result rate of recalling	The ambiguity distributive law	The correct result rate of recalling	The ambiguity distributive law
Closed test	96.25	1.88	76.60	2.72	Open test	97.75	1.75	67.79	2.58

Table 2 has shown present result.Therefrom as can be seen, remaining make-up ambiguity result has preserved most correct analysis results, need be with reference to bigger language ambience information to their deciliter selection.Complicated crossing ambiguity is present difficulty in treatment, needs to introduce how effective lexical semantic information description.

In no ambiguity interval, we are by the recognition capability of following index analysis fundamental block:

1) fundamental block recognition correct rate (P), computing formula is: the fundamental block sum * 100% that analyzes correct fundamental block sum/automatically identify;

2) fundamental block is discerned the rate (R) of recalling, and computing formula is: analyze correct fundamental block sum/correct fundamental block sum * 100%;

3) the geometrical mean F-Measure of the accuracy and the rate of recalling, computing formula is: 2*P*R/ (P+R);

At different masses, determine different correct criterion:

● to many words piece, consider following two levels: 1) block boundary, syntactic marker and relation mark identical (B+C+R); 2) block boundary is identical with syntactic marker, and relation mark can difference (B+C).

● to the holophrase piece, the homogeny of main decision block border and syntactic marker;

Table 3 and table 4 have shown present result.Therefrom as can be seen, many words quantity piece (mp), time block (tp) and adjective piece (ap) have all reached very high processing F-M value in sealing and open test, and both performance index difference are also not obvious, show that the automatic acquistion rule at these 3 pieces has reached good descriptive power, has covered the various distribution situations of these pieces basically.Many words verb piece (vp), noun piece (np) and space piece (sp) also have very big room for promotion, and vp wherein and np piece occupy maximum ratio in the real text language material, are the emphasis that we study to their accurate identification.From present result, no matter in open test still in the closed test, the F-M of vp piece exceeds 3-4 percentage point of np piece, and this performance difference situation is more obvious under the situation of considering relation mark.This has shown that fully the vocabulary related information is in Boundary Recognition that improves fundamental block and the vital role aspect the internal relations analytical performance.And the open test of vp piece and np piece F-M generally will descend 2-3 percentage point than closed test, shows at present also very insufficiently at their rule description, and the new distribution situation that occurs in many testing materials may not covered by corpus.

Table 3 closed test experimental result

	Many words piece 1:B+C+R			Many words piece 2:B+C			Holophrase piece: B+C
	Many words piece 1:B+C+R			Many words piece 2:B+C			Holophrase piece: B+C			Mark	Accuracy	Recall rate	F-M	Accuracy	Recall rate	F-M	Accuracy	Recall rate	F-M
np	78.39％	79.22％	78.80％	85.47％	86.38％	85.92％	93.74％	90.33％	92.00％	Mark	Accuracy	Recall rate	F-M	Accuracy	Recall rate	F-M	Accuracy	Recall rate	F-M
np	78.39％	79.22％	78.80％	85.47％	86.38％	85.92％	93.74％	90.33％	92.00％	vp	86.59％	83.94％	85.24％	91.61％	88.80％	90.18％	90.83％	94.74％	92.74％
mp	96.61％	96.88％	96.75％	96.71％	96.98％	96.85％	63.13％	84.57％	72.30％	vp	86.59％	83.94％	85.24％	91.61％	88.80％	90.18％	90.83％	94.74％	92.74％
mp	96.61％	96.88％	96.75％	96.71％	96.98％	96.85％	63.13％	84.57％	72.30％	ap	93.50％	93.83％	93.66％	94.28％	94.62％	94.45％	93.11％	92.74％	92.92％
tp	93.08％	92.03％	92.55％	93.30％	92.24％	92.77％	88.29％	91.11％	89.68％	ap	93.50％	93.83％	93.66％	94.28％	94.62％	94.45％	93.11％	92.74％	92.92％
tp	93.08％	92.03％	92.55％	93.30％	92.24％	92.77％	88.29％	91.11％	89.68％	sp	81.93％	84.79％	83.33％	82.77％	85.66％	84.19％	79.76％	94.71％	86.59％

The open test experiments result of table 4

	Many words piece 1:B+C+R			Many words piece 2:B+C			Holophrase piece: B+C
	Many words piece 1:B+C+R			Many words piece 2:B+C			Holophrase piece: B+C			Mark	Accuracy	Recall rate	F-M	Accuracy	Recall rate	F-M	Accuracy	Recall rate	F-M
np	75.25％	75.76％	75.50％	83.68％	84.25％	83.97％	91.74％	88.28％	89.97％	Mark	Accuracy	Recall rate	F-M	Accuracy	Recall rate	F-M	Accuracy	Recall rate	F-M
np	75.25％	75.76％	75.50％	83.68％	84.25％	83.97％	91.74％	88.28％	89.97％	vp	83.23％	81.46％	82.34％	87.35％	85.49％	86.41％	90.65％	93.69％	92.15％
mp	94.89％	95.26％	95.08％	94.89％	95.26％	95.08％	54.55％	83.33％	65.93％	vp	83.23％	81.46％	82.34％	87.35％	85.49％	86.41％	90.65％	93.69％	92.15％
mp	94.89％	95.26％	95.08％	94.89％	95.26％	95.08％	54.55％	83.33％	65.93％	ap	93.99％	97.33％	95.63％	93.99％	97.33％	95.63％	94.42％	94.83％	94.62％
tp	92.75％	88.18％	90.40％	93.52％	88.92％	91.16％	83.78％	91.63％	87.53％	ap	93.99％	97.33％	95.63％	93.99％	97.33％	95.63％	94.42％	94.83％	94.62％
tp	92.75％	88.18％	90.40％	93.52％	88.92％	91.16％	83.78％	91.63％	87.53％	sp	78.76％	86.41％	82.41％	79.65％	87.38％	83.33％	81.25％	92.86％	86.67％

Description of drawings

Fig. 1. the overall control flow of Chinese basic block autoanalyzer

Fig. 2. the treatment scheme of many words piece identification module

Fig. 3. the rule match treatment scheme

Fig. 4. the treatment scheme of divergent module is dynamically arranged by local context

Fig. 5. the interval treatment scheme of finding module of ambiguity

Fig. 6. row's fork treatment scheme in make-up ambiguity interval

Embodiment

The design object of fundamental block analyzer, be under the support of fundamental block rule base and lexical semantic knowledge base, to analyzing automatically through the Chinese real text sentence of word segmentation and part-of-speech tagging processing, identify the boundary position of wherein each fundamental block, determine its syntactic marker, relation mark and analysis degree of confidence, the fundamental block that obtains sentence is analyzed annotation results.Provide a concrete analysis example below:

The input sentence: we/r should/vM notes/v selects/v some/m the young and the middle aged/scientist n/n participation/v like this/r /the u world/n meeting/n, / w cultivation/v one/m props up/q understands/v science/n ,/w understands/v diplomacy/n /u "/w national team/n "/w ,/w actively/a carries out/the v people/n diplomacy/n./w

Analysis result: [np-SG we/r] [vp-SG should/vM] [vp-SG notes/v] [vp-SG selects/v] [np-ZX some/m the young and the middle aged/scientist n/n] [vp-SG participation/v] [vp-SG like this/r] /the u[np-ZX world/n meeting/n], / w[vp-SG cultivation/v] [mp-ZX one/m props up/q] [vp-PO understands/v science/n] ,/w[vp-PO understands/v diplomacy/n] /u "/w[np-SG national team/n] "/w ,/w[dp-SG actively/a] [vp-SG carries out/v] [np-ZX people/n diplomacy/n]./w

Used analysis resource mainly comprises following two parts at present:

1) fundamental block rule base has been preserved and has been carried out the fundamental block description rule that fundamental block is discerned needed different levels, and they are by the automatic rule learning and the acquisition of evolving.Basic format is:＜textural association〉→＜the reduction mark〉＜degree of confidence 〉.Wherein:

● textural association is described the inside unitized construction of each fundamental block, is divided into two levels according to the difference of rule description ability:

A) primitive rule, it is the POS-tagging string that its textural association is described; B) extension rule by lexical constraint and linguistic context restriction, forms the stronger textural association of descriptive power and describes.

● the reduction mark mainly comprises syntactic marker and relation mark, describes the basic syntactic information of this fundamental block.

● degree of confidence θ has provided the reliability desired value of using the fundamental block that this rule analysis obtains.

2) lexical knowledge bank is preserved the various vocabulary that may use in the analytic process and is described knowledge, and they obtain by the external knowledge source, mainly comprise following content:

● vocabulary association knowledge storehouse: comprised the syntactic relation that forms between the Chinese notional word commonly used describe right.The master data form is:＜word 1〉＜word 2〉＜part of speech 1〉＜part of speech 2〉＜the syntactic relation mark 〉;

● the feature verb list: comprise from the syntactic information dictionary, extract can be with the verb vocabulary information of dissimilar objects.

The master data form is: and＜verb entry〉}, be organized into different verb lists according to different type of object;

● semantic nouns information table: comprise 11 semantic category information of Chinese major terms, comprising: tissue, people, artifact, natural thing, information, spirit, incident, attribute, quantity, time and space.The master data form is:＜noun entry〉＜the semantic category mark 〉.

In order to adapt to different application demands, we have designed following two data structures and have preserved the fundamental block analysis result:

1) forest (PSF) is shared in compression: it is the exemplary data structure of using in line chart (Chart) analytical approach, the basic design philosophy here is: the word of the n in the sentence from left to right after the series arrangement, the right individual from the 1st the word left side to n has the n+1 position, defining each position is a line chart node, any two nodes just can be formed a bar chart limit, with (l, r) expression, wherein l is the left sibling position on limit, r is the right node location on limit, and r＞l, all limits are combined to form a line chart array, to comprise fundamental block to all words with by what word formed, phrase provides complete descriptor at interior complicated ingredient.Each limit record comprises following information:＜composition sign〉＜the composition left margin〉＜the composition right margin〉＜syntactic marker〉＜relation mark〉＜the constituent analysis degree of confidence 〉, wherein:

●＜composition sign〉the different composition classification of expression, used following character to represent at present:

◆ W-word, B-holophrase piece, many words of P-piece, the dynamic deletion limit that D-forms because of arranging fork

●＜composition left margin〉and＜the composition right margin〉represent the left and right sides boundary position of corresponding composition limit in sentence respectively;

●＜syntactic marker〉expression corresponding composition outside syntactic function, to the word limit, preserve its POS-tagging information, to the fundamental block limit, preserve the syntactic marker information that it obtains from rule base;

●＜relation mark〉represent that the internal grammar of corresponding composition concerns, to the word limit, preserve its word information, to the fundamental block limit, preserve the relation mark information that it obtains from rule base;

●＜composition degree of confidence〉expression corresponding composition the processing degree of confidence, to the word limit, be 0, to the fundamental block limit, preserve the confidence information that it obtains from rule base;

2) piece mark sequence stack (ChkStack): preserve no ambiguity fundamental block that from PSF, extracts and the block analysis interval that may produce ambiguity, form piece mark sequence at the linearity of input sentence.Each record stack comprises following information:＜composition sign〉＜the composition left margin〉＜the composition right margin〉＜syntactic marker〉＜relation mark〉＜the PSF bark mark of this composition correspondence 〉, wherein preceding 5 same PSF of content.

In the superincumbent Double Data structural design scheme, PSF has preserved all the fundamental block information (comprising row's fork processing raw data) that obtain of analyzing, owing to adopted same data structure with present complete parser, therefore can realize seamless link easily with complete parser, be convenient to make full use of various possible fundamental block analysis results, further build the complete parse tree of sentence.Then preserve reliable fundamental block that from PSF, extracts and the block analysis interval that may produce ambiguity among the ChkStack, form linear block mark sequence, can generate the fundamental block annotation results easily at the input sentence.

For the processing power of the description rule by different level of giving full play to automatic acquistion, improve matching efficiency, we have designed the internal data structure that following fundamental block rule base is preserved:

1) primitive rule table BasRules[]: preserve all POS-tagging string descriptor rules; Its master record form is: [r_stru, r_tag, fp, fn, θ, e_sp, e_ep], wherein r_stru is the textural association of rule, r_tag is the reduction mark, fp is the positive example frequency, fn is the counter-example frequency, and θ is regular degree of confidence, and computing formula is: θ=fp/ (fp+fn), e_sp is the reference position of corresponding extension rule in the extension rule table, and e_ep is the final position of corresponding extension rule in the extension rule table;

2) extension rule table ExpRules[]: preserve all description rules that comprise internal vocabulary constraint and outside linguistic context restrictive condition that study obtains through extending evolution; Its master record form is: [r_stru, r_tag, fp, fn, θ], r_stru wherein, r_tag, fp, fn, θ define same BasRules[];

Like this, index information e_sp and e_ep by the corresponding extension rule of record in the primitive rule table have set up inner link between the two.

In concrete The matching analysis process, at first, retrieve the primitive rule table by obtaining the POS-tagging string of position to be analyzed in the sentence, if find the primitive rule that certain can mate, then further whether inspection exists extension rule.If exist, then the relevant position in the sentence is carried out with different levels information expansion, check whether the expansion combination occurs in the respective bins [e_sp, e_ep] of extension rule table.If found extension rule, then from the rule of all couplings, select to handle the highest rule of degree of confidence and export as matched rule.Otherwise, use primitive rule as default matched rule.

Fig. 1 has provided the complete process flow of present fundamental block autoanalyzer: at first load and analyze resource; Read a sentence to be analyzed then; Initialization related data structures PSF adds all " word+parts of speech " items in the sentence among the PSF as the word limit; From left to right scan whole sentence, find and discern all many words pieces in the sentence; On this basis, find ambiguity structure all in the analysis result and arrange fork automatically; And promote the notional word that is not covered in the sentence automatically by many words piece, form all possible holophrase piece; Extract and export the best fundamental block mark sequence of sentence at last; And Unloading Analysis resource.

Specific implementation method to wherein several main treatment steps is elaborated below.For ease of understanding, at first provide the definition of some basic symbols and term:

● θ: analyze the degree of confidence of the fundamental block that obtains, generally determined by the fundamental block rule that is complementary;

● LowBelTh: the word in the sentence can be combined into the confidence threshold value of piece, at present value=0.5;

● InBelTh: the word combination in the sentence can form the confidence threshold value in make-up ambiguity interval, value=0.7 at present;

● HighBelTh: the word in the sentence can be combined to form the confidence threshold value of reliable fundamental block, at present value=0.85;

● n: the word sum that sentence to be analyzed comprises;

● ERSum: all extension rule description string sums of certain word combination in the sentence;

● all fundamental block sums that intersect with certain fundamental block generation among the CBSum:PSF;

● OASum, the sum in all crossing ambiguity intervals of finding in the sentence;

● CASum, the sum in all make-up ambiguity intervals of finding in the sentence;

1) many words piece identification module

Fig. 2 has listed the complete process flow of many words piece identification module, and its basic skills is: from left to right scan whole sentence, each word from sentence is combined to form (length is between 2 to 6) between possible fundamental block combination region; If can in the fundamental block rule base, find the interval therewith rule (basic procedure is seen Fig. 3) that is complementary, then from rule description, extract " syntactic marker+relation mark+degree of confidence " information, automatically generating a new fundamental block record (being labeled as ' P ') adds among the PSF, and carry out the relative combined strength of local context and dynamically arrange fork (basic procedure is seen Fig. 4), in all local context's fundamental blocks that take place to intersect with this newly-generated fundamental block, find and get rid of the fundamental block that wherein combinatory possibility is lower and make up.

2) ambiguity topology discovery and the divergent module of row automatically

Basic skills is: from PSF, extract all and analyze the many words pieces that obtain automatically, according to the left and right sides boundary position of each fundamental block wherein sort automatically (first left margin: from small to large, back right margin: from small to large).This piece sequence of sequential processes then, find all overlap types and make-up ambiguity interval by the following method respectively:

● crossing ambiguity: if border, the left and right sides (＜L1 of two adjacent fundamental blocks, R1〉and＜L2, R2 〉) meet the following conditions: L2＜R1, then form a possible crossing ambiguity interval＜L1, R2 〉, this process constantly repeats, until the common factor ambiguity interval of finding a maximum;

● make-up ambiguity: if the analysis degree of confidence of a fundamental block, then forms a possible make-up ambiguity interval less than InBelTh.

Fig. 5 has shown the interval base conditioning flow process of finding module of ambiguity.Then, just can arrange the fork processing automatically to each overlap type and make-up ambiguity interval respectively by two circulations.

Comparatively speaking, row's fork method more complicated in crossing ambiguity interval needs to consider inner different ambiguity structure situation, and the base conditioning flow process is as follows:

1. obtain wherein all intersection fundamental blocks, check its ambiguity assembled state;

2. if having all standing piece or form possible chain type relational structure, then continue next step; Otherwise return;

3. obtain the confidence value θ of interval interior all fundamental blocks _i, and its maximal value Max_ θ and minimum M in_ θ are set;

4. if meet the following conditions simultaneously, then form new even distribution chain type relational structure, return:

● ((all θ _i＞=InBelTh) ‖ (all θ _i＞LowBelTh)) ﹠amp; ﹠amp; (Max_ θ-Min_ θ＜0.1);

● all intersection fundamental blocks are noun piece (np) or verb piece (vp);

5. if meet the following conditions simultaneously, then form new chain type relational structure noun piece, return:

● all intersection fundamental blocks are noun piece (np);

● noun fundamental block of whole formation, and its distribution degree of confidence is greater than InBelTh;

6. if meet the following conditions simultaneously, then form new chain type relational structure noun piece, return:

● the syntactic marker of all intersection fundamental blocks belongs to { np, sp, tp};

● there is all standing piece, and (all θ _i＞LowBelTh);

7. other situations are directly returned;

Row's fork method in make-up ambiguity interval is then fairly simple, only needs to consider the distribution confidence information of different situations, and main treatment scheme is seen Fig. 6.

3) the automatic hoisting module of holophrase piece

Basic skills is: from left to right scan all words in the sentence, if this word is covered or specific function speech (comprising conjunction, auxiliary word, preposition, modal particle, interjection, punctuation mark etc.) by other many words pieces, then directly skip and handle next speech; Otherwise, obtain the syntactic marker of the holophrase piece of automatic lifting by following rule:

● if POS-tagging is noun (n), then promoting is noun piece np;

● if POS-tagging is that (v), then promote is verb piece vp to verb;

● if POS-tagging is adverbial word (n), then promoting is adverbial word piece dp;

● if POS-tagging is adjective (a), then promoting is adjective piece ap;

In PSF, increase a new holophrase piece limit in view of the above.Above process constantly repeats, all words in handling sentence.

4) linear block mark sequence generation module

Through above processing, we have obtained all fundamental blocks (many words piece+holophrase piece) analysis result, and they all are kept in the PSF array.It is exactly to extract the fundamental block linear dimension sequence that forms at whole sentence from PSF that final step is handled, and relevant data is kept among the ChkStack.Concrete treatment scheme is:

1. analyze the PSF array, obtain covering the interval location information table of following two class words of complete sentence;

● ambiguity interval: AmbiList is arranged;

● no ambiguity interval: NonAmbiList;

2. each there is the ambiguity interval, generates an ambiguity extent block automatically and add ChkStack; // syntactic marker is got the wherein syntactic marker of first fundamental block, and relation mark is set to " AM " (expression ambiguity interval)

3. to each no ambiguity interval, order is extracted each fundamental block information that wherein covers, and adds ChkStack;

4. ChkStack is carried out the block message ordering, form the piece that covers whole sentence and describe sequence;

Provide a specific embodiment of above analytical algorithm below: loading the analysis resource: behind fundamental block rule base+lexical knowledge bank, following input sentence is carried out fundamental block analyze automatically:

We/r should/vM notes/v selects/v some/m the young and the middle aged/scientist n/n participation/v like this/r /the u world/n meeting/n, / w cultivation/v one/m props up/q understands/v science/n ,/w understands/v diplomacy/n /u "/w national team/n "/w ,/w actively/a carries out/the v people/n diplomacy/n./w

Initialization Data Structures: PSF at first, the essential information of 31 word items in the sentence: word+part of speech adds among the PSF as speech limit (syntactic marker is ' W ').Then, from left to right scan whole sentence, find and discern all many words pieces (Fig. 2) in the sentence.

When the 2nd word that scans sentence (since 0 counting) when " attentions/v ", we have found that an effective primitive rule makes up " v+v ", and there is extension rule in it.Therefore, we call regular expansion module, have obtained following extension rule textural association description string:

1.v (winl:VVPLIST)+and the syntactic feature of v//consideration verb: can drive the part of speech object

2.vM_v+v//consider that left adjacent part of speech limits

3.v+v_m//consider that right adjacent part of speech limits

4.vM_v+v_m//consider that the adjacent part of speech in the left and right sides limits

5.vM_v (winl:VVPLIST)+and v//consider that simultaneously syntactic feature+left adjacent the part of speech of verb limits

6.v (winl:VVPLIST)+and v_m//consider that simultaneously syntactic feature+right adjacent the part of speech of verb limits

7.vM_v (winl:VVPLIST)+and v_m//consider that simultaneously the adjacent part of speech in syntactic feature+left and right sides of verb limits

By retrieving 254 extension rules of this primitive rule correspondence, the extension rule that we do not have discovery to be complementary shows under existing linguistic context, fundamental block of word combination " attention/v selects/v " unlikely formation.

When continuing to scan the 4th word " some/m " of sentence to the right, we have found an effective primitive rule combination " m+n+n " again, and there is extension rule in it.Therefore, we call regular expansion module, have obtained following extension rule textural association description string:

1.v_m+n+n//consider that left adjacent part of speech limits

2.m+n+n_v//consider that right adjacent part of speech limits

3.v_m+n+n_v//consider that the adjacent part of speech in the left and right sides limits

By retrieving corresponding 21 the extension rule information of this rule, we have found following two extension rules that are complementary:

1.m+n+n_v → np-ZX, the textural association 2 above 18,1,0.95//coupling

2.v_m+n+n → np-ZX, the textural association 1 above 14,0,1.0//coupling

Therefrom, we have selected the higher rule of degree of confidence 2 as best matched rule, can generate new many words piece in view of the above and add among the PSF: [' P ', 4,7, np, ZX, 1.0].

Above analytic process is constantly carried out, and after many words piece identification module finished, we had obtained word piece more than 7 altogether, and following table has been listed the detailed description information of these fundamental blocks.

Bark mark	The composition sign	Left margin	Right margin	Syntactic marker	Relation mark	Analyze degree of confidence
Bark mark	The composition sign	Left margin	Right margin	Syntactic marker	Relation mark	Analyze degree of confidence	37	P	28	30	np	ZX	7.812500e-001
36	P	19	21	vp	PO	9.166667e-001	37	P	28	30	np	ZX	7.812500e-001
36	P	19	21	vp	PO	9.166667e-001	35	P	16	18	vp	PO	8.500000e-001
34	P	14	16	mp	ZX	9.187863e-001	35	P	16	18	vp	PO	8.500000e-001
34	P	14	16	mp	ZX	9.187863e-001	33	P	10	12	np	ZX	1.000000e+000
32	P	5	7	np	ZX	8.729776e-001	33	P	10	12	np	ZX	1.000000e+000
32	P	5	7	np	ZX	8.729776e-001	31	P	4	7	np	ZX	1.000000e+000

Call ambiguity structure treatment module then, we find to exist in the sentence crossing ambiguity interval: [4,6], this is arranged fork handle, can see its satisfied the 3rd chain structure formation condition:

● all intersect fundamental blocks be the pronouns, general term for nouns, numerals and measure words piece (np, sp, tp);

● there is all standing piece, and (all θ _i＞LowBelTh);

Therefore, we select all standing piece (bark mark=31) wherein, and get rid of inner fundamental block (bark mark=32 are provided with syntactic marker and are ' D '), finish automatic row's fork and handle.

On this basis, further, form 10 holophrase pieces other are not promoted automatically by the notional word (the 0th, 1,2,3,7,8,13,23,26,27 word) that many words piece covers in the sentence.Thereby finished the fundamental block analysis module function in Fig. 1 flow process.

At last, move best fundamental block mark sequence extraction module, and all information among the output ChkStack, we can obtain following fundamental block analysis result:

[np-SG we/r] [vp-SG should/vM] [vp-SG notes/v] [vp-SG selects/v] [np-ZX some/m the young and the middle aged/scientist n/n] [vp-SG participation/v] [vp-SG like this/r] /the u[np-ZX world/n meeting/n], / w[vp-SG cultivation/v] [mp-ZX one/m props up/q] [vp-PO understands/v science/n] ,/w[vp-PO understands/v diplomacy/n] /u "/w[np-SG national team/n] "/w ,/w[dp-SG actively/a] [vp-SG carries out/v] [np-ZX people/n diplomacy/n]./w

This fundamental block analyzer can be realized with standard C/C++ programming language on any PC compatible.

Claims

1. rule-based automatic analysis method of Chinese basic block is characterized in that, it contains following steps successively:

(1) computer initialization is set:

To the word limit, preserve its word information;

The right corner division center, all words in the expression fundamental block are directly interdependent to form a dextrad center dependency structure to the right corner centre word, and basic model is: A ₁A _nH, dependence is: A ₁→ H ..., A _n→ H, H are the syntactic-semantic centre word of whole fundamental block, A ₁..., A _nBe qualifier;

The chain type relational structure, each word in the expression fundamental block is interdependent successively to form a multicenter dependence chain of arranging from left to right to its directly right adjacent word, and basic model is: H ₀H ₁H _n, dependence is: H ₀→ H ₁..., H _N-1→ H _n, H _i, i ∈ [1, n-1] becomes the semantic polymerization site of different levels, H _nFor whole

The syntactic-semantic centre word of fundamental block;

The ambiguity interval represents that some words can form different textural association relation, is difficult to arrange automatically fork according to existing fundamental block rule base and lexical knowledge bank content, can only keep a plurality of textural associations and wait until follow-up system and select processing;

C. piece marks sequence stack ChkStack[], no ambiguity fundamental block that preservation extracts from PSF and the block analysis interval that may produce ambiguity, formation is at the piece mark sequence of the linearity of input sentence, its master record form is: [cflag, cl, cr, ectag, crtag, corresponding PSF bark mark PSFeno];

D. primitive rule table BasRules[], preserve all POS-tagging string descriptor rules, its master record form is [r_stru, r_tag, fp, fn, θ, e_sp, e_ep], wherein:

R_stru is the textural association of rule,

Fp is the positive example frequency,

Fn is the counter-example frequency,

θ is regular degree of confidence, and computing formula is: θ=fp/ (fp+fn),

Simultaneously, set following parameter:

OASum, the sum in all crossing ambiguity intervals of finding in the sentence;

CASum, the sum in all make-up ambiguity intervals of finding in the sentence;

And use following base conditioning function:

(3.1) initialization i=0;

(3.3) make i=i+1, repeating step (3.2) up to i=n, stops;

(4) find and discern many words piece according to the following steps:

(4.1) initialization i=0;

(4.3.5), then change (4.3.7) if can not find the extension rule of coupling;

(4.4) if BestRuleNo is empty, then change (4.9);

(4.5), from corresponding rule list record, extract following information according to optimum matching number of regulation BestRuleNo:

Fundamental block syntactic marker CCT, relation mark CRT and analysis degree of confidence CB generate new many words piece limit record: [' P ', i, j, CCT, CRT, CB], add among the data structure PSF;

(4.6.1) obtain the confidence value θ of current fundamental block _T

(4.6.5) make i=i+1, repeating step (4.6.3)-(4.6.5) is up to i=CBSum;

(4.7) as long as j＜min (n, i+6), repeating step (4.3)-(4.6) then;

(4.8) if i＜n then makes i=i+1, repeating step (4.2)-(4.7), otherwise stop;

(5.10) make i=i+1, repeating step (5.4)-(5.9) are up to i=PESum;

(6.2.4); Otherwise stop;

All θ _i＞=InBelTh or all θ _i＞LowBelTh, and (Max_ θ-Min_ θ＜0.1),

All intersected blocks are noun piece or verb piece,

All intersected blocks are noun piece np,

The syntactic marker of all intersected blocks is np or sp or tp,

There is all standing piece, and (all θ _i＞LowBelTh),

(6.2.9) interval all interior intersection limits of deletion;

(6.2.10) make i=i+1, repeating step (6.2.2)-(6.2.9) up to i=OASum, stops;

(6.3) carry out combined interval row's fork and handle, step is as follows:

(6.3.2) obtain i make-up ambiguity interval＜L, R 〉;

(6.3.6) if all Seg_ θ _i＜=LowBelTh then changes (6.3.10);

(6.3.7) if Comb_ θ＞HighBelTh then changes (6.3.10);

(6.3.8) if Comb_ θ-Max_ θ＞0.1, then commentaries on classics (6.3.10);

(6.3.11) make i=i+1, repeating step (6.3.2)-(6.3.10) up to i=CASum, stops;

(7) promote the holophrase piece automatically, step is as follows:

(8) generate linear block mark sequence, step is as follows:

The ambiguity interval is arranged, is expressed as AmbiList[],

No ambiguity interval is expressed as NonAmbiList[],

(8.3) initialization i=0;

(8.6) make i=i+1, repeating step (8.4)-(8.5) are up to i=ALSum;