CN101013421B

CN101013421B - Rule-based automatic analysis method of Chinese basic block

Info

Publication number: CN101013421B
Application number: CN2007100634897A
Authority: CN
Inventors: 周强
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2007-02-02
Filing date: 2007-02-02
Publication date: 2012-06-27
Anticipated expiration: 2027-02-02
Also published as: CN101013421A

Abstract

This is a rule-based Chinese basic block automatic analysis method, belonging to the natural language processing applications filed. Its features are: make use of automatic acquisition of the basic rules of the block to design the data and knowledge-driven basic block automatic analyzer to recognize different basic block in he Chinese authentic sentence efficiently; use vocabulary knowledge to do different levels of information expansion to examples of block combinations, and through the effective information matching of knowledge-level and multi-granularity automatic acquisition rules with the vocabulary of inner context constrains and external beam restrictions, get reliable automated analysis results as far as possible; automated disambiguate ambiguity by using confidence rules, choose more reliable automatic analysis as much as possible, while retaining part of complex ambiguity which is difficult to judge in the existing conditions for the follow-up analyzers use, thus enhancing the flexibility and effectiveness of basic block analyzer.

Description

Rule-based automatic analysis method of Chinese basic block

Technical field

The invention belongs to the natural language processing technique field

Background technology

Piece (chunk) is analyzed as a kind of part and parcel analytical technology; Can decompose through reasonable task the complete analysis problem; Reduce the intractability of analyzing automatically greatly, thereby all brought into play vital role at applied system researches such as the information extraction of natural language processing research field, question answering system, text minings.

Aspect English, Abney (1991) at first piece be defined as in the sentence one group adjacent be the aggregate of the word at center with certain notional word, thereby split into the linear order of forming by several pieces to a sentence.Ramshaw&Marcus (1995) has proposed ' BIO ' model; Through judging that each word in the sentence is in reference position (B), centre position (I) of piece or does not belong to the position (O) of certain piece; Be converted into a sequence mark problem identificatioin to block analysis, for good basis has been laid in the application of various machine learning methods.Tjong&Buchholz (2000) utilizes the Wall Street Journal language material of Penn treebank, changes the text of the BIO information labeling that has formed about 300,000 speech automatically through parsing tree, as the training and testing platform of a unified English block analysis.On this platform, many researchists use the different machines learning method, comprise based on automatic block analysis devices of structure such as storage study (MBL), SVMs (SVM), hidden Markov models (HMM), carry out the BIO sequence and discern automatically.Use therein principal character is each 2 adjacent words and grammatical category information about each word.Experimental result shows that the F-measure of best block analysis system can reach about 93%, has tentatively proved the validity of local part of speech distributed intelligence to English block boundary and syntactic marker identification.In recent years, the correlative study aspect Chinese has also obtained similar result.But after this, relevant research work reduces gradually.Main cause is that present piece define method and model of cognition are main according to syntactic distribution information, and is indifferent to the inherent semantic content of piece, thereby corresponding piece is described and recognition technology research has lost the inherent driving force that further develops.

In recent years, (Multiword Expression, MWE) problem receives theoretical language scholar and computational linguist's attention to multi-words expression gradually.It mainly studies the Special Significance interpretation problems of the many word combinations between speech and phrase in the English, but its basic thought has expanded to the interface problem of ubiquitous dictionary and sentence structure in the different language.Fillmore (2003) utilizes the form of structural grammar (Construction Grammar) and the describing method that meaning combines, and the common MWE of English is conducted in-depth analysis.Sag etc. (2002) have then carried out comprehensive summary to analytical challenge and the techniques available of present MWE, have proposed the basic imagination that need use different resource and distinct methods to analyze to different MWE.These research work are through introducing the analysis and the description technique of semantic content, for the block analysis problem exploration has injected new vitality.

Receive the inspiration of MWE research work, we think, can come the definition block problem analysis from another angle: to an input sentence, at first through word polymerism and linguistic context constrained analysis on every side, confirm that which word combination can form word piece more than; Directly rise to the holophrase piece to the notional word in the remaining word then, so just can form a complete sentence piece that comprises many words piece, holophrase piece and other function words and describe sequence.Different with ' BIO ' sequence mark model of cognition is that this processing thinking is more stressed the internal vocabulary polymerization analysis of different masses, thereby can set up the inner link between bittiness description content and the corresponding lexical semantic knowledge base more easily.The processing key here is how to find a kind of effective knowledge description system, and the lexical semantic knowledge of the piece on top layer being described instance and deep layer combines.In these areas, we have carried out some desk studies:

● proposed fundamental block description system, set up the fundamental block of real text very naturally and describe the inner link between instance and the vocabulary association knowledge storehouse, formed the description basis of follow-up rule learning and evolution based on topological structure;

● utilize Chinese basic block description rule study and extending evolution instrument automatically; Under fundamental block tagged corpus and lexical knowledge bank support; From POS-tagging string descriptor rule; Through the more internal vocabulary of continuous introducing related with outside linguistic context restriction knowledge, the descriptive power of progressively evolving out is stronger, handle the higher rule of degree of confidence, formation by different level, the fundamental block rule base of many granularities;

These researchs have been laid good basis for further carrying out rule-based automatic analysis method of Chinese basic block research.

Summary of the invention

Rule-based automatic analysis method of Chinese basic block is characterized in that, it contains following steps successively:

(1) computer initialization is set:

A. the sentence of input is S, S={<w _i, t _i>, w wherein _iBe i word among the sentence S, t _iBe the POS-tagging of i word, i ∈ [1, n], n are the word sum in the sentence;

B. forest PSF [] is shared in compression, representes with the data structure in the line chart: from left to right after the series arrangement, the right individual from the 1st the word left side to n has the n+l position the word of the n in the sentence; Defining each position is a line chart node, and any two nodes just can be formed a bar chart limit, with (l; R) expression, wherein l is the left sibling position on limit, r is the right node location on limit; And r＞l; All limits are combined to form a line chart array, and to provide complete descriptor to all words with by the fundamental block that word forms, all words and fundamental block are referred to as syntactic constituent; Said fundamental block be meant among the sentence S one group adjacent be the aggregate of the word at center with certain notional word, said PSF [] includes:

B1. < composition sign cflag >, different composition classification below the expression: W-word, B-holophrase piece, many words of P-piece, the dynamic deletion limit that D-forms because of arranging fork;

B2. < composition left margin cl >, < composition right margin cr>are expressed as the left and right boundary position of branch limit in sentence S, cl ∈ [0, n-1], cr ∈ [1, n];

B3. < syntactic marker cctag>representes the outside syntactic function of corresponding composition,

To the word limit, preserve its POS-tagging, particular content comprises: n-noun, s-place speech, t-time word, the f-noun of locality, r-pronoun; The vM-auxiliary verb, v-verb, a-adjective, d-adverbial word, m-number, q-measure word; The p-preposition, u-auxiliary word, c-conjunction, y-modal particle, e-interjection, w-punctuation mark;

To the fundamental block limit, preserve the syntactic marker that it obtains from rule base, particular content comprises: np-noun piece, vp-verb piece, sp-space piece, tp-time block, mp-quantity piece, ap-adjective piece, dp-adverbial word piece;

B4. < relation mark crtag>representes that the internal grammar of corresponding composition concerns,

To the word limit, preserve its word information;

To the fundamental block limit, preserve the relation mark that it obtains from rule base, particular content comprises: ZX-right corner division center, LN-chain type relational structure, the LH-coordination, PO-states guest's relation, and SB-states the relation of benefit, and the AM-ambiguity is interval, SG-holophrase piece, wherein:

The right corner division center, all words in the expression fundamental block are directly interdependent to form a dextrad center dependency structure to the right corner centre word, and basic model is: A ₁A _nH, dependence is: A _l→ H ..., A _n→ H, H are the syntactic-semantic centre word of whole fundamental block, A ₁..., A _nBe qualifier;

The chain type relational structure, each word in the expression fundamental block is interdependent successively to form a multicenter dependence chain of arranging from left to right to its directly right adjacent word, and basic model is: H ₀H ₁H _n, dependence is: H ₀→ H ₁..., H _N-1→ H _n, H _i, i ∈ [1, n-1] becomes the semantic polymerization site of different levels, H _nSyntactic-semantic centre word for whole fundamental block;

Coordination, each word in the expression fundamental block forms parallel construction, as: the teacher classmate;

State guest's relation, two words in the expression fundamental block form predicate-object phrase, as: have a meal;

State the relation of benefit, two words in the expression fundamental block form predicate-complement structure, as: go down;

Ambiguity is interval, representes that some words can form the various structure syntagmatic, is difficult to arrange automatically fork according to existing fundamental block rule base and lexical knowledge bank content, can only keep a plurality of textural associations and wait until follow-up system and select processing;

B5. < composition degree of confidence θ>representes the processing degree of confidence of corresponding composition, θ ∈ [0,1];

B6.<word Bian>, i word in the sentence is characterized by: cflag=W, cl=i-1, cr=i, cctag=t _i, crtag=w _i, θ=0;

B7. < holophrase piece limit >, the fundamental block of promptly being made up of a word is characterized by: cflag=B, cr-cl=1, crtag=SG;

B8. < many words piece limit >, the fundamental block of promptly being made up of two above words is characterized by: cflag=P, cr-cl＞=2;

C. piece mark sequence stack ChkStack [] preserves no ambiguity fundamental block that from PSF, extracts and the block analysis that possibly produce ambiguity interval, forms the piece mark sequence to the linearity of input sentence; Its master record form is: [cflag; Cl, cr, cctag; Crtag, corresponding PSF bark mark PSFeno];

D. primitive rule table BasRules [] preserves all POS-tagging string descriptor rules, and its master record form is [r_stru, r_tag, fp, fn, θ, e_sp, e_ep], wherein:

R_stru is the textural association of rule,

R_tag is the reduction mark, comprises syntactic marker and relation mark two parts,

Fp is positive routine frequency,

Fn is the counter-example frequency,

θ is regular degree of confidence, and computing formula is: θ=fp/ (fp+fn),

E_sp is the reference position of corresponding following extension rule in following extension rule table,

E_ep is the final position of corresponding extension rule in the extension rule table,

If there is not corresponding extension rule in this primitive rule, then e_sp and e_ep are-1;

E. extension rule table ExpRules [] preserves all description rules that comprise lexical constraint and linguistic context restrictive condition that study obtains through extending evolution, and its master record form is: [r_stru, r_tag, fp, fn, θ] defines the same;

F. the fundamental block rule base has been preserved and has been carried out the fundamental block description rule that fundamental block is discerned needed different levels, obtains through automatic rule learning and evolution, and its basic format is { < textural association>→ < reduction mark>< degree of confidence>}, wherein:

Textural association is described the inside unitized construction of each fundamental block, is divided into two levels according to the difference of rule description ability:

A) primitive rule, it is the POS-tagging string that its textural association is described,

B) extension rule through lexical constraint and linguistic context restriction, forms the stronger textural association of descriptive power and describes, and the definition of reduction mark and degree of confidence is the same;

G. lexical knowledge bank is preserved the various vocabulary that possibly use in the analytic process and is described knowledge, obtains through the external knowledge source, comprises following content:

G1. vocabulary association knowledge storehouse is contained the syntactic relation that forms between the Chinese notional word commonly used and is described rightly, and the master data form is: { < word 1>< word 2>< part of speech 1>< part of speech 2>< syntactic relation mark>};

G2. characteristic verb list, contain from the syntactic information dictionary, extract can be with the verb vocabulary information of dissimilar objects, the master data form is: { < verb entry>} is organized into different verb lists according to different type of object;

G3. semantic nouns information table; 11 semantic category informations that contain the Chinese major terms: tissue, people, artifact, natural thing, information, spirit, incident, attribute, quantity, time and space, the master data form is: { < noun entry>< semantic type mark>};

H. crossing ambiguity is interval, if the border, the left and right sides of two fundamental blocks (< L1, R1>and < L2; R2 >) meet the following conditions: (& of L2＜R1) (& of L2＞=L1) (R2＞R1) or (& of the R2＞L1) (& (L2＜L1), then form a crossing ambiguity interval < AmL, AmR>of R2＜=R1); AmL=min (L1; L2), and AmR=max (R1, R2);

I. all standing piece, be illustrated in exist in certain crossing ambiguity interval < AmL, AmR>can cover this interval fundamental block fully, suppose that promptly the border, the left and right sides of this fundamental block is cl and cr, then satisfy condition: cl=AmL, cr=AmR;

J. make-up ambiguity is interval; If between a word combination region in the sentence; Under certain linguistic context, can form a complete fundamental block, and under another linguistic context, can form the fundamental block of a plurality of separation, then just be called the make-up ambiguity interval between this word combination region; Concrete decision condition is: if the fundamental block between word combination region is analyzed degree of confidence less than InBelTh, then form a make-up ambiguity interval;

K. the entire combination piece is represented the fundamental block that the interval entire combination of make-up ambiguity forms;

L. inner combination block, a plurality of fundamental blocks that interval inner each word of expression make-up ambiguity forms respectively;

Also to computer loads with lower module: many words piece identification module, ambiguity topology discovery and arrange divergent module, the automatic hoisting module of holophrase piece and linear block sequence generation module automatically;

Simultaneously, set following parameter:

LowBelTh, the word in the sentence can be combined into the confidence threshold value of piece, gets 0.5;

InBelTh, the word combination in the sentence can form the interval confidence threshold value of make-up ambiguity, gets 0.7;

HighBelTh, the word in the sentence can be combined to form the confidence threshold value of reliable fundamental block, gets 0.85;

ERSum, the sum of all extension rule description strings of certain word combination in the sentence;

CBSum, all fundamental block sums that take place to intersect with certain fundamental block among the data structure PSF;

OASum, the interval sum of finding in the sentence of all crossing ambiguities;

CASum, the interval sum of finding in the sentence of all make-up ambiguitys;

And use following base conditioning function:

Min is the function of minimizing, and min (x, y) minimum value among x and the y is selected in expression;

Max is the maximizing function, and max (x, y) maximal value among x and the y is selected in expression;

(2) input need be carried out the Chinese sentence S that fundamental block is analyzed, S={<w _i, t _i>, i ∈ [1, n];

(3) PSF of initialization related data structures according to the following steps:

(3.1) initialization i=0;

(3.2) obtain the word information w of i speech in the sentence _iWith POS-tagging t _i, generate a new word limit record: [' W ', i, i+1, t _i, w _i, 0], add among this PSF, wherein the value of 0 expression degree of confidence;

(3.3) make i=i+1, repeating step (3.2) up to i=n, stops;

(4) find and discern many words piece according to the following steps:

(4.1) initialization i=0;

(4.2) from i word of sentence, from left to right scan whole sentence, be combined to form the fundamental block of all length between 2～6 words interval to be analyzed < i, j >, j ∈ [i+2, i+6];

(4.3) a fundamental block interval to be analyzed < i, j>of order obtaining step (4.2) formation is sought to this interval optimum matching number of regulation BestRuleNo in rule base, and concrete steps are:

(4.3.1) obtain interval interior POS-tagging string,, then return non working number in the primitive rule table, stop if this mark string does not occur;

(4.3.2) judge whether to exist corresponding extension rule,, then change (4.3.7) if do not exist;

(4.3.3) the inquiry lexical knowledge bank obtains all extension rule textural association description strings in this interval, and it adds up to ERSum;

(4.3.4) all k=0 of sequential search, 1,2 ..., ERSum-1 extension rule textural association description string occurs in the extension rule table if find certain description string wherein, and then the extension rule table sequence number with coupling adds in the rule list of finding;

(4.3.5), then change (4.3.7) if can not find the extension rule of coupling;

(4.3.6) from the rule list of said discovery, select the highest extension rule of degree of confidence, return this extension rule sequence number, stop;

(4.3.7) if < LowBelTh then returns non working number to the degree of confidence of primitive rule, stops; Otherwise return this primitive rule sequence number, stop;

(4.4) if BestRuleNo is empty, then change (4.9);

(4.5) according to optimum matching number of regulation BestRuleNo, from corresponding rule list record, extract following information: fundamental block syntactic marker CCT, relation mark CRT and analysis degree of confidence CB generate new many words piece limit record: [' P '; I; J, CCT, CRT; CB], add among the data structure PSF;

(4.6) carry out the relative combined strength of local context and dynamically arrange fork, step is following:

(4.6.1) obtain the confidence value θ of current fundamental block _T

(4.6.2) obtain among the data structure PSF fundamental block therewith the every other fundamental block that intersects takes place, it adds up to CBSum, and initialization intersects limit array index control variable i=0;

(4.6.3) obtain the wherein confidence value θ on i bar intersection limit _i, judge:

(4.6.4) if θ _T-θ _i＞0.2, then wherein i bar intersection limit (order: the cflag=' D ' on this intersection limit) of deletion;

(4.6.5) make i=i+1, repeating step (4.6.3)-(4.6.5) is up to i=CBSum;

(4.6.6), make θ if there is certain intersection limit i _i-θ _T＞0.2, then delete current fundamental block limit, stop;

(4.7) as long as j＜min (n, i+6), repeating step (4.3)-(4.6) then;

(4.8) if i＜n then makes i=i+1, repeating step (4.2)-(4.7), otherwise stop;

(5) find many words piece ambiguity structure, contain following steps successively:

(5.1) from data structure PSF, extract all many words piece limits, add up to PESum;

(5.2) left and right sides boundary position according to these many words pieces sorts separately from small to large automatically, formation-individual many words block information table;

(5.3) obtain the left and right sides boundary position of the 1st piece in the table<l, R>: L=cl ₁, R=cr ₁, the interval left and right sides boundary position buffer zone of possible crossing ambiguity is set: BufL=L, BufR=R, initialization crossing ambiguity block information: AmL=AmR=0, many words of initialization block information table subscript control variable i=2;

(5.4) obtain the left and right sides boundary position of i piece in the table<l, R>: L=cl _i, R=cr _i

(5.5) if: the word interval of two adjacent blocks does not intersect, i.e. L＞BufR then changes (5.6); Otherwise, adjust the interval left and right sides boundary position buffer zone of possible crossing ambiguity: BufL=min (BufL, L), (BufR R), and is provided with corresponding crossing ambiguity block information: AmL=BufL to BufR=max, and AmR=BufR changes (5.10);

(5.6) obtain the analysis degree of confidence θ on two adjacent many words piece limits _I-1And θ _i

(5.7) if θ _i＜InBelTh then preserves a make-up ambiguity interval<l, R>

(5.8) if θ _I-1＜InBelTh then preserves a make-up ambiguity interval<bufL, BufR>

(5.9) if find that crossing ambiguity is interval, promptly a crossing ambiguity interval < AmL, AmR>is then preserved in AmL＞0, resets associated buffer: BufL=L, BufR=R, AmL=AmR=0;

(5.10) make i=i+1, repeating step (5.4)-(5.9) are up to i=PESum;

(5.11) return interval sum and the boundary information table of two types of ambiguities of discovery, termination;

(6) carry out many words piece and arrange fork automatically, step is following:

(6.1) obtain interval total OASum, CASum and the corresponding boundary information table of finding in the sentence of two types of ambiguities;

(6.2) carry out the interval row's fork of overlap type and handle, step is following:

(6.2.1) initialization overlap type interval border information table subscript control variable i=0;

(6.2.2) obtain i crossing ambiguity interval < L, R >, obtain interval inner all and intersect many words piece;

(6.2.3), then change step if exist all standing piece or all intersected blocks to form the chain type relational structure in the ambiguity interval

(6.2.4); Otherwise stop;

(6.2.4) obtain interval in the confidence value θ of all intersected blocks _i, and its maximal value Max_ θ and minimum M in_ θ are set;

(6.2.5), then form equally distributed chain type relational structure piece, and make CCT=np or vp if satisfy following two conditions simultaneously,

All θ _i＞=InBelTh or all θ _i＞LowBelTh, and (Max_ θ-Min_ θ＜0.1),

All intersected blocks are noun piece or verb piece,

(6.2.6), then form new chain type relational structure noun piece, make CCT=np if satisfy following two conditions simultaneously,

All intersected blocks are noun piece np,

Whole form a noun piece, and its distribution degree of confidence is greater than InBelTh,

(6.2.7), then form new chain type relational structure noun piece, make CCT=np if satisfy following two conditions simultaneously,

The syntactic marker of all intersected blocks is np or sp or tp,

There is all standing piece, and (all θ _i＞LowBelTh),

(6.2.8) generate new many words of chain type relational structure piece limit record: [' P ', L, R, CCT, ' LN ', Max_ θ] adds among the data structure PSF, wherein ' LN ' expression chain type relational structure;

(6.2.9) interval all interior intersection limits of deletion;

(6.2.10) make i=i+1, repeating step (6.2.2)-(6.2.9) up to i=OASum, stops;

(6.3) carry out combined interval row's fork and handle, step is following:

(6.3.1) the combined interval border information table of initialization subscript control variable i=0;

(6.3.2) obtain i make-up ambiguity interval < L, R >;

(6.3.3) obtain the confidence value Comb_ θ of this interval entire combination piece;

(6.3.4) obtain the confidence value Seg_ θ of each inner combination block in this interval _i, and its maximal value Max_ θ is set;

(6.3.5) if all Seg_ θ _i＞LowBelTh and have a Seg_ θ _i＞HighBelTh then changes (6.3.9);

(6.3.6) if all Seg_ θ _i＜=LowBelTh then changes (6.3.10);

(6.3.7) if Comb_ θ＞HighBelTh then changes (6.3.10);

(6.3.8) if Comb_ θ-Max_ θ＞0.1, then commentaries on classics (6.3.10);

(6.3.9) select ' branch ' state, promptly delete the entire combination piece, change (6.3.11);

(6.3.10) select ' closing ' state, promptly delete each inner combination block, change (6.3.11);

(6.3.1 1) makes i=i+1, and repeating step (6.3.2)-(6.3.10) up to i=CASum, stops;

(7) promote the holophrase piece automatically, step is following:

(7.1) from left to right scan all words in the sentence; If this word is covered by other many words pieces; Perhaps belong to the function word set that comprises conjunction, auxiliary word, preposition, modal particle, interjection, punctuation mark etc., then directly change next word, otherwise step below carrying out:

(7.2) obtain the corresponding syntactic marker of this word, generate a new holophrase piece limit record and add among the data structure PSF;

(8) generate linear block mark sequence, step is following:

(8.1) analyze the PSF array, obtain covering the interval location information table of following two types of words of complete sentence:

There is ambiguity interval, is expressed as AmbiList [],

No ambiguity is interval, is expressed as NonAmbiList [],

(8.2) obtain the interval sum of two types of words respectively, be expressed as ALSum and NALSum;

(8.3) initialization i=0;

(8.4) obtain i the interval left and right sides boundary position < L, R>of ambiguity is arranged;

(8.5) obtain the syntactic marker CCT of first fundamental block in this interval, generate an ambiguity extent block: [' P ', L, R, CCT, ' AM ' ,-1], adding among the piece mark sequence stack ChkStack, relation mark wherein is set to AM, and the expression ambiguity is interval;

(8.6) make i=i+1, repeating step (8.4)-(8.5) are up to i=ALSum;

(8.7) interval to each no ambiguity, order is extracted each fundamental block information that wherein covers, and adds among the piece mark sequence stack ChkStack;

(8.8) piece mark sequence stack ChkStack is carried out the block message ordering, form the piece mark sequence of the linearity that covers whole sentence, stop.

In order accurately to test the handling property of developing the Chinese basic block autoanalyzer of accomplishing at present, we have selected all news category texts, about 200,000 speech of total scale from Chinese syntax treebank TCT mark language material.With its separated into two parts: 80% as corpus, is mainly used in rule learning and handles with evolving; 20% as testing material, is mainly used in the performance evaluation of fundamental block analyzer.Table 1 has been listed the fundamental statistics of these experiment language materials.

The fundamental statistics of table 1 experiment language material

	Total number of files	The sentence sum	The word sum	The Chinese character sum	Mean sentence length
						Training set	148	6676	170829	268151	25.6
Test set	37	1461	36543	57655	25.0
						Add up to	185	8137	207372	325806	25.5

On corpus, handle through rule learning and extending evolution, below we have obtained by different level, the fundamental block rule base of many granularities:

● in the primitive rule aspect, comprise 211 POS-tagging description rules.

● in the extension rule aspect, comprise 4972 extension rules of having introduced more lexical constraint and linguistic context restriction description.Simultaneously, match with the concrete application of top extension rule, we have also used following lexical knowledge bank:

1) vocabulary association knowledge storehouse: mainly used the moving guest to concern the storehouse at present, the moving guest's relationship description of formation such as noun and verb that has wherein comprised Chinese verb commonly used and back is right.Basic scale is: 5346 of verb entries, vocabulary association are to 52390, and it is right that on average each verb entry comprises about 10 moving guest's relationship descriptions.

2) characteristic verb list: comprise from Beijing University's syntactic information dictionary, extract can be with the verb vocabulary information of dissimilar objects, basic scale is: 4888 of band noun object verbs, 781 of band place object verbs; 48 of band time object verbs; 278 of ditransitive verbs, 403 of double language verbs drive 732 of speech object verbs; 122 of band adjective object verbs, band sentential object verb 698;

3) semantic nouns information table: comprise 11 big category informations of semanteme of Chinese major terms, comprising: tissue, people, artifact, natural thing, information, spirit, incident, attribute, quantity, time and space, basic scale is: 26905 of noun entries.

Consider the concrete disposition of present analyzer, we at first are divided into following three major types to analysis result according to whether comprising ambiguity: 1) no ambiguity is interval; 2) make-up ambiguity is interval; 3) crossing ambiguity is interval.To present processing language material, under open test case, above-mentioned 3 interval words that cover account for the ratio of handling the word sum and are respectively: 0.955,0.026,0.020.Show that present block analysis device handles language material to major part, can accomplish well and analyze and the work of row's fork.

In two ambiguity intervals, we are through the following index analysis ambiguity covering power of correct analysis result as a result:

1) the correct result rate of recalling is described the ratio that comprises correct result in all ambiguity analysis results, and computing formula is: the correct result that the correct result sum/the ambiguity interval the relates to sum * 100% that comprises among the ambiguity result;

2) ambiguity distributive law is described the ambiguity that possibly form to a correct result and is analyzed average number, and computing formula is: the ambiguity correct result that sum/the ambiguity interval relates to as a result sum;

The interval result of table 2 ambiguity is analyzed data

	Make-up ambiguity		Crossing ambiguity
						The correct result rate of recalling	The ambiguity distributive law	The correct result rate of recalling	The ambiguity distributive law
Closed test	96.25	1.88	76.60	2.72
					Open test	97.75	1.75	67.79	2.58

Table 2 has shown present result.Can find out that therefrom remaining make-up ambiguity result has preserved most correct analysis results, need be to their deciliter selection with reference to bigger language ambience information.Complicated crossing ambiguity is present difficulty in treatment, needs to introduce more effectively lexical semantic information descriptions.

In no ambiguity interval, we are through the recognition capability of following index analysis fundamental block:

1) fundamental block recognition correct rate (P), computing formula is: the fundamental block sum * 100% that analyzes correct fundamental block sum/automatically identify;

2) fundamental block is discerned the rate (R) of recalling, and computing formula is: analyze correct fundamental block sum/correct fundamental block sum * 100%;

3) the geometrical mean F-Measure of the accuracy and the rate of recalling, computing formula is: 2*P*R/ (P+R);

To different masses, confirm different correct criterion:

● to many words piece, consider following two levels: 1) block boundary, syntactic marker and relation mark identical (B+C+R);

2) block boundary is identical with syntactic marker, and relation mark can difference (B+C).

● to the holophrase piece, the homogeny of main decision block border and syntactic marker;

Table 3 and table 4 have shown present result.Therefrom can find out; Many words quantity piece (mp), time block (tp) and adjective piece (ap) have all reached very high processing F-M value in sealing and open test; And both performance index difference are also not obvious; Show that the automatic acquistion rule to these 3 pieces has reached good descriptive power, has covered the various distribution situations of these pieces basically.Many words verb piece (vp), noun piece (np) and space piece (sp) also have very big room for promotion, and vp wherein and np piece occupy maximum ratio in the real text language material, are the emphasis that we study to their accurate identification.See that from present result no matter in open test still in the closed test, the F-M of vp piece exceeds 3-4 percentage point of np piece, this performance difference situation is more obvious under the situation of considering relation mark.This has shown that fully the vocabulary related information is in Boundary Recognition that improves fundamental block and the vital role aspect the internal relations analytical performance.And the open test of vp piece and np piece F-M generally will descend 2-3 percentage point than closed test, shows that the rule description that is directed against them at present is also very insufficient, and the new distribution situation that occurs in many testing materials possibly not covered by corpus.

Table 3 closed test experimental result

Many words piece 1:B+C+R

Many words piece 2:B+C

Holophrase piece: B+C

Mark

Accuracy

Recall rate

F-M

Accuracy

Recall rate

F-M

Accuracy

Recall rate

F-M

np

?78.39％

79.22％

78.80％

85.47％

86.38％

85.92％

93.74％

90.33％

92.00％

vp

?86.59％

83.94％

85.24％

91.61％

88.80％

90.18％

90.83％

94.74％

92.74％

mp

?96.61％

96.88％

96.75％

96.71％

96.98％

96.85％

63.13％

84.57％

72.30％

ap

?93.50％

93.83％

93.66％

94.28％

94.62％

94.45％

93.11％

92.74％

92.92％

tp

?93.08％

92.03％

92.55％

93.30％

92.24％

92.77％

88.29％

91.11％

89.68％

sp

?81.93％

84.79％

83.33％

82.77％

85.66％

84.19％

79.76％

94.71％

86.59％

The open test experiments result of table 4

Many words piece 1:B+C+R

Many words piece 2:B+C

Holophrase piece: B+C

Mark

Accuracy

Recall rate

F-M

Accuracy

Recall rate

F-M

Accuracy

Recall rate

F-M

np

?75.25％

75.76％

75.50％

83.68％

84.25％

83.97％

91.74％

88.28％

89.97％

vp

?83.23％

81.46％

82.34％

87.35％

85.49％

86.41％

90.65％

93.69％

92.15％

mp

?94.89％

95.26％

95.08％

94.89％

95.26％

95.08％

54.55％

83.33％

65.93％

ap

?93.99％

97.33％

95.63％

93.99％

97.33％

95.63％

94.42％

94.83％

94.62％

tp

?92.75％

88.18％

90.40％

93.52％

88.92％

91.16％

83.78％

91.63％

87.53％

sp

?78.76％

86.41％

82.41％

79.65％

87.38％

83.33％

81.25％

92.86％

86.67％

Description of drawings

Fig. 1. the overall control flow of Chinese basic block autoanalyzer

Fig. 2. the treatment scheme of many words piece identification module

Fig. 3. the rule match treatment scheme

Fig. 4. the treatment scheme of divergent module is dynamically arranged by local context

Fig. 5. the interval treatment scheme of finding module of ambiguity

Fig. 6. row's fork treatment scheme that make-up ambiguity is interval

Embodiment

The design object of fundamental block analyzer; Be under the support of fundamental block rule base and lexical semantic knowledge base; To analyzing automatically through the Chinese real text sentence of word segmentation and part-of-speech tagging processing; Identify the boundary position of wherein each fundamental block, confirm its syntactic marker, relation mark and analysis degree of confidence, the fundamental block that obtains sentence is analyzed annotation results.Provide a concrete analysis example below:

The input sentence: we/r should/vM notes/v selects/v some/m the young and the middle aged/scientist n/n participation/v like this/r /the u world/n meeting/n; / w cultivation/v one/m props up/q understands/v science/n ,/w understands/v diplomacy/n /u "/w national team/n "/w ,/w actively/a carries out/the v people/n diplomacy/n./w

Analysis result: [np-SG we/r] [vp-SG should/vM] [vp-SG notes/v] [vp-SG selects/v] [np.ZX some/m the young and the middle aged/scientist n/n] [vp-SG participation/v] [vp-SG like this/r] /u [the np-ZX world/n meeting/n]; / w [vp-SG cultivation/v] [mp-ZX one/m props up/q] [vp-PO understands/v science/n] ,/w [vp-PO understands/v diplomacy/n] /u "/w [rp-SG national team/n] "/w ,/w [dp-SG actively/a] [vp-SG carries out/v] [np-ZX people/n diplomacy/n]./w

Used analysis resource mainly comprises following two parts at present:

1) fundamental block rule base has been preserved and has been carried out the fundamental block description rule that fundamental block is discerned needed different levels, and they obtain with evolving through automatic rule learning.Basic format is: { < textural association>→ < reduction mark>< degree of confidence>}.Wherein:

● textural association is described the inside unitized construction of each fundamental block, is divided into two levels according to the difference of rule description ability:

A) primitive rule, it is the POS-tagging string that its textural association is described; B) extension rule through lexical constraint and linguistic context restriction, forms the stronger textural association of descriptive power and describes.

● the reduction mark mainly comprises syntactic marker and relation mark, describes the basic syntactic information of this fundamental block.

● degree of confidence θ has provided the reliability desired value of using the fundamental block that this rule analysis obtains.

2) lexical knowledge bank is preserved the various vocabulary that possibly use in the analytic process and is described knowledge, and they obtain through the external knowledge source, mainly comprise following content:

● vocabulary association knowledge storehouse: comprised the syntactic relation that forms between the Chinese notional word commonly used describe right.The master data form is: { < word 1>< word 2>< part of speech 1>< part of speech 2>< syntactic relation mark>};

● the characteristic verb list: comprise from the syntactic information dictionary, extract can be with the verb vocabulary information of dissimilar objects.The master data form is: { < verb entry>} is organized into different verb lists according to different type of object;

● semantic nouns information table: comprise 11 semantic category informations of Chinese major terms, comprising: tissue, people, artifact, natural thing, information, spirit, incident, attribute, quantity, time and space.The master data form is: { < noun entry>< semantic type mark>}.

In order to adapt to different application requirements, we have designed following two data structures and have preserved the fundamental block analysis result:

1) forest (PSF) is shared in compression: it is the exemplary data structure of using in line chart (Chart) analytical approach, and the basic design philosophy here is: the word of the n in the sentence from left to right after the series arrangement, the right from the 1st the word left side to n has the n+1 position; Defining each position is a line chart node, and any two nodes just can be formed a bar chart limit, with (l; R) expression; Wherein l is the left sibling position on limit, and r is the right node location on limit, and r＞l; All limits are combined to form a line chart array, to provide complete descriptor to all words with by the complicated ingredient that comprises fundamental block, phrase that word forms.Each limit record comprises following information: < composition sign>< composition left margin>< composition right margin>< syntactic marker>< relation mark>< constituent analysis degree of confidence >, wherein:

● the different composition classification of < composition sign>expression, used following character to represent at present:

◆ W-word, B-holophrase piece, many words of P-piece, the dynamic deletion limit that D-forms because of arranging fork

● < composition left margin>and < composition right margin>represented the left and right sides boundary position of corresponding composition limit in sentence respectively;

● the outside syntactic function of the corresponding composition of < syntactic marker>expression, to the word limit, preserve its POS-tagging information, to the fundamental block limit, preserve the syntactic marker information that it obtains from rule base;

● the internal grammar relation of the corresponding composition of < relation mark>expression, to the word limit, preserve its word information, to the fundamental block limit, preserve the relation mark information that it obtains from rule base;

● the processing degree of confidence of the corresponding composition of < composition degree of confidence>expression to the word limit, is 0, to the fundamental block limit, preserves the confidence information that it obtains from rule base;

2) piece mark sequence stack (ChkStack): preserve no ambiguity fundamental block that from PSF, extracts and the block analysis that possibly produce ambiguity interval, form piece mark sequence to the linearity of input sentence.Each record stack comprises following information: < composition sign>< composition left margin>< composition right margin>< syntactic marker>< relation mark>< the PSF bark mark that this composition is corresponding >, wherein preceding 5 same PSF of content.

In the superincumbent Double Data structural design scheme; PSF has preserved all the fundamental block information (comprising row's fork processing raw data) that obtain of analyzing; Owing to adopted same data structure with present complete parser; Therefore can realize easily and the seamless link of complete parser, be convenient to make full use of various possible fundamental block analysis results, further build the complete parse tree of sentence.It is interval then to preserve reliable fundamental block that from PSF, extracts and the block analysis that possibly produce ambiguity among the ChkStack, forms the linear block mark sequence to the input sentence, can generate the fundamental block annotation results easily.

For the processing power of the description rule by different level of giving full play to automatic acquistion, improve matching efficiency, we have designed the internal data structure that following fundamental block rule base is preserved:

1) primitive rule table BasRules []: preserve all POS-tagging string descriptor rules; Its master record form is: [r_stru, r_tag, fp, fn; θ, e_sp, e_ep], wherein r_stru is the textural association of rule; R_tag is the reduction mark, and fp is positive routine frequency, and fn is the counter-example frequency, and θ is regular degree of confidence; Computing formula is: θ=fp/ (fp+fn), e_sp are the reference position of corresponding extension rule in the extension rule table, and e_ep is the final position of corresponding extension rule in the extension rule table;

2) extension rule table ExpRules []: preserve all description rules that comprise internal vocabulary constraint and outside linguistic context restrictive condition that study obtains through extending evolution; Its master record form is: [r_stru, r_tag, fp, fn, θ], r_stru wherein, r_tag, fp, fn, θ define same BasRules [];

Like this, index information e_sp and e_ep through the corresponding extension rule of record in the primitive rule table have set up inner link between the two.

In concrete The matching analysis process, at first, retrieve the primitive rule table through obtaining the POS-tagging string of position to be analyzed in the sentence, if find the primitive rule that certain can mate, then further whether inspection exists extension rule.If exist, then the relevant position in the sentence is carried out with different levels information expansion, whether inspection expansion combination occurs in the respective bins [e_sp, e_ep] of extension rule table.If found extension rule, then from all matching rules, select to handle the highest rule of degree of confidence and export as matched rule.Otherwise, use primitive rule as default matched rule.

Fig. 1 has provided the complete process flow of present fundamental block autoanalyzer: at first load and analyze resource; Read a sentence to be analyzed then; Initialization related data structures PSF adds all " word+parts of speech " items in the sentence among the PSF as the word limit; From left to right scan whole sentence, find and discern all many words pieces in the sentence; On this basis, find ambiguity structure all in the analysis result and arrange fork automatically; And promote the notional word that is not covered in the sentence automatically by many words piece, form all possible holophrase piece; Extract and export the best fundamental block mark sequence of sentence at last; And Unloading Analysis resource.

The concrete implementation method of facing several main processing steps wherein down is elaborated.For ease of understanding, at first provide the definition of some basic symbols and term:

● θ: analyze the degree of confidence of the fundamental block that obtains, generally determine by the fundamental block rule that is complementary;

● LowBelTh: the word in the sentence can be combined into the confidence threshold value of piece, at present value=0.5;

● InBelTh: the word combination in the sentence can form the interval confidence threshold value of make-up ambiguity, value=0.7 at present;

● HighBelTh: the word in the sentence can be combined to form the confidence threshold value of reliable fundamental block, at present value=0.85;

● n: the word sum that sentence to be analyzed comprises;

● ERSum: all extension rule description string sums of certain word combination in the sentence;

● all fundamental block sums that take place to intersect with certain fundamental block among the CBSum:PSF;

● OASum, the interval sum of finding in the sentence of all crossing ambiguities;

● CASum, the interval sum of finding in the sentence of all make-up ambiguitys;

1) many words piece identification module

Fig. 2 has listed the complete process flow of many words piece identification module, and its basic skills is: from left to right scan whole sentence, each word from sentence is combined to form (length is between 2 to 6) between possible fundamental block combination region; If can in the fundamental block rule base, find the interval therewith rule (basic procedure is seen Fig. 3) that is complementary; Then from rule description, extract " syntactic marker+relation mark+degree of confidence " information; Automatically generating a new fundamental block record (being labeled as ' P ') adds among the PSF; And carry out the relative combined strength of local context and dynamically arrange fork (basic procedure is seen Fig. 4); In all local context's fundamental blocks that take place to intersect with this newly-generated fundamental block, find and get rid of the fundamental block that wherein combinatory possibility is lower and make up.

2) ambiguity topology discovery and the divergent module of automatic row

Basic skills is: from PSF, extract all and analyze the many words pieces that obtain automatically, according to the left and right sides boundary position of each fundamental block wherein sort automatically (first left margin: from small to large, back right margin: from small to large).This piece sequence of sequential processes then, find respectively that through following method all overlap types and make-up ambiguity are interval:

● crossing ambiguity: if the border, the left and right sides of two adjacent fundamental blocks (< L1, R1>and < L2, R2 >) meets the following conditions: L2＜R1; Then form an interval < L1 of possible crossing ambiguity; R2 >, this process constantly repeats, and is interval until the common factor property ambiguity of finding a maximum;

● make-up ambiguity: interval if the analysis degree of confidence of a fundamental block, then forms a possible make-up ambiguity less than InBelTh.

Fig. 5 has shown the interval base conditioning flow process of finding module of ambiguity.Then, just can arrange the fork processing automatically to each overlap type and make-up ambiguity interval respectively through two circulations.

Comparatively speaking, row's fork method more complicated that crossing ambiguity is interval needs to consider inner different ambiguity structure situation, and the base conditioning flow process is as follows:

1. obtain wherein all intersection fundamental blocks, check its ambiguity assembled state;

2. if having all standing piece or form possible chain type relational structure, then continue next step; Otherwise return;

3. obtain the confidence value θ of interval interior all fundamental blocks _i, and its maximal value Max_ θ and minimum M in_ θ are set;

4. if meet the following conditions simultaneously, then form new even distribution chain type relational structure, return:

● ((all θ _i＞=InBelTh) || (all θ _i＞LowBelTh)) && (Max_ θ-Min_ θ＜0.1);

● all intersection fundamental blocks are noun piece (np) or verb piece (vp);

5. if meet the following conditions simultaneously, then form new chain type relational structure noun piece, return:

● all intersection fundamental blocks are noun piece (np);

● noun fundamental block of whole formation, and its distribution degree of confidence is greater than InBelTh;

6. if meet the following conditions simultaneously, then form new chain type relational structure noun piece, return:

● the syntactic marker of all intersection fundamental blocks belongs to { np, sp, tp};

● there is all standing piece, and (all θ _i＞LowBelTh);

7. other situation are directly returned;

Row's fork method in make-up ambiguity interval is then fairly simple, only needs to consider the distribution confidence information of different situations, and the main processing flow process is seen Fig. 6.

3) the automatic hoisting module of holophrase piece

Basic skills is: from left to right scan all words in the sentence; If this word is covered or specific function speech (comprising conjunction, auxiliary word, preposition, modal particle, interjection, punctuation mark etc.) by other many words pieces, then directly skip and handle next speech; Otherwise, obtain the syntactic marker of the holophrase piece of automatic lifting through following rule:

● if POS-tagging is noun (n), then promoting is noun piece np;

● if POS-tagging is that (v), then promote is verb piece vp to verb;

● if POS-tagging is adverbial word (n), then promoting is adverbial word piece dp;

● if POS-tagging is adjective (a), then promoting is adjective piece ap;

In PSF, increase a new holophrase piece limit in view of the above.Above process constantly repeats, all words in handling sentence.

4) linear block mark sequence generation module

Through above processing, we have obtained all fundamental blocks (many words piece+holophrase piece) analysis result, and they all are kept in the PSF array.It is exactly from PSF, to extract the fundamental block linear dimension sequence that forms to whole sentence that final step is handled, and is kept at relevant data among the ChkStack.Concrete treatment scheme is:

1. analyze the PSF array, obtain covering the interval location information table of following two types of words of complete sentence;

● there is ambiguity interval: AmbiList;

● no ambiguity is interval: NonAmbiList;

2. there is ambiguity interval to each, generates an ambiguity extent block automatically and add ChkStack; // syntactic marker is got the wherein syntactic marker of first fundamental block, and relation mark is set to " AM " (the expression ambiguity is interval)

3. interval to each no ambiguity, order is extracted each fundamental block information that wherein covers, and adds ChkStack;

4. ChkStack is carried out the block message ordering, form the piece that covers whole sentence and describe sequence;

Provide a specific embodiment of above analytical algorithm below: loading the analysis resource: behind fundamental block rule base+lexical knowledge bank, following input sentence is carried out fundamental block analyze automatically:

We/r should/vM notes/v selects/v some/m the young and the middle aged/scientist n/n participation/v like this/r /the u world/n meeting/n; / w cultivation/v one/m props up/q understands/v science/n ,/w understands/v diplomacy/n /u "/w national team/n "/w ,/w actively/a carries out/the v people/n diplomacy/n./w

Initialization Data Structures: PSF at first, the essential information of 31 word items in the sentence: word+part of speech adds among the PSF as speech limit (syntactic marker is ' W ').Then, from left to right scan whole sentence, find and discern all many words pieces (Fig. 2) in the sentence.

When the 2nd word that scans sentence (since 0 counting) when " attentions/v ", we have found that an effective primitive rule makes up " v+v ", and there is extension rule in it.Therefore, we call regular expansion module, the extension rule textural association description string below having obtained:

1.v (winl:VVPLIST)+and the syntactic feature of v//consideration verb: can drive the part of speech object

2.vM_v+v // consider that left adjacent part of speech limits

3.v+v_m // consider that right adjacent part of speech limits

4.vM_v+v_m // consider that the adjacent part of speech in the left and right sides limits

5.vM_v (winl:VVPLIST)+and v // consider that simultaneously syntactic feature+left adjacent the part of speech of verb limits

6.v (winl:VVPLIST)+and v_m//consider that simultaneously syntactic feature+right adjacent the part of speech of verb limits

7.vM_v (winl:VVPLIST)+and v_m//consider that simultaneously the adjacent part of speech in syntactic feature+left and right sides of verb limits

Through retrieving 254 corresponding extension rules of this primitive rule, the extension rule that we do not have discovery to be complementary is illustrated under the existing linguistic context, fundamental block of word combination " attention/v selects/v " unlikely formation.

When continuing to scan the 4th word " some/m " of sentence to the right, we have found an effective primitive rule combination " m+n+n " again, and there is extension rule in it.Therefore, we call regular expansion module, the extension rule textural association description string below having obtained:

1.v_m+n+n // consider that left adjacent part of speech limits

2.m+n+n_v // consider that right adjacent part of speech limits

3.v_m+n+n_v//consider that the adjacent part of speech in the left and right sides limits

Should corresponding 21 the extension rule information of rule through retrieval, we have found following two extension rules that are complementary:

1.m+n+n_v → np-ZX, the textural association 2 above 18,1,0.95 // coupling

2.v_m+n+n → np-ZX, the textural association 1 above 14,0,1.0 // coupling

Therefrom, we have selected the higher rule of degree of confidence 2 as best matched rule, can generate new many words piece in view of the above and add among the PSF: [' P ', 4,7, np, ZX, 1.0].

Above analytic process is constantly carried out, and after many words piece identification module finished, we had obtained word piece more than 7 altogether, and following table has been listed the detailed description information of these fundamental blocks.

Bark mark

The composition sign

Left margin

Right margin

Syntactic marker

Relation mark

Analyze degree of confidence

37

P

28

30

np

ZX

?7.812500e-001

36

P

19

21

vp

PO

?9.166667e-001

35

P

16

18

vp

PO

?8.500000e-001

34

P

14

16

mp

ZX

?9.187863e-001

33

P

10

12

np

ZX

?1.000000e+000

32

P

5

7

np

ZX

?8.729776e-001

31

P

4

7

np

ZX

?1.000000e+000

Call ambiguity structure treatment module then, we find to exist in the sentence crossing ambiguity interval: [4,6], this is arranged fork handle, can see its satisfied the 3rd chain structure formation condition:

● all intersect fundamental blocks be the pronouns, general term for nouns, numerals and measure words piece (np, sp, tp);

● there is all standing piece, and (all θ _i＞LowBelTh);

Therefore, we select all standing piece (bark mark=31) wherein, and get rid of inner fundamental block (bark mark=32 are provided with syntactic marker and are ' D '), and completion row's fork is automatically handled.

On this basis, further, form 10 holophrase pieces other are not promoted by the notional word (the 0th, 1,2,3,7,8,13,23,26,27 word) that many words piece covers automatically in the sentence.Thereby accomplished the fundamental block analysis module function in Fig. 1 flow process.

At last, move best fundamental block mark sequence extraction module, and all information among the output ChkStack, we can obtain following fundamental block analysis result:

[np-SG we/r] [vp-SG should/vM] [vp-SG notes/v] [vp-SG selects/v] [np-ZX some/m the young and the middle aged/scientist n/n] [vp-SG participation/v] [vp-SG like this/r] /u [the np-ZX world/n meeting/n]; / w [vp-SG cultivation/v] [mp-ZX one/m props up/q] [vp-PO understands/v science/n] ,/w [vp-PO understands/v diplomacy/n] /u "/w [np-SG national team/n] "/w ,/w [dp-SG actively/a] [vp-SG carries out/v] [np-ZX people/n diplomacy/n]./w

This fundamental block analyzer can be realized with standard C/C++ programming language on any PC compatible.

Claims

1. rule-based automatic analysis method of Chinese basic block is characterized in that, it contains following steps successively:

(1) computer initialization is set:

B. forest PSF [] is shared in compression, representes with the data structure in the line chart: from left to right after the series arrangement, the right individual from the 1st the word left side to n has the n+1 position the word of the n in the sentence; Defining each position is a line chart node, and any two nodes just can be formed a bar chart limit, with (l; R) expression, wherein l is the left sibling position on limit, r is the right node location on limit; And r＞l; All limits are combined to form a line chart array, and to provide complete descriptor to all words with by the fundamental block that word forms, all words and fundamental block are referred to as syntactic constituent; Said fundamental block be meant among the sentence S one group adjacent be the aggregate of the word at center with certain notional word, said PSF [] includes:

To the word limit, preserve its word information;

The right corner division center, all words in the expression fundamental block are directly interdependent to form a dextrad center dependency structure to the right corner centre word, and basic model is: A ₁... A _nH, dependence is: A ₁→ H ..., A _n→ H, H are the syntactic-semantic centre word of whole fundamental block, A ₁..., A _nBe qualifier;

The chain type relational structure, each word in the expression fundamental block is interdependent successively to form a multicenter dependence chain of arranging from left to right to its directly right adjacent word, and basic model is: H ₀H ₁... H _n, dependence is: H ₀→ H ₁..., H _N-1→ H _n, H _i, i ∈ [1, n-1] becomes the semantic polymerization site of different levels, H _nSyntactic-semantic centre word for whole fundamental block;

Coordination, each word in the expression fundamental block forms parallel construction;

State guest's relation, two words in the expression fundamental block form predicate-object phrase;

State the relation of benefit, two words in the expression fundamental block form predicate-complement structure;

R_stru is the textural association of rule,

Fp is positive routine frequency,

Fn is the counter-example frequency,

θ is regular degree of confidence, and computing formula is: θ=fp/ (fp+fn),

Simultaneously, set following parameter:

OASum, the interval sum of finding in the sentence of all crossing ambiguities;

CASum, the interval sum of finding in the sentence of all make-up ambiguitys;

And use following base conditioning function:

(3.1) initialization i=0;

(3.3) make i=i+1, repeating step (3.2) up to i=n, stops; Many words piece is found and discerned in _ (4) according to the following steps:

(4.1) initialization i=0;

(4.3.5), then change (4.3.7) if can not find the extension rule of coupling;

(4.3.7), stop if the degree of confidence＜LowBelTh of primitive rule then returns non working number; Otherwise return this primitive rule sequence number, stop;

(4.4) if BestRuleNo is empty, then change (4.9);

(4.6.1) obtain the confidence value θ of current fundamental block _T

(4.6.4) if θ _T-θ _i＞0.2, then delete wherein i bar intersection limit, the cflag=' D ' on this intersection limit;

(4.6.5) make i=i+1, repeating step (4.6.3)-(4.6.5) is up to i=CBSum;

(4.7) as long as j＜min (n, i+6), repeating step (4.3)-(4.6) then;

(4.8) if i＜n then makes i=i+1, repeating step (4.2)-(4.7), otherwise stop;

(5.2) left and right sides boundary position according to these many words pieces sorts separately from small to large automatically, forms word block information table more than;

(5.7) if θ _i＜InBelTh then preserves a make-up ambiguity interval<l, R>

(5.10) make i=i+1, repeating step (5.4)-(5.9) are up to i=PESum;

(6.2.3), then change step (6.2.4) if exist all standing piece or all intersected blocks to form the chain type relational structure in the ambiguity interval; Otherwise stop;

All θ _i＞=InBelTh or all θ _i＞LowBelTh, and Max_ θ-Min_ θ＜0.1,

All intersected blocks are noun piece or verb piece,

All intersected blocks are noun piece np,

The syntactic marker of all intersected blocks is np or sp or tp,

There is all standing piece, and all θ _i＞LowBelTh,

(6.2.9) interval all interior intersection limits of deletion;

(6.2.10) make i=i+1, repeating step (6.2.2)-(6.2.9) up to i=OASum, stops;

(6.3) carry out combined interval row's fork and handle, step is following:

(6.3.2) obtain i make-up ambiguity interval < L, R >;

(6.3.6) if all Seg_ θ _i＜=LowBelTh then changes (6.3.10);

(6.3.7) if Comb_ θ＞HighBelTh then changes (6.3.10);

(6.3.8) if Comb_ θ-Max_ θ＞0.1, then commentaries on classics (6.3.10);

(6.3.11) make i=i+1, repeating step (6.3.2)-(6.3.10) up to i=CASum, stops; (7) promote the holophrase piece automatically, step is following:

(8) generate linear block mark sequence, step is following:

There is ambiguity interval, is expressed as AmbiList [],

No ambiguity is interval, is expressed as NonAmbiList [],

(8.3) initialization i=0;

(8.6) make i=i+1, repeating step (8.4)-(8.5) are up to i=ALSum;