CN103150381A

CN103150381A - High-precision Chinese predicate identification method

Info

Publication number: CN103150381A
Application number: CN2013100807603A
Authority: CN
Inventors: 罗森林; 白建敏; 潘丽敏; 韩磊; 魏超
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2013-03-14
Filing date: 2013-03-14
Publication date: 2013-06-12
Anticipated expiration: 2033-03-14
Also published as: CN103150381B

Abstract

The invention relates to a predicate identification method based on a combination of rules and statistics, and belongs to the field of natural language processing and machine learning. The identification method aims at achieving high-precision and high-efficiency predicate identification. The stepped identification method identifies predicates from morphology and syntax labeled sentences, and comprises the steps of conducting morphological analysis on the sentences to be detected, obtaining suspicious predicates and a number thereof, preliminarily identifying the predicates by using preliminary identification judgment conditions, extracting relevant morphological and syntactic characteristics of the suspicious predicates dissatisfying the preliminary identification judgment conditions, judging the predicates with a decision-making tree judgment model obtained from C4.5 training, and finally summarizing identification results in the two steps to present the predicates of the sentences to be detected. The identification method has the characteristics of high accuracy rate, identification speed and identification rate for the non-verbal predicates, and the like, is applicable to the field requiring high-precision Chinese predicate identification, greatly promotes development of sentence meaning analysis, and has high application and popularization values.

Description

A kind of high precision Chinese predicate recognition method

Technical field

The present invention relates to a kind of rule-based and Chinese predicate recognition method of combining of statistics, belong to natural language processing and machine learning field.

Background technology

Natural language processing technique has been obtained major progress at morphology and syntactical research, comparatively speaking, is a bottleneck that is difficult to cross over to the research of semantic, pragmatic and Contextual Knowledge always.Want to allow computing machine real understand natural language, semantic analysis is the only way which must be passed.Predicate recognition is the basis of carrying out further semantic analysis, plays critical effect for the follow-up work of semantic analysis, and therefore, high-accuracy, high efficiency predicate recognition method are even more important.

The Chinese predicate recognition need to solve two basic problems: 1. how to extract representative rule or Feature Combination strong, that discrimination is high and retrain or characterize predicate; 2. adopt which kind of accuracy rate high, judge that fireballing model identifies predicate.Take a broad view of existing predicate recognition method, mainly be based on the method for rule and based on the method for statistics, also regular and add up the method that combines.

1. Rule-based method

The method of rule is carried out natural language processing by the linguist based on language material and way formation rule storehouse, idiotropic usually, for philological knowledge, good generality and explanatory is arranged, but because the granularity size of rule, coverage rate are big or small, the problem such as competition conflict is difficult to assurance between rule, Rule-based method has its bottleneck part.Main method has:

(1) towards (the Example-Based Machine Translation of Chinese-English machine translation system based on example, EBMT) Chinese predicate recognition: the method has proposed a kind of Chinese sentence analysis method of compromise-skeleton dependency analysis method, hold the one-piece construction of sentence by definite predicate, proposed the strategy that a kind of predicate according to Chinese-English example sentence English example sentence is identified the predicate of corresponding Chinese example sentence.3000 Chinese example sentences in example sentence are carried out the automatic identification of predicate, recognition accuracy reaches 87.3%.

(2) towards the predicate recognition of technical paper: the method is the predicate recognition of researching and proposing for the syntactic analysis of this specific style of technical paper, the situation of only center predicate (predicate that is limited to the sentence ground floor) being done in verb is identified, and does not provide the concrete recognition accuracy of experiment.The basic step of its identification is: 1) according to dictionary, sentence (ending up with fullstop) is carried out participle, set D put in the word that will have verb character; 2) if being sky, D provides error message; If only have an element to use a part of rule judgment in D, then change syntactic analysis over to; Otherwise changed for the 3rd step over to; 3) use another part rule to leave out the verb that does not belong to center predicate in D, provide error message if D is sky; Otherwise changed for the 4th step over to; 4) use remaining rule to find out center predicate.

(3) utilize the syntactic relation identification predicate of subject and predicate: identify on the basis of predicate at the static grammar property that utilizes the predicate candidate item and dynamic syntax feature, proposed the method that syntactic relation between a kind of subject that utilizes sentence and predicate is identified predicate.The concrete steps of the method are: 1) select the initial candidate item of subject and the initial candidate item of predicate according to part of speech; 2) according to the feature of acquistion in training set, the predicate candidate item is further screened, some can be become the set that the predicate candidate item of subject is included into the subject candidate item; 3) the subject candidate item is carried out certain connection, make the structure of sentence more clear, and prepare for next step type of differentiating sentence; 4) differentiate the type of sentence, and select according to result the syntactic feature that the predicate candidate item has; 5) feature organization that the predicate candidate item is had also calculates, and the value that calculates is as the standard of tolerance predicate candidate item.Through test, the recognition accuracy of predicate has reached 91.3% (result in open test).

(4) predicate recognition of data-oriented analysis: the method proposes towards method and the step of the Chinese automatic syntactic parsing of event description minor sentence, before syntactic analysis, real corpus is carried out the pre-service that minor sentence is divided, pretreatment stage adopts a kind of based on predicate recognition and rule and method, and Chinese sentence is divided into a plurality of event description minor sentences; Then based on DOP, Chinese event description minor sentence is carried out syntactic analysis; Realize at last the syntactic analysis of complete sentence by combined treatment.The benefit of the method is can be with the task step-by-step processing of syntactic analysis, and the complex sentence that the word number is more is oversimplified, thereby improves speed and the precision of syntactic analysis.Predicate recognition is to carry out as the part of event description minor sentence identification.The method is done the situation of predicate and is all identified based on 171 pieces of TCT corpus that CIPS-ParsEval-2009 provides to verb, adjective, obtained 89.94% recognition accuracy.

2. based on the method for adding up

(1) based on the statistical decision tree-model predicate recognition of (Statistical Decision Tree is called for short SDT): SDT is a decision-making mechanism, and it gives a probable value P (f|h) according to series of features for each possible selection.Wherein, h represents series of features, and f is the current selection of making.Probable value P (f|h) puts question to sequence q1 by front n feature, q2 ..., qn decides.Wherein, i feature only putd question to and putd question to relevant with front i-1 feature, inner node is to put question to node, put question to node to represent enquirement to a feature for one, the branch that extends from this node represents the value that this feature is possible, leaf node is to select node, expression meet on path from root node to this leaf node the classification (the classification here comprises that candidate word is that predicate and candidate word are not predicate two classes) of characteristic word, and the selection made of leaf node is that form with probability represents.The predicate of identifying certain example sentence be exactly find out maximum probability in all leaf nodes that as predicate.Use the situation that SDT does predicate to verb and adjective and identify, closed set test recognition accuracy reaches as high as 81.3%, and opener test accuracy rate reaches as high as 78.6%.

(2) based on support vector machine (Support Vector Machine, be called for short SVM) predicate recognition: the SVM method is based upon on Statistical Learning Theory and structure risk minimum principle basis, seek optimal compromise according to limited sample information between the complicacy of the model study precision of specific training sample (namely to) and learning ability (namely identifying error-free the ability of arbitrary sample), in the hope of obtaining best Generalization Ability.Use the SVM method, as experimental data, the single mode plate and the multimode version predicate recognition accuracy rate that adopt ten folding cross-validation methods to obtain are respectively 88.21% and 88.75% with 1510 sentences in the BFS-CTC corpus.

(3) based on the predicate recognition of maximum entropy model: maximum entropy model is the theoretical foundation of maximum entropy classifiers, and its basic thought is to set up model for all known factors, and all unknown factors are foreclosed.Outstanding feature of maximum entropy model is that it does not require that feature has condition independence, and whether therefore, people can relatively at random add the feature useful to final classification, can influence each other and need not take into account between them.In addition, relatively SVM etc. based on the sorting technique of space length. maximum entropy model can relatively easily carry out modeling to the multicategory classification problem, and gives each classification output a relatively objective probability distribution result, is convenient to follow-up inference step and uses.Above-mentioned advantage makes it be successfully applied to a plurality of natural language processing field such as information extraction, syntactic analysis.

(4) based on the predicate recognition of Statistical Probabilistic Models: at first according to the grammatical attribute of word in sentence, determine the predicate Candidate Set, estimate that by maximum likelihood the probability that the predicate candidate word is served as predicate in sentence carries out approximate treatment, the process that predicate is identified automatically is equivalent to select according to the current contextual feature of candidate word the process of the candidate word of a maximum probability, utilizes absolute discounted reward model that parameter is carried out smoothly.Experiment is carried out on a Chinese treebank that 3000 sentences are arranged, and each sentence in treebank has manually marked syntactic constituent.Experiment shows, predicate recognition rate best result has not reached 80.6%(verb predicate) and 83.2%(describe the part of speech predicate).

(5) based on the predicate recognition of fuzzy relation matrix: the method is designed a kind of Chinese grammar rule, pass through systematic learning, automatically set up fuzzy relation matrix with the predicate in identification Chinese, the situation of not only predicate being done in verb, adjective is identified, the situation of also predicate being done in noun is identified, but the three separately carries out, but utilizes same set of method identification.The principle of predicate recognition is: for a sentence, at first this sentence is carried out participle, obtain the set of words W of this sentence; Secondly W is carried out the predicate recognition pre-service, as the word that obviously can not do predicate is got rid of, obtain an accurate predicate set, and static nature and the environmental characteristic set of factors of extracting accurate predicate; Then fuzzy matrix is set up in accurate predicate set and sets of factors unification, made it and the feature weight matrix multiple, can get a single order matrix, accurate predicate corresponding to subscript of getting its greatest member is exactly the predicate of sentence.

3. the method that combines of rules and statistical approaches

Luo Zhensheng etc. (2003) have proposed the predicate recognition methods that a kind of rule and feature learning combine, and whole process is divided into the binding of language sheet, predicate coarse sizing and predicate fine screen selects three phases.In the predicate coarse sizing, utilize rule-based filtering to fall the word that obviously can not serve as predicate, obtain an accurate predicate collection; Select the stage in fine screen, select the supported feature of predicate, obtain each feature to the support of predicate according to statistical computation, then the feature of utilizing the context of accurate predicate in sentence to occur is aimed at the concentrated word of predicate and is again screened, thereby determine the predicate of sentence, the linear depreciation method that the method adopts H.Ney and U.Essen to propose is come the deal with data Sparse Problems.Test statistics used and testing material and mainly be selected from the newsletter archive of Sina website, totally 50 pieces of articles, 1951 sentences, approximately 36910 words.The system identification accuracy rate is about 88% left and right in closed test, and discrimination is in 85% left and right in open test.

Summing up above several predicate recognition methods can draw: the predicate recognition accuracy rate that (1) the whole bag of tricks obtains generally is no more than 90%, and accuracy rate also has very large room for promotion; (2) the feature major part used of predicate recognition has only been used lexical characteristics, seldom uses other more senior features; (3) most of method is only identified the verb predicate, and the predicate of case study make to(for) other parts of speech such as adjective and idioms is less.

Summary of the invention

The objective of the invention is for solving predicate high-accuracy high-efficiency rate identification problem, a kind of rule-based Chinese predicate recognition method that combines with statistics is proposed, namely adopt two levels of regular preliminary judgement and decision tree decision model secondary judgement to judge, twice result of determination carried out comprehensively finally obtaining the predicate recognition result.

Design concept of the present invention is: adopt the method for substep identification, identify predicate from the sentence that carries out morphology and syntax mark, at first sentence to be measured is carried out lexical analysis, obtain suspicious predicate (may be the word of predicate) and number thereof; Utilize then whether suspicious predicate number is that 1 decision condition such as grade carries out preliminary predicate recognition; Secondly to not satisfying the suspicious predicate of preliminary identification decision condition, extract relevant morphology and syntactic feature and utilize the decision tree decision model that the C4.5 Algorithm for Training obtains to carry out predicate recognition to it; Finally gather two step recognition results and provide predicate in each sentence to be measured.Concrete Chinese predicate recognition schematic diagram as shown in Figure 1.

Technical scheme of the present invention comprises training and identifies two process specific implementation steps as follows:

Step 1 is carried out the part of speech analysis to the word in the sentence that carries out morphology and syntax mark, counts suspicious predicate and number thereof in each sentence.Due in Chinese, word with some part of speech, as preposition, auxiliary word, pronoun etc., they can't serve as predicate or only in the situation that few predicate that serves as, therefore, in order to improve efficiency of algorithm, and do not affect recognition effect, at first each word in sentence is carried out the part of speech analysis, can not be as the word of predicate, it is not carried out feature extraction and identification, only suspicious predicate is for further processing.Described sentence refers to the training sentence in training process, refer to sentence to be measured in identifying.

Step 2 on the basis of step 1, carries out feature extraction and trains finally obtaining the decision tree decision model to the mark language material, and this step is divided into feature extraction and adopts two steps of C4.5 Algorithm for Training decision tree.Described mark language material refers to the language material with predicate mark, and detailed process is as follows:

Step 2.1, what the feature extraction of training stage was inputted is through the training sentence of morphology, syntax mark and suspicious predicate and the number thereof that step 1 obtains, sum up with artificial form and summarize relevant initial morphology, syntactic feature, then test final morphology, syntactic feature and the predicate mark that obtains training sentence by Feature Selection.

The purpose of described Feature Selection experiment is that useless feature or the less feature of effect are removed, finally select optimum Feature Combination (or character subset), the feature subset selection problem, find exactly a succinct subset of primitive character set, make on the data acquisition of machine learning algorithm feature in only comprising this subset after operation, produce a sorter of pinpoint accuracy as far as possible.Therefore, the key of feature subset selection is to find a succinct and good character subset.Concrete steps are as follows:

Step 2.1.1 removes single feature, records recognition result, and sorts from high to low according to recognition effect.

Step 2.1.2, the feature that the better explanation of recognition effect is removed is less for the contribution of Feature Combination, so according to the ranking results that goes on foot 2.1.1, remove successively from high to low feature according to recognition effect, utilizes remaining feature to test.

Step 2.2, C4.5 Algorithm for Training decision tree process is morphology, syntactic feature and the predicate mark that step 2.1 is obtained, and is input to the C4.5 algorithm and trains, and finally obtains predicate decision tree decision model.

Step 2.2.1, described C4.5 algorithm is a kind of important machine learning algorithm, is a kind of improvement algorithm of ID3 algorithm, its advantage is: the classifying rules easy to understand of generation, accuracy rate is higher.Shortcoming is: in the process of structure tree, need to carry out repeatedly sequential scanning and sequence to data set, thereby cause the poor efficiency of algorithm.Concrete algorithm flow is as follows:

1. create node N, if training set is empty, N is labeled as failure at return node, if all records in training set all belong to same classification, with this classification flag node N;

2. if candidate attribute is empty, return to N as leaf node, be labeled as prevailing class in training set;

To each candidate attribute if the contact just this attribute is carried out discretize;

4. have the attribute D of high information gain in selection candidate attribute, flag node N is attribute D, to the homogeneity value d of each attribute D, by the node N branch that to grow a condition be D=d;

5. establishing s is the set of the training sample of D=d in training set, if s is empty, adds a leaf, is labeled as prevailing class in training set, otherwise adds that one has C4.5(R-{D}, C, s) point that returns.

Step 2.2.2, what the present invention adopted is C4.5 Algorithm for Training decision tree, need to carry out parameter for the C4.5 algorithm and choose.For the C4.5 algorithm, the parameter that need to adjust mainly contains cutting ratio confidenceFactor and minimum branch support example is counted minNumObj.Parameter choice experiment concrete grammar is: cutting ratio and minimum branch are supported the example number respectively by a certain size step-length value, obtain predicate recognition accuracy rate, recall rate and F value according to corresponding value, the corresponding parameter of best recognition result is final parameter.

Step 3, training process is identifying after finishing, and comprises preliminary identification, feature extraction and three steps of predicate judgement, concrete steps are:

Step 3.1, preliminary identifying input be the suspicious predicate that obtains of step 1 and number thereof and through the sentence to be measured of morphology, syntax mark, utilize relevant decision condition that suspicious predicate is tentatively identified, what meet decision condition directly provides recognition result, and what do not meet decision condition carries out next step feature extraction operation.This step has used Rule-based method to carry out the preliminary identification of predicate.

Described decision condition is:

If 1, the number of suspicious predicate is 1, this suspicious predicate is predicate.This decision condition is based on an agreement: any complete sentence must contain at least one predicate.

2, suspicious predicate is the verb "Yes" and is in " being ... " structure, and this suspicious predicate of judgement is non-predicate.

3, suspicious predicate for " fall, complete, complete " and immediately following after a verb, judges that it is non-predicate.

4, suspicious predicate for ", say,, say " and be in preposition " to " " just " " from " consist of afterwards the preposition phrase, judge that it is non-predicate.

Step 3.2, the feature extraction of identifying input be through morphology and the sentence to be measured of syntax mark and the sentence that does not meet preliminary identification decision condition, output be morphology, the syntactic feature of corresponding suspicious predicate.Described feature i.e. the listed feature of table 1.

Step 3.3, predicate decision process input be the decision tree decision model that the feature that obtains of characteristic extraction step 3.2 and step 2.2 obtain, output be the result of determination of suspicious predicate, whether be namely predicate.

Beneficial effect

Than rule-based Chinese predicate recognition method, the rule-based and method that statistics combines that the present invention adopts have accuracy rate high, to non-verb predicate recognition rate high.Choose through Feature Selection and parameter, the present invention has higher recognition efficiency and less calculating consumption under the prerequisite that guarantees high-accuracy.

Compare with machine learning methods such as maximum entropy, SVM, the present invention adopts the method for " judgement of rule judgements+C4.5 decision tree " to realize that predicate finally identifies, has the recognition accuracy of recognition rate and Geng Gao faster, and can identify the predicate of other parts of speech beyond verb, have good using value and promotional value.

Description of drawings

Fig. 1 is predicate recognition Method And Principle figure of the present invention;

BFS-CTC sentence mark example in Fig. 2 embodiment;

A syntax tree mark example in Fig. 3 embodiment in the BFS-CTC tagged corpus;

Fig. 4 be in embodiment top dj to first verb path schematic diagram;

Fig. 5 is the recognition result of removing successively in embodiment after feature;

Fig. 6 is that in embodiment, the predicate recognition accuracy rate increases progressively result with data volume, transverse axis is with 3,000 is that step-length is with 21, article 422, testing data is divided into 7 parts (last portion is 3,422 data), then from 3, article 000, data begin, 3,000 data of each increase, each point obtains respectively recognition result.

Embodiment

Be described in further details objects and advantages of the present invention below in conjunction with the embodiment of drawings and Examples to the inventive method in order better to illustrate.

For high-efficiency high-accuracy identification predicate, design and disposed the predicate recognition experiment.In order to use a small amount of feature to realize better recognition result, the feature of removing restriction mutually and can reducing accuracy rate obtains optimum Feature Combination and at first will carry out the Feature Selection experiment; In order to obtain best recognition result under identical feature and algorithm, need to the parameter of algorithm be optimized, so also will carry out the parameter choice experiment.

Experimental data comes from BFS-CTC Chinese tagged corpus (Beijing Forest Studio-Chinese Tag Corpus).Than the CPB corpus (Chinese Proposition Bank) that mainly uses in Chinese semantic meaning character labeling field at present, language material in BFS-CTC has increased the mark of distich clause justice type, and complete semantic character labeling and syntagmatic between each adopted composition are provided.

BFS-CTC is by Mailbox Of Technology University Of Beijing breath safety and countermeasure techniques laboratory self-developing, its original language material derives from the sentence (as Sohu, Sina, People's Daily etc.) in news corpus, and all sentences have all passed through the mark of morphology, syntax, the adopted structure of sentence.Wherein, the morphology mark centralized procurement part-of-speech tagging standard of Peking University; Syntax mark centralized procurement Peking University's Institute of Computational Linguistics standard; The adopted structure mark of sentence collection is formulated according to Mr.'s Jia Yande Chinese semantic meaning theory, (4 kinds of the adopted types of sentence have been defined, comprise simple sentence justice, complex sentence is adopted, compound sentence is adopted, multiple sentence justice), semantic lattice type (7 kinds of fundamental mesh, as agentive case, patient etc., 11 kinds of general lattice, as the time layout, space lattice etc.), (4 kinds of predicate types, comprise 0 order, 1 order, 2 orders, many orders), (3 kinds of predicate tenses, comprise past tense, present tense, future tense) etc., and standard the relation between Chinese sentence justice composition.At present the scale of BFS-CTC is 10,021, and about 92,000 words have been contained the subject-predicate sentence in Chinese, non-subject-predicate sentence, words and expressions, by the various sentence such as words and expressions, interlock sentence, pivotal sentence formulas.Fig. 2 is the sentence mark example of BFS-CTC.

10021 sentences in BFS-CTC are adopted in experiment, have 24231 words to be measured after total part of speech is selected, and wherein predicate is 16029,8202 of non-predicates.

Experiment is with accuracy rate (Precision), recall rate (recall), F value (F-Score) and the whole accuracy rate (Precision of single classification _Entirely) as evaluation index.Suppose classification A, its accuracy rate, recall rate, F value calculating method are suc as formula shown in (1), formula (2), formula (3).

FScore = \frac{2 \times precision \times recall}{precision + recall} - - - (3)

The classification results of last comprehensive all categories draws the whole accuracy rate of algorithm, shown in (4).

The below will describe one by one to above-mentioned 3 experiment flows, and all experiments are all completed on same computer, and concrete configuration is: Intel double-core CPU(dominant frequency 3.00G), and 2G internal memory, WindowsXP SP3 operating system.

1. Feature Selection experiment

Totally 14 of the features of initially choosing, wherein 9 of lexical characteristics, 5 of syntactic features have mainly reflected the feature of part of speech, word itself, phrase type, number and aspect, path, specific features is as shown in table 1.

Table 1 predicate recognition feature

Feature by choosing in artificial mark experience and existing predicate recognition feature when the characteristic set shown in table 1 is algorithm design, these features comprise lexical characteristics and syntactic feature, wherein syntactic feature is to be based upon on the basis of BFS-CTC syntax mark, has good representativeness and very high discrimination.

Wherein, position feature is that morphology is peculiar, reflection verb position

And distance

Two aspect information, its computing method are respectively suc as formula shown in (5) and formula (6).Wherein, M is verb sum in sentence, O _iBe the word order of i verb in sentence.

\overset{&OverBar;}{POS} = \{\begin{matrix} \frac{1}{M} Σ_{i = 1}^{M} O_{i} & M &Element; Z \\ 0 & M = 0 \end{matrix} - - - (5)

\overset{&OverBar;}{Dis} = \{\begin{matrix} \frac{1}{M - 1} (Σ_{i = 2}^{M - 1} O_{i + 1} - O_{i}) & M &Element; Z \\ 0 & M = 0,1 \end{matrix} - - - (6)

The path refer in syntax tree from a node to another node each mark node of process.Syntax tree mark example in the BFS-CTC tagged corpus as Fig. 3.In this syntax tree, the path of (first verb that occurs sentence) is dj ↓ vp ↓ v from Top-Sentence to first verb, as shown in Figure 4.

The purpose of Feature Selection experiment is exactly that useless feature or the less feature of effect are removed, finally select optimum Feature Combination (or character subset), the feature subset selection problem, find exactly a succinct subset of primitive character set, make on the data acquisition of machine learning algorithm feature in only comprising this subset after operation, produce a sorter of pinpoint accuracy as far as possible.Therefore, the key of feature subset selection is to find a succinct and good character subset.

The idiographic flow of Feature Selection experiment is:

Step 1 is removed single feature, records recognition result, and sorts from high to low according to recognition effect, and recognition result is as shown in table 2.

A table mistake! The word that there is no given pattern in document.Remove single feature recognition result

Table 2 is according to F value descending, recognition accuracy descending and identifies total wrong number (being that predicate is mistaken for total word number that non-predicate and non-predicate are mistaken for predicate) ascending order and arrange.The comprehensive embodiment of recognition accuracy and recall rate due to the F value, thus take the F value as primary key, remove that the F value is higher after a certain feature illustrates that the importance of this feature is low, namely inessential.Descending sort according to the F value is equivalent to arrange according to the importance ascending order of feature.The next descending sort with recognition accuracy of the situation that the F value is identical is in the situation that all identical total ascending order arrangements with identification error of F value and recognition accuracy.

Step 2 according to the ranking results of step 1, is removed feature successively, utilizes remaining feature to test, recognition result such as table 3 and shown in Figure 5.

Table 3 is removed the feature recognition result successively

As shown in Figure 5, when removal is numbered 8 feature, recognition accuracy, recall rate and F value all have obvious decline, and when continuing to remove the feature of back, the F value is in downtrending always, recognition accuracy and recall rate are in the vibration downtrending, illustrate that to be numbered 8,5,2,10,7,1 these features larger on the impact of the recognition effect of predicate.When removal was numbered 5,2,7 this three features, recognition accuracy and recall rate were beated larger, illustrated that these three features are larger on accuracy rate and recall rate impact.

Comprehensive above experiment and analysis select to be numbered 1,2,5,7,8,10 and 14 these 7 features as optimal characteristics combination (optimal feature subset).The recognition accuracy of this moment is identical with using all features, is that 99.5%, F value is hanged down 0.02 percentage point during only than all features of application, is 99.2%.Only selected the feature of half, the saving of time of training pattern 2/3 (data 24231, ten folding cross validations were 0.39s originally, were 0.13s now).

2. parameter choice experiment

Because what the present invention adopted is C4.5 Algorithm for Training decision tree, therefore, need to carries out parameter for the C4.5 algorithm and choose.For the C4.5 algorithm, the parameter that need to adjust mainly contains cutting ratio confidenceFactor and minimum branch support example is counted minNumObj, below is abbreviated as C and M.Mainly utilize the WEKA instrument, C is risen to 0.7, M with step-length 0.05 from 0.1 rise to 6 with step-length 1 from 0, obtain recognition accuracy, recall rate and F value under corresponding parameter, recognition result is as shown in table 4.The data that ultimate analysis obtains, thus optimized parameter provided.

In table 4 predicate recognition, the parameter of C4.5 algorithm is chosen experimental result

As can be seen from Table 4, as C=0.45, M=0,1 or C=0.5, M=0, the F value of 1 o'clock predicate recognition is the highest, reach 0.994. and consider that the rule that the value of the smaller the better M of value of C is the bigger the better is selected (C in order to improve the recognition effect of opener test, M)=(0.45,1) is optimized parameter.

3. predicate recognition experiment

Whole 24,231 testing datas are removed the situation that a sentence only has a suspicious predicate, utilize remaining 21, article 422, testing data, be that step-length is divided into 7 parts (last portion is 3,422 data) with total data with 3,000, then from 3, article 000, data begin, and increase by 3,000 data at every turn, all utilize ten folding cross validation methods to train at every turn and obtain the decision tree decision model and carry out predicate recognition, and record corresponding recognition result.Described ten folding cross validation methods refer to raw data are divided into ten parts, each utilize wherein nine parts to train to obtain decision model, utilize remaining portion to test, loop ten times, guarantee that each piece of data all does a test, the recognition results of ten times are averaged obtain final recognition result.

The result that this experiment character subset used and C4.5 algorithm parameter adopt the first two experiment to obtain, namely selecting to be numbered 1,2,5,7,8,10 and 14 these 7 features makes up as optimal characteristics, the parameter that the C4.5 algorithm need to be adjusted is set to: (C, M)=(0.45,1).Concrete steps are:

Step 1 is carried out the part of speech analysis to the word in 21,422 testing datas, adds up simultaneously suspicious predicate and number thereof in each sentence.

Step 2, on the basis of step 1, training process is divided into feature extraction and two steps of C4.5 Algorithm for Training decision tree.Detailed process is as follows:

Step 2.1, the feature extraction of training stage input be the suspicious predicate that obtains of step 1, suspicious predicate number and through the sentence of morphology, syntax mark, what obtain is for corresponding morphology, syntactic feature and the predicate mark of each suspicious predicate of training.Described feature is as shown in table 5.

Table 5 predicate recognition feature

Step 2.2, the parameter of C4.5 algorithm are set to (C, M)=(0.45,1), and what the decision tree training process was inputted is morphology, syntactic feature and the predicate mark that step 2.1 feature extraction obtains, output decision tree decision model.

Step 3.1, preliminary identifying input be the suspicious predicate that obtains of step 1, suspicious predicate number and through the sentence of morphology, syntax mark, utilize relevant decision condition that suspicious predicate is tentatively identified, what meet decision condition directly provides recognition result, and what do not meet decision condition carries out next step feature extraction operation.

Described decision condition is:

1, suspicious predicate is the verb "Yes" and is in " being ... " structure, and this suspicious predicate of judgement is non-predicate.

2, suspicious predicate for " fall, complete, complete " and immediately following after a verb, judges that it is non-predicate.

3, suspicious predicate for ", say,, say " and be in preposition " to " " just " " from " consist of afterwards the preposition phrase, judge that it is non-predicate.

Step 3.2, feature extraction input be not meet the suspicious predicate of preliminary identification decision condition and through the sentence of morphology and syntax mark, output be morphology, the syntactic feature of corresponding suspicious predicate.Described feature i.e. the listed feature of table 5.

Step 3.3, predicate decision process are that the eigenwert of test data that step 3.2 is obtained is input to the decision tree decision model that step 2.2 obtains and judges, output be the result of determination of suspicious predicate, whether be namely predicate.

Recognition result such as table 6 and shown in Figure 6.

Table 1 increases progressively experimental result based on the data volume of C4.5

As can be seen from Figure 6, the recognition accuracy of predicate, recall rate and F value constantly are tending towards same along with the increase of data volume, and in rising trend.Under existing 21,422 data volumes, recognition accuracy, recall rate and F value almost converge at a bit.Along with the continuation of data volume increases, we can predict that these three indexs may scatter or maintain the statusquo tends towards stability.Therefore, can judge that this algorithm is that the model of training in the situation of 21,422 left and right should have and differentiates preferably effect in data volume.

Step 4, (whole 24,231 testing datas) utilize ten folding cross validation methods equally under identical data source, and the present invention is compared with traditional predicate recognition algorithm based on SVM, and comparing result is as shown in table 7.Process is as shown in step 1-3, and wherein the decision condition of preliminary identifying is:

If 1, the number of suspicious predicate is 1, this suspicious predicate is predicate.

Table 2 predicate recognition contrast and experiment

As shown in Table 7: under same data source, in the present invention, the predicate recognition algorithm has improved 11 percentage points than the predicate recognition algorithm identified accuracy rate based on SVM, reached 99.6%, the F value has improved 9 percentage points and has reached 99%, illustrate that not only in the present invention, algorithm is than the algorithm identified better effects if based on SVM, and the accuracy rate of predicate recognition has significant improvement to follow-up sentence justice analysis and research near 100%.

The experimental result of above-mentioned 3 experiments shows, the present invention has that accuracy rate is high, fireballing characteristics.The Feature Selection experiment is in the situation that the assurance recognition accuracy reduces intrinsic dimensionality, and predicate recognition speed is greatly improved; The parameter choice experiment makes and obtain best recognition result under identical feature and algorithm.Data increment result demonstration on the BFS-CTC corpus, the highest recognition accuracy reaches 99.6%; Recall rate reaches 99%, F value and reaches 99.3%.

Claims

1. high precision Chinese predicate recognition method, is characterized in that: adopt the method for substep identification, at first sentence to be measured is carried out lexical analysis, obtain suspicious predicate and number thereof; Utilize then whether suspicious predicate number is that 1 decision condition such as grade carries out preliminary predicate recognition; Secondly to not satisfying the suspicious predicate of preliminary identification decision condition, extract relevant morphology and syntactic feature and utilize the decision tree decision model that the C4.5 Algorithm for Training obtains to carry out predicate recognition to it; Finally gather two step recognition results and provide predicate in each sentence to be measured.The present invention not only can further promote the predicate recognition accuracy rate, can also effectively reduce the time overhead of training and identification, and also can effectively identify the situation that predicate made in non-verb.Comprise the steps:

Step 1 is carried out the part of speech analysis to the word in the sentence that carries out morphology and syntax mark, counts suspicious predicate and number thereof in each sentence.Due in Chinese, have the word of some part of speech, as preposition, auxiliary word, pronoun etc., they can't serve as predicate or only in the situation that few predicate that serves as.Therefore, in order to improve efficiency of algorithm, and do not affect recognition effect, at first each word in sentence is carried out the part of speech analysis, can as the word of predicate, it not carried out feature extraction and identification.Only the word (suspicious predicate) that may become predicate is for further processing.Described sentence refers to the training sentence in training process, refer to sentence to be measured in identifying.

Step 2 on the basis of step 1, carries out feature extraction and trains finally obtaining the decision tree decision model to the mark language material, and this step is divided into feature extraction and two steps of C4.5 Algorithm for Training decision tree.Described mark language material refers to the language material with predicate mark, and detailed process is as follows:

Step 2.2.1, described C4.5 algorithm is a kind of important machine learning algorithm, is a kind of improvement algorithm of ID3 algorithm, its advantage is: the classifying rules easy to understand of generation, accuracy rate is higher.Shortcoming is: in the process of structure tree, need to carry out repeatedly sequential scanning and sequence to data set, thereby cause the poor efficiency of algorithm.Concrete algorithm flow is as follows: 1. create node N, if training set is empty, N is labeled as failure at return node, if all records in training set all belong to same classification, with this classification flag node N; 2. if candidate attribute is empty, return to N as leaf node, be labeled as prevailing class in training set; To each candidate attribute if the contact just this attribute is carried out discretize; 4. have the attribute D of high information gain in selection candidate attribute, flag node N is attribute D, to the homogeneity value d of each attribute D, by the node N branch that to grow a condition be D=d; 5. establishing s is the set of the training sample of D=d in training set, if s is empty, adds a leaf, is labeled as prevailing class in training set, otherwise adds that one has C4.5(R-{D}, C, s) point that returns.

Described decision condition is:

(1) if the number of suspicious predicate is 1, this suspicious predicate is predicate.This decision condition is based on an agreement: any complete sentence must contain at least one predicate.

(2) suspicious predicate is the verb "Yes" and is in " being ... " structure, and this suspicious predicate of judgement is non-predicate.

(3) suspicious predicate for " fall, complete, complete " and immediately following after a verb, judges that it is non-predicate.

(4) suspicious predicate for ", say,, say " and be in preposition " to " " just " " from " consist of afterwards the preposition phrase, judge that it is non-predicate.

Step 3.2, the feature extraction of identifying input be through morphology and the sentence to be measured of syntax mark and the sentence that does not meet preliminary identification decision condition, output be morphology, the syntactic feature of corresponding suspicious predicate.

2. high precision Chinese predicate recognition method according to claim 1, it is characterized in that: the sentence that step 1 pair is carried out morphology, syntax mark carries out the part of speech analysis, add up the number of suspicious predicate and suspicious predicate, remove that preposition, auxiliary word, pronoun etc. can't serve as predicate or only in the situation that few word that serves as predicate, for next step carries out preliminary judgement and final decision is prepared.

3. high precision Chinese predicate recognition method according to claim 1, it is characterized in that: the feature that step 2.1 adopts is as shown in table 1, these features comprise lexical characteristics and syntactic feature, wherein syntactic feature is based upon on BFS-CTC syntax mark standard, these features have good representativeness and very high discrimination, carry out Feature Selection to obtain the optimal characteristics combination in embodiment.

4. high precision Chinese predicate recognition method according to claim 1, it is characterized in that: the training process described in step 2.2 adopts C4.5 Algorithm for Training decision tree decision model, the classifying rules easy to understand that has utilized the C4.5 algorithm to produce, the advantage that accuracy rate is higher.

5. high precision Chinese predicate recognition method according to claim 1, it is characterized in that: the described preliminary identifying of step 3.1 is to utilize suspicious predicate that relevant decision condition obtains step 1 and number thereof and through the sentence to be measured of morphology, syntax mark, carrying out suspicious predicate tentatively identifies, what meet decision condition directly provides recognition result, and what do not meet decision condition carries out next step feature extraction operation.Described decision condition is:

6. high precision Chinese predicate recognition method according to claim 1, it is characterized in that: the feature extraction of step 3.2 identifying input be through morphology and the sentence to be measured of syntax mark and the sentence that does not meet preliminary identification decision condition, output be morphology, the syntactic feature of corresponding suspicious predicate.