CN102955818A

CN102955818A - Method for acquiring full names in Chinese from Web page

Info

Publication number: CN102955818A
Application number: CN2011102531001A
Authority: CN
Inventors: 王石; 丁远钧; 符建辉; 王卫民
Original assignee: KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Current assignee: KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority date: 2011-08-31
Filing date: 2011-08-31
Publication date: 2013-03-06

Abstract

The invention relates to a method for acquiring full names in Chinese from a Web page. The method comprises the steps of: inputting a known short form, selecting a query mode to establish a query item, submitting the query item to Google for acquiring an anchor text, then acquiring the corpus of the full names and the short forms from the anchor text, finally picking up candidate short forms by utilizing pick-up algorithms, and then sequencing the candidate short forms by utilizing the priority synthetic function, wherein two query modes are related, and two corresponding pick-up algorithms for picking up full name are used. The invention also defines a body of the relation between the full name and the short form, and the body comprises a set of constraint axiom and a constraint function set, wherein the constraint axiom qualitatively expresses the constraint between the full name and the short form, the constraint function set quantitatively expresses the constraint between the full name and the short form; moreover, based on the body of the relation between the full name and the short form, a full name testing method and a full name classification method are proposed. The method can realize large-scale and high-accuracy acquisition of the full names, and discusses the classification of the full names by using a computer, thereby providing an effective support for intelligent acquisition of extensive knowledge.

Description

A kind of method of from the Web webpage, obtaining the Chinese full name

Technical field

The full name that the present invention relates to Chinese information processing and information retrieval field obtains technology, relates in particular to a kind of method of obtaining the Chinese full name from the Web webpage, obtains the method for the Chinese full name of multidisciplinary, extensive, high-accuracy from the Web webpage.

Background technology

Natural language processing is a major issue in computer science and the artificial intelligence field.Its research can realize carrying out with natural language between people and the computing machine various theories and the method for efficient communication.Widespread use along with computing machine and internet, the accessible natural language text quantity of computing machine unprecedentedly increases, towards application demand rapid growths such as the text mining of magnanimity information, information extraction, cross-language information processing, man-machine interactions, the object of natural language processing is also processed from the small-scale restricted language and is turned to extensive real text to process, and its research will produce far-reaching influence to people's life.

Chinese information processing is to study how to utilize computing machine that Chinese information is processed automatically.Chinese is that a meaning is closed language, compares with western language, lacks explicit mark, and grammer, semanteme, pragmatic side are also more flexible, have increased the difficulty of computer understanding and processing, allow computing machine can process Chinese information, still have many difficulties to overcome.At present, Chinese information processing has obtained some achievements in fields such as speech recognition, participle, mechanical translation.The lifting of Chinese information robotization degree for the treatment of will bring considerable benefit to the science and technology of China, culture, economy, safety etc.

How quick from the bulk information of numerous and complicated Research into information retrieval is, the technology of Obtaining Accurate information needed.Information retrieval technique is through for many years development, and quite ripe at present, the novel information retrieval technique is just towards future developments such as intellectuality, mobilism, variation, personalizations.

Full name (Full Name, Fn) be complete address to title, be called for short (Abbreviation, An) to be brevity and lucidity in order expressing, and the address that obtains after the compression to be simplified in full name, if Fn and An have full abbreviation relation, claim that then Fn is the full name of An, An is the abbreviation of Fn, is denoted as FA(Fn, An).By full name to being called for short, can be regarded as the compression process of a quantity of information, by being called for short to full name, then can be regarded as the process of a decompress(ion), for example: c1=" Inst. of Computing Techn. Academia Sinica " is compressed, obtain c2=" institute is calculated by the Chinese Academy of Sciences ", again c2 is compressed, obtain c3=" Computer Department of the Chinese Academy of Science ", the c3 decompress(ion) is obtained c2, again the c2 decompress(ion) is obtained c1.Full name all is relative concept with being called for short, and such as in upper example, c2 is to be called for short with respect to c1, but is full name with respect to c3, says that separately c2 is full name or to be called for short all be nonsensical.

The full Relation acquisition that is called for short obtains (Knowledge Acquisition from Text as text knowledge, KAT) and information retrieval etc. use in a basic and crucial problem, its acquisition methods can be divided into two large classes: a class is based on the method for pattern, mainly utilize linguistics and natural language processing technique, extract relation schema by lexical analysis and grammatical analysis, then utilize pattern match to obtain full abbreviation relation, the method accuracy rate depends on linguistic knowledge and pattern base; The another kind of method that is based on statistics mainly based on corpus and statistical language model, is obtained full abbreviation relation by the degree of association of calculating between the concept, and the method accuracy rate and efficient are difficult to the real requirement that reaches desirable.The full problem of obtaining that is called for short relation again can be from two angles: one is the angle of excavating, and it is right to obtain full abbreviation exactly under the condition that does not have extraneous input; Another is the angle of searching, and known exactly full name looks for abbreviation or known abbreviation to look for full name.

" full name " mentioned among the present invention or " abbreviation " if no special instructions, all refer to Chinese full name or Chinese abbreviation.

Summary of the invention

For the limitation or the not high defective of accuracy rate that have in the existing full abbreviation Relation acquisition technology, the invention provides a kind of accuracy rate height and be applicable to multidisciplinary, ultra-large a kind of method of from the Web webpage, obtaining the Chinese full name.

In order to address the above problem, the invention provides a kind of method of from the Web webpage, obtaining the Chinese full name, comprise step:

Step 1, given Chinese abbreviation of input;

Step 2, selection query pattern are constructed query term, query term is submitted in the Google search engine searches for, and N item anchor text is as the anchor language material before preserving;

Step 3, by regular expression, from the anchor language material, obtain out the sentence of the relation that comprises query term, preserve as the full language material that is called for short;

Step 4, utilization are called for short extraction algorithm EFN and extract candidate's full name from full abbreviation language materials, form the set of candidate's full name;

Step 5, checking based on full abbreviation relation constraint is carried out in candidate's full name set, formed the full name set;

Step 6, classification based on full abbreviation relation constraint is carried out in full name set, thereby formed the full name set with the classification mark.

In the technique scheme, in described step 2, described query pattern comprises two kinds: query pattern 1: " being called for short An ", query pattern 2: " An full name ".We do experiment with 4000 Chinese An, wherein account for 88.75% with what query pattern 1 can obtain the anchor language material, account for 24.76% with what query pattern 2 can obtain the anchor language material, account for 91.07% with what query pattern 1 or query pattern 2 can obtain the anchor language material.Therefore, in order to improve search efficiency, we preferentially select query pattern 1, and next selects query pattern 2.

In the technique scheme, in described step 4, full name extraction algorithm EFN comprises two algorithm EFN1 and EFN2, two kinds of query patterns in the corresponding step 2 of difference, namely when selecting query pattern 1 in the step 2, adopt EFN1 to extract Fn in the step 4, when selecting query pattern 2 in the step 2, adopt EFN2 to extract Fn in the step 4.

In the technique scheme, in described step 5, if the full name set is sky, and also have query pattern available in the step 2, then re-execute step 2-6; If full name set does not have alternative query pattern in the step 2 simultaneously for empty, then withdraw from, show can not from Web search the full name of given abbreviation.

In the technique scheme, in described step 5), entirely being called for short relation constraint is four-tuple R=(Fn, an An, F, A), wherein, Fn is the full name of object, An is the abbreviation of object, and F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Hereinafter will further make an explanation to these two kinds of constraints.

Beneficial effect: the present invention is the abbreviation that obtains its correspondence according to known full name from Web, namely obtain full abbreviation relation from the angle of searching, utilizing the schema-based method to come to obtain the candidate from Google is called for short, utilization comes candidate's abbreviation is verified based on the method for statistics, have multidisciplinary property, extensive, high accuracy for examination, and inquired into the classification that is called for short with computer realization, obtaining for the intelligence of extensive knowledge provides effective support.

Description of drawings

Fig. 1 serves as reasons and is called for short the total synoptic diagram that obtains full name:

Fig. 2 utilizes query pattern 1 to obtain the process flow diagram of full name:

Fig. 3 utilizes query pattern 2 to obtain the process flow diagram of full name;

The process flow diagram of Fig. 4 for candidate's full name collection is carried out aftertreatment;

Fig. 5 checking decision tree that the full constraint function collection that is called for short generates of serving as reasons.

Embodiment

The invention will be further described below in conjunction with the drawings and specific embodiments:

Before method of the present invention is described, at first the formation rule and the word formation that are called for short in the full abbreviation relation are put in order and summed up.Be called for short in the relation complete, can be regarded as the compression process of a quantity of information to the process that is called for short by full name, in the compression process of quantity of information, sometimes have semantic equivalence conversion and the adjustment of word order, be divided into plain edition, different font and different order type so we will be called for short relation entirely.

Plain edition: each word in the abbreviation appears in the full name, and keeps their orders in full name, for example, and Fn=" People's Republic of China (PRC) ", An=" China ";

Different font: some word in the abbreviation does not occur in full name, has namely not only carried out the compression of quantity of information by full name to being called for short, and has also carried out semantic equivalence conversion, Fn=" Wa Huang Shengmumiao " for example, An=" Chinese mythology goddess mausoleum ";

Different order type: the order in the abbreviation between Chinese character is inconsistent with their orders of tie element in full name, for example, Fn=" Harbin the 6th pharmaceutical factory ", An=" breathes out medicine six factories ".

In the present invention, define full abbreviation relation constraint and represented constraint between Fn and the An, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is the full name of object, and An is the abbreviation of object, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Before constraint function collection and axiom of constraint collection are elaborated, be listed in the basic symbol that hereinafter uses:

An represents to be called for short;

Cfn represents candidate's full name of An;

Fn represents the full name of An;

The Google anchor text set of GoogleArchSet (An) expression An, this set of front 100 the anchor Chinese language that return when namely from Google, searching full name corresponding to An, if the anchor text that returns sum N is less than 100, then GoogleArchSet (An) only comprises only N bar anchor text;

Candidate's full name collection of CfnSet (An) expression An, the set that candidate's full name that the An that namely extracts from GoogleArchSet (An) is corresponding forms;

The number of contained candidate's full name among N_CfnSet (An) the expression CfnSet (An);

The full name collection of FnSet (An) expression An, i.e. the set of all elements among the CfnSet (An) through forming after the checking;

The abbreviation collection of AnSet (Fn) expression Fn, namely for given Fn, the correspondence of obtaining from Google is called for short the set that forms;

FA (Fn, An) expression Fn and An have full abbreviation relation;

The length of length (str) expression notional word Chinese character string str, the i.e. number of contained Chinese character among the str;

N_word (Fn, An) expression appears at the Chinese character number among Fn and the An simultaneously;

Behind N_Clas (Fn) the expression Fn process participle, the participle number of appearance;

The participle number that is covered by An among N_Cover (Fn, An) the expression Fn;

The set of the participle that is covered by An among CoverSet (Fn, An) the expression Fn;

P: the participle that the expression full name comprises;

P1/p2/... / pm: expression is by participle p1, p2 ... the segmentation sequence that pm forms, wherein/separator between the expression participle;

The position of the participle central point of centre (Fn) expression Fn, after namely Fn passes through participle, the position of that middle participle, or the mean place of those middle two participles, centre (Fn)=(N_Clas (Fn)+1)/2;

d _i(Fn) center offset of i the participle of expression Fn, i.e. displacement between the position of i the participle of the position of the participle central point of Fn and Fn, d _i(Fn)=i-centre (Fn);

(Fn) the center of maximum side-play amount of expression Fn, i.e. the center offset ground maximal value of all participles of Fn, (Fn)=(N_Clas (Fn)-1)/2;

Len _iI not capped contained participle number of participle string of (Fn, An) expression.After Fn carried out participle, those participles that do not covered by An, if link in Fn then form capped participle string, if do not link then independent bunchiness, i the capped contained participle number of participle string is designated as Len _i(Fn, An);

Freq (Fn, An) expression extracts the number of Fn from GoogleArchSet (An);

Represent an infinitesimal number;

The frequency order of loca (Cfn, An) expression Cfn in CfnSet (An), namely the element among the CfnSet (An) is pressed the big or small ascending sort of freq (Cfn, An) after, the order of Cfn;

Any Chinese character string among the S set et of NoInclude (s1, Set) expression Chinese character string is not the substring of Chinese character string s1;

How Interrogative represents interrogative set, comprises " what ", " ", " what ", " " etc.;

Chinese character string after concat (s1, s2) represents Chinese character string s1 and Chinese character string s2 is connected;

Concat (s1 ..., sn) expression Chinese character string s1 ..., the Chinese character string of sn after mutually connecting successively;

Each word among Contain (sl, s2) the expression Chinese character string s2 appears among the Chinese character string s1;

Include (s1, s2) expression Chinese character string s2 is the true substring of Chinese character string s1;

Prefix (s1, s2) expression s1 is with respect to the prefix of s2, and prefix (s1, s2) be sky, i.e. s1=concat (prefix (s1, s2), s2, s3), and wherein s3 can be empty string;

Expression will From

Middle deletion.

The below describes from 11 aspects to the concrete meaning that constraint function is concentrated:

The word of constraint function 1:An is from the ratio among the Fn.

Generally speaking, full name comprises and is called for short all included Chinese characters.For example, An=" Beijing University ", Fn=" Peking University ", each Chinese character among the An comes among the Fn.Concentrate at candidate's full name, the priority that comprises the higher candidate's full name of the ratio of word of An is higher.

The formal definition of constraint function 1 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):

For example, An=" Eight Trigram Palm ", Cfn ₁=" eight-diagram palm ", Cfn ₂=" a chain of fist of Eight Diagrams ".According to constraint function 1, have

So, Cfn ₁Priority ratio Cfn ₂Priority high.

The word order of constraint function 2:Fn and An.

In the breviary process, most word orders that keeping in the full name that are called for short.For example, An=" Olympic Games ", Fn=" Olympic Games ", the triliteral order among the An is strictly arranged sequentially by what occur in Fn.

The formal definition of constraint function 2 be calculated as follows (indicate: this function is consistent with patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):

Attention: Fn is identical with the An word order, and all words that containing among the An all appear among the Fn, if the word that does not appear among the Fn is arranged among the An, then the value of constraint function 2 is 0.

Constraint function 3:An is to the word-coverage rate of Fn

Full name is comprised of a plurality of participles usually, one or more participles of full name can be omitted in abbreviation in the situation about having, can not exceed 1/2nd of full name participle number but generally be omitted participle, the participle that candidate's full name is called for short covering is more, just more may become full name.

The formal definition of constraint function 3 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):

For example, An=" Beijing University ", Cfn ₁=" Beijing/university ", Cfn ₂=" Beijing/traffic/university ", according to constraint function 3, So, Cfn ₁Priority ratio Cfn ₂Priority high.

Constraint function 4:An covers center of gravity to the participle of Fn

Full name is comprised of a plurality of participles usually, and the one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name.For example, An=" your boat group ", Fn=" China/Guizhou/aviation/industry/group/company ", abridged participle " China ", " industry ", " company " are respectively in forward part, center section and the rear section of Fn among the Fn.

The formal definition of constraint function 4 and being calculated as follows:

For example, An=" mountain is large ", Cfn ₁=" Shandong/university ", Cfn ₂=" Shandong/university/Weihai/branch school ", Cfn ₁The middle participle that is covered by An " Shandong " and " university " are evenly distributed on Cfn ₁In, and Cfn ₂The middle participle that is covered by An " Shandong " and " university " all are distributed in Cfn ₂First half.According to constraint function 4,

So, Cfn ₁Priority ratio Cfn ₂Priority high.

The longest continuative participle number that is not covered by An among the constraint function 5:Fn

Candidate's full name is comprised of a plurality of participles usually, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller.

The formal definition of constraint function 5 and being calculated as follows:

Wherein, N represents the not number of capped participle string contained among the Fn

For example, An=" Communist Youth League ", Cfn ₁=" common property/doctrine/Communist Youth League ", Cfn ₂=" China/people/republic/common property/doctrine/Communist Youth League ", Cfn ₁In the participle that do not covered by An only have " doctrine ", and Cfn ₂In participle " China ", " people " and " republic " of not covered by An connect together.According to constraint function 5,

So, Cfn ₁Priority ratio Cfn ₂Priority high.

The length relation of constraint function 6:Fn and An

Usually the abbreviation of standard can excessively not reduce, and can see that to guarantee majority name knows meaning.Thereby most be called for short corresponding full name length in a scope, generally at the 1.5-5 that is called for short length doubly, the probability that full name length exceeds this scope is less.

The formal definition of constraint function 6 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):

For example, An=" Computer Department of the Chinese Academy of Science ", Cfn ₁=" Inst. of Computing Techn. Academia Sinica ", Cfn ₂=" Inst. of Computing Techn. Academia Sinica's residential building ".According to constraint function 6,

So, Cfn ₁Priority ratio Cfn ₂Priority high.

The frequency that constraint function 7:Fn occurs in GoogleArchSet (An)

By being called for short when searching full name to the Google, the priority of candidate's full name that occurrence frequency is higher in GoogleArchSet (An) is higher.

The formal definition of constraint function 7 and being calculated as follows:

For example, An=" lithium battery ", Cfn ₁=" lithium ion battery ", Cfn ₂=" lithium-ion-power cell, Freq (Cfn ₁)=42, Freq (Cfn ₂)=12, according to constraint function 7,

So, Cfn ₁Priority ratio Cfn ₂Priority high.

When searching Fn by An, obtain sometimes several candidate's full name, they consist of candidate's full name collection Set_CFN, for any one the candidate's full name Cfn among the Set_CFN _i, analyze FA(Cfn _i, can analogy Set_CFN in the time of An) in the desired value of other candidate's full name.

4 following constraint functions are based on the definition of candidate's full name collection.

The word of constraint function 8:An is from the relative ratios among the Cfn

Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's full name in Set_CFN, such as, the abbreviation of some external transliteration vocabulary does not just have identical word with full name, has carried out some synonyms when some abbreviation is reduced into full name and has transformed etc.

The formal definition of constraint function 8 and being calculated as follows:

For example, An=" acquired immune deficiency syndrome (AIDS) ", Cfn ₁=" aids ", Cfn ₂=" acquired immunodeficiency syndrome " is although An and Cfn ₁There is not identical word, but An and Cfn ₂There is not identical word, so can not be because of Cfn yet ₁The value of function 1 be 0 just to judge Cfn ₁It or not full name.

The relative coverage ratio that constraint function 9:Fn concentrates at candidate's full name

Compare with constraint function 3, constraint function 9 is emphasized the relativity of candidate's full name in Set_CFN, such as, some abbreviation is not high to the coverage rate of candidate's full name, and the priority of candidate's full name that coverage rate is relatively high so is higher.

The formal definition of constraint function 9 and being calculated as follows:

For example, An=" Tsing Hua Tong Fang ", Cfn ₁=" Tsing-Hua University/with side/share/limited/company ", Cfn ₂=" Tsing-Hua University/with side/CD/share/limited/company ", although An is to Cfn ₁And Cfn ₂Word-coverage rate not high, but to Cfn ₁Word-coverage rate relatively higher, so Cfn ₁Compare Cfn ₂It is high that priority is wanted.

The frequency that constraint function 10:Fn concentrates at candidate's full name

When searching Fn by An, sometimes candidate's full name concentrates the frequency of all candidate's full name all very low, and the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency of each candidate's full name, concentrate at candidate's full name, the priority of candidate's full name that frequency is relatively high is higher.

The formal definition of constraint function 10 and being calculated as follows:

For example, An=" eel connection ", Cfn ₁=" world's eel vegetative propagation joint conference ", Cfn ₂=" Shantou eel community of stock part company limited ", Freq (Cfn ₁)=3, Freq (Cfn ₂Although)=1 is according to constraint function 7, Cfn ₁And Cfn ₂Frequency all lower, but according to constraint function 10, Cfn ₁And Cfn ₂The frequency of concentrating at candidate's full name is all higher.

Constraint function 11: the element that candidate's full name is concentrated according to the frequency ascending sort after, Fn relative position therein

When the concentrated element of candidate's full name was many, the candidate's that frequency is lower importance was relatively low.

The formal definition of constraint function 11 and being calculated as follows:

The importance of candidate's full name that the value of constraint function 11 is lower is lower.

More than the concrete meaning of the constraint function constraint function concentrated from 11 aspects be illustrated, they have represented Fn(or Cfn quantitatively) and An between constraint, axiom of constraint then represents Fn(or Cfn qualitatively) and An between constraint, the below is specifically described axiom of constraint:

Axiom of constraint 1: the long axiom that do not wait of word

Form represents:

Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of An.

Axiom of constraint 2: indicative mood axiom

Form represents:

How do not comprise interrogative " what ", " ", " what " etc. among meaning: Fn directly perceived and the An.

Axiom of constraint 3: form does not repeat axiom

Form represents:

Meaning directly perceived: be called for short in the relation complete, Fn and An cannot be the Chinese character strings of ss form, and wherein s is Chinese character string.

For example, An=" Hainan Island ", Cfn=" Jade Flowery Islet, Jade Flowery Islet ", Cfn is the ss form, s=" Jade Flowery Islet " wherein is so Cfn should be modified to s.This phenomenon why can occur is because do not have punctuation mark to separate between two " Jade Flowery Islets " in the language material.

Axiom of constraint 4: semanteme does not repeat axiom

Form represents:

Meaning: Fn directly perceived semantically can not repeat.

For example, An=" Hainan Island ", Cfn=" Jade Flowery Islet Hainan Island ", Cfn is the s1s2 form, and s1=" Jade Flowery Islet " wherein, s2=" Hainan Island " is so Cfn is incorrect.This phenomenon why can occur is because of not having punctuation mark to separate between s1 in language material and the s2.

Axiom of constraint 5: entirely be called for short axiom of equal value

Form represents:

Meaning directly perceived: be called for short in the relation complete, the inevitable full name at An of Fn is concentrated, and the inevitable abbreviation at Fn of An is concentrated.

Axiom of constraint 5 is not used in the checking to full abbreviation relation, and is used for the expansion to full abbreviation relational knowledge base.

In that the full abbreviation relation constraint of the present invention's definition has been done on the basis that describes in detail, with reference to figure 1, specifically introduce the embodiment of the inventive method.

Method according to Chinese abbreviation identification Chinese full name of the present invention comprises two large steps, is respectively to produce candidate's full name collection and candidate's full name collection is carried out aftertreatment, and the below describes them respectively.Because utilize the method for query pattern 1 and query pattern 2 generation candidate full name collection different, so separate introduction.

As shown in Figure 2, utilize the specific implementation step of query pattern 1 generation candidate full name collection as follows:

Step 1-1, user input known Chinese abbreviation An;

Step 1-2, according to query pattern 1: " being called for short An ", construct concrete query term.

Step 1-3, query term is submitted in the Google search engine searches for, preserve front 100 anchor texts as the anchor language material.

Step 1-4, by regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.

The full sentence that is called for short mainly is divided into three types, that is: label is to type, without the suffix type with the suffix type is arranged.Label is to type: the An back is without Chinese character, and Cfn is paired label and marks, and need not to determine the border of Cfn, directly extraction.Without the suffix type: the An back is without Chinese character, and Cfn is not paired label and marks, and Cfn need decide left margin.The suffix type is arranged: there is Chinese character the An back, shows that An is the first half of another abbreviation " An* ", so also this is the first half of full name " Cfn* " corresponding to " An* " to Cfn, so Cfn need determine border, the left and right sides.

Step 1-5, utilize algorithm FCFNEA to extract benchmark candidate full name collection.

Extract the algorithm of benchmark candidate full name collection: (formal candidate fullname extract algorithm FCFNEA)

Input: label is called for short the sentence set entirely to type

, entirely be called for short the sentence set without the suffix type

, have the suffix type entirely to be called for short the sentence set

Output: benchmark candidate full name set

Step1: , extract the entry of label centering

à

, and statistics

Frequency;

Step2:

,

If,

Be included in

In, then

Frequency+1, and from

Middle deletion ;

Step3: ,

If,

Be included in

In, then

Frequency+1;

Step4:

, utilize ICTCLAS to carry out participle, with first participle

With last participle

Form

,

à

Step5:

,

If, The middle prefix that exists is that pre and suffix are the entry of suf

, then à , from Middle deletion

, utilize prioritization strategy P SCFObtain

Best candidate

à

;

Step6:?return

The prioritization strategy of in the Step5 of algorithm FCFNEA, using PSCFBe defined as follows:

Prioritization strategy (priority sort comparison function PSCF)

,

?iff （？？？）

1).?

;

2). ，

if?

;

?iff

1). ;

2).

;

If

, then claim Cfn _kBe

In best candidate, be designated as

Step 1-6, utilize algorithm ICFNEA to extract non-benchmark candidate full name collection.

Extract the algorithm of non-benchmark candidate full name: (informal candidate fullname extract algorithm ICFNEA)

Input: phrase to be extracted or short sentence Co-referent, the known concept word Inputitem={C ₁ C ₂ C _n };

Output: the full abbreviation candidate who extracts Candidate;

Step1:Right Co-referentCarry out participle and mark part of speech, word segmentation result is: { P ₁ P ₂ P _m }

The definition position variable Left_flag k, Left1

Step3:?for?each? C _i ∈{C _n C _n-1 ……C ₁ }

for?each?Pj∈?{Pleft_flagPleft_flag-1……P1}

If Ci appears among the Pj

Then left_flag j

break;

end?if

end?for?each

Step4:?for?each? P _k ∈{P ₁ P ₂ ……P _m }

If P _kPart of speech ∈{ conjunction preposition auxiliary word verb measure word label } and k＜ Left_flag

Then left k+1 ;

end?if

end?for?each

Step5:return? ? Candidate? {P _left ……P _m };

Border, the left and right sides decided again in candidate's full name that step 1-7, the method for utilizing analogy are concentrated non-benchmark candidate full name.

The method of analogy is specifically seen following method 1 and method 2.

Method 1: form represents:

The meaning directly perceived of method 1: for concentrated any two candidates of candidate's full name

With

If satisfy simultaneously precondition:

1) Chinese character among the An all appears at

In

2)

Be True substring

3) Frequency 2 or

Frequency＜10

4) With respect to

Prefix be not the prefix that all the other candidates concentrated in candidate's full name

Then

Frequency change into

With

The frequency sum, and will Concentrate deletion from candidate's full name.

Method 2: form represents:

The meaning directly perceived of method 2: for concentrated any two candidates of candidate's full name

With

If satisfy simultaneously precondition:

1)

Frequency

10

2) Frequency

5 times of frequency

3)

Be True substring

4)

In comprise An number of words and

In comprise An number of words equate

Then

Frequency change into

With

The frequency sum, and will

Concentrate deletion from candidate's full name.

Step 1-8, read in left margin vocabulary LBV and right margin vocabulary RBV respectively, border, the left and right sides decided again in the candidate's full name that utilizes LBV and RBV that non-benchmark candidate full name is concentrated.Specific algorithm is as follows:

Utilize LBV and RBV to decide again the algorithm (RDLRB) on border, the left and right sides:

Input: candidate's full name Cfn, be called for short An, the left margin vocabulary LBV, the right margin vocabulary RBV

Output: decide again the candidate's full name behind the border, the left and right sides CFN

Utilize ICTCLAS pair CfnCarry out participle, the result is: Cfn _{_ clas}= P ₁ P ₂ P _n

Step2:Determine AnFirst character and the last character exist CfnThe middle respectively participle of correspondence P _iWith P _j

Definition CfnLeft margin Left1;

for?each? P _k ∈{P _i-1 ……P ₁ }

If P _kIn the on the left side circle vocabulary

left k+1;

break;

end?if

end?for?each

Step4:Definition CfnRight margin RightN;

for?each? P _k ∈{P _j+1 ……P _n }

If P _kOn the right in boundary's vocabulary

right k-1;

break;

end?if

end?for?each

Step5:? cFN?? {P _left ……P _right };

Return cFN；

Step 1-9, concentrate the nearest prefix word extract existence and nearest suffix word from the benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary.Concentrate nearest left part word and the nearest right part word that extracts existence from non-benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary.

In above-mentioned steps 1-9, the nearest prefix word of mentioning, nearest suffix word, nearest left part word, nearest right part word, suspicious left margin vocabulary, suspicious right margin vocabulary, specific definition and generation method are as follows:

Definition 1: for the Cfni and the Cfnj that satisfy above method 1 and method 2 conditionals, note Cfnj=left+Cfni+right, wherein, left (if not empty) is called the left part of Cfnj, right (if not empty) is called the right part of Cfnj, left and right are carried out participle with ICTCLAS respectively, and last participle of left is called the nearest left part word of Cfnj, and first participle of right is called the nearest right part word of Cfnj.

Definition 2: each the candidate's full name Cfnk that concentrates for benchmark candidate full name, if each word of An appears among the Cfnk, then Cfnk is carried out participle with ICTCLAS after

, establish first character that participle Fi and Fj are respectively An and the last character corresponding participle in Cfnk, note

The prefix that is called Cfnk, Fi-1 is called the nearest prefix word of Cfnk,

The suffix that is called Cfnk, Fj+1 are called the nearest suffix word of Cfnk.

Definition 3: suspicious left margin vocabulary (dubious left boundary vocabulary DLBV):

Formal definition:

map<key,map_value>?dubious_left_boundary;

Key:string prefix: recently left part word or recently prefix word

Map_value:int qu:prefix is as the frequency of nearest left part word

Int liu:prefix is as the frequency of nearest prefix word

Whether bool flag:prefix needs is manually verified

Definition 4: suspicious right margin vocabulary (dubious right boundary vocabulary DRBV):

map<key,map_value>?dubious_left_boundary;

Key:string suffix: recently right part word or recently suffix word

Map_value:int qu:suffix is as the frequency of nearest right part word

Int liu:suffix is as the frequency of nearest suffix word

Whether bool flag:suffix needs is manually verified

Step 1-10, suspicious left margin vocabulary and suspicious right margin vocabulary are manually verified, generated left margin vocabulary and right margin vocabulary.

The method of the suspicious left margin vocabulary of artificial checking is as follows:

If satisfy:

1) prefix manually verifies

2) prefix is as the frequency of nearest left part word〉2

3) frequency of the nearest prefix word of prefix conduct＜2

4) prefix is as the nearest frequency of left part word〉5 * prefix are as the nearest frequency of prefix word

Then prefix is manually verified, determine whether as the left margin word, if then add the left margin vocabulary as the left margin word.

Definition 5: left margin vocabulary (left boundary vocabulary LBV):

Formal definition:

map<key,map_value>?left_boundary;

Key:string prefix: left margin word

Map_value:int num: utilize prefix to determine the Cfn number of left margin

The method of the suspicious right margin vocabulary of artificial checking is as follows:

If satisfy:

1) suffix manually verifies

2) suffix is as the frequency of nearest right part word〉2

3) frequency of the nearest suffix word of suffix conduct＜2

4) suffix is as the nearest frequency of right part word〉9 * suffix are as the nearest frequency of suffix word

Then suffix is manually verified, determine whether it is the right margin word, if the right margin word then adds the right margin vocabulary.

Definition 6: right margin vocabulary (right boundary vocabulary RBV):

map<key,map_value> right_boundary_cfn;

Key:string suffix: right margin word

Map_value:int num: utilize suffix to determine the Cfn number of right margin

Step 1-11, merging benchmark candidate's full name collection and non-benchmark candidate full name collection generate candidate's full name collection.

We do experiment with 4000 Chinese An, wherein account for 88.75% with what query pattern 1 can obtain the source language material, account for 24.76% with what query pattern 2 can obtain the source language material, query pattern 2 can obtain the source language material and only account for 2.33% with what query pattern 1 can not obtain the source language material, so, 2 of query patterns are namely only just used query pattern 2 as the replenishing of query pattern 1 when query pattern 1 obtains less than candidate's full name in the present invention.

As shown in Figure 3, utilize the specific implementation step of query pattern 2 generation candidate primitive collection as follows:

Step 2-1, user input known Chinese abbreviation An;

Step 2-2, according to query pattern 2: " An full name ", construct concrete query term.

Step 2-3, query term is submitted in the Google search engine searches for, preserve front 100 anchor texts as the anchor language material.

Step 2-4, by the structure regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.

Step 2-5, utilize algorithm CFNEA to extract candidate's full name collection.

Extract the algorithm of candidate's full name: (candidate fullname extract algorithm CFNEA)

Input: prefix Prefix, known abbreviation? Inputitem, phrase to be extracted or short sentence Co-referent

Output: the full abbreviation candidate who extracts Candidate

Step1: Defined label Flag0, (that increases income seemingly can not be used for commercial object) is right Co-referentParticiple is designated as: { P ₁ P ₂ P _n }

Step2: for?each?P _i∈? {P ₁ P ₂ ……P _n }

If Flag=0 and P _iWith PrefixIdentical word and P is arranged _iWith InputitemWithout identical word

Then flag?1 ;

end?if

If Flag=1And P _iWith PrefixWithout identical word

Then break;

end?if

If P _iWith InputitemIdentical word is arranged

Then break;

end?if

end?for?each

Step3: if flag=0 Then? ?i?0 ;

Step4: Candidate? {P _i ……P _n }

Return Candidate

Obtain candidate's full name collection by aforesaid operations, then candidate's full name collection is carried out aftertreatment, obtain final result, aftertreatment comprises to be verified, classifies and sort candidate's full name, and with reference to figure 4, its specific implementation step is as follows:

Each candidate's full name that step C-1, the axiom of constraint 1-4 checking candidate full name that utilizes axiom of constraint to concentrate are concentrated.

Step C-2, generate the decision tree (see figure 5) by the constraint function collection, the candidate's full name that utilizes decision tree that candidate's full name is concentrated is classified, and removing classification is candidate's full name of " N ", and retention class is that candidate's full name of " Y " generates the full name collection.

In Fig. 5, the different font mistake of " N1 " expression low frequency, the different font mistake of " N2 " expression high frequency, the different order type of " N3 " expression low frequency mistake, " Y " expression is correct.

Step C-3, the full name collection carried out the classification of Constraint-based collection of functions.

According to full name whether different word or different order are arranged in the present invention, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to FN at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type (seeing Table 1) according to An to the covering center of gravity of FN.

Form

The type of full name

The condition that concrete criteria for classification and all kinds of full name need to satisfy (seeing Table 2).

Form

The criteria for classification of full name

Classification	Need satisfied condition
		The strong linguistic context of high frequency is irrelevant	f ₁=1 f ₂=1 f ₃=1 f ₁₁=1
The strong linguistic context of low frequency is irrelevant	f ₁=1 f ₂=1 f ₃=1 f ₁₁< 1
		The weak linguistic context of high frequency is irrelevant	f ₁=1 f ₂=1 0.823 f ₃<1 f ₉=1 f ₁₁=1
The weak linguistic context of low frequency is irrelevant	f ₁=1 f ₂=1 0.823 f ₃<1 f ₉=1 f ₁₁<1
		Forward direction type linguistic context is relevant	f ₁=1 f ₂=1 f ₃ 1 f ₄ 0.5
Type linguistic context placed in the middle is relevant	f ₁=1 f ₂=1 0.5 f ₄ 0.5 (f ₃ 0.823 f ₉ 1)
		The backward type linguistic context is relevant	f ₁=1 f ₂=1 f ₃ 1 f ₄ 0.5
Different order type	f ₁=1 f ₂=0 f ₁₁=1
		Different font	f ₁ 1 f ₇ f ₁₀ f ₇ 0.05 f ₉=1 f ₁₁=1))

Notice because linguistic context is the concept of a semantic level, whether linguistic context is relevant so be difficult to judge a FN with computer intelligence ground, the judgement that utilizes constraint function to be similar to from the word-building rule aspect among the present invention.

In the form 2, the meaning directly perceived that the strong linguistic context of high frequency is irrelevant: FN comprises all words among the An and keeps word order constant, and each participle among the FN has correspondence in An, and FN concentrates frequency the highest at full name.

In the form 2, the meaning directly perceived that the strong linguistic context of low frequency is irrelevant: FN comprises all words among the An and keeps word order constant, and each participle among the FN has correspondence in An, and FN concentrates frequency the not highest at full name.

In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of high frequency: FN comprises all words among the An and keeps word order constant, and the most of participle among the FN has correspondence in An, and FN concentrates frequency the highest at full name.

In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of low frequency: FN comprises all words among the An and keeps word order constant, and the most of participle among the FN has correspondence in An, and FN concentrates frequency the not highest at full name.

In the form 2, the meaning directly perceived that forward direction type linguistic context is relevant: FN comprises all words among the An and keeps word order constant, and the participle that is omitted among the FN is mostly at the latter half of FN.

In the form 2, the irrelevant meaning directly perceived of type linguistic context placed in the middle: FN comprises all words among the An and keeps word order constant, and the participle number that the front and rear part is omitted among the FN is similar.

In the form 2, the meaning directly perceived that the backward type linguistic context is relevant: FN comprises all words among the An and keeps word order constant, and the participle that is omitted among the FN is mostly at the first half of FN.

In the form 2, the meaning directly perceived of different order type: FN comprises all words among the An but word order has change, and FN concentrates frequency the highest at full name.

In the form 2, the meaning directly perceived of different font: FN does not comprise all words among the An but the frequency of FN is very high or the relative frequency concentrated at full name is very high.

Step C-4, according to priority comprehensive function PRI (Cfn, An) concentrates of a sort full name to sort to full name.

The priority comprehensive function PRI (Cfn, An) that uses in step C-4 is defined as follows:

Wherein,

,

Be the weight that each function is taked when the comprehensive evaluation, F _iWith

Between corresponding relation see Table 3,

Size obtain by experiment according to the degree of restraint of each function to full abbreviation relation:

Form

Numbering	The function content	The function weight
			F ₁	The word of An is from the ratio among the Fn	0.12
F ₂	The word order of Fn and An	0.08
			F ₃	An is to the word-coverage rate of Fn	0.06
F ₄	An covers center of gravity to the participle of Fn	0.08
			F ₅	The longest continuative participle number that is not covered by An among the Fn	0.04
F ₆	The length relation of Fn and An	0.06
			F ₇	The frequency that Fn occurs in GoogleArchSet (An)	0.10
F ₈	The word of An is from the relative ratios among the Cfn	0.12
			F ₉	The relative coverage ratio that Fn concentrates at candidate's full name	0.10
F ₁₀	The frequency that Fn concentrates at candidate's full name	0.12
			F ₁₁	The element that candidate's full name is concentrated according to the frequency ascending sort after, Fn relative position therein	0.14

For actual effect of the present invention is described, adopt method of the present invention to look for full name to do great many of experiments to multidisciplinary abbreviation.We have randomly drawed 3910 Chinese An from multidisciplinary, utilize the present invention to search its Fn, the results are shown in form 4.

Form

An searches the experimental result of FN

The An number	Get access to the An number of Fn	Get access to the number percent of the An of Fn	The number of all Fn	Search the exact rate (sampling) of Fn
					3910	3561	91.07%	9305	94.77%

We have randomly drawed 3188 full name and have verified with decision tree from above-mentioned experiment, table 5 is results of decision tree checking.

Form

The result of decision tree

Can draw the following conclusions by experiment: the present invention has preferably recognition effect to the identification of Chinese full name, and is applied widely, can finely remedy the defective of the upper previous methods of Chinese full name identification.

Claims

1. method of obtaining the Chinese full name from the Web webpage is characterized in that: comprise step:

Step 1, given Chinese abbreviation of input;

2. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 1 is characterized in that: in described step 2, if the Query Result that Google returns〉100, then N gets 100, otherwise N gets the number of the Query Result that Google returns.

3. a kind of method of from the Web webpage, obtaining the Chinese full name according to claim 1, it is characterized in that: in the above-mentioned steps 2, described query pattern comprises two kinds: query pattern 1: " being called for short An ", query pattern 2: " An full name "; Select first query pattern 1, next selects query pattern 2.

4. a kind of method of from the Web webpage, obtaining the Chinese full name according to claim 1, it is characterized in that: in the above-mentioned steps 4, full name extraction algorithm EFN comprises two algorithm CFNEA1 and CFNEA2, two kinds of query patterns in the corresponding step 2 of difference, namely when selecting query pattern 1 in the step 2, adopt CFNEA1 to extract Fn in the step 4, when selecting query pattern 2 in the step 2, adopt CFNEA2 to extract Fn in the step 4.

5. A kind of method of obtaining the Chinese full name from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 1, step 4 was carried out following steps:

The full sentence that is called for short mainly is divided into three types, that is: label is to type, without the suffix type with the suffix type is arranged; Label is to type: the An back is without Chinese character, and Cfn is paired label and marks, and need not to determine the border of Cfn, directly extraction; Without the suffix type: the An back is without Chinese character, and Cfn is not paired label and marks, and Cfn need decide left margin; The suffix type is arranged: there is Chinese character the An back, shows that An is the first half of another abbreviation " An* ", so also this is the first half of full name " Cfn* " corresponding to " An* " to Cfn, so Cfn need determine border, the left and right sides;

Steps A-1, utilize algorithm FCFNEA to extract benchmark candidate full name collection;

Input: label is called for short the sentence set entirely to type

, entirely be called for short the sentence set without the suffix type

, have the suffix type entirely to be called for short the sentence set

Output: benchmark candidate full name set

, extract the entry of label centering à , and statistics Frequency;

,

If,

Be included in

In, then

Frequency+1, and from

Middle deletion

;

,

If,

Be included in

In, then

Frequency+1;

, utilize ICTCLAS to carry out participle, with first participle

With last participle

Form

,

à

,

If,

The middle prefix that exists is that pre and suffix are the entry of suf , then

à

, from

Middle deletion

, utilize prioritization strategy P SCFObtain

Best candidate à

;

return

The prioritization strategy PSCF that uses in the Step5 of algorithm FCFNEA is defined as follows:

Prioritization strategy (priority sort comparison function PSCF)

,

?iff

1).?

;

2).

，if?

;

?iff

1). ;

2).

;

If

, then claim Cfn _kBe

In best candidate, be designated as

Steps A-2, utilize algorithm ICFNEA to extract non-benchmark candidate full name collection;

Output: the full abbreviation candidate who extracts Candidate;

Right Co-referentCarry out participle and mark part of speech, word segmentation result is: { P ₁ P ₂ P _m }

The definition position variable Left_flag k, Left1

for?each? C _i ∈{C _n C _n-1 ……C ₁ }

for?each? P _j ∈{P _{left_flag} P _{left_flag-1} ……P ₁ }

If C _iAppear at P _jIn

Then left_flag? ?j

break;

end?if

end?for?each

for?each? P _k ∈{P ₁ P ₂ ……P _m }

Then left k+1 ;

end?if

end?for?each

return Candidate? {P _left ……P _m };

Border, the left and right sides decided again in candidate's full name that steps A-3, the method for utilizing analogy are concentrated non-benchmark candidate full name;

The method of analogy is specifically seen following method 1 and method 2;

Form represents:

With

If satisfy simultaneously precondition:

Chinese character among the An all appears at In

Be

True substring

Frequency 2 or

Frequency＜10

With respect to

Then Frequency change into

With The frequency sum, and will

Concentrate deletion from candidate's full name;

Form represents:

With If satisfy simultaneously precondition:

Frequency 10

Frequency

5 times of frequency

Be

True substring

In comprise An number of words and

In comprise An number of words equate

Then

Frequency change into With

The frequency sum, and will

Concentrate deletion from candidate's full name;

Steps A-4, read in left margin vocabulary LBV and right margin vocabulary RBV respectively, border, the left and right sides decided again in the candidate's full name that utilizes LBV and RBV that non-benchmark candidate full name is concentrated; Specific algorithm is as follows:

Determine AnFirst character and the last character exist CfnThe middle respectively participle of correspondence P _iWith P _j

Definition CfnLeft margin Left1;

for?each? P _k ∈{P _i-1 ……P ₁ }

If P _kIn the on the left side circle vocabulary

left? k+1;

break;

end?if

end?for?each

Definition CfnRight margin RightN;

for?each? P _k ∈{P _j+1 ……P _n }

If P _kOn the right in boundary's vocabulary

right k-1;

break;

end?if

end?for?each

cFN?? {P _left ……P _right };

Return cFN；

Steps A-5, concentrate the nearest prefix word extract existence and nearest suffix word from the benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary; Concentrate nearest left part word and the nearest right part word that extracts existence from non-benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary;

In above-mentioned steps A-5, the nearest prefix word of mentioning, nearest suffix word, nearest left part word, nearest right part word, suspicious left margin vocabulary, suspicious right margin vocabulary, specific definition and generation method are as follows:

Definition 1: for the Cfni and the Cfnj that satisfy above method 1 and method 2 conditionals, note Cfnj=left+Cfni+right, wherein, left (if not empty) is called the left part of Cfnj, right (if not empty) is called the right part of Cfnj, left and right are carried out participle with ICTCLAS respectively, and last participle of left is called the nearest left part word of Cfnj, and first participle of right is called the nearest right part word of Cfnj;

The prefix that is called Cfnk, Fi-1 is called the nearest prefix word of Cfnk, The suffix that is called Cfnk, Fj+1 are called the nearest suffix word of Cfnk;

Formal definition:

map<key,map_value>?dubious_left_boundary;

Key:string prefix: recently left part word or recently prefix word

Map_value:int qu:prefix is as the frequency of nearest left part word

Int liu:prefix is as the frequency of nearest prefix word

Whether bool flag:prefix needs is manually verified

map<key,map_value>?dubious_left_boundary;

Key:string suffix: recently right part word or recently suffix word

Map_value:int qu:suffix is as the frequency of nearest right part word

Int liu:suffix is as the frequency of nearest suffix word

Whether bool flag:suffix needs is manually verified

Steps A-6, suspicious left margin vocabulary and suspicious right margin vocabulary are manually verified, generated left margin vocabulary and right margin vocabulary;

If satisfy:

Prefix manually verifies

Prefix is as the frequency of nearest left part word〉2

The prefix conduct is the frequency of prefix word＜2 recently

Prefix is as the nearest frequency of left part word〉5 * prefix are as the nearest frequency of prefix word

Then prefix is manually verified, determine whether as the left margin word, if then add the left margin vocabulary as the left margin word;

Definition 5: left margin vocabulary (left boundary vocabulary LBV):

Formal definition:

map<key,map_value>?left_boundary;

Key:string prefix: left margin word

Map_value:int num: utilize prefix to determine the Cfn number of left margin

If satisfy:

Suffix manually verifies

Suffix is as the frequency of nearest right part word〉2

The suffix conduct is the frequency of suffix word＜2 recently

Suffix is as the nearest frequency of right part word〉9 * suffix are as the nearest frequency of suffix word

Then suffix is manually verified, determine whether it is the right margin word, if the right margin word then adds the right margin vocabulary;

Definition 6: right margin vocabulary (right boundary vocabulary RBV):

map<key,map_value> right_boundary_cfn;

Key:string suffix: right margin word

Map_value:int num: utilize suffix to determine the Cfn number of right margin

Steps A-7, merging benchmark candidate's full name collection and non-benchmark candidate full name collection generate candidate's full name collection;

Steps A-1 to steps A-2 forms algorithm CFNEA1.

6. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 2, step 4 was carried out following steps:

Step B-1, utilize algorithm CFNEA2 to extract candidate's full name collection;

Extract the algorithm of candidate's full name: (candidate fullname extract algorithm CFNEA2)

Output: the full abbreviation candidate who extracts Candidate

Defined label Flag0, (that increases income seemingly can not be used for commercial object) is right Co-referentParticiple is designated as: { P ₁ P ₂ P _n }

for?each?P _i∈? {P ₁ P ₂ ……P _n }

Then flag?1 ;

end?if

If Flag=1And P _iWith PrefixWithout identical word

Then break;

end?if

If P _iWith InputitemIdentical word is arranged

Then break;

end?if

end?for?each

if flag=0 Then? ?i?0 ;

Candidate? {P _i ……P _n }

Return Candidate

Obtain candidate's full name collection by aforesaid operations.

7. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 1 is characterized in that: in described step 5, if the full name set is sky, and also have query pattern available in the step 2, then re-execute step 2-6; If full name set does not have alternative query pattern in the step 2 simultaneously for empty, then withdraw from, show can not from Web search the full name of given abbreviation.

8. a kind of method of from the Web webpage, obtaining the Chinese full name according to claim 1, it is characterized in that: in described step 5, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is the full name of object, and An is the abbreviation of object, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy; The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.

9. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 8 is characterized in that: described step 5,6 specific implementation step are as follows:

Each candidate's full name that step C-1, the axiom of constraint 1-4 checking candidate full name that utilizes axiom of constraint to concentrate are concentrated;

Step C-2, generate decision tree by the constraint function collection, the candidate's full name that utilizes decision tree that candidate's full name is concentrated is classified, removing classification is candidate's full name of " F1 ", " F2 " and " F3 ", and retention class is candidate's full name of " T ", thereby generates the full name collection;

The different font mistake of " F1 " expression low frequency, the different font mistake of " F2 " expression high frequency, the different order type of " F3 " expression low frequency mistake, " Y " expression is correct;

Step C-3, the full name collection carried out the classification of Constraint-based collection of functions;

According to full name whether different word or different order are arranged, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to FN at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type according to An to the covering center of gravity of FN;

The condition that concrete criteria for classification and all kinds of full name need to satisfy:

The meaning directly perceived that the strong linguistic context of high frequency is irrelevant: Fn comprises all words among the An and keeps word order constant, and each participle among the Fn has correspondence in An, and Fn concentrates frequency the highest at full name;

The meaning directly perceived that the strong linguistic context of low frequency is irrelevant: Fn comprises all words among the An and keeps word order constant, and each participle among the Fn has correspondence in An, and Fn concentrates frequency the not highest at full name;

The irrelevant meaning directly perceived of the weak linguistic context of high frequency: Fn comprises all words among the An and keeps word order constant, and the most of participle among the Fn has correspondence in An, and Fn concentrates frequency the highest at full name;

The irrelevant meaning directly perceived of the weak linguistic context of low frequency: Fn comprises all words among the An and keeps word order constant, and the most of participle among the Fn has correspondence in An, and Fn concentrates frequency the not highest at full name;

The meaning directly perceived that forward direction type linguistic context is relevant: Fn comprises all words among the An and keeps word order constant, and the participle that is omitted among the Fn is mostly at the latter half of Fn;

The irrelevant meaning directly perceived of type linguistic context placed in the middle: Fn comprises all words among the An and keeps word order constant, and the participle number that the front and rear part is omitted among the Fn is similar;

The meaning directly perceived that the backward type linguistic context is relevant: Fn comprises all words among the An and keeps word order constant, and the participle that is omitted among the Fn is mostly at the first half of Fn;

The meaning directly perceived of different order type: Fn comprises all words among the An but word order has change, and Fn concentrates frequency the highest at full name;

The meaning directly perceived of different font: Fn does not comprise all words among the An but the frequency of Fn is very high or the relative frequency concentrated at full name is very high;

Step C-4, according to priority comprehensive function PRI (Cfn, An) concentrates of a sort full name to sort to full name;

Wherein,

,

The weight of taking when the comprehensive evaluation for each function.

10. require 8 or 9 described a kind of methods of obtaining the Chinese full name from the Web webpage according to claim, it is characterized in that: the concrete meaning of described constraint function collection is:

The word of constraint function 1:An is from the ratio among the Fn

Full name comprises and is called for short all included Chinese characters, and namely each Chinese character among the An comes among the Fn, concentrates at candidate's full name, and the priority that comprises the higher candidate's full name of the ratio of word of An is higher;

The formal definition of constraint function 1 and being calculated as follows:

The word order of constraint function 2:Fn and An

In the breviary process, most word orders that keeping in the full name that are called for short, the order of word is strictly arranged sequentially by what occur in Fn among the An;

The formal definition of constraint function 2 and being calculated as follows:

Fn is identical with the An word order, and all words that containing among the An all appear among the Fn, if the word that does not appear among the Fn is arranged among the An, then the value of constraint function 2 is 0;

Constraint function 3:An is to the word-coverage rate of Fn

Full name is comprised of a plurality of participles usually, one or more participles of full name can be omitted in abbreviation in the situation about having, can not exceed 1/2nd of full name participle number but generally be omitted participle, the participle that candidate's full name is called for short covering is more, just more may become full name;

The formal definition of constraint function 3 and being calculated as follows:

Constraint function 4:An covers center of gravity to the participle of Fn

Full name is comprised of a plurality of participles usually, and the one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name;

The formal definition of constraint function 4 and being calculated as follows:

Candidate's full name is comprised of a plurality of participles usually, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller;

The formal definition of constraint function 5 and being calculated as follows:

Wherein, N represents the not number of capped participle string contained among the Fn;

The length relation of constraint function 6:Fn and An

Usually the abbreviation of standard can excessively not reduce, and can see that to guarantee majority name knows meaning; Thereby most be called for short corresponding full name length in a scope, generally at the 1.5-5 that is called for short length doubly, the probability that full name length exceeds this scope is less;

The formal definition of constraint function 6 and being calculated as follows:

The frequency that constraint function 7:Fn occurs in GoogleArchSet (An)

By being called for short when searching full name to the Google, the priority of candidate's full name that occurrence frequency is higher in GoogleArchSet (An) is higher;

The formal definition of constraint function 7 and being calculated as follows:

When searching Fn by An, obtain sometimes several candidate's full name, they consist of candidate's full name collection Set_CFN, for any one the candidate's full name Cfn among the Set_CFN _i, analyze FA(Cfn _i, can analogy Set_CFN in the time of An) in the desired value of other candidate's full name;

4 following constraint functions are based on the definition of candidate's full name collection:

The word of constraint function 8:An is from the relative ratios among the Cfn

Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's full name in Set_CFN;

The formal definition of constraint function 8 and being calculated as follows:

Compare with constraint function 3, constraint function 9 is emphasized the relativity of candidate's full name in Set_CFN, if some abbreviation is not high to the coverage rate of candidate's full name, the priority of candidate's full name that coverage rate is relatively high so is higher;

The formal definition of constraint function 9 and being calculated as follows:

When searching Fn by An, sometimes candidate's full name concentrates the frequency of all candidate's full name all very low, and the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency of each candidate's full name, concentrate at candidate's full name, the priority of candidate's full name that frequency is relatively high is higher;

When the concentrated element of candidate's full name was many, the candidate's that frequency is lower importance was relatively low;

The importance of candidate's full name that the value of constraint function 11 is lower is lower;

The concrete meaning of described axiom of constraint:

Axiom of constraint 1: the long axiom that do not wait of word

Form represents:

Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of An;

Axiom of constraint 2: indicative mood axiom

Form represents:

Do not comprise interrogative among meaning: Fn directly perceived and the An;

Axiom of constraint 3: form does not repeat axiom

Form represents:

Meaning directly perceived: be called for short in the relation complete, Fn and An cannot be the Chinese character strings of ss form, and wherein s is Chinese character string;

Axiom of constraint 4: semanteme does not repeat axiom

Form represents:

Meaning: Fn directly perceived semantically can not repeat;

Axiom of constraint 5: entirely be called for short axiom of equal value

Form represents:

Meaning directly perceived: be called for short in the relation complete, the inevitable full name at An of Fn is concentrated, and the inevitable abbreviation at Fn of An is concentrated;